«Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»
-7Zusammenfassung Die in dieser Arbeit präsentierte Architektur bietet die Werkzeuge und Protokolle, um die am meisten geeigneten Anbieter für eine Daueranfrage zu ermitteln, indem Metadaten genutzt werden, die in einem konzeptionell globalen, aber physisch verteilten Verzeichnis gespeichert werden. Ein Abonnent berechnet zur Wahl der besten Anbieter Bewertungszahlen, die das bisherige Verhalten beinhalten und dieses nutzen, um zukünftiges Verhalten vorauszusagen. Diese Bewertungszahlen basieren auf einer Kombination aus Resource Selection und Behavior Prediction, um die gegebene Dynamik zu berücksichtigen. Behavior Prediction benutzt Zeitreihenanalyse mit Double Exponential Smoothing Techniken zur Vorhersage zukünftigen Verhaltens mit schnellen Veränderungen. Zusätzlich können Korrelationen innerhalb der Schlüsselwörter einer Anfrage ausgenutzt werden, um die Auswahl der Anbieter weiter zu verbessern. Aus Skalierbarkeitsgründen besitzen die Metadaten, die im Verzeichnis gespeichert werden, Anbieter-Granularität und keine Dokumenten-Granularität. Dadurch werden die besten Anbieter für einzelne Schlüsselwörter festgestellt. Auf der Hauptarchitektur aufbauend bietet diese Arbeit zwei Algorithmen, um Korrelationen unter Schlüsselwörter zur Verbesserung der Auswahl der Anbieter auszunutzen. Der erste Algorithmus benutzt individuelle Schlüsselwörter Synopsen aus dem Verzeichnis, um die Auswahl zu verbessern.
Bereits existierende Ansätze, die sich auf Mengen aus zwei Schlüsselwörtern beschränken, werden auf beliebig viele Schlüsselwörter erweitert. Darüber hinaus zeigt diese Arbeit einen Algorithmus auf, welcher Statistiken für Mengen aus mehreren Schlüsselwörtern durch die Kombination von beliebigen Teilmengenstatistiken approximiert. Durch Anwendung der vorgeschlagenen Techniken erhält approximatives Information Filtering eine höhere Skalierbarkeit. Für einen akzeptablen Verlust an Recall erhält man schnellere Antwortzeiten und einen geringeren Nachrichtenverkehr.
Schließlich präsentiert diese Doktorarbeit einen Anwendungsfall für Digitale Bibliotheken, welcher approximative Retrieval und Filtering Funktionalität in einer vereinten Architektur unterstützt, und eine prototypische Implementierung, die die Anwendbarkeit von approximativem Information Filtering herausstellt. Vergleichende Experimente auf unterschiedlichen echten Datenmengen messen die Eﬃzienz und Eﬀektivität der Methoden dieser Arbeit.
-8Chapter 1 Introduction Chapter 1 Introduction This introductory chapter motivates the doctoral thesis in Section 1.1 by introducing the problem of P2P IF. Section 1.2 highlights the main solution and presents the major contributions that will be investigated in the following chapters, while Section 1.3 presents the thesis outline.
1.1 Motivation The peer-to-peer (P2P) paradigm has been receiving increasing attention in recent years, mainly in the context of ﬁle-sharing applications (Gnutella, BitTorrent) or other Internet applications such as IP telephony (e.g., Skype). In the meantime, the P2P paradigm is moving towards distributed data management systems such as information retrieval (IR) or information ﬁltering (IF). The main advantage of P2P is its ability to handle huge amounts of data in a decentralized and self-organizing manner. The characteristics of P2P oﬀer high potential beneﬁt for information systems powerful regarding scalability, eﬃciency, and resilience to failures and dynamics. Beyond that, such an information system can potentially beneﬁt from the intellectual input of a large user community participating in the data sharing network. Finally, P2P information systems can also bring forward pluralism in informing users about Internet content, which is crucial in order to guard against information-resource monopolies and the biased visibility of content from economically powerful sources. Information censorship is often mentioned in this context.
Information ﬁltering, also referred to as publish/subscribe or continuous querying or information push, can be seen as equally important to information retrieval (or one-time querying), since users are able to subscribe to information sources and be notiﬁed when new events of interest happen. The need for content-based push technologies is also stressed by the deployment of new applications (e.g., Google Alerts) where a user posts a subscription (or continuous query) to the system to receive notiﬁcations whenever matching events of interest take place. Information ﬁltering and information retrieval are often referred as two sides of the same coin [BC92]. Although many of the underlying issues and goals are similar in retrieval and ﬁltering, since in both cases a document needs to be matched against an information demand, the design issues, the techniques and algorithms devised to increase ﬁltering eﬃciency diﬀer signiﬁcantly. This concept will be one of the main drivers of this thesis.
The main challenge addressed in this thesis is to exploit P2P technology for eﬃcient and approximate information ﬁltering. While there exist several approaches to perform exact information ﬁltering in P2P environments, the work in this thesis emphasizes a novel architecture to support content-based approximate information ﬁltering. Most information ﬁltering approaches taken so far have the underlying hypothesis of potentially delivering notiﬁcations from all information producers.
-9Chapter 1 IntroductionAn approximate approach relaxes this assumption by monitoring only selected sources that are likely to publish documents relevant to the user interests in the future. Since in an IF scenario the data is originally highly distributed residing on millions of sites (e.g., with people contributing to blogs), a P2P approach seems an ideal candidate for such a setting.
However, exact publish/subscribe functionality has proven expensive for such distributed environments. Contrary, approximate IF oﬀers a natural solution to this problem, by avoiding document granularity dissemination as this presents the main scalability bottleneck for the exact IF approaches.
The targeted P2P system consists of a number n of publisher and/or subscriber peers (or nodes 1 ) dj with j = 1,..., n, forming a network. In general, peers are assumed to be autonomous publishing and requesting data. All publisher peers dj construct and store a local index structure consisting of index lists, Ij (k) over each key k (also referred to as keyword or term). A continuous query in such a ﬁltering system is a set of multiple distinct keywords q = k1, k2,..., tm. The subscriber wants to be notiﬁed regarding new documents containing these keywords published by other peers in the network. Therefore, all peers have precomputed statistical summaries (metadata) on their local index contents organized on a per-key basis and typically including measures such as the number of documents that the peers’s local index contains for a given key, the average term frequency in these documents, and so on. Additionally, these summaries may contain compact synopses representing the documents each peer maintains. These summaries are then posted to a conceptually global, but physically distributed directory conveniently accessible by every peer, with O(log n) communication costs per key where n is the size of the network. Assuming the existence of a such a directory, the approximate information ﬁltering system has to provide the ability to identify the top-k most appropriate publisher peers p1, p2,..., pk for a given continuous query q, i.e., those publishers that are expected to provide the best documents matching q in the future. In this thesis, this task is referred to as publisher peer selection. To select this set of peers, this thesis introduces new strategies based on already known resource selection techniques in combination with new behavior prediction techniques. These new techniques utilize the distributed directory to predict future publishing behavior of peers by applying prediction techniques based on time series of IR statistics.
1.2 Main Contributions This section summarizes the major contributions of this doctoral thesis. Section 1.2.1 discusses the general framework for supporting approximate information ﬁltering whereas Section 1.2.2 extends the framework to consider correlations among keywords. Section 1.2.3 introduces a prototype system, and Section 1.2.4 presents a digital library (DL) use case.
1.2.1 Approximate Information Filtering This thesis presents a novel architecture, coined MAPS (Minerva Approximate Publish Subscribe) [ZTB+ 08, ZTB+ 07], to support content-based approximate information ﬁltering in P2P environments. Most IF approaches so far have the underlying hypothesis of potentially delivering notiﬁcations from every information producer. Contrary, MAPS relaxes this assumption and monitors only selected sources that are likely to publish documents relevant to the user interests in the future. In MAPS, a user subscribes with a continuous query and only published documents from these monitored sources are forwarded to him.
1 The thesis uses the terminology peer rather than node.
- 10 Chapter 1 Introduction MAPS provides a network-agnostic P2P architecture with diﬀerent services and its related protocols (directory, subscription, publication, and notiﬁcation protocol) for supporting approximate IF functionality in a distributed P2P environment. It is the ﬁrst approach that looks into the problem of approximate IF in such a setting. The architecture exploits metadata stored in a conceptually global, but physically distributed directory. The most critical task in approximate IF is the selection of appropriate publisher peers to meet the information demand in the future. Therefore, MAPS combines existing resource selection techniques with new behavior prediction strategies. It is shown, that existing resource selection approaches (also referred to as collection or database selection) are not suﬃcient in a dynamic ﬁltering setting since resource selection can only select appropriate authorities that have already published matching documents in the past.
The novel method of behavior prediction of publisher peers completes the peer selection strategy by applying prediction techniques to time series of IR statistics. So, MAPS introduces research of time series analysis to P2P information ﬁltering environments. The experimental evaluation of MAPS approves the eﬀectiveness and eﬃciency in several settings using real Web data. Various publishing behaviors are investigated to conclude that only the combination of resource selection and behavior prediction allows to improve recall while monitoring only a small number of publishers.
MAPS also conducts an approach to improve the proposed prediction method of time series analysis with double exponential smoothing (DES). The MAPS Selective Method (MSM) does not need any additional communication to adjust prediction parameters.
There, the key concept allows to recognize individual publishing behaviors of peers for various continuous queries. In addition, MAPS is compared to an existing exact information ﬁltering approach in the P2P setting with regard to several major characteristics (e.g., load balancing issues, query placement, or routing infrastructure) [TZWK07].
1.2.2 Correlation Awareness In MAPS, publisher peer selection for a given continuous query with multiple keywords is driven by statistical summaries (metadata) that are stored by the system. These summaries are provided to the directory by the publishers and can be managed in diﬀerent ways ranging from centralized solutions like servers or server farms, to super-peer or pure peer-to-peer solutions in the form of a distributed P2P directory built on top of a distributed hash table (DHT) or other kinds of overlay networks. For scalability of MAPS, the summaries have publisher granularity, not document granularity, thus capturing the best publisher for certain keywords or keys.
This, together with per-key organization of the directory that disregards keyword correlations2 or correlated key sets are two of the basic reasons that may possibly lead to insuﬃcient recall. Obviously, considering statistics for all possible key sets is clearly not possible due to the explosion in the feature space. The baseline approach would decompose a continuous query into the individual keys and use the statistics from the directory to compute a combined score (e.g., intersection or some other kind of aggregation of individual key scores) for each key and publisher. This score would represent the probability of each source to publish documents matching the information demand in the near future. This approach may lead to poor ﬁltering quality as the top-ranked publishers for the complete query may not be among the top selected publishers. In the worst case, a selected publisher may deliver many documents for each single keyword, but no single document matching all keywords, since this information is not present in the directory.
2 In general statistical usage, correlation or co-relation refers to the departure of two variables from independence.
- 11 Chapter 1 IntroductionThus, the thesis introduces two approaches as suggested in [ZTW08] that use correlations
among keys to improve ﬁltering quality in the scenario described above:
• The USS (Unique Synopses Storage) algorithm uses existing single-key synopses stored in the directory to estimate the publishing behavior of information sources for key sets.
• The CSS (Combined Synopses Storage) algorithm enhances the directory to explicitly maintain statistical metadata about selectively chosen key sets.
Contrary to distributed IR settings for one-time searching where sources are ranked according to their document collections (i.e., using resource selection strategies), in approximate IF the publishers are ranked according to their probability to publish relevant documents in the near future, which poses diﬀerent requirements for maintaining metadata.
This thesis presents the ﬁrst work to develop algorithms for exploiting keyword correlations in such a dynamic IF setting. Existing and self-limited approaches for two-key queries are extended to the case of multi-key continuous queries for an arbitrary number of keys.