«Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»
The main challenge addressed in this thesis was to exploit P2P technology for eﬃcient and approximate information ﬁltering. While there exist several approaches to perform exact information ﬁltering in P2P environments, the work in this thesis emphasized a novel architecture to support content-based approximate information ﬁltering. Most information ﬁltering approaches taken so far have the underlying hypothesis of potentially delivering notiﬁcations from all information producers. This exact pub/sub functionality has proven expensive for such distributed environments. The approximate approach in this thesis relaxed this assumption by monitoring only selected sources that are likely to publish documents relevant to the user interests in the future. Since in an IF scenario the data is originally highly distributed residing on millions of sites (e.g., with people contributing to blogs), a P2P approach seems an ideal candidate for such a setting.
1 This chapter focuses on the most important open research questions.
- 137 Chapter 8 Conclusion and Open Questions
8.2 Contributions The present thesis gives the following four main contributions that have been discussed in
detail in the previous chapters:8.2.1 Approximate Information Filtering One of the main contributions of this thesis was to introduce the novel MAPS architecture to support content-based approximate information ﬁltering in P2P environments. Most IF approaches so far have the underlying hypothesis of potentially delivering notiﬁcations from every information producer. Contrary, MAPS relaxes this assumption and monitors only selected sources that are likely to publish documents relevant to the user interests in the future. In MAPS, a user subscribes with a continuous query and only published documents from these monitored sources are forwarded to him. The system provides a network-agnostic P2P architecture with diﬀerent services and its related protocols (directory, subscription, publication, and notiﬁcation protocol) for supporting approximate IF functionality in a distributed P2P environment. It is the ﬁrst approach that looks into the problem of approximate IF in such a setting by exploiting metadata stored in a conceptually global, but physically distributed directory. The most critical task in approximate IF is the selection of appropriate publisher peers to meet the information demand in the future. Therefore, MAPS combines existing resource selection techniques with new behavior prediction strategies. The thesis showed that existing resource selection approaches (also referred to as collection or database selection) are not suﬃcient in a dynamic ﬁltering setting since resource selection can only determine appropriate authorities that have already published matching documents in the past.
A novel method aiming at behavior prediction of publisher peers complements the peer selection strategy by applying prediction techniques to time series of IR statistics. So, MAPS introduces research of time series analysis to P2P information ﬁltering environments. The experimental evaluation of MAPS approved the eﬀectiveness and eﬃciency in several settings using real Web data. Various publishing behaviors have been investigated to conclude that only the combination resource selection and behavior prediction allows to improve recall, while monitoring only a small number of publishers.
Finally, the aforementioned prediction method was further enhanced by an automatic parameter tuning component. The MAPS Selective Method introduces an automatic method to tune prediction parameters by utilizing known values to adjust the parameters used in the time-series analysis. This is achieved at no additional communication cost, and is a local task performed at each peer individually. To demonstrate the potential gaining from the usage of MAPS, the thesis compared it against an existing exact information ﬁltering approach in the P2P setting and studied several major characteristics (e.g., load balancing issues, query placement, or routing infrastructure).
- 138 Chapter 8 Conclusion and Open Questions 8.2.2 Correlation Awareness The second contribution of this thesis covers publisher peer selection for a given continuous query with multiple keywords. For scalability reasons, MAPS’s summaries are stored in the distributed directory and have publisher (rather than document) granularity, thus capturing the best publisher for certain keys. This, together with per-key organization of the directory that disregards keyword correlations (also referred to as correlated key sets) are two of the basic reasons that may lead to low recall. However, considering statistics for all possible key sets is clearly not possible due to the explosion in the feature space. In the baseline approach, a continuous query is decomposed into individual keys, and the statistics from the directory are used to compute a combined score for each publisher. This score would represent the probability of each source to publish documents matching the information demand in the near future. This approach may lead to poor ﬁltering quality as the topranked publishers for the complete query may not be among the top selected publishers. In the worst case, a selected publisher may deliver many documents for each single keyword, but no single document matching all keywords, since this information is not present in the directory.
To overcome his problem, this thesis introduced two approaches that use correlations among keys to improve ﬁltering quality: (i) the USS (Unique Synopses Storage) algorithm that uses existing single-key synopses stored in the directory to estimate the publishing behavior of information sources for key sets, and (ii) the CSS (Combined Synopses Storage) that enhances the directory to explicitly maintain statistical metadata about selectively chosen key sets. Contrary to distributed IR settings for one-time searching where sources are ranked according to their document collections (i.e., using resource selection strategies), in approximate IF the publishers are ranked according to their probability to publish relevant documents in the near future, which poses diﬀerent requirements for maintaining metadata. This thesis presented the ﬁrst work to develop algorithms for exploiting keyword correlations in such a dynamic IF setting. Existing and self-limited approaches for two-key queries have been extended to the case of multi-key continuous queries for an arbitrary number of keys. Beyond that, new algorithms to approximate multi-key statistics by combining the statistics of arbitrary subsets have been provided. Hash sketches have been used in similar algorithms with an IR emphasis to compactly represent the documents. This choice has however yielded inaccurate results when considering continuous queries with more than two keys. For this reason the usage of very recent state-of-the-art techniques (KMV synopses) for compact representation of multisets is proposed and applied to an IF setting. These new structures allow the system to compute accurate synopses for multikey queries, and improve the ﬁltering eﬀectiveness. The experimental evaluation of both algorithms illustrated the ﬁltering performance improvements in comparison to the basic MAPS approach. All experimental series used two diﬀerent real-world collections for Web and blog data, and applied real-world queries from Google Zeitgeist. The evaluation also investigated ﬁltering performance gains depending on the introduced correlation measure (conditional probability) representing a way to compute the relatedness among keys.
8.2.3 Prototype System In addition to the architecture and algorithms for approximate information ﬁltering presented above, the thesis has also developed a prototype implementation of the approximate information ﬁltering approach of MAPS. This implementation was meant to serve as a testing environment for conducted experiments, but also serve as a prototype that is able to demonstrate the applicability of the proposed techniques.
- 139 Chapter 8 Conclusion and Open QuestionsThe approximate information ﬁltering approach introduced in this thesis has been integrated into the Minerva search prototype [BMT+ 05b] such that Minerva provides an approximate publish/subscribe functionality in addition. In this regard, the implementation aspects concerning the extension of Minerva have been investigated and new components to support IF in addition to IR have been implemented and integrated into the existing prototype. An extensive showcase has been used to describe the application of the extended Minerva prototype by executing sample one-time and continuous queries in detail.
8.2.4 Digital Library Use Case Finally, the thesis presented a digital library use case using the MinervaDL architecture as an application scenario to support approximate information retrieval and ﬁltering functionality in a single unifying framework. MinervaDL is hierarchical and utilizes a distributed hash table to achieve scalability, fault-tolerance, and robustness in its routing layer.
The MinervaDL architecture allows handling huge amounts of data provided by DLs in a distributed and self-organizing way, and provides an infrastructure for creating large networks of digital libraries with minimal administration costs. There are two kinds of basic functionality that are oﬀered in the DL architecture of MinervaDL: (i) an information retrieval scenario (or one-time querying, where a user poses an one-time query and the system returns all resources matching the query (e.g., all currently available documents relevant to the requested query), and (ii) an information ﬁltering scenario (or publish/subscribe or continuous querying), where a user submits a continuous query (or subscription) and waits to be notiﬁed from the system about certain events of interest that take place (i.e., about newly published documents relevant to the continuous query).
The proposed DL architecture is built upon a distributed directory storing metadata.
The thesis presents routing protocols for information ﬁltering and retrieval that use this directory information to perform the two functionalities. MinervaDL introduces three main components. Super-peers run the DHT protocol and form a distributed directory maintaining metadata about providers’ local knowledge in compact form. Consumers are utilized by users to connect to the MinervaDL network, using a single super-peer as their access point. Providers are implemented by information sources that want to expose their content to the MinervaDL network. Typical examples of provider peers are digital libraries of large institutions, like research centers or content providers.
Finally, this thesis presented diﬀerent scoring functions to select appropriate provider peers answering a one-time query or storing a continuous query for future matching publications. The experimental evaluation investigated the inﬂuence of the individual scoring functions used to perform the two functionalities of MinervaDL. Furthermore, MinervaDL was compared to another DL architecture for P2P networks (LibraRing) providing retrieval and ﬁltering at the same time. Unlike MinervaDL, this architecture provides exact retrieval and ﬁltering functionality.
8.3 Open Questions In this section, a list of interesting open problems for approximate information ﬁltering is presented. The topics discussed are relevant not only within the scope of this thesis, but are also interesting in the context of P2P data management in general. The discussion will focus on topics relevant to the eﬃciency and eﬀectiveness of data management, and will not go into security or privacy issues that, although of high importance, are independent of the proposed solutions.
Thus, the remainder of this section will discuss open questions in the ﬁelds of load balancing, dynamics, economics, and replication & caching.
8.3.1 Load Balancing The ﬁrst open problem related to the thesis is load balancing in a distributed P2P setting.
Load balancing can be deﬁned as distributing processing and communication activity evenly across a computer network so that no single device is overwhelmed. In a P2P system, the data load of a peer is usually determined by the amount of stored data items per peer. The total data load of the P2P system is deﬁned as the sum of the loads of all peers participating in the system. To talk about a load balanced network, the load of a peer should be around 1/n of the total load when considering a network of n peers. A peer is referred to as overloaded or heavy if it has a signiﬁcantly higher load compared to the load it would have in a load balanced case. Similarly, a peer is light if it has signiﬁcantly less load than the load in the load balanced case.
As presented in the previous chapters of this thesis, the MAPS approach utilizes distributed hash tables (e.g., Chord [SMK+ 01] or Pastry [RD01a]) as underlying infrastructure thus exploiting existing work on load balancing techniques for this kind of overlay network. The literature distinguishes two particular problems related to load balancing in
distributed hash tables: