«Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»
In the US, more than 10 billion searches have been conducted at core search engines in November 2007. Here, the big players share the market as shown in Table 6.11. In Germany, Google is even more dominant with a market share of almost 90%2. Various projects have been started to build and operate a P2P Web search network (e.g., [TD04, LKP+ 05, CAPMN03, YVGM04]) including the Minerva project. Web search and Internet-scale ﬁle
content search seem to be perfect candidates for a P2P approach for several reasons:
• The Web is increasing at a much faster rate than the indexing capability of any centralized search engine [HT99, WMYL01, SE00]. In addition, the data is highly distributed by nature, residing on millions of sites.
• A P2P network could potentially outperform even the largest server farm in terms of processing power and could, thus, enable much more advanced methods (e.g., ontology-based background knowledge). A P2P Web search engine can potentially beneﬁt from the intellectual input of a large user community, as every peer’s behavior is inﬂuenced by a human user.
• There is growing concern about the world’s dependency on a few monopolistic search engines and their susceptibility to commercial interests.
1 http://www.comscore.com 2 http://www.webhits.de
6.2 The Minerva Search System This section introduces the Minerva3 prototype for P2P search [BMWZ05, BMT+ 05a, BMT+ 05b]. The next sections elaborate the main principles of the Minerva search engine (Section 6.2.1), present its architecture (Section 6.2.2), and summarize some core fundamentals of its prototype implementation (Section 6.2.3).
6.2.1 System Principles In [LLH+ 03], approaches to comprehensive Web search based on a P2P network have been considered infeasible, or at least being a grand challenge, from a scalability viewpoint.
Early approaches typically spread inverted index lists across the directory such that each peer is responsible for maintaining a subset of index lists. Such systems allow for exact and complete execution of top-k style aggregation queries over the P2P network. However, bandwidth requirements and also latency issues raise concerns about their scalability claims.
Novel approaches [CW04, MTW05] try to overcome these challenges by utilizing eﬃcient large-scale top-k query aggregation algorithms for distributed systems.
The system design of Minerva diﬀers from the approaches mentioned above. Instead of disseminating inverted index lists across the directory, Minerva uses only pointers to promising peers (enriched with compact statistical metadata describing the index contents of that peer) and utilize these pointers to answer multi-keyword queries. Here, some fundamental design principles are listed that inﬂuenced the architecture of the Minerva search
system presented in the next section:
Figure 6.1: Minerva Search Architecture and Query Execution.
• Minerva does not forward requests to all possible peers in the network. The scalability principle ensures that only the most appropriate peers are involved in the query execution.
6.2.2 System Architecture The P2P Web search prototype system Minerva [BMT+ 05b] assumes a P2P collaboration in which every peer is autonomous and has a local index that can be built from the peer’s own crawls or imported from external sources representing the user’s interest proﬁle. The local index contains inverted index lists with URLs for Web pages that contain speciﬁc keywords. A conceptually global but physically distributed directory, which is layered on top of a distributed hash table, holds compact, aggregated information about the peers’ local indexes. The DHT partitions the key space, such that each directory peer is responsible for maintaining the metadata about a randomized subset of keys. For failure resilience and availability, the metadata may be replicated across multiple directory peers. The three
steps of query execution work as follows (see Figure 6.1):
1. Directory Maintenance: Every peer publishes in step 1 a number of key-speciﬁc statistical summaries (posts) describing its local index to the directory (shown in Figure
6.1a). Posts contain contact information about the peer who published this summary together with statistical information to support appropriate query routing strategies (e.g., the size of the inverted list for the key, the maximum average score among the key’s inverted list entries, or various other statistical synopses [MBN+ 06]). The DHT is used to determine the directory peer responsible for this key.
2. Query Routing: If the results of a local query execution are unsatisfactory, the user can utilize in step 2 the distributed directory to identify more promising peers for a particular query as follows (shown in Figure 6.1b): for each query key, the query initiator identiﬁes the peer that is currently maintaining the appropriate statistics, using the DHT functionality. Then the query initiator retrieves the relevant posts
- 107 Chapter 6 Prototype Implementationby issuing requests directly to these peers. The statistical synopses contained in the posts are used to perform query routing, i.e., to identify the most promising peers for that particular query.
3. Query Processing: After a small subset of peers has been selected, the query is forwarded and executed based on their local indexes in step 3 (shown in Figure 6.1c).
Note that this communication is carried out in a point-to-point manner. Finally, the remote peers return their local results to the query initiator, where they are combined into a single result list (result merging).
This Minerva baseline approach can be extended such that multiple directories are utilized to maintain information beyond local index summaries, such as information about local bookmarks [BMWZ04], information about relevance assessments (e.g., from peer-speciﬁc query logs or click streams [KLFW06]), and (implicit or explicit) user feedback.
The upcoming sections discuss the two major challenges of query execution with Minerva, namely query routing in Section 18.104.22.168 and result merging in Section 22.214.171.124.
126.96.36.199 Query Routing
Query routing is one of the key issues to make P2P search feasible. A user requesting a multi-keyword query expects a high-quality top-10 or top-100 ranked result list. Therefore, the system has to select the peers that have to answer the query. This decision is based on statistical information stored in the directory. [BMWZ05] introduces the baseline query routing approach utilizing resource selection strategies (e.g., CORI [CLC95], DTF [NF03], or GlOSS [GGM95]). These resource selection strategies (or database selection) originally been designed for distributed IR need to be adapted to the large-scale and the high dynamics of a P2P system. The work presented in [BMT+ 05a, MBTW06] introduces an overlap-aware query routing approach that takes the novelty of peers into account. Another extension [MBN+ 06] considers the correlations among keywords to improve P2P query routing. Caching strategies as shown in [ZBW08] can improve result-quality of P2P search and reduce response times by retrieving cached results of previously executed queries. In addition, aggressively reuse cached results of even subsets of a query towards an approximate caching technique can drastically reduce the bandwidth overheads.
Figure 6.2: Minerva Search Engine Implementation.
6.2.3 Implementation Minerva is implemented in a platform-independent way using Java 5. The software architecture of a peer is illustrated in Figure 6.2. Each peer is build on top of a globally distributed directory which is organized as a distributed hash table, e.g, Chord [SMK+ 01] or Pastry [RD01a]. Minerva utilizes the lookup functionality of the DHT to provide a mapping from keys to peers. Early versions of Minerva relied on a reimplementation of the Chord protocol, and more recent versions run as a FreePastry application, using Pastry’s network routing mechanisms and Past’s storage functionalities to become resilient to network dynamics and peer failures. The latter version works as follows: each peer maintains a PastryNode, implementing the PastryApplication interface, and is registered at a PastryEndpoint. Once registered, the PastryNode delivers incoming messages to the registered applications. There exist two diﬀerent implementations of Past (PastImpl and GCPastImpl). GCPastImpl is an extension of PastImpl that oﬀers garbage collection based on time-stamps. Minerva uses this extended version in order to prune outdated metadata objects after a speciﬁc time interval.
The DHT layer returns a Peer Descriptor object containing the contact information (e.g., IP address and port) of the peer currently responsible for a key. A Communicator is instantiated with this data to perform the communication with remote peers. Each peer runs an Event Handler listener that receives incoming messages and forwards them to the appropriate local components of a peer. Every peer has a Local Index holding the peer data using any database system capable of executing standard SQL commands (e.g., Oracle, MySQL, or Cloudscape/Derby). The index can be used for query execution by the Local Query Processor component. Additionally, the Poster component uses the local index to produce the key-speciﬁc summaries that are published to the global directory using the Communicator. Each peer implements a PeerList Processor to maintain the incoming posts, i.e., all posts from across the network regarding the subset of keys that the actual peer is currently responsible for. Notice that Past is designed to handle such inserts natively, such that recent versions of Minerva do not need to use a PeerList Processor.
- 109 Chapter 6 Prototype ImplementationWhen the user initiates a one-time query using Minerva, the Global Query Processor component uses the DHT to locate the peer responsible for each query key and retrieves the respective metadata using Communicator components. After appropriately processing the metadata, the Global Query Processor forwards the complete query to selected peers, which in turn process the query using their Local Query Processors and return their results.
In a ﬁnal step, the Global Query Processor merges these remote results and presents the merged result to the user. There are some cases where Minerva peers communicate directly, i.e., without using the DHT lookup functionality.
6.3 The MAPS Filtering Extension The MAPS approximate information ﬁltering approach introduced in this thesis has been integrated into the Minerva search prototype such that Minerva provides in addition to onetime searching an approximate publish/subscribe functionality in addition. Section 6.3.1 presents the implementation aspects concerning the extension of Minerva, while Section 6.3.2 explains the usage of the extended Minerva prototype by executing example one-time and continuous queries. There, the various parts of the graphical user interface (GUI) are illustrated.
6.3.1 Implementation In this section, the changes implemented to add the publish/subscribe functionality at the Minerva prototype are explained. Figure 6.3 shows how the three new components of a peer are integrated in the existing system.
Besides one-time queries, the modiﬁcations described below, allow issuing continuous queries. Thus, the Global Query Processor component is able to handle both types of queries as input. A continuous query has in addition a lifetime parameter to determine how long such a request should be valid. To process a continuous query requested by a user, the Global Query Processor utilizes the Time Analysis Storage component that maintains statistical metadata of active continuous queries. Therefore, the future behavior of publisher peers can be predicted by applying time series analysis techniques to stored metadata as described in Chapter 3 in detail.
Collecting publisher metadata is performed similarly to the one-time querying by utilizing the lookup functionality of the DHT, and asking the peers responsible for the keys in the continuous query. Having selected the most promising publisher peers for a query, the Global Query Processor uses the Communicator to send the continuous query to the remote peers.
Whenever a peer receives a continuous query (as an event of the Event Handler ), the query is stored at the Continuous Query Store. This store denotes the second new component for ﬁltering. Each time, a peer inserts a new query, outdated queries stored at the peer are removed from the Continuous Query Store.
The third new component to integrate approximate publish/subscribe functionality to Minerva, is the Publisher module. This component receives new documents as input and adds these documents to the Local Index. In addition, the Publisher checks the Continuous Query Storage for active continuous queries matching the publication. Subscriptions that are no longer valid (e.g., because the lifetime has expired) are removed from the store.
Checking the matching continuous queries delivers a set of peers that have subscribed with one of these queries, and have to be notiﬁed about the published document. The Publisher component sends a notiﬁcation message using the Communicator to inform the subscriber peers about the new document.
Figure 6.3: Extended MAPS Implementation with IF Components.
6.3.2 Example Use of MAPS This section showcases the usage of Minerva (respectively MAPS) with the extended approximate publish/subscribe functionality by means of screenshots taken from the latest prototype version. It also serves as a short explanation how one-time and continuous query functionality is used within Minerva.
In this showcase example, 10 Cloudscape or Derby databases are used to host about 100 documents per database. Thus, 10 peers can be created to manage one of the 10 collections each. To publish new documents, there is an additional database that hosts about 100, 000 additional documents simulating the input from a crawling component such as BINGO!