«Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»
[STSW02] commonly used in such a P2P search engine. A peer can select a random or query-speciﬁc document from this shared collection to store it in its local database. The shared collection is also realized by a Cloudscape database. The peer instances can run on one or more machines. The following sections present a usage scenario and explain the graphical user interface of the client in detail.
184.108.40.206 Minerva Initialization
When starting the Minerva client, the user has to input the network details and database login as shown in Figure 6.4. On the left, the Local Port and the Nickname of the peer have to be speciﬁed. If the peer joins an existing network, the Remote IP address and the Remote Port number of a random peer in the network must be declared. On the right, the connection to a local database has to be stated including DB Service Name, Host Name, Port number, Username, and Password. Here, the database DB1 runs on the same machine realized by the server mentioned in the previous section.
The Create Ring button speciﬁes that a new P2P network should be created whereas the Join Ring button is used to contact an existing network. The form also allows to select which widgets should be shown automatically after initialization. All three check boxes are preselected.
After initialization, the peer is connected to a P2P network. Figure 6.5 shows on the right the Network Properties and Collection Statistics. The Network Properties illustrate that the peer is connected, has a certain Pastry Node ID with local port, and owns the local collection with DB1 as DB Service Name. The Collection Statistics present some statistical information (e.g., the number of documents) concerning the local collection.
The Received Posts widget in the middle manages the metadata the peer is responsible for. The list shows all keys a peer stores metadata for, e.g., three peers in the network have published metadata for key music. The entry of Peer02 tells that this peer hosts 17 documents containing the key music. The Refresh button updates the shown list such that metadata received in the meantime from other network peers is updated. The Post all button starts the posting procedure of this peer and distributes the metadata of its local collection to the network. In addition, the Collection Statistics are recomputed to incorporate new documents that have been recently published.
220.127.116.11 One-Time Query Execution
Figure 6.6 summarizes the one-time query execution.
The query input ﬁeld contains the requested query modern music, and the unselected check box designates that this is a onetime query. When the user executes the query, the peer contacts the directory peers storing the metadata for key modern and key music to retrieve the key statistics. For this query, the peer itself hosts the key statistics such that it is fact that only Peer02 stores documents for both keys.
Figure 6.5: Updating and Disseminating Metadata.
Peer selection for this query is trivial and the one-time query is forwarded to this peer.
Nevertheless, Minerva allows to specify the peer selection strategy (e.g., CORI [SJCO02]) and the number of remote peers that receive the query. The Query Results widget shows the ﬁnal result document list to the user. All three documents are hosted at Peer02. The URL allows to visit the document’s online version. If more than one peer contribute their local results for the requested query, and if the overall number of results is high, Minerva merges the collected result documents (e.g., using CORI-based merging algorithms) and displays only the top-ranked documents to the user.
18.104.22.168 Continuous Query Execution
Subscribing with a continuous query using the Minerva prototype with extended MAPS functionality is shown in Figure 6.7. The user enters the continuous query in the text ﬁeld mentioned before and selects the check box on the left to specify that this request is a continuous query. In the background, Minerva checks whether this continuous query is already active, i.e., it was requested by the same user in the past. If the query is active, the directory is used to retrieve updated statistical metadata about the query keys, and time series analysis is applied according to the MAPS approach presented in this thesis.
Prediction methods such as double exponential smoothing can not be applied to continuous queries requested for the ﬁrst time. Thus, resource selection alone is used to select the most promising publishers in the future. Having selected the interesting publisher peers, the Minerva prototype system sends the continuous query to them and waits for new published documents. Periodically, continuous queries have to be updated to recognize publishing behavior. This can be realized automatically by the system or manually by the user.
Figure 6.8: Publishing Documents with Notifying Subscriber Peers.
22.214.171.124 Document Publication Figure 6.8 illustrates the publication process for the showcase where a peer can publish new documents by adding them to its own local collection. Each peer in the network has a database connection to this server, and two diﬀerent methods allow to publish new
• Random Insert selects a completely random document from the server collection and adds the selected one to the local database. A random document can not be published twice such that the server reminds of this publication.
• Matched Insert allows in this showcase to select documents that match active continuous queries a peer stores. In this case, the peer selects a random stored information demand and gets a matching document from the server to add it to its local store.
If there is no matching document available or no active continuous query stored, a random publication occurs.
On the lower right, Figure 6.8 shows the incoming requests. The ﬁrst request was the one-time query for modern music and the second request was the same query as long-term demand. Having added new documents to the local collection, a peer updates its Collection Statistics and disseminates its refreshed metadata to the network by pushing again the Post all button. This procedure can be done periodically to decrease network traﬃc caused by update messages.
Receiving a notiﬁcation for a new published document is shown in Figure 6.9. Here, the querying peer for modern music gets a notiﬁcation message from Peer02. This message is included in the Notiﬁcations widget. Of course, the user can directly follow the URL link to access the online-version of the published document. In addition, the Running continuous queries widget lists all active subscriptions of the current peer. Again, continuous queries with expired lifetime are removed from the list.
126.96.36.199 Resubmitting Continuous Queries
The last screenshot deals with the resubmission of already existing continuous queries.
Figure 6.10 shows that the query modern music is requested again.
In the meantime, several peers in the network have published new documents. The Received Posts widget lists all currently available metadata for the two requested keys. Now, two peers have published documents concerning modern, and ﬁve peers store data regarding music. Thus, peer selection for the whole query selects Peer01 and Peer02 also considering the time series observations since the query was last requested.
The Running continuous queries widget designates that this query was stored before, such that metadata of previous executions and selection processes are available to be used in the future.
Figure 6.10: Resubmitting a Continuous Query.
6.4 Other Prototypes This sections brieﬂy introduces some existing prototypes for P2P retrieval and P2P ﬁltering.
LibraRing [TIK05a] is the only other system that combines retrieval and ﬁltering functionality in a P2P environment of digital libraries. In contrast to the prototype presented in this chapter, LibraRing focuses on exact searching and ﬁltering functionality by disseminating documents or continuous queries in the network. Section 6.4.1 presents some P2P retrieval systems, whereas Section 6.4.2 discusses relevant P2P ﬁltering systems. Overall, the prototype presented in this thesis is the only approach that provides approximate P2P searching and ﬁltering functionality in a unifying framework.
6.4.1 P2P Retrieval Prototypes Galanx [WGD03, GWJD03] is a P2P search engine implemented using the Apache HTTP server and BerkeleyDB. Web site servers form the P2P layer of this architecture; pages are stored only where they originate from. Galanx directs user queries to relevant peers by consulting a local peer index that is maintained on each peer. In the experimental evaluation, the use of peer indices to direct searches is investigated. In contrast, the Minerva approach relies on peers to decide at what extent they want to crawl interesting fractions of the Web and build their own local indexes. [GWJD03] focuses on XML data repositories and postulates that, upon completion of the query, regardless of the number of results or how they are ranked and presented, the system guarantees that all the relevant data sources known at query submission time have been contacted. For this purpose, a distributed catalog service that maintains summaries of all peers is designed.
- 117 Chapter 6 Prototype ImplementationPlanetP [CAPMN03] is a publish-subscribe service for unstructured P2P communities, supporting content ranking search. PlanetP distinguishes local indexes and a global index to describe all peers and their shared information. The global index is replicated using a gossiping algorithm. PlanetP does not provide notiﬁcation messages about new published data. Odissea [SMwW+ 03] (Open DIStributed Search Engine Architecture) assumes a two-layered search engine architecture with a global index structure distributed over the peers in the system. The system provides a highly distributed global indexing and query execution service that can be used for content residing inside or outside the P2P network. A single peer holds the complete, Web-scale, index for a given text key (i.e., keyword or word stem). Query execution uses a distributed version of Fagin’s threshold algorithm [Fag02].
The system appears to cause higher network traﬃc when posting document metadata into the network, and the presented query execution method seems limited to queries with at most two keywords. The paper actually advocates using a limited number of peers, in the spirit of a server farm.
The OverCite system [SCL+ 05] was proposed as a distributed alternative for the scientiﬁc literature digital library CiteSeer. This functionality was made possible by utilizing a DHT infrastructure to harness distributed resources (storage, computational power, etc.).
OverCite is able to support new features such as documents alerts. The work presented in [RV03] adopts an architecture very similar to Minerva, but seems incomplete since one cannot locate a running implementation. The presented results are based on simulations that also support the assumption that a Minerva-like architectures do in fact scale and are well within reasonable bandwidth limits. The system described in the paper provides keyword search functionality for a DHT-based ﬁle system or archival storage system, to map keyword queries to unique routing keys. It does so by mapping each keyword to a peer in the DHT that will store a list of documents containing that keyword.
The eSearch system presented in [TD04] is a P2P keyword search system based on a hybrid indexing structure in which each peer is responsible for certain keys. Given a document, eSearch selects a small number of important keys in the document and publishes the complete key list for the document to peers responsible for those top keys. This selective replication of key lists allows a multi-key query to be processed locally at the peers responsible for the query keys, but the document granularity indexes may interfere with the goal of unlimited scalability. The authors claim that eSearch is scalable and eﬃcient, and obtains search results as good as state-of-the-art centralized systems.
Rumorama [EH05] is an approach based on the replication of peer data summaries via rumor spreading and multi-casting techniques in a structured overlay. Rumorama utilizes a hierarchical structure, and adopts a summary-based approach to support P2P-IR in the spirit of PlanetP. In a Rumorama network, each peer views the network as a small PlanetP network with connections to peers that see other small PlanetP networks. Each peer can select the size of the PlanetP network it wants to see according to its local processing power and bandwidth. Rumorama manages to process a query such that the summary of each peer is considered exactly once in a network without churn. The actual number of peers to be contacted for a query is a small fraction of the total number of peers in the network.
Alvis [LKP+ 05] is a prototype for scalable full-text P2P-IR using the notion of Highly Discriminative Keys (HDK ) for indexing, which claims to overcome the scalability problem of single-key retrieval in structured P2P networks. Alvis is a fully-functional retrieval engine built on top of P-Grid. It provides distributed indexing, retrieval, and a content-based ranking module. While the index size is even larger than the single key index, the authors bring forward that storage is available in P2P systems as opposed to network bandwidth.
ALVIS includes a component for HDK-based indexing and retrieval, and a distributed content-based ranking module.