«Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»
7.1.3 Previous Research on P2P Digital Library Architectures P2P-DIET [IKT04a] and LibraRing where the ﬁrst approaches that tried to support both IR and IF functionalities in a single unifying framework. P2P-DIET utilizes an expressive query language based on IR concepts and is implemented as an unstructured P2P network with routing techniques based on shortest paths and minimum weight spanning trees. An extension of P2P-DIET [CIKN04] considers a similar problem for distributing RDF metadata in an Edutella [NWQ+ 02] fashion. LibraRing [TIK05a] was the ﬁrst approach to provide protocols for the support of both IR and IF functionality in DLs using DHTs. In LibraRing, super-peers are organized in a Chord DHT and both (continuous) queries and documents are indexed by hashing words contained in them. This hashing scheme depends heavily on the data model and query language adopted, and the protocols have to be modiﬁed when the data model changes [TIK05a]. The DHT is used to make sure that queries meet the matching documents (in the IR scenario) or that published documents meet the indexed continuous queries (in the IF scenario). In this way the retrieval eﬀectiveness of a centralized system is achieved, while a number of routing optimizations (such as value proxying, content based-multicasting, etc.) are used to enhance scalability.
Figure 7.2: High-Level View of the MinervaDL Architecture.
7.2 The MinervaDL Architecture This section of the use case presents the high-level view of the MinervaDL architecture and presents the various system components. The system architecture of MinervaDL is composed of three diﬀerent types of peers: super-peers, consumer peers (or consumers), and provider peers (or providers).
Figure 7.2 shows a high-level view using an underlying DHT-based directory (e.
g., Chord [SMK+ 01]). The following sections explain the three main components and explain their properties and abilities in detail while Section 7.3 presents the protocols regulating peer interactions.
7.2.1 Super-Peers Super-peers run the DHT protocol and form a distributed directory that maintains statistics (metadata) about providers’ local knowledge in compact form. In MinervaDL, the Chord DHT is used to partition the key space such that each directory peer (super-peer) is responsible for the statistics of a randomized subset of keys. Directory peers are super-peers, peers with more capabilities than consumer or provider peers (e.g., more cpu power and bandwidth capacities) that are responsible for serving information consumers and providers and act as their access point to the MinervaDL network. When the number of super-peers is small, each peer can easily locate others in a single hop by maintaining a full routing table. When the super-peer network grows in size, the DHT provides a scalable means of locating other super-peers in the network.
Super-peers can be deployed by large institutions like universities, research centers or content providers (e.g., CiteSeer, ACM, Springer, Elsevier) to provide access points for their users (students, faculty or employees) or digital libraries. As shown in Figure 7.2, more than one provider and/or consumer can be connected to a single super-peer that acts as their common access point.
- 125 Chapter 7 Digital Library Use Case7.2.2 Consumer Peers Consumer peers (or consumers) are utilized by users (e.g., students, faculty or employees) to connect to the MinervaDL network, using a single super-peer as their access point. Utilizing a consumer peer allows users to pose one-time queries, receive relevant resources, subscribe to resource publications with continuous queries and receive notiﬁcations about published resources (e.g., documents) that match their interests. Consumer peers are responsible for selecting the best information sources to query (respectively monitor) with respect to a given one-time query (respectively continuous query). If consumer peers are not online to receive notiﬁcations about documents matching their submitted continuous queries, these notiﬁcations are stored by their access point and are delivered upon reconnection. Section
7.3 presents the protocols regulating the activities of consumer peers.
7.2.3 Provider Peer Provider peers (or providers) are implemented by information sources that want to expose their content to the MinervaDL network. Typical examples are digital libraries deployed by larger institutions, like research centers or content providers (e.g., CiteSeer, ACM, Springer, or Elsevier). Provider peers use a directory peer (super-peer) as their access point and utilize it to distribute statistics about their local resources to the network. Providers answer one-time queries and store continuous queries submitted by consumers to match them against new documents they publish. More than one provider peers may be used to expose the contents of large digital libraries, and also an integration layer can be used to unify diﬀerent types of DLs.
7.3 The MinervaDL Protocols Having introduced in the previous section the main architecture of MinervaDL with three diﬀerent types of peers, in this section, the protocols that regulate the interactions between all types of peers in the DL architecture are explained in detail. The protocols include the joining and leaving of consumers, providers, and super-peers, but also the publication of new documents, the submission of one-time or continuous queries, and the receipt of answers and notiﬁcations for submitted requests.
Before explaining the individual protocols, three diﬀerent functions have to be deﬁned.
These will be used to ensure the basic communication procedures of the presented protocols:
7.3.1 Provider & Consumer Join/Leave The ﬁrst time, a provider peer P wants to connect to the existing MinervaDL network, it has to follow the join protocol. P has to ﬁnd the IP address of a super-peer S using out-of-band means (e.g., via a secure Web site that contains IP addresses for the super-peers that are currently online in the network). Then, P sends to S a NewProv(key(P ), ip(P )) message, and S adds P in its local provider table (P T ), which is a hash table used for identifying the providers that use S as their access point. Here, key(P ) is used to index providers in P T, while each P T slot stores contact information about the provider including its status (connected or disconnected) and its stored notiﬁcations (see Section 7.3.8 for notiﬁcation delivery).
Subsequently, super-peer S sends to provider P an appropriate acknowledgement message AckNewProv(id(S), ip(S)). Once P has joined, it can use the connect/disconnect protocol described next to connect to and disconnect from the network. A consumers C uses a similar protocol to join the MinervaDL network. In this case the appropriate messages NewCons and AckNewCons are utilized in combination with a consumer table CT managing contact information about consumers.
A provider peer P or a consumer peer C that want to leave the network has to send a LeaveProv(key(P ), ip(P ), id(S)) or LeaveCons(key(C), ip(C), id(S)) message to its access point S, respectively. The super-peer S deletes the peer from its provider or consumer table including all contact information.
7.3.2 Provider & Consumer Connect/Disconnect When a provider P wants to connect to the network, it sends to its access point S a ConnectProv(key(P ), ip(P ), id(S)) message. If key(P ) exists in P T of S, P is marked as connected. If key(P ) does not exist in P T, this means that S was not the access point of P the last time that P connected (Section 7.3.8 discusses this case). When a provider P wants to disconnect, it sends to its access point S a DisconnectProv(key(P ), ip(P ), id(S)) message and S marks P as disconnected in its P T.
Consumers connect/disconnect from the network in a similar way (applying messages ConnectCons and DisconnectCons), but S has also to make sure that a disconnecting consumer C will not miss notiﬁcations about resources of interest while not online. Thus, notiﬁcations for C are stored in the consumer table CT of S and wait to be delivered upon reconnection of C (see Section 7.3.8).
7.3.3 Super-Peer Join/Leave To join the MinervaDL network, a super-peer S must ﬁnd the IP address of another super-peer S using out-of-band means. S creates a NewSPeer(id(S), ip(S)) message and sends it to S which performs a lookup operation by calling lookup(id(S)) to ﬁnd Ssucc = successor(id(S)), similarly to the Chord joining procedure. S sends a AckNewSPeer(id(Ssucc ), ip(Ssucc )) message to S and S updates its successor to Ssucc. S also contacts Ssucc asking its predecessor and the data that should now be stored at S. Ssucc updates its predecessor to S, and answers back with the contact information of its previous predecessor, Spred, and all continuous queries and publications that were indexed under key k, with id(S) ≤ k id(Spred ). S makes Spred its predecessor and populates its index structures with the new data that arrived. After that S populates its ﬁnger table entries by repeatedly performing lookup operations on the desired keys.
- 127 Chapter 7 Digital Library Use CaseWhen a super-peer S wants to leave MinervaDL network, it constructs a DisconnectSPeer(id(S), ip(S), id(Spred ), ip(Spred ), data) message, where data are all the continuous queries, published resources and stored notiﬁcations of oﬀ-line peers that S was responsible for. Subsequently, S sends the message to its successor Ssucc and notiﬁes Spred that its successor is now Ssucc. Clients that used S as their access point connect to the network through another super-peer S. Stored notiﬁcations can be retrieved through successor(id(S)).
Subsequently, C creates a GetResults(ip(C), key(C), q) message and forwards it, using the contact information associated with the statistics, to all provider peers selected previously. Once a provider peer P receives a GetResults message containing a query q, it matches q against its local document collection to retrieve the documents matching q.
The local results are ranked according to their relevance to the query to create a result list R. Subsequently, P creates a RetResults(ip(P ), R, q) message and sends it to C. In this way, C collects the local result lists of all selected providers and uses them to compute a ﬁnal result list that is then presented to the user. To merge the retrieved result lists, standard IR scoring functions (e.g., CORI [CLC95], GlOSS [GGMT99], or CVV [YL97]) are used. In [FPC+ 99], various standard approaches are compared.
7.3.6 Subscribing with a Continuous Query
This section describes how to extend the protocols of Section 7.3.5 to provide information ﬁltering functionality. To submit a continuous query cq containing keys k1, k2,..., kn, the one-time query submission protocol needs to be modiﬁed. The ﬁrst three steps are identical while step four is modiﬁed as follows.
C uses the scoring function pred(P, cq) described in Section 7.4 to rank the providers with respect to cq and identify the top − k providers that may publish documents matching cq in the future. These are the peers that will store cq and C will receive notiﬁcations from these peers only. This query indexing scheme makes provider selection a critical component of the ﬁltering functionality. Notice that, in a ﬁltering setting, resource selection techniques like sel(P, cq) described in Section 7.4 and used for one-time querying, are not appropriate since MinervaDL is not interested in the current document collection of the providers but rather in their future publishing behavior.
Once providers that will store cq have been determined, consumer C creates an message IndexQuery(key(C), ip(C), id(S), ip(S), cq) and sends it to these providers using the IP addresses associated with the GetStats messages C received in the previous step. When a provider peer P receives an IndexQuery message, it stores cq in its local continuous query data structures to match it against future publications. P utilizes these data structures at publication time to ﬁnd quickly all continuous queries that match a publication. This can be done using eﬃcient algorithms, e.g., BestFitTrie [TKD04], or SQI [YGM99].
7.3.7 Publishing a new Document