«Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»
This scoring function can be based on standard resource selection approaches [LCC00] from the IR literature (e.g., CORI). However, as it will be shown in Chapter 4, these approaches alone are not suﬃcient in an IF setting, since they were designed for retrieval scenarios, in contrast to the IF scenario considered here, and are aimed at identifying specialized authorities. Once the peers that will store q have been determined, S uses the IP addresses associated with them to construct a message IndexQ(id(S), ip(S), q).
The message is forwarded to the peer that will store q. When a publisher P receives a message IndexQ containing q, it stores q using a local query indexing mechanism such as [TKD04, YGM99, TKD08]. This procedure is shown in step 2 of Figure 3.3, where S contacts publishers P1, P2 and P3. Filtering and peer selection are dynamic processes, therefore periodic query repositioning, based on user-set preferences, is necessary to adapt to changes in publisher’s behavior. To reposition an already indexed query q, a subscriber would re-execute the subscription protocol, to acquire new peer statistics, compute a new ranking, and appropriately modify the set of peers indexing q.
3.3.3 Publication and Notiﬁcation Protocol The publication service is employed by users that want to expose their content to the network. A publisher P utilizes the directory to update statistics about the keys contained in the documents it publishes. All queries that match a published document produce appropriate notiﬁcations to interested subscribers. According to this, the procedure followed by P at publication time is as follows.
When a document d is published by P, it is matched against P ’s local query database to determine which subscribers should be notiﬁed. Then, for each subscriber S, P constructs a notiﬁcation message Notify(id(P ), ip(P ), d) and sends it to S using the IP address associated with the stored query (shown in Figure 3.4). If S is not online at notiﬁcation arrival, then P utilizes function send() to send the message through the DHT, by using the id(S) also associated with q. In this way, S will receive the message from its successor upon reconnection. Notice that peers publishing documents relevant to a query q, but not storing it, will produce no notiﬁcation (peer P4 in Figure 3.4).
3.3.4 One-Time Search Protocol The one-time search protocol is similar to the subscription protocol, and is used by peers to request one-time queries to the network. It is assumed that a querying peer Q requests a multi-key or multi-keyword query q of the form k1 k2... kk with t distinct keys. Q needs to determine which content providers in the network are currently the most promising candidates to satisfy the query. To do resource selection, metadata about data sources is collected from the directory using the directory service and protocol. Thus, a ranking of these information sources is calculated based on resource selection techniques. In this theses, a full discussion of query routing in Minerva is not possible for space reasons (see [BMT+ 05a, MBN+ 06].
To collect statistics about data sources, Q needs to contact all directory peers responsible for the query keys. Similar to the subscription protocol, for each query key ki, Q computes H(ki ), which is the identiﬁer of the peer responsible for storing statistics about other peers that publish documents containing the key ki. The same CollectStats(id(S), ip(S), ki ), is created and the function send() is used to forward the message in O(log n) hops to the peer responsible for identiﬁer H(ki ). A peer D receiving the message searches its local store of Post messages to retrieve the peer list Li of all posts of the key. Subsequently, a message RetStats(Li, ki ) is created by D and sent to Q using its IP found in the CollectStats message. Figure 3.5 illustrates this ﬁrst step where Q contacts directory peers D1 and D4. Once Q has collected all the peer lists Li for the keys contained in q, it utilizes an appropriate scoring function score(P, q) to compute a peer score with respect to q, for each one of the peers P contained in Li. Based on the score calculated for each peer, a ranking of peers is determined and the top-k ranked peers are candidates for answering the request q. In contrast to the subscription protocol, Q sends the query q to all selected content providers contained in a message RequestQ(id(S), ip(S), q). When a content provider Ci receives a message RequestQ containing q, it locally executes q on its own current document collection and creates a list Ri of result documents matching the query. This result list is returned to the query initiator Q using a RetResults(Ri, q) message. This is illustrated in step 2 of Figure 3.5.
- 46 Chapter 3 System Architecture and Protocols In the third and last step of the protocol (also illustrated in Figure 3.5, the query initiator Q merges all received local result lists Ri (e.g., using appropriate merging algorithms) to provide one combined global result list to the user. Section 6.2.2 gives some more details.
3.4 Comparison to Exact Information Filtering
This Section compares the MAPS architecture for approximate information ﬁltering presented in this chapter with an exact IF approach named DHTrie as introduced in [TIK05b].
This comparison considers diﬀerent characteristics of both approaches [TZWK07]. Table
3.1 gives an general overview concerning them. The following sections will investigate four major characteristics concerning DHTrie and MAPS in more detail: routing infrastructure, query placement, statistical information, and load balancing. Additional experimental issues (e.g., message traﬃc costs) are included in Chapter 4.
A crucial design decision in both approaches is the use of the distributed hash table (DHT) as the underlying routing infrastructure. Content-based ﬁltering requires an eﬃcient object location mechanism to support expressive query languages that capture the user’s speciﬁc interests. This renders Gnutella-style networks an ineﬃcient solution and simple keyword functionality of standard DHTs an insuﬃcient one.
To overcome this limitation of exact lookup, both approaches extend the DHT functionality to support richer data models and more expressive queries. However, to be able to eﬃciently support this functionality, changes and extensions in DHT protocols and data structures are necessary. DHTrie uses message grouping and extra routing tables to overcome ineﬃciencies, whereas MAPS uses batch posting of key summaries, piggy-backing of post messages to directory maintenance messages and decrease in the key space by using correlated multi-key sets to reduce network traﬃc (see Chapter 5).
3.4.2 Query Placement
Distributed publish/subscribe systems involve some sort of peer selection techniques to decide where to place a user query. This selection is critical, since future publications that may match this query should also be able to reach the peer storing it to trigger a notiﬁcation in case of a match. Query placement in DHTrie is deterministic and it depends upon the keys contained in the query and the hash function provided by the DHT. To decide where a keyword query should be placed, all query keys are hashed and the query is forwarded to the peers responsible for these identiﬁers to ensure correctness at publication time. These query placement protocols lead to ﬁltering performance that is exactly the same with that of a centralized system.
On the other hand, in MAPS only the most specialized and promising peers store a user query and are thus monitored. Publications produced by each peer are matched against its local query database only, since for scalability reasons no publication forwarding is used.
In this case recall is lower than that of DHTrie, but document-granularity dissemination to the network is avoided.
3.4.3 Statistical Information
Matching incoming documents with stored subscriptions involves maintenance and estimation of important global statistics such as the document frequency of a certain key, i.e., the number of distinct documents seen in the last interval that contain a speciﬁc keyword. In DHTrie this functionality is implicit; every time a new document is published by some peer, it reaches the peers responsible for the distinct keywords contained in the document, since these are the candidate peers that may store continuous queries that will potentially match the document. These peers do all the necessary bookkeeping of the statistical information, and can be queried for this when other peers need to compute a similarity score.
On the other hand, MAPS explicitly addresses this issue with the maintenance of a directory. Notice that in the case of DHTrie the statistics maintenance is a byproduct of the ﬁltering process and it is only necessary for computing similarity scores, whereas in the MAPS case it is the cornerstone of the ﬁltering process, since both peer selection (and thus query routing) and also similarity computation utilize it.
- 48 Chapter 3 System Architecture and Protocols 3.4.4 Load Balancing The eﬀect of load imbalances between peers in the two approaches is diﬀerent. DHTrie peers are more susceptible to load imbalances due to the nature of the subscription and publication protocol. Through the mapping of the DHT hash function, an overloaded peer or a peer with small processing capabilities may get responsible for a popular key, and thus be forced to store high volumes of user queries and process high numbers of publications.
DHTrie load balancing mechanisms are based on load-shedding and prove eﬃcient even for highly skewed distributions.
On the other hand, MAPS has an intrinsic load balancing mechanism to cope with imbalances. Assuming that peers are willing to oﬀer some of their resources to the community, a peer with resources to share soon becomes a hub peer for some topic and receives more subscriptions, whereas a peer with few resources will be forced to specialize more, and thus reduce the number of users that are interested in its publications. MAPS is more sensitive to ﬁltering load balancing (i.e., load imposed by the ﬁltering requests that need to be processed), since a directory peer responsible for a popular key will get high numbers of statistics retrieval requests. Finally, routing load (i.e., load imposed by the messages a peer has to forward due to the overlay maintenance protocols) is similar in both systems, and it mainly depends on the DHT usage.
3.5 Discussion This chapter presented the architecture with services and protocols of the MAPS approach for approximate publish/subscribe functionality in a structured P2P environment. While most information ﬁltering approaches taken so far have the underlying hypothesis of potentially delivering notiﬁcations from every information producer, MAPS relaxes this assumption by monitoring only selected sources that are likely to publish documents relevant to the user interests in the future. A subscriber requesting a continuous query uses a distributed directory to collect metadata concerning publishers. Next chapter will focus on publisher peers selection approximate information ﬁltering where a combination of resource selection and behavior prediction techniques is applied to the collected metadata such that the most promising publishers can be selected. The directory, subscription, publication and notiﬁcation protocols regulate the interactions of peers in the P2P network.
For completeness, this chapter also provided the retrieval protocols for approximate onetime searching as presented in the Minerva P2P search architecture. The MAPS approach extends the Minerva system such that two functionalities (one-time and continuous searching) are available. In addition, this chapter compared the main characteristics of MAPS with an existing information ﬁltering system called DHTrie realizing exact ﬁltering functionality over P2P networks. This comparison highlighted the diﬀerences, advantages, and disadvantages of both architectural approaches. The next chapter will discuss the strategies to select the publisher peers that a subscriber should monitor.
Chapter 4 Publisher Peer Selection This chapter discusses appropriate strategies to select the publisher peers that a subscriber should monitor. Notice that in MAPS peer selection is a critical task since it may lead to poor ﬁltering eﬀectiveness [ZTB+ 08].
In Section 4.1, the peer selection problem in an IF setting is introduced. Then, Section
4.2 presents the main peer selection algorithm that combines resource selection and behavior prediction to improve ﬁltering eﬀectiveness. The proposed prediction function of MAPS uses time series analysis with double exponential smoothing (as described in Section 18.104.22.168) to predict the future behavior of publisher peers. An extensive experimental evaluation in Section 4.3 shows the eﬀectiveness and eﬃciency of the proposed algorithm. Section 4.4 develops an algorithm called MAPS Selective Method (MSM) to improve the prediction results without additional communication overhead, while Section 4.5 presents the beneﬁts of the prediction improvement method MSM by analyzing several publishing scenarios.
Finally, Section 4.6 discusses and summarizes the results of this chapter.
The previous chapter introduced the architecture for approximate information retrieval and information ﬁltering including the appropriate services and protocols. This doctoral thesis does not focus on one-time search in structured P2P networks since there is a well-known understanding concerning query routing in the Minerva search architecture. More details about Minerva query routing can be found in Chapter 6 where the prototype implementation is explained. Here, approximate publish/subscribe will be presented in detail using the MAPS approach. The most critical task in MAPS is the selection of most promising publisher peers for a requested continuous query.