«Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»
The main contribution of this chapter is the presentation of the peer selection approach for MAPS by utilizing novel publisher selection strategies based on a combination of resource selection and behavior prediction. The evaluation of this approach shows that traditional resource selection strategies alone are not suﬃcient in this setting, and devise a novel method to predict peer publishing behavior based on time series analysis of IR metrics.
This technique allows to improve recall, while monitoring only a small number of publishers.
In addition, this chapter investigates an improvement algorithm called MAPS Selective Method to improve the prediction techniques without additional communication overhead.
An extended experimental evaluation of both methods investigates the usefulness of the proposed approaches in several publishing behaviors and scenarios.
- 51 Chapter 4 Publisher Peer Selection
4.2 Peer Selection Strategy The protocols presented in the previous chapter have shown that the selection of peers is the most critical part of the approximate information ﬁltering approach MAPS. The subscriber only receives notiﬁcations about published documents from peers that store the continuous query. In other words: a subscriber monitors a publisher with a continuous query.
To decide which publisher peers should be monitored, the subscription protocol of Section 3.3.2 uses a scoring function to rank peers. In the MAPS approach, the subscriber computes a publisher peer score based on a combination of resource selection 1 and behavior prediction
formulas as shown below:
score(P, q) = α · sel(P, q) + (1 − α) · pred(P, q) (4.1) In this equation, q is a continuous query, P is a certain publisher peer, and sel(P, q) and pred(P, q) are scoring functions based on resource selection and behavior prediction respectively that assign a score to a peer P with respect to query q. Function score(P, q) is used to combine the selection and prediction scores, and decides the ﬁnal score of a peer P with respect to q. The tunable parameter α aﬀects the balance between authorities (high sel(P, q) scores) and peers with potential to publish matching documents in the future (high pred(P, q) scores). The value of α is from 0 to 1 where α = 1 means that the peer score is only based on resource selection, and α = 0 only considers behavior prediction.
Based on these scores, a ranking of peers is determined and q is forwarded to the highest ranked peers. In this way, the subscriber can monitor the top-k publisher peers concerning its information demand.
Notice that this peer selection strategy is general and can also be used in centralized settings, where a server (instead of the distributed directory) maintains the necessary statistics, and mediates the interaction between publishers and subscribers.
The main contribution of this peer selection approach respect to predicting peer behavior, is to view the IR statistics provided by the (centralized or distributed) directory as time series and use statistical analysis tools to model publisher peer behavior. Time series analysis accounts for the fact that the time series observations have some sort of internal structure (e.g., trend, or seasonality etc.), and uses this observation to analyze older values and predict future ones (see Section 2.4). In the next sections, both resource selection (Section 4.2.1) and novel behavior prediction (Section 4.2.2) are presented and investigated in detail. Section 4.2.3 explains the main reasons why both components of the scoring function are needed to get a satisfying system eﬀectiveness. An extensive discussion of resource selection for one-time search in structured P2P networks is not in the focus of this thesis. Chapter 6 includes some aspects of the Minerva one-time search approach where resource selection is adopted to overlapping P2P collections.
4.2.1 Resource Selection In the literature, resource selection (or database selection or collection selection) is the task to decide to which information source a one-time query should be routed. Obviously, in a distributed system, it is too expensive to query all available information sources such that the decision is critical and strongly inﬂuences the retrieval quality (e.g., recall and precision).
1 The term resource selection is also known as database selection or collection selection. Resource selection
is throughout this thesis.
- 52 Chapter 4 Publisher Peer Selection Given the large-scale data distribution of a P2P network, one of the key technical challenges is query routing, which is the process of eﬃciently selecting the most promising peers (from a potentially very large set of peers storing relevant data) for a particular information need. Query routing is very similar to resource selection that has been studied extensively in the literature in the context of metasearch engines 2. There, approaches are typically designed for a small and rather static set of search engines and did not consider the challenges present in a P2P network, such as peer autonomy, network dynamics and high inter-peer latencies. The interested reader is referred to [MYL02, NF06, BMWZ05] for existing approaches to query routing.
In MAPS, the sel(P, q) function returns a score for a peer P and a query q, and is calculated using standard resource selection algorithms such as tf − idf -based methods, CORI [CLC95, SJCO02], GlOSS [GGM95, GGMT99], the decision-theoretic framework (DTF) [Fuh99, NF03], or Statistical Language Models [SC99, PC98, XC99, ZL01, MLS99].
A P2P setting poses some new requirements to route a one-time query to an appropriate set of selected information sources. Here, the assumption of non-overlapping collections does not hold. In [BMT+ 05a], an overlap-aware query routing approach is presented that considers the novelty of single information sources. The idea behind this approach is that a query initiator should not ask sources that provide similar content concerning a query.
Bloom ﬁlters [Blo70] and/or min-wise independent permutations [BCFM00] are used to enrich the distributed directory and to compute a novelty score to combine quality and novelty measures. In addition, correlated key sets as described in [MBN+ 06] are recognized and exploited to further improve peer selection.
Using the scoring function sel(P, q), the MAPS approach identiﬁes authorities specialized in a topic, which, as argued above, is not suﬃcient for the IF setting. Next, two resource selection strategies are presented in more detail including the well-known CORI approach and a simple approach based on IR statistics.
18.104.22.168 The CORI Approach
The Collection Retrieval Inference Network (CORI) [CLC95, SJCO02] reduces the task of resource selection to a document retrieval task by relying on inference networks. In CORI, a super-document is selected as the representative of a collection of documents, and represents the concatenation of all documents with all distinct keys of this collection. If a key appears in t distinct documents in the collection, the key appears t times in the superdocument. The set of all super-documents forms a special-purpose collection that is used to identify the most promising collections for a given query. In principle, standard approaches (e.g., tf − idf and cosine measure) could now be applied. From the viewpoint of regular document scoring and retrieval, term frequencies are replaced by document frequencies, and document frequencies by collection frequencies, i.e., the number of collections that contain a speciﬁc key.
The approach taken by CORI is implemented in the context of the INQUERY [CCH92] retrieval system based on inference networks [TC91] to compute the ranking score of a collection with respect to query q as the estimated belief that the database or information source contains useful documents. The belief is essentially the combined probability that the database contains useful documents for each query key. More speciﬁcally, the belief that the information need expressed by a query key k is satisﬁed by searching the collection
of P is determined by the following equations:
2A metasearch engine is a search engine that sends user requests to several other search engines and/or databases and aggregates the results into a single list or displays them according to their source.
Compared to the original application of inference networks to document scoring, document peers are replaced by the super-documents, yielding a moderate-sized network. The frequency values are typically higher, but that does not aﬀect the computational complexity. However, in a highly dynamic network of (cooperating) peers, the proper estimation of n, clavg, and cf is a non-trivial problem.
22.214.171.124 The MRS Approach
The MAPS Resource Selection approach is based on tf − idf statistics directly available by the distributed directory. This simple strategy (compared to more sophisticated ones as CORI or DTF ) uses the peer document frequency (df ), and the maximum peer term frequency (tf max ) as deﬁned before. The scoring functions sel(P, q) aggregates over all
query keys k as follows:
4.2.2 Behavior Prediction To predict peer behavior MAPS considers time series analysis of IR statistics, thus making a rich repository of techniques from time series analysis [Cha04] applicable to this problem.
These techniques predict future time series values based on past observations and diﬀer in (i) their assumptions about the internal structure of the time series (e.g., whether trends and seasonality can be observed) and (ii) their ﬂexibility to put emphasis on more recent observations. Details about time series analysis including diﬀerent prediction techniques are presented in this thesis in Section 2.4.
- 54 Chapter 4 Publisher Peer Selection Since the considered IR statistics exhibit trends, for instance, when peers successively crawl sites that belong to similar topics, or, gradually change their thematic focus, the employed time series prediction technique must be able to deal with trends properly. Further, in this scenario it is obvious to put emphasis on a peer’s recent behavior and thus assign higher weight to recent observations when making predictions about its future behavior.
Moving average techniques do not support a higher weight for more recent observations, and single exponential smoothing (SES) is not able to recognize trends.
For this reason, MAPS uses double exponential smoothing (DES) as a prediction technique, since it can both deal with trends and put emphasis on more recent observations.
4.2 explains double exponential smoothing and also other techniques. For completeness, notice that there is also triple exponential smoothing (TES) that, in addition, handles seasonality in the observed data. The MAPS setting assumes that many queries are expected to be short-lived so that no seasonality will be observed in the IR statistics time series. For an application with many long-lasting queries, one could use triple exponential smoothing, so that seasonality is taken into account. An example for seasonality could be the continuous query for Christmas gifts. In this case, every year before Christmas, the publication of matching documents would be recognizable.
The function pred(P, q) returns a prediction score for a publisher peer P that represents the likelihood of publishing documents relevant to query q in the future. Using the double exponential smoothing (DES) prediction technique as described before, two values are predicted.
Thus, two aspects of the peer’s behavior are modeled: (i) its potential to publish relˆ evant documents in the future (reﬂected by δ(df P,k )), and (ii) its overall expected future publishing activity (reﬂected by δ(cs)). The time series of IR statistics that are needed as ˆ an input to the prediction mechanism are obtained using the distributed directory. The
predicted behavior for peer P is quantiﬁed as follows in Equation 4.7:
4.2.3 Why Both Strategies Are Needed A key component of the peer selection procedure and peer ranking function 4.1 is the behavior prediction mechanism introduced in this thesis. Prediction is complementary to well-known resource selection techniques (e.g., MRS or CORI approach). The following example using the characteristics of Table 4.1 will demonstrate the necessity in a ﬁltering setting to combine both components of the ranking function, and shows why an approach that relies only on resource selection is not suﬃcient. Table 4.1 lists all four possible combinations describing the past 4 and the future 5.
Table 4.1: Diﬀerent Publisher Characteristics.
Assume a publisher peer P1 that has specialized and become an authority in Sports, but publishes no relevant documents any more. Assume that P1 will not publish any Sportsrelated documents in the future. Another publisher peer P2 is not specialized in Sports, but is currently crawling a sports portal. Imagine a user who wants to stay informed about the upcoming 2008 Olympic Games, and subscribes with the continuous query 2008 Olympic Games. If the ranking function solely relied on resource selection, peer P1 would always be chosen to index the user’s information demand, which would be wrong given that peer P1 no longer publishes Sports-related documents. On the other hand, to have a high ranking score assigned, peer P2 would have to specialize in Sports – a long procedure that is inapplicable in a IF setting which is by deﬁnition dynamic. This makes the use of a ranking and selection strategy that recognizes and predicts publishing behaviors an important component.