«Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»
Contrary to approaches such as [TIK05a] that distribute the documents or the index lists among the peers to achieve exact retrieval and ﬁltering functionality, publications in MinervaDL are kept locally at each provider. This lack of publication forwarding mechanism is a design decision that oﬀers increased scalability in MinervaDL by trading recall. Thus, only monitored provider peers (i.e., indexing a continuous query cq) can notify a consumer C, although other providers may also publish relevant documents. As already stated, this makes the scoring function pred(P, cq) a critical component.
Thus, when a provider peer P wants to publish a new document d to the MinervaDL network, the document is only matched against P ’s local continuous query database to determine which continuous queries match d, and thus which consumers should be notiﬁed.
Additionally, at certain intervals P creates Post messages with updated statistics and sends them to its access point S.
7.4 Scoring Functions Selecting the appropriate provider peers to forward an one-time or a continuous query requires a ranking or scoring function that will be used to determine the most appropriate sources (providers) for a given information demand. Although, both IR and IF protocols utilize the same metadata stored in the distributed directory, the peer ranking strategies
in this use case diﬀer signiﬁcantly and have to consider varying objectives:
So, it is clear, that both functionalities have to consider diﬀerent objectives. In the ﬁltering scenario, the main approach of this doctoral theses is used to select the best providers. To even better illustrate this, consider the following example: Assume two providers, where the ﬁrst is specialized in soccer (i.e., in the past has published a lot of documents about soccer), although now it is rarely publishing new documents. The second provider is not specialized in soccer but currently it is publishing many documents about soccer. Now, a consumer subscribes beginning 2008 for documents with the continuous query soccer Euro 2008 1. A ranking function based on resource selection algorithms would always choose the ﬁrst provider. To get a higher ranking score, and thus get selected for indexing the query, the second provider has to specialize in soccer, a long procedure that is inapplicable in a ﬁltering setting, which is by deﬁnition dynamic. The above example illustrates that the ranking formula should be able to predict the publishing behavior of providers by observing both their past and current behavior and projecting it to the future.
2 discusses the appropriate technique from this doctoral thesis.
1 The2008 UEFA European Football Championship, commonly referred to as Euro 2008, takes place in Austria and Switzerland, from 7 to 29 June 2008.
- 130 Chapter 7 Digital Library Use Case 7.4.1 Resource Selection As mentioned before, a number of resource selection methods (e.g., CORI [CLC95], GlOSS [GGMT99], CVV [YL97], language models, etc.) are available to identify good authorities with respect to a given multi-key query q. A simple but eﬀective approach utilizes term frequencies (tf ) and document frequencies (df ) already stored in the directory to compute
a score sel(P, q) for all provider peers P :
The parameter β in Equation 7.1 can be chosen in the range between 0 and 1. It is used to stress the importance of df or tf max. Preliminary experiments have shown that β = 0.5 is an appropriate value in most cases [BMT+ 05a]. Using the scoring function presented above, providers can be can identiﬁed that store high-quality documents for query q, and thus achieve high recall by querying only a few providers in the MinervaDL network. To further improve recall, there are peer selection strategies that consider overlap-awareness [BMT+ 05a] and key correlations [MBN+ 06].
7.4.2 Behavior Prediction The prediction mechanism collects IR statistics from the distributed directory and treats them as time series data to perform statistical analysis over them. A statistical analysis assumes that the data points taken over time have some sort of internal structure (e.g., trend etc.), and uses this observation to analyze older values and predict future ones [Cha04].
There exist various approaches (see 2.4) that diﬀer in their assumptions about the internal structure of the time-series (e.g., whether it exhibits a trend or seasonality). Moving average techniques as described in 18.104.22.168 are a well-known group of time series prediction techniques that assign equal weights to past observations (e.g., averaging is the simplest form of moving average techniques), and thus cannot cope with trends or seasonality. In the Minerva DL setting, it is reasonable to put more emphasis on a peer’s recent behavior and thus assign higher weights to recent observations. Double exponential smoothing assigns exponentially decreasing weights to past observations and assumes that the time series data present some trend to predict future values. Since many queries are expected to be short-lived so that no seasonality will be observed in the IR statistics time series, seasonality is not considered in the predictions (and thus, triple exponential smoothing is not used).
The scoring function pred(P, cq) returns for a score representing the probability that provider P will publish in the future documents relevant to the continuous query cq. MinervaDL uses double exponential smoothing to predict the following two statistical values.
ˆ Initially, for all keys k in cq, the approach predicts the value for dfP,k (denoted as df P,k ), ˆ )) between df ˆ and uses the diﬀerence (denoted as δ(df P,k P,k and the last received value ˆ ) reﬂects the number of relevant from the directory to calculate the score for P. δ(df P,k documents concerning t that P will publish in the next period. Secondly, MinervaDL predicts δ(cs) as the diﬀerence in the collection size of P reﬂecting the provider peer’s overall ˆ expected future publishing activity. In this way two aspects of the peer’s behavior are modeled: its ability to publish relevant documents in the future and its overall expected publishing activity. The consumer peer that submits cq obtains the IR statistics that are needed as an input to the prediction mechanism by utilizing the distributed directory. The following formula is used to compute the prediction score for a provider P with respect to
a continuous query cq:
7.5 Experimental Evaluation This section presents the evaluation of MinervaDL. In the experimental evaluation, a total number of 1000 providers and the same number of consumers is assumed. All these peers are connected to a MinervaDL network using 400 super-peers as access points. At bootstrapping, each provider stores about 300 documents and is mainly specialized in one out of ten categories: Music, Finance, Arts, Sports, Natural Science, Health, Movies, Travel, Politics, and Nature. Thus, there are 100 specialized information sources per category.
Overall, the experimental runs use 30 queries2 containing two, three or four keys that are strong representatives of a certain document category and are used both as one-time and continuous queries. The next sections investigate search performance, ﬁltering performance, and a message cost analysis.
7.5.1 Search Performance In Figure 7.3, the experimental results of the information retrieval functionality of MinervaDL are presented. Consumers issue the 30 one-time queries and the evaluation measures
the relative recall to a centralized search engine hosting the complete document collection:
the ratio of top-25 documents for a query included in the merged P2P query result. Thus, MinervaDL applies the resource selection strategy as described before in 7.4.1 and increase the percentage ρ of providers in the system that are requested to answer the query.
Figure 7.3 shows that the retrieval performance of MinervaDL manages to retrieve more than 90% of the top rated documents by asking only a fraction of the providers, whereas the baseline approach of random peer selection reaches a much lower recall (around 20%).
This experiment shows that a random selection of provider peers does not lead to satisfying retrieval results.
7.5.2 Filtering Performance To evaluate the ﬁltering functionality of MinervaDL, the experiments assume that providers publish 30 documents within a speciﬁc time period (called publishing round ) and the results after ten publishing rounds are investigated. A scenario that models periods of inactivity in the publishing behavior is considered. This scenario assumes that consumers subscribe with 30 continuous queries and reposition them in each publishing round. In the experiments, the percentage ρ of monitored providers is varied and the providers are ranked using the two scoring functions approaches described in Section 7.4. Recall is used as ﬁltering measures;
it is deﬁned as the ratio of total number of notiﬁcations received by the consumers to the total number of published documents matching a subscription.
2 Example queries are museum modern art, or space model.
Figure 7.5: Approximate vs.
Exact Information Filtering.
As illustrated in Figure 7.4, the use of behavior prediction improves recall over resource selection as it manages to model more accurately the publishing behavior of providers.
In this way, the proposed scoring function manages to achieve a recall of around 80% by monitoring only 20% of the providers in the MinervaDL network. Notice that for resource selection, this number is only 35% which makes it an inapplicable solution to a ﬁltering environment.
7.5.3 Message Costs Analysis Figures 7.5 and 7.6 show diﬀerent aspects of a message costs analysis of MinervaDL. Here, in each round, the subscriptions are repositioned and the one-time queries are requested
again (e.g., by another consumer peer). In this setting, there are three types of messages:
directory, retrieval, and ﬁltering messages. Figure 7.5 compares MinervaDL with an implementation of the exact matching protocols of LibraRing [TIK05a].
The overall message costs are considered as a function of the total number of documents published in the system for both approaches. As shown, message costs of MinervaDL are independent of the number of publications, whereas LibraRing, as with all exact matching approaches, is sensitive to this parameter since documents have to be disseminated to the network at publication time. This imposes a higher network cost especially for high publication rates as those expected in a DL setting.
Finally, in Figure 7.6, it can be observed that directory messages dominate both retrieval and ﬁltering messages. This is expected since matching and ﬁltering mechanisms are by design local to accommodate high publication rates. This means that building both IR and IF functionality on top of the routing infrastructure imposes no extra cost on the architecture, compared to one that supports only one type of functionality. Since other types of information can also be stored in the directory in the same fashion (e.g., QoS statistics or load balancing information) building extra functionality will increase the added value of the directory maintenance, and come at almost no extra cost.
1.5 Figure 7.6: Message Costs in MinervaDL.
7.6 Discussion This chapter presented a digital library use case utilizing a novel DL architecture coined MinervaDL, designed to provide both IR and IF functionality in a single unifying framework.
All introduced protocols regulate the communication between the three peer types (superpeers, providers, and consumers). The experimental evaluation of this use case investigated the inﬂuence of the individual scoring functions used to perform the two functionalities of MinervaDL. Furthermore, MinervaDL was compared against the LibraRing architecture designed to provide exact retrieval and ﬁltering. The next chapter will conclude this thesis by summarizing the main contributions, and giving an outlook of open problems.
Chapter 8 Conclusion and Open Questions The last chapter of this doctoral thesis summarizes the presented work (Section 8.1), and lists the main contributions (Section 8.2). Finally, Section 8.3 discusses possible extensions to the presented work and elaborates on open research questions with regard to load balancing, dynamics in P2P networks, replication & caching, and economic aspects1.
8.1 Summary Recently, the P2P paradigm has moved towards distributed data management systems that are able to oﬀer information retrieval or information ﬁltering functionality. The main advantage of P2P is its ability to handle huge amounts of data in a decentralized and selforganizing manner. The characteristics of P2P oﬀer high potential beneﬁt for information systems powerful regarding scalability, eﬃciency, and resilience to failures and dynamics.
Beyond that, such an information management system can potentially beneﬁt from the intellectual input of a large user community participating in the data sharing network. Finally, P2P information systems can also bring forward pluralism in informing users about Internet content, a crucial property that breaks information-resource monopolies and bypasses the biased visibility of content from economically powerful sources. In the above distributed and dynamic setting, information ﬁltering, also referred to as publish/subscribe or continuous querying or information push, can be seen as equally important to information retrieval (or one-time querying), since users are able to subscribe to information sources and be notiﬁed when new events of interest happen. A user posts a subscription (or continuous query) to the system to receive notiﬁcations whenever matching events of interest take place. Information ﬁltering and information retrieval are often referred as two sides of the same coin [BC92].