# «Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»

The results show the recall improvements obtained by the proposed algorithms. For two-key queries, 24% publishers have to be monitored to reach a recall level of 0.50 in the baseline approach, whereas, using the CSS approach, the subscriber only has to monitor 19% of the network. The CSS outperforms both USS approaches, because it oﬀers more accurate and explicit statistics for the key-pairs. Comparing the two USS approaches, the use of KMV synopses slightly improves ﬁltering quality when compared to hash sketches.

Considering the results for the three-key query set, the improvements are much higher.

The CSS approach reaches a recall of 0.79 by subscribing to 15% of all publishers. In contrast, the baseline approach reaches a recall of 0.44 by monitoring the same number of publishers. Using this query set, a considerable diﬀerence between the results of the two USS approaches can be seen. The USS-KMV approach that uses KMV synopses almost reaches the result quality of the multi-key approach, while USS-HS suﬀers from the inaccuracy of combining hash sketches of more than two sets. Recall that KMV synopses use a direct way to intersect multisets, whereas hash sketches reside on an indirect computation that causes loss in accuracy.

** Figure 5.2: Average Recall for Three-Key Zeitgeist Queries.**

To investigate the ﬁltering quality of both approaches for four-key continuous queries, queries were created using the single keys from the real-world Zeitgeist query set. Out of the full set of all possible four-key queries, the 500 combinations with the highest number of matching documents in the Web collection have been selected. Examples of resulting queries are time war unit world, day nation world game, or war world london british. All selected continuous queries have more than 3, 000 matching documents within the Web data collection.

In the ﬁrst experimental series with this query set, the general ﬁltering quality of the CSS and USS approaches are investigated. Figure 5.3 shows that CSS outperforms the baseline approach and both USS approaches. When hash sketches are used to represent documents the ﬁltering performance is worse even from the baseline approach due to the inaccuracy of multiset operations. Hash sketches have to use the sieve formula to perform the multiset operation thus suﬀering from inaccurate estimations. Contrary, the use of KMV synopses improves the ﬁltering quality because it guaranties better distinct-value estimations for the intersections of multiple sets (see Section 2.5.2). Notice that the selected four-key queries do not fully show the improvement margins of the CSS algorithm, since more selective key sets with less matching documents beneﬁt more from multi-key statistics.

Using the set of four-key continuous queries described above, the evaluation measured the ﬁltering quality for combining multi-key statistics as proposed in Section 5.4.2.4. Figure

5.4 illustrates the improvements for several scenarios of available multi-key statistics in comparison to the baseline approach. The ﬁltering quality of CSS gives the upper bound

**by using the exact multi-key statistics of the full continuous query:**

• In the ﬁrst scenario CSS-1, all four possible three-key statistics are available to use select the most appropriate publishers. This means for a four-key continuous query abcd that the statistics for abc, abd, acd, and bcd are available. Notice that multi-key statistics for single keys or pairs can also be stored in the directory but not be used since the proposed algorithm in Section 5.4.2.4 to combine multi-key statistics only considers maximal subsets.

0.6 Figure 5.3: Average Recall for Four-Key Zeitgeist Queries.

** Figure 5.5: Average Recall Improvements for Two-Key Queries.**

• All six two-key statistics are provided by the directory in the second scenario denoted as CSS-2 but the directory does not store statistics for three key subsets.

• The third CSS-3 scenario assumes the existence of two two-key statistics such that the whole key set is covered (e.g., for query abcd, the statistics of ab and cd are available).

• One three-key plus one complementary single-key statistics are provided by the directory in the CSS-4 scenario (e.g., statistics for acd and b).

• The last CSS-5 scenario assumes two overlapping multi-key statistics for one two-key plus one three-key statistics. Here, the multi-key statistics for abc and cd meet the scenario condition.

The results show that all combinations improve the baseline approach, but cannot reach the ﬁltering quality of the CSS approach. Naturally, ﬁltering quality depends on the size of the multi-key statistics: the scenarios (CSS-5 and CSS-1 ) where multi-key statistics for three keys are considered to select publishers show the best ﬁltering performance; in contrast, the three other scenarios only have two-key statistics.

** Figures 5.5 and 5.**

6 demonstrate another series of experiments that targeted the improvements in ﬁltering performance over the baseline algorithm, when using the proposed correlation measures for multi-key statistics. To conduct this experiment, all possible twoand three-key queries are created using the single-keys from the Zeitgeist query-log. Three diﬀerent query sets with a size of 100 each are selected; the query sets have the following

**properties:**

• Low-Corr includes those key sets where all keys have a low conditional probability such that there are only a few documents in the Web data set that contain both keys at the same time.

** Tables 5.2 and 5.**

3 show example two-key and three-key queries for the used query sets described above. Underlined scores denote high conditional probabilities. The threshold depend on the number of keys. The key sets island brother and british sport star show low conditional probabilities whereas in the key sets football sport and nation unit world, all keys have a high conditional probability. The other two examples belong to the query set High-Corr where keys show a diverse correlation measure (at least one key has a high conditional probability).

As already explained in Section 5.4, subscribing to key sets where all keys have low conditional probabilities yields to the highest ﬁltering result improvements when monitoring the same percentage of publishers. This is caused by the fact that a low conditional probability means that there are a lot of matching documents this key without matching the remaining keys. Thus, when monitoring only 10% of the publishers, 2-Key-Low-Corr has an recall improvement from 0.44 to 0.65 whereas 2-Key-All-High-Corr only improves ﬁltering quality from 0.25 to 0.28 (see Figure 5.5). Similar results hold for three-key queries in Figure 5.6 where 3-Key-Low-Corr has an improvement from 0.43 to 0.68 compared to improvement for 3-Key-All-High-Corr from 0.30 to 0.33. This leads to the conclusion that the CSS algorithm has no signiﬁcant eﬀect for key sets where all keys are highly correlated, while it signiﬁcantly improves ﬁltering for key sets with low correlations.

Analyzing the results for key sets where one key is highly correlated to all others (HighCorr ), an smaller improvement compared to unrelated keys can be observed. Here, the use of the CSS approach is possible, but an alternative strategy is proposed: a high conditional probability for a key k means that there is almost no additional information included in multi-key statistics for the full key set in comparison to the key set without k. As an example, the multi-key statistics for key set time face friend yield to similar ﬁltering results than the multi-key statistics for face friend because the conditional probability P (time|f ace, f riend) is high (83% of all documents containing face and friend also contain time). The same observation holds for the key set time gift where time denotes the highly correlated key. There, 60% of all documents containing gift also contain time.

**5.5.3.2 Results Using Blog Data**

In this section, experimental results from the blog data experiment are presented. Overall, the observations for blog data lead to similar conclusions with the Web data corpus, with CSS outperforming USS and baseline.

The results for all two-, and three-key queries from the Zeitgeist query-log are presented in Figure 5.7. The complete Zeitgeist queries are listed in the appendix B. Again, the KMV synopses perform better than hash sketches. The USS-KMV approach is almost as good as the the upper bound of CSS whereas USS-HS at least outperforms the baseline approach. Notice that, contrary to the Web data collection, a smaller number of blogs have to be monitored to acquire a suﬃciently high recall. This happens because all blogs are highly specialized and thus only a small number of them contribute new posts matching the requested continuous queries. Compared to this, the Web data collection provides less speciﬁed publisher peers.

Subsequently, Figures 5.8, 5.9, and 5.10 illustrate the average recall improvements for two-key, three-key queries, or four-key queries respectively considering the diﬀerent query sets introduced in the previous section (Low-Corr, High-Corr, and All-High-Corr ). Again, the experiments use selections of all possible key set combinations created by single keys with maximal numbers of matching blog posts. The sizes of all query sets range from 10 to 20, and now, key sets with four keys are regarded, too.

Similar to the experiments with Web data, the CSS approach for multi-key queries with low conditional probabilities shows the highest recall improvement. In Figure 5.8, there is almost no improvement for two-key queries where both keys have a high conditional probability (2-All-High-Corr ) but the improvement for queries with low conditional probability (Low-Corr ) is signiﬁcant: from 0.53 to 0.67 when monitoring 10 blogs. Thus, blog data experiments conﬁrm the previous observations. In Figures 5.9 and 5.10, the diﬀerences between low correlated (Low-Corr ) and high correlated key sets (All-High-Corr ) are reduced but still existing.

0.6 Figure 5.9: Average Recall Improvements for Three-Key Queries.

** Table 5.4: Examples of Relatedness among Four-Key Queries.**

The following example makes this point clear using the conditional probabilities in Table

5.4. Using a multi-key continuous query london time train bbc, it is suﬃcient to apply the CSS approach for the subset time train bbc because the key london has a conditional probability of 0.97. In other words, 97% of all posts containing the key london also contain the remaining keys. So, there is no signiﬁcant additional information included in the multikey statistics of the complete key set.

- 102 Chapter 5 Correlated Key Sets

5.6 Discussion This chapter introduced two approaches that utilize correlations among keys to improve ﬁltering quality. The USS algorithm uses existing single-key synopses stored in the directory to estimate the publishing behavior of information sources for key sets, while the CSS algorithm enhances the directory to explicitly maintain statistical metadata about selectively chosen key sets. This is the ﬁrst work to develop algorithms for exploiting keyword correlations in such an dynamic IF setting.

Whereas the approach of [MBN+ 06] is limited to two-key queries, the USS algorithm can be applied to multi-key continuous queries for an arbitrary number of keys. Therefore the usage of very recent state-of-the-art techniques for compact representation of multisets (KMV synopses) is included in the algorithm. These synopses overcome the restrictions of hash sketches. The CSS algorithm uses a new strategy to approximate multi-key statistics by combining the statistics of arbitrary subsets. This strategy is useful when the multi-key statistics for the full continuous query are not included in the directory.

The experimental evaluation illustrated the ﬁltering performance improvements of both algorithms in comparison to the baseline approach presented in this thesis. All experimental series used two diﬀerent real-world collections for Web and blog data, and applied real-world Google Zeitgeist queries. The evaluation also investigated ﬁltering performance gains depending on the introduced correlation measure (conditional probability) representing a way to compute the relatedness among keys. The next chapter presents the current prototype implementation of the MAPS approximate information ﬁltering approach.

Chapter 6 Prototype Implementation This chapter presents the current prototype implementation [ZHTW08] of the MAPS approximate information ﬁltering approach introduced in the previous chapters of this thesis.

Therefore, the Minerva search engine developed as explained in [BMPC07] is used and extended with additional components to realize both functionalities (one-time searching and publish/subscribe) in one unifying system.

** Section 6.1 introduces some aspects of current Web search.**

In Section 6.2, the main aspects of the Minerva search engine are explained including an implementation overview. In addition, the most important principles that drive the P2P search paradigm are illustrated.

** Section 6.3 presents the additional components for approximate pub/sub functionality that have been implemented.**

This section also shows the usage of both functionalities by following some example (continuous and one-time) queries. Screenshots from the system’s graphical user interface (GUI) are used to demonstrate and explain the MAPS system. A short overview of some other prototype systems for IR and IF in P2P networks presented in Section 6.4, and the discussion in Section 6.5 conclude this chapter.

6.1 Introduction Today, full-ﬂedged Web search is more or less under the control of centralized search engines.