# «Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»

5.4 Exploiting Correlations This section presents two new algorithms to exploit correlations among key sets: (i) the USS algorithm (Unique Synopses Storage) presented in 5.4.1 uses single-key synopses (hash sketches or KMV synopsis) already stored in the distributed directory to estimate the ﬁltering behaviors for key sets, and (ii) the CSS algorithm (Combined Synopses Storage) explained in 5.4.2 enhances the single key directory to explicitly maintain statistical metadata about selectively chosen sets of multiple keys depending on the correlation measures (conditional probability) introduced in previous Section 5.3. Both algorithm build-upon and extend the baseline algorithm introduced in Section 5.2.

5.4.1 The USS Algorithm In this section, the USS algorithm (Unique Synopses Storage) is presented. USS uses single-key synopses already stored in the distributed directory to estimate the publishing behavior for complete key sets. As the distributed statistics of a publisher pi regarding a key a contain the synopsis SY Ni (a) (to represent the documents of pi containing key a), it is possible to compute the number of documents containing all keys of a given key set S.

For this, the algorithm uses the intersection operation for multisets as explained in detail in Section 2.5 as an estimation for the frequency of the whole key set in pi ’s local collection (dfi (S)). Together with prediction techniques presented in this thesis, the most promising publishers can be selected by applying appropriate scoring functions based on publisher behavior prediction.

To explain the algorithm, consider a subscriber s subscribing with a multi-key continuous query cq = {k1, k2,..., kn } containing n distinct keys. According to the USS algorithm the

**following steps are executed:**

1. For each key kj, 1 ≤ k ≤ n in cq, subscriber s contacts the directory peer d(kj ) responsible for kj and retrieves the statistical summaries published individually for all keys, including the synopses SY Ni (kj ) produced by a publisher pi.

5.4.2 The CSS Algorithm To overcome the problems of the USS algorithm, the CSS algorithm (Combined Synopses Storage) is introduced. The algorithm identiﬁes valuable key combinations, enhances the directory to explicitly maintain these multi-key statistics, and exploits these statistics to improve publisher selection. In addition, the CSS algorithm combines multi-key statistics for subsets of the requested continuous query.

As already discussed in Section 5.3, uncorrelated or negatively correlated keys are of interest to collect multi-key statistics. The CSS algorithm aims at addressing the following questions: (i) how to decide which key sets are of interest to disseminate additional multikey statistics; (ii) how to disseminate multi-key statistics of publishers to the directory;

(iii) how to exploit the additional statistics to improve publisher selection; (iv) how to deal with multi-key statistics for subsets of the complete continuous query.

5.4.2.1 Assessing Key Sets Given a key set S = {k1, k2,..., kn } with n distinct keys, a deterministic function is employed (e.g., by selecting the lexicographically highest or lowest key) to select one directory peer that has to assess the relatedness among the keys of S and be responsible for this key set. A deterministic approach with pre-hashing the key-values ensures load balancing among the directory peers. The CSS approach uses the following three steps to assess a candidate key set S where d(S) is the directory peer that is responsible for the assessment

**decision:**

1. Initially, d(S) contacts all directory peers responsible for the other keys kj ∈ S to retrieve the synopsis SY N (kj ) representing all documents in the network containing the key. The answering directory peers d(kj ) for a key kj compute locally the synopsis SY N (kj ) by using the union operation over all individual publisher synopses SY Ni (kj ) (see Section 2.5).

2. Subsequently d(S) computes the intersections among the synopses SY N (kj ), to retrieve the estimated cardinality of documents containing all keys in the candidate set (denoted as df (S)).

3. Finally, using Equation 5.2, directory peer d(S) then computes the conditional probabilities for each key kj ∈ S.

Small values for the conditional probabilities show that the occurrence of all keys in the documents are largely independent, meaning that publisher selection decisions can strongly beneﬁt from the existence of available multi-key statistics. Thus, CSS initiates the creation of these additional summaries if the conditional probabilities for all keys kj ∈ S are below some threshold β.

To further optimize message traﬃc between directory peers, a threshold α is introduced, such that if the conditional probability of a key kj is above α, publisher selection strategy does not have to consider the single-key statistics of that key. The idea behind this is that kj is contained in almost all documents containing the rest of the keys S\{kj }, making this key set a candidate for multi-key statistics. The experimental evaluation will also investigate this strategy in Section 5.5 by computing the conditional probabilities of several continuous query.

**5.4.2.2 Disseminating Multi-Key Statistics**

As soon as a key set has been assessed as a useful candidate that is worth collecting multikey statistics for (as explained before), publishers have to learn this fact immediately.

Especially in an information ﬁltering scenario, where statistics have to get updated, the continuous process of directory refreshment can be used to disseminate the knowledge about valuable multi-key sets.

Assuming that a directory peer d(S) that is responsible for a key k ∈ S and key set S identiﬁed key set S as useful. According to the algorithm, d(S) is also responsible to maintain the multi-key statistics for the complete key set S. Whenever a publisher p updates its statistics for key k, d(S) can inform the publisher about the new key set. In this way, p will compute its local statistics for the complete set S and disseminate them to the directory peer d(S). This procedure does not incur any additional messages compared to the baseline approach described in Chapter 3, since additional statistics can be piggybacked on messages that need to be sent anyway.

Next, it is assumed that a subscriber s submitting a continuous query cq asks the directory peers responsible for the keys in cq to provide the related statistics. Since cq is included in the request message, a directory peer d(S) that stores statistics about a key set S, where S ⊆ cq, will return the statistics for S together with the single-key statistics it is responsible for. The following section explains how to compute the statistics for a multikey query by combining statistics from subsets available in the directory. Subsequently, the subscriber collects the multi-key statistics and uses them to perform publisher ranking.

Note that the subscriber applies publisher selection based on multi-key statistics, and is thus able to predict the publishing behavior of sources more accurately. Contrary, the baseline algorithm performs a separate prediction for each single key. If no summaries for the key set (or subsets, see Section 5.4.2.4) have been published, the baseline procedure using single-key summaries is still applicable without contacting additional directory peers.

** 5.4.2.4 Combining Multi-Key Statistics**

A usual case for continuous queries with more than two keys is that there are no statistics for the full key set S of the query, because, e.g., the assessment step did not consider the full key set as suﬃciently useful. However there might be available statistics for several multi-key subsets of S. In this case, by design, all of these subsets can be found by the subscriber by simply asking the directory peers responsible for the single keys as explained in the previous section.

To give an example, assume that the directory maintains the multi-key statistics of S1 = {a, b, c} and S2 = {b, c, d}. Then, for a given ﬁve-key query S = {a, b, c, d, e}, the algorithm has several options at hand. The CSS approach proposes to select all maximal subsets among the available multi-key summaries. This is eﬃcient in terms of network costs because the entire continuous query will be sent to all single-key directory peers anyway. But this consideration opens up a space of optimization strategies; Combining such incomparable but mutually related multi-key statistics is reminiscent of the recent work on multidimensional histograms with incomplete information [MMK+ 05], but the setting here has the additional complexity of very-high-dimensional key space.

To combine the diﬀerent multi-key statistics, the CSS algorithm weights them depending on the size of the multi-key subset. The scoring function to calculate a publisher score is

**given by Equation 5.3:**

|Si | · predScoresi (p) scoreS (p) = (5.3) Si ⊆S Here, Si is a subset of the key set S contained in the continuous query, |Si | is its cardinality, and predScoresi (p) represents the likelihood that p will produce a document containing Si in the future. This score is produced using time series analysis techniques similarly (plus resource selection techniques) as presented in Chapter 4. Thus, to obtain a publisher score, the system sums-up all prediction scores for the subsets Si that have been received from the directory peers such there is no Sj with Si ⊂ Sj.

The intuition behind weighting the prediction score with the size of the key set |Si | is that the prediction score for small subsets dominates the sum. This happens because the number of documents containing all the keys of small key sets is higher, resulting in higher prediction scores. Next section experimentally investigates diﬀerent scenarios where multi-key statistics for the full continuous query are not available, but a combination of the statistics of subsets is possible.

- 92 Chapter 5 Correlated Key Sets

5.5 Experimental Evaluation In this experimental section, the USS and CSS algorithms are evaluated using two real-life corpora with Web and blog data. Since it is extremely diﬃcult to ﬁnd real-life continuous queries except by obtaining proprietary data (e.g., from CNN’s news or Springer’s journal alert system), the experiments utilize the popular Web queries contained in the Zeitgeist query-log, and treats them as subscriptions (e.g., assuming that a user is interested in staying informed about his favorite pop star, or a big sport event). The Google Zeitgeist 3 query-log is published periodically for diﬀerent geographic regions to express the most interesting search trends.

5.5.1 Experimental Setup The evaluation compares the ﬁltering performance of both algorithms with diﬀerent synopses (hash sketches and KMV synopses) and against a baseline algorithm that computes publisher scores using single-key frequencies to predict the publishing behavior of information producers. This baseline algorithm corresponds with the baseline approach presented in 5.2. The prediction mechanisms have an average prediction error of 10% to ensure a realistic system behavior.

As quality measure to evaluate the ﬁltering performance of both algorithms, the experiments use recall. Recall in the IF setting is deﬁned as the ratio of the total number of notiﬁcations received by subscribers to the total number of published documents matching the subscriptions. In the experiments, the average recall over several publication rounds is considered. In all graphs illustrated in this section, recall depends on the percentage of monitored publisher peers.

5.5.2 Experimental Data For the experimental setup, a recently proposed benchmark [NBMW06] designed speciﬁcally for use in the evaluation of distributed and P2P settings is utilized. Brieﬂy, the benchmark consists of more than 800, 000 Web documents drawn from the Wikipedia corpus, and an algorithm to distribute the documents among 1, 000 publishers with controlled overlap that mimics the real-world behavior.

The Zeitgeist query-log contains queries with one, two, and three keys. To investigate ﬁltering quality depending on the conditional probabilities, the set of distinct single-keys in all queries is used to create two-, three, and four-key queries by combinations. During query evaluation, a continuous query is indexed in up to 25% of the publishers. The experiments use the concept of publication rounds to simulate the time properties of the published documents; here, a publisher peer publishes around 400 documents in each round, which could represent the weekly publications of a digital library or news portal.

The second data set focuses on blog data4. Out of a huge set of blogs, about 1, 000 blogs have been selected that published more than 300 postings each during a time period of three weeks. The selected blogs have been crawled and assigned to 1, 000 publishers. The crawled blog data also spans over three weeks. Again, the same Zeitgeist query-log is used to evaluate the ﬁltering performance, while the query repositioning algorithm was executed to simulate one query repositioning per week.

3 http://www.google.com/press/zeitgeist.html archives query-logs since 2001 4 Another well-known source of blog data is Blogsphere [BK07].

** Figure 5.1 (respectively 5.**

2) shows the ﬁltering quality for all two-key (respectively threekey) continuous queries from the Web query-log using the Web data set. The evaluation compares a baseline publisher selection algorithm based on behavior prediction for singlekey statistics as described in this thesis in Chapter 3, with the CSS approach of multikey statistics maintained in the directory, and with the USS approach using two diﬀerent synopses (hash sketches and KMV synopses).