«Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»
On the other hand, authorities are expected with high probability to publish new documents from the same topic in the future. In this case, assume another publisher peer P3 similar to P1 that is not publishing Sports-related documents at the moment. But contrary to P1, P3 is not a good authority for Sports. So, a peer selection strategy should prefer P1 because a Sports-authority is expected to publish Sports-related documents in the future with a higher probability than a publisher that is specialized in another topic. For this reason, MAPS peer selection strategy combines resource selection and behavior prediction techniques. The following experiments in Section 4.3 will investigate several scenarios where a combination of both parts have to be stressed using the system parameter α in Equation
4.1. Obviously, a publisher P4 that is an authority for Sports (since it stores many documents regarding Sports) and that also publishes new documents concerning this topic, will be the best choice for monitoring the issued continuous query.
The fact that resource selection alone is not suﬃcient is even more evident in the case of news items. News items have a short self-life, making them the worst candidate for slowpaced resource selection algorithms. The above example shows the need to make slow-paced selection algorithms more sensitive to the publication dynamics in the network, and MAPS employs peer behavior prediction to cope with such dynamic scenarios.
4 The past describes whether a publisher is or not a good authority for a topic based on the stored data that has been published before.
5 The future describes the predicted and expected publishing behavior whether a publisher will provide
4.3 Experimental Evaluation This experimental section evaluates the MAPS approach and investigates the ﬁltering effectiveness and eﬃciency depending on the peer selection strategy in diﬀerent publishing scenarios. Section 4.3.1 describes the experimental setup including the performance measures used in the evaluation. The experimental data is explained in Section 4.3.2 including the continuous query set. Section 4.3.3 shows experimental results for various publishing scenarios in terms of recall and beneﬁt/cost ratio. An investigation across scenarios is included in Section 4.3.4, whereas Section 4.3.5 compares an existing exact ﬁltering approach with MAPS. Finally, Section 4.3.6 sums up the observed experimental results of this extensive evaluation.
4.3.1 Experimental Setup To conduct each experiment described in the next sections of this evaluation, the following ﬁve steps are executed:
1. Initially the network is set up and the underlying distributed hash table is created.
All publisher peers join the DHT such that a structured P2P network is build-up.
2. Then, all peers in the DHT distribute metadata concerning their local collection by applying the directory protocol as described in Section 3.3.1.
3. Subsequently, subscribers utilize the protocol described in Section 3.3.2 to subscribe to selected publishers. It can be said that a publisher peer is monitored with a continuous query q by a subscriber, when it stores q in its local query database store, and notiﬁes the subscriber for all publications matching q.
4. Once queries are stored, the documents are published to the network and at certain intervals (called rounds) queries are repositioned. A repositioning round occurs every 30 document publications per peer.
5. At the end of each round, message costs and recall for this round are calculated, and subscribers rank publishers using formula score(P, q) described in Section 4.2 and reposition their queries accordingly.
To investigate the eﬀectiveness and eﬃciency of the MAPS approach, peer publishing behavior is modeled through diﬀerent publishing scenarios described in Section 4.3.3 in detail. Additionally, the two following system parameters ρ and α have to be considered in
the experimental setting:
• System parameter ρ determines the percentage of publishers that a subscriber monitors after ranking them using the methods of Section 4.2. In the experimental evaluation, ρ is constant for all subscribers and diﬀerent system properties for a value of ρ up to 25% are investigated. It follows that when ρ = 100% (i.e., all publishers in the network are monitored) then recall is 1.0, and the MAPS approach degenerates to exact information ﬁltering.
• System parameter α controls the inﬂuence of resource selection vs. peer behavior prediction in the experiments. The value of α in the peer selection formula is varied from 0.0 to 1.0. A value of α close to 0.0 emphasizes peer behavior prediction, while a value close to 1.0 stresses resource selection.
To evaluate the overall ﬁltering eﬀectiveness of the approach the evaluation utilizes recall, while eﬃciency is measured using a beneﬁt/cost ratio metric. Here, the deﬁnitions of both
measures are given:
As explained in the protocols of Chapter 3, the number of subscription messages depends on the number of query keys and monitored publishers. In addition, the subscription costs are proportional to the number of query repositionings, since for each repositioning the subscription protocol is re-executed. Finally, for each publication matching an indexed query, a notiﬁcation message is created and sent to the subscriber. In the experiments of Sections 4.3.3 and 4.3.4, the message cost needed to maintain the distributed directory information is not taken into account since the main goal is to focus on the ﬁltering protocol costs.
Since directory messages are typically included in DHT maintenance messages, they can be considered as part of the underlying routing infrastructure. The directory maintenance messages are included in the cost analysis of Section 4.3.5 when MAPS is compared with DHTrie [TIK05b], an distributed exact IF approach that delivers the recall of a centralized system in a P2P setting.
4.3.2 Experimental Data The data collection used in the experiments contains more than 2 million documents from a focused Web crawl with the focused Web crawler system BINGO! [STSW02]. All documents are categorized in one of ten categories: Music, Finance, Arts, Sports, Natural Science, Health, Movies, Travel, Politics, and Nature. The overall number of corpus documents is 2, 052, 712. The smallest category consists of 67, 374 documents, the largest category of 325, 377 documents. In Addition, there are 36, 831 documents with unknown category (documents that are non-categorized by the focused crawler). Table 4.2 summaries the statistical overview of the experimental data collection. The number of distinct keys after stemming and removing stop words amounts to 593, 876.
In all experiments, the network consists of 1, 000 peers containing 300 documents each in their initial local collection. Each peer hosts 15% random documents (i.e., from random categories), 10% not categorized documents, and 75% documents from one single category, resulting in 100 peers specializing in each category. Using the document collection, 30 continuous queries are constructed containing two, three or four query keys as follows.
Each of the query keys selected is a strong representative of a document category (i.e., a frequent key in documents of one category and infrequent in documents of the other categories). Example queries are music instrument, museum modern art, or space model research study. The complete list of continuous queries is shown in Table 4.3.
To measure MAPS’s eﬃciency in terms of recall and message cost under various settings, ﬁve scenarios are considered representing diﬀerent publishing behaviors. The overall number of published documents is constant in all scenarios (300, 000 documents) therefore the maximum number of notiﬁcations concerning the 30 active continuous queries is also constant (146, 319 notiﬁcations), allowing us to compare across diﬀerent scenarios. The following sections shows experimental results with average recall and beneﬁt/cost ratio for diﬀerent publishing scenarios and diﬀerent α and ρ values. A baseline approach (called rand ) that implements a random peer selection method is included for comparison purposes.
220.127.116.11 The Consistent Publishing Scenario
The ﬁrst publishing scenario Consist targets the performance of the approach when peers’ interests remain unchanged over time. Figures 4.1 and 4.2 show that the average recall and the beneﬁt/cost ratio do not depend on the ranking method used, and the MAPS approach presents the same performance for all values of α. This can be explained as follows. Publishers that are consistently publishing documents from one category have built up an expertise in this category and peer selection techniques are able to detect this and monitor the authorities for each topic.
Similarly, publication prediction observes this trend for consistent behavior and chooses to monitor the most specialized peers. Compared to the baseline approach of random selection (rand ), the MAPS approach achieves up to 7 times a higher average recall (e.g., for ρ = 10%). Finally, the best value for the beneﬁt/cost ratio is when ρ = 10% that means when 10% of the publisher peers in the network store a continuous query.
Figure 4.2: Beneﬁt/Cost Ratio in the Consist Scenario.
18.104.22.168 The Half Publishing Scenario The second scenario (called Half ) that is observed in this evaluation series assumes that only half of the peers publish documents at all. Image a network with so-called free-riders that means peers only subscribing but not publishing anything. Figures 4.3 and 4.4 illustrate that the average recall is slightly higher than in the Consist scenario. The explanation is that MAPS peer selection recognizes that some peers do not publish documents at all.
The beneﬁt/cost ratio is also higher than in in previous scenario. It reaches the highest beneﬁt/cost ratio for ρ = 5%. As in the previous scenario, MAPS outperforms the random peer selection approach rand for all parameter settings.
22.214.171.124 The Category Change Scenario
Since users may publish documents from diﬀerent topics, the evaluation uses the scenario CatChg to simulate the changes in a publisher’s content. In the CatChg scenario, a peer initially publishes documents from one category, and later on switches to a diﬀerent category at some point in time. Figures 4.5 and 4.6 illustrate the performance of the approach in this scenario for diﬀerent values for α and ρ. The most important observation from these ﬁgures is the performance of the prediction method in comparison to resource selection. In some cases (e.g., when ρ = 10%) not only publication prediction achieves more that 6 times better average recall than resource selection, but also resource selection is only marginally better than rand (e.g., when monitoring 15% of publishers). So, this scenario aﬃrms the example of Section 4.2.3 why resource selection is not suﬃcient for ﬁltering scenarios.
In general, both average recall and beneﬁt cost/ratio improve as α reaches 0.0 and prediction is stressed. This abrupt changes in the publishers’ content cannot be captured by any resource selection method, which favors topic authorities. On the other hand, publication prediction detects the publishers’ topic change from observed changes in the IR statistics and adapts the scoring function to monitor peers publishing documents relevant to subscribed continuous queries.
0.6 Figure 4.5: Average Recall in the CatChg Scenario.
The experiments shown in Figures 4.7 and 4.8 are introduced to demonstrate the retrieval eﬀectiveness of MAPS in each of the 10 query repositioning rounds for the CatChg scenario, when α is variable and for ρ = 5% and ρ = 10%6. In round one, all combinations of publication prediction and resource selection provide the same performance because no time series values are available yet; thus only the resource selection method is used to rank peers. As new statistics become available in later rounds, the prediction method improves recall. The more emphasis is put on the prediction part of the ranking, the faster the learning process evolves, as is clearly shown by the recall increase. A comparison of the two ﬁgures reveals that the learning process of the scoring function remains unaﬀected by ρ, and that prediction manages to adapt quickly in topic shifts, by needing only two repositioning rounds to reach the maximum recall.
126.96.36.199 The Publishing Breaks Scenario
The Break scenario models the behavior of peers as they log in and out of the network. It is assumed that some publishers are active and publish documents for some rounds, and then log out of the network, publishing no documents any more. This procedure is continued in intervals, modeling, e.g., a user using the publication service at home, and switching it oﬀ every day in the oﬃce. The ranking mechanism should recognize and adapt to these inactivity periods, and distinguish between peers not publishing documents any more and peers making temporary pauses. Figures 4.9 and 4.10 demonstrate that both average recall and beneﬁt/cost ratio improve when resource selection is emphasized (i.e., when α is close to 1.0), since pauses in the publishing mislead the prediction formula to foresee that, in the future, no relevant publications will occur. For this reason, peers with inactivity periods are ranked lower resulting in miss of relevant documents. On the other hand, resource selection accommodates less dynamics, so temporary breaks remain undetected and the topic authorities continue to be monitored since the ranking procedure is not aﬀected.
6 Monitoring 5% or 10% of the network seems to be a realistic size.
Figure 4.8: Recall per Publishing Round for ρ = 10% in the CatChg Scenario.