FREE ELECTRONIC LIBRARY - Abstract, dissertation, book

Pages:     | 1 |   ...   | 14 | 15 || 17 | 18 |   ...   | 25 |

«Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»

-- [ Page 16 ] --

The results show the recall improvements obtained by the proposed algorithms. For two-key queries, 24% publishers have to be monitored to reach a recall level of 0.50 in the baseline approach, whereas, using the CSS approach, the subscriber only has to monitor 19% of the network. The CSS outperforms both USS approaches, because it offers more accurate and explicit statistics for the key-pairs. Comparing the two USS approaches, the use of KMV synopses slightly improves filtering quality when compared to hash sketches.

Considering the results for the three-key query set, the improvements are much higher.

The CSS approach reaches a recall of 0.79 by subscribing to 15% of all publishers. In contrast, the baseline approach reaches a recall of 0.44 by monitoring the same number of publishers. Using this query set, a considerable difference between the results of the two USS approaches can be seen. The USS-KMV approach that uses KMV synopses almost reaches the result quality of the multi-key approach, while USS-HS suffers from the inaccuracy of combining hash sketches of more than two sets. Recall that KMV synopses use a direct way to intersect multisets, whereas hash sketches reside on an indirect computation that causes loss in accuracy.

–  –  –

Figure 5.2: Average Recall for Three-Key Zeitgeist Queries.

To investigate the filtering quality of both approaches for four-key continuous queries, queries were created using the single keys from the real-world Zeitgeist query set. Out of the full set of all possible four-key queries, the 500 combinations with the highest number of matching documents in the Web collection have been selected. Examples of resulting queries are time war unit world, day nation world game, or war world london british. All selected continuous queries have more than 3, 000 matching documents within the Web data collection.

In the first experimental series with this query set, the general filtering quality of the CSS and USS approaches are investigated. Figure 5.3 shows that CSS outperforms the baseline approach and both USS approaches. When hash sketches are used to represent documents the filtering performance is worse even from the baseline approach due to the inaccuracy of multiset operations. Hash sketches have to use the sieve formula to perform the multiset operation thus suffering from inaccurate estimations. Contrary, the use of KMV synopses improves the filtering quality because it guaranties better distinct-value estimations for the intersections of multiple sets (see Section 2.5.2). Notice that the selected four-key queries do not fully show the improvement margins of the CSS algorithm, since more selective key sets with less matching documents benefit more from multi-key statistics.

Using the set of four-key continuous queries described above, the evaluation measured the filtering quality for combining multi-key statistics as proposed in Section Figure

5.4 illustrates the improvements for several scenarios of available multi-key statistics in comparison to the baseline approach. The filtering quality of CSS gives the upper bound

by using the exact multi-key statistics of the full continuous query:

• In the first scenario CSS-1, all four possible three-key statistics are available to use select the most appropriate publishers. This means for a four-key continuous query abcd that the statistics for abc, abd, acd, and bcd are available. Notice that multi-key statistics for single keys or pairs can also be stored in the directory but not be used since the proposed algorithm in Section to combine multi-key statistics only considers maximal subsets.

–  –  –

0.6 Figure 5.3: Average Recall for Four-Key Zeitgeist Queries.

Figure 5.5: Average Recall Improvements for Two-Key Queries.

• All six two-key statistics are provided by the directory in the second scenario denoted as CSS-2 but the directory does not store statistics for three key subsets.

• The third CSS-3 scenario assumes the existence of two two-key statistics such that the whole key set is covered (e.g., for query abcd, the statistics of ab and cd are available).

• One three-key plus one complementary single-key statistics are provided by the directory in the CSS-4 scenario (e.g., statistics for acd and b).

• The last CSS-5 scenario assumes two overlapping multi-key statistics for one two-key plus one three-key statistics. Here, the multi-key statistics for abc and cd meet the scenario condition.

The results show that all combinations improve the baseline approach, but cannot reach the filtering quality of the CSS approach. Naturally, filtering quality depends on the size of the multi-key statistics: the scenarios (CSS-5 and CSS-1 ) where multi-key statistics for three keys are considered to select publishers show the best filtering performance; in contrast, the three other scenarios only have two-key statistics.

Figures 5.5 and 5.

6 demonstrate another series of experiments that targeted the improvements in filtering performance over the baseline algorithm, when using the proposed correlation measures for multi-key statistics. To conduct this experiment, all possible twoand three-key queries are created using the single-keys from the Zeitgeist query-log. Three different query sets with a size of 100 each are selected; the query sets have the following


• Low-Corr includes those key sets where all keys have a low conditional probability such that there are only a few documents in the Web data set that contain both keys at the same time.

–  –  –

Tables 5.2 and 5.

3 show example two-key and three-key queries for the used query sets described above. Underlined scores denote high conditional probabilities. The threshold depend on the number of keys. The key sets island brother and british sport star show low conditional probabilities whereas in the key sets football sport and nation unit world, all keys have a high conditional probability. The other two examples belong to the query set High-Corr where keys show a diverse correlation measure (at least one key has a high conditional probability).

–  –  –

As already explained in Section 5.4, subscribing to key sets where all keys have low conditional probabilities yields to the highest filtering result improvements when monitoring the same percentage of publishers. This is caused by the fact that a low conditional probability means that there are a lot of matching documents this key without matching the remaining keys. Thus, when monitoring only 10% of the publishers, 2-Key-Low-Corr has an recall improvement from 0.44 to 0.65 whereas 2-Key-All-High-Corr only improves filtering quality from 0.25 to 0.28 (see Figure 5.5). Similar results hold for three-key queries in Figure 5.6 where 3-Key-Low-Corr has an improvement from 0.43 to 0.68 compared to improvement for 3-Key-All-High-Corr from 0.30 to 0.33. This leads to the conclusion that the CSS algorithm has no significant effect for key sets where all keys are highly correlated, while it significantly improves filtering for key sets with low correlations.

Analyzing the results for key sets where one key is highly correlated to all others (HighCorr ), an smaller improvement compared to unrelated keys can be observed. Here, the use of the CSS approach is possible, but an alternative strategy is proposed: a high conditional probability for a key k means that there is almost no additional information included in multi-key statistics for the full key set in comparison to the key set without k. As an example, the multi-key statistics for key set time face friend yield to similar filtering results than the multi-key statistics for face friend because the conditional probability P (time|f ace, f riend) is high (83% of all documents containing face and friend also contain time). The same observation holds for the key set time gift where time denotes the highly correlated key. There, 60% of all documents containing gift also contain time. Results Using Blog Data

In this section, experimental results from the blog data experiment are presented. Overall, the observations for blog data lead to similar conclusions with the Web data corpus, with CSS outperforming USS and baseline.

The results for all two-, and three-key queries from the Zeitgeist query-log are presented in Figure 5.7. The complete Zeitgeist queries are listed in the appendix B. Again, the KMV synopses perform better than hash sketches. The USS-KMV approach is almost as good as the the upper bound of CSS whereas USS-HS at least outperforms the baseline approach. Notice that, contrary to the Web data collection, a smaller number of blogs have to be monitored to acquire a sufficiently high recall. This happens because all blogs are highly specialized and thus only a small number of them contribute new posts matching the requested continuous queries. Compared to this, the Web data collection provides less specified publisher peers.

Subsequently, Figures 5.8, 5.9, and 5.10 illustrate the average recall improvements for two-key, three-key queries, or four-key queries respectively considering the different query sets introduced in the previous section (Low-Corr, High-Corr, and All-High-Corr ). Again, the experiments use selections of all possible key set combinations created by single keys with maximal numbers of matching blog posts. The sizes of all query sets range from 10 to 20, and now, key sets with four keys are regarded, too.

Similar to the experiments with Web data, the CSS approach for multi-key queries with low conditional probabilities shows the highest recall improvement. In Figure 5.8, there is almost no improvement for two-key queries where both keys have a high conditional probability (2-All-High-Corr ) but the improvement for queries with low conditional probability (Low-Corr ) is significant: from 0.53 to 0.67 when monitoring 10 blogs. Thus, blog data experiments confirm the previous observations. In Figures 5.9 and 5.10, the differences between low correlated (Low-Corr ) and high correlated key sets (All-High-Corr ) are reduced but still existing.

–  –  –

0.6 Figure 5.9: Average Recall Improvements for Three-Key Queries.

Table 5.4: Examples of Relatedness among Four-Key Queries.

The following example makes this point clear using the conditional probabilities in Table

5.4. Using a multi-key continuous query london time train bbc, it is sufficient to apply the CSS approach for the subset time train bbc because the key london has a conditional probability of 0.97. In other words, 97% of all posts containing the key london also contain the remaining keys. So, there is no significant additional information included in the multikey statistics of the complete key set.

- 102 Chapter 5 Correlated Key Sets

5.6 Discussion This chapter introduced two approaches that utilize correlations among keys to improve filtering quality. The USS algorithm uses existing single-key synopses stored in the directory to estimate the publishing behavior of information sources for key sets, while the CSS algorithm enhances the directory to explicitly maintain statistical metadata about selectively chosen key sets. This is the first work to develop algorithms for exploiting keyword correlations in such an dynamic IF setting.

Whereas the approach of [MBN+ 06] is limited to two-key queries, the USS algorithm can be applied to multi-key continuous queries for an arbitrary number of keys. Therefore the usage of very recent state-of-the-art techniques for compact representation of multisets (KMV synopses) is included in the algorithm. These synopses overcome the restrictions of hash sketches. The CSS algorithm uses a new strategy to approximate multi-key statistics by combining the statistics of arbitrary subsets. This strategy is useful when the multi-key statistics for the full continuous query are not included in the directory.

The experimental evaluation illustrated the filtering performance improvements of both algorithms in comparison to the baseline approach presented in this thesis. All experimental series used two different real-world collections for Web and blog data, and applied real-world Google Zeitgeist queries. The evaluation also investigated filtering performance gains depending on the introduced correlation measure (conditional probability) representing a way to compute the relatedness among keys. The next chapter presents the current prototype implementation of the MAPS approximate information filtering approach.

–  –  –

Chapter 6 Prototype Implementation This chapter presents the current prototype implementation [ZHTW08] of the MAPS approximate information filtering approach introduced in the previous chapters of this thesis.

Therefore, the Minerva search engine developed as explained in [BMPC07] is used and extended with additional components to realize both functionalities (one-time searching and publish/subscribe) in one unifying system.

Section 6.1 introduces some aspects of current Web search.

In Section 6.2, the main aspects of the Minerva search engine are explained including an implementation overview. In addition, the most important principles that drive the P2P search paradigm are illustrated.

Section 6.3 presents the additional components for approximate pub/sub functionality that have been implemented.

This section also shows the usage of both functionalities by following some example (continuous and one-time) queries. Screenshots from the system’s graphical user interface (GUI) are used to demonstrate and explain the MAPS system. A short overview of some other prototype systems for IR and IF in P2P networks presented in Section 6.4, and the discussion in Section 6.5 conclude this chapter.

6.1 Introduction Today, full-fledged Web search is more or less under the control of centralized search engines.

Pages:     | 1 |   ...   | 14 | 15 || 17 | 18 |   ...   | 25 |

Similar works:

«Thomas Philipott Villare Cantianum London 393a sig 3E The ETYMOLOGY, DERIVATION, and DEFINITION, of all the Hundreds and Parishes mentioned in the Map of KENT, as they are derived from some Saxon Radix. Blackheath is from a Saxon Radix. Bleach is turned into Bleke, which suites with the high open and cold situation of the Heath, which gives name to the Hundred. v Bromley in old Engish from Brome and Leah, which is Pasture now called Ley, and is the same with Bromefield. Lesnes, at present...»

«DATENBLÄTTER AUSLAND Eine unvollständige Zusammenfassung bisher verkosteter Auslandsbiere. Die Liste wird ständig erweitert. Frank Markert administrator@bundesbierschutz.eu Bierdatenblatt Bundesbierschutz Name Alfa Super Strong Typ Herkunftsland Niederlande Homepage http://www.alfabier.nl/ 9,2% Alkoholgehalt. Das Alfa Bier Super Strong selbst ist nicht mein Favorit, man muss es mögen oder nicht. Ziemlich kräftig im Geschmack, der DosenbierGeschmack kommt klar durch. Bierdatenblatt...»

«Auftraggeber: Freie und Hansestadt Hamburg | Bezirksamt Hamburg-Nord 5. Fachgespräch Wohnungsbau im Bezirksamt Hamburg Nord Das Pergolenviertel: Jetzt geht’s los! Dienstag, 28.10.2014 Bezirksamt Hamburg Nord, Großer Sitzungssaal Robert-Koch-Str. 17 20249 Hamburg Auftraggeber: Freie und Hansestadt Hamburg | Bezirksamt Hamburg-Nord Protokoll Fachgespräch Wohnungsbau im Bezirksamt Hamburg Nord Das Pergolenviertel: Jetzt geht’s los! Datum: 28.10.2014 Ort: Bezirksamt Hamburg Nord, Großer...»

«.. t * I r si t ; WHEELS,'LIFE,I.. t i AND OTHER fh 3 Tr MATHEMATICAL AMUSEMENTS 2' i I \ -. elk',.*,'. $ : Q ; a 54qp46 -2 9+...»

«Senso-motorische Interaktionen des stomatogastrischen Nervensystems beim Taschenkrebs (Cancer pagurus, LINNAEUS, 1758) Dissertation zur Erlangung des Doktorgrades (Dr. rer. nat.) der Mathematisch-Naturwissenschaftlichen Fakultät der Rheinischen-Friedrich-Wilhelms-Universität Bonn vorgelegt von Dirk Weigeldt aus Lippstadt Bonn 2002 Angefertigt mit Genehmigung der Mathematisch-Naturwissenschaftlichen Fakultät der Rheinischen Friedrich-Wilhelms-Universität Bonn 1. Referent: Prof. Dr....»

«CURRICULUM VITAE September 2014 Hannah Brenkert-Smith, Ph.D. Research Associate University of Colorado Environment & Society Program UCB 483 Institute of Behavioral Science Boulder, CO 80309-0483 hannahb@colorado.edu 303.492.5743 EDUCATION Ph.D., Sociology University of Colorado at Boulder August 2008 Ph.D. Dissertation: Placing Wildfire in Context: Environmental and Social Dimensions of Mitigation Decision-Making Dissertation Committee: Lori Hunter (Chair) Kathleen Tierney, Daniel Williams,...»

«Number the Stars Lois Lowry Table of Contents Title Page Table of Contents. Copyright Dedication Contents Introduction 1. Why Are You Running?2. Who Is the Man Who Rides Past?3. Where Is Mrs. Hirsch?4. It Will Be a Long Night 5. Who Is the Dark-Haired One?6. Is the Weather Good for Fishing?7. The House by the Sea 8. There Has Been a Death 9. Why Are You Lying?10. Let Us Open the Casket 11. Will We See You Again Soon, Peter? 12. Where Was Mama? 13. Run! As Fast As You Can! 14. On the Dark Path...»

«DRAFT RED HERRING PROSPECTUS Please read Section 60B of the Companies Act, 1956 Dated[*] SEA TV NETWORK LIMITED (Incorporated as Sea TV Network Limited on May 21, 2004 with the Registrar of Companies, Uttar Pradesh and Uttaranchal, Kanpur) Registered Office: 148, Manas Nagar, Shahganj, Agra282010 Tel.:0562-4036666, 2512122, 2512123, 2512223 Fax: 0562-2511070 Email: admin@seatvnetwork.com Website: www.seatvnetwork.com Contact Person: Akshay Kumar Jain, Director (Finance) and Compliance Officer...»

«See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/273704613 RIVER FRAGMENTATION AND CONNECTIVITY PROBLEMS IN THE GANGE RIVER OF THE UPPER HIMALAYA (INDIA): THE EFFECT ON THE FISH COMMUNITIES Article · January 2011 READS 1 author: Harcharan Singh Rumana Independent Researcher 6 PUBLICATIONS 13 CITATIONS SEE PROFILE All in-text references underlined in blue are linked to publications on ResearchGate,...»

«SANDIA REPORT SAND2013-5472 Unlimited Release Printed July 2013 Microgrid Cyber Security Reference Architecture Version 1.0 Cynthia K. Veitch, Jordan M. Henry, Bryan T. Richardson, Derek H. Hart Prepared by Sandia National Laboratories Albuquerque, New Mexico 87185 and Livermore, California 94550 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s...»

«2015/2016 Peabody Undergraduate Handbook Class of 2019 FOREWORD Welcome to Peabody! This is your guidebook designed to lead you to successful completion of your major at Peabody College. Over the next four years, you will find it to be a ready source of information on your major requirements, policies and procedures, and offices to contact with your questions as you make your way toward the Bachelor of Science degree in May 2019. You will be expected to keep this handbook for four years and to...»

«Preprints of the Max Planck Institute for Research on Collective Goods Bonn 2006/12 Wettbewerb als sozial erwünschtes Dilemma Christoph Engel MAX PLANCK SOCIETY Preprints of the Max Planck Institute for Research on Collective Goods Bonn 2006/12 Wettbewerb als sozial erwünschtes Dilemma Christoph Engel May 2006 Max Planck Institute for Research on Collective Goods, Kurt-Schumacher-Str. 10, D-53113 Bonn http://www.coll.mpg.de Wettbewerb als sozial erwünschtes Dilemma* Christoph Engel * Für...»

<<  HOME   |    CONTACTS
2016 www.abstract.xlibx.info - Free e-library - Abstract, dissertation, book

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.