FREE ELECTRONIC LIBRARY - Abstract, dissertation, book

Pages:     | 1 |   ...   | 12 | 13 || 15 | 16 |   ...   | 25 |

«Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»

-- [ Page 14 ] --

The prediction error of the MAPS Selective Method is about as good as the minimum absolute prediction error computed min using a global parameter setting. The error varies for both strategies from 13 to 20 on average. Notice, that oracle and max are for the known reasons not included in this graph. The ExpInc Publishing Scenario

Figures 4.23 and 4.

24 show the recall and absolute prediction error results of the second scenario where all 100 peers follow a ExpInc publishing behavior. The total number of published documents amounts to 175, 600. In this scenario, recall is almost independent of the parameter choice (see Figure 4.23). All approaches (min, max, oracle, and selective) perform almost the same. Only the random strategy is not competitive. Even the prediction quality of selective and min are similar and vary from about 22 to 35 per key, round, and peer. The absolute prediction error for max is still very high and not included in Figure 4.24, but there is no influence to the recall. This is caused by the fact that all publisher peers follow the same behavior ExpInc. Compared to other behaviors, this scenario shows the highest absolute prediction errors since the data series shows a strong growing trend.

–  –  –

0.6 0.4 0.2

–  –  –

0.6 0.4 0.2

–  –  – The QuadDec Publishing Scenario The last investigated scenario considers the case that all peers publish documents using a QuadDec behavior. The overall number of published documents is about 231, 000. Here, the MAPS Selective Method performs much better than min in terms of recall but does not reach the level of max (see Figure 4.25). This result corresponds to the observations of Section 4.4.2. The same result holds for the absolute prediction errors where selective causes higher errors than the best global parameter choice. As also shown in the graph, the selective approach still performs better than the worst parameter choice for η and γ.

It can also be constituted that the absolute prediction error is almost independent from the number of monitored peers in the network since there is only a small error variation recognizable in Figure 4.26.

4.5.5 Summing Up The experimental section presented the effectiveness of the MAPS Selective Method in terms of recall improvement against several opponents. In the most considered scenarios, the proposed novel method reached almost the same recall level as the best global parameter combination providing the smallest prediction error. When compared to the opponents examined, the main advantage of MSM is the fact that no additional communication is needed to adopt the two prediction parameters.

4.6 Discussion This chapter presented a novel peer selection approach, called MAPS, appropriate for supporting approximate publish/subscribe functionality in a structured P2P environment. The peer selection strategy is embedded in the service and protocol architecture presented in the previous chapter. A combination of resource selection and behavior prediction techniques is applied to the collected metadata such that the most promising publisher peers can be selected. The extensive experimental evaluation approved the effectiveness and efficiency of MAPS in several settings using real Web data.

This chapter also conducted an approach to improve the proposed prediction method of time series analysis with double exponential smoothing. The MAPS Selective Method clearly improved prediction by individually recognizing publisher behavior without any additional communication overhead. In the next chapter, two strategies to consider correlated keys in MAPS will be presented.

–  –  –

Chapter 5 Correlated Key Sets Based on the MAPS architecture and peer selection strategies presented earlier, this chapter provides two novel algorithms for exploiting correlations among keywords in continuous queries for an approximate information filtering setting [ZTW08]. This setting assumes that metadata is maintained in a distributed directory, usually on a per-keyword basis, thus disregarding possible relatedness among keywords. The work presented in this chapter extends traditional query routing techniques and strategies from the domain of distributed information retrieval.

Section 5.1 introduces the issue of correlated keys (or keywords), mentions the main contributions of correlation awareness, and investigates some previous work in this research area.

The baseline approach, a slightly modified protocol for publish/subscribe, and the correlation measure are presented in Sections 5.2 and 5.3. The two algorithms USS and CSS to exploit correlations among key sets are explained in Section 5.4 and evaluated in Section 5.5. This chapter will revisit the DV estimation synopses (hash sketches and KMV synopses) listed in Section 2.5. The chapter is concluded with a closing discussion in Section 5.6.

5.1 Introduction In an approximate information filtering as the one presented in Chapter 3, only a few carefully selected, specialized, and promising publishers store the user query and are monitored for new publications. Thus, the user query is replicated to these sources and only published documents from these sources are forwarded to the subscriber. The system is responsible for managing the user query, discovering new potential sources and moving queries to better or more promising sources. Since in an IF scenario the data is originally highly distributed residing on millions of sites (e.g., with people contributing to blogs), approximate IF seems an ideal candidate for such a setting. This is also supported by the fact that exact IF functionality has proven expensive for such distributed environments [TIK05a, TX03, AT05].

Thus, approximate IF achieves much better scalability of such systems by trading faster response times and lower message traffic for a moderate loss in recall.

Publisher selection in approximate IF regarding a given continuous query with multiple keywords is driven by statistical summaries (metadata) that are stored by the system.

These summaries are provided to the directory by the publishers and can be managed in different ways ranging from centralized solutions like servers or server farms, to super-peer or pure peer-to-peer solutions in the form of a distributed P2P directory built on top of a DHT [Abe01, SMK+ 01] or other kinds of overlay networks. For scalability, the summaries have publisher granularity, not document granularity, thus capturing the best publisher for certain keywords (also referred to as keys) but not for specific documents.

- 85 Chapter 5 Correlated Key Sets

This, together with per-key organization of the directory that disregards keyword correlations (also referred to as key sets) are two of the basic reasons that may possibly lead to insufficient recall. On the other hand, considering statistics for all possible key sets is clearly not possible due to the explosion in the feature space.

As an example scenario, consider a user Bob who wants to follow the discussion about the presidential elections 1 in the US, and wants to receive notifications from a number of different sites like news agencies, portals, and user blogs. Clearly, Bob would be interested in monitoring a variety of publishers but is not interested in receiving all the articles published by all sources, as it would be the case for exact IF. Thus, in an approximate IF scenario, Bob would submit the continuous query US presidential elections to the filtering system.

The basic approach would decompose the continuous query into the three individual keys and use the statistics from the directory to compute a combined score (e.g., intersection or some other kind of aggregation of individual key scores) for each key and publisher.

This score would represent the probability of each source to publish documents about US presidential elections in the near future. This approach may lead to poor filtering quality as the top-ranked publishers for the complete query may not be among the top selected publishers. In the worst case, a selected publisher may deliver many documents for each single keyword, but no single document matching all keywords, since this information is not present in the directory.

In this chapter, two approaches that use correlations among keys to improve filtering quality in the scenario described above are introduced and evaluated in detail. The first algorithm (coined USS for Unique Synopses Storage) uses existing single-key synopses stored in the directory to estimate the publishing behavior of information sources for key sets, while the second (coined CSS for Combined Synopses Storage) enhances the directory to explicitly maintain statistical metadata about selectively chosen key sets. Both algorithms build upon previous work in the area of information retrieval over distributed P2P networks [MBN+ 06] and extend it to the approximate IF setting in various novel ways.

–  –  –

Table 5.1: Summary of Notation 5.

1.2 Previous Research on Correlated Key Sets This section present some previous work in the context of information retrieval and filtering that includes the usage of correlations among keys within these settings. All methods mentioned in this thesis organize the statistics about publisher peers, which drive the necessary query routing decisions, on a per-key basis disregarding key correlations. The only recent works that consider key correlations in the context of P2P search are [MBN+ 06] and [SLZ+ 07]. [MBN+ 06] considers frequent key combinations in query logs and documents for P2P IR, where [PRL+ 07] proposes a framework for Highly Discriminative Keys (HDK), which includes correlated key combinations; however, it does not give any algorithms for managing the corresponding statistics in a distributed setting and for correlation-aware query routing. Previous work on information filtering does not consider correlations among keys so far such that the work in this thesis presents the first approach.

5.2 The Baseline Approach This section presents the baseline approach as a starting point to introduce the two algorithms that exploit correlations in key sets. The baseline approach uses the protocols presented in Section 3.3 with an approximate IF architecture consisting of three different system components: the directory, the publishers and the subscribers. The notation that will be used is summarized in Table 5.1.

In this chapter, the additions and modifications over the baseline protocols presented in Chapter 3 are described including the distribution of special-purpose synopses by the publishers. The synopses represent the inverted lists of documents a publisher hosts concerning a key a. This can be done using hash sketches or KMV synopses as presented in Section

2.5. A publisher pi inserts an identifier for each document contained in its collection into a local hash sketch or KMV synopsis for key a to obtain SY Ni (a). The distributed directory stores all disseminated statistics. Since there is a well-defined directory peer responsible for each key (through the DHT hash function), the synopses representing the index lists of all publishers for a particular key a are all sent to the same directory peer d(a). Thus, d(a) can compute a moving-window estimate for the global df value of a – df (a) – by performing an union operation for all synopses SY Ni (a) sent by every publisher pi for key a.

- 87 Chapter 5 Correlated Key Sets

The baseline approach intersects the statistics for the single keys of a continuous query cq = {k1, k2,..., kn } consisting of multiple keys ki, and sends the continuous query only to (a subset of) the publishers that published (or will publish) statistics for all queried keys. However, this approach may lead to reduced recall, since a publisher appearing in the publisher lists for both keys ki and kj will not necessarily publish documents containing both ki and kj. Thus, to select an appropriate publisher for cq, the statistics of the key set have to be considered to determine more accurately its future publishing behavior.

Obviously, the larger the subset of cq statistics are maintained for, the more accurate the prediction about the behavior of the publisher will be.

–  –  –

Notice that, in an information filtering setting, the continuous queries are actually longstanding queries that will be assessed several times in the future. Thus, all continuous queries can be effectively considered as candidate key sets for harvesting multi-key statistics.

In contrast, a requested query in the retrieval scenario can be asked only once. Should a more selective system behavior be intended (e.g., for performance reasons), the submitted continuous queries can be further analyzed using frequent itemset mining techniques [AIS93, FSGM+ 98] to discover the most common ones.

To identify either uncorrelated key pairs or negatively correlated key pairs, additional statistics are needed to find the best publishers to index a multi-key continuous query. To clarify this need, consider a key pair ab that has no correlation. For these keys, there are only few publishers in the network that have the potential to publish in the future relevant documents that would contain both a and b, and these publishers cannot be located by selecting and combining the statistics for key a and key b alone. The conclusion from ˆ ˆ this is that a pair ab has to be considered as interesting if both P (A|B) and P (B|A) are below some threshold β within the documents. Extending the above to a key set S with an arbitrary number of keys using the conditional probability in Equation 5.2, for each key ˆ k0 the value of P (K0 |K1... Kn−1 ) has to be estimated. The key set S is of interest if all estimated probabilities are below some threshold β.

Pages:     | 1 |   ...   | 12 | 13 || 15 | 16 |   ...   | 25 |

Similar works:

«AIPPI Association Internationale pour la Protection de la Propriété Industrielle Réunion du Comité exécutif Annuaire 1997/I Vienne 1997 (18-22 avril 1997) Rapports de synthèse Commission spéciale Q 94: GATT/WTO Résolution Premier Forum de l'AIPPI Rapports des représentants de l'AIPPI Rapports tardifs des Groupes nationaux © AIPPI Zürich 1997 ISBN Nr. 3.9050.2878.6 Edité au nom de l'AIPPI par J. David MEISSER, Klosters (Suisse) Assistants de I' Editeur: Jean-François LEGER, Genève,...»

«PCB Fabrication Services and Preparation Robert Saunders PCB Fabrication Services and Preparation RSGC ACES | Robert Saunders PCB Fabrication Service Providers Bittele Electronics Inc. Preparing Eagle Files For Production What is a Gerber RS-274X File? Short Wikipedia Spiel Creating Silkscreens Including an Image to the Silkscreen Step 1 (Import) Step 2 (Select Image) Step 3 (No Scan & Select Colors) Step 4 (Configure and Execute) Creating a Dimension (.gko file) Creating Mounting Holes...»

«Airline Ticket Prices and Flight Characteristics: Lowest Price Roundtrip Flights from Los Angeles to New York Abstract This paper examines variation in airline ticket prices and relates this to observable differences in flight time, how far in advance the ticket was purchased, the time in transit for roundtrip flight, airport used, and airline carrier. It relates these differences to the underlying theory of hedonic pricing and price discrimination and the special nature of air travel in a...»

«Chapter 1 in Airport Cities: The Evolution (London: Insight Media, 2008) The Evolution of Airport Cities and the Aerotropolis John D. Kasarda Airports like cities are never static. They are constantly evolving in form and function. Historically, airports have been understood as places where aircraft operate, including runways, control towers, terminals, hangers and other facilities which directly serve aircraft, passengers and cargo. This traditional understanding is giving way to much broader,...»

«Universität Hohenheim Institut für Physiologie und Biotechnologie der Pflanzen Prof. Dr. Andreas Schaller Untersuchungen über die Prozessierung und Sekretion der Subtilase LeSBT3 Diplomarbeit vorgelegt von Benjamin Kuhn Studiengang Agrarbiologie, Fakultät Agrarwissenschaften Hohenheim, August 2007 Betreut von Dr. Anna Cedzich Zusammenfassung Für subtilisinähnliche Proteasen wird eine Steuerungsfunktion pflanzlicher Entwicklungsprozesse vermutet, ihre physiologische Wirkungsweise und...»

«AMNESTY INTERNATIONAL’S LGBT PRIDE TOOLKIT Lesbian, Gay, Bisexual & Transgender Rights are Human Rights CONTENTS 3 Welcome and Introduction What is PRIDE? 4-6 Background Information: LGBT Rights are Human Rights!Amnesty International’s Pride Action Guide: Key Actions, Issue Briefs & Talking Points 8-9 ACTION—UNITED STATES Support Respect for Marriage Act; Repeal Defense of Marriage Act 10-11 ACTION—INTERNATIONAL Stop Discrimination and Criminalization in BOTSWANA 12-13...»

«Leben aus der Auferstehung – 1. Korinther 15 Themenreihe B zum Textplan der Apis Thorsten Müller 1 10.04.2011 – Die Auferstehung Christi – Grundlage der Evangeliumsverkündigung und der Sündenvergebung (1-19) 1.1 Texterklärung Anlass für das Kapitel 15 des 1. Korintherbriefs ist die Ansicht in Korinth, dass es keine Auferstehung der Toten gibt. Die Korinther, zumindest einige, denen die Gemeinde nicht energisch widersprach, waren der Meinung, als Christen auf dieser Welt hätten sie...»

«G C/43/4 Rev. ORIGINAL: englisch DATUM: 22. Oktober 2009 INTERNATIONALER VERBAND ZUM SCHUTZ VON PFLANZENZÜCHTUNGEN GENF DER RAT Dreiundvierzigste ordentliche Tagung Genf, 22. Oktober 2009 PROGRAMM UND HAUSHALTSPLAN FÜR DIE RECHNUNGSPERIODE 2010-2011 vom Rat angenommen 1. Auf seiner dreiundvierzigsten ordentlichen Tagung am 22. Oktober 2009 in Genf prüfte der Rat das Dokument C/43/4 „Entwurf eines Programms und Haushaltsplans für die Rechnungsperiode 2010-2011“ und billigte: a) die in...»

«Владислав Скобелев, «Доктор Живаго» Б. Пастернака и «Хождение по мукам» А. Н. Толстого К вопросу о судьбах русского романа в двадцатом столетии (Vladislav Skobelev, Doktor Živago B. Pasternaka i Choždenie po mukam A. N. Tolstogo. K voprosu o sud'bach russkogo romana v dvadcatom stoletii) aus: Analysieren als Deuten Wolf Schmid zum 60. Geburtstag Herausgegeben von Lazar...»

«This is a summary of the meeting of the Accounting Standards Accounting Forum (ASAF) held in London on 25-26 September 2013. It was prepared by staff of the IASB, and is a high-level summary of the discussions that took place. A full recording of the meeting is available on the IASB website. The meeting was chaired by Hans Hoogervorst, Chairman of the IASB, with Ian Mackintosh, Vice-Chairman of the IASB. ASAF Members attending Kim Bromfield for the South African Financial Reporting Standards...»

«Ergebnisdarstellung – SaarLernNetz –Lernende Region Saarland Planungsphase http://www.saarlernnetz.de/ Autoren: Dr. Josef Burgard, DFKI, Gesamtprojektleiter SaarLernNetz Wolfgang Vogt, BFW, Teilprojekt 1: Neugier auf Wissen Michael Röser, CJDe-Learning Center, Teilprojekt 2: Go2Live Karl Schiene, Steinbeis, Teilprojekt 3:TILL Erwin Irmisch, Bildungszentrum Kirkel, Teilprojekt 4 Christof Ecker, Arbeitskammer, Teilprojekt 5: ARGUS Patric Kany, CEB, Teilprojekt 6: Virtuelle Akademie Nico...»

«CBI Product Factsheet: Promising target groups in Europe CBI | Market Intelligence Product Factsheet Cloves in Germany | 1 Introduction The increase of single and single-parent households in Europe is fuelling demand for both solo travel and single-parent travel. Solo travelling is gaining increasing social acceptance, and it is no longer limited to global backpackers in the 18-30 age group. Gay people are also a valuable, growing niche market that can be regarded as an interesting segment....»

<<  HOME   |    CONTACTS
2016 www.abstract.xlibx.info - Free e-library - Abstract, dissertation, book

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.