WWW.ABSTRACT.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Abstract, dissertation, book
 
<< HOME
CONTACTS



Pages:     | 1 |   ...   | 13 | 14 || 16 | 17 |   ...   | 25 |

«Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»

-- [ Page 15 ] --

5.4 Exploiting Correlations This section presents two new algorithms to exploit correlations among key sets: (i) the USS algorithm (Unique Synopses Storage) presented in 5.4.1 uses single-key synopses (hash sketches or KMV synopsis) already stored in the distributed directory to estimate the filtering behaviors for key sets, and (ii) the CSS algorithm (Combined Synopses Storage) explained in 5.4.2 enhances the single key directory to explicitly maintain statistical metadata about selectively chosen sets of multiple keys depending on the correlation measures (conditional probability) introduced in previous Section 5.3. Both algorithm build-upon and extend the baseline algorithm introduced in Section 5.2.

5.4.1 The USS Algorithm In this section, the USS algorithm (Unique Synopses Storage) is presented. USS uses single-key synopses already stored in the distributed directory to estimate the publishing behavior for complete key sets. As the distributed statistics of a publisher pi regarding a key a contain the synopsis SY Ni (a) (to represent the documents of pi containing key a), it is possible to compute the number of documents containing all keys of a given key set S.

For this, the algorithm uses the intersection operation for multisets as explained in detail in Section 2.5 as an estimation for the frequency of the whole key set in pi ’s local collection (dfi (S)). Together with prediction techniques presented in this thesis, the most promising publishers can be selected by applying appropriate scoring functions based on publisher behavior prediction.

To explain the algorithm, consider a subscriber s subscribing with a multi-key continuous query cq = {k1, k2,..., kn } containing n distinct keys. According to the USS algorithm the

following steps are executed:

1. For each key kj, 1 ≤ k ≤ n in cq, subscriber s contacts the directory peer d(kj ) responsible for kj and retrieves the statistical summaries published individually for all keys, including the synopses SY Ni (kj ) produced by a publisher pi.

–  –  –

5.4.2 The CSS Algorithm To overcome the problems of the USS algorithm, the CSS algorithm (Combined Synopses Storage) is introduced. The algorithm identifies valuable key combinations, enhances the directory to explicitly maintain these multi-key statistics, and exploits these statistics to improve publisher selection. In addition, the CSS algorithm combines multi-key statistics for subsets of the requested continuous query.

As already discussed in Section 5.3, uncorrelated or negatively correlated keys are of interest to collect multi-key statistics. The CSS algorithm aims at addressing the following questions: (i) how to decide which key sets are of interest to disseminate additional multikey statistics; (ii) how to disseminate multi-key statistics of publishers to the directory;

(iii) how to exploit the additional statistics to improve publisher selection; (iv) how to deal with multi-key statistics for subsets of the complete continuous query.

–  –  –

5.4.2.1 Assessing Key Sets Given a key set S = {k1, k2,..., kn } with n distinct keys, a deterministic function is employed (e.g., by selecting the lexicographically highest or lowest key) to select one directory peer that has to assess the relatedness among the keys of S and be responsible for this key set. A deterministic approach with pre-hashing the key-values ensures load balancing among the directory peers. The CSS approach uses the following three steps to assess a candidate key set S where d(S) is the directory peer that is responsible for the assessment

decision:

1. Initially, d(S) contacts all directory peers responsible for the other keys kj ∈ S to retrieve the synopsis SY N (kj ) representing all documents in the network containing the key. The answering directory peers d(kj ) for a key kj compute locally the synopsis SY N (kj ) by using the union operation over all individual publisher synopses SY Ni (kj ) (see Section 2.5).

2. Subsequently d(S) computes the intersections among the synopses SY N (kj ), to retrieve the estimated cardinality of documents containing all keys in the candidate set (denoted as df (S)).

3. Finally, using Equation 5.2, directory peer d(S) then computes the conditional probabilities for each key kj ∈ S.

Small values for the conditional probabilities show that the occurrence of all keys in the documents are largely independent, meaning that publisher selection decisions can strongly benefit from the existence of available multi-key statistics. Thus, CSS initiates the creation of these additional summaries if the conditional probabilities for all keys kj ∈ S are below some threshold β.

To further optimize message traffic between directory peers, a threshold α is introduced, such that if the conditional probability of a key kj is above α, publisher selection strategy does not have to consider the single-key statistics of that key. The idea behind this is that kj is contained in almost all documents containing the rest of the keys S\{kj }, making this key set a candidate for multi-key statistics. The experimental evaluation will also investigate this strategy in Section 5.5 by computing the conditional probabilities of several continuous query.





5.4.2.2 Disseminating Multi-Key Statistics

As soon as a key set has been assessed as a useful candidate that is worth collecting multikey statistics for (as explained before), publishers have to learn this fact immediately.

Especially in an information filtering scenario, where statistics have to get updated, the continuous process of directory refreshment can be used to disseminate the knowledge about valuable multi-key sets.

Assuming that a directory peer d(S) that is responsible for a key k ∈ S and key set S identified key set S as useful. According to the algorithm, d(S) is also responsible to maintain the multi-key statistics for the complete key set S. Whenever a publisher p updates its statistics for key k, d(S) can inform the publisher about the new key set. In this way, p will compute its local statistics for the complete set S and disseminate them to the directory peer d(S). This procedure does not incur any additional messages compared to the baseline approach described in Chapter 3, since additional statistics can be piggybacked on messages that need to be sent anyway.

–  –  –

Next, it is assumed that a subscriber s submitting a continuous query cq asks the directory peers responsible for the keys in cq to provide the related statistics. Since cq is included in the request message, a directory peer d(S) that stores statistics about a key set S, where S ⊆ cq, will return the statistics for S together with the single-key statistics it is responsible for. The following section explains how to compute the statistics for a multikey query by combining statistics from subsets available in the directory. Subsequently, the subscriber collects the multi-key statistics and uses them to perform publisher ranking.

Note that the subscriber applies publisher selection based on multi-key statistics, and is thus able to predict the publishing behavior of sources more accurately. Contrary, the baseline algorithm performs a separate prediction for each single key. If no summaries for the key set (or subsets, see Section 5.4.2.4) have been published, the baseline procedure using single-key summaries is still applicable without contacting additional directory peers.

5.4.2.4 Combining Multi-Key Statistics

A usual case for continuous queries with more than two keys is that there are no statistics for the full key set S of the query, because, e.g., the assessment step did not consider the full key set as sufficiently useful. However there might be available statistics for several multi-key subsets of S. In this case, by design, all of these subsets can be found by the subscriber by simply asking the directory peers responsible for the single keys as explained in the previous section.

To give an example, assume that the directory maintains the multi-key statistics of S1 = {a, b, c} and S2 = {b, c, d}. Then, for a given five-key query S = {a, b, c, d, e}, the algorithm has several options at hand. The CSS approach proposes to select all maximal subsets among the available multi-key summaries. This is efficient in terms of network costs because the entire continuous query will be sent to all single-key directory peers anyway. But this consideration opens up a space of optimization strategies; Combining such incomparable but mutually related multi-key statistics is reminiscent of the recent work on multidimensional histograms with incomplete information [MMK+ 05], but the setting here has the additional complexity of very-high-dimensional key space.

To combine the different multi-key statistics, the CSS algorithm weights them depending on the size of the multi-key subset. The scoring function to calculate a publisher score is

given by Equation 5.3:

|Si | · predScoresi (p) scoreS (p) = (5.3) Si ⊆S Here, Si is a subset of the key set S contained in the continuous query, |Si | is its cardinality, and predScoresi (p) represents the likelihood that p will produce a document containing Si in the future. This score is produced using time series analysis techniques similarly (plus resource selection techniques) as presented in Chapter 4. Thus, to obtain a publisher score, the system sums-up all prediction scores for the subsets Si that have been received from the directory peers such there is no Sj with Si ⊂ Sj.

The intuition behind weighting the prediction score with the size of the key set |Si | is that the prediction score for small subsets dominates the sum. This happens because the number of documents containing all the keys of small key sets is higher, resulting in higher prediction scores. Next section experimentally investigates different scenarios where multi-key statistics for the full continuous query are not available, but a combination of the statistics of subsets is possible.

- 92 Chapter 5 Correlated Key Sets

5.5 Experimental Evaluation In this experimental section, the USS and CSS algorithms are evaluated using two real-life corpora with Web and blog data. Since it is extremely difficult to find real-life continuous queries except by obtaining proprietary data (e.g., from CNN’s news or Springer’s journal alert system), the experiments utilize the popular Web queries contained in the Zeitgeist query-log, and treats them as subscriptions (e.g., assuming that a user is interested in staying informed about his favorite pop star, or a big sport event). The Google Zeitgeist 3 query-log is published periodically for different geographic regions to express the most interesting search trends.

5.5.1 Experimental Setup The evaluation compares the filtering performance of both algorithms with different synopses (hash sketches and KMV synopses) and against a baseline algorithm that computes publisher scores using single-key frequencies to predict the publishing behavior of information producers. This baseline algorithm corresponds with the baseline approach presented in 5.2. The prediction mechanisms have an average prediction error of 10% to ensure a realistic system behavior.

As quality measure to evaluate the filtering performance of both algorithms, the experiments use recall. Recall in the IF setting is defined as the ratio of the total number of notifications received by subscribers to the total number of published documents matching the subscriptions. In the experiments, the average recall over several publication rounds is considered. In all graphs illustrated in this section, recall depends on the percentage of monitored publisher peers.

5.5.2 Experimental Data For the experimental setup, a recently proposed benchmark [NBMW06] designed specifically for use in the evaluation of distributed and P2P settings is utilized. Briefly, the benchmark consists of more than 800, 000 Web documents drawn from the Wikipedia corpus, and an algorithm to distribute the documents among 1, 000 publishers with controlled overlap that mimics the real-world behavior.

The Zeitgeist query-log contains queries with one, two, and three keys. To investigate filtering quality depending on the conditional probabilities, the set of distinct single-keys in all queries is used to create two-, three, and four-key queries by combinations. During query evaluation, a continuous query is indexed in up to 25% of the publishers. The experiments use the concept of publication rounds to simulate the time properties of the published documents; here, a publisher peer publishes around 400 documents in each round, which could represent the weekly publications of a digital library or news portal.

The second data set focuses on blog data4. Out of a huge set of blogs, about 1, 000 blogs have been selected that published more than 300 postings each during a time period of three weeks. The selected blogs have been crawled and assigned to 1, 000 publishers. The crawled blog data also spans over three weeks. Again, the same Zeitgeist query-log is used to evaluate the filtering performance, while the query repositioning algorithm was executed to simulate one query repositioning per week.

3 http://www.google.com/press/zeitgeist.html archives query-logs since 2001 4 Another well-known source of blog data is Blogsphere [BK07].

–  –  –

Figure 5.1 (respectively 5.

2) shows the filtering quality for all two-key (respectively threekey) continuous queries from the Web query-log using the Web data set. The evaluation compares a baseline publisher selection algorithm based on behavior prediction for singlekey statistics as described in this thesis in Chapter 3, with the CSS approach of multikey statistics maintained in the directory, and with the USS approach using two different synopses (hash sketches and KMV synopses).



Pages:     | 1 |   ...   | 13 | 14 || 16 | 17 |   ...   | 25 |


Similar works:

«Kálvinizmus és a magyarság hazaszeretete A reformáció elterjedésével lényegesen csökkent, egyes területeken teljesen meg is szűnt Róma befolyása Európa egyházi életére. A reformált egyházak érthetően nem függtek a pápától, sőt el sem ismerték annak hatalmát. Ugyanakkor nem jött létre más egyházi „ csúcsszerv” sem, amely összefogta volna Európa protestáns egyházait, érdemlegesen beleszólhatott volna életükbe. Ezek az egyházak, bár különböző...»

«Gardener Winter They would make prices in your team but you report a module of service when increasing to the ownership. Be how you are to make, when they are to have up and what only do with a sale. Who can you triple in a statistics mean their costs to build if events how yourself have you? There are a day properties, municipality and loan, looking % post and troubled drop to take globally. They can come instructions with yellow tools and logos for non cities really are especially every...»

«Linguistic Awareness of Cultures Grundlagen eines Trainingsmoduls Bernd Müller-Jacquier aus: Bolten, Jürgen (ed.) (2000). Studien zur internationalen Unternehmenskommunikation. Leipzig: Popp, 20-49 LINGUISTIC AWARENESS OF CULTURES. PRINCIPLES OF A TRAINING MODULE by Bernd-Dietrich Müller (Chemnitz) 1 Introduction 2 Intercultural Communication A Communication Problem? 3 Intercultural Training and the Teaching of Intercultural Competences 3.1 Training Module: Linguistic Awareness of Cultures...»

«Deutscher Bundestag Drucksache 18/3270 18. Wahlperiode 20.11.2014 Unterrichtung durch die Bundesregierung Fortschrittsbericht zur Lage in Afghanistan 2014 einschließlich einer Zwischenbilanz des Afghanistan-Engagements Inhaltsverz eich nis Seite Einleitung Zusammenfassung des Fortschrittsberichts 2014 1. Staatswesen und Regierungsführung 1.1 Wahlen und Regierungswechsel 1.2 Regierungsführung und Institutionen 1.3 Internationale Beziehungen 1.4 Zivilgesellschaft und Menschenrechte 1.5...»

«UNITED STATES BANKRUPTCY COURT NOT FOR PUBLICATION SOUTHERN DISTRICT OF NEW YORK -x In re: Chapter 11 JOYCE E. BROWN, Case No. 07 B 22631 (ASH) Debtor.-x APPEARANCES: JOYCE E. BROWN, Debtor Pro Se 9 Weaver Street Scarsdale, NY 10583 LAW OFFICES OF JEFFREY S. GREENE, P.C. Attorneys for Secured Creditor By: Jeffrey S. Greene, Esq. One Barker Avenue Second Floor White Plains, NY 10601 ADLAI S. HARDIN, JR. UNITED STATES BANKRUPTCY JUDGE DECISION AND ORDER ANNULLING AUTOMATIC STAY NUNC PRO TUNC...»

«Alien terrestrial arthropods of Europe i Alien terrestrial arthropods of Europe Edited by Alain ROQUES, Marc KENIS, David LEES, Carlos LOPEZ-VAAMONDE, Wolfgang RABITSCH, Jean-Yves RASPLUS and David B. ROY Sofia–Moscow ii Alain Roques et al. (Eds) / BioRisk 4(1) (2010) BioRisk 4(1) (Special Issue) Alien terrestrial arthropods of Europe Edited by Alain Roques, Marc Kenis, David Lees, Carlos Lopez-Vaamonde, Wolfgang Rabitsch, Jean-Yves Rasplus And David B. Roy This work was supported by a grant...»

«CORRECTED UNITED STATES OF AMERICA Before the SECURITIES AND EXCHANGE COMMISSION SECURITIES EXCHANGE ACT OF 1934 Release No. 66212A / January 23, 2012 INVESTMENT ADVISERS ACT OF 1940 Release No. 3360A / January 23, 2012 ADMINISTRATIVE PROCEEDING File No. 3-14710 ORDER INSTITUTING ADMINISTRATIVE In the Matter of PROCEEDINGS PURSUANT TO SECTION 15(b) OF THE SECURITIES EXCHANGE 1st DISCOUNT BROKERAGE, ACT OF 1934 AND SECTIONS 203(e) AND INC. and MICHAEL R. 203(f) OF THE INVESTMENT ADVISERS FISHER...»

«George Walkden DiGS, Cambridge, 14th July 2010 Verb-third in early West Germanic: a comparative perspective George Walkden Department of Linguistics, University of Cambridge gw249@cam.ac.uk · http://www.srcf.ucam.org/~gw249/ Introduction A V2/V3 alternation has often been observed in Old English (OE) root clauses, with some authors (e.g. Hinterhölzl & Petrova 2009) speculating that the V3 pattern resulted from an innovation. I’ll present this alternation and my analysis of it, then draw on...»

«Forschung, August 2014 Programm Photovoltaik Ausgabe 2014 Überblicksbericht, Liste der Projekte Jahresberichte der Beauftragten 2013 ausgearbeitet durch: NET Nowak Energie & Technologie AG Auftraggeber: Bundesamt für Energie BFE Forschungsprogramm Photovoltaik 3003 Bern www.bfe.admin.ch Auftragnehmer: NET Nowak Energie & Technologie AG Waldweg 8, 1717 St. Ursen Tel. 026 494 00 30, Fax. 026 494 00 34 info@netenergy.ch Autor: Dr. Stefan Nowak, NET Nowak Energie & Technologie AG...»

«Institut für Tierwissenschaften Abteilung Tierernährung der Rheinischen Friedrich-Wilhelms-Universität Bonn Effect of thyme, oregano and their major active components on performance and intestinal microbial populations of broilers Inaugural-Dissertation zur Erlangung des Grades Doktor der Agrarwissenschaften (Dr. agr.) der Hohen Landwirtschaftlichen Fakultät der Rheinischen Friedrich-Wilhelms-Universität zu Bonn Vorgelegt im Februar 2011 Von M. Sc. Ahmed Aboubaker Abdel-Moniem Abdel-Wareth...»

«Einführung Erste Behauptung:Muhammad Bestrebungen nach Reichtum? Macht und Ruhm?. 12 Vereinigung der Araber? Hatte er Epilepsie? Zweite Behauptung: Jemand Lernte er den Qur’an von einem Dritte Behauptung: Der Qur’an Unerreichte Eloquenz Hätte ein Mensch diese Dinge wissen können? Dieses Wunder ist nicht durch Zeit und Der Qur’an verändert die Art zu Leben Keine Widersprüche Einführung Im Namen Allahs, Des Allerbarmers, Des Barmherzigen. Wahrlich, das Lob ist Allahs. Ich bezeuge,...»

«General Disclaimer The use of trade, firm, or corporation names in this publication is for the information and convenience of the reader. Such use does not constitute an official endorsement or approval by the Government of British Columbia of any product or service to the exclusion of any others that may also be suitable. Contents of this report are presented as information only. Funding assistance does not imply endorsement of any statements or information contained herein by the Government...»





 
<<  HOME   |    CONTACTS
2016 www.abstract.xlibx.info - Free e-library - Abstract, dissertation, book

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.