FREE ELECTRONIC LIBRARY - Abstract, dissertation, book

Pages:     | 1 |   ...   | 5 | 6 || 8 | 9 |   ...   | 25 |

«Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»

-- [ Page 7 ] --

More formally, if β(S) is the set of bit positions ρ(h(d)) for all d ∈ S, then β(S1 ∪ S2 ) = β(S1 ) ∪ β(S2 ). Notice that, if both original collections carry a random document, the document will conceptually be counted only once, effectively providing duplicate-insensitive (i.e. distinct item) counting for the union of the original multisets.

–  –  –

Obviously, to compute the intersection of a huge number of hash sketches, the relative error is propagated and the distinct-value estimation is getting inaccurate. Even regarding the overlap of four multisets, the sieve formula needs a high computation complexity.

–  –  –

2.5.2 KMV Synopses In [BHR+ 07], the KMV synopses and appropriate DV estimators are introduced. The main focus of KMVs is on arbitrary multiset operations including union, intersection, and differences. Their major differences when compared to hash sketches are the lower computational costs and the more accurate DV estimation. In this section, the main motivation behind the DV estimators is explained, and subsequently the KMV data structure and the basic DV estimator for them are introduced. Finally, by using the KMV synopsis and the basic estimator, the multiset operations union, intersection, and difference can be applied.

Assume that D points are placed randomly and uniformly on the unit interval. The expected distance between two neighboring points is 1/(D + 1) ≈ 1/D, such that the expected value of Uk, the k-th smallest point, is E[Uk ] ≈ k/D. Thus D ≈ k/E[Uk ]. If Uk itself is known, a basic estimator for the number of points as proposed in [BYJK+ 02] is

given in Equation 2.23:

ˆb Dk = k/Uk (2.23) In the DV estimation problem, an enumeration of distinct values v1, v2,..., vD in dataset A with domain Θ(A) is given. Using a hash function h : Θ(A) → 0, 1,..., M such that the sequence h(v1 ), h(v2 ),..., h(vD ) reminds a sequence of independent and identically distributed samples from the discrete uniform distribution on 0, 1,..., M. Assuming that M is sufficiently greater than D, the sequence U1 = h(v1 )/M, U2 = h(v2 )/M,..., UD = h(vD )/M will approximate the realization of a sequence of samples from the continuous uniform distribution on [0, 1]. The requirement that M is much larger than D avoids collisions and ensures that h(vi ) = h(vj ) for all i = j, with high probability. Creation and DV Estimator

Using the idea of the basic estimator introduced in [BYJK+ 02], a KMV synopsis for a multiset S is created as described in [BHR+ 07]: by applying the hash function h to each value of Θ(S), the k smallest of the hashed values are recorded. This simple synopsis (e.g., set LS of hashed values) is called KMV synopsis (stands for k minimum values).

As discussed previously, M = O(D2 ) is needed to avoid collisions such that each of the k hashed values requires O(log M ) = O(log D) bits. Thus, the size of the KMV synopsis is O(k·log D) bits. To compute the KMV synopsis, one single scan of the dataset S is done and only a sorted list of k hashed values is necessary. KMV synopses can deal with a variety of multiset operations (union, intersection, difference). To handle multiset differences, KMVs need to be augmented with counters. These AKMV synopses [BHR+ 07] also provide the deletion of single values.

In [BHR+ 07], the DV estimator for KMV synopses extends the basic estimator 2.23 and

uses the following computation:

–  –  –

So far, there is an estimator for KMV synopses. Now, the focus is on multiset operations using two or more KMV synopses in combination with estimating the compound set. Here, the operations union and intersection are explained in details. To perform other multiset operations, including difference or deletion, extended KMV synopses (AKMV synopses) are needed. AKMV synopses involve adding counters to the basic synopses, in the spirit of [CG05].

In the description, it is assumed that all synopses are created using the same hash function h : Θ → 0, 1,..., M where Θ denotes the data value domain appearing in the synopsis and M = O(|Θ|2 ). Ordinary set operations are denoted by ∪, ∩ and multiset operations by ∪m, ∩m.

–  –  –

Chapter 3 System Architecture and Protocols This chapter introduces the main architecture and presents the related protocols for approximate information filtering in structured P2P networks. The architecture is named MAPS standing for Minerva Approximate Publish Subscribe, and builds upon the Minerva P2P Web search system to enrich one-time search with novel approximate publish/subscribe functionality. With MAPS the concept of approximate IF is introduced and the and the the first architecture to support such functionality for P2P networks is provided. Although approximate information retrieval as realized in Minerva is not the main focus of this doctoral thesis, the appropriate protocols are presented below for completeness reasons.

Section 3.1 gives an introduction to approximate information filtering, discusses the main contributions, and refers to related work in the research area of IF.

Sections 3.2 and 3.3 introduce the types of services implemented in MAPS and the main protocols regulating the interactions between peers in the network. Here, the thesis also presents the appropriate retrieval protocols for one-time searching. In Section 3.4, the MAPS approach is compared to an existing exact information filtering approach for P2P networks. Section

3.5 summarizes and concludes this chapter.

3.1 Introduction Much information of interest to humans is available today on the Web, making it extremely difficult to stay informed without sifting through enormous amounts of information. One of the most important issues is how to dig out from the information avalanche. In such a dynamic setting, information filtering, also referred to as publish/subscribe or continuous querying or information push, is equally important to one-time querying, since users are able to subscribe to information sources and be notified when documents of interest are published. This need for content-based push technologies is also stressed by the deployment of new tools such as Google Alerts 1 or the QSR system [YJ06]. In an IF scenario, a user posts a subscription (or continuous query) or profile representing his information demand (e.g., the fact that the user is interested in some sports) to the system to receive notifications whenever certain events of interest take place (e.g., when a paper on distributed systems becomes available). Information filtering and information retrieval are often referred as two sides of the same coin [BC92]. Although many of the underlying issues and goals are similar in retrieval and filtering, since in both cases a document needs to be matched against an information demand, the design issues, the techniques and algorithms devised to increase filtering efficiency differ significantly.

1 GoogleAlerts (http://www.google.com/alerts) is a service offered by search engine company Google which notifies its users (i.e., by sending a notification email) about the latest Web and news pages of their choice.

- 37 Chapter 3 System Architecture and Protocols

This doctoral thesis presents how to combine approximate search and filtering functionality in a single unifying framework. Therefore, the existing Minerva system providing search capability is extended with novel approximate publish/subscribe functionality. The present thesis mainly focuses on information filtering rather than information retrieval. However, the full set of services and protocols is presented for completeness reasons. Since the focus is on IF, issues related to query routing for P2P search are only briefly discussed and the interested reader is referred to [BMWZ05, BMT+ 05a, MBN+ 06] for more details.

MAPS (Minerva Approximate Publish Subscribe) is a novel architecture based on the architecture of the Minerva search system to support content-based approximate information filtering in P2P environments. While most information filtering approaches taken so far have the underlying hypothesis of potentially delivering notifications from every information producer, MAPS relaxes this assumption by monitoring only selected sources that are likely to publish documents relevant to the user interests in the future. In MAPS, a user subscribes with a continuous query and monitors only the most interesting sources in the network. The user query is replicated to these sources and only published documents from these sources are forwarded to him. The system itself is responsible for managing the user query, discovering new potential sources and moving queries to better or more promising information sources. Since in an IF scenario the data is originally highly distributed residing on millions of sites (e.g., with people contributing to blogs), a P2P approach seems an ideal candidate for such a setting. However, exact pub/sub functionality has proven expensive for such distributed environments [TX03, TIK05b, AT06]. By contrast, MAPS offers a natural solution to this problem, by avoiding document granularity dissemination as the main scalability bottleneck for other approaches.

As possible application scenarios for MAPS consider the case of news filtering (but with the emphasis on information quality rather than timeliness of delivery) or blog filtering where users subscribe to new posts. Not only do these settings pose scalability challenges, but they would also incur information avalanche and cognitive overload to the subscribed users, if these were alerted for each and every new document published at any source whenever this matched a submitted continuous query. The proposed approximate IF approach ranks sources, and delivers matches only from the top-ranked providers, by utilizing novel publisher selection strategies based on a combination of resource selection and behavior prediction. Despite that the presented approach focuses on a P2P setting, notice that the MAPS architecture can also be realized in other settings, like a single server monitoring a number of distributed sources, or a farm of servers in a data center providing an alerting service. The publisher peer selection strategies will be presented in Chapter 4. In the following, the main architecture including the services and protocols is introduced.

–  –  –

This chapter presents the services and the protocols to provide approximate information filtering. Since, the MAPS approach extends the Minerva search system, the service and protocol for one-time search are also explained for completeness reasons. Chapter 4 covers the selection strategy coming along with MAPS because publisher selection denotes the most critical task in MAPS. In Section 3.4 of this chapter, the main characteristics of MAPS are compared to an existing exact information filtering approach.

3.1.2 Previous Work on Information Filtering Before presenting the MAPS architecture including the services and protocols in detail, previous work on information filtering is presented. The following areas are covered: pub/sub in databases (Section, pub/sub in information retrieval (Section, and exact pub/sub in P2P networks (Section Since MAPS denotes the first approach for approximate IF in P2P systems, no previous work can be listed. IF in Databases

Database research on continuous queries has its origins in the paper [TGNO92] and systems OpenCQ [LPT00] and NiagaraCQ [CDTW00]. All these papers offered centralized solutions to the problem of continuous query processing. More recently, continuous queries have been studied in depth in the context of monitoring and stream processing with various centralized [MSHR02, CF03] and distributed proposals [GL03, Ac04, JHR+ 04]. The efforts to improve network efficiency and reduce delivery delays in content-based pub/sub systems lead to approaches like HYPER [ZH05], where a hybrid architecture that exploits properties of subject-based pub/sub approaches is presented.

Hermes [PB02] was one of the first proposals to use a distributed hash table for building a topic-based pub/sub system, while PeerCQ [GL03] utilized a DHT to build a content-based system for processing continuous queries. It was also one of the first approaches to assume that data is not stored in the DHT but are kept locally at external data sources. Finally, Meghdoot [GSAA04] utilized the CAN DHT [RFH+ 01] to support an attribute-value data model and offered new ideas for the processing of subscriptions with range predicates and load balancing. IF in Information Retrieval

Recently, several systems that employed an IR-based query language to support information filtering on top of structured overlay networks have been deployed. DHTrie [TIK05b] extended the Chord protocol to achieve exact information filtering functionality and applied document-granularity dissemination to achieve the recall of a centralized system at low message costs. Section 3.4 compares the DHTrie approach with MAPS to explain the architectural distinguishing features between exact and approximate information filtering in (structured) P2P networks.

In the same spirit, LibraRing [TIK05a] presented a framework to provide information retrieval and filtering services in two-tier digital library environments. Similarly, pFilter [TX03] used a hierarchical extension of the CAN DHT [RFH+ 01] to store user queries and relied on multi-cast trees to notify subscribers. Again, the DHT served as a distributed index, and documents were multi-casted to reach stored subscriptions. Finally, [AT06] shows how to implement a DHT-agnostic solution to support prefix and suffix operations over string attributes in a pub/sub environment.

–  –  –

Pages:     | 1 |   ...   | 5 | 6 || 8 | 9 |   ...   | 25 |

Similar works:

«MEGEMLÉKEZÉS DR. BABOCSAY JÓZSEFRŐL, SZÜLETÉSÉNEK 200. ÉVFORDULÓJA ALKALMÁBÓL í r t a : Prof. C S E K E Y ISTVÁN (Pécs) r. Babocsay József nagykanizsai orvos H é v í z gyógyfürdő első D ismertetője v o l t. 1795-ben jelent meg névtelenül, csupán kez­ dőbetűivel ( D. B. J.) megjelölve Sopronban, Sziesz K l á r a nyom­ d á j á b a n 28 lapra terjedő hírverő füzete a következő c í m m e l : „ B o l ­ d o g Z a l a v á r m e g y e ! Keszhelyi...»

«Casualty List 63rd New York Volunteer Infantry Meagher’s Irish Brigade Antietam Maryland September 17, 1862 Casualty List 63rd New York “Bloody Lane” Antietam September 17, 1862 This list was prepared from a photocopy of an original casualty list I found in the National Archives, Washington D. C., with entries from the “Annual Report of the Adjutant General of the State of New York,” 1901, which is the 63rd New York Volunteer Infantry unit roster, and other sources. The names are...»

«1 Aus dem Inhalt Einladung zum Ehemaligentreffen Das Gymnasium Zitadelle 2007 von Charly Kreiner Die Zitadelle im Blick von Ernst Fettweis Neues aus der Anstalt Anke de Wit Thomas Bonzeck André Hansen Dirk Neumann Bernhard Wille Peter Joachim Reichard und Inge Duwe Impressionen eines großen Abschieds Was mich mit P.J. Reichard und Inge Duwe verbindet.25 Der „Chef“ und seine Gäste Zum Abschied von Catherine Wagner Unsere neuen Sextaner Abitur 2007 Die Schülerrede von David Joecken und...»

«УДК 615.11.4 ББК 52.8 Р 17 Р 17 Развитие фармацевтической практики: фокус на пациента. — Б.: 2008. —112 с. СитиХоуп Интернешнл, Инк. ISBN 978–9967–24–850–2 Опубликовано Всемирной организацией здравоохранения и Международной фармацевтической федерацией в 2006 г. под названием «Developing pharmacy practice...»

«NC STATE UNIVERSITY BioResources, a peer-reviewed journal College of Natural Resources devoted to the science of lignocellulosic Department of Wood and Paper Science Campus Box 8005 materials, chemicals, and applications Raleigh, NC 27695-8005 919.515.7707/919.513.3022 919.515.6302 (fax) BioResources Contents: Vol. 2, Issue 1, February 2007 Hubbe, M. A. (2007). Incinerate, recycle, or wash and reuse, BioRes. 2(1), 1-2. Bi, W., and Coffin, D. W., (2007). Racking strength of paperboard based...»

«Police Aviation News August 2010 ©Police Aviation Research Number 172 August 2010 IPAR Police Aviation News August 2010 2 PAN—Police Aviation News is published monthly by POLICE AVIATION RESEARCH, 7 Windmill Close, Honey Lane, Waltham Abbey, Essex EN9 3BQ UK. Contacts: Main: +44 1992 714162 Cell: +44 7778 296650 Skype: BrynElliott E-mail: editor@policeaviationnews.com SPONSORS Bob Crowe www.bobcroweaircraft.com Broadcast Microwave www.downlinkexperts.com Diamond Aircraft www.diamond-air.at...»

«Moon Take A Hike Portland Hikes Within Two Hours Of The City The number depends to deter to embark card on your bank could keep like a more. You was have different amenities and be challenging on whole exhausts over they buy to want even provide along and discuss out their objectives or help ready you and become defined if all sought-after franchise. Of it have a many frequency process, the hand allocation may file it on payment a coverage is how you are again through courier as your epub...»

«Hong Kong Exchanges and Clearing Limited and The Stock Exchange of Hong Kong Limited take no responsibility for the contents of this announcement, make no representation as to its accuracy or completeness and expressly disclaim any liability whatsoever for any loss howsoever arising from or in reliance upon the whole or any part of the contents of this announcement. DAQING DAIRY HOLDINGS LIMITED 大慶乳業控股有限公司 (Incorporated in the Cayman Islands with limited liability) (Stock...»

«Department of Microbiology and Genetics Institute of Biochemical Microbiology Global transcriptional responses of Candida albicans to histone acetyltranferases deletion Ph.D Thesis Zahra Rashki Ghalehnoo Ph.D Thesis Adviser Prof. Dr. Angel Dominguez Olavarri Department of Microbiology and Genetics, University of Salamanca Salamanca, Spain, 2009 Prof. Dr. Angel Dominguez Olavarri Professor and Chair of Microbiology at the Department of Microbiology and Genetics, University of Salamanca Autoriza:...»

«Journal of Hebrew Scriptures Volume 14, Article 7 DOI:10.5508/jhs.2014.v14.a7 The Ashkar-Gilson Manuscript: Remnant of a Proto-Masoretic Model Scroll of the Torah PAUL SANDERS Articles in JHS are being indexed in the ATLA Religion Database, RAMBI, and BiBIL. Their abstracts appear in Religious and Theological Abstracts. The journal is archived by Library and Archives Canada and is accessible for consultation and research at the Electronic Collection site maintained by Library and Archives...»

«Rexane Dehdashti-Rasmussen Der Konflikt um Berg-Karabach: Ursachen, Verhandlungsstand und Perspektiven1 Zwölf Jahre nach dem Waffenstillstand von 1994 steht eine Reglung des Berg-Karabach-Konflikts noch immer aus. Einige Faktoren scheinen nach wie vor unverändert: Noch immer herrschen in der Region Blockaden und Hunderttausende Flüchtlinge und Binnenvertriebene warten auf die Rückkehr in ihre Heimat. Die 1992 begonnenen Verhandlungen unter der Ägide der OSZE dauern an, während sich der...»

«Datenschutzrichtlinie Die Firma Colbette II Ltd., 3rd Floor, 178 Athalassas Avenue, 2025 Nikosia, Zypern (nachstehend “COLBETTE” “wir, unser oder uns) ist dazu verpflichtet, Ihre persönlichen Daten jederzeit zu schützen. Diese DATENSCHUTZRICHTLINIE, zusammen mit den NUTZUNGSBEDINGUNGEN, den NUTZUNGSBEDINGUNGEN FÜR PUBLISHER und allen anderen sich darauf beziehenden Dokumenten, bildet die Basis für die Erhebung, Verarbeitung und Nutzung der von uns von den NUTZERN oder PUBLISHERN der...»

<<  HOME   |    CONTACTS
2016 www.abstract.xlibx.info - Free e-library - Abstract, dissertation, book

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.