«Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»
More formally, if β(S) is the set of bit positions ρ(h(d)) for all d ∈ S, then β(S1 ∪ S2 ) = β(S1 ) ∪ β(S2 ). Notice that, if both original collections carry a random document, the document will conceptually be counted only once, eﬀectively providing duplicate-insensitive (i.e. distinct item) counting for the union of the original multisets.
Obviously, to compute the intersection of a huge number of hash sketches, the relative error is propagated and the distinct-value estimation is getting inaccurate. Even regarding the overlap of four multisets, the sieve formula needs a high computation complexity.
2.5.2 KMV Synopses In [BHR+ 07], the KMV synopses and appropriate DV estimators are introduced. The main focus of KMVs is on arbitrary multiset operations including union, intersection, and diﬀerences. Their major diﬀerences when compared to hash sketches are the lower computational costs and the more accurate DV estimation. In this section, the main motivation behind the DV estimators is explained, and subsequently the KMV data structure and the basic DV estimator for them are introduced. Finally, by using the KMV synopsis and the basic estimator, the multiset operations union, intersection, and diﬀerence can be applied.
Assume that D points are placed randomly and uniformly on the unit interval. The expected distance between two neighboring points is 1/(D + 1) ≈ 1/D, such that the expected value of Uk, the k-th smallest point, is E[Uk ] ≈ k/D. Thus D ≈ k/E[Uk ]. If Uk itself is known, a basic estimator for the number of points as proposed in [BYJK+ 02] is
given in Equation 2.23:
ˆb Dk = k/Uk (2.23) In the DV estimation problem, an enumeration of distinct values v1, v2,..., vD in dataset A with domain Θ(A) is given. Using a hash function h : Θ(A) → 0, 1,..., M such that the sequence h(v1 ), h(v2 ),..., h(vD ) reminds a sequence of independent and identically distributed samples from the discrete uniform distribution on 0, 1,..., M. Assuming that M is suﬃciently greater than D, the sequence U1 = h(v1 )/M, U2 = h(v2 )/M,..., UD = h(vD )/M will approximate the realization of a sequence of samples from the continuous uniform distribution on [0, 1]. The requirement that M is much larger than D avoids collisions and ensures that h(vi ) = h(vj ) for all i = j, with high probability.
188.8.131.52 Creation and DV Estimator
Using the idea of the basic estimator introduced in [BYJK+ 02], a KMV synopsis for a multiset S is created as described in [BHR+ 07]: by applying the hash function h to each value of Θ(S), the k smallest of the hashed values are recorded. This simple synopsis (e.g., set LS of hashed values) is called KMV synopsis (stands for k minimum values).
As discussed previously, M = O(D2 ) is needed to avoid collisions such that each of the k hashed values requires O(log M ) = O(log D) bits. Thus, the size of the KMV synopsis is O(k·log D) bits. To compute the KMV synopsis, one single scan of the dataset S is done and only a sorted list of k hashed values is necessary. KMV synopses can deal with a variety of multiset operations (union, intersection, diﬀerence). To handle multiset diﬀerences, KMVs need to be augmented with counters. These AKMV synopses [BHR+ 07] also provide the deletion of single values.
In [BHR+ 07], the DV estimator for KMV synopses extends the basic estimator 2.23 and
uses the following computation:
So far, there is an estimator for KMV synopses. Now, the focus is on multiset operations using two or more KMV synopses in combination with estimating the compound set. Here, the operations union and intersection are explained in details. To perform other multiset operations, including diﬀerence or deletion, extended KMV synopses (AKMV synopses) are needed. AKMV synopses involve adding counters to the basic synopses, in the spirit of [CG05].
In the description, it is assumed that all synopses are created using the same hash function h : Θ → 0, 1,..., M where Θ denotes the data value domain appearing in the synopsis and M = O(|Θ|2 ). Ordinary set operations are denoted by ∪, ∩ and multiset operations by ∪m, ∩m.
Chapter 3 System Architecture and Protocols This chapter introduces the main architecture and presents the related protocols for approximate information ﬁltering in structured P2P networks. The architecture is named MAPS standing for Minerva Approximate Publish Subscribe, and builds upon the Minerva P2P Web search system to enrich one-time search with novel approximate publish/subscribe functionality. With MAPS the concept of approximate IF is introduced and the and the the ﬁrst architecture to support such functionality for P2P networks is provided. Although approximate information retrieval as realized in Minerva is not the main focus of this doctoral thesis, the appropriate protocols are presented below for completeness reasons.
Section 3.1 gives an introduction to approximate information ﬁltering, discusses the main contributions, and refers to related work in the research area of IF.
Sections 3.2 and 3.3 introduce the types of services implemented in MAPS and the main protocols regulating the interactions between peers in the network. Here, the thesis also presents the appropriate retrieval protocols for one-time searching. In Section 3.4, the MAPS approach is compared to an existing exact information ﬁltering approach for P2P networks. Section
3.5 summarizes and concludes this chapter.
3.1 Introduction Much information of interest to humans is available today on the Web, making it extremely diﬃcult to stay informed without sifting through enormous amounts of information. One of the most important issues is how to dig out from the information avalanche. In such a dynamic setting, information ﬁltering, also referred to as publish/subscribe or continuous querying or information push, is equally important to one-time querying, since users are able to subscribe to information sources and be notiﬁed when documents of interest are published. This need for content-based push technologies is also stressed by the deployment of new tools such as Google Alerts 1 or the QSR system [YJ06]. In an IF scenario, a user posts a subscription (or continuous query) or proﬁle representing his information demand (e.g., the fact that the user is interested in some sports) to the system to receive notiﬁcations whenever certain events of interest take place (e.g., when a paper on distributed systems becomes available). Information ﬁltering and information retrieval are often referred as two sides of the same coin [BC92]. Although many of the underlying issues and goals are similar in retrieval and ﬁltering, since in both cases a document needs to be matched against an information demand, the design issues, the techniques and algorithms devised to increase ﬁltering eﬃciency diﬀer signiﬁcantly.
1 GoogleAlerts (http://www.google.com/alerts) is a service oﬀered by search engine company Google which notiﬁes its users (i.e., by sending a notiﬁcation email) about the latest Web and news pages of their choice.
- 37 Chapter 3 System Architecture and ProtocolsThis doctoral thesis presents how to combine approximate search and ﬁltering functionality in a single unifying framework. Therefore, the existing Minerva system providing search capability is extended with novel approximate publish/subscribe functionality. The present thesis mainly focuses on information ﬁltering rather than information retrieval. However, the full set of services and protocols is presented for completeness reasons. Since the focus is on IF, issues related to query routing for P2P search are only brieﬂy discussed and the interested reader is referred to [BMWZ05, BMT+ 05a, MBN+ 06] for more details.
MAPS (Minerva Approximate Publish Subscribe) is a novel architecture based on the architecture of the Minerva search system to support content-based approximate information ﬁltering in P2P environments. While most information ﬁltering approaches taken so far have the underlying hypothesis of potentially delivering notiﬁcations from every information producer, MAPS relaxes this assumption by monitoring only selected sources that are likely to publish documents relevant to the user interests in the future. In MAPS, a user subscribes with a continuous query and monitors only the most interesting sources in the network. The user query is replicated to these sources and only published documents from these sources are forwarded to him. The system itself is responsible for managing the user query, discovering new potential sources and moving queries to better or more promising information sources. Since in an IF scenario the data is originally highly distributed residing on millions of sites (e.g., with people contributing to blogs), a P2P approach seems an ideal candidate for such a setting. However, exact pub/sub functionality has proven expensive for such distributed environments [TX03, TIK05b, AT06]. By contrast, MAPS oﬀers a natural solution to this problem, by avoiding document granularity dissemination as the main scalability bottleneck for other approaches.
As possible application scenarios for MAPS consider the case of news ﬁltering (but with the emphasis on information quality rather than timeliness of delivery) or blog ﬁltering where users subscribe to new posts. Not only do these settings pose scalability challenges, but they would also incur information avalanche and cognitive overload to the subscribed users, if these were alerted for each and every new document published at any source whenever this matched a submitted continuous query. The proposed approximate IF approach ranks sources, and delivers matches only from the top-ranked providers, by utilizing novel publisher selection strategies based on a combination of resource selection and behavior prediction. Despite that the presented approach focuses on a P2P setting, notice that the MAPS architecture can also be realized in other settings, like a single server monitoring a number of distributed sources, or a farm of servers in a data center providing an alerting service. The publisher peer selection strategies will be presented in Chapter 4. In the following, the main architecture including the services and protocols is introduced.
This chapter presents the services and the protocols to provide approximate information ﬁltering. Since, the MAPS approach extends the Minerva search system, the service and protocol for one-time search are also explained for completeness reasons. Chapter 4 covers the selection strategy coming along with MAPS because publisher selection denotes the most critical task in MAPS. In Section 3.4 of this chapter, the main characteristics of MAPS are compared to an existing exact information ﬁltering approach.
3.1.2 Previous Work on Information Filtering Before presenting the MAPS architecture including the services and protocols in detail, previous work on information ﬁltering is presented. The following areas are covered: pub/sub in databases (Section 184.108.40.206), pub/sub in information retrieval (Section 220.127.116.11), and exact pub/sub in P2P networks (Section 18.104.22.168). Since MAPS denotes the ﬁrst approach for approximate IF in P2P systems, no previous work can be listed.
22.214.171.124 IF in Databases
Database research on continuous queries has its origins in the paper [TGNO92] and systems OpenCQ [LPT00] and NiagaraCQ [CDTW00]. All these papers oﬀered centralized solutions to the problem of continuous query processing. More recently, continuous queries have been studied in depth in the context of monitoring and stream processing with various centralized [MSHR02, CF03] and distributed proposals [GL03, Ac04, JHR+ 04]. The eﬀorts to improve network eﬃciency and reduce delivery delays in content-based pub/sub systems lead to approaches like HYPER [ZH05], where a hybrid architecture that exploits properties of subject-based pub/sub approaches is presented.
Hermes [PB02] was one of the ﬁrst proposals to use a distributed hash table for building a topic-based pub/sub system, while PeerCQ [GL03] utilized a DHT to build a content-based system for processing continuous queries. It was also one of the ﬁrst approaches to assume that data is not stored in the DHT but are kept locally at external data sources. Finally, Meghdoot [GSAA04] utilized the CAN DHT [RFH+ 01] to support an attribute-value data model and oﬀered new ideas for the processing of subscriptions with range predicates and load balancing.
126.96.36.199 IF in Information Retrieval
Recently, several systems that employed an IR-based query language to support information ﬁltering on top of structured overlay networks have been deployed. DHTrie [TIK05b] extended the Chord protocol to achieve exact information ﬁltering functionality and applied document-granularity dissemination to achieve the recall of a centralized system at low message costs. Section 3.4 compares the DHTrie approach with MAPS to explain the architectural distinguishing features between exact and approximate information ﬁltering in (structured) P2P networks.
In the same spirit, LibraRing [TIK05a] presented a framework to provide information retrieval and ﬁltering services in two-tier digital library environments. Similarly, pFilter [TX03] used a hierarchical extension of the CAN DHT [RFH+ 01] to store user queries and relied on multi-cast trees to notify subscribers. Again, the DHT served as a distributed index, and documents were multi-casted to reach stored subscriptions. Finally, [AT06] shows how to implement a DHT-agnostic solution to support preﬁx and suﬃx operations over string attributes in a pub/sub environment.