«Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»
2.1.4 Example: The Pastry Protocol Pastry 5 [RD01a] is another self-organizing structured overlay network. The protocol uses a routing schema based on preﬁx matching. Each Pastry node is assigned a globally unique 128-bit identiﬁer from the domain [0, 2128 − 1], in form of sequences of digits with base 2b where b is a conﬁguration parameter. A typical value for b is 4. Similar to Chord, Pastry oﬀers a simple routing method that eﬃciently determines the node that is numerically closest to a given key, i.e., which is currently responsible for maintaining this key. To enable eﬃcient routing within a network of N nodes, each peer maintains a routing table that consists of log2b N rows with 2b − 1 entries each. Each entry consists of a Pastry identiﬁer and the corresponding contact information (i.e., IP address, port) of the numerically closest node currently responsible for that key. All 2b − 1 entries in row n represent nodes with a Pastry identiﬁer that shares the ﬁrst n digits with the current node, but each with a diﬀerent n + 1-st digit (2b − 1 possible values). Figure 2.6 shows an example routing table with 8 rows for a Pastry node with identiﬁer 10233102 and b = 2.
The preﬁx routing now works as follows: For a given key, the current node forwards the request to that node from its routing table that has the longest common preﬁx with the key. Intuitively, each routing hop can ﬁx one additional digit toward the desired key. Thus, in a network of N nodes, Pastry can route a message to a currently responsible node with less than log2b N message hops.
Pastry is intended as general substrate for the construction of a variety of P2P applications like global ﬁle sharing, ﬁle storage, group communication and naming systems.
Several application have been built on top of Pastry to date, including a global, persistent storage utility called PAST [RD01b] and a scalable publish/subscribe system called Scribe [RKCD01].
2.2 Information Retrieval Information retrieval is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases. IR systems keep collections of unstructured or semi-structured data (e.g., text documents or HTML pages) and oﬀer search functionalities for delivering documents relevant to a query. Typical examples of IR systems include Web search engines such as Google, Yahoo or Live Search (formerly MSN Search), or digital libraries. Lately, relational database systems are integrating IR functionality as well (see [MYL02, Cha02] for more details about database systems that support IR functionalities).
2.2.1 Effectiveness Measures Precision and recall are the two most widely used measures for evaluating the quality of results in the area of IR. Precision can be seen as a measure of exactness or ﬁdelity, whereas
recall is a measure of completeness. Both measures are deﬁned as follows:
The F1 -measure is a good candidate for optimization. Given the fact that one can get a perfect precision score by not returning any documents, or a perfect recall score by returning all documents. A powerful document retrieval system will yield all truly relevant documents and only truly relevant documents, maximizing precision and recall at the same time, and therefore maximizing the F1 -score which falls in the range from 0 to 1 (with 1 being the best possible score). Two other commonly used F-measures are the F2 -measure, which weights recall twice as much as precision, and the F0.5 -measure, which weights precision twice as much as recall. More about IR measures can be found in [BY99].
2.2.2 Document and Query Representation Usually, documents are represented by the words (or terms, keywords, or keys) that are contained in them. Typically, a small set of so-called stop words are not considered and are ignored because these words do not contain any semantic information (e.g., a, the, of, etc.).
Stemming describes the reduction of words to their common preﬁx stem thus combining combining words with a common root (e.g., walk, walking, walker ). A very widely used stemmer that became the de-facto standard algorithm used for English words is the Porter Stemmer [RvRP80].
After the removal of all stop words and after stemming, a document in a collection can be represented by a vector of n key occurrences, where n is the total number of distinct keys in the set of all documents in a collection. In this vector space model, a document is represented by a vector (w1, w2,..., wn ), where each wi is the weight (or score) indicating the importance of key i in a document. Due to the high number of distinct keys in the whole collection and due to the absent of most keys in a typical document, most of the vector entries will be zero. A weight assigned to a key is usually based on the following two
• The number of occurrences of key t in document d is called term frequency and typically denoted as tft,d. Intuitively, the weight of a key within a document increases with the number of occurrences.
• The number of documents in a collection that contain a key t is called document frequency and denoted as dft ; the inverse document frequency idft is deﬁned as the inverse of dft. Intuitively, the relative importance of a key decreases as the number of documents that contain this key increases, i.e., the key oﬀers less diﬀerentiation between the documents.
In Equation 2.6, q and d are the query and document vectors, respectively. For a particular query, the scalar product intuitively sums up the importance scores for those keys of the document that are contained in the query vector, weighted by their respective weights within the query. The cosine measure can be used to overcome the tendency of the scalar product function to favor longer documents having many keys, capturing the angle between
the two vectors by using the vector lengths:
2.2.3 Top-k Query Processing Computing the scalar product, the cosine measure, or the BM25 score for all documents concerning a query is a computationally expensive task. Thus, there are several strategies to eﬃciently identify only the top-k documents (i.e., the k documents with the highest score for the given query) without evaluating the full scalar product for all possible documents.
In order to eﬃciently support this process, the concept of inverted index lists has been developed [ZM06]. All keys that appear in the whole collection form a tree-based index structure (often a B + -tree or a trie) where the leaves contain a list of unique document identiﬁers for exactly those documents that contain this key (see Figure 2.7). Depending on the exact query execution strategy, the lists of document identiﬁers may be ordered according to the document identiﬁers or according to their key scores to allow for eﬃcient pruning.
An eﬃcient algorithm should avoid reading inverted index lists completely, but would ideally limit the eﬀort to O(k) steps where k is the number of desired results. In the IR and multimedia-search literature, there are various algorithms to accomplish this task.
The threshold algorithm (TA) by Fagin [FLN01] (independently proposed also by Nepal et al. [NR99] and Güntzer et al. [GBK00]) is the best known method for top-k queries.
It uses index lists that are sorted in descending order of key weights under the additional assumption that the ﬁnal score for a document is calculated using a monotone aggregation function (such as a simple sum function). TA traverses all inverted index lists in a roundrobin manner, i.e., lists are mainly traversed using sorted accesses. For every new document d encountered, TA uses random accesses in the remaining index lists to calculate the ﬁnal score for d and keeps this information in a document candidate set of size k. Since TA additionally keeps track of an upper score bound for documents not yet encountered, the algorithm terminates as soon as this bound assures that no previously unseen document can enter the candidate set.
Figure 2.7: B + -Tree with Inverted Index Lists.
There are extensions of this algorithm that further reduce the number of index list accesses (especially expensive random accesses) by a more sophisticated scheduling. Also, probabilistic methods have been studied that can further improve the eﬃciency of index processing [TSW05, BMS+ 06].
2.3 Information Filtering Information retrieval and information ﬁltering (or selective dissemination of information or publish/subscribe) are often referred as two sides of the same coin [BC92]. Although many of the underlying issues and goals are similar in retrieval and ﬁltering, since in both cases a document needs to be matched against an information demand, the design issues, the techniques and algorithms devised to increase ﬁltering eﬃciency diﬀer signiﬁcantly.
Information ﬁltering can also be seen as an general application of information retrieval because most of the issues which appear at ﬁrst to be unique to IF, are really specializations of IR problems.
2.3.1 IF in Databases and Distributed Systems Selective dissemination of information has its origin in a 1958 article about Business Intelligence System [Luh58] where a selection module (selective dissemination of new information) is used to produce lists of new documents matching interests of users described by proﬁles.
The concept of information ﬁltering was introduced in [Den82], where the need to ﬁlter incoming mail messages to sort them in order of urgency is described6. In the following, only work most relevant to this thesis and mainly referred as content-based ﬁltering will be discussed. Early approaches to information ﬁltering by IR researchers focused mainly on appropriate representations of user interests [MS94] and on improving ﬁltering eﬀectiveness [HPS96].
6 At that time, spam ﬁlters were not needed since spam was not such a big issue than today.
- 27 Chapter 2 Background and Technical Basics[BM96] is one of the ﬁrst approaches to address performance, where an information ﬁltering system capable of scaling up to large tasks is described. Here, a server receives documents at a high rate, and proposed algorithms support vector space queries by improving the algorithm SQI of [YGM94a]. InRoute [Cal96] creates documents and query networks and uses belief propagation techniques to ﬁlter incoming documents based on inference networks. [ZC01, Cal98] mainly focus on adaptive ﬁltering and how vector space queries and their dissemination thresholds are adapted based on documents processed in the past. Filtering news articles [CGR05, GDH04] is also an important research area related to IF but stresses the focus on personalized duplicate elimination (or information novelty) and freshness of the results shown to the user.
In the database literature, the term selective dissemination of information is used [FZ98].
The term publish/subscribe system (or pub/sub system) comes from distributed systems, and has also been used in this context by database researchers. SIFT [YGM99, YGM94b] constitutes another inﬂuential system where publications are documents in free text form and queries are conjunctions of keywords. This system was the ﬁrst approach to emphasize query indexing as a means to achieve scalability in pub/sub systems [YGM94b]. Later, other work focused on pub/sub systems with data models based on attribute-value pairs and query languages based on attributes with arithmetic and string comparison operators [FJL+ 01, NACP01]. [CCC+ 01] considers a data model based on attribute-value pairs but goes beyond conjunctive queries (the standard class of queries). Subscription summarization to support pub/sub functionality was also a prominent example [TE04].
Recent work on XML data management [KP05] has focused on publications that are XML documents and queries that are subsets of XPath or XQuery (e.g., XFilter [AF00], YFilter [DAF+ 03], or Xtrie [CFGR02]). All these papers discuss sophisticated ﬁltering algorithms that base on indexing queries. There are several publish/subscribe systems that have been developed over the years in the area of distributed systems and networks. Data models are based on channels, topics, and attribute-value pairs [CRW01]. The latter systems are called content-based as in the IR literature, because attribute-value data models are ﬂexible enough to express the content of messages in various applications. The query language of content-based systems is based on Boolean expressions of arithmetic and string operations.
The SIENA system [CRW01] uses a data model and language based on attribute-value pairs and demonstrates how to express notiﬁcations, subscriptions and advertisements in this language.
2.3.2 IF in Peer-to-Peer Systems The two pub/sub systems DIAS [KKTR02] and P2P-DIET [KTID03] use some prominent ideas from the database and distributed systems in a single unifying framework. The P2PDIET [IKT04b, IKT04a] approach demonstrates how to provide the traditional one-time query scenario of typical super-peer systems and the pub/sub features of the SIENA system [CRW01]. An extension of P2P-DIET considers a similar problem for distributing RDF meta-data in an Edutella [NWQ+ 02] fashion. The development of structured P2P systems mainly based on distributed hash tables such as CAN [RFH+ 01], Chord [SMK+ 01], Pastry [RD01a]), P-Grid [Abe01], or Tapestry [ZZJ+ 01] introduced a wide-area of new pub/sub systems. Scribe [RKCD01] is a topic-based publish/subscribe system based on Pastry.
Hermes [PB02] is similar to Scribe because it uses with Pastry the same underlying DHT but it allows more expressive subscriptions by supporting the notion of an event type with attributes. Every event type in Hermes is managed by an event broker which is a rendezvous node for subscriptions and publications related to this event. Related ideas and approaches appear in [TAJ04, TBF+ 03].