«Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»
Beyond that, new algorithms to approximate multi-key statistics by combining the statistics of arbitrary subsets are provided. Whereas hash sketches, used for compactly representing the documents, yield inaccurate results when considering continuous queries with more than two keys, the usage of very recent state-of-the-art techniques (KMV synopses) for compact representation of multisets is proposed and applied. These new structures allow the system to compute accurate synopses for multi-key queries, and improve the ﬁltering eﬀectiveness.
The experimental evaluation of both algorithms (USS and CSS) illustrates the ﬁltering performance improvements in comparison to the basic MAPS approach. All experimental series use two diﬀerent real-world collections for Web and blog data, and apply real-world queries from Google Zeitgeist. The evaluation also investigates ﬁltering performance gains depending on the introduced correlation measure (conditional probability) representing a way to compute the relatedness among keys.
1.2.3 Prototype System Within this thesis, a prototype implementation [ZHTW08] has been developed. This prototype was initially meant to serve as a testing environment for the conducted experiments, in order to evaluate the novel methods for approximate ﬁltering, but also aimed at demonstrating the feasibility of the combination of retrieval and ﬁltering functionality in a single unifying system.
Thus, the approximate information ﬁltering approach MAPS introduced in this thesis has been integrated into the Minerva search prototype [BMT+ 05b] such that Minerva also provides an approximate publish/subscribe functionality. In this regard, the implementation aspects concerning the extension of Minerva are investigated and some new components (to subscribe with continuous query, and to publish new documents) upgrade the existent prototype. An extensive use case describes the appliance of the extended Minerva prototype by executing sample one-time and continuous queries in detail. There, the various parts of the graphical user interface are illustrated. In this context, a short overview of existing retrieval and ﬁltering prototypes is given.
1.2.4 Digital Library Use Case The thesis presents a digital library use case (called MinervaDL) as an application scenario to support approximate information retrieval and ﬁltering functionality in a single
unifying framework. MinervaDL, as introduced in [ZTW07], is hierarchical and utilizes a distributed hash table to achieve scalability, fault-tolerance, and robustness in its routing layer. The MinervaDL architecture allows handling huge amounts of data provided by DLs in a distributed and self-organizing way, and provides an infrastructure for creating large networks of digital libraries with minimal administration costs. There are two kinds of
basic functionality that are oﬀered in the DL architecture of MinervaDL:
• In an information retrieval scenario (or one-time querying), a user poses an one-time query and the system returns all resources matching the query (e.g., all currently available documents relevant to the requested query).
• In an information ﬁltering scenario (or publish/subscribe or continuous querying), a user submits a continuous query (or subscription) and will later be notiﬁed from the DL system about certain events of interest that take place (i.e., about newly published documents relevant to the continuous query).
The proposed DL architecture is built upon a distributed directory storing metadata.
The thesis presents routing protocols for information ﬁltering and retrieval that use this directory information to perform the two functionalities. MinervaDL distinguishes three
• Super-peers run the DHT protocol and form a distributed directory that maintains metadata about providers’ local knowledge in compact form. In MinervaDL, superpeers are responsible for serving information consumers and providers and act as their access point to the network. Super-peers can be deployed by large institutions like universities, research centers or content providers to provide access points for their users or digital libraries.
• Consumers are utilized by users (e.g., students, faculty or employees) to connect to the MinervaDL network, using a single super-peer as their access point. Utilizing a consumer peer allows users to pose one-time queries, receive relevant resources, subscribe to resource publications with continuous queries and receive notiﬁcations about published resources (e.g., documents) that match their interests.
• Providers are implemented by information sources that want to expose their content to the MinervaDL network. Typical examples are digital libraries deployed by larger institutions, like research centers or content providers (e.g., CiteSeer, ACM, Springer, or Elsevier). Provider peer use a directory peer (super-peer) as their access point and utilize it to distribute statistics about their local resources to the network. Providers answer one-time queries and store continuous queries submitted by consumers to match them against new documents they publish.
This thesis presents diﬀerent scoring functions to select appropriate provider peers answering a one-time query or storing a continuous query for future matching publications.
The experimental evaluation investigates the inﬂuence of the individual scoring functions used to perform the two functionalities of MinervaDL. Furthermore, MinervaDL is compared to another DL architecture for P2P networks (LibraRing) providing retrieval and ﬁltering at the same time. Unlike MinervaDL, this architecture provides exact retrieval and ﬁltering functionality.
- 13 Chapter 1 Introduction
1.3 Thesis OutlineThis thesis is comprised of eight chapters, the ﬁrst being the current introductory chapter. The remainder of this thesis is organized as follows. Chapter 2 gives an background overview about existing work in the areas of P2P, information retrieval, information ﬁltering, time series analysis, and distinct-value estimation. Chapter 3 introduces the main architecture that is used for approximate information retrieval and information ﬁltering in structured P2P networks. This architecture extends the Minerva search architecture with the MAPS approach for approximate information ﬁltering. This chapter also presents the services and protocols of the complete architecture but with focus on approximate ﬁltering.
Chapter 4 focuses on publisher peer selection in the approximate information ﬁltering approach MAPS and includes an extensive evaluation. Chapter 5 extends the MAPS approach and considers correlated keys. Two diﬀerent algorithms that improve ﬁltering quality are presented and compared to the baseline approach. Chapter 6 presents the implemented prototype system for approximate information ﬁltering. This prototype extends the Minerva P2P search prototype allowing one-time searching. In Chapter 7, a DL use case for searching and ﬁltering in a distributed digital library environment is put forward. The proposed DL architecture combines two functionality using the same infrastructure. This thesis is concluded in Chapter 8 by summarizing the main results and pointing out some open questions and directions for future work in this area.
- 14 Chapter 2 Background and Technical Basics Chapter 2 Background and Technical Basics This chapter introduces some background and state-of-the-art work used to develop this doctoral thesis. First, a short overview of existing P2P architectures and protocols is described in Section 2.1 including the basic underlying infrastructure. Sections 2.2 and
2.3 present necessary background knowledge about information retrieval and information ﬁltering.
The main techniques in the area of time series analysis are explained in Section 2.4, providing the background for the publication prediction methods. Finally, Section 2.5 presents two data structures (or synopses) for distinct-value estimation on large data sets that are used to count the number of distinct data items within a multi-set.
2.1 Peer-to-Peer Systems In recent years, the peer-to-peer approach has been receiving increasing attention and has become a true hype paradigm for communication on the Internet and the World Wide Web. Although becoming popular mainly in the context of ﬁle sharing applications (e.g., Napster, Gnutella, or BitTorrent), the P2P paradigm can be applied to access any kind of distributed data. Currently, P2P is making its way into distributed data management systems and is oﬀering possibilities for new Web applications.
In contrast, the traditional client-server approach (shown in Figure 2.1) requires a huge amount of eﬀort and resources to meet the increasing challenges of the continuously growing resources shared over the Internet. This centralized client-server network model results in increased load for the server component, and creates scalability bottlenecks and single-points-of-failure, where a failure of one entity results in terminating the functionality of the whole system. As a consequence, such systems can easily be attacked, e.g., by denial-of-service attacks. In addition, dedicated servers are often diﬃcult and expensive to administrate and to relocate due to their strategic placement within the Internet infrastructure.
Contrary to the client-server model, P2P computing promises to oﬀer enormous potential beneﬁts to important issues including scalability, security, reliability, eﬃciency, ﬂexibility, and resilience to failures and dynamics. [SW05] discusses in detail the issues addressed by this fundamental paradigm shift.
An exact answer to the question What exactly is P2P? is not obvious. Especially the early P2P applications that have not even been true P2P in a strict sense, make it diﬃcult to give a precise deﬁnition. The Internet encyclopedia Wikipedia currently deﬁnes the
following aspects of a P2P network:
• A peer-to-peer (or "P2P", or, rarely, "PtP") computer network exploits diverse connectivity between participants in a network and the cumulative bandwidth of network
A more technical deﬁnition of P2P systems (shown in Figure 2.2) can be found in [Ora01, SW05]: A peer-to-peer system is a self-organizing system of equal, autonomous entities (peers) which aims for the shared usage of distributed resources in a networked environment avoiding central services. Both deﬁnitions share the idea of decentralization and point at decentralized resource usage and decentralized self-organization as potential beneﬁts.
2.1.1 The Lookup Problem How to ﬁnd a certain data item in a large P2P system in a scalable manner without any centralized service? This problem is at the heart of any P2P system [BKK+ 03] and is often referred to as the lookup problem. The main reason of this challenge is caused by decentralization such that the main system strength also yields to the main system challenge.
In contrast to centralized client-server systems, where data is provided by dedicated physical entities that are explicitly referenced, P2P systems store data in multiple, distant, transient, and unreliable locations within the network. Thus, the eﬃcient location of data stored in the network is one of the predominant challenges of a P2P system.
Figure 2.2: Peer-to-Peer Network Architecture.
2.1.2 P2P Architectures There are several ways to classify P2P networks. One approach considers the application a P2P network is used for (e.g., ﬁle sharing, telephony, media streaming etc.). Another approach includes the degree of centralization and distinguishes between pure P2P without central server (peers act as equals) and networks with central server keeping information on peers. In this context, the following terminology can be found: centralized, or decentralized P2P networks, structured, unstructured, or hybrid (so-called super-peer architectures) P2P networks.
The research literature commonly tries to classify the existing approaches to deal with the lookup problem. The following sections present the respective approaches in more detail.
Subsequent, two example protocols of P2P architectures are presented: Chord (2.1.3) and Pastry (2.1.4).
220.127.116.11 Centralized P2P Architectures
The ﬁrst occurrence of the P2P paradigm in a broader public cognition was probably Napster 1. This famous ﬁle-sharing system elegantly solved the lookup problem by utilizing an architecture in which a centralized entity provides a directory service to all participating peers (or users), eﬀectively forming a star network. All peers joining the system have to register their data (mostly music ﬁles in the early days) with this centralized server thus allowing a comfortable way for other peers in the system to locate any data in the network by presence of a physically centralized directory. The fact that only pointers to decentralized available peers are stored at the centralized server (instead of the actual data) conceptually decreases the load at the central entity; the fact that (after relevant data has been located by means of the directory) each peer could directly communicate with other peers that store the data in a decentralized manner, completely bypassing the centralized directory entity, drives the perception of Napster as a P2P system.
1 Napster’s brand and logo continue to be used by a pay service, having been acquired by Roxio.
Figure 2.3: Unstructured P2P Network with Message Flooding.
However, in a pure P2P system, it should be possible to remove any entity from the network without loss of functionality. Instead, a peer should conceptually fulﬁll both roles as server and client, such that the functionality of the system is spread equally over all the participating peers in the network. Thus, by this deﬁnition, Napster can not be denoted as a P2P system. The shutdown of the the centralized server of Napster by legal authorities allowed the easy shutdown of the complete system.
18.104.22.168 Unstructured P2P Architectures