«Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»
In contrast to centralized approaches, non-centralized or decentralized P2P architectures do not rely on any centralized or coordinating entity to locate data items within the network. More speciﬁcally, in unstructured P2P approaches which are a specialization of the decentralized architectures, peers recursively forward received requests to neighboring peers (also referred to as neighbors), in an attempt to ﬁnd all relevant items in the network. In order to avoid missing peers in the network, each peer broadcasts messages to all known other peers, regardless of whether these neighbors store relevant data or not. This message forwarding approach leads to a breadth-ﬁrst search strategy and is also known as message ﬂooding. To prevent inﬁnite loops and to control the number of messages generated by one single request, each message gets assigned a time-to-live (TTL) value. Each peer forwarding such a message decreases this value by one, and only messages with positive TTL values get forwarded.
Figure 2.3 illustrates message ﬂooding in an unstructured P2P network architecture.
The peer on the left issues a query and forwards this request message to its three known neighbors, as indicated by the arrows. As shown, the initial TTL value of the query message is three. After decreasing the TTL value, these peers forward the message to their own neighbors. If the TTL value is not positive after decreasing it, the peer will not forward the message further, thus terminating the forwarding procedure. Note that some messages unnecessarily address peers that have already received the query before (e.g., from diﬀerent peers), but peers do not forward the same message to its neighbors more than once.
- 18 Chapter 2 Background and Technical Basics There are several studies of real-world networks showing that the peers of such a network form graphs with small diameters, typically in a range from ﬁve to seven, supporting the so-called small-world-phenomenon [Kle00], a property that has also been observed in social networks 2. One important property of such networks is that messages with a relatively small TTL value can locate any requested data with high probability. But, this technique still represents a lossy protocol as it has no guarantees to successfully locate the data of interest, e.g., due to higher graph diameters or disconnected graph partitions. Additionally, the number of messages created by a single request is signiﬁcant; it depends on the degree of the peers in the network (i.e., the number of neighbors of a peer) and the chosen TTL value.
The main advantage of unstructured P2P networks is the fact that there is no need to proactively maintain a network structure. Peers maintain only pointers to an upperbounded number of direct neighbors. Also note that there is no enforcement of a speciﬁc storage location for data items, as they can be located anywhere in the network. In other words, the data stored at a peer is unrelated to the peer’s position in the network. Freenet [CMH+ 02] and early versions of the Gnutella protocol3 are popular representatives of this paradigm. In unreliable networks (e.g., mobile ad-hoc networks), ﬂooding is also a fundamental message dissemination strategy. While brute-force ﬂooding algorithms cause a high number of unnecessary messages, network contention, packet collisions, and wasting energy, several probabilistic and epidemic approaches have been studied to optimize plain ﬂooding.
18.104.22.168 Structured P2P Architectures
Centralized P2P architectures suﬀer from scalability bottlenecks which prevent an a-priori unlimited number of participating peers. In a centralized architecture, the linear storage complexity of the central directory entity is prohibitive. In an unstructured architecture, the communication overhead caused by message ﬂooding is a signiﬁcant deﬁcit. Thus, an eﬃcient and scalable approach requires a sub-linear increase in the storage and search complexity as more and more peers join the network.
Structured P2P architectures utilize particular overlay structures to map peers and data items into a common address space, enabling a unique mapping from data items to peers given the current state of the network. Therefore, each peer manages a small number (typically O(log N ), where N is the number of peers in the whole network) of pointers to carefully selected other peers. If these structured overlays are used for routing, then reaching the peer currently responsible for a given data item requires O(log N ) messages.
To guarantee balanced storage and retrieval loads among the peers, the responsibilities for data items have to be distributed as uniformly as possible.
To realize this routing functionality, a distributed data structure based on a hash table is used to allow the insertion and retrieval of key/value pairs. Thus, to insert or retrieve a key/value pair, the peer responsible for a key in the network as deﬁned by the structured P2P network is utilized. This peer stores and maintains all appropriate key/value pairs for a key from across the directory. Note that, in contrast to unstructured P2P architectures, the data placement is no longer arbitrary; the placement is accurately determined by the underlying overlay architecture, and this provides a guarantee to ﬁnd data items indexed by the network.
2A social network is a social structure made of nodes (which are generally individuals or organizations) that are tied by one or more speciﬁc types of interdependency.
3 Gnutella is a ﬁle sharing network. In December 2005, it was the third-most-popular ﬁle sharing network
The expression distributed hash table stands for such a functionality in a P2P network and DHT is commonly used as a synonym for structured P2P architectures in general.
But, there is also the strict distinction between structured P2P routing primitives on one side and the DHT interfaces of inserting and retrieving data as upper layer of functionality on the other side. Throughout this thesis, DHT is used to refer to the class of structured P2P architectures.
There are several proposals for structured overlay topologies including geometries as hypercubes (e.g., CAN [RFH+ 01]), rings (Chord [SMK+ 01], or Pastry [RD01a]), tree-like structures (P-Grid [Abe01], or PRR [PRR97]), and butterﬂy networks (Viceroy [MNR02]).
Special cases are random graphs [MS05]. [GGG+ 03] presents a study concerning the general resilience and proximity properties of these diﬀerent geometries.
22.214.171.124 Super-Peer Architectures
Super-peer architectures exploit the fact that the performance characteristics of the peers (processing power, bandwidth, availability, etc.) is not evenly distributed over all peers in the network. Thus, the beneﬁts of a perfect decentralization are decreasing.
In super-peer architecture, a small subset of peers takes over speciﬁc responsibilities in the network, e.g., aggregation or routing tasks. Thus, the super-peers can be viewed as the distributed extensions of the centralized entity in the Napster architecture. Conceptually, only the super-peers build-up the P2P network; all other peers connect to this backbone by communicating to one super-peer, which acts in the spirit of database mediators aggregating the content of downstream peers.
Routing in super-peer architectures is conducted in a two-phase mode. A request is routed within the super-peer backbone at ﬁrst, and is then distributed to the peers connected via the super-peers. While dedicating speciﬁc peers potentially limits the self-organizing capabilities of a P2P network, super-peer architectures have been proven a way to alleviate the performance issues of pure unstructured topologies. [YGM03] examines super-peer networks in detail to gain an understanding of their fundamental characteristics and performance tradeoﬀs. In Chapter 7, this thesis uses a super-peer architecture to build-up a distributed digital library environment.
2.1.3 Example: The Chord Protocol The Chord protocol [SMK+ 01] is deﬁnitively one of the most prominent DHT implementations and one of the most important representatives of structured P2P architectures. The main advantage of Chord when compared to other DHT approaches is its elegance and simplicity of concepts and implementation, and its still growing popularity.
In Chord, all data items and peers are mapped to a unique one-dimensional identiﬁer space such that identiﬁers are l-bit numbers, i.e., integer values in the range [0, 2l − 1], forming a cyclic identiﬁer space modulo 2l. In the following, the identiﬁer of a data item is referred to as a key, while the identiﬁer of a peer as an id. The responsibility of maintaining the data items (key, value) associated with key lies at the nearest peer on the identiﬁer circle (ID circle or Chord ring) whose id is greater or equal to key. This peer p is called the successor of key. Thus, a peer p is responsible for all key identiﬁers between its predecessor ’s identiﬁer (exclusive) and its own identiﬁer (inclusive), and each key/value pair is located and managed on a single, well-deﬁned peer.
Figure 2.4 illustrates an identiﬁer circle with l = 6, i.
e., identiﬁers in the range [0, 63].
For example, key K54 is maintained by peer P56 as its next successor on the ID circle, whereas both keys K24 and K30 are maintained by peer P32.
Figure 2.4: Example Chord Ring.
The solution to eﬃcient lookup and modiﬁcation operations on the stored data is to eﬃciently solve the lookup problem introduced in 2.1.1, i.e., to quickly locate the peer responsible for a particular key. As a naive approach, every peer could store a direct pointer to its successor peer on the identiﬁer circle. When a key is being requested, each peer forwards the request to its successor, until one peer determines that the key lies between its own id and the id of its successor. Thus, the key must be hosted by its successor. As a consequence, contact information about the successor is communicated as the result of the query back to the originator. When maintaining only a minimum amount of state at each peer (O(1)), this form of key location leads to an expected number of messages linear in the number of peers in the network, which is not considered scalable and acceptable.
Figure 2.4 also shows this naive approach where peer P8 issues a lookup request for key K54.
The request is forwarded along the ID circle linearly until the responsible peer P56 can be identiﬁed.
To improve scalability and eﬃcient lookups, Chord keeps additional state at each peer.
A peer maintains a routing table (also referred to as ﬁnger table), pointing to other peers on the identiﬁer circle. The m-th entry in the ﬁnger table of peer Pi contains a pointer to the ﬁrst peer Pj that succeeds Pi by at least 2m−1 on the identiﬁer circle, leading to a ﬁnger table with at most l distinct entries (independent of the actual number of keys or peers).
There are two important characteristics of this schema. First, each peer only maintains state about a logarithmic number of other peers, and knows more about peers closely following it on the identiﬁer circle than about peers farther away. Second, a peer’s ﬁnger table does not necessarily contain enough information to directly determine the peer responsible for an arbitrary key. However, since each peer has ﬁnger entries at power of two intervals around the identiﬁer circle, each peer can forward a query at least halfway along the remaining distance between itself and the requested peer. This property is illustrated in Figure 2.5 for peers P8, P42, and P51. It follows that the number of peers to be contacted (and, thus, the number of messages to be sent) to ﬁnd a target peer in an N -peer system is O(log N ).
To handle churn 4 in a network, Chord implements a stabilization protocol that each peer runs periodically in the background and which updates the ﬁnger tables and successor pointers in order to ensure that lookups execute correctly. The stabilization requires an additional predecessor pointer, as each peer requests its successor, Succ(Pi ), to return its predecessor P red(Succ(Pi )). If Pi equals P red(Succ(Pi )), Pi and Succ(Pi ) agree on being each other’s respective predecessor and successor. In contrast, the fact that P red(Succ(Pi )) lies between Pi and Succ(Pi ) indicates that P red(Succ(Pi )) recently joined the identiﬁer circle as Pi ’s successor. Thus, Pi updates its successor pointer to P red(Succ(Pi )) and notiﬁes P red(Succ(Pi )) of being its predecessor. At this stage, all successor pointers are up to date and requests can be routed correctly. As the impact of outdated ﬁnger table entries on lookup performance is small, Chord updates ﬁnger tables only lazily by periodically picking a ﬁnger table entry j randomly and performing a lookup to ﬁnd the true peer that currently succeeds peer Pi by 2j−1.
Chord addresses peer failures by checking all communication with remote peers for timeouts. To further ensure routing stability in the presence of multiple simultaneous peer failures, each peer maintains not only a pointer to its immediate successor, but a list of the ﬁrst r successors. When a peer detects a failure of its successor, it reverts to the next peer in its successor list. The successor list is also maintained by the stabilization protocol.
[LNBK02] gives a theoretical analysis of Chord’s stability in the face of concurrent joins and multiple simultaneous peer failures. The failure of a peer not only puts the routing stability at risk, but also makes the data managed by this peer unavailable. Such data loss can be prevented by replicating the data items to other peers. In Chord, the successor of a failed peer becomes responsible for the keys and data of the failed peer. Thus, an obvious replication strategy to prevent data loss is to replicate data to immediate successors, using the successor list.
0 _0_2212102 1 _2_2301203 _3_1203203 1 0 1_1_301233 1_2_230203 1_3_0201022 2 10_0_31203 10_1_32102 2 10_3_23302 3 102_0_0230 102_1_1302 102_2_2302 3 4 1023_0_322 1023_1_000 1023_2_121 3 5 10233_0_01 1 10233_2_32 6 0 102331_2_0 Figure 2.6: Pastry Routing Table.