«Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»
• Address-space load balancing tries to partition the address-space of the DHT evenly among keys. This problem can be seen as solved by relying on consistent hashing (i.e., main property of DHT overlay networks) and constructions such as virtual peers. With virtual peers, each real peer pretends to be several distinct peers, each participating independently in the DHT protocol. The load of a peer is thus determined by summing the load of all its virtual peers. The Chord DHT [SMK+ 01] is based on consistent hashing and needs O(log n) copies to be operated for every peer [RLS+ 03]. Drawbacks of virtual peers are increased storage requirement and higher network bandwidth demand. In [KR06], the authors solve the drawbacks by activating only one of the virtual peers at any given time. The work presented in [BKadH05] considers the task of keeping a system balanced in presence of peers leaving and joining the network. The goals is to keep the lengths of intervals assigned to peers diﬀer at most by a constant factor. The proposed scheme works in a constant number of rounds and achieves an optimal balance with high probability.
• Item load balancing aims to directly balance the distribution of items among peers in the network. Here, an arbitrary load distribution of items is assumed, i.e., some keys are more popular or appear more frequent than others. [ADH05] and [KR06] are prominent work addressing this issue. A simple and cost-eﬀective approach for item load balancing is presented in [BCM03] applying the power of two choices paradigm.
There, an item is stored at the less loaded of two or more random alternatives.
[RPW04] presents a simple approach for balancing stored data items between peers in a fashion analogous to the dissipation of heat energy in materials. This algorithm improves the distribution of load in DHTs without requiring major changes of the DHTs themselves. In addition, the fault tolerance of P2P systems is increased by the proposed algorithm. Another approach for load balancing is described in [SVF08] and considers diﬀerences in peer capacities. In addition, load balance in heavily loaded P2P systems is investigated.
- 141 Chapter 8 Conclusion and Open QuestionsRegarding the MAPS approach presented in this thesis, load balancing aﬀects addressspace and item load balancing at the level of the distributed metadata directory. Beyond that, the distribution of active subscriptions is also important since highly monitored publishers may become a bottleneck. Overall, the following load balancing questions arise and
have to be addressed in future work:
• Load balancing is also important regarding active subscriptions. Publisher peers that store many subscriptions for requested continuous queries have to send a high number of notiﬁcations for new published documents that match these subscriptions. Thus, load balancing depends also on the number of publications since each publisher peer can adjust the number of publications to its capacities. Decreasing the publication rate also decreases the load for sending notiﬁcations because publishers with a low publication rate are less monitored.
Load balancing in P2P information management systems, including the presented MAPS approach, is a very important research topic for future work. Although existing balancing techniques can be applied, there still exist some additional issues that need more sophisticated balancing strategies.
8.3.2 Dynamics Another open problem for future work involves the eﬀect of dynamics in P2P networks.
Within a P2P network, the churn rate 2 refers to the number of peers leaving the system during a given time period (e.g., an hour, or a year). Churn is a signiﬁcant problem for large-scale systems, as they must maintain consistent information about peers in the system in order to operate eﬀectively. For example, with a large churn rate it may be impossible to index ﬁle locations, making some ﬁles inaccessible even though the peers that store these ﬁles are online. In addition to peers leaving the network, dynamics in such a setting also includes peers joining or even changing content in an abrupt way.
[KEAAH05] presents an analytical study of churn using a master-equation-based approach using the Chord DHT. For any churn and stabilization rate, and any system size, the authors predict the fraction of failed or incorrect successor and ﬁnger pointers (see Section 2.1.3). Earlier work [RGRK04, CCR04, LNBK02] considered theoretical studies of asymptotic performance bounds of DHTs under churn or simulation-based studies.
Discussing the inﬂuence of dynamics for the MAPS architecture has to consider several aspects including the directory maintenance, the subscriber and publisher behavior. Since publishing documents is a main part of the information ﬁltering setting, data dynamics is
not a concern in MAPS. Thus dynamics in MAPS only covers leaving and joining peers:
• The impact of publishers that leave or join the MAPS network is small. Whenever a publisher joins the network, it has to update the metadata stored in the distributed directory. If it is not the ﬁrst time that this publisher joins (e.g., the digital library had some downtime), the publishing behavior is recognizable. Subscriptions from previous participation are still valid for future publications. Publishers leaving the MAPS network can not publish documents any more until the next join. A subscriber is not able to send a query to a publisher that has left the network. So, the number of monitored publishers needs to be increased (depending on the expected or estimated number of inactive publishers) to ensure a predeﬁned number of monitored publishers.
• Similarly to the previous discussion, the impact of leaving or joining subscribers is minor. Whenever a subscriber leaves the network, it can not request new continuous queries, although Its active subscriptions will stay valid and issue notiﬁcations for matching publications. Of course, the subscriber will not receive the notiﬁcation, so protocols that support storing notiﬁcations in the directory seem to be a necessary system component (see Section 7.3). Then, a subscriber rejoining the MAPS network receives the stored notiﬁcations for matching publications during its downtime.
• The eﬀect of peer churn on directory maintenance is a challenging problem that might aﬀect both the eﬀectiveness and eﬃciency of the system. If a new directory peer joins the MAPS system, it is responsible for a certain subset of the key space. Thus, there are two possibilities to handle joins: (i) the directory peer waits for updated metadata from publishers, but until then, the directory cannot provide metadata for certain keys; (ii) the joining protocol includes the dissemination of existing metadata by exploiting the consistent hashing property. In this context, a critical issue that arises is the departure of directory peers from the network without prior notiﬁcation or without following the network disconnection protocol. If so, the metadata for keys that the directory peer was responsible for are no longer available. In this case, publisher peers have to update their statistics to ensure a proper directory maintenance.
2 The phrase is based on the English idiom to churn up, meaning to agitate or produce violent motion
A possible solution for the problem of churn on directory maintenance is replication:
more than one directory peers may store metadata for a key to ensure the availability of directory statistics. This issue will be discussed in the next section.
8.3.3 Replication & Caching An interesting open question for future work involves dealing with replication & caching.
Replication is the process of sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, faulttolerance, or accessibility. It could be data replication if the same data is stored on multiple storage devices, or computation replication if the same computing task is executed many times. Typically, the access to a replicated entity is uniform with access to a single, nonreplicated entity. The replication itself should be transparent to an external user. Also, in a failure scenario, a failover of replicas is hidden as much as possible.
In computer science, a cache is a collection of data duplicating original values stored elsewhere or computed earlier, where the original data is expensive to fetch (owing to longer access time) or to compute, compared to the cost of reading the cache. In other words, a cache is a temporary storage area where frequently accessed data can be stored for rapid access. Once the data is stored in the cache, future use can be made by accessing the cached copy rather than re-fetching or recomputing the original data, so that the average access time is shorter. Cache, therefore, helps expedite data access that the CPU would otherwise need to fetch from main memory.
Both caching and replication are also two popular topics in distributed information management. [RL05] compares two popular redundancy schemes: replication and erasure encoding. Replication is also considered in both DHT protocols used in MAPS (Chord [RFH+ 01] and Pastry [RD01b]) whereas erasure encoding is investigated in [DLS+ 04, KBC+ 00]. In general, replication denotes the simple redundancy scheme where k identical copies of each data object are kept in the system. The erasure encoding redundancy scheme means that each data object divided into m fragments is recorded into n fragments stored separately with n m. The key property of erasure encoding is that the original data object can be reconstructed from any m fragments. In [ZBW08], a cache-aware, multi-round query routing strategy is developed that balances between query eﬃciency and result quality in distributed P2P information retrieval (e.g., for the Minerva architecture). In addition to this Exact Caching (EC) strategy, the aggressive reuse of the cached results of even subsets of a query is proposed as the Approximate Caching (AC) technique. This strategy can drastically reduce the bandwidth overheads. Older work proposes a Uniform Index Caching (UIC) approach where query results are cached in all peers along the inverse query path such that the same query requested by other peers can be replied from their nearby cached results. The DiCAS (Distributed Caching and Adaptive Search) approach [WXLZ04] extends this by distributing the cached results among neighboring peers and queries are forwarded to peers with high probability of providing the desired cache results.
Both UIC and DiCAS protocols cache the ﬁnal results of a query and reuse them only if a query matches exactly the cached results, i.e., only if the results were cached due to an identical query issued earlier. It should be noted that UIC and DiCAS are developed in the context of unstructured P2P networks and are not directly applicable in DHT-based structured networks. In [BCG+ 03], the authors present a way to eﬃciently maintain a distributed version of inverted list intersections that were computed frequently to answer queries. Here, the cacheable entity is a materialized intersection of inverted lists which reduces the bandwidth consumption due to frequent shipping of inverted lists. Similar ideas are also proposed recently in [SA06].
- 144 Chapter 8 Conclusion and Open Questions In the context of P2P information ﬁltering, it is not obvious either how or what to cache or replicate. In the setting of the thesis, the most beneﬁts out of replication and caching emerge from applying them to the distributed directory of statistics to ensure directory maintenance for high dynamics. In a combined architecture as MinervaDL, caching strategies as mentioned above should be applied. Other caching and replication strategies need more work in the future.
8.3.4 Economics Finally, an interesting direction for future research covers the economic aspects of the presented approximate information ﬁltering approach. In contrast to existing exact ﬁltering approaches, publishers have to disseminate their data items or documents such that the data is available in the system. In MAPS, publishers are not required to expose their actual content to the rest of the network, but rather they disseminate metadata about it. This new paradigm allows to introduce a payment model for approximate ﬁltering which introduces a new research direction and an interesting business model.
The design of an incentive-based payment model has to consider several cost aspects and restrictions, e.g., the MAPS system has to avoid that subscribers increase the number of monitored publishers above a certain threshold. This can be achieved by introducing a cost for submitting a subscription to a publisher. In the following, the properties of a possible payment model are driven by the following assumptions: (i) subscribers are interested in harvesting information while minimizing their costs, and (ii) publishers are interested in selling information in order to maximize their proﬁt.
An interesting dimension introduced by adding an economic perspective involves the limitation of subscribers to monitor only a few publishers. This limitation makes publisher selection a crucial issue, since subscribers may aﬀord to monitor only a few sources. In the
following, interesting issues introduced by an economic model are elaborated:
• Directory peers have to maintain metadata of publishers and provide this information to subscribers that submit a continuous query. One possible payment model could assume that publishers have to pay for storing metadata and subscribers have to pay for receiving metadata. Thus, directory peers have an incentive to store as much metadata as possible, and they are only restricted by their storage capacity and computational power. This assumption ensures that peers not interested in taking part in the directory maintenance get paid.