# «Approximate Information Filtering in Structured Peer-to-Peer Networks Christian Zimmer Max-Planck Institute for Informatics Saarbrücken ...»

- 28 Chapter 2 Background and Technical Basics The PeerCQ system [GL03] is another important pub/sub approach using a DHT infrastructure. PeerCQ takes into account peer heterogeneity and extends consistent hashing [KLL+ 97] with simple load balancing techniques based on appropriate assignment of peer identiﬁers to network peers. The Meghdoot system [GSAA04] is a content-based pub/sub approach implemented on top of a CAN-like DHT, and supports an attribute-value data model and new ideas for the processing of subscriptions with range predicates (e.g., the price is between 50 and 100 Euros) and load balancing. Mercury [BAS04] also uses an attribute-value data model similar to Meghdoot and is utilized in the implementation of a pub/sub system for network games. PastryStrings [AT06] also supports a rich set of queries in the context of a pub/sub system. The system utilizes preﬁx-based routing to facilitate both numerical and string queries.

The pFilter system [TX03] uses a hierarchical extension of CAN DHT to ﬁlter unstructured documents. This approach relies on multi-cast trees to notify subscribers. VSM and LSI can be used to match documents to user queries. The multi-cast trees of pFilter take into account network distance. Supporting preﬁx and suﬃx queries in string attributes is the focus of the DHTStrings system [AT05]. This system utilizes a DHT-agnostic architecture to develop algorithms for eﬃcient multi-dimensional event processing. LibraRing [TIK05a] was the ﬁrst approach to provide protocols for the support of both IR and IF functionality in digital libraries (DLs) using DHTs. In LibraRing, super-peers are organized in a Chord DHT and both (continuous) queries and documents are indexed by hashing words contained in them. This hashing scheme depends heavily on the data model and query language adopted, and the protocols have to be modiﬁed when the data model changes. The DHT is used to make sure that queries meet the matching documents (in the IR scenario) or that published documents meet the indexed continuous queries (in the IF scenario).

In this way the retrieval eﬀectiveness of a centralized system is achieved (in contrary to the work in this thesis), while a number of routing optimizations (such as value proxying, content based-multicasting, etc.) are used to enhance scalability. The ﬁltering approach of LibraRing is presented in detail in [TIK05b, TKD04, TKD08].

2.4 Time Series Analysis In the literature (e.g., [Cha04],or [Ham94]), a time series is declined as a collection of observations made sequentially through time. Time series play an important role in a huge variety of ﬁelds ranging from economics to engineering. The diﬀerent methods to analyze and explain time series constitute an notable area of statistics and this area is called time series analysis.

Many time series arise in practice, for instance by recording values for a period of time.

Prominent examples of time series in economics and ﬁnance are export totals in successive years, or average incomes in successive months. [Cha04] mentions the classic Beveridge wheat price index series consisting of the average wheat price in nearly 50 places in many countries in successive years from 1500 to 1869. Besides economic and ﬁnancial time series, there are a lot of other important areas: physical time series 7, marketing time series, and demographic time series 8. Time series analysis also occurs in process control to ensure a certain quality level. Binary processes are a special type of time series where observations can take only one of two values (usually denoted by 0 and 1). Communication theory represents a situation of this type of time series.

7 Includingmeteorology, marine science and geophysics (e.g., rainfall on successive days).

8 Time series study the population development of countries considering among other things birth and death rates.

## - 29 Chapter 2 Background and Technical Basics

The literature also considers events occurring randomly through time9. A series of events of this type is called point process. There is a big diﬀerence between analyzing point processes and analyzing standard time series data as the interested quantities diﬀer (e.g., the distribution of the number of events occurring in a given time period).** 2.4.1 Terminology**

A time series is called continuous time series when the observations are made continuously through time even when the measured variable can only take a discrete set of values (e.g., in binary processes). A time series is said to be a discrete time series when observations are taken only at speciﬁc times, usually equally spaced (e.g., every day at the same time).

A discrete time series can even measure continuous variables. The analyzing methods presented in the following, consider only discrete time series. Recall that every continuous time series can be transformed into a discrete one by just recording the values at equal intervals of time. This type of time series is also called a sampled series. Discrete time series also arise by aggregation of values over equal intervals of time10.

Statistical theory is concerned with random samples of independent observations. In contrast, time series analysis considers successive observations that are not independent and that have an important time ordering. The fact that successive observations are dependent, future values may be predicted from past values. A deterministic time series assumes that a time series can be predicted exactly whereas a stochastic time series determines the future only partly by past observations such that exact predictions are not possible; stochastic time series use the idea the future values have a probability distribution which is conditioned by a knowledge of past values.

** 2.4.2 Objections of Time Series Analysis**

The literature enumerates several objectives in analyzing a time series. The description of time series mainly includes plotting the observations against time to get a so-called time plot. These time plots are used to obtain simple descriptive measures of the main series properties (e.g., regular seasonal eﬀects). Linear systems are used to convert an input series to an output series by a linear operation. This technique is important in the explanation of time series especially when observations are taken on two or more variables. Then, the variation in one time series can be used to explain the variation in another series. Control is another objective of time series analysis and mainly consider physical or economic systems (e.g., measure the quality of a manufacturing process).

The most important objective for this thesis is prediction. Given an observed time series, prediction is used to get the future values of the series. Predictions appear in sales forecasting, or in the analysis of economic and industrial time series. In general, the terms prediction and forecasting are used interchangeably. In the time series literature, there is also the diﬀerentiation between prediction to describe subjective methods and forecasting to describe objective methods. This thesis uses the term prediction in the following chapters to specify future observations.

2.4.3 Types of Variation The literature decomposes the variation in a time series into trend, seasonal variation (or seasonality, or periodicity), and other cyclic changes. All other remaining types of variation are called irregular ﬂuctuations.

• Trend: This variation is loosely deﬁned as long-term change in the mean level. A main issue with this deﬁnition is the meaning of long-term. A time series of increasing or decreasing observations shows a trend.

• Seasonality or Periodicity: Time series with periodic variations show seasonal changes. Typical examples are sales ﬁgures and temperature reading where time series exhibit variation that is annual in period.

• Cyclic Changes: Time series with variations that do not have a ﬁxed period but which are predictable to some extent belong to this category of cyclic changes. For example, business cycles vary from 3 or 4 years to more than 10 years.

• Irregular Fluctuations: Other types of variations that do not show trends, seasonal or other cyclic changes form this group of irregular ﬂuctuations.

Trend and seasonality (or periodicity) will be in the main focus of this thesis such that prediction techniques have to consider these types of variation in the observations.

2.4.4 Prediction Techniques Prediction or forecasting 11 of future values of an observed time series occurs in many areas including economics or stock control. There is a wide variety of prediction techniques available that can be grouped into three categories: subjective, univariate, and multivariate methods. This thesis only considers univariate methods12 where predictions are based on present and past observations and ignores other factors (e.g., external economic inﬂuences).

Assume, there is an observed time series x1, x2,..., xn of n values. The basic issues is to estimate future values xn+h where the integer h is called the lead time or horizon. The predicted value of xn+h made at time n for h time steps ahead is denoted as x(n, h) or ˆ xn (h). Both notations emphasize the time a prediction is made and the prediction horizon.

ˆ In the following, two groups of prediction techniques will be presented in detail: moving average and exponential smoothing techniques.

2.4.4.1 Moving Average Techniques

Obviously, all past observed values have the same weight. An alternative way to summarize the past observations is to compute the mean of successive smaller sets of observed values. Single moving average is the simplest technique that uses the mean of the k most

**recent observations to predict xn (1):**

ˆ 11 Forecasting is the art of saying what will happen, and then explaining why it didn’t – Anonymous 12 Methods of this type are also called naive or projection methods

The smoothing process is continued by advancing one period and calculating the next average of k observations, dropping the ﬁrst. Two general objections about moving average techniques in general are that they cannot cope well with trends observed in the time series and assign equal weights to past observations. There exists a variation on the moving average procedure that performs better in terms of handling trend. It is called double moving average for a linear trend process. It calculates a second moving average from the original moving average, using the same value for k.

Thus, three updating equations with three parameters η, γ and δ are needed. As before, the smoothing parameters are usually chosen in the range (0, 1). Level, trend, and seasonal

**index are computed as follows where a complete season’s data consists of l periods:**

2.5 Distinct-Value Estimation The task of determining the number of distinct values (distinct-value estimation or DV estimation) in a large dataset is an important question in a variety of ﬁelds in computer science, including data integration, query optimization, or network monitoring.

The exact number of distinct values can be computed by sorting the dataset and then executing a straightforward scan-and-count pass over the data; alternatively, a hash table can be constructed and used to compute the number of distinct values. Both naive approaches do not scale. Most research eﬀorts have therefore focused on approximate methods that scale to very large datasets. These approaches either draw a random sample of the data items and use the observed frequencies of the values in the sample as a basis for estimation (e.g., [CCMN00]) or take a single pass through the data and use hashing techniques to compute an estimate using a bounded amount of memory (e.g., [Gib01]).

Here, two recent methods (hash sketches and KMV synopses) are introduced in Sections 2.5.1 and 2.5.2. Both methods support an eﬃcient DV estimation and allow multi-set operations such as union, or intersection.

2.5.1 Hash Sketches Hash sketches denote a statistical tool for probabilistically estimating the cardinality of a multiset S. This distinct-value estimation techniques was proposed by Flajolet and Martin in [FM85]. Hash sketches rely on the existence of a pseudo-uniform hash function h() : S → [0, 1,..., 2L ), which spreads input values pseudo-uniformly over its output values.

In [DF03], Durand and Flajolet further improved hash sketches (super-LogLog counting) by reducing the space complexity for maintaining Hash Sketches and relaxing the requirements on the statistical properties of the hash function.

**2.5.1.1 Creation and DV Estimation**

In their essence, the synopsis13 works as follows. The function ρ(y) : [0, 2L ) → [0, L) is used to designate the position of the least signiﬁcant 1-bit in the binary representation of

**y as follows:**

13 Synopsisrefers to a summary or abstract. In this thesis, data structures to represent sets and multisets in a compact manner (e.g., hash sketches) are called synopses.

In Equation 2.19, ρ(0) = L, and bit(y, k) denotes the k-th bit in the binary representation of y (bit-position 0 corresponds to the least signiﬁcant bit). Estimating n, the number of distinct elements in a multiset S, proceeds as follows. For all d ∈ S, apply ρ(h(d)) and record the least-signiﬁcant 1-bits in a bitmap vector B[0... L]. Since h() distributes values uniformly over [0, 2L ), it follows that

A key property of hash sketches with great implications to the eﬃciency of large-scale network applications (including distributed IR) lies in the ability to combine them.

Union Operation The hash sketch of the union of an arbitrary number of multisets is derived from the hash sketches of each multiset by taking their bit-wise OR. Thus, given the compact synopses of a set of multisets, one can instantly estimate the number of distinct items in the union of these multisets.