# «The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length danar DANA RON ~boratory#rComputerScience, MIZ ...»

MachineLearning,25, 117-149 (1996)

(~) 1996 KluwerAcademicPublishers,Boston. Manufacturedin The Netherlands.

The Power of Amnesia: Learning Probabilistic

Automata with Variable Memory Length

danar@theory.lcs.mit.edu

## DANA RON

~boratory#rComputerScience, MIZ Cambri~e, MA02139

singer@research.att.com

## YORAM SINGER

AT&T Labs, 600 Mountain Avenue, Murray Hill, NJ 07974

tishby@cs.huji.ac.il

## TISHBY

## NAFTALI

Institute of Computer Science, Hebrew University, Jerusalem 91904, Israel Editor: ThomasHancock We propose and analyze a distribution learning algorithm for variable memory length Markov Abstract.processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributionsgenerated by generalprobabilisticautomata,we provethat the algorithmwe presentcan efficientlylearndistributionsgenerated by PSAs. In particular,we show that for any target PSA, the KL-divergencebetweenthe distributiongeneratedby the target and the distributiongeneratedby the hypothesisthe learningalgorithmoutputs,can be made small with high confidencein polynomialtime and sample complexity. The learningalgorithmis motivatedby applications in human-machineinteraction, Here we present two applicationsof the algorithm. In the first one we apply the algorithm in order to constructa model of the English language,and use this model to correct corrupted text. In the second applicationwe constructa simple stochastic model for E.coli DNA.

Keywords: Learningdistributions,probabilisticautomata, Markov models, suffix trees, text correction I. Introduction Statistical modeling of complex sequences is a fundamental goal of machine learning due to its wide variety of natural applications. The most noticeable examples of such applications are statistical models in human communication such as natural language, handwriting and speech (Jelinek, 1985, Nadas, 1984), and statistical models of biological sequences such as DNA and proteins (Krogh, Mian & Haussler, 1993).

These kinds of complex sequences clearly do not have any simple underlying statistical source since they are generated by natural sources. However, they typically exhibit the following statistical property, which we refer to as the short m e m o r y property. If we consider the (empirical) probability distribution on the next symbol given the preceding subsequence of some given length, then there exists a length L (the m e m o r y length) such that the conditional probability distribution does not change substantially if we condition it on preceding subsequences of length greater than L.

This observation lead Shannon, in his seminal paper (Shannon, 1951), to suggest modeling such sequences by Markov chains of order L 1, where the order is the memory length of the model. Alternatively, such sequences may be modeled by Hidden Markov Models 118 D. RON, Y. SINGER AND N. TISHBY (HMMs) which are more complex distribution generators and hence may capture additional properties of natural sequences. These statistical models define rich families of sequence distributions and moreover, they give efficient procedures both for generating sequences and for computing their probabilities. However, both models have severe drawbacks. The size of Markov chains grows exponentially with their order, and hence only very low order Markov chains can be considered in practical applications. Such low order Markov chains might be very poor approximators of the relevant sequences. In the case of HMMs, there are known hardness results concerning their learnability which we discuss in Section 1.1.

In this paper we propose a simple stochastic model and describe its learning algorithm.

It has been observed that in many natural sequences, the memory length depends on the context and is notfixed. The model we suggest is hence a variant of order L Markov chains, in which the order, or equivalently, the memory, is variable. We describe this model using a subclass of Probabilistic Finite Automata (PFA), which we name Probabilistic Suffix Automata (PSA).

Each state in a PSA is labeled by a string over an alphabet ~. The transition function between the states is defined based on these string labels, so that a walk on the underlying graph of the automaton, related to a given sequence, always ends in a state labeled by a suffix of the sequence. The lengths of the strings labeling the states are bounded by some upper bound L, but different states may be labeled by strings of different length, and are viewed as having varying memory length. When a PSA generates a sequence, the probability distribution on the next symbol generated is completely defined given the previously generated subsequence of length at most L. Hence, as mentioned above, the probability distributions these automata generate can be equivalently generated by Markov chains of order L, but the description using a PSA may be much more succinct. Since the size of order L markov chains is exponential in L, their estimation requires data length and time exponential in L.

In our learning model we assume that the learning algorithm is given a sample (consisting either of several sample sequences or of a single sample sequence) generated by an unknown target PSA M of some bounded size. The algorithm is required to output a hypothesis machine 21~/,which is not necessarily a PSA but which has the following properties. M can be used both to efficiently generate a distribution which is similar to the one generated by M, and given any sequence s, it can efficiently compute the probability assigned to s by this distribution.

Several measures of the quality of a hypothesis can be considered. Since we are mainly interested in models for statistical classification and pattern recognition, the most natural measure is the Kullback-Leibler (KL) divergence. Our results hold equally well for the variation (L1) distance and other norms, which are upper bounded by the KL-divergence.

Since the KL-divergence between Markov sources grows linearly with the length of the sequence, the appropriate measure is the KL-divergence per symbol. Therefore, we define an e-good hypothesis to be an hypothesis which has at most e KL-divergence per symbol to the target source.

In particular, the hypothesis our algorithm outputs, belongs to a class of probabilistic machines named Probabilistic Suffix Trees (PST). The learning algorithm grows such a suffix tree starting from a single root node, and adaptively adds nodes (strings) for which

## LEARNING PROBABILISTIC AUTOMATA WITH VARIABLE MEMORY LENGTH 119

there is strong evidence in the sample that they significantly affect the prediction properties of the tree.We show that every distribution generated by a PSA can equivalently be generated by a PST which is not much larger. The converse is not true in general. We can however characterize the family of PSTs for which the converse claim holds, and in general, it is always the case that for every PST there exists a not much larger PFA that generates an equivalent distribution. There are some contexts in which PSAs are preferable, and some in which PSTs are preferable, and therefore we use both representation in the paper. For example, PSAs are more efficient generators of distributions, and since they are probabilistic automata, their well defined state space and transition function can be exploited by dynamic programming algorithms which are used for solving many practical problems. In addition, there is a natural notion of the stationary distribution on the states of a PSA which PSTs lack. On the other hand, PSTs sometimes have more succinct representations than the equivalent PSAs, and there is a natural notion of growing them.

Stated formally, our main theoretical result is the following. If both a bound L, on the memory length of the target PSA, and a bound n, on the number of states the target PSA has, are known, then for every given 0 e 1 and 0 ~ 1, our learning algorithm outputs an e-good hypothesis PST, with confidence I - 5, in time polynomial in L, •, I~1, !, and ½. Furthermore, such a hypothesis can be obtained from a single sample sequence if the sequence length is also polynomial in a parameter related to the rate in which the target machine converges to its stationary distribution. Despite an intractability result concerning the learnability of distributions generated by Probabilistic Finite Automata (Kearns, et al., 1994), that is described in Section 1.1, our restricted model can be learned in a PAC-like sense efficiently. This has not been shown so far for any of the more popular sequence modeling algorithms.

We present two applications of the learning algorithm. In the first application we apply the algorithm in order to construct a model of the English language, and use this model to correct corrupted text. In the second application we cons~uct a simple stochastic model for E.coli DNA. Combined with a learning algorithm for a different subclass of probabilistic automata (Ron, Singer & Tishby, 1995), the algorithm presented here is part of a complete cursive handwriting recognition system (Singer & Tishby, 1995).

**1.1. Related Work**

The most powerful (and perhaps most popular) model used in modeling natural sequences is the Hidden Markov Model (HMM). A detailed tutorial on the theory of HMMs as well as selected applications in speech recognition is given by Rabiner (Rabiner, 1989). A commonly used procedure for learning an HMM from a given sample is a maximum likelihood parameter estimation procedure that is based on the Baum-Welch method (Baume, et al., 1970, Baume, 1972) (which is a special case of the EM (Expectation-Maximization) algorithm (Dempster, Laired & Rubin, 1977)). However, this algorithm is guaranteed to converge only to a local maximum, and thus we are not assured that the hypothesis it outputs can serve as a good approximation for the target distribution. One might hope that the problem 120 D. RON, Y. SINGER AND N. TISHBY can be overcome by improving the algorithm used or by finding a new approach. Unfortunately, there is strong evidence that the problem cannot be solved efficiently.

Abe and Warmuth (Abe & Warmuth, 1992) study the problem of training HMMs. The HMM training problem is the problem of approximating an arbitrary, unknown source distribution by distributions generated by HMMs. They prove that HMMs are not trainable in time polynomial in the alphabet size, unless R P -- NP. Gillman and Sipser (Gillman & Sipser, 1994) study the problem of exactly inferring an (ergodic) HMM over a binary alphabet when the inference algorithm can query a probability oracle for the longterm probability of any binary string. They prove that inference is hard: any algorithm for inference must make exponentially many oracle calls. Their method is information theoretic and does not depend on separation assumptions for any complexity classes.

Natural simpler alternatives, which are often used as well, are order L Markov chains (Shannon, 1951, Good, 1969), also known as n-gram models. As noted earlier, the size of an order L Markov chain is exponential in L and hence, if we want to capture more than very short term memory dependencies in the sequences, of substantial length in the sequences, then these models are clearly not practical.

HSffgen (HSffgen, 1993) studies families of distributions related to the ones studied in this paper, but his algorithms depend exponentially and not polynomially on the order, or memory length, of the distributions. Freund et. al. (Freund, et al., 1993) point out that their result for learning typical deterministic finite automata from random walks without membership queries, can be extended to learning typical PFAs. Unfortunately, there is strong evidence indicating that the problem of learning general PFAs is hard. Kearns et.

al. (Kearns, et al., 1994) show that PFAs are not efficiently learnable under the assumption that there is no efficient algorithm for learning noisy parity functions in the PAC model.

The machines used as our hypothesis representation, namely Probabilistic Suffix Trees (PSTs), were introduced (in a slightly different form) in (Rissanen, 1983) and have been used for other tasks such as universal data compression (Rissanen, 1983, Rissanen, 1986, Weinberger, Lempel & Ziv, 1982, Willems, Shtarkov & Tjalkens, 1993). Perhaps the strongest among these results (which has been brought to our attention after the completion of this work) and which is most tightly related to our result is (Willems, Shtarkov & Tj alkens, 1993). This paper describes an efficient sequential procedure for universal data compression for PSTs by using a larger model class. This algorithm can be viewed as a distribution learning algorithm but the hypothesis it produces is not a PST or a PSA and hence cannot be used for many applications. Willems et. aL show that their algorithm can be modified to give the minimum description length PST. However, in case the source generating the examples is a PST, they are able to show that this PST convergence only in the limit of infinite sequence length to that source.

Vitter and Krishnan (Vitter & Krishnan, 1991, Krishnan & Vitter, 1993) adapt a version of the Ziv-Lempel data compression algorithm (Ziv & Lempel, 1978) to get a page prefetching algorithm, where the sequence of page accesses is assumed to be generated by a PFA. They show that the page fault rate of their algorithm converges to the page fault rate of the best algorithm that has full knowledge of the source. This is true for almost all page access sequences (in the limit of the sequence length). Laird and Saul (Laird & Saul, 1994) describe a prediction algorithm which is similar in spirit to our algorithm and is based