FREE ELECTRONIC LIBRARY - Abstract, dissertation, book

Pages:   || 2 | 3 | 4 | 5 |   ...   | 6 |

«The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length danar DANA RON ~boratory#rComputerScience, MIZ ...»

-- [ Page 1 ] --

MachineLearning,25, 117-149 (1996)

(~) 1996 KluwerAcademicPublishers,Boston. Manufacturedin The Netherlands.

The Power of Amnesia: Learning Probabilistic

Automata with Variable Memory Length



~boratory#rComputerScience, MIZ Cambri~e, MA02139



AT&T Labs, 600 Mountain Avenue, Murray Hill, NJ 07974




Institute of Computer Science, Hebrew University, Jerusalem 91904, Israel Editor: ThomasHancock We propose and analyze a distribution learning algorithm for variable memory length Markov Abstract.

processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributionsgenerated by generalprobabilisticautomata,we provethat the algorithmwe presentcan efficientlylearndistributionsgenerated by PSAs. In particular,we show that for any target PSA, the KL-divergencebetweenthe distributiongeneratedby the target and the distributiongeneratedby the hypothesisthe learningalgorithmoutputs,can be made small with high confidencein polynomialtime and sample complexity. The learningalgorithmis motivatedby applications in human-machineinteraction, Here we present two applicationsof the algorithm. In the first one we apply the algorithm in order to constructa model of the English language,and use this model to correct corrupted text. In the second applicationwe constructa simple stochastic model for E.coli DNA.

Keywords: Learningdistributions,probabilisticautomata, Markov models, suffix trees, text correction I. Introduction Statistical modeling of complex sequences is a fundamental goal of machine learning due to its wide variety of natural applications. The most noticeable examples of such applications are statistical models in human communication such as natural language, handwriting and speech (Jelinek, 1985, Nadas, 1984), and statistical models of biological sequences such as DNA and proteins (Krogh, Mian & Haussler, 1993).

These kinds of complex sequences clearly do not have any simple underlying statistical source since they are generated by natural sources. However, they typically exhibit the following statistical property, which we refer to as the short m e m o r y property. If we consider the (empirical) probability distribution on the next symbol given the preceding subsequence of some given length, then there exists a length L (the m e m o r y length) such that the conditional probability distribution does not change substantially if we condition it on preceding subsequences of length greater than L.

This observation lead Shannon, in his seminal paper (Shannon, 1951), to suggest modeling such sequences by Markov chains of order L 1, where the order is the memory length of the model. Alternatively, such sequences may be modeled by Hidden Markov Models 118 D. RON, Y. SINGER AND N. TISHBY (HMMs) which are more complex distribution generators and hence may capture additional properties of natural sequences. These statistical models define rich families of sequence distributions and moreover, they give efficient procedures both for generating sequences and for computing their probabilities. However, both models have severe drawbacks. The size of Markov chains grows exponentially with their order, and hence only very low order Markov chains can be considered in practical applications. Such low order Markov chains might be very poor approximators of the relevant sequences. In the case of HMMs, there are known hardness results concerning their learnability which we discuss in Section 1.1.

In this paper we propose a simple stochastic model and describe its learning algorithm.

It has been observed that in many natural sequences, the memory length depends on the context and is notfixed. The model we suggest is hence a variant of order L Markov chains, in which the order, or equivalently, the memory, is variable. We describe this model using a subclass of Probabilistic Finite Automata (PFA), which we name Probabilistic Suffix Automata (PSA).

Each state in a PSA is labeled by a string over an alphabet ~. The transition function between the states is defined based on these string labels, so that a walk on the underlying graph of the automaton, related to a given sequence, always ends in a state labeled by a suffix of the sequence. The lengths of the strings labeling the states are bounded by some upper bound L, but different states may be labeled by strings of different length, and are viewed as having varying memory length. When a PSA generates a sequence, the probability distribution on the next symbol generated is completely defined given the previously generated subsequence of length at most L. Hence, as mentioned above, the probability distributions these automata generate can be equivalently generated by Markov chains of order L, but the description using a PSA may be much more succinct. Since the size of order L markov chains is exponential in L, their estimation requires data length and time exponential in L.

In our learning model we assume that the learning algorithm is given a sample (consisting either of several sample sequences or of a single sample sequence) generated by an unknown target PSA M of some bounded size. The algorithm is required to output a hypothesis machine 21~/,which is not necessarily a PSA but which has the following properties. M can be used both to efficiently generate a distribution which is similar to the one generated by M, and given any sequence s, it can efficiently compute the probability assigned to s by this distribution.

Several measures of the quality of a hypothesis can be considered. Since we are mainly interested in models for statistical classification and pattern recognition, the most natural measure is the Kullback-Leibler (KL) divergence. Our results hold equally well for the variation (L1) distance and other norms, which are upper bounded by the KL-divergence.

Since the KL-divergence between Markov sources grows linearly with the length of the sequence, the appropriate measure is the KL-divergence per symbol. Therefore, we define an e-good hypothesis to be an hypothesis which has at most e KL-divergence per symbol to the target source.

In particular, the hypothesis our algorithm outputs, belongs to a class of probabilistic machines named Probabilistic Suffix Trees (PST). The learning algorithm grows such a suffix tree starting from a single root node, and adaptively adds nodes (strings) for which


there is strong evidence in the sample that they significantly affect the prediction properties of the tree.

We show that every distribution generated by a PSA can equivalently be generated by a PST which is not much larger. The converse is not true in general. We can however characterize the family of PSTs for which the converse claim holds, and in general, it is always the case that for every PST there exists a not much larger PFA that generates an equivalent distribution. There are some contexts in which PSAs are preferable, and some in which PSTs are preferable, and therefore we use both representation in the paper. For example, PSAs are more efficient generators of distributions, and since they are probabilistic automata, their well defined state space and transition function can be exploited by dynamic programming algorithms which are used for solving many practical problems. In addition, there is a natural notion of the stationary distribution on the states of a PSA which PSTs lack. On the other hand, PSTs sometimes have more succinct representations than the equivalent PSAs, and there is a natural notion of growing them.

Stated formally, our main theoretical result is the following. If both a bound L, on the memory length of the target PSA, and a bound n, on the number of states the target PSA has, are known, then for every given 0 e 1 and 0 ~ 1, our learning algorithm outputs an e-good hypothesis PST, with confidence I - 5, in time polynomial in L, •, I~1, !, and ½. Furthermore, such a hypothesis can be obtained from a single sample sequence if the sequence length is also polynomial in a parameter related to the rate in which the target machine converges to its stationary distribution. Despite an intractability result concerning the learnability of distributions generated by Probabilistic Finite Automata (Kearns, et al., 1994), that is described in Section 1.1, our restricted model can be learned in a PAC-like sense efficiently. This has not been shown so far for any of the more popular sequence modeling algorithms.

We present two applications of the learning algorithm. In the first application we apply the algorithm in order to construct a model of the English language, and use this model to correct corrupted text. In the second application we cons~uct a simple stochastic model for E.coli DNA. Combined with a learning algorithm for a different subclass of probabilistic automata (Ron, Singer & Tishby, 1995), the algorithm presented here is part of a complete cursive handwriting recognition system (Singer & Tishby, 1995).

1.1. Related Work

The most powerful (and perhaps most popular) model used in modeling natural sequences is the Hidden Markov Model (HMM). A detailed tutorial on the theory of HMMs as well as selected applications in speech recognition is given by Rabiner (Rabiner, 1989). A commonly used procedure for learning an HMM from a given sample is a maximum likelihood parameter estimation procedure that is based on the Baum-Welch method (Baume, et al., 1970, Baume, 1972) (which is a special case of the EM (Expectation-Maximization) algorithm (Dempster, Laired & Rubin, 1977)). However, this algorithm is guaranteed to converge only to a local maximum, and thus we are not assured that the hypothesis it outputs can serve as a good approximation for the target distribution. One might hope that the problem 120 D. RON, Y. SINGER AND N. TISHBY can be overcome by improving the algorithm used or by finding a new approach. Unfortunately, there is strong evidence that the problem cannot be solved efficiently.

Abe and Warmuth (Abe & Warmuth, 1992) study the problem of training HMMs. The HMM training problem is the problem of approximating an arbitrary, unknown source distribution by distributions generated by HMMs. They prove that HMMs are not trainable in time polynomial in the alphabet size, unless R P -- NP. Gillman and Sipser (Gillman & Sipser, 1994) study the problem of exactly inferring an (ergodic) HMM over a binary alphabet when the inference algorithm can query a probability oracle for the longterm probability of any binary string. They prove that inference is hard: any algorithm for inference must make exponentially many oracle calls. Their method is information theoretic and does not depend on separation assumptions for any complexity classes.

Natural simpler alternatives, which are often used as well, are order L Markov chains (Shannon, 1951, Good, 1969), also known as n-gram models. As noted earlier, the size of an order L Markov chain is exponential in L and hence, if we want to capture more than very short term memory dependencies in the sequences, of substantial length in the sequences, then these models are clearly not practical.

HSffgen (HSffgen, 1993) studies families of distributions related to the ones studied in this paper, but his algorithms depend exponentially and not polynomially on the order, or memory length, of the distributions. Freund et. al. (Freund, et al., 1993) point out that their result for learning typical deterministic finite automata from random walks without membership queries, can be extended to learning typical PFAs. Unfortunately, there is strong evidence indicating that the problem of learning general PFAs is hard. Kearns et.

al. (Kearns, et al., 1994) show that PFAs are not efficiently learnable under the assumption that there is no efficient algorithm for learning noisy parity functions in the PAC model.

The machines used as our hypothesis representation, namely Probabilistic Suffix Trees (PSTs), were introduced (in a slightly different form) in (Rissanen, 1983) and have been used for other tasks such as universal data compression (Rissanen, 1983, Rissanen, 1986, Weinberger, Lempel & Ziv, 1982, Willems, Shtarkov & Tjalkens, 1993). Perhaps the strongest among these results (which has been brought to our attention after the completion of this work) and which is most tightly related to our result is (Willems, Shtarkov & Tj alkens, 1993). This paper describes an efficient sequential procedure for universal data compression for PSTs by using a larger model class. This algorithm can be viewed as a distribution learning algorithm but the hypothesis it produces is not a PST or a PSA and hence cannot be used for many applications. Willems et. aL show that their algorithm can be modified to give the minimum description length PST. However, in case the source generating the examples is a PST, they are able to show that this PST convergence only in the limit of infinite sequence length to that source.

Vitter and Krishnan (Vitter & Krishnan, 1991, Krishnan & Vitter, 1993) adapt a version of the Ziv-Lempel data compression algorithm (Ziv & Lempel, 1978) to get a page prefetching algorithm, where the sequence of page accesses is assumed to be generated by a PFA. They show that the page fault rate of their algorithm converges to the page fault rate of the best algorithm that has full knowledge of the source. This is true for almost all page access sequences (in the limit of the sequence length). Laird and Saul (Laird & Saul, 1994) describe a prediction algorithm which is similar in spirit to our algorithm and is based


on the Markov tree or Directed Acyclic Word Graph approach which is used for data compression (Blumer, 1990). They do not analyze the correctnes of the algorithm formally, but present several applications of the algorithm.

Pages:   || 2 | 3 | 4 | 5 |   ...   | 6 |

Similar works:

«Est. 1887 A TRADITION OF EXCELLENCE EQUINE CLIPPING GUIDE Est. 1887 A TRADITION OF EXCELLENCE Over 125 years of manufacturing quality Equine, Cattle and Sheep Clippers and Clipper Blades. INTRODUCTION At Wolseley we understand that clipping a horse for the first time can be nerveracking and even those more experienced in clipping are often apprehensive. So we have produced this booklet to encourage you and hopefully to answer any questions which you may have. In addition, our in-house experts...»

«http://www.prepareforsuccess.org.uk Complete list of questions about Further Education in the UK What is the difference between an FE college and a private college? There is a difference between FE colleges and private colleges. Only FE colleges are subject to regular government inspection to ensure that their courses meet national standards for quality. Private colleges are not subject to such regulation and may reflect variable quality as a result. In contrast, some universities retain the...»

«Evaluation of the dynamic behaviour of composite racing bicycles through outdoor field testing Frederik Mortier Promotoren: prof. dr. ir. Wim Van Paepegem, prof. dr. ir. Mia Loccufier Begeleiders: ir. Joachim Vanwalleghem, dr. ir. Ives De Baere Masterproef ingediend tot het behalen van de academische graad van Master in de ingenieurswetenschappen: werktuigkunde-elektrotechniek Vakgroep Toegepaste Materiaalwetenschappen Voorzitter: prof. dr. ir. Joris DEGRIECK Vakgroep Elektrische energie,...»

«Peregrine An English Companion to Chutzpah Magazine Wu Ang A Gift from Bill Gates Translated by Nicky Harman Wu Qing My Name is Ding Xiban Translated by Julia Lovell Kan Yao-ming Tales at the Funeral Translated by David van der Peet Issue 4, October 2011 A Gift from Bill Gates Wu Ang My name is Thousands (“Yiqianji”) and I’ve worked in all sorts of jobs. Most recently, I’ve been spending my time at home writing, and in my spare time, help my mother out picking vegetables. (With the...»

«TRACEABILITY OF REQUIREMENTS AND SOFTWARE ARCHITECTURE FOR CHANGE MANAGEMENT ARDA GÖKNIL Ph.D. dissertation committee: Chairman and secretary: Prof. Dr. Ir. Anton J. Mouthaan, University of Twente, The Netherlands Promoter: Prof. Dr. Ir. Mehmet Akşit, University of Twente, The Netherlands Assistant promoter: Dr. Ivan Kurtev, University of Twente, The Netherlands Members: Prof. Dr. Roel Wieringa, University of Twente, The Netherlands Prof. Dr. Ir. Arend Rensink, University of Twente, The...»

«EUROCARB eBOOK ABSTRACT Organizing Committee In collaboration with University University of Naples Federico II of Milan-Bicocca Francesco Nicotra (co-Chair) Antonio Molinaro (Chair) University Rosa Lanzetta (co-Chair) of Salerno Michelangelo Parrilli Giovanni Palumbo Maria Michela Corsaro Raffaele Riccio Cristina De Castro EDITORS Prof. Antonio Molinaro Prof.ssa Rosa Lanzetta Prof. ssa Maria Michela Corsaro Prof. Giovanni Palumbo ADDRESS Università di Napoli Federico II Complesso Universitario...»

«Kollár László A DINAMIKUS PROGRAMOZÁS ALKALMAZÁSÁNAK LEHETŐSÉGEI 1. E L Ő S Z Ó M a már a gazdasági kérdések feldolgozásakor fokozottabban alkalmaz­ zák az operációkutatás módszereit. Persze csak akkor, ha valójában fel­ ismerik az általuk kínált lehetőségeket. O l y a n ismeretek birtokában kell lennünk, hagy а gazdasági életben felmerülő kérdéseket — az adottságok méltányolásával — matematikai módszerekkel is megoldhassuk. A z...»

«Environmental Measurements Laboratory U.S. Department of Energy Pu-12-RC, Vol. I Rev. 0 HASL-300, 28th Edition February 1997 Pu-12-RC PLUTONIUM AND/OR AMERICIUM IN SOIL OR SEDIMENTS Contact Person(s) : Anna Berne APPLICATION This procedure is applicable to soils which contain plutonium and americium deposited from worldwide fallout and some nuclear activities. A total dissolution technique is required for some soil samples for plutonium determination. Plutonium and americium isotopes are...»

«International Electronic Journal of Elementary Education, 2015, 8(1), 581-600. Evaluation of Students’ Mathematical Problem Solving Skills in Relation to Their Reading Levels Gökhan ÖZSOY University of Ordu, Turkey Hayriye Gül KURUYER Aksaray University, Turkey Ahmet ÇAKIROĞLU Aksaray University, Turkey Received: 10 February 2015 / Revised: 12 July 2015 / Accepted: 5 August 2015 Abstract The purpose of the current study is to investigate the correlation between students’ reading...»

«Bezirksregierung Detmold Geschäftsstelle Weser NRW Protokoll Zweiter Runder Tisch Diemel Vorsitzende Birgit Rehsies (Leiterin der Geschäftsstelle Weser NRW u. d. Projektteams WRRL) Teilnehmer siehe Anhang Ort Kreis Höxter, Verwaltungsnebenstelle Warburg, Raum 205 (Sitzungssaal der Stadt Warburg), Bahnhofstraße 26, 34414 Warburg Tag Dienstag, 29. April 2008 Zeit 10:00 – 15:30 Tagesordnung Thema TOP Redner/ Funktion Nr. Rednerin 1 Begrüßung und Einführung Birgit Rehsies Leiterin der...»

«KEARNEY LOUGHLIN, ET * NO. 2013-CA-1285 AL. * VERSUS COURT OF APPEAL * UNITED SERVICES FOURTH CIRCUIT AUTOMOBILE ASSOCIATION * STATE OF LOUISIANA ******* APPEAL FROM CIVIL DISTRICT COURT, ORLEANS PARISH NO. 2012-10478, DIVISION “D-16” Honorable Lloyd J. Medley, Judge ****** Judge Madeleine M. Landrieu ****** (Court composed of Chief Judge James F. McKay, III, Judge Dennis R. Bagneris, Sr., Judge Daniel L. Dysart, Judge Madeleine M. Landrieu, Judge Sandra C. Jenkins) JENKINS, J., DISSENTS...»

«Integral Consulting Inc.     285 Century Place   Suite 190   Louisville, CO 80027 telephone: 303.404.2944 facsimile: 303.404.2945 skosinski@integral-corp.com Sean Kosinski, P.Hg. Senior Managing Scientist PROFESSIONAL PROFILE Mr. Sean Kosinski has 14 years of consulting experience, and his primary area of expertise is in numerical simulation of hydrologic systems. He has worked on and managed projects that involved modeling large-scale dewatering operations at open-pit and underground...»

<<  HOME   |    CONTACTS
2016 www.abstract.xlibx.info - Free e-library - Abstract, dissertation, book

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.