FREE ELECTRONIC LIBRARY - Abstract, dissertation, book

Pages:     | 1 || 3 | 4 |   ...   | 6 |

«Bachelor dissertation Alina Varfolomeeva Institute of Hospitality Management in Prague Department of Hospitality Management Major field of study: ...»

-- [ Page 2 ] --

Similar to data mining, text mining explores data in text files to establish valuable patterns and rules that indicate trends and significant features about specific topics (Nasukawa and Nagano, 2001). Both types of systems rely on such techniques as document preprocessing and pattern discovery, but unlike data mining text mining retrieves those patterns from unstructured or partially structured data sets in document collections, which are not relevant to data -mining processes. Datamining applications use structured data that was carefully prepared. Text-mining, on the other hand, works with natural language, not a computer stored data, and employs a variety of techniques from information retrieval, information extraction, and corpus-based computational linguistics (Feldman and Sanger, 2007).

2.3 Text Mining Techniques As was mentioned above, text mining is a combination of techniques from such areas as natural language processing, information retrieval, information extraction and data mining. Moreover, each of those techniques was developed long before the initial term of text mining was formulated.

2.3.1 Information Retrieval Text mining starts with natural language processing (NLP), which in most general terms makes human language understandable for computer. Text mining appears to include in itself the whole idea of automatic natural language processing and even more besides — for example, analysis of linkage structures such as hyperlinks in the Web literature, a useful source of information that lies outside the traditional domain of natural language processing (Witten, 2004).

Information retrieval represents one more text mining technique. The process of information retrieval is most commonly associated with online documents and using search-engines to browse the Internet. Essentially, information retrieval narrows down the set of documents relevant to a current problem.

One of the main aspects of information retrieval is assessing the similarity between different documents using categories. According to Sebastiani (2002) text categorisation is assigning natural language documents to pre-defined categories and grouping documents into natural clusters according to their content. The set of categories is often called a “controlled vocabulary.” In text categorization the categories are known beforehand and determined in advance for each document. In contrast, document clustering is “unsupervised” learning, there is no predefined category or “class,” but groups of documents that belong together are sought. Document clustering assists in document retrieval by creating links between similar documents, which in turn allows related documents to be retrieved once one of the documents has been deemed relevant to a query (Martin, 1995).

2.3.2 Information Extraction One of the principal techniques of text mining is information extraction, which is used to refer to the task of filling templates from natural language input (Appelt, 1999). The process of information extraction examines the document set to discover higher-level concepts, which can be used instead of words to devise a more complex matrix. This way more interesting patterns can be uncovered during the following learning process.

Turmo et al. (2006) describe the main approaches to the information extraction, and discusses various machine learning techniques that can be used to adapt existing methods of extracting valuable information to any set of textual information on any domain. These techniques are a part web-mining, which is “the application of data mining techniques to discover patterns from the web” (Zhang et al., 2010).Web-mining can be divided into three types: web content mining, web usage mining, and web structure mining (Pabarskaite and Raudys, 2007; Sanchez et al., 2008). Web content mining is a process of discovering useful information from web-sites and web-servers, which may contain both textual and graphical information, as well as more complex flash media applications. Web usage mining analyses behavioral patterns of users who are browsing through web-pages. Web structure mining is used to analyze the structure of the web -sites.

Part of the web content mining is web-based text mining, which involves discovering useful patterns in textual information in the Internet. Most of the text on the Internet contains detailed structural markup. Some markup is internal and indicates document structure or format; some is external and gives explicit hypertext links between documents. These information sources give additional leverage for mining Web documents (Chakrabarti, 2003). They contain hyperlinks, which lead to other connected documents and might give additional information to the researcher. The structure of page markup can be used to mine textual information only in particular part of the document to aquire only the most relevant data.

2.3.3 Standardization The computer industry as a who le, including most of the text-processing community, has adopted XML (Extensible Markup Language) as its standard exchange format. Many currently available information systems are in this format.

The main reason to identify different parts of a document consistently is to allow more precise selection of those words and word-phrases that will be used to generate features.

Many software programs nowadays allow documents to be saved in XML format, and individual filters can be installed to convert existing documents without having to process each separate one manually. Documents encoded as images are harder to deal with currently. There are some optical character recognition systems that can be useful, but these can introduce errors in the text and must be used rather carefully (Weiss et al., 2005).

The mining tools can be applied without the need to consider the structure of the document, and it is the main advantage of standardizing the data. For retrieving information from a document, it is irrelevant what program was used to create it or what the original format was. The software tools need to read data just in one format, and not in the many different formats they came in originally.

2.3.4 Tokenization and Stemming If the given document collection is in XML format then we are ready to examine the unstructured text to identify useful features. The very first step in working with text is to break the stream of characters into words or, more precisely, tokens. This is crucial for further analysis. Without identifying the tokens, it is difficult to even imagine extracting higher-level information from the document. Each token is a representative of a type or a group of words, so the number of tokens is much higher than the number of types.

The tokenizer should always be customized beforehand so that a researcher gets from the available text the best possible features, — otherwise extra work may be required after the key words are obtained. The tokenization process is languagedependent. In this work the focus is on documents in English. For other languages, although the general principles will remain the same, but the details will differ.

After all text in the document set has been segmented into a sequence of tokens, the next optional step is to convert each of the tokens to a standard form, a process usually referred to as stemming or lemmatization. Whether or not this step is necessary is application-dependent. Stemming might be able to provide a positive benefit in some cases for the purpose of document classification. One of the outcomes of stemming is reduced number of different kinds of words in a text in a set of documents and the other is increased frequency of occurrence of some individual word types. This can sometimes make a difference; especially it can be used in classification algorithms that take frequency into account. In other cases the extra processing may not provide any significant benefits.

When the normalization of the text is concentrated on unification of grammatical variants such as singular/plural and present/past forms of one word, then the process is called inflectional stemming. In linguistic terminology, this is called morphological analysis. For English language and some others, with many irregular word forms and nonintuitive spelling, it is more difficult. There is no simple rule which computer can learn, for example, to bring together teach and taught.

Similarly, the stem for similarly sounding words might differ: the stem for “rebelled” is “rebel,” but the stem for “belled” is “bell.” In other languages, for example Spanish, morphological analysis is comparatively simple (Weiss et al., 2005).

When considering normalizing texts in English, an algorithm for inflectional stemming must be part dictionary-based and part rule-based. Any stemming algorithm for English will make little to many mistakes because of ambiguity in case this algorithm operates only on tokens, without any grammatical information such as parts-of-speech or the semantics. For example, is “lie” a verb meaning telling not rue things or meaning that something is put in horizontal position on some surface. Or maybe it is a noun, or somebody misspelled the word. In the absence of some lexical disambiguation process which is often complicated, a stemming algorithm would probably pick the most frequently appearing choice.

Although the inflectional stemmer will most probably correctly identify a rather significant number of stems, it is not expected to be perfect.

Another more aggressive way of normalizing words using stemming is intended to reach a root form of each word with no inflectional or derivational prefixes and suffixes. For instance, “denormalization” is reduced to the stem “norm.” Such aggressive stemming is used to reduce the number of word types in a document set rather drastically, as a result making distributional statistics more reliable.

Additionally, words with the same core meaning are grouped together, so that one concept such as locate has only one stem, although the text may have location, locating, etc. However, “the usefulness of stemming is very much applicationdependent” (Weiss et al., 2005). It doesn’t hurt to try to run text mining analysis both with and without stemming, as results might differ in every new project.

2.3.5 Dictionary Each document in a set of documents can be characterized by the tokens or key words it contains. It means that even without using a deep linguistic analysis, it is possible to describe any chosen document from the set by features that represent the most frequent words.

According to Weiss (2005) “the collective set of features is typically called a dictionary”. The key words in this dictionary represent the basis for creating a spreadsheet or matrix of numeric data which is correspondent to the document collection. Each row represents a document, and each column represents a feature. In a result, a cell in the spreadsheet is a measurement of frequency of a feature for each of the documents in a set.

In many cases it might be useful to work with a smaller dictionary. Its size can be reduced to improve performance results through various transformations.

One of the obvious ways to reduce the size of the dictionary is to create a list of stop-words and then remove them from a dictionary. Stop-words are the words that almost never have any use in determining the topic of the document, words such as articles a and the or pronouns such as it and they. These common words can be discarded before the feature generation process, but might be even more effective to generate the features first, then apply all the other transformations, and at the very end of the process discard the ones that correspond to stop words.

Frequency information on the word counts can be quite useful in reducing dictionary size and can sometimes improve performance results for some methods. As was mentioned above, the most frequent words are often stop words and can be deleted. The remaining most frequently used words are often the important words that should remain in a local dictionary. The very rare words are often typos or words which will not drastically change the results and can also be discarded.

An alternative approach to generating a local dictionary, related to one document, is to construct a global dictionary from all documents in the collection. Instead of placing every possible word from the set of documents into the dictionary, it is better to follow the lead of printed dictionaries and avoid storing every single variation of the same word. It can be explained as that the different variations of a word really refer to the same concept. For text mining software there is no need for singular and plural. Most verbs can be stored in their stem form. Synonyms can be grouped and relate to one and the same key word.

However, it is important to apply s temming carefully as it can occasionally be excessive or even harmful for some words. When applying the same procedure that shortens words to their root form to all the words in the dictionary, we will most likely face cases where a subtle difference in meaning is lost.

Overall, stemming will result in a large reduction in dictionary size and is quite beneficial for performance results when a smaller dictionary is used. In general, the smaller the dictionary, the more thoughtful in its composition is needed be to select the most and best words.

The use of tokens and stemming are examples of helpful procedures in composing smaller dictionaries. All these efforts will most definitely pay off in improved manageability of results and perhaps improved accuracy and precision. Even if nothing is acquired from those manipulations with text, analysis can proceed more rapidly with smaller dictionaries.

Once all the normalizing and standardizing is done, and the set of features has been determined, the document collection ca n be converted to spreadsheet format. This spreadsheet will be populated by ones and zeros, representing the presence or absence of the words from a dictionary in each particular document.

To make it more accurate some more additional transformations might be considered. They will be discussed in Chapters 3 and 4 of the study.

2.4 Text Mining in Hospitality and Tourism Industry How was already mentioned in the Introduction Chapter of this thesis, more and more hoteliers become aware of the importance of collecting clients’ feedback and use of automatic approaches over manual ones to analyze this information.

Pages:     | 1 || 3 | 4 |   ...   | 6 |

Similar works:

«Luftwaffe Airfields 1935-45 Luftwaffe Airfields 1935-45 Italy, Sicily and Sardinia By Henry L. deZeng IV Catania Edition: September 2015 Luftwaffe Airfields 1935-45 Copyright © by Henry L. deZeng IV (Work in Progress). (1st Draft 2015) Blanket permission is granted by the author to researchers to extract information from this publication for their personal use in accordance with the generally accepted definition of fair use laws. Otherwise, the following applies: All rights reserved. No part...»

«252 Groups May 24, 2015, Week 4 Large Group, 4-5 Backfired Bible Story: Backfired (Haman’s Plot Against the Jews) • Esther 3; 4:8b; 5:6-10; 6:6-11; 7:1-6, 9b-10 Bottom Line: When you lie to help yourself, you hurt yourself instead. Memory Verse: “Keep me from cheating and telling lies. Be kind and teach me your law.” Psalm 119:29, NIrV Life App: Honesty—choosing to be truthful in whatever you say and do. Basic Truth: I can trust God no matter what. May Large Group Downloads:...»

«2 Culture and wetlands in the Mediterranean: an evolving story edited by Thymio Papayannis and Dave Pritchard Med-INA The designation of geographical entities and the presentation of material in this book do not imply the expression of any opinion whatsoever on the part of Med-INA, the editors and the authors concerning the legal status of any country, territory or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. The views expressed in this publication...»


«Diskussionsbeiträge aus dem Volkswirtschaftlichen Seminar der Universität Göttingen Beitrag Nr. 113 Corrupt Relational Contracting Johann Graf Lambsdorff, Sitki Utku Teksoz, Mai 2002 Volkswirtschaftliches Seminar, Platz der Göttinger Sieben 3, D-37073 Göttingen Corrupt Relational Contracting J. Graf Lambsdorff and S.U.Teksoz∗ Abstract Because corruption must be hidden from the public and is not enforced by courts it entails transaction costs, which are larger than those from legal...»

«1 of 10 [9.06] Room Full of Heroes Room Full of Heroes Written by Eric Zicklin Directed by Wil Shriner ===================================================================== Production Code: 9.06 Episode Number In Production Order: 195 Episode Filmed on: August 28th, 2001 Original Airdate on NBC: October 30th, 2001 Transcript written on November 4th, 2001 AWARDS & NOMINATIONS Nominated YOUNG ARTISTS AWARDS • Best Performance in a Television Series – Guest Starring Young Actor: Steven Anthony...»

«Peter Honegger András Gurovits Kohli Daniel Eisele Sport und Recht Vermarktung und Organisation von Sportanlässen Sports and Law Commercialization and Organization of Sports Events Publikation 11 Peter Honegger András Gurovits Kohli Daniel Eisele Sport und Recht Vermarktung und Organisation von Sportanlässen Sports and Law Commercialisation and Organization of Sports Events Publikation 11 In der NKF-Schriftenreihe werden in loser Folge Aufsätze und Abhandlungen publiziert, die sich mit...»

«supported by Conference 16–18 Nov 2011 Wissenschaftskolleg zu Berlin The Concept of Human Dignity in a Transatlantic Perspective. Foundations and Variations A Berlin Dialogue on Transatlantic Legal Culture(s) Table of Contents The Conference 5 Schedule 7 Bios and Abstracs 10 Ino Augsberg 10 Human Dignity Beyond Autonomy? On Kant’s Concept of “Achtung” 10 Samantha Besson 12 Jochen von Bernstorff 13 Human dignity and the anti-utilitarian telos of human rights: Assessing categorical styles...»

«CIRCLE Working Paper 75 Voting Laws, Education, and Youth Civic Engagement: A Literature Review by CIRCLE staff with Haley Pero and Laura Nelson1 Executive Summary Since the founding of public schools in the United States, a primary purpose of public education has been to produce capable, informed, and engaged citizens. In particular, civic education can prepare students to be informed voters. Unfortunately, the turnout rates and the civic knowledge of young people are unacceptably low; there...»

«Working with men for HIV prevention and care UNAIDS/01.64 E (English original, October 2001) ISBN: 92-9173-123-4 © Joint United Nations Programme on HIV/AIDS The designations employed and the presentation of the (UNAIDS) 2001. This document is not a formal pubmaterial in this work do not imply the expression of any lication of UNAIDS and IOM and all rights are reopinion whatsoever on the part of UNAIDS concerning served by these bodies. the legal status of any country, territory, city or area...»

«Henry Rousso University of Minnesota November 15, 2005 Comparative Afterlives : Vichy France and the Algerian War A few weeks ago, when I was preparing this lecture, I didn’t expected that I would have to cope with such a burning question: are the current riots in France a kind of a legacy of “colonialism” or is it a biased way to understand contemporary French issues? In any case, the legacy of the past is with no doubt part of the problem. When last Tuesday, French Prime minister...»

«Écranosphère n° 2 (hiver 2015) The Inception of Cynicism from the Ruins of Sexual Difference: Christopher Nolan’s Dialectic of Masculine Enlightenment Tamas Nagypal York University Résumé/Abstract This paper maps the antagonism of two subject positions, in Christopher Nolan’s Inception, that mediate the contradictions of the film’s late capitalist universe. Drawing on the Lacanian formulas of sexuation, it argues that the femme fatale, Mal, has a relation to the social symbolic order...»

<<  HOME   |    CONTACTS
2016 www.abstract.xlibx.info - Free e-library - Abstract, dissertation, book

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.