«Bachelor dissertation Alina Varfolomeeva Institute of Hospitality Management in Prague Department of Hospitality Management Major field of study: ...»
Similar to data mining, text mining explores data in text files to establish valuable patterns and rules that indicate trends and significant features about specific topics (Nasukawa and Nagano, 2001). Both types of systems rely on such techniques as document preprocessing and pattern discovery, but unlike data mining text mining retrieves those patterns from unstructured or partially structured data sets in document collections, which are not relevant to data -mining processes. Datamining applications use structured data that was carefully prepared. Text-mining, on the other hand, works with natural language, not a computer stored data, and employs a variety of techniques from information retrieval, information extraction, and corpus-based computational linguistics (Feldman and Sanger, 2007).
2.3 Text Mining Techniques As was mentioned above, text mining is a combination of techniques from such areas as natural language processing, information retrieval, information extraction and data mining. Moreover, each of those techniques was developed long before the initial term of text mining was formulated.
2.3.1 Information Retrieval Text mining starts with natural language processing (NLP), which in most general terms makes human language understandable for computer. Text mining appears to include in itself the whole idea of automatic natural language processing and even more besides — for example, analysis of linkage structures such as hyperlinks in the Web literature, a useful source of information that lies outside the traditional domain of natural language processing (Witten, 2004).
Information retrieval represents one more text mining technique. The process of information retrieval is most commonly associated with online documents and using search-engines to browse the Internet. Essentially, information retrieval narrows down the set of documents relevant to a current problem.
One of the main aspects of information retrieval is assessing the similarity between different documents using categories. According to Sebastiani (2002) text categorisation is assigning natural language documents to pre-defined categories and grouping documents into natural clusters according to their content. The set of categories is often called a “controlled vocabulary.” In text categorization the categories are known beforehand and determined in advance for each document. In contrast, document clustering is “unsupervised” learning, there is no predefined category or “class,” but groups of documents that belong together are sought. Document clustering assists in document retrieval by creating links between similar documents, which in turn allows related documents to be retrieved once one of the documents has been deemed relevant to a query (Martin, 1995).
2.3.2 Information Extraction One of the principal techniques of text mining is information extraction, which is used to refer to the task of filling templates from natural language input (Appelt, 1999). The process of information extraction examines the document set to discover higher-level concepts, which can be used instead of words to devise a more complex matrix. This way more interesting patterns can be uncovered during the following learning process.
Turmo et al. (2006) describe the main approaches to the information extraction, and discusses various machine learning techniques that can be used to adapt existing methods of extracting valuable information to any set of textual information on any domain. These techniques are a part web-mining, which is “the application of data mining techniques to discover patterns from the web” (Zhang et al., 2010).Web-mining can be divided into three types: web content mining, web usage mining, and web structure mining (Pabarskaite and Raudys, 2007; Sanchez et al., 2008). Web content mining is a process of discovering useful information from web-sites and web-servers, which may contain both textual and graphical information, as well as more complex flash media applications. Web usage mining analyses behavioral patterns of users who are browsing through web-pages. Web structure mining is used to analyze the structure of the web -sites.
Part of the web content mining is web-based text mining, which involves discovering useful patterns in textual information in the Internet. Most of the text on the Internet contains detailed structural markup. Some markup is internal and indicates document structure or format; some is external and gives explicit hypertext links between documents. These information sources give additional leverage for mining Web documents (Chakrabarti, 2003). They contain hyperlinks, which lead to other connected documents and might give additional information to the researcher. The structure of page markup can be used to mine textual information only in particular part of the document to aquire only the most relevant data.
2.3.3 Standardization The computer industry as a who le, including most of the text-processing community, has adopted XML (Extensible Markup Language) as its standard exchange format. Many currently available information systems are in this format.
The main reason to identify different parts of a document consistently is to allow more precise selection of those words and word-phrases that will be used to generate features.
Many software programs nowadays allow documents to be saved in XML format, and individual filters can be installed to convert existing documents without having to process each separate one manually. Documents encoded as images are harder to deal with currently. There are some optical character recognition systems that can be useful, but these can introduce errors in the text and must be used rather carefully (Weiss et al., 2005).
The mining tools can be applied without the need to consider the structure of the document, and it is the main advantage of standardizing the data. For retrieving information from a document, it is irrelevant what program was used to create it or what the original format was. The software tools need to read data just in one format, and not in the many different formats they came in originally.
2.3.4 Tokenization and Stemming If the given document collection is in XML format then we are ready to examine the unstructured text to identify useful features. The very first step in working with text is to break the stream of characters into words or, more precisely, tokens. This is crucial for further analysis. Without identifying the tokens, it is difficult to even imagine extracting higher-level information from the document. Each token is a representative of a type or a group of words, so the number of tokens is much higher than the number of types.
The tokenizer should always be customized beforehand so that a researcher gets from the available text the best possible features, — otherwise extra work may be required after the key words are obtained. The tokenization process is languagedependent. In this work the focus is on documents in English. For other languages, although the general principles will remain the same, but the details will differ.
After all text in the document set has been segmented into a sequence of tokens, the next optional step is to convert each of the tokens to a standard form, a process usually referred to as stemming or lemmatization. Whether or not this step is necessary is application-dependent. Stemming might be able to provide a positive benefit in some cases for the purpose of document classification. One of the outcomes of stemming is reduced number of different kinds of words in a text in a set of documents and the other is increased frequency of occurrence of some individual word types. This can sometimes make a difference; especially it can be used in classification algorithms that take frequency into account. In other cases the extra processing may not provide any significant benefits.
When the normalization of the text is concentrated on unification of grammatical variants such as singular/plural and present/past forms of one word, then the process is called inflectional stemming. In linguistic terminology, this is called morphological analysis. For English language and some others, with many irregular word forms and nonintuitive spelling, it is more difficult. There is no simple rule which computer can learn, for example, to bring together teach and taught.
Similarly, the stem for similarly sounding words might differ: the stem for “rebelled” is “rebel,” but the stem for “belled” is “bell.” In other languages, for example Spanish, morphological analysis is comparatively simple (Weiss et al., 2005).
When considering normalizing texts in English, an algorithm for inflectional stemming must be part dictionary-based and part rule-based. Any stemming algorithm for English will make little to many mistakes because of ambiguity in case this algorithm operates only on tokens, without any grammatical information such as parts-of-speech or the semantics. For example, is “lie” a verb meaning telling not rue things or meaning that something is put in horizontal position on some surface. Or maybe it is a noun, or somebody misspelled the word. In the absence of some lexical disambiguation process which is often complicated, a stemming algorithm would probably pick the most frequently appearing choice.
Although the inflectional stemmer will most probably correctly identify a rather significant number of stems, it is not expected to be perfect.
Another more aggressive way of normalizing words using stemming is intended to reach a root form of each word with no inflectional or derivational prefixes and suffixes. For instance, “denormalization” is reduced to the stem “norm.” Such aggressive stemming is used to reduce the number of word types in a document set rather drastically, as a result making distributional statistics more reliable.
Additionally, words with the same core meaning are grouped together, so that one concept such as locate has only one stem, although the text may have location, locating, etc. However, “the usefulness of stemming is very much applicationdependent” (Weiss et al., 2005). It doesn’t hurt to try to run text mining analysis both with and without stemming, as results might differ in every new project.
2.3.5 Dictionary Each document in a set of documents can be characterized by the tokens or key words it contains. It means that even without using a deep linguistic analysis, it is possible to describe any chosen document from the set by features that represent the most frequent words.
According to Weiss (2005) “the collective set of features is typically called a dictionary”. The key words in this dictionary represent the basis for creating a spreadsheet or matrix of numeric data which is correspondent to the document collection. Each row represents a document, and each column represents a feature. In a result, a cell in the spreadsheet is a measurement of frequency of a feature for each of the documents in a set.
In many cases it might be useful to work with a smaller dictionary. Its size can be reduced to improve performance results through various transformations.
One of the obvious ways to reduce the size of the dictionary is to create a list of stop-words and then remove them from a dictionary. Stop-words are the words that almost never have any use in determining the topic of the document, words such as articles a and the or pronouns such as it and they. These common words can be discarded before the feature generation process, but might be even more effective to generate the features first, then apply all the other transformations, and at the very end of the process discard the ones that correspond to stop words.
Frequency information on the word counts can be quite useful in reducing dictionary size and can sometimes improve performance results for some methods. As was mentioned above, the most frequent words are often stop words and can be deleted. The remaining most frequently used words are often the important words that should remain in a local dictionary. The very rare words are often typos or words which will not drastically change the results and can also be discarded.
An alternative approach to generating a local dictionary, related to one document, is to construct a global dictionary from all documents in the collection. Instead of placing every possible word from the set of documents into the dictionary, it is better to follow the lead of printed dictionaries and avoid storing every single variation of the same word. It can be explained as that the different variations of a word really refer to the same concept. For text mining software there is no need for singular and plural. Most verbs can be stored in their stem form. Synonyms can be grouped and relate to one and the same key word.
However, it is important to apply s temming carefully as it can occasionally be excessive or even harmful for some words. When applying the same procedure that shortens words to their root form to all the words in the dictionary, we will most likely face cases where a subtle difference in meaning is lost.
Overall, stemming will result in a large reduction in dictionary size and is quite beneficial for performance results when a smaller dictionary is used. In general, the smaller the dictionary, the more thoughtful in its composition is needed be to select the most and best words.
The use of tokens and stemming are examples of helpful procedures in composing smaller dictionaries. All these efforts will most definitely pay off in improved manageability of results and perhaps improved accuracy and precision. Even if nothing is acquired from those manipulations with text, analysis can proceed more rapidly with smaller dictionaries.
Once all the normalizing and standardizing is done, and the set of features has been determined, the document collection ca n be converted to spreadsheet format. This spreadsheet will be populated by ones and zeros, representing the presence or absence of the words from a dictionary in each particular document.
To make it more accurate some more additional transformations might be considered. They will be discussed in Chapters 3 and 4 of the study.
2.4 Text Mining in Hospitality and Tourism Industry How was already mentioned in the Introduction Chapter of this thesis, more and more hoteliers become aware of the importance of collecting clients’ feedback and use of automatic approaches over manual ones to analyze this information.