Topic Identification Considering Word Order by Using Markov Chains

Kampas, Dimitrios

Download

Doctoral thesis (Dissertations and theses)

Topic Identification Considering Word Order by Using Markov Chains

Kampas, Dimitrios

2016

Permalink
https://hdl.handle.net/10993/27805

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

Thesis_DK_final.pdf

Publisher postprint (1.25 MB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Topic identification; Stochastic classifier; Finance

Abstract :

[en] Automated topic identification of text has gained a significant attention since a vast amount of documents in digital forms are widespread and continuously increasing. Probabilistic topic models are a family of statistical methods that unveil the latent structure of the documents defining the model that generates the text a priori. They infer about the topic(s) of a document considering the bag-of-words assumption, which is unrealistic considering the sophisticated structure of the language. The result of such a simplification is the extraction of topics that are vague in terms of their interpretability since they disregard any relations among the words that may settle word ambiguity. Topic models miss significant structural information inherent in the word order of a document. In this thesis we introduce a novel stochastic topic identifier for text data that addresses the above shortcomings. The primary motivation of this work is initiated by the assertion that word order reveals text semantics in a human-like way. Our approach recognizes an on-topic document trained solely on the experience of an on-class corpus. It incorporates the word order in terms of word groups to deal with data sparsity of conventional n-gram language models that usually require a large volume of training data. Markov chains hereby provide a reliable potential to capture short and long range language dependencies for topic identification. Words are deterministically associated with classes to improve the probability estimates of the infrequent ones. We demonstrate our approach and motivate its eligibility on several datasets of different domains and languages. Moreover, we present a pioneering work by introducing a hypothesis testing experiment that strengthens the claim that word order is a significant factor for topic identification. Stochastic topic identifiers are a promising initiative for building more sophisticated topic identification systems in the future.

Disciplines :

Computer science

Author, co-author :

Kampas, Dimitrios ; University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Computer Science and Communications Research Unit (CSC)

Language :

English

Title :

Topic Identification Considering Word Order by Using Markov Chains

Defense date :

30 May 2016

Institution :

Unilu - University of Luxembourg, Luxembourg

Degree :

Docteur en Informatique

Promotor :

Schommer, Christoph

Name of the research project :

ESCAPE

Available on ORBilu :

since 30 June 2016

Statistics

Number of views

204 (20 by Unilu)

Number of downloads

348 (7 by Unilu)

More statistics