References of "Kampas, Dimitrios 50002068"
     in
Bookmark and Share    
Full Text
See detailTopic Identification Considering Word Order by Using Markov Chains
Kampas, Dimitrios UL

Doctoral thesis (2016)

Automated topic identification of text has gained a significant attention since a vast amount of documents in digital forms are widespread and continuously increasing. Probabilistic topic models are a ... [more ▼]

Automated topic identification of text has gained a significant attention since a vast amount of documents in digital forms are widespread and continuously increasing. Probabilistic topic models are a family of statistical methods that unveil the latent structure of the documents defining the model that generates the text a priori. They infer about the topic(s) of a document considering the bag-of-words assumption, which is unrealistic considering the sophisticated structure of the language. The result of such a simplification is the extraction of topics that are vague in terms of their interpretability since they disregard any relations among the words that may settle word ambiguity. Topic models miss significant structural information inherent in the word order of a document. In this thesis we introduce a novel stochastic topic identifier for text data that addresses the above shortcomings. The primary motivation of this work is initiated by the assertion that word order reveals text semantics in a human-like way. Our approach recognizes an on-topic document trained solely on the experience of an on-class corpus. It incorporates the word order in terms of word groups to deal with data sparsity of conventional n-gram language models that usually require a large volume of training data. Markov chains hereby provide a reliable potential to capture short and long range language dependencies for topic identification. Words are deterministically associated with classes to improve the probability estimates of the infrequent ones. We demonstrate our approach and motivate its eligibility on several datasets of different domains and languages. Moreover, we present a pioneering work by introducing a hypothesis testing experiment that strengthens the claim that word order is a significant factor for topic identification. Stochastic topic identifiers are a promising initiative for building more sophisticated topic identification systems in the future. [less ▲]

Detailed reference viewed: 122 (14 UL)
Full Text
Peer Reviewed
See detailA Hidden Markov Model to detect relevance in nancial documents based on on/off topics
Kampas, Dimitrios UL; Schommer, Christoph UL; Sorger, Ulrich UL

in European Conference on Data Analysis (2014)

Automated text classification has gained a significant attention since a vast amount of documents in digital forms are widespread and continuously increasing. Most of the standard classification posit the ... [more ▼]

Automated text classification has gained a significant attention since a vast amount of documents in digital forms are widespread and continuously increasing. Most of the standard classification posit the independence of the terms-features in document, which is unrealistic considering the sophisticated structure of the language. Our research concerns the discovery of relevance in documents, which adequately refers to a sufficient number of thematic themes (or topics) that are either `on' or `off'. `On topics' are semantically close with a domain specific discourse, whereas `Off topics' are not considered to be on documents. As a rather promising approach, we have modelled a stochastic process for term sequences, where each term is conditionally dependent of its preceeding terms. Hidden Markov Models hereby provide a reliable potential to incorporate language and domain dependencies for a classification. Terms are deterministically associated with classes to improve the probability estimates for the infrequent words. In the paper presentation, we demonstrate our approach and motivate its eligibility by the exploration of annotated Thomson Reuters news documents; in particular, the `on topic' documents discourse the monetary policy of Federal Reserves. We estimate the transition and emission probabilities of our model on a training set of both on and off topic documents and evaluate the accuracy of our approach using 10-fold cross validation. This work is part of the interdisciplinary research project ESCAPE, which is funded by the Fonds National de la Recherche. We kindly thank our colleagues from the Dept. of Finance for their support. [less ▲]

Detailed reference viewed: 292 (54 UL)