Institute of Information Theory and Automation

You are here

Text classification

Mgr. Antonín Malík
Defense type: 
Ph.D.
Date of Event: 
2008-09-30
Venue: 
Zasedací místnost č. K112 Fakulty elektrotechnické ČVUT v Praze, Karlovo nám. 13, Praha 2
Mail: 
Status: 
defended
The goal of text document classification (TC) is to assign automatically a new document into one or more predefined classes based on its content. The representation scheme using a bag of words approach leads to very high-dimensional feature/word space. The dominant approach in TC to dimensionality reduction is feature selection (FS). Methods for FS in TC task use an evaluation function that is applied to a single word. However, they evaluate each word separately and completely ignore the existence of other words and the manner how the words work together. In this thesis, the novel algorithms for word selection are proposed. The sequential forward selection methods based on proposed improved mutual information criterion functions are presented. The performance of the proposed criteria compared to the information gain, chi-squared statistic and odds ratio which evaluate features individually is discussed. The experimental results using naive Bayes classifier based on multinomial model,linear support vector machine and k-nearest neighbor classifiers on the Reuters-21578 data set are analyzed from various perspectives, including recall, precision and F1-measure. Experimental results indicate the effectiveness of the proposed FS algorithms in TC. The probabilistic models for TC problem by using mixture model for class conditional probability functions is also presented. The focus is devoted to the application of the mixture of multivariate Bernoulli distributions and on the mixture of multinomial distributions. The proposed approach is a generalization of naive Bayes that tries to properly model significant class-conditional dependencies by spreading them over different class mixture components. Maximum-likelihood estimation of mixture parameters is done by using the well-known expectation-maximization algorithm. Experimental results on Reuters-21578 and Newsgroups data sets indicate the effectiveness of proposed mixture models in TC task; an increase in classification accuracy has been achieved.
2018-05-03 08:01