Poddubnyy V.V.   Kubarev A.I.   Shevelyov O.G.   Kukushkina O.V.  

Building a style sheet using text classification algorithms based on decision trees

Reporter: Poddubnyy V.V.

Building a style sheet using text classification algorithms based on decision trees

A.I. Kubarev
Tomsk State University, kubarev_ai@mail.ru
V.V. Poddubny
vvpoddubny@gmail.com
O.G. Shevelyov
oshevelyov@gmail.com
O.V. Kukushkina
Lomonosov Moscow State University, kukush@orc.ru

(This work was supported by RFBR grant № 11-07-00776-a)

A decision tree based algorithm of sequential binary segmentation of n-dimensional feature space of text styles into 2^n non-overlapping n-dimensional intervals optimized by informational criteria is proposed. The intervals form a table of text styles that represent a "style profile" of a text corpus. It is assumed that the feature space is frequency-domain, i.e. consists of frequencies of occurences of functional words, word combinations, bi-grams, etc. The algorithm is implemented within the "StyleAnalyzer" system, developed for a complex investigation of heterogeneous corpora. Research was performed using decision trees and the tables of text styles to investigate the performance of the text classification over various text characteristics such as authors, genres and styles. Profiles of text styles found by the algorithm can be used to identify style of texts with unknown author that, for example, would allow to determine their most probable authorship.

Keywords: classification of texts, decision trees, tables of styles, profiles of styles, recognition of texts

Abstracts file: Kubarev_Poddubny_Abstracts.doc


To reports list