A novel text classification problem and its solution

Sławomir Zadrożny,

Janusz Kacprzyk,

Marek Gajewski,

Maciej Wysocki

Abstrakt

A new text categorization problem is introduced. As in the classical problem, there is a set of documents and a set of categories. However, in addition to being assigned to a specific category, each document belongs to a certain sequence of documents, referred to as a case. It is assumed that all documents in the same case belong to the same category. An example may be a set of news articles. Their categories may be sport, politics, entertainment, etc. In each category there exist cases, i.e., sequences of documents describing, for example evolution of some events. The problem considered is how to classify a document to a proper category and a proper case within this category. In the paper we formalize the problem and discuss two approaches to its solution.

Słowa kluczowe: text categorization, sequences of documents, sequence mining, hidden Markov models