Experiments with language combinatorics in text classification: lessons learned and future implications

Michal Ptaszynski,

Fumito Masui

Abstrakt

W niniejszym artykule przedstawiono metaanalizę badań przeprowadzonych za pomocą kombinatoryki językowej (language combinatorics, LC), nowej metody generacji modelu języka i ekstrakcji cech, opartej o kombinacyjne manipulacje na elementach zdań (np. słowa). W trakcie ostatnich lat LC została zastosowana do wielu zadań z dziedziny klasyfikacji tekstu, takich jak analiza afektu, wykrywanie cyberagresji lub ekstrakcja odniesień do przyszłych wydarzeń. W niniejszym artykule podsumowujemy dwa z najbardziej obszernych doświadczeń i omawiamy ogólne implikacje dotyczące przyszłych zastosowań kombinatoryjnego modelu języka.

Słowa kluczowe: kombinatoryka językowa, przetwarzanie języków naturalnych, klasyfikacja tekstu
References

[1]      Ptaszynski M., Masui F., Rzepka R., Araki K., First Glance on Pattern-based Language Modeling, Language Acquisition and Understanding Research Group Technical Reports, 2014.

[2]      Ptaszynski M., Masui F., Kimura Y., Rzepka R., Araki K., Extracting Patterns of Harmful Expressions for Cyberbullying Detection, Proceedings of LTC’15, 2016, 370-375.

[3]      Ptaszynski M., Masui F., Rzepka R., Araki K., Subjective? Emotional? Emotive?: Language Combinatorics based Automatic Detection of Emotionally Loaded Sentences, Linguistics and Literature Studies, Vol. 5, No. 1, 2017, 36-50.

[4]      Bickel S., Haider P., Scheffer T., Predicting sentences using n-gram language models, Proceedings of HLT-EMNLP 2005, 2005, 193-200.

[5]      Li Haizhou, Bin Ma, A phonotactic language model for spoken language identification, Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 2005, 515-522.

[6]      Ponte J.M., Croft W.B., A language modeling approach to information retrieval, Proceedings of the 21st annual international ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, 275-281.

[7]      Brown P.F., Cocke J., Pietra S.A.D., Pietra V.J.D., Jelinek F., Lafferty J.D., Mercer R.L., Roossin P.S., A statistical approach to machine translation, Computational Linguistics, Vol. 16, No. 2, 1990, 79-85.

[8]      Mays E., Damerau F.J., Mercer R.L., Context based spelling correction, Information Processing & Management, Vol. 27, No. 5, 1991, 517-522.

[9]      Kupiec J., Robust part-of-speech tagging using a hidden Markov model, Computer Speech & Language, Vol. 6, No.3, 1992, 225-242.

[10]  Hu Y., Lu R., Li X., Chen Y., Duan J., A language modeling approach to sentiment analysis, Computational Science – ICCS 2007, 1186-1193.

[11]  Ptaszynski M., Rzepka R., Araki K., Momouchi Y., Language combinatorics: A sentence pattern extraction architecture based on combinatorial explosion, International Journal of Computational Linguistics (IJCL), Vol. 2, No. 1, 2011, 24-36.

[12]  Harris Z., Distributional Structure, Word, Vol. 10, N. 2/3, 1954, 146-162.

[13]  Cambria E., Hussain A., Sentic Computing: Techniques, Tools, and Applications, Springer, 2012.

[14]  Lu Y., Zhai C.X., Positional Language Models for Information Retrieval, 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2009, 299-306.

[15]  Markov A.A., Extension of the limit theorems of probability theory to a sum of variables connected in a chain, Reprinted in Appendix B of: R. Howard, Dynamic Probabilistic Systems, Vol. 1: Markov Chains, John Wiley and Sons, 1971.

[16]  Huang X., Alleva F., Hon H.W., Hwang M.Y., Rosenfeld R., The SPHINX-II Speech Recognition System: An Overview,Computer, Speech and Language, Vol. 7, 1992, 137-148.

[17]  Guthrie D., Allison B., Liu W., Guthrie L., Wilks Y., A closer look at skip-gram modelling, Proceedings of LREC-2006, 2006, 1-4.

[18]  Pickhardt R., Gottron T., Korner M., Wagner P.G., Speicher T., Staab S., A Generalized Language Model as the Combination of Skipped n-grams and Modified Kneser Ney Smoothing, Proceedings of ACL 2014, 2014, 1145-1154.

[19]  Ptaszynski M., Lempa P., Masui F., A Modular System for Support of Experiments in Text Classification, Technical Transactions, vol. 7-B/2015, 229-243.

[20]  Nakajima Y., Ptaszynski M., Honma H., Masui F., Investigation of Future Reference Expressions in Trend Information, Proceedings of the 2014 AAAI Spring Symposium Series, 2014, 31-38.

[21]  Ptaszynski M., Dybala P., Rzepka R., Araki K., Affecting Corpora: Experiments with Automatic Affect Annotation System – A Case Study of the 2channel Forum, Proceedings of PACLING-09, 2009, 223-228.

[22]  Human Rights Research Institute Against All Forms for Discrimination and Racism in Mie Prefecture, Japan, http://www.pref.mie.lg.jp/jinkenc/hp/ (access: 21.04.2017).

[23]  Ministry of Education, Culture, Sports, Science and Technology (MEXT), ‘Netto-jo no ijime’ ni kansuru taio manyuaru jirei shu (gakko, kyoin muke), MEXT, 2008.

[24]  Ure J., Lexical density and register differentiation, [in:] Applications of Linguistics, (eds.) G. Perren, J.L.M. Trim, Cambridge University Press, London 1971, 443-452.