Word Embeddings for Morphologically Complex Languages

Grzegorz Jurdziński

Recent methods for learning word embeddings, like GloVe or Word2-Vec, succeeded in spatial representation of semantic and syntactic relations. We extend GloVe by introducing separate vectors for base form and grammatical form of a word, using morphosyntactic dictionary for this. This allows vectors to capture properties of words better. We also present model results for word analogy test and introduce a new test based on WordNet.
Słowa kluczowe: machine learning, word embeddings, natural language processing, morphology 1. Introduction

[1] Manning C.D., Raghavan P., Sch¨utze H., Introduction to Information Retrieval. Cambridge University Press, 2008.

[2] Sebastiani F., Machine learning in automated text categorization. ACM computing surveys (CSUR), 2002, 34 (1), pp. 1–47.

[3] Tellex S., Katz B., Lin J., Fernandes A., Marton G., Quantitative evaluation of passage retrieval algorithms for question answering. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, 2003, pp. 41–47.

[4] Turian J., Ratinov L., Bengio Y., Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2010, pp. 384–394.

[5] Socher R., Bauer J., Manning C.D., Ng A.Y., Parsing with compositional vector grammars. In: ACL (1), 2013, pp. 455–465.

[6] Mikolov T., Yih W.t., Zweig G., Linguistic regularities in continuous space word representations. In: HLT-NAACL. vol. 13., 2013, pp. 746–751.

[7] Mikolov T., Sutskever I., Chen K., Corrado G.S., Dean J., Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26, 2013, pp. 3111–3119.

[8] Pennington J., Socher R., Manning C.D., Glove: Global vectors for word representation. In: EMNLP. vol. 14., 2014, pp. 1532–43.

[9] Bengio Y., Ducharme R., Vincent P., Jauvin C., A neural probabilistic language model. Journal of Machine Learning Research, 2003, 3 (Feb), pp. 1137–1155.

[10] Mikolov T., Chen K., Corrado G., Dean J., Efficient estimation of word representations in vector space. CoRR, 2013, abs/1301.3781.

[11] Luong T., Socher R., Manning C.D., Better word representations with recursive neural networks for morphology. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, CoNLL 2013, Sofia, Bulgaria, August 8-9, 2013, 2013, pp. 104–113.

[12] Botha J.A., Blunsom P., Compositional morphology for word representations and language modelling. In: ICML, 2014, pp. 1899–1907.

[13] Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman R., Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, 41 (6), pp. 391.

[14] Duchi J., Hazan E., Singer Y., Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011, 12 (Jul), pp. 2121–2159.

[15] Miłkowski M., Polimorfologik. https://github.com/morfologik/polimorfologik 2016.

[16] Maziarz M., Piasecki M., Szpakowicz S., Approaching plWordNet 2.0. In: Proceedings of the 6th Global Wordnet Conference, January 2012.

[17] Schnabel T., Labutov I., Mimno D.M., Joachims T.,Evaluation methods for unsupervised word embeddings. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17–21, 2015, 2015, pp. 298–307.

Czasopismo ukazuje się w sposób ciągły on-line.
Pierwotną formą czasopisma jest wersja elektroniczna.

Wersja papierowa czasopisma dostępna na www.wuj.pl