Word Embeddings for Morphologically Complex Languages

Grzegorz Jurdzinski

Abstrakt
Recent methods for learning word embeddings, like GloVe or Word2-Vec, succeeded in spatial representation of semantic and syntactic relations. We extend GloVe by introducing separate vectors for base form and grammatical form of a word, using morphosyntactic dictionary for this. This allows vectors
to capture properties of words better. We also present model results for word analogy test and introduce a new test based on WordNet.
Słowa kluczowe: machine learning, word embeddings, natural language processing, morphology 1. Introduction
References

[1] Manning C.D., Raghavan P., Sch¨utze H., Introduction to Information Retrieval.
Cambridge University Press, 2008.
[2] Sebastiani F., Machine learning in automated text categorization. ACM computing
surveys (CSUR), 2002, 34 (1), pp. 1–47.
[3] Tellex S., Katz B., Lin J., Fernandes A., Marton G., Quantitative evaluation of
passage retrieval algorithms for question answering. In: Proceedings of the 26th
annual international ACM SIGIR conference on Research and development in
informaion retrieval, 2003, pp. 41–47.
[4] Turian J., Ratinov L., Bengio Y., Word representations: a simple and general
method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting
of the Association for Computational Linguistics, Association for Computational
Linguistics, 2010, pp. 384–394.
[5] Socher R., Bauer J., Manning C.D., Ng A.Y., Parsing with compositional vector
grammars. In: ACL (1), 2013, pp. 455–465.
138
[6] Mikolov T., Yih W.t., Zweig G., Linguistic regularities in continuous space word
representations. In: HLT-NAACL. vol. 13., 2013, pp. 746–751.
[7] Mikolov T., Sutskever I., Chen K., Corrado G.S., Dean J., Distributed representations
of words and phrases and their compositionality. In: Advances in Neural
Information Processing Systems 26, 2013, pp. 3111–3119.
[8] Pennington J., Socher R., Manning C.D., Glove: Global vectors for word representation.
In: EMNLP. vol. 14., 2014, pp. 1532–43.
[9] Bengio Y., Ducharme R., Vincent P., Jauvin C., A neural probabilistic language
model. Journal of Machine Learning Research, 2003, 3 (Feb), pp. 1137–1155.
[10] Mikolov T., Chen K., Corrado G., Dean J., Efficient estimation of word representations
in vector space. CoRR, 2013, abs/1301.3781.
[11] Luong T., Socher R., Manning C.D., Better word representations with recursive
neural networks for morphology. In: Proceedings of the Seventeenth Conference
on Computational Natural Language Learning, CoNLL 2013, Sofia, Bulgaria,
August 8-9, 2013, 2013, pp. 104–113.
[12] Botha J.A., Blunsom P., Compositional morphology for word representations and
language modelling. In: ICML, 2014, pp. 1899–1907.
[13] Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman R., Indexing
by latent semantic analysis. Journal of the American Society for Information
Science, 1990, 41 (6), pp. 391.
[14] Duchi J., Hazan E., Singer Y., Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine Learning Research, 2011, 12
(Jul), pp. 2121–2159.
[15] Miłkowski M., Polimorfologik. https://github.com/morfologik/polimorfologik
2016.
[16] Maziarz M., Piasecki M., Szpakowicz S., Approaching plWordNet 2.0. In: Proceedings
of the 6th Global Wordnet Conference, January 2012.
[17] Schnabel T., Labutov I., Mimno D.M., Joachims T.,Evaluation methods for unsupervised
word embeddings. In: Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal,
September 17–21, 2015, 2015, pp. 298–307.

Czasopismo ukazuje się w sposób ciągły on-line.
Pierwotną formą czasopisma jest wersja elektroniczna.

Wersja papierowa czasopisma dostępna na www.wuj.pl