Hilberg’s Conjecture – a Challenge for Machine Learning

Łukasz Dębowski


We review three mathematical developments linked with Hilberg’s conjecture – a hypothesis about the power-law growth of entropy of texts in natural language, which sets up a challenge for machine learning. First, considerations concerning maximal repetition indicate that universal codes such as the Lempel-Ziv code may fail to efficiently compress sources that satisfy Hilberg’s conjecture. Second, Hilberg’s conjecture implies the empirically observed power-law growth of vocabulary in texts. Third, Hilberg’s conjecture can be explained by a hypothesis that texts describe consistently an infinite random object.

Słowa kluczowe: statistical language modeling, Hilberg’s conjecture, maximal repetition, grammar-based codes, Santa Fe processes

Charikar M., Lehman E., Lehman A., Liu D., Panigrahy R., Prabhakaran M., Sahai A., Shelat A., The smallest grammar problem, IEEE Transactions on Information Theory 51, 2005, pp. 2554–2576.

de Luca A., On the combinatorics of finite words, Theoretical Computer Science 218, 1999, pp. 13–39.

de Marcken C. G. Unsupervised Language Acquisition, PhD thesis, Massachussetts Institute of Technology, 1996.

Debowski Ł., Variable-length coding of two-sided asymptotically mean stationary measures, Journal of Theoretical Probability 23, 2010, pp. 237–256.

Debowski Ł., On the vocabulary of grammar-based codes and the logical consistency of texts, IEEE Transactions on Information Theory 57, 2011, pp. 4589–4599.

Debowski Ł., Maximal lengths of repeat in English prose. S. Naumann, P. Grzybek, R. Vulanovic, G. Altmann (Eds.), Synergetic Linguistics. Text and Language as Dynamic System, Praesens Verlag, Wien 2012, pp. 23–30.

Debowski Ł., Mixing, ergodic, and nonergodic processes with rapidly growing information between blocks, IEEE Transactions on Information Theory 58, 2012, pp. 3392–3401.

Debowski Ł., Empirical evidence for Hilberg’s conjecture in single-author texts. In: I. Obradovic, E. Kelih, R. K¨ohler (Eds.), Methods and Applications of Quantitative Linguistics – Selected papers of the 8th International Conference on Quantitative Linguistics (QUALICO), Academic Mind, Belgrade, 2013, pp. 143–151.

Debowski Ł., A preadapted universal switch distribution for testing Hilberg’s conjecture, http://arxiv.org/abs/1310.8511, 2013.

Debowski Ł., Estimation of entropy from subword complexity, http://www.ipipan.waw.pl/˜ldebowsk/, 2014.

Debowski Ł., Maximal repetitions in written texts: Finite energy hypothesis vs. strong Hilberg conjecture, http://www.ipipan.waw.pl/˜ldebowsk/, 2014.

Debowski Ł., A new universal code helps to distinguish natural language from random texts, http://www.ipipan.waw.pl/˜ldebowsk/, 2014.

Debowski Ł., On hidden Markov processes with infinite excess entropy, Journal of Theoretical Probability 27, 2014, pp. 539–551.

Ebeling W., P¨oschel T., Entropy and long-range correlations in literary English, Europhysics Letters 26, 1994, pp. 241–246.

Ebeling W., Nicolis G., Entropy of symbolic sequences: The role of correlations, Europhysics Letters 14, 1991, pp. 191–196.

Ebeling W., Nicolis G., Word frequency and entropy of symbolic sequences: A dynamical perspective, Chaos, Solitons and Fractals 2, 1992, pp. 635–650.

Graham R.L., Knuth D.E., Patashnik O., Concrete Mathematics. A Foundation for Computer Science, Addison-Wesley, 1994.

Guiraud P., Les caract`eres statistiques du vocabulaire, Presses Universitaires de France, Paris, France, 1954.

Heaps H.S., Information Retrieval – Computational and Theoretical Aspects, Academic Press, New York, USA, 1978.

Herdan G., Quantitative Linguistics, Butterworths, London, England, 1964.

Hilberg W., Der bekannte Grenzwert der redundanzfreien Information in Texten– eine Fehlinterpretation der Shannonschen Experimente? Frequenz 44, 1990, pp. 243–248.

Jelinek F., Statistical Methods for Speech Recognition, The MIT Press, Cambridge, Massachusetts, USA, 1997.

Kieffer J.C., Yang E., Grammar-based codes: A new class of universal lossless source codes, IEEE Transactions on Information Theory 46, 2000, pp. 737–754.

Kit Ch., Wilks Y., Unsupervised learning of word boundary with description length gain, M. Osborne and E.T.K. Sang (Eds.), Proceedings of the Computational Natural Language Learning ACL Workshop, Bergen, 1999, pp. 1–6.

K¨ohler R., Altmann G., Piotrowski R.G. (Eds.), Quantitative Linguistik. Ein internationales Handbuch / Quantitative Linguistics. An International Handbook, Walter de Gruyter, Berlin, Germany, 2005.

Kolmogorov A.N., Three approaches to the quantitative definition of information, Problems of Information Transmission 1 (1), 1965, pp. 1–7.

Kolpakov R., Kucherov G., On maximal repetitions in words, Journal of Discrete Algorithms 1, 1999, pp. 159–186.

Kuraszkiewicz W., Łukaszewicz J., The number of different words as a function of text length, Pamietnik Literacki (In Polish) 42 (1), 1951, pp. 168–182.

Li M., Vit´anyi P.M.B., An Introduction to Kolmogorov Complexity and Its Applications, 2nd ed. Springer, New York, USA, 1997.

Mandelbrot B., An informational theory of the statistical structure of languages, W. Jackson (Ed.), Communication Theory, Butterworths, London, England, 1953, pp. 486–502.

Mandelbrot B., Structure formelle des textes et communication, Word 10, 1954, pp. 1–27.

Markov A.A., An example of statistical investigation of the text ‘Eugene Onegin’ concerning the connection of samples in chains, Science in Context 19, 2006, pp. 591–600.

Menzerath P., ¨ Uber einige phonetische Probleme, Actes du premier Congres international de linguistes, Sijthoff, Leiden, Holland, 1928.

Nevill-Manning C.G., Inferring Sequential Structure, PhD thesis, University of Waikato, 1996.

Shannon C., A mathematical theory of communication, Bell System Technical Journal 30, 1948, pp. 379–423, 623–656.

Shannon C., Prediction and entropy of printed English, Bell System Technical Journal 30, 1951, pp. 50–64.

Shields P.C., String matching: The ergodic case, The Annals of Probability 20, 1992, pp. 1199–1203.

Shields P.C., Universal redundancy rates don’t exist, IEEE Transactions on Information Theory IT-39, 1993, pp. 520–524.

Shields P.C., String matching bounds via coding, The Annals of Probability 25, 1997, pp. 329–336.

Wolff J.G., Language acquisition and the discovery of phrase structure, Language and Speech 23, 1980, pp. 255–269.

Zipf G.K., The Psycho-Biology of Language: An Introduction to Dynamic Philology, 2nd ed. The MIT Press, Cambridge, Massachusetts, USA, 1965.

Ziv J., Lempel A., A universal algorithm for sequential data compression, IEEE Transactions on Information Theory, 23, 1977, pp. 337–343.

Czasopismo ukazuje się w sposób ciągły on-line.
Pierwotną formą czasopisma jest wersja elektroniczna.

Wersja papierowa czasopisma dostępna na www.wuj.pl