Parallelization of the Levenshtein distance algorithm

Artur Niewiarowski,

Marek Stanuszek

Abstrakt

This paper presents a method for the parallelization of the Levenshtein distance algorithm deployed on very large strings. The proposed approach was accomplished using .NET Framework 4.0 technology with a specific implementation of threads using the System. Threading.Task namespace library. The algorithms developed in this study were tested on a high performance machine using Xamarin Mono (for Linux RedHat/Fedora OS). The computational results demonstrate a high level of efficiency of the proposed parallelization procedure.

Słowa kluczowe: Levenshtein distance, Levenshtein-Damerau distance, edit distance, very large strings, parallel computing, threads, high performance computing, Microsoft .NET Framework 4.0, mono-project
References

Niewiarowski A., Stanuszek M., The mechanism of identification and classification of content, Studia Informatica, Vol. 34, 2B(112), Silesian University of Technology Press, Gliwice 2013, 205-222.

Niewiarowski A., Stanuszek M., Mechanism of analysis of similarity short texts, based on the Levenshtein distance, Studia Informatica. Vol. 34, 1 (110), Silesian University of Technology Press, Gliwice 2013, 107-114.

Niewiarowski A., Term frequency optimization for the vector space model, Czasopismo Techniczne, 9-M/2012, 155-165.

Kobzdej P., Waligóra D., Wielebińska K., Paprzycki M., Parallel Application of Levenshtein Distance to Establish Similarity Between Strings, International Journal of Computer Research, Vol. 12, No. 4, 2003, 625-633.

Mono-project (www.mono-project.com/What_is_Mono).

Левенштейн В.И., Двоичные коды с исправлением выпадений, вставок и замещений символов, Доклады Академий Наук СCCP 163 (4), 1965, 845-848.

Wypych M., Stochastic Spelling Correction of Texts in Polish, Institute of Linguistics, Adam Mickiewicz University, Poznań, Poland; Speech and Language Technology. Volume 6, Poznań 2002.

Damerau F.J., A technique for computer detection and correction of spelling errors, Communications of the ACM, 7 (3), 1964, 171-176.

Runkler T.A., Bezdek J.C., Web mining with relational clustering, International Journal of Approximate Reasoning, Vol. 32, Issues 2–3, February 2003, 217-236.

Niewiarowski A., Działanie parsera ‚Part-of-Speech Tagging’ w ujęciu mechanizmu Web Content Mining, Wydawnictwo VI Ogólnopolskiej Konferencji Naukowej Nauka i Przemysł, Politechnika Krakowska im. Tadeusza Kościuszki, Kraków 2011, 93-100.

Niewiarowski A., Stanuszek M., Parallelize edit distance algorithm, Proceedings. Seventh ACC Cyfronet AGH Users’ Conference, Academic Computer Centre Cyfronet AGH, Zakopane 2014, 31-32.

Niewiarowski A., Stanuszek M., Performance and quality of method for short text similarity algorithm based on edit distance and thesaurus, Proceedings. Seventh ACC Cyfronet AGH Users’ Conference, Academic Computer Centre Cyfronet AGH, Zakopane 2014, 33-34.

Ramos J., Using tf-idf to determine word relevance in document queries, Proceedings of the First Instructional Conference on Machine Learning, 2003.

Campbell C., Johnson R., Miller A., Toub S., Parallel Programming with Microsoft. NET. Design Patterns for Decomposition and Coordination on Multicore Architectures, Microsoft Press, 2010.

Niewiarowski A., Szybko zrozumieć Visual Basic 2012, Self Publishing. Kraków 2013, 66-73.