Article information

2017 , Volume 22, ¹ 4, p.11-21

Barakhnin V.B., Bakiyeva A.M., Bakiyev M.N., Tazhibayeva S.Z., Batura T.V., Lukpanova L.K.

Stemming and generation of word forms in automatic text processing systems in the Kazakh language

Purpose. Currently there is an urgent need for automatic processing of texts in the Kazakh language. Morphological analysis in the process of automatic text processing allows increasing both the completeness and the accuracy of the result of information retrieval. Since the Kazakh language is agglutinative, it is impractical to use the dictionary of word forms for the automation of morphological analysis. It is much more effective to use affix dictionaries and sets of rules. Algorithms for synthesizing and analyzing word forms of the Kazakh language are proposed in this article.

Methodology. A distinctive feature of the proposed algorithms for stemming and generation of word forms of the Kazakh language is the use of the principle of words splitting into inflectional classes. To implement these algorithms for all changeable parts of speech (noun, adjective, verb), we described the sets of affix combination rules.

Findings. During the research the dictionary was developed. It includes about 2000 verbal affixes and their combinations for the 17 inflectional classes and about 3500 affixes and their combinations (variants of endings) for nouns and adjectives. Some combinations of affixes are repeated. The system is supplemented with an exception dictionary, including 18 nouns and 352 verbs, in which the word forms are formed by changing the stem. Such a volume of the dictionaries is sufficient to perform text analysis of any themes. The generation module and the stemming module are implemented in Python using libraries: psycopg2, collections. The dictionaries are stored in the database PostgreSQL.

Originality. We tested the software application on words belonging to different parts of speech, and found no errors, which makes it possible to judge the correctness of the proposed algorithms. The proposed algorithms can be applied at the stage of morphological analysis in the search engines, summarization systems and question-answer systems, as well as in the construction of thesauri and ontologies.

[full text]
Keywords: Kazakh language, stemming, generation , morphological analysis, affixes, inflectional classes

Author(s):
Barakhnin Vladimir Borisovich
Dr. , Associate Professor
Position: Leading research officer
Office: Federal Research Center for Information and Computational Technologies
Address: 630090, Russia, Novosibirsk, Ac. Lavrentiev ave, 6
Phone Office: (383) 330 78 26
E-mail: bar@ict.nsc.ru
SPIN-code: 1541-0448

Bakiyeva Aigerim Muratovna
Office: Novosibirsk State University
Address: 630090, Russia, Novosibirsk, Ac. Lavrentiev ave, 6

Bakiyev Murat Nauryzbaevich
Office: L.N. Gumilyov Eurasian National University
Address: 010008, Kazakhstan, Astana, Ac. Lavrentiev ave, 6

Tazhibayeva Saule Zhaksylykbaevna
Office: L.N. Gumilyov Eurasian National University
Address: 010008, Kazakhstan, Astana, Ac. Lavrentiev ave, 6

Batura Tatiana Viktorovna
Office: Novosibirsk State University, A.P. Ershov Institute of Informatics Systems SB RAS
Address: 630090, Russia, Novosibirsk, Ac. Lavrentiev ave, 6

Lukpanova Lyazzat Khamitovna
Office: K.I. Satpayev Kazakh National Research Technical University
Address: 050013, Kazakhstan, Astana, Ac. Lavrentiev ave, 6

References:
[1] Trevgoda, S.A. Metody i algoritmy avtomaticheskogo referirovaniya teksta na osnove analiza funktsional'nykh otnosheniy. Avtoreferat dis. kand. tekhn. nauk. [Methods and algorithms for automatic text summarization based on
analyzing functional relationship: Abstract of Dissertation for the Degree of Candidate of Technical Sci.]. St. Petersburg; 2009: 18. (In Russ.)
[2] Gridina, E.A. Analysis of algorithms for automatic text summarization. EASTERN-EUROPEAN JOURNAL OF ENTERPRISE TECHNOLOGIES. 2011; 3/2(51):36–38 (In Russ.)
[3] Han, U. Mani, I. Systems of automatic summarization. Available at: http://www.osp.ru/os/2000/12/067_print.htm (access: 12.03.2015) (In Russ.)
[4] Ginkul, A.S. A comparative analysis of the existing systems of automatic text summarization. Polit. Modern Problems of Sci. 2012; P255. Available at: http://jrnl.nau.edu.ua/index.php/Fly/article/view/2598. (In Russ.)
[5] Anno, E.N. The system of morphological analysis to the synthesis of word forms. Semiotics and Informatics. 1978; (10):168–187. (In Russ.)
[6] Monz, C. Document retrieval in the context of question answering. Lecture Notes in Computer Science. 2003; (2633):571–579.
[7] Shokin, Y.I., Fedotov, A.M., Barakhnin, V.B. Problemy poiska informatsii [Information retrieval problems]. Novosibirsk: Nauka; 2010: 196. (In Russ.)
[8] Belonogov, G.G., Zelenkov, Yu.G. Algorithm for automatic analysis of Russian words. Theoretical and Practical Issues of Journalism. 1985; (53):62–93. (In Russ.)
[9] Barakhnin, V.B., Lukpanova, L.Kh., Solovyev, A.A. The algoritm for constructing wordforms using inflexional classes for systems of kazakh language morfological analysis. Novosibirsk State University Journal of Information Technologies. 2014; 12(2):25–31. (In Russ.)
[10] Fedotov A.M., Tusupov Dzh. A., Sambetbayeva M.A., Yerimbetova A.S., Bakiyeva A.M., Idrisova I.A. The implementation of the algorithm generating word forms of the Kazakh language. Novosibirsk State University Journal of Information Technologies. 2015; 13(1):107–116. (In Russ.)
[11] Sharipbaev, A.A., Bekmanova, G.T., Ergesh, B.Zh., Buribaeva, A.K., Karabalaeva, M.Kh. Intelligent morphological analyzer, based on semantic networks. Materialy mezhdunarodnoy nauchno-tekhnicheskoy konferentsii “Otkrytye semanticheskie tekhnologii proektirovaniya intellektual'nykh system” [Proc. of the Intern. Sci.-Techn. Conf. “Open Semantic Intelligent Systems Design Technology”] (OSTIS-2012), February of 16–18, 2012. Minsk: BGUIR; 2012: 397–400. (In Russ.)
[12] Buribaeva, A.K., Sharipbaev, A.A., Bekmanova, G.T., Ergesh, B.Zh., Karabalaeva, M.Kh. Hardware implementation of the synthesis of word forms of the Kazakh language using associative memory. Bulletin of the Euras. Nat. Univ. L.N. Gumilev. 2012; (Special issue):180–183. (In Russ.)
[13] Zaurbekov, D., Kayrakbay, B. Construction of the final drive for morphological analysis and generation of word forms of the Kazakh language. Proceedings of VIII international scientific-practical conference «Eastern Partnership – 2012». Przemy´sl, September of 07-15. Philological Sciences. Przemy´sl: Nauka i studiya; 2012; (8): 30-39. (In Russ.)
[14] Valyaeva, T. Grammatika kazakhskogo yazyka [The grammar of the Kazakh language]. Available at: http://kaz-tili.kz (access: 20.01.2017) (In Russ.)
[15] Porter, M.F. An algorithm for suffix stripping. Program. 1980; 14(3):130–137.
[16] Bakieva A. Program generation of word forms of the Kazakh language. Available at: http://db4.sbras.ru/morpher (In Russ.)

Bibliography link:
Barakhnin V.B., Bakiyeva A.M., Bakiyev M.N., Tazhibayeva S.Z., Batura T.V., Lukpanova L.K. Stemming and generation of word forms in automatic text processing systems in the Kazakh language // Computational technologies. 2017. V. 22. ¹ 4. P. 11-21
Home| Scope| Editorial Board| Content| Search| Subscription| Rules| Contacts
ISSN 1560-7534
© 2024 FRC ICT