Article information

2020 , Volume 25, ¹ 5, p.107-123

Krasnov F.V., Smaznevich I.S.

The explicability factor of the algorithm in the problems of searching for the similarity of text documents

The problem of providing a comprehensive explanation to any user why the applied intelligent information system suggests meaning similarity in certain texts imposes significant requirements on the intelligent algorithms. The article covers the entire set of technologies involved in the solution of the text clustering problem and several conclusions are stated thereof.

Matrix decomposition aimed at reducing the dimension of the vector representation of a corpus does not provide clear explanatiom of the algorithmic principles to a user. Ranking using the TF-IDF function and its modifications finds a few documents that are similar in meaning, however, this method is the easiest for users to comprehend, since algorithms of this type detect specific matching words in the compared texts. Topic modeling methods (LSI, LDA, ARTM) assign large similarity values to texts despite a few matching words, while a person can easily tell that the general subject of the texts is the same. Yet the explanation of how topic modeling works requires additional effort for interpretation of the detected ones. This interpretation gets easier as the model quality grows, while the quality can be optimized by its average coherence. The experiment demonstrated that the absolute value of documents similarity is not invariant for different intelligent algorithms, so the optimal threshold value of similarity must be set separately for each problem to be solved.

The results of the work can be further used to assess which of the various methods developed to detect meaning similarity in texts can be effectively implemented in applied information systems and to determine the optimal model parameters based on the solution explicability requirements.

[full text]
Keywords: explainable artificial intelligence, XAI, ranking function, document similarity

doi: 10.25743/ICT.2020.25.5.009

Author(s):
Krasnov Fedor Vladimirovich
PhD.
Office: NAUMEN R and D
Address: 620028, Russia, Ekaterinburg, 49A, Tatishcheva street
E-mail: fkrasnov@naumen.ru
SPIN-code: 8650-1127

Smaznevich Irina Sergeevna
PhD.
Position: business analyst
Office: NAUMEN R and D
Address: 620028, Russia, Ekaterinburg, 49A, Tatishcheva street
E-mail: ismaznevich@naumen.ru

References:
1. Pospelov D.A. “Soznanie”, “samosoznanie” i vychislitel’nye mashiny. Sistemnye issledovaniya. Metodologicheskie problemy. Ezhegodnik [“Consciousness”, “self-awareness” and computers. System research. Methodological problems. Yearbook]. Pod red. I.V. Blauberga, O.Ya. Gel’mana, V.P. Zinchenko. Moscow: Nauka; 1969: 178–184. (In Russ.)

2. Navrotskiy M.A., Zhukova N.A., Mouromtsev D.I., Mustafin N.G. Design, development and maintenance methodology of domain semantic portals of scientific and technical information. Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2018; 18(2):286—298.

3. Golenkov V.V., Guliakina N.A., Davydenko I.T., Shunkevich D.V., Eremeev A.P. Ontological design of hybrid semantically compatible intelligent systems based on sense representation of knowledge. Ontology of Designing. 2019; 9(1): 132–151.

4. Antonov V.V, Barmina O.V, Nikulina N.O. Decision-making support in software project management based on fuzzy ontology. Ontology of Designing. 2020; 10(1): 121–140.

5. Golovko V.A., Golenkov V.V., Ivashenko V.P., Taberko V.V., Ivaniuk D.S, Kroshchanka A.A., Kovalev M.V. Integration of artificial neural networks and knowledge bases. Ontology of Designing. 2018; 8(3): 366–386.

6. Minsky M., Kurzweil R., Mann S. The society of intelligent veillance. IEEE International Symposium on Technology and Society (ISTAS): Social Implications of Wearable Computing and Augmediated Reality in Everyday Life. 2013: 13–17.

7. Zanzotto F.M. Viewpoint: Human-in-the-loop artificial intelligence. Journal of Artificial Intelligence Research. 2019; (64): 243–252. DOI:10.1613/jair.1.11345.

8. Zhang S. How to invest my time: Lessons from human-in-the-loop entity extraction. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2019: 2305–2313.

9. Ng V., Rees E., Niu J., Zaghlool A., Ghiasbeglou H., Verster A. Application of natural language processing algorithms for extracting information from news articles in event-based surveillance. Canada Communicable Disease Report. 2020; 46(6):186–191.

10. Kluegl P., Toepfer M., Beck Ph.-D., Fette G., Puppe F. UIMA Ruta: Rapid development of rule-based information extraction applications. Natural Language Engineering. 2016; 22(1): 1–40.

DOI:10.1017/S1351324914000114.

11. Pauwels P., Zhang S., Lee Y.C. Semantic web technologies in AEC industry: A literature overview. Automation in Construction. 2017; (73):145–165.

12. Azad H.K., Deepak A. Query expansion techniques for information retrieval: A survey. Information Processing and Management. 2019; 56(5):1698–1735.

13. Bui Q.V., Sayadi K., Amor S.B., Bui M. Combining Latent Dirichlet Allocation and K-means for documents clustering: effect of probabilistic based distance measures. Asian Conference on Intelligent Information and Database Systems. Springer, Cham; 2017: 248–257.

14. Lample G., Ballesteros M., Subramanian S., Kawakami K., Dyer C. Neural architectures for named entity recognition. Proceedings of NAACL-HLT. 2016: 260–270. Available at: https://www. aclweb.org/anthology/N16-1030.pdf

15. McGovern A., Gagne D.J., Lagerquist R., Elmore K., Jergensen G.E. Making the black box more transparent: Understanding the physical implications of machine learning. Bulletin of the American Meteorological Society. 2019; 100(11):2175–2199. DOI:10.1175/BAMS-D-18-0195.

16. Gilpin L.H., Bau D., Yuan B.Z., Bajwa A., Specter M., Kagal L. Explaining explanations: An overview of interpretability of machine learning. 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA). Turin, Italy: IEEE; 2018: 80–89. DOI:10.1109/DSAA.2018.00018.

17. Hagras H. Toward human-understandable, explainable AI. Computer. 2018; 51(9):28–36.

18. Maaten L., Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research. 2008; 9(11):2579–2605.

19. Curry H.B. The method of steepest descent for non-linear minimization problems. Quarterly of Applied Mathematics. 1944; 2(3):258–261.

20. Sadeghi J., Sadeghi S., Niaki S.T.A. Optimizing a hybrid vendor-managed inventory and transportation problem with fuzzy demand: An improved particle swarm optimization algorithm. Information Sciences. 2014; (272):126–144.

21. Kennedy J., Eberhart R. Particle swarm optimization. Proceedings of ICNN’95 International Conference on Neural Networks. IEEE. 1995; (4):1942–1948.

22. Ribeiro M.T., Singh S., Guestrin C. “Why should I trust you?” Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 1135–1144. Available at: https://www.aclweb.org/anthology/N16-3020. pdf

23. Deerwester S., Dumais S., Furnas G.W., Landauer T.K., Harshman R. Indexing by latent semantic analysis. Journal of the American Society for Information Science. 1990; 41(6):391–407.

24. Blei D.M., Ng A.Y., Jordan M.I. Latent Dirichlet allocation. Journal of Machine Learning Research. 2003; (3):993–1022.

25. Fevotte C., Idier J. Algorithms for nonnegative matrix factorization with the β-divergence. Neural Computation. 2011; 23(9):2421–2456. Available at: https://arxiv.org/pdf/1010.1763.pdf 26. Vorontsov K., Potapenko A. Additive regularization of topic models. Machine Learning. 2015; 101(1–3):303–323.

27. Huang G. Supervised word mover’s distance. Advances in Neural Information Processing Systems. 2016: 4862–4870.

28. Gomaa W.H., Fahmy A.A. A survey of text similarity approaches. International Journal of Computer Applications. 2013; 68(13):13–18.

29. Levandowsky M., Winter D. Distance between sets. Nature. 1971; 234(5323):34–35. DOI:10.1038/234034a0.

30. Salton G., Wu H. A term weighting model based on utility theory. Proceedings of SIGIR. New York: ACM; 1980: 9—22.

31. Robertson S., Zaragoza H. The probabilistic relevance framework: BM25 and beyond. Information Retrieval. 2009; 3(4):333–389. DOI:10.1561/1500000019.

32. Trotman A., Puurula A., Burgess B. Improvements to BM25 and language models examined. Proceedings of the 2014 Australasian Document Computing Symposium. Melbourne, Australia; 2014: 58–65.

33. Shavrina T., Shapovalova O. To the methodology of corpus construction for machine learning: “TAIGA” syntax tree corpus and parser. Trudy Mezhdunarodnoy Konferentsii “Korpusnaya Lingvistika — 2017”. Saint-Peterburg: Izdatel’stvo SPbGU; 2017: 78–84.

34. Korobov M. Morphological analyzer and generator for Russian and Ukrainian languages. International Conference on Analysis of Images, Social Networks and Texts. Springer, Cham; 2015: 320–332.

35. Halko N., Martinsson P.G., Tropp J.A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review. 2011; 53(2):217–288.

36. Hofmann T. Probabilistic latent semantic indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999: 50–57.

37. Newman D., Lau J.H., Grieser K., Baldwin T. Automatic evaluation of topic coherence. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Los Angeles, California; 2010: 100–108.

38. Mimno D., Wallach H., Talley E., Leenders M., McCallum A. Optimizing semantic coherence in topic models. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Edinburgh, Scotland, UK; 2011: 262–272.

Bibliography link:
Krasnov F.V., Smaznevich I.S. The explicability factor of the algorithm in the problems of searching for the similarity of text documents // Computational technologies. 2020. V. 25. ¹ 5. P. 107-123
Home| Scope| Editorial Board| Content| Search| Subscription| Rules| Contacts
ISSN 1560-7534
© 2024 FRC ICT