Beyond human labelling: an automatic topic identification framework for big web data

Beyond human labelling: an automatic topic identification framework for big web data

Roberto Ascari, Silvio Gerli, Sonia Migliorati, Teresa Cigna, Matteo Borrotti

Abstract

Nowadays, the global amount of written texts grows faster and faster. Since 2011 the number of posts per minute on Facebook increased from 650K to 3M. These unstructured data represent the source of an enormous amount of information that should be extracted by using automatic engines. This can be mainly accomplished by means of Natural Language Processing (NLP), which is a field of Artificial Intelligence devoted to analyzing and understanding human language as it is spoken and written. One common task of NLP is topic identification, related to the recognition of a text's topic(s). Two popular methods for modeling latent topics are latent Dirichlet allocation (LDA) and correlated topic model (CTM). Both of them assume that each word composing a document is associated with a latent topic, but they differ in the prior distribution assigned to topics, thus showing different pros and cons.In this work, LDA and CTM are tested and compared in a big data context by analyzing a large set of short documents automatically downloaded from the web by means of a modern crawler. In addition, under the assumption that each document is associated with a single topic, two new methods for the automatic classification of documents according to their real topic are proposed and tested relying on LDA and CTM as (latent) topic model engines. Finally, under the more realistic hypothesis of multiple topics within a document, the two new methods together with some combinations of the two are tested as multi-class classification tools.

Keywords: latent Dirichlet allocation; correlated topic model; automatic classification; textual data; topic identification

References

Aitchison, J. (2003). The Statistical Analysis of Compositional Data. The Blackburn Press, London, second edition.

Allahyari, M., Pouriyeh, S., Kochut, K., and Arabnia, H. R. (2017). A knowledge-based topic modeling approach for automatic topic labeling. International Journal of Advanced Computer Science and Applications, 8(9).

Blei, D. (2012). Probabilistic topic models. Communications of the ACM, 55:77–84.

Blei, D. and Lafferty, J. (2007). A correlated topic model of science. The Annals of Applied Statistics, 1(1):17–35.

Blei, D., Ng, A., and Jordan, M. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022.

Griffiths, T. L. and Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(SUPPL. 1):5228– 5235.

He, D., Ren, Y., Khattak, A. M., Liu, X., Tao, S., and Gao, W. (2021). Automatic topic labeling using graph-based pre-trained neural embedding. Neurocomputing, 463:596–608.

Hulpus, I., Hayes, C., Karnstedt, M., and Greene, D. (2013). Unsupervised graph-based topic labelling using dbpedia. In Proceedings of the Sixth ACM InternationalConference on Web Search and Data Mining, WSDM ’13, page 465–474, New York,NY, USA. Association for Computing Machinery.

Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., and Zhao, L. (2019). Latent dirichlet allocation (lda) and topic modeling: models, applications, a survey. Multimedia Tools and Applications, 78:15169–15211.

Jingxian, G. and Yong, Q. (2021). Selection of the optimal number of topics for lda topic model – taaking patent policy analysis as an example. Entropy, 23:1301.

Lau, J. H., Grieser, K., Newman, D., and Baldwin, T. (2011). Automatic labelling of topic models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1536–1545, Portland, Oregon, USA. Association for Computational Linguistics.

Lau, J. H., Newman, D., Karimi, S., and Baldwin, T. (2010). Best topic word selection for topic labelling. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, COLING ’10, page 605–613, USA. Association for Computational Linguistics.

Magatti, D., Calegari, S., Ciucci, D., and Stella, F. (2009). Automatic labeling of topics. In 2009 Ninth International Conference on Intelligent Systems Design and Applications, pages 1227–1232.

Manning, C. and Schutze, H. (1999). Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge.

Mei, Q., Shen, X., and Zhai, C. (2007). Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, page 490–499, New York, NY, USA. Association for Computing Machinery.

Mimno, D., Wallach, H., and McCallum, A. (2008). Gibbs sampling for logistic normal topic models with graph-based priors. In NIPS Workshop on Analyzing Graphs, volume 61.

Mimno, D., Wallach, H., Talley, E., Leenders, M., and McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of Empirical Methods in Natural Language Processing, pages 262–272.

Murshed, B., Mallappa, S., Abawajy, J., Saif, M., Al-ariki, H., and Abdulwahab, H. (2022). Short text topic modelling approaches in the context of big data: tanonomy, survey, and analysis. Artificial Intelligence Review.

Newman, D., Asuncion, A., Smyth, P., and Welling, M. (2009). Distributed algorithms for topic models. Journal of Machine Learning Research, 10(62):1801–1828.

Powers, D. and Turk, C. (1990). Machine Learning of Natural Language. Springer-Verlag, New York.

Roder, M., Both, A., and Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining, pages 399–408.

Sbalchiero, S. and Eder, M. (2020). Topic modeling, long texts and the best number of topics. some problems and solutions. Quality and Quantity, 54:1095–1108.

Sokolova, M. and Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45(4):427–437.

Sridhar, V. (2015). Unsupervised topic modeling for short texts using distributed representations of words. In Proceedings of the 1st workshop on vector space modeling for natural language processing, pages 192–200.

Syed, S. and Spruit, M. (2017). Full-text or abstract? examining topic coherence scores using latent dirichlet allocation. In Proceedings of the International Conference on Data Science and Advanced Analytics (DSAA), pages 165–174.

Wood, J., Tan, P., Wang, W., and Arnold, C. (2017). Source-lda: Enhancing probabilistic topic models using prior knowledge sources. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pages 411–422.

Yan, X., Guo, J., Lan, Y., and Cheng, X. (2013). A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web, pages 1445–1456.

Yao, L., Mimno, D., and McCallum, A. (2009). Efficient methods for topic model inference on streaming document collections. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 937–946.

Full Text: pdf

کاغذ a4

This work is licensed under a Creative Commons Attribuzione - Non commerciale - Non opere derivate 3.0 Italia License.

Username
Password
Remember me

Electronic Journal of Applied Statistical Analysis