Text mining and semantic modeling of literary corpora: a machine learning–based study of Indonesian fiction

Authors

  • Rinda Widya Ikomah Universitas Mataram  Indonesia
  • Zohaib Hassan Sain Superior University  Pakistan

DOI:

https://doi.org/10.64595/lingtech.v2i1.133

Keywords:

affective intensity, Indonesian fiction, machine learning, semantic modeling, text mining

Abstract

Background: The large-scale digitization of Indonesian literary works has produced extensive textual corpora that challenge conventional close-reading approaches and call for systematic, data-driven methods capable of capturing thematic, semantic, and affective patterns in fiction.

Objective: This study aims to examine how text mining and semantic modeling can reveal lexical salience, intertextual relations, and narrative emotion in Indonesian fiction across different thematic orientations.

Method: Using a quantitative corpus-based design, the study analyzes 36 Indonesian literary texts published between 1980 and 2022 through TF–IDF–based lexical analysis, document-level semantic embeddings with cosine similarity and clustering, and sentence-level sentiment analysis.

Results: The findings show distinct lexical signatures that differentiate thematic clusters, coherent semantic groupings reflecting intertextual proximity, and sentiment trajectories dominated by neutral-to-negative polarity with strategically placed affective peaks across narrative progression.

Implication: These results demonstrate that computational methods can empirically support literary analysis without displacing interpretive criticism.

Novelty: The study integrates lexical, semantic, and affective modeling within a unified framework for Indonesian fiction, offering a scalable and replicable approach to digital literary studies.

Downloads

Download data is not yet available.

References

Ágreda-López, M., & Petrelli, M. (2025). Opportunities, epistemological assessment and potential risks of machine learning applications in volcano science. Artificial Intelligence in Geosciences, 6(2). https://doi.org/10.1016/j.aiig.2025.100153

Ahmed, A., Johnson, F., Walton, G., & Bayounis, S. (2020). A phenomenographic approach to the effect of emotions on the information behaviour of doctoral students: A narrative inquiry. 12051 LNCS, 874–883. https://doi.org/10.1007/978-3-030-43687-2_73

András, M. (2025). Application of machine learning in behavioral science and psychology: Advantages and disadvantages. Mentalhigiene Es Pszichoszomatika, 26(1–2), 56–69. https://doi.org/10.1556/0406.2025.00077

Atikurrahman, M. (2025). Reimagining textuality: Digital convergence and literary adaptation in Indonesia. Lingua Technica: Journal of Digital Literary Studies, 1(1), 63–71. https://doi.org/10.64595/lingtech.v1i1.30

Barros, A., Carneiro, A. T., & Wanderley, S. (2019). Organizational archives and historical narratives: Practicing reflexivity in (re)constructing the past from memories and silences. Qualitative Research in Organizations and Management: An International Journal, 14(3), 280–294. https://doi.org/10.1108/QROM-01-2018-1604

Bless, B. D. (2021). Deriving a Theoretical Framework for Interpreting Management Research Results in South Africa. 2022-June, 191–198. https://doi.org/10.34190/ecrm.21.1.418

Boleda, G., & Herbelot, A. (2016). Formal distributional semantics: Introduction to the special issue. Computational Linguistics, 42(4), 619–635. https://doi.org/10.1162/COLI_a_00261

Bowman, A. D., & Jololian, L. (2023). Introduction to artificial intelligence and machine learning algorithms. In Artificial Intelligence in Tissue and Organ Regeneration (pp. 15–28). https://doi.org/10.1016/B978-0-443-18498-7.00010-7

Chernyavskaya, V. E., & Safronenkova, E. L. (2020). Linguistic construction of the past: rhetoric in geopolitical conflicts or rhetoric making conflicts? Terra Linguistica, 11(4), 84–93. https://doi.org/10.18721/JHSS.11408

Chu, K. E., Keikhosrokiani, P., & Asl, M. P. (2022). A Topic Modeling and Sentiment Analysis Model for Detection and Visualization of Themes in Literary Texts. Pertanika Journal of Science and Technology, 30(4), 2535–2561. https://doi.org/10.47836/pjst.30.4.14

Cipresso, P., & Riva, G. (2016). Computational psychometrics meets hollywood: The complexity in emotional storytelling. Frontiers in Psychology, 7(NOV). https://doi.org/10.3389/fpsyg.2016.01753

Daelemans, W. (2013). Explanation in computational stylometry. 7817 LNCS(PART 2), 451–462. https://doi.org/10.1007/978-3-642-37256-8_37

Domingos, F., Bagdonas, A., & Zanetic, J. (2024). “So the Lights Have Bent”: Investigation of Pre-Service Teachers’ Conceptions of Science Through a Historical Narrative on the General Relativity Theory. Investigacoes Em Ensino de Ciencias, 29(2), 201–230. https://doi.org/10.22600/1518-8795.ienci2024v29n2p201

Elliott, C. (2023). The Unfortunate Footnote: Using the Affective Reasoner to Generate Fortunes-of-Others Emotions in Story-Morphs. 542 LNNS, 690–707. https://doi.org/10.1007/978-3-031-16072-1_51

Erker, D., & Guy, G. R. (2012). The role of lexical frequency in syntactic variability: Variable subject personal pronoun expression in Spanish. Language, 88(3), 526–557. https://doi.org/10.1353/lan.2012.0050

Fawaid, A., Assyabani, R., Abdullah, I., Muali, C., Itqan, M. S., & Islam, S. (2025). Human Intelligence and Algorithmic Precision: An Experimental Study of Indonesian Translation Pedagogy in Higher Education. Asian Journal of University Education, 21(3), 779–792. https://doi.org/10.24191/ajue.v21i3.53

Gefen, A., Saint-Raymond, L., & Venturini, T. (2021). AI for Digital Humanities and Computational Social Sciences. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): 12600 LNCS (pp. 191–202). https://doi.org/10.1007/978-3-030-69128-8_12

Govender, P., Langerman, J., & Joseph, N. (2024). Authorship Attribution on an Afrikaans Corpus using Burrows Delta. 172–177. https://doi.org/10.1109/ICT4DA62874.2024.10777202

Hiba, B. (2024). Hedgehogs, foxes, blueprints, and skeletons: Untangling the murky complexity of theoretical and conceptual frameworks. Energy Research and Social Science, 111. https://doi.org/10.1016/j.erss.2024.103468

Janebi Enayat, M. (2025). Computationally derived linguistic features of L2 narrative essays and their relations to human-judged writing quality. Language Testing in Asia, 15(1). https://doi.org/10.1186/s40468-025-00374-9

Kamogashira, T. (2022). Machine Learning in Diagnosis Support with Posturography Data. Equilibrium Research, 81(4), 212–221. https://doi.org/10.3757/jser.81.212

Khafaga, A. F., & Shaalan, I. E.-N. A. W. (2020). Using concordance to decode the ideological weight of lexis in learning narrative literature: A computational approach. International Journal of Advanced Computer Science and Applications, 11(4), 246–252. https://doi.org/10.14569/IJACSA.2020.0110433

Kulesa, J., Induru, S., Hubbard, E., & Bhansali, P. (2024). The Conceptual Framework: A Practical Guide. Hospital Pediatrics, 14(11), e503–e508. https://doi.org/10.1542/hpeds.2024-007794

Masjedy, H., Adel, S. M. R., Amirian, S. M. R., & Zareian, G. (2022). An Overview of Text Mining in Language Studies: The Computational Approach to Text Analytics. Language Related Research, 12(6), 499–531. https://doi.org/10.52547/LRR.12.6.16

Mazlan, N. H., Putra, C. W., & Sulistyo, H. (2025). Understanding reader navigation patterns in multi-path hypertext fiction: A case study approach to Patchwork Girl. Lingua Technica: Journal of Digital Literary Studies, 1(1), 1–12. https://doi.org/10.64595/lingtech.v1i1.3

Meroni, F. (2025). Exploring Metanarrative Cues in Literary Texts with NooJ: The Case of Les Amours de Psyché et de Cupidon by Jean de La Fontaine. 2443 CCIS, 152–164. https://doi.org/10.1007/978-3-031-89810-5_13

Miller, D. (2021). Analysing Frequency Lists. In A Practical Handbook of Corpus Linguistics (pp. 77–97). https://doi.org/10.1007/978-3-030-46216-1_4

Möller, R. (2021). Humanities-Centered AI: From Machine Learning to Machine Training. Workshop at the 44th German Conference on Artificial Intelligence, September 28, 2021, Berlin, Germany, 3093, 40–44. https://ceur-ws.org/Vol-3093/paper5.pdf

Moreno, L. G. (2017). Interpreting fictional texts in two-dimensional logic. Revista de Literatura, 79(158), 365–390. https://doi.org/10.3989/revliteratura.2017.02.013

Omar, A. (2021). Towards a Computational Model to Thematic Typology of Literary Texts: A Concept Mining Approach. International Journal of Advanced Computer Science and Applications, 12(12), 203–211. https://doi.org/10.14569/IJACSA.2021.0121226

Peng, Y., Sun, J., Quan, J., Wang, Y., Lv, C., & Zhang, H. (2023). Predicting Chinese EFL Learners’ Human‐rated Writing Quality in Argumentative Writing Through Multidimensional Computational Indices of Lexical Complexity. Assessing Writing, 56. https://doi.org/10.1016/j.asw.2023.100722

Pinkal, M., & Koller, A. (2012). Semantic research in computational linguistics. In Semantics: An International Handbook of Natural Language Meaning volume 3 (pp. 2825–2859). https://discovered.ed.ac.uk/permalink/44UOE_INST/iatqhp/alma9923929932102466

Pradeep, M., Sasivardhan, T., Bodana, G., Shilpa, K., Savalapurapu, K., & Babu, G. C. (2025). Natural Language Processing for Literacy Text Mining: Extracting Knowledge From British National Corpus. 1816–1821. https://doi.org/10.1109/ICIRCA65293.2025.11089848

Rahman, N. F. A., Wang, S. L., Ng, T. F., & Ghoneim, A. S. (2025). Artificial Intelligence in Education: A Systematic Review of Machine Learning for Predicting Student Performance. Journal of Advanced Research in Applied Sciences and Engineering Technology, 54(1), 198–221. https://doi.org/10.37934/araset.54.1.198221

Rocco, S. T., & Plakhotnik, S. M. (2009). Literature reviews, conceptual frameworks, and theoretical frameworks: Terms, functions, and distinctions. Human Resource Development Review, 8(1), 120–130. https://doi.org/10.1177/1534484309332617

Romadhani, A. D. (2025). Virtual reality as narrative medium: The emotional effects of full immersion in VR-based film Aladin. Lingua Technica: Journal of Digital Literary Studies, 1(1), 51–62. https://doi.org/10.64595/lingtech.v1i1.29

Ryokai, K., Raffle, H., & Kowalski, R. (2012). StoryFaces: Pretend-play with ebooks to support social-emotional storytelling. 125–133. https://doi.org/10.1145/2307096.2307111

Sano, S.-I. (2015). The role of exemplars and lexical frequency in rendaku. Open Linguistics, 1(1), 329–344. https://doi.org/10.1515/opli-2015-0005

Schmidt, M.-L. C. R., Winkler, J. R., Appel, M., & Richter, T. (2023). Emotional shifts, event-congruent emotions, and transportation in narrative persuasion. Discourse Processes, 60(7), 502–521. https://doi.org/10.1080/0163853X.2023.2252696

Šeļa, A. (2021). Differences, distances and fingerprints: The fundamentals of stylometry and multivariate text analysis. Keel Ja Kirjandus, 64(8–9), 696–718. https://doi.org/10.54013/kk764a3

Sherratt, S. (2007). Right brain damage and the verbal expression of emotion: A preliminary investigation. Aphasiology, 21(3–4), 320–339. https://doi.org/10.1080/02687030600911401

Skorinkin, D., & Orekhov, B. (2023). Hacking stylometry with multiple voices: Imaginary writers can override authorial signal in Delta. Digital Scholarship in the Humanities, 38(3), 1247–1266. https://doi.org/10.1093/llc/fqad012

Stańczyk, U. (2011). Application of DRSA-ANN classifier in computational stylistics. 6804 LNAI, 695–704. https://doi.org/10.1007/978-3-642-21916-0_73

Tu, X., Wang, D., & Yang, Q. (2024). Emotional Analysis in Animated Films Using Big Data and IoT: An In-Depth Study of’Krek’. 175–182. https://doi.org/10.1145/3697355.3697384

Urberg, M. (2021). Creating return on investment for large-scale metadata creation. Information Services and Use, 41(1–2), 53–60. https://doi.org/10.3233/ISU-210117

Varela, P. J., Albonico, M., Justino, E. J. R., & Assis, J. L. V. D. (2020). Authorship Attribution in Latin Languages using Stylometry. IEEE Latin America Transactions, 18(4), 729–735. https://doi.org/10.1109/TLA.2020.9082216

Varghese, N., & Punithavalli, M. (2019). Lexical and semantic analysis of sacred texts using machine learning and natural language processing. International Journal of Scientific and Technology Research, 8(12), 3133–3140. https://www.ijstr.org/research-paper-publishing.php?month=dec2019

Weitin, T., & Herget, K. (2017). Falcon topics: On some problems of topic modeling of literary texts. Lili - Zeitschrift Fur Literaturwissenschaft Und Linguistik, 47(1), 29–48. https://doi.org/10.1007/s41244-017-0049-3

Witten, I. H. (2004). Text mining. In The Practical Handbook of Internet Computing (pp. 14–1). https://doi.org/10.1201/9780203507223

Wotela, K. (2017). Conceptualising conceptual frameworks in public and business management research. Conference: 16th European Conference on Research Methodology for Business and Management Studies, 2017-June, 370–379. https://kar.kent.ac.uk/id/eprint/64395

Xia, L., Liu, K., Li, X., & Ye, Q. (2025). Encoding types and narrative coherence modulate the impact of emotions on temporal order memory. Acta Psychologica Sinica, 57(1), 1–17. https://doi.org/10.3724/SP.J.1041.2025.0001

Yogeesh, N., Mohammad, S. I., Raja, N., Reddy, N. A., Hassan, S. R., Kavitha, H. S., Vasudevan, A., Hunitie, M. F. A., & Alshdaifat, N. (2025). Modeling Lexical Ambiguity in English Literature Using Fuzzy Logic and Equations. Applied Mathematics and Information Sciences, 19(4), 873–889. https://doi.org/10.18576/amis/190413

Yu, C. H., Jannasch-Pennell, A., & DiGangi, S. (2011). Compatibility between Text Mining and Qualitative Research in the Perspectives of Grounded Theory, Content Analysis, and Reliability. Qualitative Report, 16(3), 730–744. https://doi.org/10.46743/2160-3715/2011.1085

Zad, S., Heidari, M., Hajibabaee, P., & Malekzadeh, M. (2021). A Survey of Deep Learning Methods on Semantic Similarity and Sentence Modeling. 466–472. https://doi.org/10.1109/IEMCON53756.2021.9623078

Změlík, R. (2018). Quantitative and corpus research in literary studies: Possibilities and approaches. Slovo a Slovesnost, 79(1), 47–65. https://www.ceeol.com/search/article-detail?id=717085

Downloads

Published

30-01-2026

How to Cite

Rinda Widya Ikomah, & Zohaib Hassan Sain. (2026). Text mining and semantic modeling of literary corpora: a machine learning–based study of Indonesian fiction. Lingua Technica: Journal of Digital Literary Studies, 2(1), 51–67. https://doi.org/10.64595/lingtech.v2i1.133

Similar Articles

You may also start an advanced similarity search for this article.