<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3-mathml3.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" dtd-version="1.3" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title-group>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
</journal-title-group>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/frai.2026.1752103</article-id>
<article-version article-version-type="Version of Record" vocab="NISO-RP-8-2008"/>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Original Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>FinTextSim: a domain-specific sentence-transformer for extracting predictive latent topics from financial disclosures</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Jehnen</surname> <given-names>Simon</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Methodology" vocab-term-identifier="https://credit.niso.org/contributor-roles/methodology/">Methodology</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Conceptualization" vocab-term-identifier="https://credit.niso.org/contributor-roles/conceptualization/">Conceptualization</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Software" vocab-term-identifier="https://credit.niso.org/contributor-roles/software/">Software</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Project administration" vocab-term-identifier="https://credit.niso.org/contributor-roles/project-administration/">Project administration</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Visualization" vocab-term-identifier="https://credit.niso.org/contributor-roles/visualization/">Visualization</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Writing &#x2013; original draft" vocab-term-identifier="https://credit.niso.org/contributor-roles/writing-original-draft/">Writing &#x2013; original draft</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Investigation" vocab-term-identifier="https://credit.niso.org/contributor-roles/investigation/">Investigation</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Resources" vocab-term-identifier="https://credit.niso.org/contributor-roles/resources/">Resources</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Validation" vocab-term-identifier="https://credit.niso.org/contributor-roles/validation/">Validation</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Data curation" vocab-term-identifier="https://credit.niso.org/contributor-roles/data-curation/">Data curation</role>
<uri xlink:href="https://loop.frontiersin.org/people/3286253"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Villalba-D&#x000ED;ez</surname> <given-names>Javier</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<xref ref-type="aff" rid="aff4"><sup>4</sup></xref>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Supervision" vocab-term-identifier="https://credit.niso.org/contributor-roles/supervision/">Supervision</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Project administration" vocab-term-identifier="https://credit.niso.org/contributor-roles/project-administration/">Project administration</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Validation" vocab-term-identifier="https://credit.niso.org/contributor-roles/validation/">Validation</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Methodology" vocab-term-identifier="https://credit.niso.org/contributor-roles/methodology/">Methodology</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Conceptualization" vocab-term-identifier="https://credit.niso.org/contributor-roles/conceptualization/">Conceptualization</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Writing &#x2013; review &amp; editing" vocab-term-identifier="https://credit.niso.org/contributor-roles/writing-review-editing/">Writing &#x2013; review &#x00026; editing</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Funding acquisition" vocab-term-identifier="https://credit.niso.org/contributor-roles/funding-acquisition/">Funding acquisition</role>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Ordieres-Mer&#x000E9;</surname> <given-names>Joaqu&#x000ED;n</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Supervision" vocab-term-identifier="https://credit.niso.org/contributor-roles/supervision/">Supervision</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Project administration" vocab-term-identifier="https://credit.niso.org/contributor-roles/project-administration/">Project administration</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Methodology" vocab-term-identifier="https://credit.niso.org/contributor-roles/methodology/">Methodology</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Writing &#x2013; review &amp; editing" vocab-term-identifier="https://credit.niso.org/contributor-roles/writing-review-editing/">Writing &#x2013; review &#x00026; editing</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Conceptualization" vocab-term-identifier="https://credit.niso.org/contributor-roles/conceptualization/">Conceptualization</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Funding acquisition" vocab-term-identifier="https://credit.niso.org/contributor-roles/funding-acquisition/">Funding acquisition</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Validation" vocab-term-identifier="https://credit.niso.org/contributor-roles/validation/">Validation</role>
<uri xlink:href="https://loop.frontiersin.org/people/106894"/>
</contrib>
</contrib-group>
<aff id="aff1"><label>1</label><institution>DEGIN Doctoral Program, Department of Industrial Management, Escuela T&#x000E9;cnica Superior de Ingenieros Industriales, Universidad Polit&#x000E9;cnica de Madrid</institution>, <city>Madrid</city>, <country country="es">Spain</country></aff>
<aff id="aff2"><label>2</label><institution>Beta Klinik GmbH</institution>, <city>Bonn</city>, <country country="de">Germany</country></aff>
<aff id="aff3"><label>3</label><institution>Fakult&#x000E4;t f&#x000FC;r Wirtschaft, Hochschule Heilbronn</institution>, <city>Heilbronn</city>, <country country="de">Germany</country></aff>
<aff id="aff4"><label>4</label><institution>Department of Mechanical Engineering, Universidad de La Rioja</institution>, <city>Logro&#x000F1;o</city>, <country country="es">Spain</country></aff>
<author-notes>
<corresp id="c001"><label>&#x0002A;</label>Correspondence: Joaqu&#x000ED;n Ordieres-Mer&#x000E9;, <email xlink:href="mailto:j.ordieres@upm.es">j.ordieres@upm.es</email></corresp>
</author-notes>
<pub-date publication-format="electronic" date-type="pub" iso-8601-date="2026-03-02">
<day>02</day>
<month>03</month>
<year>2026</year>
</pub-date>
<pub-date publication-format="electronic" date-type="collection">
<year>2026</year>
</pub-date>
<volume>9</volume>
<elocation-id>1752103</elocation-id>
<history>
<date date-type="received">
<day>22</day>
<month>11</month>
<year>2025</year>
</date>
<date date-type="rev-recd">
<day>10</day>
<month>01</month>
<year>2026</year>
</date>
<date date-type="accepted">
<day>16</day>
<month>01</month>
<year>2026</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2026 Jehnen, Villalba-D&#x000ED;ez and Ordieres-Mer&#x000E9;.</copyright-statement>
<copyright-year>2026</copyright-year>
<copyright-holder>Jehnen, Villalba-D&#x000ED;ez and Ordieres-Mer&#x000E9;</copyright-holder>
<license>
<ali:license_ref start_date="2026-03-02">https://creativecommons.org/licenses/by/4.0/</ali:license_ref>
<license-p>This is an open-access article distributed under the terms of the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License (CC BY)</ext-link>. The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</license-p>
</license>
</permissions>
<abstract>
<p>Recent advancements in information availability and computational capabilities have transformed the analysis of annual reports, integrating traditional financial metrics with insights from textual data. To extract actionable insights from this wealth of textual data, automated review processes, such as topic modeling, are essential. This study benchmarks classical approaches against contemporary neural techniques and introduces FinTextSim, a sentence-transformer finetuned for financial text. Using Item 7 and Item 7A of 10-K filings from S&#x00026;P 500 companies (2016&#x02013;2023), we systematically evaluate these models qualitatively and quantitatively. BERTopic in combination with FinTextSim consistently outperforms all alternatives, producing notably clearer, more coherent and financially relevant topic clusters. Compared to the most widely used standard embedding models and financial baselines, FinTextSim improves intratopic similarity by up to 71% and reduces intertopic similarity by more than 108%, highlighting the importance of domain-specific embeddings. Crucially, these qualitative gains translate into quantitative predictive benefits: incorporating FinTextSim-derived topic features into a logistic regression framework for corporate performance prediction leads to a statistically significant two-percentage-point increase in both ROC-AUC and F1-score over a purely financial baseline. In contrast, off-the-shelf sentence-transformers and classical topic models introduce noise that degrades predictive performance. For non-linear classifiers, several textual representations yield modest gains, reflecting their greater capacity to absorb noisier features. However, FinTextSim remains the most stable and consistently strong performer across both linear and non-linear settings. Overall, FinTextSim acts as a domain-adapted information filter, translating unstructured financial text into structured, semantically rich representations that human analysts and generic models often overlook. By bridging interpretability and predictive utility, it enables the extraction of economically relevant information from corporate narratives and supports more effective decision-making, resource allocation, and corporate performance forecasting.</p></abstract>
<kwd-group>
<kwd>artificial intelligence</kwd>
<kwd>BERTopic</kwd>
<kwd>company performance prediction</kwd>
<kwd>FinTextSim</kwd>
<kwd>LDA</kwd>
<kwd>machine learning</kwd>
<kwd>topic modeling</kwd>
</kwd-group>
<funding-group>
<award-group id="gs1">
<funding-source id="sp1">
<institution-wrap>
<institution>Agencia Estatal de Investigaci&#x000F3;n</institution>
<institution-id institution-id-type="doi" vocab="open-funder-registry" vocab-identifier="10.13039/open_funder_registry">10.13039/501100011033</institution-id>
</institution-wrap>
</funding-source>
</award-group>
<funding-statement>The author(s) declared that financial support was received for this work and/or its publication. JO-M and JV-D want to acknowledge the partial support by the Spanish &#x0201C;Agencia Estatal de Investigaci&#x000F3;n&#x0201D; through the grant PID2022-137748OB-C31 funded by MCIN/AEI/10.13039/501100011033 and &#x0201C;ERDF A way of making Europe.&#x0201D;</funding-statement>
</funding-group>
<counts>
<fig-count count="5"/>
<table-count count="5"/>
<equation-count count="0"/>
<ref-count count="133"/>
<page-count count="18"/>
<word-count count="14318"/>
</counts>
<custom-meta-group>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>AI in Finance</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec sec-type="introduction" id="s1">
<label>1</label>
<title>Introduction</title>
<p>In recent years, the increasing availability of information (<xref ref-type="bibr" rid="B26">Chen and Ji, 2025</xref>; <xref ref-type="bibr" rid="B109">Sun et al., 2026</xref>) and advances in computational capabilities have transformed the analysis of annual reports, including 10-K filings. These filings are among the most critical disclosures (<xref ref-type="bibr" rid="B55">Griffin, 2003</xref>; <xref ref-type="bibr" rid="B61">Hajek and Munk, 2024</xref>), providing a standardized snapshot of a company&#x00027;s financial situation through both numerical and textual data (<xref ref-type="bibr" rid="B88">Masson and Paroubek, 2020</xref>). Traditional evaluations of 10-K filings have focused on retrospective quantitative financial metrics, while textual data remains underexplored (<xref ref-type="bibr" rid="B63">Hida and Do Nascimento, 2026</xref>). However, growing evidence shows that qualitative textual components also carry predictive power for future performance (<xref ref-type="bibr" rid="B30">Cohen et al., 2020</xref>; <xref ref-type="bibr" rid="B11">Ashtiani and Raahemi, 2023</xref>; <xref ref-type="bibr" rid="B92">Nazareth and Reddy, 2023</xref>; <xref ref-type="bibr" rid="B132">Zhu, 2026</xref>; <xref ref-type="bibr" rid="B121">Wang et al., 2023</xref>; <xref ref-type="bibr" rid="B44">Frankel et al., 2022</xref>; <xref ref-type="bibr" rid="B106">Siano, 2025</xref>). While these studies demonstrate the predictive potential of textual disclosures, they largely adopt end-to-end predictive frameworks and provide limited insight into how alternative textual representations, particularly topic-based representations, differ in their ability to extract economically meaningful information. Thus, integrating these textual insights with financial metrics provides a more comprehensive basis for decision-making, benefiting investors, analysts, and regulators (<xref ref-type="bibr" rid="B65">Hsieh and Hristova, 2022</xref>; <xref ref-type="bibr" rid="B115">Ueda et al., 2024</xref>).</p>
<p>Within 10-K filings, Item 7 and Item 7A are particularly valuable. Item 7, the Management Discussion &#x00026; Analysis (MD&#x00026;A), presents management&#x00027;s perspective on various aspects, including operations, performance, risks, opportunities, and strategies to address future challenges (<xref ref-type="bibr" rid="B30">Cohen et al., 2020</xref>). Item 7A provides qualitative and quantitative disclosures about market risk. As 10-K filings are mandatory for publicly traded companies, they represent a rich source of financial text that requires systematic and scalable analysis. Manual review, however, is both time-consuming and prone to subjectivity bias (<xref ref-type="bibr" rid="B60">Hagen, 2018</xref>; <xref ref-type="bibr" rid="B67">Huang et al., 2025</xref>). The growing volume of available information (<xref ref-type="bibr" rid="B99">Rashid et al., 2019</xref>; <xref ref-type="bibr" rid="B123">Wang Y. et al., 2024</xref>) further increases risk of information overload (<xref ref-type="bibr" rid="B85">Lu, 2022</xref>), making it essential to allocate resources efficiently (<xref ref-type="bibr" rid="B83">Liu, 2022</xref>; <xref ref-type="bibr" rid="B96">Pufahl et al., 2025</xref>). Automated approaches, such as topic modeling, address these challenges by uncovering latent topics and summarizing large text corpora (<xref ref-type="bibr" rid="B18">Blei et al., 2003</xref>; <xref ref-type="bibr" rid="B108">Song et al., 2025</xref>; <xref ref-type="bibr" rid="B31">Curiskis et al., 2020</xref>). A key advantage of topic modeling is its unsupervised nature. While supervised approaches often require extensive annotated datasets, which are infeasible in most real-world settings, unsupervised methods scale more efficiently (<xref ref-type="bibr" rid="B111">Taha, 2023</xref>).</p>
<p>Classical topic modeling approaches, including Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF), rely on the bag-of-words (BoW) assumption. This assumption posits that each document is treated as a collection of words, disregarding their sequential order. However, this limits the model&#x00027;s ability to capture the semantic meaning of text. Neural topic modeling approaches address this issue by employing contextual embeddings (<xref ref-type="bibr" rid="B17">Blair et al., 2020</xref>), which capture semantic and contextual relationships between texts (<xref ref-type="bibr" rid="B19">Booker et al., 2024</xref>; <xref ref-type="bibr" rid="B16">Bhattacharya and Mickovic, 2024</xref>). Sentence-transformers further improve efficiency and semantic similarity comparisons (<xref ref-type="bibr" rid="B100">Reimers and Gurevych, 2019</xref>). These text representations are crucial, as they must faithfully reflect a document&#x00027;s content while distinguishing it from others (<xref ref-type="bibr" rid="B109">Sun et al., 2026</xref>), enabling advanced applications such as BERTopic (<xref ref-type="bibr" rid="B57">Grootendorst, 2022</xref>). Despite the widespread use of topic modeling and contextual embeddings in general Natural Language Processing (NLP), little is known about their effectiveness in financial applications, where specialized terminology and domain-specific context are critical (<xref ref-type="bibr" rid="B16">Bhattacharya and Mickovic, 2024</xref>).</p>
<p>To address this gap, we develop and evaluate FinTextSim, a sentence-transformer finetuned specifically for financial text. General-purpose models, such as all-MiniLM-L6-v2 (AM) and all-mpnet-base-v2 (MPNET), have become standard baselines due to their strong performance across a wide range of domains. Yet, they are not optimized for the semantic and contextual nuances of financial language. Furthermore, existing models tailored for the financial domain are primarily optimized for sentiment analysis (e.g., <xref ref-type="bibr" rid="B10">Araci, 2019</xref>; <xref ref-type="bibr" rid="B82">Li et al., 2023</xref>; <xref ref-type="bibr" rid="B59">Guo et al., 2024</xref>). As a result, their suitability for topic modeling and semantic clustering in financial text remains an open empirical question. In contrast, FinTextSim is explicitly designed to capture domain-specific semantic structure. Functioning as a domain-adapted information filter, FinTextSim mitigates a fundamental information processing and retrieval bottleneck in financial text analysis. By distilling unstructured narratives into structured, semantically rich representations that emphasize economically meaningful relations, it extracts signals often overlooked by both human analysts and generic models. Beyond model development, we systematically evaluate multiple topic modeling algorithms, comparing classical approaches with contemporary neural techniques. This dual benchmarking across embedding models and topic modeling paradigms provides the first comprehensive evaluation of topic modeling for financial text. Moreover, we demonstrate the practical relevance of FinTextSim-enhanced BERTopic, which generates higher-quality and financially relevant insights with direct implications for research, business valuation, and stock price prediction.</p>
<p>Extending this analysis, we integrate the outputs of topic models into a machine learning (ML) framework to assess their informational value for corporate performance prediction. Corporate performance prediction is a central objective in accounting and financial research, as accurate forecasts are closely linked to future excess investment returns (<xref ref-type="bibr" rid="B119">Veganzones and Severin, 2025</xref>; <xref ref-type="bibr" rid="B24">Cao and You, 2024</xref>; <xref ref-type="bibr" rid="B39">Easton et al., 2024</xref>; <xref ref-type="bibr" rid="B114">Uddin et al., 2022</xref>; <xref ref-type="bibr" rid="B27">Chen et al., 2022</xref>). Although several studies emphasize the potential of NLP and topic modeling to enhance corporate performance prediction (<xref ref-type="bibr" rid="B95">Peng, 2025</xref>; <xref ref-type="bibr" rid="B61">Hajek and Munk, 2024</xref>; <xref ref-type="bibr" rid="B113">Theodorakopoulos et al., 2025</xref>; <xref ref-type="bibr" rid="B78">Lee and Anderl, 2025</xref>), systematic evidence on how alternative textual representations, particularly topic-based representations, contribute incremental value when combined with quantitative financial indicators remains limited. To address this second gap, our approach combines topic-document distributions derived from topic models with fundamental financial indicators, allowing ML models to exploit both quantitative and qualitative information. This design enables us to assess which topic modeling approach most effectively quantifies qualitative textual information to improve corporate performance prediction, and to evaluate the robustness of these textual representations across both linear and non-linear predictive frameworks.</p>
<p>We will explore the following research questions based on Item 7 and Item 7A from S&#x00026;P500 companies between 2016 and 2023:</p>
<list list-type="simple">
<list-item><p>RQ1 How can we leverage contextual embeddings for the financial domain?</p></list-item>
<list-item><p>RQ2 Which topic modeling approach provides the most qualitative and coherent topics?</p></list-item>
<list-item><p>RQ3 Which topic modeling approach proves best in organizing and summarizing our large-scale financial text dataset?</p></list-item>
<list-item><p>RQ4 Does topic modeling improve corporate performance prediction?</p></list-item>
</list>
<p>The rest of the paper hereinafter is organized as follows. Section 2 reviews the state-of-the-art literature and methodologies. Section 3 describes our study&#x00027;s materials and methods, including the training procedure of FinTextSim. Section 4 presents and discusses the main findings. Finally, Section 5 provides the conclusion. This structure ensures a clear and logical progression, enabling a thorough understanding of our study&#x00027;s contributions.</p></sec>
<sec id="s2">
<label>2</label>
<title>State of the art</title>
<p>The following subsections provide an overview of topic modeling approaches and corporate performance prediction. They will set the foundation for understanding the algorithms and methodologies.</p>
<sec>
<label>2.1</label>
<title>Classical topic modeling approaches</title>
<p>Among classical topic modeling approaches, we highlight LDA and NMF. Both operate under the BoW assumption, treating each document as a mixture of underlying topics and each topic as a mixture of words. Accordingly, they assign prevalence of terms to topics (&#x003B2;) and topics to documents (&#x003B3;) (<xref ref-type="bibr" rid="B18">Blei et al., 2003</xref>). To ensure robust performance, several preprocessing steps are typically applied, including tokenization, stopword removal and lemmatization or stemming of words (<xref ref-type="bibr" rid="B15">Bellstam et al., 2021</xref>; <xref ref-type="bibr" rid="B46">Fu et al., 2021</xref>; <xref ref-type="bibr" rid="B4">Albalawi et al., 2020</xref>).</p>
<sec>
<label>2.1.1</label>
<title>Latent dirichlet allocation</title>
<p>LDA is the most widely applied topic modeling approach in literature. It is a three-level parametric hierarchical Bayesian model. By defining a hypothetical generative process for documents, LDA works backwards to infer the topics that could have generated the documents (<xref ref-type="bibr" rid="B1">Abdelrazek et al., 2023</xref>). The model is governed by three key hyperparameters (<xref ref-type="bibr" rid="B18">Blei et al., 2003</xref>): the number of topics (<italic>k</italic>), the concentration parameter of the Dirichlet prior of the document-topic distribution (&#x003B1;), and the parameter controlling the distribution of words across topics (&#x003B7;) (<xref ref-type="bibr" rid="B43">Fernandes et al., 2020</xref>). These hyperparameters significantly influence the quality and stability of the generated topics. Yet, their selection remains challenging due to the inherent complexity of textual data (<xref ref-type="bibr" rid="B87">Maier et al., 2018</xref>; <xref ref-type="bibr" rid="B3">Agrawal et al., 2018</xref>).</p>
<p>Despite its popularity, LDA faces several limitations. LDA is sensitive to the order of training data. As a result, topic structures can vary when the training data is shuffled, introducing systematic errors into studies (<xref ref-type="bibr" rid="B3">Agrawal et al., 2018</xref>). Furthermore, overlapping topics can occur as LDA extracts topics from word distributions independently <xref ref-type="bibr" rid="B22">Campbell et al., (2015)</xref>.</p>
<p>LDA has been used in various fields. <xref ref-type="bibr" rid="B13">Bao and Datta (2014)</xref> pioneered the integration of unsupervised learning methods into Management Accounting and Finance using LDA to analyze risk disclosures from 10-K reports. <xref ref-type="bibr" rid="B38">Dyer et al. (2017)</xref> examined topics contributing to the lengthening of 10-K reports over time, while <xref ref-type="bibr" rid="B20">Brown et al. (2020)</xref> identified topics predicting financial misreporting. <xref ref-type="bibr" rid="B33">Deveikyte et al. (2022)</xref> employ LDA to predict market volatility. In additional financial studies, LDA has been used to quantify the economic content in communications, identify central subjects or to estimate innovation capabilities, among other applications (<xref ref-type="bibr" rid="B70">Jegadeesh and Wu, 2017</xref>; <xref ref-type="bibr" rid="B84">Lowry et al., 2020</xref>; <xref ref-type="bibr" rid="B15">Bellstam et al., 2021</xref>; <xref ref-type="bibr" rid="B49">Garc&#x000ED;a-M&#x000E9;ndez et al., 2023</xref>).</p></sec>
<sec>
<label>2.1.2</label>
<title>Non-negative matrix factorization</title>
<p>NMF takes a decompositional, non-probabilistic approach to topic modeling, factorizing the input document-term-matrix <italic>A</italic> into the product of term-topic-matrix <italic>W</italic> and topic-document-matrix <italic>H</italic> (<xref ref-type="bibr" rid="B79">Lee and Seung, 1999</xref>). By evaluating the discrepancy between <italic>A</italic> and <italic>W</italic>&#x000D7;<italic>H</italic> using the squared Frobenius norm, the topic modeling problem is framed as an optimization task restricted to non-negative entries (<xref ref-type="bibr" rid="B120">Wang and Zhang, 2023</xref>). Unlike LDA, NMF does not rely on Bayesian priors, although the number of topics still needs to be specified by the user.</p>
<p>While NMF offers advantages in simplicity and computational efficiency (<xref ref-type="bibr" rid="B41">Egger and Yu, 2022</xref>), it also faces several challenges. Compared to LDA, it lacks a solid statistical foundation and a defined generative model. Additionally, NMF relies on anchor words to enforce a block diagonal structure in the term-topic matrix <italic>W</italic>, ensuring consistent solutions (<xref ref-type="bibr" rid="B37">Donoho and Stodden, 2003</xref>; <xref ref-type="bibr" rid="B52">Gillis and Vavasis, 2014</xref>). This assumption posits that each topic is associated with a unique anchor word, absent in other topics (<xref ref-type="bibr" rid="B52">Gillis and Vavasis, 2014</xref>). Given the multifaceted nature of words, this assumption can be considered as fragile (<xref ref-type="bibr" rid="B120">Wang and Zhang, 2023</xref>). Another assumption of NMF is that each topic has at least one &#x0201C;pure document,&#x0201D; a document discussing only that specific topic <xref ref-type="bibr" rid="B52">Gillis and Vavasis, (2014)</xref>. This assumption is particularly fragile for longer documents.</p>
<p>NMF has applications in various fields and domains. In finance, <xref ref-type="bibr" rid="B28">Chen et al. (2017)</xref> used NMF and other topic modeling methods on 10-K and 8-K filings of bank holding companies to distinguish failed from non-failed banks. Additionally, <xref ref-type="bibr" rid="B21">Cai et al. (2022)</xref> applied NMF to assess the impact of risk factor disclosures on bond pricing. In other fields, NMF has been primarily employed for short-text topic modeling (<xref ref-type="bibr" rid="B29">Chen et al., 2019</xref>; <xref ref-type="bibr" rid="B4">Albalawi et al., 2020</xref>; <xref ref-type="bibr" rid="B41">Egger and Yu, 2022</xref>).</p></sec>
<sec>
<label>2.1.3</label>
<title>Wrapup of classical topic modeling approaches</title>
<p>Classical topic modeling approaches offer both, advantages and disadvantages. A main advantage is the easier interpretation of hyperparameters, aiding in troubleshooting and model interpretation. However, disadvantages become increasingly pronounced with more complex corpora (<xref ref-type="bibr" rid="B1">Abdelrazek et al., 2023</xref>). Classical models are particularly susceptible to the following issues:</p>
<list list-type="order">
<list-item><p>BoW Assumption: context and semantic relationships cannot be captured (<xref ref-type="bibr" rid="B91">Murphy et al., 2024</xref>); misrepresentation of topics and documents possible (<xref ref-type="bibr" rid="B57">Grootendorst, 2022</xref>),</p></list-item>
<list-item><p>Interpretability of topics (<xref ref-type="bibr" rid="B22">Campbell et al., 2015</xref>; <xref ref-type="bibr" rid="B108">Song et al., 2025</xref>),</p></list-item>
<list-item><p>Reliability, validity, and subjectivity: outcomes depend heavily on manual preprocessing choices and hyperparameter selection (<xref ref-type="bibr" rid="B12">Baden et al., 2022</xref>).</p></list-item>
</list></sec></sec>
<sec>
<label>2.2</label>
<title>Contemporary topic modeling approaches</title>
<p>Modern methodologies address the issues of classical topic modeling approaches by utilizing advanced text embedding techniques (<xref ref-type="bibr" rid="B17">Blair et al., 2020</xref>). The following subsections provide an overview of the evolution of contemporary techniques and a detailed examination of BERTopic, a state-of-the-art topic modeling approach.</p>
<sec>
<label>2.2.1</label>
<title>Evolution of contemporary topic modeling approaches</title>
<p>The integration of contextual embeddings has transformed topic modeling by moving beyond the BoW assumption, enabling better capturing of semantic relationships within text (<xref ref-type="bibr" rid="B17">Blair et al., 2020</xref>). These advances are rooted in key developments in NLP. The transformer architecture revolutionized the field by relying entirely on attention mechanisms, allowing models to capture long-range dependencies and contextual information (<xref ref-type="bibr" rid="B118">Vaswani et al., 2017</xref>). Encoder-only models such as BERT (<xref ref-type="bibr" rid="B34">Devlin et al., 2019</xref>) further advanced deep contextualized language modeling, while subsequent improvements (<xref ref-type="bibr" rid="B124">Warner et al., 2024</xref>) increased efficiency and performance on classification and retrieval tasks. Despite their strengths, encoder-only models are not designed for large-scale semantic similarity tasks. Sentence-transformers addressed this limitation by refining encoder-only models with siamese or triplet architectures, enabling efficient and precise similarity assessments (<xref ref-type="bibr" rid="B100">Reimers and Gurevych, 2019</xref>). They produce embeddings that reflect semantic similarity, providing a powerful foundation for neural topic models. Building on these advances, modern topic modeling approaches combine contextual embeddings with clustering techniques. For instance, centroid-based methods group embeddings into clusters and interpret words closest to the centroid as representative of the topic (<xref ref-type="bibr" rid="B105">Sia et al., 2020</xref>; <xref ref-type="bibr" rid="B8">Angelov, 2020</xref>). While computationally efficient, this assumption can be fragile, since real-world clusters do not always follow spherical distributions, leading to potential misrepresentation of topics (<xref ref-type="bibr" rid="B57">Grootendorst, 2022</xref>). A promising approach for topic modeling based on contextual embeddings, addressing centroid-based clustering issues, is BERTopic (<xref ref-type="bibr" rid="B57">Grootendorst, 2022</xref>).</p></sec>
<sec>
<label>2.2.2</label>
<title>BERTopic</title>
<p>BERTopic structures topic modeling into five sequential steps. First, document embeddings are generated using a pre-trained sentence-transformer, leveraging the benefits of advancements in modern language models (<xref ref-type="bibr" rid="B57">Grootendorst, 2022</xref>; <xref ref-type="bibr" rid="B58">Gu et al., 2024</xref>). Second, dimensionality reduction is applied to improve computational efficiency and clustering accuracy (<xref ref-type="bibr" rid="B5">Allaoui et al., 2020</xref>). Third, the reduced embeddings are clustered into semantically similar groups, i.e., topics. Fourth, documents are tokenized. Finally, token importance within topics is determined by assessing class-based tfidf (c-tfidf). c-tfidf weighs the importance of tokens within topics, enabling a more efficient extraction of topic representations.</p>
<p>Despite its advantages, BERTopic also faces challenges. It tends to produce a multitude of closely interconnected topics which may vary upon repeated modeling attempts (<xref ref-type="bibr" rid="B41">Egger and Yu, 2022</xref>). This variability contributes to inconsistency in producing meaningful results, further complicated by the complexity of interpreting hyperparameters, hindering troubleshooting and diminishing the reliability of results (<xref ref-type="bibr" rid="B1">Abdelrazek et al., 2023</xref>). Moreover, BERTopic assumes that each document relates to a single topic, potentially oversimplifying real-world complexity (<xref ref-type="bibr" rid="B57">Grootendorst, 2022</xref>). Additionally, sentence-transformer models used for the document embedding step perform optimally with sentences or paragraphs, while longer documents are truncated (<xref ref-type="bibr" rid="B100">Reimers and Gurevych, 2019</xref>). Furthermore, high computation times can result from processing large amounts of data <xref ref-type="bibr" rid="B57">Grootendorst, (2022)</xref>.</p>
<p>Due to its novelty, applications of BERTopic are still in their infancy. In a financial context, <xref ref-type="bibr" rid="B76">Kim et al. (2022)</xref> utilized BERTopic on Item 1A from 10-K filings. They examined whether identified topics can enhance the accuracy of ESG rating predictions and quantify each topic&#x00027;s relative contribution to the final rating prediction. In other contexts, BERTopic has been applied in various studies: <xref ref-type="bibr" rid="B103">S&#x000E1;nchez-Franco and Rey-Moreno (2022)</xref> analyzed customer reviews, <xref ref-type="bibr" rid="B2">Abuzayed and Al-Khalifa (2021)</xref> explored its application with pre-trained Arabic language models, <xref ref-type="bibr" rid="B41">Egger and Yu (2022)</xref> evaluated its performance on Twitter data, and <xref ref-type="bibr" rid="B56">Grigore and Pintilie (2023)</xref> extended BERTopic to predict individual&#x00027;s responses to a questionnaire based on their social media activity.</p></sec></sec>
<sec>
<label>2.3</label>
<title>Topic modeling of Item 7 and Item 7A</title>
<p>Our research is driven by several motivations regarding the choice of documents and analysis techniques. Item 7 and Item 7A stand out as particularly crucial sections in 10-K reports (<xref ref-type="bibr" rid="B16">Bhattacharya and Mickovic, 2024</xref>). The MD&#x00026;A section (Item 7) provides a narrative that contextualizes the presented numbers. In this section, management offers its individual perspective, which is essential for understanding the company&#x00027;s strategic direction and potential challenges. Additionally, the MD&#x00026;A section offers the most leeway and flexibility, making it rich with insights and indicative of future performance <xref ref-type="bibr" rid="B30">Cohen et al., (2020)</xref>. Item 7A focuses on market risks, containing valuable information regarding the company&#x00027;s prospective performance. Analyzing these sections enables extraction of textual information relevant for predicting future firm performance. While Items 7 and 7A are our primary focus, we also analyze Items 1 and 1A, which are widely recognized for their economic relevance (<xref ref-type="bibr" rid="B69">Jamshed et al., 2025</xref>; <xref ref-type="bibr" rid="B76">Kim et al., 2022</xref>). This allows us to test FinTextSim&#x00027;s generalizability, with results for Items 1 and 1A reported in the <xref ref-type="supplementary-material" rid="SM1">Supplementary material</xref>. Whereas most prior work focuses on social media data (e.g., <xref ref-type="bibr" rid="B108">Song et al., 2025</xref>; <xref ref-type="bibr" rid="B131">Zheng et al., 2025</xref>; <xref ref-type="bibr" rid="B72">Ji and Han, 2022</xref>; <xref ref-type="bibr" rid="B33">Deveikyte et al., 2022</xref>), we extract and structure firm- and management-specific information from 10-K reports. To operationalize this analysis, we rely on topic modeling (<xref ref-type="bibr" rid="B98">Ranta et al., 2022</xref>; <xref ref-type="bibr" rid="B1">Abdelrazek et al., 2023</xref>).</p>
<p>Despite methodological advances, applications of topic modeling in finance remain scarce. LDA still dominates applied topic modeling, although more powerful approaches such as BERTopic are available (<xref ref-type="bibr" rid="B40">Egger and Yu, 2021</xref>; <xref ref-type="bibr" rid="B17">Blair et al., 2020</xref>). To bridge this gap, we benchmark classical models alongside contemporary ones, focusing on BERTopic. We demonstrate that FinTextSim, a finetuned sentence-transformer, substantially enhances BERTopic&#x00027;s ability to produce precise and coherent financial topics. Beyond improving research quality, FinTextSim contributes to the democratization of knowledge-intensive, expert-driven tasks (<xref ref-type="bibr" rid="B129">Zhang et al., 2026</xref>; <xref ref-type="bibr" rid="B50">Garc&#x00301;&#x00131;a-M&#x000E9;ndez et al., 2024</xref>), enabling more efficient and effective interpretation of disclosures for both analysts and non-experts. It also lays the foundation for aspect-based managerial sentiment analysis, further improving predictive models in valuation and stock price forecasting (<xref ref-type="bibr" rid="B48">Garc&#x00301;&#x00131;a-M&#x000E9;ndez et al., 2023</xref>; <xref ref-type="bibr" rid="B115">Ueda et al., 2024</xref>).</p></sec>
<sec>
<label>2.4</label>
<title>Corporate performance prediction</title>
<p>Forecasting corporate performance is a central objective in accounting and finance research due to its proven relationship with excess investment returns and capital market efficiency (<xref ref-type="bibr" rid="B94">Ou and Penman, 1989</xref>; <xref ref-type="bibr" rid="B24">Cao and You, 2024</xref>; <xref ref-type="bibr" rid="B119">Veganzones and Severin, 2025</xref>). Traditional approaches relied on statistical, regression-based models (<xref ref-type="bibr" rid="B94">Ou and Penman, 1989</xref>). More recently, ML techniques have gained prominence for their ability to learn complex patterns from large-scale financial data. These models uncover economically meaningful relationships between historical financial variables and future performance, generating significant abnormal returns when used for portfolio formation (<xref ref-type="bibr" rid="B68">Hunt et al., 2019</xref>; <xref ref-type="bibr" rid="B114">Uddin et al., 2022</xref>; <xref ref-type="bibr" rid="B27">Chen et al., 2022</xref>). Collectively, these studies highlight the growing potential of ML-based approaches to extract predictive insights that surpass those of human analysts or traditional benchmarks (<xref ref-type="bibr" rid="B23">Campbell et al., 2024</xref>; <xref ref-type="bibr" rid="B116">Van Binsbergen et al., 2023</xref>; <xref ref-type="bibr" rid="B9">Aoki et al., 2025</xref>).</p>
<p>Despite these advances, notable limitations remain. Most existing applications rely predominantly on structured numerical data. While ML models based on financial indicators can correct analyst biases and uncover hidden dependencies (<xref ref-type="bibr" rid="B23">Campbell et al., 2024</xref>; <xref ref-type="bibr" rid="B116">Van Binsbergen et al., 2023</xref>), they fail to capture forward-looking managerial information that is explicitly communicated through narrative sections such as the MD&#x00026;A (<xref ref-type="bibr" rid="B9">Aoki et al., 2025</xref>). Recent studies have begun to incorporate textual disclosures using ML models to predict corporate performance (e.g., <xref ref-type="bibr" rid="B44">Frankel et al., 2022</xref>; <xref ref-type="bibr" rid="B106">Siano, 2025</xref>). However, these approaches largely adopt end-to-end predictive frameworks and do not systematically compare alternative textual representations. Although prior work highlights the potential of NLP to improve corporate performance predictions (<xref ref-type="bibr" rid="B95">Peng, 2025</xref>; <xref ref-type="bibr" rid="B126">Xinyue et al., 2020</xref>; <xref ref-type="bibr" rid="B74">Jun et al., 2022</xref>; <xref ref-type="bibr" rid="B113">Theodorakopoulos et al., 2025</xref>), evidence on which types of textual representations, particularly topic-based representations, provide incremental value beyond standard financial indicators remains limited.</p>
<p>Our study addresses this gap by integrating financial indicators with topic modeling outputs to assess the incremental informational value of textual representations for corporate performance prediction. Specifically, we integrate topic-document distributions derived from Item 7 and Item 7A of 10-K filings with fundamental financial indicators in a ML framework to predict firms&#x00027; Return on Assets (ROA). We demonstrate that topic representations derived from BERTopic in combination with FinTextSim yield the most consistent predictive improvements when integrated with financial indicators, particularly in linear models. While several textual representations provide modest gains in more flexible non-linear models, FinTextSim is the only approach that improves performance reliably across both linear and non-linear settings. This finding suggests that domain-specific language models can effectively quantify qualitative disclosures, boosting both interpretability and reliability in corporate performance forecasting.</p></sec></sec>
<sec sec-type="materials|methods" id="s3">
<label>3</label>
<title>Materials and methods</title>
<p>In the following subsections, we outline the materials and methods of our study. This section is divided into several parts: sourcing the dataset, creating an enhanced financial keyword list, training FinTextSim, creating the topic models, presenting the metrics used to evaluate the performance of the topic models, and the description of the downstream task of predicting corporate performance.</p>
<sec>
<label>3.1</label>
<title>Dataset</title>
<p>Our study focuses exclusively on Item 7 and Item 7A of 10-K reports while avoiding survivorship bias. Given their greater significance, we deliberately choose 10-K over 10-Q reports (<xref ref-type="bibr" rid="B55">Griffin, 2003</xref>). We source our data from the Notre Dame Software Repository for Accounting and Finance in text-file format, which underwent a &#x0201C;Stage One Parse&#x0201D; to remove all HTML tags.<xref ref-type="fn" rid="fn0003"><sup>1</sup></xref></p>
<p>To avoid survivorship bias, we include 10-K filings of all companies that have been listed in the S&#x00026;P 500 index between 2016 and 2023. Using a regular expression-based extractor, we isolate the text from the start of Item 7 to the start of Item 8. We refer to this combination of Item 7 and Item 7A as &#x0201C;documents.&#x0201D; To ensure comparability, documents containing fewer than 250 words are discarded.<xref ref-type="fn" rid="fn0004"><sup>2</sup></xref> Additional outlier documents are removed using z-scores, excluding documents more than two standard deviations from the mean length. Text preprocessing methods are applied to improve model performance and comparability across methods (<xref ref-type="bibr" rid="B107">Siino et al., 2024</xref>), including replacing contractions as well as removing URLs and numerical characters.</p>
<p><xref ref-type="table" rid="T1">Table 1</xref> summarizes the number of documents at each preprocessing step.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Dataset.</p></caption>
<table frame="box" rules="all">
<thead>
<tr>
<th valign="top" align="left"><bold>Preprocessing-step</bold></th>
<th valign="top" align="center"><bold>&#x00023; Documents</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Extracted documents</td>
<td valign="top" align="center">4,754</td>
</tr>
<tr>
<td valign="top" align="left">Outlier documents</td>
<td valign="top" align="center">629</td>
</tr>
<tr>
<td valign="top" align="left">Remaining documents in database</td>
<td valign="top" align="center">4,125</td>
</tr>
<tr>
<td valign="top" align="left">Number of sentences</td>
<td valign="top" align="center">2,178,712</td>
</tr></tbody>
</table>
</table-wrap>
<p>As BERTopic assumes single-topic documents and sentence-transformers and NMF perform best on short inputs (<xref ref-type="bibr" rid="B57">Grootendorst, 2022</xref>; <xref ref-type="bibr" rid="B100">Reimers and Gurevych, 2019</xref>; <xref ref-type="bibr" rid="B29">Chen et al., 2019</xref>), we tokenize each of the remaining 4,125 document into individual sentences. This avoids losing information through truncation and prevents misleading single-topic assumptions for multi-topic MD&#x00026;A sections. As a result, our dataset contains 2,178,712 sentences.</p></sec>
<sec>
<label>3.2</label>
<title>Keyword list</title>
<p>To train FinTextSim, we build on an established keyword framework for financial text. The foundation is the economic anchorword list for 10-K and 10-Q reports proposed by <xref ref-type="bibr" rid="B81">Li (2010)</xref>, which covers eleven domains.<xref ref-type="fn" rid="fn0005"><sup>3</sup></xref> Subsequent work by <xref ref-type="bibr" rid="B42">Fengler and Phan (2025)</xref> expanded this list by identifying semantically related terms with a Word2Vec model trained on MD&#x00026;A sections of 10-K filings. Building on this evolution, we further refined this list to contain common performance indicators and operational terms. Moreover, we broadened it with a dedicated topic on Environmental Sustainability, reflecting the growing importance of ESG-related disclosures (<xref ref-type="bibr" rid="B53">Giudici and Wu, 2025</xref>; <xref ref-type="bibr" rid="B125">Xie et al., 2025</xref>).<xref ref-type="fn" rid="fn0006"><sup>4</sup></xref></p></sec>
<sec>
<label>3.3</label>
<title>FinTextSim</title>
<p>To accurately cluster semantically similar financial text, we introduce FinTextSim. FinTextSim is a sentence-transformer model specifically finetuned to enhance contextual embeddings for the financial domain. Given the financial jargon and its domain-specific nuances, off-the-shelf (OTS), general-purpose sentence-transformers fall short. Existing models tailored for the financial domain are primarily optimized for sentiment analysis (e.g., <xref ref-type="bibr" rid="B10">Araci, 2019</xref>; <xref ref-type="bibr" rid="B82">Li et al., 2023</xref>; <xref ref-type="bibr" rid="B59">Guo et al., 2024</xref>). By finetuning FinTextSim on financial text, we aim to improve the quality of generated topics, enhancing semantic coherence and separation between topics, bridging the gap between general-purpose models and the specific demands of financial text analysis.</p>
<p>We construct a labeled dataset from the corpus described in Section 3.1, using a dictionary-based approach that leverages the keyword list from Section 3.2. To this end, we create a keyword-sentence matrix by iterating over each word in every sentence and matching substrings to keywords. This approach allows recognition of variations such as &#x0201C;logistics&#x0201D; or &#x0201C;logistical&#x0201D; for the keyword &#x0201C;logistic.&#x0201D; Sentences containing two or more keywords from a single topic are labeled accordingly. This procedure ensures topic distinctiveness and provides a reliable ground truth for training, consistent with data-centric perspectives on model quality (<xref ref-type="bibr" rid="B35">Di Gennaro et al., 2024</xref>). To prevent overemphasis on repeated phrasings, only unique sentences are retained. Finally, our dataset comprises 113,291 labeled sentences. To avoid data leakage, we train the model using a temporal split. Data from 2016&#x02013;2021 is used for training while data from 2022&#x02013;2023 is reserved for testing. Following these steps, we obtain 27,388 test- and 85,903 train-sentences. To assess the robustness of FinTextSim to reduced lexical cues, we conduct an additional evaluation in which 50% of the label-inducing keywords are randomly masked in the test set. Masking is applied only at evaluation time while the trained model remains unchanged, allowing us to examine whether learned representations generalize beyond explicit keyword presence. Results of this masked evaluation are reported in the <xref ref-type="supplementary-material" rid="SM1">Supplementary material</xref>.</p>
<p>FinTextSim is trained using BatchHardTripletLoss, following methods outlined by <xref ref-type="bibr" rid="B100">Reimers and Gurevych (2019)</xref> and <xref ref-type="bibr" rid="B34">Devlin et al. (2019)</xref>. Unlike standard triplet loss, BatchHardTripletLoss dynamically selects the hardest positive (most dissimilar within the same class) and the hardest negative (most similar from a different class) for each anchor in the batch. This strategy forces the model to learn more discriminative embeddings, leading to faster convergence and improved representation quality (<xref ref-type="bibr" rid="B62">Hermans et al., 2017</xref>). As base model, we select ModernBERT, a recent advancement in encoder-only architectures (<xref ref-type="bibr" rid="B124">Warner et al., 2024</xref>). We adapt it with a mean pooling and a normalization layer to enhance its performance for sentence similarity tasks (<xref ref-type="bibr" rid="B100">Reimers and Gurevych, 2019</xref>). Finally, we train FinTextSim with a batch size of 200 and a margin of five. Following this contrastive learning-based training approach, we aim to improve latent semantic discovery of financial topics (<xref ref-type="bibr" rid="B86">Luo et al., 2024</xref>).</p>
<p>We evaluate FinTextSim by comparing its embeddings with those generated by AM, MPNET, and distilroberta-finetuned-financial-news-sentiment-analysis (DR), using intra- and intertopic similarity (see Section 3.5.2). Being the most downloaded models for sentence similarity tasks on the Hugging Face website, AM and MPNET serve as robust baselines. DR is the most prominent model for financial sentiment analysis, acting as domain benchmark. To examine embedding structure, we visualize the learned representations using Uniform Manifold Approximation and Projection (UMAP). Compared to dimensionality reduction alternatives, such as t-SNE or PCA, UMAP better preserves both local and global structure (<xref ref-type="bibr" rid="B5">Allaoui et al., 2020</xref>; <xref ref-type="bibr" rid="B8">Angelov, 2020</xref>). For UMAP, we employ the following essential hyperparameters:</p>
<list list-type="bullet">
<list-item><p>Minimum distance: 0, to encourage closely grouped data points, facilitating the formation of clusters representing semantically similar documents.</p></list-item>
<list-item><p>Distance metric: Cosine similarity, standard for NLP similarity tasks.</p></list-item>
<list-item><p>n_neighbors: 125, prioritizing global structures in our data to identify overarching macrotopics as well as hierarchically lower-ranked microtopics (<xref ref-type="bibr" rid="B8">Angelov, 2020</xref>).</p></list-item>
</list>
<p>We share the labeled dataset alongside FinTextSim&#x00027;s training code in the following Github Repository: <ext-link ext-link-type="uri" xlink:href="https://github.com/JehnenS/FinTextSim">https://github.com/JehnenS/FinTextSim</ext-link>.</p></sec>
<sec>
<label>3.4</label>
<title>Model creation</title>
<sec>
<label>3.4.1</label>
<title>Classical approaches</title>
<p>For the classical topic modeling approaches, we follow widely adopted preprocessing steps: stopword removal, lemmatization and term frequency&#x02013;inverse document frequency (tf&#x02013;idf) weighting. We remove stopwords using financial domain-specific lists provided by the Notre Dame Software Repository for Accounting and Finance.<xref ref-type="fn" rid="fn0007"><sup>5</sup></xref> Next, we lemmatize words to reduce vocabulary size. We deliberately choose lemmatization over stemming, as it preserves the interpretability of words better (<xref ref-type="bibr" rid="B87">Maier et al., 2018</xref>). To capture multi-word expressions, we construct bigrams and trigrams, combining terms that frequently occur together. We then build a dictionary and corpus representation of the texts and apply tf&#x02013;idf weighting to emphasize informative words. Finally, we employ LDA and NMF with the number of topics fixed at 12, aligning with the number of domains in our keyword list.</p></sec>
<sec>
<label>3.4.2</label>
<title>Contemporary approaches</title>
<p>For the contemporary approaches, we generate contextual embeddings using FinTextSim, AM and MPNET. Each embedding model is applied within BERTopic under identical settings, ensuring that embedding choice is the only factor influencing performance. Dimensionality reduction is performed using UMAP, which preserves both global and local structures (<xref ref-type="bibr" rid="B5">Allaoui et al., 2020</xref>; <xref ref-type="bibr" rid="B8">Angelov, 2020</xref>) and scales effectively to large datasets (<xref ref-type="bibr" rid="B8">Angelov, 2020</xref>). We configure UMAP with the same settings as in Section 3.3. To strike a balance between clustering efficiency and information retention, we reduce the dimensionality to ten components. For clustering, we adopt Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN). HDBSCAN accommodates clusters of varying size and shape, models noise as outliers and avoids forcing unrelated documents into topics (<xref ref-type="bibr" rid="B89">McInnes and Healy, 2017</xref>). We use the following hyperparameters:</p>
<list list-type="bullet">
<list-item><p>Minimum cluster size: 5,000, to prioritize global over highly local topics.</p></list-item>
<list-item><p>Minimum number of samples: 50, to reduce the number of outliers by requiring denser cluster formation.</p></list-item>
</list>
<p>We then vectorize documents using a CountVectorizer, removing financial stopwords. To extract relevant financial topics, we apply c-tfidf weighting, reduce overly common words and incorporate seed words from our keyword list with a weighting multiplier of 50. This guides the model toward generating finance-specific, domain-relevant topics while limiting generic clusters.</p></sec></sec>
<sec>
<label>3.5</label>
<title>Topic model evaluation</title>
<p>To compare the performance of the topic models, we focus on two fundamental tasks (<xref ref-type="bibr" rid="B18">Blei et al., 2003</xref>; <xref ref-type="bibr" rid="B108">Song et al., 2025</xref>):</p>
<list list-type="order">
<list-item><p>Topic Quality: Ability to uncover interpretable topics in financial texts.</p></list-item>
<list-item><p>Organizing Power: Organizing and structuring documents into distinct, meaningful groups.</p></list-item>
</list>
<p>The following subsections detail how we operationalize these tasks and how we adapt evaluation to the financial domain.</p>
<sec>
<label>3.5.1</label>
<title>Topic quality</title>
<p>To assess topic quality, we use NPMI coherence (<xref ref-type="bibr" rid="B99">Rashid et al., 2019</xref>; <xref ref-type="bibr" rid="B127">Yadavilli et al., 2024</xref>; <xref ref-type="bibr" rid="B112">Tang et al., 2025</xref>; <xref ref-type="bibr" rid="B109">Sun et al., 2026</xref>). NPMI measures the strength of association between words by comparing observed co-occurrence with expected independence. Following <xref ref-type="bibr" rid="B101">R&#x000F6;der et al. (2015)</xref>, NPMI coherence is computed with a sliding window. For classical models, we maintain the default window size of ten. Due to the shorter sentence lengths resulting from stopword removal in classical models, we adjust the window size for BERTopic. Based on the ratio between sentence lengths of BERTopic versus classical models, we set the window size for BERTopic to 20, guaranteeing comparable context coverage. Moreover, we lemmatize BERTopic&#x00027;s input texts and topic representations to reduce the impact of divergent vocabulary sizes. For each model, we use the five most representative words per topic, balancing informativeness with interpretability (<xref ref-type="bibr" rid="B3">Agrawal et al., 2018</xref>).</p>
<p>Raw coherence scores alone do not guarantee financial relevance. To address this, we complement them with topic accuracy, evaluated by human experts. For each topic, ten representative sentences are manually annotated to determine whether the topic assignment is correct. Topic accuracy is then defined as the proportion of correctly classified sentences. This approach captures the ability of each model to identify economically meaningful financial topics and generalize to unseen text. In addition, we perform a qualitative analysis of topic assignments to examine strengths and weaknesses of each model in capturing domain-specific semantics.</p></sec>
<sec>
<label>3.5.2</label>
<title>Organizing power</title>
<p>To assess document organization and clustering performance, we measure intratopic similarity (cohesion within topics) and intertopic similarity (separation across topics). High intratopic similarity combined with low intertopic similarity indicates semantically well-structured and diverse topics.</p>
<p>For classical models, similarities are derived from document&#x02013;topic distributions. First, documents are assigned to their dominant topic. Next, topic embeddings are computed as means of assigned documents. Intertopic similarity is defined as the cosine similarity between topic embeddings. Intratopic similarity is based on the cosine similarity between each document assigned to the topic and the corresponding topic embedding.</p>
<p>For contemporary models, similarities are computed directly from sentence embeddings. Topic embeddings are calculated as the mean of sentence embeddings per topic. Intertopic similarity reflects pairwise cosine similarities between topic embeddings. Intratopic similarity is defined as the average cosine similarity of sentence embeddings to their topic embedding.</p>
<p>Although similarity scores are computed in different latent representation spaces, all evaluated methods rely on cosine similarity, which is bounded and defined relative to a well-specified neutral reference: vector orthogonality. In both classical topic-distribution spaces and neural embedding spaces, orthogonal vectors correspond to the absence of semantic association. Importantly, our evaluation does not compare absolute cosine similarity magnitudes across model architectures. Instead, we assess relative topic structure within each model, focusing on intratopic cohesion and intertopic separation. These quantities are defined with respect to the model-specific similarity distribution and therefore remain interpretable despite differences in representation geometry. To further mitigate architectural effects, all reported similarity statistics are interpreted relative to their empirical within-model distributions rather than as absolute semantic similarity scores. By evaluating the contrast between intratopic and intertopic similarities, rather than their raw levels, we obtain a scale-independent measure of topic organization. This framing enables meaningful comparison of topic separability across architectures while respecting the distinct geometric properties of their underlying latent spaces.</p></sec></sec>
<sec>
<label>3.6</label>
<title>Downstream task: predictive validity</title>
<p>To assess the predictive value of textual information derived from topic modeling, we conduct a downstream task, evaluating whether the inclusion of topic-document distributions improves company performance prediction. Specifically, we examine the extent to which topics extracted from Item 7 and Item 7A contribute incremental predictive information for future firm profitability.</p>
<p>We define the prediction target as the normalized change in ROA. Following <xref ref-type="bibr" rid="B27">Chen et al. (2022)</xref>, we normalize by subtracting the average change in ROA over the past four years from the current ROA change. In line with recent literature on corporate performance prediction, we frame the task as a binary classification problem that predicts the direction of ROA change (<xref ref-type="bibr" rid="B95">Peng, 2025</xref>). This setup further helps in mitigating heteroscedasticity and outlier sensitivity (<xref ref-type="bibr" rid="B45">Freeman et al., 1982</xref>; <xref ref-type="bibr" rid="B94">Ou and Penman, 1989</xref>). Consistent with <xref ref-type="bibr" rid="B94">Ou and Penman (1989)</xref> and <xref ref-type="bibr" rid="B27">Chen et al. (2022)</xref>, we exclude observations with model probabilities between 0.4 and 0.6 to remove statistically ambiguous cases and strengthen the predictive signal (<xref ref-type="bibr" rid="B73">Jones et al., 2023</xref>; <xref ref-type="bibr" rid="B74">Jun et al., 2022</xref>).</p>
<p>The independent variables comprise two components: (1) financial control variables and (2) textual topic features. The financial control variables are based on <xref ref-type="bibr" rid="B110">Swade et al. (2023)</xref> and <xref ref-type="bibr" rid="B77">Koval et al. (2024)</xref>. They comprise 15 features that capture value, growth, profitability, momentum, and size. Focusing on this limited set of features allows us to represent key firm characteristics while preserving the interpretability and visibility of the added textual components. The textual variables are derived from topic-document distributions generated by each topic modeling approach. For classical models, we use the model-implied topic&#x02013;document distributions directly. For BERTopic, which does not natively provide document-level topic probabilities, we employ HDBSCAN-based approximations of topic distributions. In all cases, document-level topic representations are obtained by averaging sentence-level topic probabilities, yielding vectors that reflect the relative importance of each topic within a document.</p>
<p>We evaluate two predictive models widely used in financial prediction: LR, and XGBoost (XGB). LR serves as a linear benchmark, offering simplicity and interpretability (<xref ref-type="bibr" rid="B47">Gangwani and Zhu, 2024</xref>; <xref ref-type="bibr" rid="B128">&#x0017B;bikowski and Antosiuk, 2021</xref>). XGB represents a more sophisticated tree-based model, known for its robustness and performance in financial prediction tasks. Tree-based models offer several advantages, as they are capable of handling high-dimensional data and capturing complex, non-linear interactions among features <xref ref-type="bibr" rid="B80">Levy and O&#x00027;Malley, (2020)</xref>; <xref ref-type="bibr" rid="B64">Ho, (1995)</xref>; <xref ref-type="bibr" rid="B117">Varian, (2014)</xref>; <xref ref-type="bibr" rid="B51">Geertsema and Lu, (2023)</xref>. Both ML models are trained using a temporal split. We use data from 2016&#x02013;2021 for training and data from 2022&#x02013;2023 for testing. For LR, we perform several preprocessing steps to ensure robust model performance, including removing columns or rows with excessive placeholder or zero values, replacing outlier values, and scaling of features. All preprocessing steps are applied while preventing data leakage and look-ahead bias (<xref ref-type="bibr" rid="B128">&#x0017B;bikowski and Antosiuk, 2021</xref>). As tree-based models can internally manage missing values and are resilient to outliers, we do not apply any form of winsorizing or feature scaling for XGB (<xref ref-type="bibr" rid="B97">Ranta and Ylinen, 2024</xref>; <xref ref-type="bibr" rid="B51">Geertsema and Lu, 2023</xref>). The final dataset contains 3,454 firm-year observations, with 2,568 for training and 886 for testing. We apply balanced class weighting to mitigate minor class imbalance (43.3% positive, 56.7% negative), which is consistent across training and test set.</p>
<p>We evaluate predictive performance using Accuracy, F1-score, and the Area Under the Receiver Operating Characteristic Curve (ROC-AUC). Following <xref ref-type="bibr" rid="B27">Chen et al. (2022)</xref> and <xref ref-type="bibr" rid="B25">Carpenter and Bithell (2000)</xref>, we assess statistical significance of ROC-AUC differences by constructing bootstrap <italic>p</italic>-values for deviations from 50%, i.e. a random guess. Specifically, we generate 10,000 bootstrap samples of equal size to the original test set. The <italic>p-</italic>value is defined as the proportion of the bootstrap AUCs that are below 50%.</p>
<p>For each ML model, we compare seven different inputs: a baseline model that relies solely on financial variables and six text-enhanced models that integrate topic-document distributions from distinct topic modeling approaches. This design enables a direct comparison of the incremental predictive power of textual representations, revealing which topic modeling approach most effectively contributes to corporate performance prediction. Additionally, by applying both linear and non-linear classifiers, we can assess how the benefit of textual features interacts with model complexity.</p></sec></sec>
<sec id="s4">
<label>4</label>
<title>Results and discussion</title>
<p>We structure the results and discussion section according to our research questions:</p>
<list list-type="simple">
<list-item><p>RQ1 FinTextSim: Leveraging the quality of contextual embeddings for the financial domain.</p></list-item>
<list-item><p>RQ2 Topic Quality: Creating qualitative, coherent topic representations.</p></list-item>
<list-item><p>RQ3 Organizing Power: Organizing large financial textual datasets.</p></list-item>
<list-item><p>RQ4 Improving corporate performance prediction with textual data.</p></list-item>
</list>
<p>The results are presented and contextualized in the following subsections.</p>
<sec>
<label>4.1</label>
<title>FinTextSim&#x02014;Leveraging contextual embeddings for the financial domain</title>
<p>FinTextSim generates substantially improved clusters and notably reduces the number of outliers compared to standard embedding models. As illustrated in <xref ref-type="fig" rid="F1">Figure 1</xref> and <xref ref-type="table" rid="T2">Table 2</xref>, FinTextSim (<xref ref-type="fig" rid="F1">Figure 1a</xref>) achieves a marked increase in intratopic similarity while simultaneously lowering intertopic similarity relative to AM, MPNET, and DR (<xref ref-type="fig" rid="F1">Figures 1b</xref>&#x02013;<xref ref-type="fig" rid="F1">d</xref>) on the test dataset.</p>
<fig position="float" id="F1">
<label>Figure 1</label>
<caption><p>UMAP reduced sentence embeddings <bold>(a)</bold> FinTextSim vs. <bold>(b)</bold> AM vs. <bold>(c)</bold> MPNET vs. <bold>(d)</bold> DR on the test dataset. The colors of the datapoints represent a topic from the keyword list.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-09-1752103-g0001.tif">
<alt-text content-type="machine-generated">Four scatter plots display two-dimensional UMAP projections of sentence embeddings produced by different models. The FinTextSim panel shows compact, well-separated topic clusters, whereas AM, MPNET, and DR projections exhibit diffuse point clouds with substantial overlap across topics and no visually distinct cluster structure. </alt-text>
</graphic>
</fig>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>FinTextSim vs. OTS embedding models: intra- and intertopic similarity on test dataset.</p></caption>
<table frame="box" rules="all">
<thead>
<tr>
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="center"><bold>Intratopic similarity &#x02191;</bold></th>
<th valign="top" align="center"><bold>Intertopic similarity &#x02193;</bold></th>
<th valign="top" align="center"><bold>Outliers within BERTopic &#x02193;</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">FinTextSim</td>
<td valign="top" align="center">0.998</td>
<td valign="top" align="center">&#x02013;0.075</td>
<td valign="top" align="center">240,823</td>
</tr>
<tr>
<td valign="top" align="left">AM</td>
<td valign="top" align="center">0.584</td>
<td valign="top" align="center">0.563</td>
<td valign="top" align="center">781,965</td>
</tr>
<tr>
<td valign="top" align="left">MPNET</td>
<td valign="top" align="center">0.614</td>
<td valign="top" align="center">0.625</td>
<td valign="top" align="center">784,225</td>
</tr>
<tr>
<td valign="top" align="left">DR</td>
<td valign="top" align="center">0.773</td>
<td valign="top" align="center">0.883</td>
<td valign="top" align="center">1,332,620</td>
</tr></tbody>
</table>
</table-wrap>
<p>Specifically, FinTextSim attains an intratopic similarity of 0.998, substantially exceeding AM (0.584), MPNET (0.614), and DR (0.773). At the same time, FinTextSim reduces intertopic similarity by more than 108% compared to all baselines, achieving a score of &#x02013;0.075. In contrast, AM and MPNET yield 0.563 and 0.623, respectively, while DR exhibits the highest intertopic similarity at 0.883. Differences across models are further reflected in the number of outliers generated when combined with BERTopic. AM and MPNET generate 781,965 and 784,225 outliers, respectively. DR performs worst, resulting in more than 1.3 million outliers. In contrast, using FinTextSim leads to only 240,823 outliers, representing a reduction of more than 69% relative to all baselines.</p>
<p>These results show that FinTextSim creates significantly enhanced clusters of semantically similar concepts, characterized by high intratopic similarity and low intertopic similarity. AM, MPNET, and DR show limited ability to capture topic-specific nuances, leading to less differentiated embedding spaces (see <xref ref-type="fig" rid="F1">Figure 1</xref>). In parallel, FinTextSim notably reduces the number of outliers, preserving valuable information that standard embedding models discard. Taken together, these findings suggest that OTS sentence-transformers and models finetuned primarily for financial sentiment analysis are less well suited for semantic clustering of financial text. By explicitly modeling domain-specific semantic structure, FinTextSim provides embeddings that better align with financial topical distinctions.</p>
<p>Turning to a practical example, <xref ref-type="fig" rid="F2">Figure 2</xref> illustrates topic assignments for the same sentence under BERTopic in combination with FinTextSim, AM, MPNET, and DR. FinTextSim correctly identifies the topic as &#x0201C;Sales,&#x0201D; producing a coherent and interpretable topic representation. In contrast, AM and MPNET assign the sentence to cost- and debt-related topics, reflecting topic confusion and partial concept mixing that limits reliable topic differentiation in this setting. DR assigns the sentence to a diffuse topic lacking clear financial interpretation. This qualitative evidence reinforces the quantitative findings and underscores FinTextSim&#x00027;s advantage in producing interpretable, domain-aligned embeddings that preserve financial topical structure.</p>
<fig position="float" id="F2">
<label>Figure 2</label>
<caption><p>Topic representations for the same sentence using BERTopic with FinTextSim, AM, MPNET, and DR. Words are colored by their assigned topic from the keyword list. Black words are not included in the list. Sentence: &#x0201C;in addition we ended the year with a strong sales backlog up in homes and in dollar value which gives us a strong start for fiscal.&#x0201D;</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-09-1752103-g0002.tif">
<alt-text content-type="machine-generated">Four wordclouds show topic representations generated for the same sentence using different embedding models. The FinTextSim panel contains prominent sales-related terms, forming a coherent thematic group. The AM and MPNET panels display mixtures of cost-, debt-, and sales-related words, indicating less focused topic composition. The DR panel includes a broad set of financial terms without a clearly dominant theme.</alt-text>
</graphic>
</fig></sec>
<sec>
<label>4.2</label>
<title>Topic quality</title>
<p>As described in Section 3.5.1, we evaluate topic quality using two complementary criteria: NPMI coherence and topic accuracy. While coherence captures statistical word co-occurrence within topics, topic accuracy directly measures whether models correctly identify economically meaningful financial topics. <xref ref-type="table" rid="T3">Table 3</xref> reports coherence and topic accuracy for all models.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Topic quality.</p></caption>
<table frame="box" rules="all">
<thead>
<tr>
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="center"><bold>Coherence &#x02191;</bold></th>
<th valign="top" align="center"><bold>Topic accuracy &#x02193;</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">BERTopic-AM</td>
<td valign="top" align="center">0.387</td>
<td valign="top" align="center">0.06</td>
</tr>
<tr>
<td valign="top" align="left">BERTopic-MPNET</td>
<td valign="top" align="center">0.382</td>
<td valign="top" align="center">0.23</td>
</tr>
<tr>
<td valign="top" align="left">BERTopic-FinTextSim</td>
<td valign="top" align="center">0.287</td>
<td valign="top" align="center">0.81</td>
</tr>
<tr>
<td valign="top" align="left">BERTopic-DR</td>
<td valign="top" align="center">0.368</td>
<td valign="top" align="center">0.09</td>
</tr>
<tr>
<td valign="top" align="left">LDA</td>
<td valign="top" align="center">0.039</td>
<td valign="top" align="center">0</td>
</tr>
<tr>
<td valign="top" align="left">NMF</td>
<td valign="top" align="center">0.239</td>
<td valign="top" align="center">0.11</td>
</tr></tbody>
</table>
</table-wrap>
<p>BERTopic combined with FinTextSim outperforms all alternative approaches in topic accuracy, achieving 81% correct classification across all expert-labeled sentences. In contrast, OTS sentence-transformers achieve markedly lower accuracy. BERTopic with AM achieves a topic accuracy of only 6%, while MPNET reaches 23%. DR exhibits similarly limited performance, achieving 9% accuracy. Classical topic models perform at comparable levels. While NMF reaches 11% accuracy, LDA is unable to correctly identify any topic. A topic-level breakdown shows that FinTextSim consistently identifies most financial topics with high accuracy, whereas baseline models succeed only in narrowly defined, lexically explicit topics such as litigation. This pattern suggests that baseline models rely heavily on surface-level keyword cues, while FinTextSim captures broader domain-specific contextual semantics.<xref ref-type="fn" rid="fn0008"><sup>6</sup></xref> These results highlight that FinTextSim reliably recovers a broad range of economically meaningful financial topics. In comparison, generic embedding models and classical topic models show reduced coverage and consistency, limiting their effectiveness for comprehensive, large-scale financial text analysis.</p>
<p>In terms of raw coherence, BERTopic models outperform classical topic modeling, consistent with <xref ref-type="bibr" rid="B2">Abuzayed and Al-Khalifa (2021)</xref> and <xref ref-type="bibr" rid="B41">Egger and Yu (2022)</xref>. In line with <xref ref-type="bibr" rid="B41">Egger and Yu (2022)</xref>, <xref ref-type="bibr" rid="B93">O&#x00027;Callaghan et al. (2015)</xref>, and <xref ref-type="bibr" rid="B29">Chen et al. (2019)</xref>, NMF produces more coherent topics than LDA, reflecting its strengths in short-text-modeling (<xref ref-type="bibr" rid="B29">Chen et al., 2019</xref>) and handling non-mainstream text (<xref ref-type="bibr" rid="B93">O&#x00027;Callaghan et al., 2015</xref>). LDA, by contrast, generates more general and less domain-specific topics, consistent with <xref ref-type="bibr" rid="B93">O&#x00027;Callaghan et al. (2015)</xref>.</p>
<p>During the evaluation of raw coherence scores, an important discrepancy arises: BERTopic with AM, MPNET, and DR achieve higher coherence than with FinTextSim. At first glance, this seems to suggest lower quality for FinTextSim. Yet, this interpretation is incomplete in the financial domain. The paradox arises because coherence does not penalize misclassification, i.e., low topic accuracy. In addition, AM, MPNET, and DR generate a large number of outliers, which simplifies the compression and generation of topics. This artificially inflates coherence while losing valuable financial signals. In contrast, FinTextSim preserves topic distinctions, resulting in fewer outliers and richer topical structures. A further challenge lies in the vocabulary of the financial domain. Key terms often occur as standalone words rather than within a sliding window. Hence, &#x0201C;true&#x0201D; financial topics might suffer from low coherence scores. These factors demonstrate that coherence alone is insufficient to evaluate financial topic models. In line with <xref ref-type="bibr" rid="B57">Grootendorst (2022)</xref>, who emphasizes that topic evaluation requires both domain expertise and subjective interpretation, we argue that topic accuracy is necessary to capture meaningful financial insights. Standard embedding models within BERTopic and classical topic models exhibit limited ability to correctly identify economically meaningful topics, underscoring their limitations for finance-specific tasks.</p>
<p>A practical example illustrates this issue. In <xref ref-type="fig" rid="F3">Figure 3</xref>, FinTextSim correctly identifies the topic as &#x0201C;Sales.&#x0201D; AM, MPNET, and DR misclassify the same sentence. Yet, AM receives a coherence score of 0.611, more than double FinTextSim&#x00027;s 0.263. Here, coherence rewards an incorrect classification, undermining interpretability and predictive utility.</p>
<fig position="float" id="F3">
<label>Figure 3</label>
<caption><p>Topic representations - Sales. Original cleaned sentence: &#x0201C;we calculate revpar by dividing hotel room revenue by total number of room nights available to guests for a given period.&#x0201D;</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-09-1752103-g0003.tif">
<alt-text content-type="machine-generated">Six wordclouds display topic representations for the same sentence generated by FinTextSim, AM, MPNET, DR, LDA, and NMF. The FinTextSim panel contains prominent sales-related terms. The AM panel displays words related to non-financial concepts. The other panels contain mixed or generic financial words without a clear central theme.</alt-text>
</graphic>
</fig>
<p><xref ref-type="fig" rid="F4">Figure 4</xref> shows another case: FinTextSim correctly assigns the sentence to &#x0201C;Profit and Loss.&#x0201D; AM associates it with foreign currency and NMF is unable to identify a financial topic at all. Nevertheless, AM (0.528) and NMF (0.341) achieve higher coherence scores than FinTextSim (0.261).</p>
<fig position="float" id="F4">
<label>Figure 4</label>
<caption><p>Topic representations - Profit and Loss. Original cleaned sentence: &#x0201C;reported operating profit of million in was million or higher than reported operating profit of million in.&#x0201D;</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-09-1752103-g0004.tif">
<alt-text content-type="machine-generated">Six wordclouds display topic representations for the same sentence generated by FinTextSim, AM, MPNET, DR, LDA, and NMF. The FinTextSim panel highlights profit- and loss-related terms. The other panels contain mixed or generic financial words without a clear central theme.</alt-text>
</graphic>
</fig>
<p>Overall, our findings highlight that domain-specific embeddings are essential for generating high-quality topic representations in financial text applications. Standard coherence metrics systematically undervalue accurate domain-specific topic assignments, while topic accuracy captures meaningful distinctions. By ensuring precise alignment between text and financial topics, FinTextSim provides the interpretability and reliability required for downstream tasks.</p></sec>
<sec>
<label>4.3</label>
<title>Organizing power</title>
<p>To efficiently organize and structure large collections of documents, maximizing intratopic similarity while simultaneously minimizing intertopic similarity is desirable. The results for intra- and intertopic similarity of our models are displayed in <xref ref-type="table" rid="T4">Table 4</xref>. These metrics are computed within each model&#x00027;s latent space and interpreted relatively, focusing on the contrast between cohesion and separation rather than absolute similarity values.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Topic similarities.</p></caption>
<table frame="box" rules="all">
<thead>
<tr>
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="center"><bold>Intertopic similarity &#x02193;</bold></th>
<th valign="top" align="center"><bold>Intratopic similarity &#x02191;</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">BERTopic-AM</td>
<td valign="top" align="center">0.465</td>
<td valign="top" align="center">0.596</td>
</tr>
<tr>
<td valign="top" align="left">BERTopic-MPNET</td>
<td valign="top" align="center">0.511</td>
<td valign="top" align="center">0.656</td>
</tr>
<tr>
<td valign="top" align="left">BERTopic-FinTextSim</td>
<td valign="top" align="center">&#x02013;0.034</td>
<td valign="top" align="center">0.939</td>
</tr>
<tr>
<td valign="top" align="left">BERTopic-DR</td>
<td valign="top" align="center">0.745</td>
<td valign="top" align="center">0.948</td>
</tr>
<tr>
<td valign="top" align="left">LDA</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0</td>
</tr>
<tr>
<td valign="top" align="left">NMF</td>
<td valign="top" align="center">0.202</td>
<td valign="top" align="center">0.881</td>
</tr></tbody>
</table>
</table-wrap>
<p>BERTopic combined with FinTextSim consistently achieves the strongest balance between cohesion and separation, producing highly coherent topic clusters (intratopic similarity 0.939) while maintaining strong separation between topics (intertopic similarity &#x02013;0.034). This demonstrates that FinTextSim captures domain-specific distinctions in financial text, forming distinct and semantically meaningful clusters. By contrast, generic OTS sentence-transformers produce weaker topic structure. Both AM and MPNET exhibit moderate intratopic similarity (0.596 and 0.656) but substantially higher intertopic similarity (0.465 and 0.511), indicating that topics are less well-separated and concepts are partially conflated. DR shows high intratopic similarity (0.948), yet its elevated intertopic similarity (0.745) points to limited topic differentiation. Classical topic models struggle as well. LDA collapses all sentences into a single dominant topic, resulting in maximal intertopic similarity and minimal intratopic similarity. NMF produces higher intratopic similarity than LDA, but intertopic similarity remains at a moderate level, indicating partial topic mixing.</p>
<p>Overall, these results highlight the importance of jointly evaluating intratopic and intertopic similarity. FinTextSim consistently forms clear and well-structured topic clusters, outperforming general-purpose embeddings, domain-specific sentiment-baselines, and classical topic models. <xref ref-type="fig" rid="F5">Figure 5</xref> illustrates this advantage in practice. FinTextSim correctly identifies the sentence as belonging to the &#x0201C;HR&#x0201D; topic, ensuring a precise and domain-relevant assignment. The alternative models associate the sentence with broader or mixed topics, failing to recover this specific financial concept. Such topic ambiguity manifests in higher intertopic similarity and lower intratopic similarity, underscoring the limitations of OTS sentence-transformers, sentiment-focused financial embeddings, and classical topic models for fine-grained financial semantic clustering.</p>
<fig position="float" id="F5">
<label>Figure 5</label>
<caption><p>Topic representations - HR. Original cleaned sentence: &#x0201C;in the fourth quarter we recognized our frontline employees for their commitment and contributions to their communities during the pandemic with a award that was paid in January.&#x0201D;</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-09-1752103-g0005.tif">
<alt-text content-type="machine-generated">Six wordclouds display topic representations for the same sentence generated by FinTextSim, AM, MPNET, DR, LDA, and NMF. The FinTextSim panel emphasizes HR-related terminology. The NMF panel displays financing- and debt-related words. The AM panel contains non-financial terms. The remaining panels show broader financial vocabulary without a clearly defined topic.</alt-text>
</graphic>
</fig></sec>
<sec>
<label>4.4</label>
<title>Predictive validity</title>
<p>As presented in Section 3.6, we evaluate the performance of our corporate performance predicting ML models using accuracy, F1-Score and AUC-ROC. The results are reported in <xref ref-type="table" rid="T5">Table 5</xref>.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>ML performance comparison across feature sets and models.</p></caption>
<table frame="box" rules="all">
<thead>
<tr>
<th valign="top" align="left"><bold>Feature set</bold></th>
<th valign="top" align="center"><bold>Accuracy</bold></th>
<th valign="top" align="center"><bold>F1 score</bold></th>
<th valign="top" align="center"><bold>ROC-AUC</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" colspan="4"><bold>LR</bold></td>
</tr>
<tr>
<td valign="top" align="left">Financial</td>
<td valign="top" align="center"><italic><bold>69.2</bold></italic></td>
<td valign="top" align="center">57.8</td>
<td valign="top" align="center">68.8</td>
</tr>
<tr>
<td valign="top" align="left">Financial &#x0002B; AM</td>
<td valign="top" align="center">63.8</td>
<td valign="top" align="center">53.3</td>
<td valign="top" align="center">64.6</td>
</tr>
<tr>
<td valign="top" align="left">Financial &#x0002B; MPNET</td>
<td valign="top" align="center">66.9</td>
<td valign="top" align="center">56.5</td>
<td valign="top" align="center">65.8</td>
</tr>
<tr>
<td valign="top" align="left">Financial &#x0002B; FinTextSim</td>
<td valign="top" align="center">68.6</td>
<td valign="top" align="center"><bold>59.9</bold></td>
<td valign="top" align="center"><italic><bold>70.8</bold></italic></td>
</tr>
<tr>
<td valign="top" align="left">Financial &#x0002B; DR</td>
<td valign="top" align="center">66.5</td>
<td valign="top" align="center">53.9</td>
<td valign="top" align="center">66.7</td>
</tr>
<tr>
<td valign="top" align="left">Financial &#x0002B; LDA</td>
<td valign="top" align="center">67.4</td>
<td valign="top" align="center">55.6</td>
<td valign="top" align="center">67.7</td>
</tr>
<tr>
<td valign="top" align="left">Financial &#x0002B; NMF</td>
<td valign="top" align="center">66.2</td>
<td valign="top" align="center">56.4</td>
<td valign="top" align="center">69.0</td>
</tr>
<tr>
<td valign="top" align="left" colspan="4"><bold>XGB</bold></td>
</tr>
<tr>
<td valign="top" align="left">Financial</td>
<td valign="top" align="center">63.6</td>
<td valign="top" align="center">60.3</td>
<td valign="top" align="center">67.2</td>
</tr>
<tr>
<td valign="top" align="left">Financial &#x0002B; AM</td>
<td valign="top" align="center">66.3</td>
<td valign="top" align="center"><italic><bold>62.6</bold></italic></td>
<td valign="top" align="center">67.4</td>
</tr>
<tr>
<td valign="top" align="left">Financial &#x0002B; MPNET</td>
<td valign="top" align="center">64.8</td>
<td valign="top" align="center">58.0</td>
<td valign="top" align="center">66.7</td>
</tr>
<tr>
<td valign="top" align="left">Financial &#x0002B; FinTextSim</td>
<td valign="top" align="center">66.0</td>
<td valign="top" align="center">61.2</td>
<td valign="top" align="center"><bold>68.6</bold></td>
</tr>
<tr>
<td valign="top" align="left">Financial &#x0002B; DR</td>
<td valign="top" align="center">65.7</td>
<td valign="top" align="center">59.4</td>
<td valign="top" align="center">67.6</td>
</tr>
<tr>
<td valign="top" align="left">Financial &#x0002B; LDA</td>
<td valign="top" align="center"><bold>67.0</bold></td>
<td valign="top" align="center">62.2</td>
<td valign="top" align="center">67.6</td>
</tr>
<tr>
<td valign="top" align="left">Financial &#x0002B; NMF</td>
<td valign="top" align="center">66.7</td>
<td valign="top" align="center">60.8</td>
<td valign="top" align="center">68.2</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Bold values indicate the best performance within each model. Values highlighted in bold and italic indicate the best performing model-feature combination overall. Values are in percent.</p>
</table-wrap-foot>
</table-wrap>
<p>For LR, FinTextSim delivers the strongest and most consistent improvements. Topic features derived from BERTopic combined with FinTextSim yield the highest ROC-AUC (70.8) and F1-score (59.9), representing an improvement of approximately two percentage points over the financial baseline. These gains are statistically significant and reflect simultaneous improvements in both precision and recall. In contrast, text features derived from OTS sentence-transformers and models finetuned for financial sentiment analysis reduce predictive performance relative to the financial baseline. Classical topic models offer only marginal or inconsistent improvements. Overall, these results suggest that weak or noisy topic representations do not reliably contribute predictive signal and may adversely affect linear classifiers.</p>
<p>Results under XGB paint a complementary picture. As a more expressive, non-linear model, XGB is better able to accommodate heterogeneous feature quality. Several text-based feature sets yield modest, statistically significant improvements over the financial baseline. Nevertheless, FinTextSim remains the most consistent performer across evaluation metrics, achieving the highest ROC-AUC while maintaining competitive accuracy and F1-score. Importantly, no alternative model matches FinTextSim&#x00027;s joint gains across linear and non-linear classifiers.</p>
<p>Taken together, these findings highlight two key insights. First, predictive gains from textual topic features are highly sensitive to embedding quality, particularly in linear models where noise cannot be absorbed through model complexity. Second, FinTextSim is the only embedding approach that improves predictive performance robustly across both LR and XGB. FinTextSim&#x00027;s superior predictive validity aligns with its stronger intrinsic characteristics, namely higher topic quality and cluster separation. These properties are therefore not merely internal measures of representational quality but translate directly into extrinsic predictive utility. This demonstrates that domain-specific embeddings can effectively extract latent, forward-looking information embedded in corporate narratives. On the other hand, classical, general-purpose or sentiment-focused models tend to provide weaker predictive signals in our setting.</p>
<p>Our results are consistent with and extend previous findings in the earnings and profitability prediction literature. The accuracy and ROC-AUC values reported in our study exceed most previous work, where accuracy typically ranges between 57% and 64% and AUC scores around 68% (<xref ref-type="bibr" rid="B7">Anand et al., 2019</xref>; Baranes et al., <xref ref-type="bibr" rid="B14">(2019)</xref>; <xref ref-type="bibr" rid="B126">Xinyue et al., (2020)</xref>; <xref ref-type="bibr" rid="B73">Jones et al., (2023)</xref>; <xref ref-type="bibr" rid="B27">Chen et al., (2022)</xref>. For example, <xref ref-type="bibr" rid="B73">Jones et al. (2023)</xref> report an AUC of 68.4% for LR, while <xref ref-type="bibr" rid="B27">Chen et al. (2022)</xref> achieve between 67.5% and 68.7%. Compared to these benchmarks, our FinTextSim-based model demonstrates superior predictive validity using a lightweight LR framework and a wide range of model probabilities, whereas many studies only focus on the first and last quintile (<xref ref-type="bibr" rid="B73">Jones et al., 2023</xref>; <xref ref-type="bibr" rid="B74">Jun et al., 2022</xref>). This reinforces that the observed improvement is not an artifact of model complexity or sample selection but stems from the added informational value of textual features derived from domain-specific contextual embeddings. Contrary to our expectations, LR outperforms XGB, diverging from prior work <xref ref-type="bibr" rid="B6">Amel-Zadeh et al., (2020)</xref>; <xref ref-type="bibr" rid="B102">Rossi and Utkus, (2020)</xref>; <xref ref-type="bibr" rid="B80">Levy and O&#x00027;Malley, (2020)</xref>; <xref ref-type="bibr" rid="B133">Zhu et al., (2025)</xref>. We attribute XGB&#x00027;s comparatively weaker performance to our deliberately parsimonious feature set, which limits the scope for higher-order interactions.</p>
<p>Overall, our findings confirm that textual representations can meaningfully enhance the prediction of corporate performance when generated by a domain-adapted language model. FinTextSim captures subtle linguistic signals reflecting managerial expectations, strategic orientation, and forward-looking disclosures that are otherwise omitted in numerical data. By integrating such qualitative cues into financial prediction tasks, we demonstrate that corporate narratives contain actionable, forward-looking information that can improve the predictive power of conventional forecasting models and contribute to a more holistic understanding of firm performance.</p></sec>
<sec>
<label>4.5</label>
<title>Wrapup of results and discussion</title>
<p>We find that BERTopic is highly effective on financial text when combined with FinTextSim. AM, MPNET, DR, and classical topic models tend to produce broader and less differentiated topics, limiting their ability to capture critical financial aspects and resulting in gaps in topical coverage. Only when paired with FinTextSim, BERTopic produces clear, distinct clusters of financial topics, minimizing misclassifications and enhancing interpretability. Conceptually, this aligns with <xref ref-type="bibr" rid="B32">Das et al. (2017)</xref>, who observed that financial text represented with expert keywords often exhibits almost linearly separable structures. Furthermore, our results support (<xref ref-type="bibr" rid="B36">Dong et al., 2024</xref>; <xref ref-type="bibr" rid="B58">Gu et al., 2024</xref>; <xref ref-type="bibr" rid="B123">Wang Y. et al., 2024</xref>; <xref ref-type="bibr" rid="B61">Hajek and Munk, 2024</xref>), demonstrating that finetuning on a domain-specific dataset improves both model performance and domain-specific understanding. While general-purpose embeddings often exhibit biases and limited coverage of specialized financial terminology (<xref ref-type="bibr" rid="B109">Sun et al., 2026</xref>; <xref ref-type="bibr" rid="B61">Hajek and Munk, 2024</xref>), models finetuned for financial sentiment analysis also appear less effective for robust topic modeling and semantic clustering. In contrast, domain-adapted models like FinTextSim produce sentence embeddings that better capture topic-specific nuances and context (<xref ref-type="bibr" rid="B123">Wang Y. et al., 2024</xref>), emphasizing that relying on alternatives may compromise reliability and introduce systematic errors (<xref ref-type="bibr" rid="B109">Sun et al., 2026</xref>). The hyperparameter choices for UMAP and HDBSCAN (see Section 3.4.2) are critical to our results. While we prioritized capturing global structures and macrotopics, these settings succeeded only with FinTextSim, which provided high-quality, pre-separated embeddings for financial text. AM, MPNET, and DR exhibit substantially higher outlier rates and produce less distinguishable topic structures under the same settings. This further highlights a unique advantage of FinTextSim: its domain-adapted representations not only enhance intratopic and intertopic similarity but also enable dimensionality reduction and clustering methods to effectively capture macro-level topic structures, reinforcing its suitability for financial text analysis where both clarity and interpretability are paramount.</p>
<p>Beyond intrinsic topic quality, our results show that improved textual representations translate into tangible predictive benefits. For LR, topic features generated by BERTopic in combination with FinTextSim yield a statistically significant improvement over purely financial features, reflected in a two-percentage-point increase in both ROC-AUC and F1-score. In contrast, OTS sentence-transformers, DR, and classical topic models provide no improvement and, in some cases, even degrade performance, indicating that their latent features introduce noise rather than signal. Results under XGB present a complementary picture. As a non-linear learner, XGB is better able to absorb heterogeneous or partially noisy feature sets, leading to modest improvements for several textual representations. Nevertheless, FinTextSim remains the most consistent performer, achieving the highest ROC-AUC while maintaining competitive accuracy and F1-score. No alternative topic modeling approach delivers comparable gains across both linear and non-linear classifiers. Taken together, these findings bridge intrinsic and extrinsic evaluation. The superior topic quality and cluster separation achieved by FinTextSim are not merely internal quality measures but translate into robust predictive utility, particularly when model capacity cannot compensate for weak representations. Hence, we conclude that semantic differentiation between sentence representations not only contributes positively to topic modeling (<xref ref-type="bibr" rid="B123">Wang Y. et al., 2024</xref>), but also to corporate performance prediction. Therefore, we partially support prior literature suggesting that NLP can enhance corporate performance prediction. However, our evidence reveals that such improvements are realized only when domain-specific representations are employed. Together, these findings position FinTextSim as a bridge between qualitative disclosure analysis and quantitative forecasting, highlighting the promise of domain-adapted language models in advancing the methodological frontier of textual analysis in accounting and finance.</p>
<p>Evaluating topic models remains challenging (<xref ref-type="bibr" rid="B130">Zhao et al., 2021</xref>). Our analysis reveals the limitations of standard coherence metrics. BERTopic with AM, MPNET, and DR attain higher raw coherence than FinTextSim, yet exhibit low topic accuracy caused by frequent misclassifications. These findings underscore the need for new coherence or topic-quality measures tailored to domain-specific texts.</p>
<p>While BERTopic enhances topic modeling relative to classical approaches, there is still significant room for improvement. The transformer architecture, which BERTopic heavily relies on, may not be fully optimized yet. Thus, more sophisticated and computationally efficient alternatives should be explored (<xref ref-type="bibr" rid="B75">Karami and Ghodsi, 2024</xref>). Further advancements in encoder-only models could enhance sentence-transformers by improving their contextual understanding of language (<xref ref-type="bibr" rid="B124">Warner et al., 2024</xref>). Moreover, applying domain-specific pre-training methods to optimized BERT variants may deepen the model&#x00027;s understanding of financial language, leading to more effective downstream task performance (<xref ref-type="bibr" rid="B66">Huang et al., 2023</xref>). Another promising direction is the integration of topic modeling with generative Large Language Models such as GPT. Although generative models alone do not exhibit competitive performance in topic modeling tasks due to difficulties in handling corpus-level information (<xref ref-type="bibr" rid="B122">Wang R. et al., 2024</xref>), hybrid approaches that combine their generalization capabilities with topic modeling frameworks may improve both generalization and textual understanding (<xref ref-type="bibr" rid="B112">Tang et al., 2025</xref>).</p>
<p>While our experiments focus on Item 7 and Item 7A of 10-K filings, experiments on Item 1 suggest similar performance, indicating that FinTextSim&#x00027;s effectiveness extends to other sections of 10-K filings.<xref ref-type="fn" rid="fn0009"><sup>7</sup></xref> Considering future improvements for FinTextSim, incorporating diverse high-quality financial sources, such as news, conference call transcripts, and analyst reports could lead to enhanced robustness and adaptability (<xref ref-type="bibr" rid="B90">Mohammed et al., 2025</xref>). Additionally, incorporating researcher-labeled data may provide further improvements (<xref ref-type="bibr" rid="B35">Di Gennaro et al., 2024</xref>; <xref ref-type="bibr" rid="B112">Tang et al., 2025</xref>). These advancements not only improve financial text analysis but also enable topic-specific sentiment extraction, which is highly valuable for performance prediction (<xref ref-type="bibr" rid="B54">Gracewell et al., 2025</xref>; <xref ref-type="bibr" rid="B33">Deveikyte et al., 2022</xref>; <xref ref-type="bibr" rid="B61">Hajek and Munk, 2024</xref>).</p>
<p>In terms of corporate performance prediction, the downstream utility of FinTextSim could be utilized to refine investment strategies, generating excess returns by capturing information beyond raw numerical data. We achieve the best results with a lightweight LR framework and a restricted number of features, highlighting that the predictive gains stem from FinTextSim&#x00027;s improved information quality rather than complex model architecture. Yet, applying more complex models and a richer feature set could further amplify FinTextSim&#x00027;s predictive power and strategic relevance.</p></sec></sec>
<sec sec-type="conclusions" id="s5">
<label>5</label>
<title>Conclusion</title>
<p>Increased availability of information and enhanced computational capabilities have transformed the analysis of annual reports, recognizing the value embedded within qualitative textual data. Automated review processes, such as topic modeling, are essential for analyzing this data. However, in the financial domain, the use of ML based methods (<xref ref-type="bibr" rid="B98">Ranta et al., 2022</xref>), including contextual embeddings, remains underexplored (<xref ref-type="bibr" rid="B104">Senave et al., 2023</xref>; <xref ref-type="bibr" rid="B63">Hida and Do Nascimento, 2026</xref>). We address these issues by bridging the gap between classical and contemporary topic modeling approaches for Item 7 and Item 7A of 10-K reports from S&#x00026;P 500 companies in the timeframe between 2016 and 2023. Furthermore, we introduce FinTextSim, a finetuned sentence-transformer enhancing financial text analysis with BERTopic, and demonstrate its value in downstream corporate performance prediction.</p>
<p>Our study reveals the advantages of FinTextSim over OTS sentence-transformer models and demonstrates the benefits of contemporary topic modeling approaches over classical ones. FinTextSim excels at generating distinct clusters of topics, substantially outperforming OTS sentence-transformers and models finetuned for financial sentiment analysis. Additionally, FinTextSim enables BERTopic to identify high-quality, domain-relevant topics, whereas standard embeddings, financial domain baselines and classical topic modeling approaches frequently miss key financial concepts, leading to misclassified documents. Combining BERTopic with FinTextSim further enhances the creation of well-separated clusters of financial topics. This underscores the critical role of domain-adapted embeddings for optimal topic modeling outcomes.</p>
<p>Beyond these intrinsic improvements, we demonstrate that enhanced textual representations also yield tangible benefits for corporate performance prediction. When FinTextSim-derived topic features are incorporated into a LR model predicting the direction of ROA changes, performance improves significantly, achieving a two-percentage-point increase in both ROC-AUC and F1-score over a purely financial baseline. In contrast, features derived from alternative embeddings or classical topic models tend to introduce noise, degrading predictive accuracy. Results under XGB present a more nuanced picture. As a non-linear learner, XGB can partially absorb heterogeneous or noisier textual feature representations, leading to modest improvements for several non-FinTextSim topic modeling approaches. Nevertheless, FinTextSim remains the most consistent performer across both linear and non-linear classifiers, achieving the highest ROC-AUC and stable performance across evaluation metrics. These results establish a direct link between topic quality and predictive validity, confirming that domain-specific textual representations can meaningfully enhance corporate performance forecasting.</p>
<p>Our work offers several key contributions. First, we advance contextual embeddings for the financial domain with FinTextSim, which functions as a domain-adapted information filter, addressing the fundamental information processing and retrieval bottleneck in financial text analysis. By transforming unstructured narratives into structured, semantically rich representations, FinTextSim enhances the quality of extracted information and enables ML models to detect economically meaningful signals often overlooked by human analysts and generic models. Second, FinTextSim strengthens the informational content of textual data, allowing analysts and researchers to derive actionable insights that support efficient resource allocation and more informed decision-making. Third, by bridging classical and contemporary topic modeling techniques, we establish a foundation for methodologically consistent and empirically validated model selection in financial text analysis. Finally, we demonstrate the practical value of FinTextSim in a downstream corporate performance prediction task. Thus, our research lays the foundation for integrating narrative information into valuation and forecasting frameworks, highlighting that qualitative disclosures can complement quantitative financial metrics in predictive applications.</p>
<p>Our study is not without limitations. Direct comparison between classical bag-of-words models and contextual embedding approaches remains challenging due to fundamental architectural differences. Additionally, the evaluation of topic models is inherently complex. Single metrics may be misleading, necessitating a holistic combination of quantitative and qualitative assessment.</p>
<p>Future research should continue refining domain-specific embeddings and topic evaluation metrics. Advancements in transformer architectures, embedding strategies, and hyperparameter optimization may further enhance topic stability and interpretability. Integrating FinTextSim-derived features with richer feature sets and more advanced learning frameworks represents another promising avenue. Ultimately, these developments will strengthen the role of FinTextSim as a semantic information filter, deepening our understanding of how corporate narratives convey actionable, forward-looking economic information.</p></sec>
</body>
<back>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found here: <ext-link ext-link-type="uri" xlink:href="https://sraf.nd.edu/data/stage-one-10-x-parse-data/">https://sraf.nd.edu/data/stage-one-10-x-parse-data/</ext-link> and Github repository (<ext-link ext-link-type="uri" xlink:href="https://github.com/JehnenS/FinTextSim">https://github.com/JehnenS/FinTextSim</ext-link>).</p>
</sec>
<sec sec-type="author-contributions" id="s7">
<title>Author contributions</title>
<p>SJ: Methodology, Conceptualization, Software, Project administration, Visualization, Writing &#x02013; original draft, Investigation, Resources, Validation, Data curation. JV-D: Supervision, Project administration, Validation, Methodology, Conceptualization, Writing &#x02013; review &#x00026; editing, Funding acquisition. JO-M: Supervision, Project administration, Methodology, Writing &#x02013; review &#x00026; editing, Conceptualization, Funding acquisition, Validation.</p>
</sec>
<ack><title>Acknowledgments</title><p>A preliminary version of this research appeared as an arXiv preprint (<xref ref-type="bibr" rid="B71">Jehnen et al., 2025</xref>). The present manuscript substantially extends that work by introducing refinements to FinTextSim&#x00027;s training process and by adding a downstream ROA-prediction task, demonstrating the FinTextSim&#x00027;s economic relevance through improved corporate performance prediction.</p>
</ack>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>SJ was employed by Beta Klinik GmbH. The remaining author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="ai-statement" id="s9">
<title>Generative AI statement</title>
<p>The author(s) declared that generative AI was used in the creation of this manuscript. During the preparation of this work, the author(s) used ChatGPT-5 to improve readability and language of the work. After using this tool, the author(s) reviewed and edited the content as needed and take full responsibility for the content of the published article.</p>
<p>Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.</p></sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec sec-type="supplementary-material" id="s11">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/frai.2026.1752103/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/frai.2026.1752103/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.pdf" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/></sec>
<ref-list>
<title>References</title>
<ref id="B1">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Abdelrazek</surname> <given-names>A.</given-names></name> <name><surname>Eid</surname> <given-names>Y.</given-names></name> <name><surname>Gawish</surname> <given-names>E.</given-names></name> <name><surname>Medhat</surname> <given-names>W.</given-names></name> <name><surname>Hassan</surname> <given-names>A.</given-names></name></person-group> (<year>2023</year>). <article-title>Topic modeling algorithms and applications: a survey</article-title>. <source>Inf. Syst</source>. <volume>112</volume>:<fpage>102131</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.is.2022.102131</pub-id></mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Abuzayed</surname> <given-names>A.</given-names></name> <name><surname>Al-Khalifa</surname> <given-names>H.</given-names></name></person-group> (<year>2021</year>). <article-title>Bert for arabic topic modeling: An experimental study on bertopic technique</article-title>. <source>Procedia Comput. Sci</source>. <volume>189</volume>, <fpage>191</fpage>&#x02013;<lpage>194</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.procs.2021.05.096</pub-id></mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Agrawal</surname> <given-names>A.</given-names></name> <name><surname>Fu</surname> <given-names>W.</given-names></name> <name><surname>Menzies</surname> <given-names>T.</given-names></name></person-group> (<year>2018</year>). <article-title>What is wrong with topic modeling? And how to fix it using search-based software engineering</article-title>. <source>Inf. Softw. Technol</source>. <volume>98</volume>, <fpage>74</fpage>&#x02013;<lpage>88</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.infsof.2018.02.005</pub-id></mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Albalawi</surname> <given-names>R.</given-names></name> <name><surname>Yeap</surname> <given-names>T. H.</given-names></name> <name><surname>Benyoucef</surname> <given-names>M.</given-names></name></person-group> (<year>2020</year>). <article-title>Using topic modeling methods for short-text data: a comparative analysis</article-title>. <source>Front. Artif. Intell</source>. <volume>3</volume>:<fpage>42</fpage>. doi: <pub-id pub-id-type="doi">10.3389/frai.2020.00042</pub-id><pub-id pub-id-type="pmid">33733159</pub-id></mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Allaoui</surname> <given-names>M.</given-names></name> <name><surname>Kherfi</surname> <given-names>M. L.</given-names></name> <name><surname>Cheriet</surname> <given-names>A.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Considerably improving clustering algorithms using umap dimensionality reduction technique: a comparative study,&#x0201D;</article-title> in <source>International Conference on Image and Signal Processing</source> (<publisher-loc>Springer</publisher-loc>), <fpage>317</fpage>&#x02013;<lpage>325</lpage>. doi: <pub-id pub-id-type="doi">10.1007/978-3-030-51935-3_34</pub-id></mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Amel-Zadeh</surname> <given-names>A.</given-names></name> <name><surname>Calliess</surname> <given-names>J.-P.</given-names></name> <name><surname>Kaiser</surname> <given-names>D.</given-names></name> <name><surname>Roberts</surname> <given-names>S.</given-names></name></person-group> (<year>2020</year>). <article-title>Machine learning-based financial statement analysis</article-title>. <source>SSRN Electr. J.</source>doi: <pub-id pub-id-type="doi">10.2139/ssrn.3520684</pub-id></mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Anand</surname> <given-names>V.</given-names></name> <name><surname>Brunner</surname> <given-names>R.</given-names></name> <name><surname>Ikegwu</surname> <given-names>K.</given-names></name> <name><surname>Sougiannis</surname> <given-names>T.</given-names></name></person-group> (<year>2019</year>). <article-title>Predicting profitability using machine learning</article-title>. <source>Available at SSRN 3466478</source>. doi: <pub-id pub-id-type="doi">10.2139/ssrn.3466478</pub-id></mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Angelov</surname> <given-names>D.</given-names></name></person-group> (<year>2020</year>). <article-title>Top2vec: Distributed representations of topics</article-title>. <source>arXiv preprint arXiv:2008.09470</source>.</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Aoki</surname> <given-names>Y.</given-names></name> <name><surname>Ishida</surname> <given-names>S.</given-names></name> <name><surname>Jin</surname> <given-names>M.</given-names></name> <name><surname>Yoneda</surname> <given-names>T.</given-names></name></person-group> (<year>2025</year>). <article-title>Machine learning versus management earnings forecasts</article-title>. <source>Available at SSRN 5365902</source>. doi: <pub-id pub-id-type="doi">10.2139/ssrn.5365902</pub-id></mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Araci</surname> <given-names>D.</given-names></name></person-group> (<year>2019</year>). <article-title>Finbert: financial sentiment analysis with pre-trained language models</article-title>. <source>arXiv preprint arXiv:1908.10063</source>.</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ashtiani</surname> <given-names>M. N.</given-names></name> <name><surname>Raahemi</surname> <given-names>B.</given-names></name></person-group> (<year>2023</year>). <article-title>News-based intelligent prediction of financial markets using text mining and machine learning: a systematic literature review</article-title>. <source>Expert Syst. Appl</source>. <volume>217</volume>:<fpage>119509</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.eswa.2023.119509</pub-id></mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Baden</surname> <given-names>C.</given-names></name> <name><surname>Pipal</surname> <given-names>C.</given-names></name> <name><surname>Schoonvelde</surname> <given-names>M.</given-names></name> <name><surname>van der Velden</surname> <given-names>M. A. G.</given-names></name></person-group> (<year>2022</year>). <article-title>Three gaps in computational text analysis methods for social sciences: a research agenda</article-title>. <source>Commun. Methods Meas</source>. <volume>16</volume>, <fpage>1</fpage>&#x02013;<lpage>18</lpage>. doi: <pub-id pub-id-type="doi">10.1080/19312458.2021.2015574</pub-id></mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bao</surname> <given-names>Y.</given-names></name> <name><surname>Datta</surname> <given-names>A.</given-names></name></person-group> (<year>2014</year>). <article-title>Simultaneously discovering and quantifying risk types from textual risk disclosures</article-title>. <source>Manage. Sci</source>. <volume>60</volume>, <fpage>1371</fpage>&#x02013;<lpage>1391</lpage>. doi: <pub-id pub-id-type="doi">10.1287/mnsc.2014.1930</pub-id></mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Baranes</surname> <given-names>A.</given-names></name> <name><surname>Palas</surname> <given-names>R.</given-names></name></person-group> (<year>2019</year>). <article-title>Earning movement prediction using machine learning-support vector machines (SVM)</article-title>. <source>J. Manag. Inf. Dec. Sci</source>. <volume>22</volume>, <fpage>36</fpage>&#x02013;<lpage>53</lpage>. doi: <pub-id pub-id-type="doi">10.18910/100638</pub-id></mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bellstam</surname> <given-names>G.</given-names></name> <name><surname>Bhagat</surname> <given-names>S.</given-names></name> <name><surname>Cookson</surname> <given-names>J. A.</given-names></name></person-group> (<year>2021</year>). <article-title>A text-based analysis of corporate innovation</article-title>. <source>Manage. Sci</source>. <volume>67</volume>, <fpage>4004</fpage>&#x02013;<lpage>4031</lpage>. doi: <pub-id pub-id-type="doi">10.1287/mnsc.2020.3682</pub-id></mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bhattacharya</surname> <given-names>I.</given-names></name> <name><surname>Mickovic</surname> <given-names>A.</given-names></name></person-group> (<year>2024</year>). <article-title>Accounting fraud detection using contextual language learning</article-title>. <source>Int. J. Account. Inf. Syst</source>. <volume>53</volume>:<fpage>100682</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.accinf.2024.100682</pub-id></mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Blair</surname> <given-names>S. J.</given-names></name> <name><surname>Bi</surname> <given-names>Y.</given-names></name> <name><surname>Mulvenna</surname> <given-names>M. D.</given-names></name></person-group> (<year>2020</year>). <article-title>Aggregated topic models for increasing social media topic coherence</article-title>. <source>Appl. Intell</source>. <volume>50</volume>, <fpage>138</fpage>&#x02013;<lpage>156</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s10489-019-01438-z</pub-id></mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Blei</surname> <given-names>D. M.</given-names></name> <name><surname>Ng</surname> <given-names>A. Y.</given-names></name> <name><surname>Jordan</surname> <given-names>M. I.</given-names></name></person-group> (<year>2003</year>). <article-title>Latent dirichlet allocation</article-title>. <source>J. Mach. Learn. Res</source>. <volume>3</volume>, <fpage>993</fpage>&#x02013;<lpage>1022</lpage>.</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Booker</surname> <given-names>A.</given-names></name> <name><surname>Chiu</surname> <given-names>V.</given-names></name> <name><surname>Groff</surname> <given-names>N.</given-names></name> <name><surname>Richardson</surname> <given-names>V. J.</given-names></name></person-group> (<year>2024</year>). <article-title>Ais research opportunities utilizing machine learning: from a meta-theory of accounting literature</article-title>. <source>Int. J. Account. Inf. Syst</source>. <volume>52</volume>:<fpage>100661</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.accinf.2023.100661</pub-id></mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Brown</surname> <given-names>N. C.</given-names></name> <name><surname>Crowley</surname> <given-names>R. M.</given-names></name> <name><surname>Elliott</surname> <given-names>W. B.</given-names></name></person-group> (<year>2020</year>). <article-title>What are you saying? Using topic to detect financial misreporting</article-title>. <source>J. Account. Res</source>. <volume>58</volume>, <fpage>237</fpage>&#x02013;<lpage>291</lpage>. doi: <pub-id pub-id-type="doi">10.1111/1475-679X.12294</pub-id></mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Cai</surname> <given-names>K. N.</given-names></name> <name><surname>Hanley</surname> <given-names>K. W.</given-names></name> <name><surname>Huang</surname> <given-names>A. G.</given-names></name> <name><surname>Zhao</surname> <given-names>X.</given-names></name></person-group> (<year>2022</year>). <source>Risk disclosure and the pricing of corporate debt issues in private and public markets</source>. Georgetown McDonough School of Business Research Paper.</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Campbell</surname> <given-names>J. C.</given-names></name> <name><surname>Hindle</surname> <given-names>A.</given-names></name> <name><surname>Stroulia</surname> <given-names>E.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Latent dirichlet allocation: extracting topics from software engineering data,&#x0201D;</article-title> in <source>The Art and Science of Analyzing Software Data</source> (<publisher-loc>Elsevier</publisher-loc>), <fpage>139</fpage>&#x02013;<lpage>159</lpage>. doi: <pub-id pub-id-type="doi">10.1016/B978-0-12-411519-4.00006-9</pub-id></mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Campbell</surname> <given-names>J. L.</given-names></name> <name><surname>Ham</surname> <given-names>H.</given-names></name> <name><surname>Lu</surname> <given-names>Z.</given-names></name> <name><surname>Wood</surname> <given-names>K.</given-names></name></person-group> (<year>2024</year>). <article-title>Expectations matter: when (not) to use machine learning earnings forecasts</article-title>. <source>Available at SSRN 4495297</source>. doi: <pub-id pub-id-type="doi">10.2139/ssrn.4495297</pub-id></mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Cao</surname> <given-names>K.</given-names></name> <name><surname>You</surname> <given-names>H.</given-names></name></person-group> (<year>2024</year>). <article-title>Fundamental analysis via machine learning</article-title>. <source>Financ. Anal. J</source>. <volume>80</volume>, <fpage>74</fpage>&#x02013;<lpage>98</lpage>. doi: <pub-id pub-id-type="doi">10.1080/0015198X.2024.2313692</pub-id></mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Carpenter</surname> <given-names>J.</given-names></name> <name><surname>Bithell</surname> <given-names>J.</given-names></name></person-group> (<year>2000</year>). <article-title>Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians</article-title>. <source>Stat. Med</source>. <volume>19</volume>, <fpage>1141</fpage>&#x02013;<lpage>1164</lpage>. doi: <pub-id pub-id-type="doi">10.1002/(SICI)1097-0258(20000515)19:9&#x0003C;1141::AID-SIM479&#x0003E;3.0.CO;2-F</pub-id><pub-id pub-id-type="pmid">10797513</pub-id></mixed-citation>
</ref>
<ref id="B26">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>P.</given-names></name> <name><surname>Ji</surname> <given-names>M.</given-names></name></person-group> (<year>2025</year>). <article-title>Deep learning-based financial risk early warning model for listed companies: a multi-dimensional analysis approach</article-title>. <source>Expert Syst. Applic</source>. <volume>283</volume>:<fpage>127746</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.eswa.2025.127746</pub-id></mixed-citation>
</ref>
<ref id="B27">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>X.</given-names></name> <name><surname>Cho</surname> <given-names>Y. H.</given-names></name> <name><surname>Dou</surname> <given-names>Y.</given-names></name> <name><surname>Lev</surname> <given-names>B.</given-names></name></person-group> (<year>2022</year>). <article-title>Predicting future earnings changes using machine learning and detailed financial data</article-title>. <source>J. Account. Res</source>. <volume>60</volume>, <fpage>467</fpage>&#x02013;<lpage>515</lpage>. doi: <pub-id pub-id-type="doi">10.1111/1475-679X.12429</pub-id></mixed-citation>
</ref>
<ref id="B28">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Rabbani</surname> <given-names>R. M.</given-names></name> <name><surname>Gupta</surname> <given-names>A.</given-names></name> <name><surname>Zaki</surname> <given-names>M. J.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Comparative text analytics via topic modeling in banking,&#x0201D;</article-title> in <source>2017 IEEE Symposium Series on Computational Intelligence (SSCI)</source> (<publisher-loc>IEEE</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>. doi: <pub-id pub-id-type="doi">10.1109/SSCI.2017.8280945</pub-id></mixed-citation>
</ref>
<ref id="B29">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Liu</surname> <given-names>R.</given-names></name> <name><surname>Ye</surname> <given-names>Z.</given-names></name> <name><surname>Lin</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>Experimental explorations on short text topic mining between LDA and NMF based schemes</article-title>. <source>Knowl.-Based Syst</source>. <volume>163</volume>, <fpage>1</fpage>&#x02013;<lpage>13</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.knosys.2018.08.011</pub-id></mixed-citation>
</ref>
<ref id="B30">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Cohen</surname> <given-names>L.</given-names></name> <name><surname>Malloy</surname> <given-names>C.</given-names></name> <name><surname>Nguyen</surname> <given-names>Q.</given-names></name></person-group> (<year>2020</year>). <article-title>Lazy prices</article-title>. <source>J. Finance</source> <volume>75</volume>, <fpage>1371</fpage>&#x02013;<lpage>1415</lpage>. doi: <pub-id pub-id-type="doi">10.1111/jofi.12885</pub-id></mixed-citation>
</ref>
<ref id="B31">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Curiskis</surname> <given-names>S. A.</given-names></name> <name><surname>Drake</surname> <given-names>B.</given-names></name> <name><surname>Osborn</surname> <given-names>T. R.</given-names></name> <name><surname>Kennedy</surname> <given-names>P. J.</given-names></name></person-group> (<year>2020</year>). <article-title>An evaluation of document clustering and topic modelling in two online social networks: Twitter and reddit</article-title>. <source>Inf. Proc. Manag</source>. <volume>57</volume>:<fpage>102034</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.ipm.2019.04.002</pub-id></mixed-citation>
</ref>
<ref id="B32">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Das</surname> <given-names>A. S.</given-names></name> <name><surname>Mehta</surname> <given-names>S.</given-names></name> <name><surname>Subramaniam</surname> <given-names>L. V.</given-names></name></person-group> (<year>2017</year>). <article-title>Annofin-a hybrid algorithm to annotate financial text</article-title>. <source>Expert Syst. Appl</source>. <volume>88</volume>, <fpage>270</fpage>&#x02013;<lpage>275</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.eswa.2017.07.016</pub-id></mixed-citation>
</ref>
<ref id="B33">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Deveikyte</surname> <given-names>J.</given-names></name> <name><surname>Geman</surname> <given-names>H.</given-names></name> <name><surname>Piccari</surname> <given-names>C.</given-names></name> <name><surname>Provetti</surname> <given-names>A.</given-names></name></person-group> (<year>2022</year>). <article-title>A sentiment analysis approach to the prediction of market volatility</article-title>. <source>Front. Artif. Intell</source>. <volume>5</volume>:<fpage>836809</fpage>. doi: <pub-id pub-id-type="doi">10.3389/frai.2022.836809</pub-id><pub-id pub-id-type="pmid">36620753</pub-id></mixed-citation>
</ref>
<ref id="B34">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Devlin</surname> <given-names>J.</given-names></name> <name><surname>Chang</surname> <given-names>M.-W.</given-names></name> <name><surname>Lee</surname> <given-names>K.</given-names></name> <name><surname>Toutanova</surname> <given-names>K.</given-names></name></person-group> (<year>2019</year>). <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>. <source>arXiv preprint arXiv:1810.04805</source>.</mixed-citation>
</ref>
<ref id="B35">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Di Gennaro</surname> <given-names>G.</given-names></name> <name><surname>Greco</surname> <given-names>C.</given-names></name> <name><surname>Buonanno</surname> <given-names>A.</given-names></name> <name><surname>Cuciniello</surname> <given-names>M.</given-names></name> <name><surname>Amorese</surname> <given-names>T.</given-names></name> <name><surname>Ler</surname> <given-names>M. S.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>Hum-card: a human crowded annotated real dataset</article-title>. <source>Inf. Syst</source>. <volume>124</volume>:<fpage>102409</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.is.2024.102409</pub-id></mixed-citation>
</ref>
<ref id="B36">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Dong</surname> <given-names>M. M.</given-names></name> <name><surname>Stratopoulos</surname> <given-names>T. C.</given-names></name> <name><surname>Wang</surname> <given-names>V. X.</given-names></name></person-group> (<year>2024</year>). <article-title>A scoping review of chatgpt research in accounting and finance</article-title>. <source>Int. J. Account. Inf. Syst</source>. <volume>55</volume>:<fpage>100715</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.accinf.2024.100715</pub-id></mixed-citation>
</ref>
<ref id="B37">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Donoho</surname> <given-names>D.</given-names></name> <name><surname>Stodden</surname> <given-names>V.</given-names></name></person-group> (<year>2003</year>). <article-title>&#x0201C;When does non-negative matrix factorization give a correct decomposition into parts?,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, 16.</mixed-citation>
</ref>
<ref id="B38">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Dyer</surname> <given-names>T.</given-names></name> <name><surname>Lang</surname> <given-names>M.</given-names></name> <name><surname>Stice-Lawrence</surname> <given-names>L.</given-names></name></person-group> (<year>2017</year>). <article-title>The evolution of 10-k textual disclosure: evidence from latent dirichlet allocation</article-title>. <source>J. Account. Econ</source>. <volume>64</volume>, <fpage>221</fpage>&#x02013;<lpage>245</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.jacceco.2017.07.002</pub-id></mixed-citation>
</ref>
<ref id="B39">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Easton</surname> <given-names>P. D.</given-names></name> <name><surname>Kapons</surname> <given-names>M. M.</given-names></name> <name><surname>Monahan</surname> <given-names>S. J.</given-names></name> <name><surname>Sch&#x000FC;tt</surname> <given-names>H. H.</given-names></name> <name><surname>Weisbrod</surname> <given-names>E. H.</given-names></name></person-group> (<year>2024</year>). <article-title>Forecasting earnings using k-nearest neighbors</article-title>. <source>Account. Rev</source>. <volume>99</volume>, <fpage>115</fpage>&#x02013;<lpage>140</lpage>. doi: <pub-id pub-id-type="doi">10.2308/TAR-2021-0478</pub-id></mixed-citation>
</ref>
<ref id="B40">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Egger</surname> <given-names>R.</given-names></name> <name><surname>Yu</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). <article-title>Identifying hidden semantic structures in instagram data: a topic modelling comparison</article-title>. <source>Tourism Rev</source>. <volume>77</volume>, <fpage>1234</fpage>&#x02013;<lpage>1246</lpage>. doi: <pub-id pub-id-type="doi">10.1108/TR-05-2021-0244</pub-id></mixed-citation>
</ref>
<ref id="B41">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Egger</surname> <given-names>R.</given-names></name> <name><surname>Yu</surname> <given-names>J.</given-names></name></person-group> (<year>2022</year>). <article-title>A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts</article-title>. <source>Front. Sociol</source>. <volume>7</volume>:<fpage>886498</fpage>. doi: <pub-id pub-id-type="doi">10.3389/fsoc.2022.886498</pub-id><pub-id pub-id-type="pmid">35602001</pub-id></mixed-citation>
</ref>
<ref id="B42">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Fengler</surname> <given-names>M. R.</given-names></name> <name><surname>Phan</surname> <given-names>M. T.</given-names></name></person-group> (<year>2025</year>). <article-title>Unveiling themes in 10-k disclosures: a new topic modeling perspective</article-title>. <source>Int. Rev. Finan. Anal</source>. <volume>103</volume>:<fpage>104121</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.irfa.2025.104121</pub-id></mixed-citation>
</ref>
<ref id="B43">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Fernandes</surname> <given-names>N.</given-names></name> <name><surname>Gkolia</surname> <given-names>A.</given-names></name> <name><surname>Pizzo</surname> <given-names>N.</given-names></name> <name><surname>Davenport</surname> <given-names>J.</given-names></name> <name><surname>Nair</surname> <given-names>A.</given-names></name></person-group> (<year>2020</year>). <article-title>Unification of hdp and lda models for optimal topic clustering of subject specific question banks</article-title>. <source>arXiv preprint arXiv:2011.01035</source>.</mixed-citation>
</ref>
<ref id="B44">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Frankel</surname> <given-names>R.</given-names></name> <name><surname>Jennings</surname> <given-names>J.</given-names></name> <name><surname>Lee</surname> <given-names>J.</given-names></name></person-group> (<year>2022</year>). <article-title>Disclosure sentiment: machine learning vs. dictionary methods</article-title>. <source>Manag. Sci</source>. <volume>68</volume>, <fpage>5514</fpage>&#x02013;<lpage>5532</lpage>. doi: <pub-id pub-id-type="doi">10.1287/mnsc.2021.4156</pub-id></mixed-citation>
</ref>
<ref id="B45">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Freeman</surname> <given-names>R. N.</given-names></name> <name><surname>Ohlson</surname> <given-names>J. A.</given-names></name> <name><surname>Penman</surname> <given-names>S. H.</given-names></name></person-group> (<year>1982</year>). <article-title>Book rate-of-return and prediction of earnings changes: an empirical investigation</article-title>. <source>J. Account. Res</source>. <volume>20</volume>, <fpage>639</fpage>&#x02013;<lpage>653</lpage>. doi: <pub-id pub-id-type="doi">10.2307/2490890</pub-id></mixed-citation>
</ref>
<ref id="B46">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Fu</surname> <given-names>Q.</given-names></name> <name><surname>Zhuang</surname> <given-names>Y.</given-names></name> <name><surname>Gu</surname> <given-names>J.</given-names></name> <name><surname>Zhu</surname> <given-names>Y.</given-names></name> <name><surname>Guo</surname> <given-names>X.</given-names></name></person-group> (<year>2021</year>). <article-title>Agreeing to disagree: choosing among eight topic-modeling methods</article-title>. <source>Big Data Res</source>. <volume>23</volume>:<fpage>100173</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.bdr.2020.100173</pub-id></mixed-citation>
</ref>
<ref id="B47">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Gangwani</surname> <given-names>D.</given-names></name> <name><surname>Zhu</surname> <given-names>X.</given-names></name></person-group> (<year>2024</year>). <article-title>Modeling and prediction of business success: a survey</article-title>. <source>Artif. Intell. Rev</source>. <volume>57</volume>:<fpage>44</fpage>. doi: <pub-id pub-id-type="doi">10.1007/s10462-023-10664-4</pub-id></mixed-citation>
</ref>
<ref id="B48">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Garc&#x000ED;a-M&#x000E9;ndez</surname> <given-names>S.</given-names></name> <name><surname>de Arriba-P&#x000E9;rez</surname> <given-names>F.</given-names></name> <name><surname>Barros-Vila</surname> <given-names>A.</given-names></name> <name><surname>Gonz&#x000E1;lez-Casta&#x000F1;o</surname> <given-names>F. J.</given-names></name></person-group> (<year>2023</year>). <article-title>Targeted aspect-based emotion analysis to detect opportunities and precaution in financial twitter messages</article-title>. <source>Expert Syst. Appl</source>. <volume>218</volume>:<fpage>119611</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.eswa.2023.119611</pub-id></mixed-citation>
</ref>
<ref id="B49">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Garc&#x000ED;a-M&#x000E9;ndez</surname> <given-names>S.</given-names></name> <name><surname>de Arriba-P&#x000E9;rez</surname> <given-names>F.</given-names></name> <name><surname>Barros-Vila</surname> <given-names>A.</given-names></name> <name><surname>Gonz&#x000E1;lez-Casta&#x000F1;o</surname> <given-names>F. J.</given-names></name> <name><surname>Costa-Montenegro</surname> <given-names>E.</given-names></name></person-group> (<year>2023</year>). <article-title>Automatic detection of relevant information, predictions and forecasts in financial news through topic modelling with latent dirichlet allocation</article-title>. <source>Appl. Intell</source>. <volume>53</volume>, <fpage>19610</fpage>&#x02013;<lpage>19628</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s10489-023-04452-4</pub-id></mixed-citation>
</ref>
<ref id="B50">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Garc&#x000ED;a-M&#x000E9;ndez</surname> <given-names>S.</given-names></name> <name><surname>de Arriba-P&#x000E9;rez</surname> <given-names>F.</given-names></name> <name><surname>Gonz&#x000E1;lez-Gonz&#x000E1;lez</surname> <given-names>J.</given-names></name> <name><surname>Gonz&#x000E1;lez-Casta&#x000F1;o</surname> <given-names>F. J.</given-names></name></person-group> (<year>2024</year>). <article-title>Explainable assessment of financial experts&#x00027; credibility by classifying social media forecasts and checking the predictions with actual market data</article-title>. <source>Expert Syst. Appl</source>. <volume>255</volume>:<fpage>124515</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.eswa.2024.124515</pub-id></mixed-citation>
</ref>
<ref id="B51">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Geertsema</surname> <given-names>P.</given-names></name> <name><surname>Lu</surname> <given-names>H.</given-names></name></person-group> (<year>2023</year>). <article-title>Relative valuation with machine learning</article-title>. <source>J. Account. Res</source>. <volume>61</volume>, <fpage>329</fpage>&#x02013;<lpage>376</lpage>. doi: <pub-id pub-id-type="doi">10.1111/1475-679X.12464</pub-id></mixed-citation>
</ref>
<ref id="B52">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Gillis</surname> <given-names>N.</given-names></name> <name><surname>Vavasis</surname> <given-names>S. A.</given-names></name></person-group> (<year>2014</year>). <article-title>Fast and robust recursive algorithms for separable nonnegative matrix factorization</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>36</volume>, <fpage>698</fpage>&#x02013;<lpage>714</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TPAMI.2013.226</pub-id></mixed-citation>
</ref>
<ref id="B53">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Giudici</surname> <given-names>P.</given-names></name> <name><surname>Wu</surname> <given-names>L.</given-names></name></person-group> (<year>2025</year>). <article-title>Sustainable artificial intelligence in finance: impact of ESG factors</article-title>. <source>Front. Artif. Intell</source>. <volume>8</volume>:<fpage>1566197</fpage>. doi: <pub-id pub-id-type="doi">10.3389/frai.2025.1566197</pub-id><pub-id pub-id-type="pmid">40115117</pub-id></mixed-citation>
</ref>
<ref id="B54">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Gracewell</surname> <given-names>J.</given-names></name> <name><surname>Raj</surname> <given-names>A. A. E.</given-names></name> <name><surname>Kalaivani</surname> <given-names>C.</given-names></name></person-group> (<year>2025</year>). <article-title>Hierarchical aspect-based sentiment analysis using semantic capsuled multi-granular networks</article-title>. <source>Inf. Syst</source>. <volume>132</volume>:<fpage>102556</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.is.2025.102556</pub-id></mixed-citation>
</ref>
<ref id="B55">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Griffin</surname> <given-names>P. A.</given-names></name></person-group> (<year>2003</year>). <article-title>Got information? Investor response to form 10-k and form 10-q edgar filings</article-title>. <source>Rev. Account. Stud</source>. <volume>8</volume>, <fpage>433</fpage>&#x02013;<lpage>460</lpage>. doi: <pub-id pub-id-type="doi">10.1023/A:1027351630866</pub-id></mixed-citation>
</ref>
<ref id="B56">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Grigore</surname> <given-names>D.-N.</given-names></name> <name><surname>Pintilie</surname> <given-names>I.</given-names></name></person-group> (<year>2023</year>). <article-title>&#x0201C;Transformer-based topic modeling to measure the severity of eating disorder symptoms,&#x0201D;</article-title> in <source>CLEF (Working Notes)</source>, <fpage>684</fpage>&#x02013;<lpage>692</lpage>.</mixed-citation>
</ref>
<ref id="B57">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Grootendorst</surname> <given-names>M.</given-names></name></person-group> (<year>2022</year>). <article-title>Bertopic: neural topic modeling with a class-based tf-idf procedure</article-title>. <source>arXiv preprint arXiv:2203.05794</source>.</mixed-citation>
</ref>
<ref id="B58">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Gu</surname> <given-names>H.</given-names></name> <name><surname>Schreyer</surname> <given-names>M.</given-names></name> <name><surname>Moffitt</surname> <given-names>K.</given-names></name> <name><surname>Vasarhelyi</surname> <given-names>M.</given-names></name></person-group> (<year>2024</year>). <article-title>Artificial intelligence co-piloted auditing</article-title>. <source>Int. J. Account. Inf. Syst</source>. <volume>54</volume>:<fpage>100698</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.accinf.2024.100698</pub-id></mixed-citation>
</ref>
<ref id="B59">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Guo</surname> <given-names>M.</given-names></name> <name><surname>Zong</surname> <given-names>X.</given-names></name> <name><surname>Guo</surname> <given-names>L.</given-names></name> <name><surname>Lei</surname> <given-names>Y.</given-names></name></person-group> (<year>2024</year>). <article-title>Does haze-related sentiment affect income inequality in china?</article-title> <source>Int. Rev. Econ. Finan</source>. <volume>94</volume>:<fpage>103371</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.iref.2024.05.050</pub-id></mixed-citation>
</ref>
<ref id="B60">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hagen</surname> <given-names>L.</given-names></name></person-group> (<year>2018</year>). <article-title>Content analysis of e-petitions with topic modeling: How to train and evaluate lda models?</article-title> <source>Inf. Proc. Manag</source>. <volume>54</volume>, <fpage>1292</fpage>&#x02013;<lpage>1307</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.ipm.2018.05.006</pub-id></mixed-citation>
</ref>
<ref id="B61">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hajek</surname> <given-names>P.</given-names></name> <name><surname>Munk</surname> <given-names>M.</given-names></name></person-group> (<year>2024</year>). <article-title>Corporate financial distress prediction using the risk-related information content of annual reports</article-title>. <source>Inf. Proc. Manag</source>. <volume>61</volume>:<fpage>103820</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.ipm.2024.103820</pub-id></mixed-citation>
</ref>
<ref id="B62">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hermans</surname> <given-names>A.</given-names></name> <name><surname>Beyer</surname> <given-names>L.</given-names></name> <name><surname>Leibe</surname> <given-names>B.</given-names></name></person-group> (<year>2017</year>). <article-title>In defense of the triplet loss for person re-identification</article-title>. <source>arXiv preprint arXiv:1703.07737</source>.</mixed-citation>
</ref>
<ref id="B63">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hida</surname> <given-names>G. S.</given-names></name> <name><surname>Do Nascimento</surname> <given-names>A. C. A.</given-names></name></person-group> (<year>2026</year>). <article-title>Overview of machine learning in class imbalance scenarios: trends, challenges, and approaches</article-title>. <source>Expert Syst. Applic</source>. <volume>296</volume>:<fpage>129592</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.eswa.2025.129592</pub-id></mixed-citation>
</ref>
<ref id="B64">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Ho</surname> <given-names>T. K.</given-names></name></person-group> (<year>1995</year>). <article-title>&#x0201C;Random decision forests,&#x0201D;</article-title> in <source>Proceedings of 3rd International Conference on Document Analysis and Recognition</source> (<publisher-loc>IEEE</publisher-loc>), <fpage>278</fpage>&#x02013;<lpage>282</lpage>.</mixed-citation>
</ref>
<ref id="B65">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Hsieh</surname> <given-names>H.-T.</given-names></name> <name><surname>Hristova</surname> <given-names>D.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Transformer-based summarization and sentiment analysis of sec 10-k annual reports for company performance prediction,&#x0201D;</article-title> in <source>Proceedings of the 55th Hawaii International Conference on System Sciences, HICSS</source> (<publisher-loc>Hawaii International Conference on System Sciences</publisher-loc>), <fpage>1759</fpage>&#x02013;<lpage>1768</lpage>. doi: <pub-id pub-id-type="doi">10.24251/HICSS.2022.218</pub-id></mixed-citation>
</ref>
<ref id="B66">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>A. H.</given-names></name> <name><surname>Wang</surname> <given-names>H.</given-names></name> <name><surname>Yang</surname> <given-names>Y.</given-names></name></person-group> (<year>2023</year>). <article-title>Finbert: a large language model for extracting information from financial text</article-title>. <source>Contemp. Account. Res</source>. <volume>40</volume>, <fpage>806</fpage>&#x02013;<lpage>841</lpage>. doi: <pub-id pub-id-type="doi">10.1111/1911-3846.12832</pub-id></mixed-citation>
</ref>
<ref id="B67">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>Y.</given-names></name> <name><surname>Tai</surname> <given-names>W.</given-names></name> <name><surname>Zhou</surname> <given-names>F.</given-names></name> <name><surname>Gao</surname> <given-names>Q.</given-names></name> <name><surname>Zhong</surname> <given-names>T.</given-names></name></person-group> (<year>2025</year>). <article-title>Extracting key insights from earnings call transcript via information-theoretic contrastive learning</article-title>. <source>Inf. Proc. Manag</source>. <volume>62</volume>:<fpage>103998</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.ipm.2024.103998</pub-id></mixed-citation>
</ref>
<ref id="B68">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hunt</surname> <given-names>J.</given-names></name> <name><surname>Myers</surname> <given-names>J.</given-names></name> <name><surname>Myers</surname> <given-names>L.</given-names></name></person-group> (<year>2019</year>). <source>Improving earnings predictions with machine learning</source>. Working Paper.</mixed-citation>
</ref>
<ref id="B69">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Jamshed</surname> <given-names>A.</given-names></name> <name><surname>Oberoi</surname> <given-names>J. S.</given-names></name> <name><surname>Lawal</surname> <given-names>T. O.</given-names></name></person-group> (<year>2025</year>). <article-title>Speaking with one voice? The joint information content of tone in md &#x00026; a and risk factor disclosures</article-title>. <source>Available at SSRN 5265813</source>. doi: <pub-id pub-id-type="doi">10.2139/ssrn.5265813</pub-id></mixed-citation>
</ref>
<ref id="B70">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Jegadeesh</surname> <given-names>N.</given-names></name> <name><surname>Wu</surname> <given-names>D. A.</given-names></name></person-group> (<year>2017</year>). <article-title>Deciphering fedspeak: the information content of FOMC meetings</article-title>. <source>SSRN Electr. J</source>. doi: <pub-id pub-id-type="doi">10.2139/ssrn.2939937</pub-id></mixed-citation>
</ref>
<ref id="B71">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Jehnen</surname> <given-names>S.</given-names></name> <name><surname>Ordieres-Mer&#x000E9;</surname> <given-names>J.</given-names></name> <name><surname>Villalba-D&#x000ED;ez</surname> <given-names>J.</given-names></name></person-group> (<year>2025</year>). <article-title>Fintextsim: enhancing financial text analysis with bertopic</article-title>. <source>arXiv preprint arXiv:2504.15683</source>.</mixed-citation>
</ref>
<ref id="B72">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ji</surname> <given-names>R.</given-names></name> <name><surname>Han</surname> <given-names>Q.</given-names></name></person-group> (<year>2022</year>). <article-title>Understanding heterogeneity of investor sentiment on social media: a structural topic modeling approach</article-title>. <source>Front. Artif. Intell</source>. <volume>5</volume>:<fpage>884699</fpage>. doi: <pub-id pub-id-type="doi">10.3389/frai.2022.884699</pub-id><pub-id pub-id-type="pmid">36277168</pub-id></mixed-citation>
</ref>
<ref id="B73">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Jones</surname> <given-names>S.</given-names></name> <name><surname>Moser</surname> <given-names>W. J.</given-names></name> <name><surname>Wieland</surname> <given-names>M. M.</given-names></name></person-group> (<year>2023</year>). <article-title>Machine learning and the prediction of changes in profitability</article-title>. <source>Contemp. Account. Res</source>. <volume>40</volume>, <fpage>2643</fpage>&#x02013;<lpage>2672</lpage>. doi: <pub-id pub-id-type="doi">10.1111/1911-3846.12888</pub-id></mixed-citation>
</ref>
<ref id="B74">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Jun</surname> <given-names>S. Y.</given-names></name> <name><surname>Kim</surname> <given-names>D. S.</given-names></name> <name><surname>Jung</surname> <given-names>S. Y.</given-names></name> <name><surname>Jun</surname> <given-names>S. G.</given-names></name> <name><surname>Kim</surname> <given-names>J. W.</given-names></name></person-group> (<year>2022</year>). <article-title>Stock investment strategy combining earnings power index and machine learning</article-title>. <source>Int. J. Account. Inf. Syst</source>. <volume>47</volume>:<fpage>100576</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.accinf.2022.100576</pub-id></mixed-citation>
</ref>
<ref id="B75">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Karami</surname> <given-names>M.</given-names></name> <name><surname>Ghodsi</surname> <given-names>A.</given-names></name></person-group> (<year>2024</year>). <article-title>Orchid: flexible and data-dependent convolution for sequence modeling</article-title>. <source>arXiv preprint arXiv:2402.18508</source>.</mixed-citation>
</ref>
<ref id="B76">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>M. G.</given-names></name> <name><surname>Kim</surname> <given-names>K. S.</given-names></name> <name><surname>Lee</surname> <given-names>K. C.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Analyzing the effects of topics underlying companies&#x00027; financial disclosures about risk factors on prediction of ESG risk ratings: emphasis on bertopic,&#x0201D;</article-title> in <source>2022 IEEE International Conference on Big Data (Big Data)</source> (<publisher-loc>IEEE</publisher-loc>), <fpage>4520</fpage>&#x02013;<lpage>4527</lpage>. doi: <pub-id pub-id-type="doi">10.1109/BigData55660.2022.10021110</pub-id></mixed-citation>
</ref>
<ref id="B77">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Koval</surname> <given-names>R.</given-names></name> <name><surname>Andrews</surname> <given-names>N.</given-names></name> <name><surname>Yan</surname> <given-names>X.</given-names></name></person-group> (<year>2024</year>). <article-title>&#x0201C;Financial forecasting from textual and tabular time series,&#x0201D;</article-title> in <source>Findings of the Association for Computational Linguistics: EMNLP 2024</source>, 8289&#x02013;8300. doi: <pub-id pub-id-type="doi">10.18653/v1/2024.findings-emnlp.486</pub-id></mixed-citation>
</ref>
<ref id="B78">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>C.-Y.</given-names></name> <name><surname>Anderl</surname> <given-names>E.</given-names></name></person-group> (<year>2025</year>). <article-title>Does business news sentiment matter in the energy stock market? Adopting sentiment analysis for short-term stock market prediction in the energy industry</article-title>. <source>Front. Artif. Intell</source>. <volume>8</volume>:<fpage>1559900</fpage>. doi: <pub-id pub-id-type="doi">10.3389/frai.2025.1559900</pub-id><pub-id pub-id-type="pmid">40771943</pub-id></mixed-citation>
</ref>
<ref id="B79">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>D. D.</given-names></name> <name><surname>Seung</surname> <given-names>H. S.</given-names></name></person-group> (<year>1999</year>). <article-title>Learning the parts of objects by non-negative matrix factorization</article-title>. <source>Nature</source> <volume>401</volume>, <fpage>788</fpage>&#x02013;<lpage>791</lpage>. doi: <pub-id pub-id-type="doi">10.1038/44565</pub-id><pub-id pub-id-type="pmid">10548103</pub-id></mixed-citation>
</ref>
<ref id="B80">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Levy</surname> <given-names>J. J.</given-names></name> <name><surname>O&#x00027;Malley</surname> <given-names>A. J.</given-names></name></person-group> (<year>2020</year>). <article-title>Don&#x00027;t dismiss logistic regression: the case for sensible extraction of interactions in the era of machine learning</article-title>. <source>BMC Med. Res. Methodol</source>. <volume>20</volume>:<fpage>171</fpage>. doi: <pub-id pub-id-type="doi">10.1186/s12874-020-01046-3</pub-id><pub-id pub-id-type="pmid">32600277</pub-id></mixed-citation>
</ref>
<ref id="B81">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>F.</given-names></name></person-group> (<year>2010</year>). <article-title>The information content of forward-looking statements in corporate filings&#x02013;a na&#x000EF;ve Bayesian machine learning approach</article-title>. <source>J. Account. Res</source>. <volume>48</volume>, <fpage>1049</fpage>&#x02013;<lpage>1102</lpage>. doi: <pub-id pub-id-type="doi">10.1111/j.1475-679X.2010.00382.x</pub-id></mixed-citation>
</ref>
<ref id="B82">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>T.</given-names></name> <name><surname>Chen</surname> <given-names>H.</given-names></name> <name><surname>Liu</surname> <given-names>W.</given-names></name> <name><surname>Yu</surname> <given-names>G.</given-names></name> <name><surname>Yu</surname> <given-names>Y.</given-names></name></person-group> (<year>2023</year>). <article-title>Understanding the role of social media sentiment in identifying irrational herding behavior in the stock market</article-title>. <source>Int. Rev. Econ. Finance</source> <volume>87</volume>, <fpage>163</fpage>&#x02013;<lpage>179</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.iref.2023.04.016</pub-id></mixed-citation>
</ref>
<ref id="B83">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>M.</given-names></name></person-group> (<year>2022</year>). <article-title>Assessing human information processing in lending decisions: a machine learning approach</article-title>. <source>J. Account. Res</source>. <volume>60</volume>, <fpage>607</fpage>&#x02013;<lpage>651</lpage>. doi: <pub-id pub-id-type="doi">10.1111/1475-679X.12427</pub-id></mixed-citation>
</ref>
<ref id="B84">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lowry</surname> <given-names>M.</given-names></name> <name><surname>Michaely</surname> <given-names>R.</given-names></name> <name><surname>Volkova</surname> <given-names>E.</given-names></name></person-group> (<year>2020</year>). <article-title>Information revealed through the regulatory process: interactions between the sec and companies ahead of their ipo</article-title>. <source>Rev. Financ. Stud</source>. <volume>33</volume>, <fpage>5510</fpage>&#x02013;<lpage>5554</lpage>. doi: <pub-id pub-id-type="doi">10.1093/rfs/hhaa007</pub-id></mixed-citation>
</ref>
<ref id="B85">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lu</surname> <given-names>J.</given-names></name></person-group> (<year>2022</year>). <article-title>Limited attention: Implications for financial reporting</article-title>. <source>J. Account. Res</source>. <volume>60</volume>, <fpage>1991</fpage>&#x02013;<lpage>2027</lpage>. doi: <pub-id pub-id-type="doi">10.1111/1475-679X.12432</pub-id></mixed-citation>
</ref>
<ref id="B86">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Luo</surname> <given-names>Z.</given-names></name> <name><surname>Liu</surname> <given-names>L.</given-names></name> <name><surname>Ananiadou</surname> <given-names>S.</given-names></name> <name><surname>Xie</surname> <given-names>Q.</given-names></name></person-group> (<year>2024</year>). <article-title>Graph contrastive topic model</article-title>. <source>Expert Syst. Appl</source>. <volume>255</volume>:<fpage>124631</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.eswa.2024.124631</pub-id></mixed-citation>
</ref>
<ref id="B87">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Maier</surname> <given-names>D.</given-names></name> <name><surname>Waldherr</surname> <given-names>A.</given-names></name> <name><surname>Miltner</surname> <given-names>P.</given-names></name> <name><surname>Wiedemann</surname> <given-names>G.</given-names></name> <name><surname>Niekler</surname> <given-names>A.</given-names></name> <name><surname>Keinert</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>Applying LDA topic modeling in communication research: toward a valid and reliable methodology</article-title>. <source>Commun. Methods Meas</source>. <volume>12</volume>, <fpage>93</fpage>&#x02013;<lpage>118</lpage>. doi: <pub-id pub-id-type="doi">10.1080/19312458.2018.1430754</pub-id></mixed-citation>
</ref>
<ref id="B88">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Masson</surname> <given-names>C.</given-names></name> <name><surname>Paroubek</surname> <given-names>P.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;NLP analytics in finance with dore: a french 257m tokens corpus of corporate annual reports,&#x0201D;</article-title> in <source>Language Resources and Evaluation Conference (LREC 2020)</source> (<publisher-loc>ELRA</publisher-loc>), <fpage>2261</fpage>&#x02013;<lpage>2267</lpage>.</mixed-citation>
</ref>
<ref id="B89">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>McInnes</surname> <given-names>L.</given-names></name> <name><surname>Healy</surname> <given-names>J.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Accelerated hierarchical density clustering,&#x0201D;</article-title> in <source>2017 IEEE International Conference on Data Mining Workshops (ICDMW)</source>, 33&#x02013;42. doi: <pub-id pub-id-type="doi">10.1109/ICDMW.2017.12</pub-id></mixed-citation>
</ref>
<ref id="B90">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Mohammed</surname> <given-names>S.</given-names></name> <name><surname>Budach</surname> <given-names>L.</given-names></name> <name><surname>Feuerpfeil</surname> <given-names>M.</given-names></name> <name><surname>Ihde</surname> <given-names>N.</given-names></name> <name><surname>Nathansen</surname> <given-names>A.</given-names></name> <name><surname>Noack</surname> <given-names>N.</given-names></name> <etal/></person-group>. (<year>2025</year>). <article-title>The effects of data quality on machine learning performance on tabular data</article-title>. <source>Inf. Syst</source>. <volume>132</volume>:<fpage>102549</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.is.2025.102549</pub-id></mixed-citation>
</ref>
<ref id="B91">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Murphy</surname> <given-names>B.</given-names></name> <name><surname>Feeney</surname> <given-names>O.</given-names></name> <name><surname>Rosati</surname> <given-names>P.</given-names></name> <name><surname>Lynn</surname> <given-names>T.</given-names></name></person-group> (<year>2024</year>). <article-title>Exploring accounting and ai using topic modelling</article-title>. <source>Int. J. Account. Inf. Syst</source>. <volume>55</volume>:<fpage>100709</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.accinf.2024.100709</pub-id></mixed-citation>
</ref>
<ref id="B92">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Nazareth</surname> <given-names>N.</given-names></name> <name><surname>Reddy</surname> <given-names>Y. V. R.</given-names></name></person-group> (<year>2023</year>). <article-title>Financial applications of machine learning: a literature review</article-title>. <source>Expert Syst. Appl</source>. <volume>219</volume>:<fpage>119640</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.eswa.2023.119640</pub-id></mixed-citation>
</ref>
<ref id="B93">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>O&#x00027;Callaghan</surname> <given-names>D.</given-names></name> <name><surname>Greene</surname> <given-names>D.</given-names></name> <name><surname>Carthy</surname> <given-names>J.</given-names></name> <name><surname>Cunningham</surname> <given-names>P.</given-names></name></person-group> (<year>2015</year>). <article-title>An analysis of the coherence of descriptors in topic modeling</article-title>. <source>Expert Syst. Appl</source>. <volume>42</volume>, <fpage>5645</fpage>&#x02013;<lpage>5657</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.eswa.2015.02.055</pub-id></mixed-citation>
</ref>
<ref id="B94">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ou</surname> <given-names>J. A.</given-names></name> <name><surname>Penman</surname> <given-names>S. H.</given-names></name></person-group> (<year>1989</year>). <article-title>Financial statement analysis and the prediction of stock returns</article-title>. <source>J. Account. Econ</source>. <volume>11</volume>, <fpage>295</fpage>&#x02013;<lpage>329</lpage>. doi: <pub-id pub-id-type="doi">10.1016/0165-4101(89)90017-7</pub-id></mixed-citation>
</ref>
<ref id="B95">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Peng</surname> <given-names>Y.</given-names></name></person-group> (<year>2025</year>). <article-title>Earnings prediction using machine learning: a survey</article-title>. <source>Osaka Univ. Econ</source>. <volume>74</volume>, <fpage>45</fpage>&#x02013;<lpage>60</lpage>.</mixed-citation>
</ref>
<ref id="B96">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Pufahl</surname> <given-names>L.</given-names></name> <name><surname>Stiehle</surname> <given-names>F.</given-names></name> <name><surname>Ihde</surname> <given-names>S.</given-names></name> <name><surname>Weske</surname> <given-names>M.</given-names></name> <name><surname>Weber</surname> <given-names>I.</given-names></name></person-group> (<year>2025</year>). <article-title>Resource allocation in business process executions&#x02013;a systematic literature study</article-title>. <source>Inf. Syst</source>. <volume>132</volume>:<fpage>102541</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.is.2025.102541</pub-id></mixed-citation>
</ref>
<ref id="B97">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ranta</surname> <given-names>M.</given-names></name> <name><surname>Ylinen</surname> <given-names>M.</given-names></name></person-group> (<year>2024</year>). <article-title>Employee benefits and company performance: evidence from a high-dimensional machine learning model</article-title>. <source>Manag. Account. Res</source>. <volume>64</volume>:<fpage>100876</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.mar.2023.100876</pub-id></mixed-citation>
</ref>
<ref id="B98">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ranta</surname> <given-names>M.</given-names></name> <name><surname>Ylinen</surname> <given-names>M.</given-names></name> <name><surname>J&#x000E4;rvenp&#x000E4;&#x000E4;</surname> <given-names>M.</given-names></name></person-group> (<year>2022</year>). <article-title>Machine learning in management accounting research: literature review and pathways for the future</article-title>. <source>Eur. Account. Rev</source>. <volume>32</volume>, <fpage>607</fpage>&#x02013;<lpage>636</lpage>. doi: <pub-id pub-id-type="doi">10.1080/09638180.2022.2137221</pub-id></mixed-citation>
</ref>
<ref id="B99">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Rashid</surname> <given-names>J.</given-names></name> <name><surname>Shah</surname> <given-names>S. M. A.</given-names></name> <name><surname>Irtaza</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>Fuzzy topic modeling approach for text mining over short text</article-title>. <source>Inf. Proc. Manag</source>. <volume>56</volume>:<fpage>102060</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.ipm.2019.102060</pub-id></mixed-citation>
</ref>
<ref id="B100">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Reimers</surname> <given-names>N.</given-names></name> <name><surname>Gurevych</surname> <given-names>I.</given-names></name></person-group> (<year>2019</year>). <article-title>Sentence-bert: sentence embeddings using siamese bert-networks</article-title>. <source>arXiv preprint arXiv:1908.10084</source>.</mixed-citation>
</ref>
<ref id="B101">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>R&#x000F6;der</surname> <given-names>M.</given-names></name> <name><surname>Both</surname> <given-names>A.</given-names></name> <name><surname>Hinneburg</surname> <given-names>A.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Exploring the space of topic coherence measures,&#x0201D;</article-title> in <source>Proceedings of the Eighth ACM International Conference on Web Search and Data Mining</source>, 399&#x02013;408. doi: <pub-id pub-id-type="doi">10.1145/2684822.2685324</pub-id></mixed-citation>
</ref>
<ref id="B102">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Rossi</surname> <given-names>A. G.</given-names></name> <name><surname>Utkus</surname> <given-names>S. P.</given-names></name></person-group> (<year>2020</year>). <source>Who benefits from robo-advising? Evidence from machine learning</source>. Working paper. Evidence from Machine Learning. doi: <pub-id pub-id-type="doi">10.64202/wp.120.202001</pub-id></mixed-citation>
</ref>
<ref id="B103">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>S&#x000E1;nchez-Franco</surname> <given-names>M. J.</given-names></name> <name><surname>Rey-Moreno</surname> <given-names>M.</given-names></name></person-group> (<year>2022</year>). <article-title>Do travelers&#x00027; reviews depend on the destination? An analysis in coastal and urban peer-to-peer lodgings</article-title>. <source>Psychol. Market</source>. <volume>39</volume>, <fpage>441</fpage>&#x02013;<lpage>459</lpage>. doi: <pub-id pub-id-type="doi">10.1002/mar.21608</pub-id></mixed-citation>
</ref>
<ref id="B104">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Senave</surname> <given-names>E.</given-names></name> <name><surname>Jans</surname> <given-names>M. J.</given-names></name> <name><surname>Srivastava</surname> <given-names>R. P.</given-names></name></person-group> (<year>2023</year>). <article-title>The application of text mining in accounting</article-title>. <source>Int. J. Account. Inf. Syst</source>. <volume>50</volume>:<fpage>100624</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.accinf.2023.100624</pub-id></mixed-citation>
</ref>
<ref id="B105">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Sia</surname> <given-names>S.</given-names></name> <name><surname>Dalmia</surname> <given-names>A.</given-names></name> <name><surname>Mielke</surname> <given-names>S. J.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Tired of topic models? Clusters of pretrained word embeddings make for fast and good topics too!,&#x0201D;</article-title> in <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source> (<publisher-loc>Association for Computational Linguistics</publisher-loc>), <fpage>1728</fpage>&#x02013;<lpage>1736</lpage>. doi: <pub-id pub-id-type="doi">10.18653/v1/2020.emnlp-main.135</pub-id></mixed-citation>
</ref>
<ref id="B106">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Siano</surname> <given-names>F.</given-names></name></person-group> (<year>2025</year>). <article-title>The news in earnings announcement disclosures: capturing word context using LLM methods</article-title>. <source>Manag. Sci</source>. <volume>71</volume>, <fpage>9831</fpage>&#x02013;<lpage>9855</lpage>. doi: <pub-id pub-id-type="doi">10.1287/mnsc.2024.05417</pub-id></mixed-citation>
</ref>
<ref id="B107">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Siino</surname> <given-names>M.</given-names></name> <name><surname>Tinnirello</surname> <given-names>I.</given-names></name> <name><surname>La Cascia</surname> <given-names>M.</given-names></name></person-group> (<year>2024</year>). <article-title>Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on transformers and traditional classifiers</article-title>. <source>Inf. Syst</source>. <volume>121</volume>:<fpage>102342</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.is.2023.102342</pub-id></mixed-citation>
</ref>
<ref id="B108">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Song</surname> <given-names>J.</given-names></name> <name><surname>Lu</surname> <given-names>X.</given-names></name> <name><surname>Hong</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>F.</given-names></name></person-group> (<year>2025</year>). <article-title>External information enhancing topic model based on graph neural network</article-title>. <source>Expert Syst. Appl</source>. <volume>263</volume>:<fpage>125709</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.eswa.2024.125709</pub-id></mixed-citation>
</ref>
<ref id="B109">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sun</surname> <given-names>Y.</given-names></name> <name><surname>Zhao</surname> <given-names>J.</given-names></name> <name><surname>Xu</surname> <given-names>H.</given-names></name> <name><surname>Zhang</surname> <given-names>R.</given-names></name> <name><surname>Liu</surname> <given-names>C.</given-names></name> <name><surname>Yuan</surname> <given-names>L.</given-names></name></person-group> (<year>2026</year>). <article-title>Enhancing neural topic modeling for social media text via semantic bag of word clusters and log-domain sinkhorn transport</article-title>. <source>Inf. Proc. Manag</source>. <volume>63</volume>:<fpage>104411</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.ipm.2025.104411</pub-id></mixed-citation>
</ref>
<ref id="B110">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Swade</surname> <given-names>A.</given-names></name> <name><surname>Hanauer</surname> <given-names>M. X.</given-names></name> <name><surname>Lohre</surname> <given-names>H.</given-names></name> <name><surname>Blitz</surname> <given-names>D.</given-names></name></person-group> (<year>2023</year>). <article-title>Factor zoo</article-title>. <source>J. Portfolio Manag</source>. <volume>50</volume>, <fpage>11</fpage>&#x02013;<lpage>31</lpage>. doi: <pub-id pub-id-type="doi">10.3905/jpm.2023.1.561</pub-id></mixed-citation>
</ref>
<ref id="B111">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Taha</surname> <given-names>K.</given-names></name></person-group> (<year>2023</year>). <article-title>Semi-supervised and un-supervised clustering: a review and experimental evaluation</article-title>. <source>Inf. Syst</source>. <volume>114</volume>:<fpage>102178</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.is.2023.102178</pub-id></mixed-citation>
</ref>
<ref id="B112">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tang</surname> <given-names>Y.-K.</given-names></name> <name><surname>Huang</surname> <given-names>H.</given-names></name> <name><surname>Shi</surname> <given-names>X.</given-names></name> <name><surname>Mao</surname> <given-names>X.-L.</given-names></name></person-group> (<year>2025</year>). <article-title>Bridging insight gaps in topic dependency discovery with a knowledge-inspired topic model</article-title>. <source>Inf. Proc. Manag</source>. <volume>62</volume>:<fpage>103911</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.ipm.2024.103911</pub-id></mixed-citation>
</ref>
<ref id="B113">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Theodorakopoulos</surname> <given-names>L.</given-names></name> <name><surname>Theodoropoulou</surname> <given-names>A.</given-names></name> <name><surname>Bakalis</surname> <given-names>A.</given-names></name></person-group> (<year>2025</year>). <article-title>Big data in financial risk management: evidence, advances, and open questions. a systematic review</article-title>. <source>Front. Artif. Intell</source>. <volume>8</volume>:<fpage>1658375</fpage>. doi: <pub-id pub-id-type="doi">10.3389/frai.2025.1658375</pub-id><pub-id pub-id-type="pmid">41104145</pub-id></mixed-citation>
</ref>
<ref id="B114">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Uddin</surname> <given-names>A.</given-names></name> <name><surname>Tao</surname> <given-names>X.</given-names></name> <name><surname>Chou</surname> <given-names>C.-C.</given-names></name> <name><surname>Yu</surname> <given-names>D.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Machine learning for earnings prediction: a nonlinear tensor approach for data integration and completion,&#x0201D;</article-title> in <source>Proceedings of the Third ACM International Conference on AI in Finance</source>, 282&#x02013;290. doi: <pub-id pub-id-type="doi">10.1145/3533271.3561677</pub-id></mixed-citation>
</ref>
<ref id="B115">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ueda</surname> <given-names>K.</given-names></name> <name><surname>Suwa</surname> <given-names>H.</given-names></name> <name><surname>Yamada</surname> <given-names>M.</given-names></name> <name><surname>Ogawa</surname> <given-names>Y.</given-names></name> <name><surname>Umehara</surname> <given-names>E.</given-names></name> <name><surname>Yamashita</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>SSCDV: social media document embedding with sentiment and topics for financial market forecasting</article-title>. <source>Expert Syst. Appl</source>. <volume>245</volume>:<fpage>122988</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.eswa.2023.122988</pub-id></mixed-citation>
</ref>
<ref id="B116">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Van Binsbergen</surname> <given-names>J. H.</given-names></name> <name><surname>Han</surname> <given-names>X.</given-names></name> <name><surname>Lopez-Lira</surname> <given-names>A.</given-names></name></person-group> (<year>2023</year>). <article-title>Man versus machine learning: the term structure of earnings expectations and conditional biases</article-title>. <source>Rev. Financ. Stud</source>. <volume>36</volume>, <fpage>2361</fpage>&#x02013;<lpage>2396</lpage>. doi: <pub-id pub-id-type="doi">10.1093/rfs/hhac085</pub-id></mixed-citation>
</ref>
<ref id="B117">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Varian</surname> <given-names>H. R.</given-names></name></person-group> (<year>2014</year>). <article-title>Big data: New tricks for econometrics</article-title>. <source>J. Econ. Perspect</source>. <volume>28</volume>, <fpage>3</fpage>&#x02013;<lpage>28</lpage>. doi: <pub-id pub-id-type="doi">10.1257/jep.28.2.3</pub-id></mixed-citation>
</ref>
<ref id="B118">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Vaswani</surname> <given-names>A.</given-names></name> <name><surname>Shazeer</surname> <given-names>N.</given-names></name> <name><surname>Parmar</surname> <given-names>N.</given-names></name> <name><surname>Uszkoreit</surname> <given-names>J.</given-names></name> <name><surname>Jones</surname> <given-names>L.</given-names></name> <name><surname>Gomez</surname> <given-names>A. N.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>&#x0201C;Attention is all you need,&#x0201D;</article-title> in <source>31st Conference on Neural Information Processing Systems (NIPS 2017)</source> (<publisher-loc>Long Beach, CA, USA</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>15</lpage>.</mixed-citation>
</ref>
<ref id="B119">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Veganzones</surname> <given-names>D.</given-names></name> <name><surname>Severin</surname> <given-names>E.</given-names></name></person-group> (<year>2025</year>). <article-title>Earnings management visualization and prediction using machine learning methods</article-title>. <source>Int. J. Account. Inf. Syst</source>. <volume>56</volume>:<fpage>100743</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.accinf.2025.100743</pub-id></mixed-citation>
</ref>
<ref id="B120">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>X.-L.</given-names></name></person-group> (<year>2023</year>). <article-title>Deep nmf topic modeling</article-title>. <source>Neurocomputing</source> <volume>515</volume>, <fpage>157</fpage>&#x02013;<lpage>173</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.neucom.2022.10.002</pub-id></mixed-citation>
</ref>
<ref id="B121">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Q.</given-names></name> <name><surname>Su</surname> <given-names>T.</given-names></name> <name><surname>Lau</surname> <given-names>R. Y. K.</given-names></name> <name><surname>Xie</surname> <given-names>H.</given-names></name></person-group> (<year>2023</year>). <article-title>Deepemotionnet: emotion mining for corporate performance analysis and prediction</article-title>. <source>Inf. Proc. Manag</source>. <volume>60</volume>:<fpage>103151</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.ipm.2022.103151</pub-id></mixed-citation>
</ref>
<ref id="B122">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>R.</given-names></name> <name><surname>Ren</surname> <given-names>P.</given-names></name> <name><surname>Liu</surname> <given-names>X.</given-names></name> <name><surname>Chang</surname> <given-names>S.</given-names></name> <name><surname>Huang</surname> <given-names>H.</given-names></name></person-group> (<year>2024</year>). <article-title>DCTM: dual contrastive topic model for identifiable topic extraction</article-title>. <source>Inf. Proc. Manag</source>. <volume>61</volume>:<fpage>103785</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.ipm.2024.103785</pub-id></mixed-citation>
</ref>
<ref id="B123">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>J.</given-names></name> <name><surname>Yang</surname> <given-names>Z.</given-names></name> <name><surname>Wang</surname> <given-names>B.</given-names></name> <name><surname>Jin</surname> <given-names>J.</given-names></name> <name><surname>Liu</surname> <given-names>Y.</given-names></name></person-group> (<year>2024</year>). <article-title>Improving extractive summarization with semantic enhancement through topic-injection based bert model</article-title>. <source>Inf. Proc. Manag</source>. <volume>61</volume>:<fpage>103677</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.ipm.2024.103677</pub-id></mixed-citation>
</ref>
<ref id="B124">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Warner</surname> <given-names>B.</given-names></name> <name><surname>Chaffin</surname> <given-names>A.</given-names></name> <name><surname>Clavi&#x000E9;</surname> <given-names>B.</given-names></name> <name><surname>Weller</surname> <given-names>O.</given-names></name> <name><surname>Hallstr&#x000F6;m</surname> <given-names>O.</given-names></name> <name><surname>Taghadouini</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>Smarter, better, faster, longer: a modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference</article-title>. <source>arXiv preprint arXiv:2412.13663</source>.</mixed-citation>
</ref>
<ref id="B125">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Xie</surname> <given-names>H.</given-names></name> <name><surname>Luo</surname> <given-names>J.</given-names></name> <name><surname>Tan</surname> <given-names>X.</given-names></name></person-group> (<year>2025</year>). <article-title>Artificial intelligence technology application and corporate esg performance-evidence from national pilot zones for artificial intelligence innovation and application</article-title>. <source>Front. Artif. Intell</source>. <volume>8</volume>:<fpage>1643684</fpage>. doi: <pub-id pub-id-type="doi">10.3389/frai.2025.1643684</pub-id><pub-id pub-id-type="pmid">41058911</pub-id></mixed-citation>
</ref>
<ref id="B126">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Xinyue</surname> <given-names>C.</given-names></name> <name><surname>Zhaoyu</surname> <given-names>X.</given-names></name> <name><surname>Yue</surname> <given-names>Z.</given-names></name></person-group> (<year>2020</year>). <article-title>Using machine learning to forecast future earnings</article-title>. <source>Atlantic Econ. J</source>. <volume>48</volume>, <fpage>543</fpage>&#x02013;<lpage>545</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11293-020-09691-1</pub-id></mixed-citation>
</ref>
<ref id="B127">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Yadavilli</surname> <given-names>V. S.</given-names></name> <name><surname>Seshadri</surname> <given-names>K.</given-names></name> <name><surname>Bhattu</surname> <given-names>N.</given-names></name></person-group> (<year>2024</year>). <article-title>Joint modeling of causal phrases-sentiments-aspects using hierarchical pitman yor process</article-title>. <source>Inf. Proc. Manag</source>. <volume>61</volume>:<fpage>103753</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.ipm.2024.103753</pub-id></mixed-citation>
</ref>
<ref id="B128">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>&#x0017B;bikowski</surname> <given-names>K.</given-names></name> <name><surname>Antosiuk</surname> <given-names>P.</given-names></name></person-group> (<year>2021</year>). <article-title>A machine learning, bias-free approach for predicting business success using crunchbase data</article-title>. <source>Inf. Proc. Manag</source>. <volume>58</volume>:<fpage>102555</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.ipm.2021.102555</pub-id></mixed-citation>
</ref>
<ref id="B129">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>M.</given-names></name> <name><surname>Xu</surname> <given-names>C.</given-names></name> <name><surname>Gan</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Fu</surname> <given-names>Y.</given-names></name> <name><surname>Chen</surname> <given-names>Y.</given-names></name></person-group> (<year>2026</year>). <article-title>Automating construction contract question answering using large language model and fine-tuning</article-title>. <source>Expert Syst. Applic</source>. <volume>297</volume>:<fpage>129493</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.eswa.2025.129493</pub-id></mixed-citation>
</ref>
<ref id="B130">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>H.</given-names></name> <name><surname>Phung</surname> <given-names>D.</given-names></name> <name><surname>Huynh</surname> <given-names>V.</given-names></name> <name><surname>Jin</surname> <given-names>Y.</given-names></name> <name><surname>Du</surname> <given-names>L.</given-names></name> <name><surname>Buntine</surname> <given-names>W.</given-names></name></person-group> (<year>2021</year>). <article-title>Topic modelling meets deep neural networks: a survey</article-title>. <source>arXiv preprint arXiv:2103.00498</source>.</mixed-citation>
</ref>
<ref id="B131">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zheng</surname> <given-names>L.</given-names></name> <name><surname>He</surname> <given-names>Z.</given-names></name> <name><surname>He</surname> <given-names>S.</given-names></name></person-group> (<year>2025</year>). <article-title>A topic model-based knowledge graph to detect product defects from social media data</article-title>. <source>Expert Syst. Appl</source>. <volume>268</volume>:<fpage>126313</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.eswa.2024.126313</pub-id></mixed-citation>
</ref>
<ref id="B132">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zhu</surname> <given-names>X.</given-names></name></person-group> (<year>2026</year>). <article-title>Intelligent decision support systems for improving financial forecasting and market trend analysis</article-title>. <source>Expert Syst. Appl</source>. <volume>297</volume>:<fpage>129462</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.eswa.2025.129462</pub-id></mixed-citation>
</ref>
<ref id="B133">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zhu</surname> <given-names>Z.</given-names></name> <name><surname>Zheng</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Huang</surname> <given-names>D.</given-names></name> <name><surname>Feng</surname> <given-names>L.</given-names></name></person-group> (<year>2025</year>). <article-title>Forecasting china&#x00027;s precious metal futures volatility: Gbrt models and time-model dimension combination of tree shap</article-title>. <source>Int. Rev. Finan. Anal</source>. <volume>104</volume>:<fpage>104249</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.irfa.2025.104249</pub-id></mixed-citation>
</ref>
</ref-list>
<fn-group>
<fn fn-type="custom" custom-type="edited-by" id="fn0001">
<p>Edited by: <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/879062/overview">Ida Claudia Panetta</ext-link>, Sapienza University of Rome, Italy</p>
</fn>
<fn fn-type="custom" custom-type="reviewed-by" id="fn0002">
<p>Reviewed by: <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/3311470/overview">Asma Iqbal</ext-link>, Nawab Shah Alam Khan College of Engineerin &#x00026; Technology, India</p>
<p><ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/3320588/overview">Federico Siano</ext-link>, The University of Texas at Dallas, United States</p>
</fn>
</fn-group>
<fn-group>
<fn id="fn0003"><label>1</label><p>The data can be found at: <ext-link ext-link-type="uri" xlink:href="https://sraf.nd.edu/data/stage-one-10-x-parse-data/">https://sraf.nd.edu/data/stage-one-10-x-parse-data/</ext-link>.</p></fn>
<fn id="fn0004"><label>2</label><p>Paragraphs typically consist of 100&#x02013;200 words. Moreover, sentence-transformers, such as AM and FinTextSim are designed to capture the semantic information of sentences and short paragraphs. Input texts longer than 256-word pieces (approximately 170&#x02013;210 words) are truncated by default. The 250-word threshold ensures that each document includes at least two paragraphs, enhancing relevance, as shorter texts often lack substantive or complete ideas.</p></fn>
<fn id="fn0005"><label>3</label><p>The domains encompass sales, cost, profit/loss, operations, liquidity, investment, financing, litigation, employment, tax/regulation, and accounting.</p></fn>
<fn id="fn0006"><label>4</label><p>The keyword list is presented in a Github Repository under <ext-link ext-link-type="uri" xlink:href="https://github.com/JehnenS/FinTextSim">https://github.com/JehnenS/FinTextSim</ext-link>.</p></fn>
<fn id="fn0007"><label>5</label><p>The stopword lists can be found at: <ext-link ext-link-type="uri" xlink:href="https://sraf.nd.edu/textual-analysis/stopwords/">https://sraf.nd.edu/textual-analysis/stopwords/</ext-link>.</p></fn>
<fn id="fn0008"><label>6</label><p>Detailed topic-level accuracies are reported in the <xref ref-type="supplementary-material" rid="SM1">Supplementary material</xref>.</p></fn>
<fn id="fn0009"><label>7</label><p>The results of the experiment are displayed in the <xref ref-type="supplementary-material" rid="SM1">Supplementary material</xref>.</p></fn>
</fn-group>
</back>
</article>