<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="2.3" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/frai.2025.1541087</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Artificial Intelligence</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>ML-based validation of experimental randomization in learning games</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Hsieh</surname>
<given-names>Pei-Hsuan</given-names>
</name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="https://loop.frontiersin.org/people/978716/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/formal-analysis/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/validation/"/>
<role content-type="https://credit.niso.org/contributor-roles/visualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Computer Science, College of Informatics, National Chengchi University</institution>, <addr-line>Taipei</addr-line>, <country>Taiwan</country></aff>
<aff id="aff2"><sup>2</sup><institution>College of Education, National Chengchi University</institution>, <addr-line>Taipei</addr-line>, <country>Taiwan</country></aff>
<author-notes>
<fn fn-type="edited-by" id="fn0001">
<p>Edited by: <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1414052/overview">Kaixiang Yang</ext-link>, South China University of Technology, China</p>
</fn>
<fn fn-type="edited-by" id="fn0002">
<p>Reviewed by: <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/2742234/overview">Pradeep Paraman</ext-link>, SEGi University, Malaysia</p>
<p><ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/2919509/overview">Syeda Shumaila</ext-link>, Shenzhen Technology University, China</p>
</fn>
<corresp id="c001">&#x002A;Correspondence: Pei-Hsuan Hsieh, <email>hsiehph@nccu.edu.tw</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>30</day>
<month>10</month>
<year>2025</year>
</pub-date>
<pub-date pub-type="collection">
<year>2025</year>
</pub-date>
<volume>8</volume>
<elocation-id>1541087</elocation-id>
<history>
<date date-type="received">
<day>07</day>
<month>12</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>15</day>
<month>10</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2025 Hsieh.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Hsieh</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Randomization is a standard method in experimental research, yet its validity is not always guaranteed. This study introduces machine learning (ML) models as supplementary tools for validating participant randomization. A learning direction game with dichotomized scenarios was introduced, and both supervised and unsupervised ML models were evaluated on a binary classification task. Supervised models (logistic regression, decision tree, and support vector machine) achieved the highest accuracy of 87% after adding synthetic data to enlarge the sample size, while unsupervised models (k-means, k-nearest neighbors, and ANN&#x2014;artificial neural networks) performed less effectively. The ANN model, in particular, showed overfitting, even with synthetic data. Feature importance analysis further revealed predictors of assignment bias. These findings support the proposed methodology for detecting randomization patterns; however, its effectiveness is influenced by sample size and experimental design complexity. Future studies should apply this approach with caution and further examine its applicability across diverse experimental designs.</p>
</abstract>
<kwd-group>
<kwd>randomization</kwd>
<kwd>experimental design</kwd>
<kwd>sample assignment</kwd>
<kwd>scenarios</kwd>
<kwd>machine learning (ML) model</kwd>
<kwd>classification performance</kwd>
<kwd>learning game</kwd>
</kwd-group>
<counts>
<fig-count count="3"/>
<table-count count="3"/>
<equation-count count="0"/>
<ref-count count="60"/>
<page-count count="12"/>
<word-count count="9337"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Machine Learning and Artificial Intelligence</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="sec1">
<title>Introduction</title>
<p>When conducting experimental research, randomization of research participants is the most commonly used method, typically followed by a series of comparisons between at least two different groups of participants (<xref ref-type="bibr" rid="ref8">Campbell and Stanley, 2015</xref>). A two-group comparison can be referred to as dichotomization (<xref ref-type="bibr" rid="ref17">DeCoster et al., 2009</xref>; <xref ref-type="bibr" rid="ref34">MacCallum et al., 2002</xref>). Randomization can be based on participants&#x2019; demographic differences, such as gender (e.g., <xref ref-type="bibr" rid="ref13">Chung and Chang, 2017</xref>) or game experience levels (e.g., <xref ref-type="bibr" rid="ref42">Procci et al., 2013</xref>). From the perspective of game design in educational contexts, students&#x2019; learning performance can also be compared after assigning them to play different types of games (e.g., two- versus three-dimensional; <xref ref-type="bibr" rid="ref51">Yilmaz and Cagiltay, 2016</xref>). A fair comparison in such a dichotomized setting can be expected if participants are appropriately assigned to one of the two game types in the experiment.</p>
<p>Machine learning (ML) models are broadly categorized into supervised (e.g., logistic regression, support vector machines) and unsupervised approaches (e.g., k-means clustering, artificial neural networks <xref ref-type="bibr" rid="ref48">Verma et al., 2022</xref>). Both types are particularly valuable for studying game behavior in educational settings. For example, ML was applied to enhance the analysis of game-based learning data and to highlight the value of game behaviors (e.g., <xref ref-type="bibr" rid="ref27">J&#x00E4;rvel&#x00E4; et al., 2020</xref>; <xref ref-type="bibr" rid="ref33">Luan and Tsai, 2021</xref>). In experimental research, ML can serve as a novel approach to detecting randomization flaws, complementing conventional balance tests. While statistical methods such as the <italic>t</italic>-test and chi-square test are commonly used to examine the validity of randomization, they may be limited in addressing fluid, high-dimensional, and nonlinear relationships among predictive factors (<xref ref-type="bibr" rid="ref12">Choubineh et al., 2023</xref>). ML, by contrast, enables the detection of complex patterns across all data points in an experimental study (<xref ref-type="bibr" rid="ref48">Verma et al., 2022</xref>). This research preliminarily explores the capability of ML models to classify well-randomized, dichotomized sample assignments. Successful classification is assumed to provide evidence of the validity of these assignments within the experimental design.</p>
<p>As researchers claim that their sample assignments are well-randomized, the classification performance of ML, measured by accuracy rate and influenced by implementation settings, is expected to surpass a satisfactory threshold (e.g., 60% or higher; <xref ref-type="bibr" rid="ref52">Zhang et al., 2017</xref>). In a fully randomized experiment, ML models can evaluate the effectiveness of dichotomization in sample randomization. A high accuracy rate in classifying samples within a dichotomized experimental design reinforces the validity and reliability of claims regarding the random assignment process. In addition, verification through ML not only supports researchers&#x2019; claims but also enhances their academic credibility by demonstrating robust sample randomization in experimental studies.</p>
<p>This study raises the research question: Can ML models serve as a methodological validation tool to enhance researchers&#x2019; accountability in claiming proper randomization in experiments? It is hypothesized that supervised ML models will outperform unsupervised ones in classifying sample assignments, while feature importance analysis from unsupervised ML models will reveal key predictors of assignment bias. Both approaches can provide insights into the effectiveness of small sample sizes and within-group experimental designs. Unlike prior studies focusing on game analytics, the novelty of this study lies in validating the capability of both supervised and unsupervised ML models to examine randomization in experimental assignments.</p>
<p>The following sections begin with a review of psychology research literature on randomization in sample assignment, followed by a discussion of the classification capabilities of ML models. To address the proposed research question, a learning game featuring two distinct scenarios was introduced. A detailed description of the recruitment plan and methodology, including randomization procedures, data collection, implementation of various ML models, and analysis methods, is provided. The results of employing these ML models are presented and thoroughly discussed. Finally, the capability of ML models to enhance research validity and reliability in experimental studies, as well as to analyze various behaviors in learning games, is explored.</p>
<sec id="sec2">
<title>Related work</title>
<sec id="sec3">
<title>Randomization approaches in experimental design</title>
<p>Randomization is widely regarded as one of the most effective solutions for experimental design in psychological research (<xref ref-type="bibr" rid="ref8">Campbell and Stanley, 2015</xref>; <xref ref-type="bibr" rid="ref40">Pirlott and MacKinnon, 2016</xref>). Randomization can effectively neutralize participant characteristics (e.g., age and gender) and ensure a representative selection of participants from the research context (e.g., <xref ref-type="bibr" rid="ref13">Chung and Chang, 2017</xref>; <xref ref-type="bibr" rid="ref39">Pan and Ke, 2023</xref>; <xref ref-type="bibr" rid="ref42">Procci et al., 2013</xref>; <xref ref-type="bibr" rid="ref44">Puolakanaho et al., 2020</xref>). Given the limitations of the research context, incorporating a control group becomes essential to rule out confounding factors (e.g., individual differences in personality and prior knowledge) within the experimental group, a design known as a randomized controlled design or trial (e.g., <xref ref-type="bibr" rid="ref44">Puolakanaho et al., 2020</xref>). A within-subject design can also be used to address confounding factors among research participants (<xref ref-type="bibr" rid="ref14">Cohen and Staub, 2015</xref>; <xref ref-type="bibr" rid="ref6">Brauer and Curtin, 2018</xref>; <xref ref-type="bibr" rid="ref36">Montoya, 2023</xref>). In this approach, participants&#x2019; characteristics can later serve as control variables during the data analysis stage to enhance the robustness of the findings (<xref ref-type="bibr" rid="ref8">Campbell and Stanley, 2015</xref>; <xref ref-type="bibr" rid="ref4">Bernerth and Aguinis, 2016</xref>).</p>
<p>In addition, sampling designs that consider population characteristics are also an effective approach to preventing prominent bias and confounding effects in the study context (<xref ref-type="bibr" rid="ref30">Levy and Lemeshow, 2013</xref>; <xref ref-type="bibr" rid="ref32">Lohr, 2021</xref>). This research method is crucial and widely used in large-scale educational studies, such as the Program for International Student Assessment (<xref ref-type="bibr" rid="ref38">Organization for Economic Co-operation and Development, 2014</xref>) and Trends in International Mathematics and Science Study TIMSS (<xref ref-type="bibr" rid="ref503">Martin et al., 2016</xref>). The sampling weights provided by the experts of these large-scale studies can be used by researchers who focus on providing meaningful suggestions for educational, political, or healthcare policies to generate generalizable findings for the broader population (<xref ref-type="bibr" rid="ref2">Ar&#x0131;kan et al., 2020</xref>; <xref ref-type="bibr" rid="ref19">Ertl et al., 2020</xref>; <xref ref-type="bibr" rid="ref29">Laukaityte and Wiberg, 2018</xref>; <xref ref-type="bibr" rid="ref35">Meinck, 2015</xref>; <xref ref-type="bibr" rid="ref45">Rust, 2014</xref>). Practitioner researchers can also model the sampling design using various statistical analyses (e.g., multilevel analysis) to identify the impact of the factors considered in the sampling design (<xref ref-type="bibr" rid="ref11">Chiu et al., 2022</xref>; <xref ref-type="bibr" rid="ref29">Laukaityte and Wiberg, 2018</xref>).</p>
<p>Independent researchers in some fields (e.g., education, neurology, psychology) often use small sample sizes to examine their theories (<xref ref-type="bibr" rid="ref610">Francis et al., 2010</xref>; <xref ref-type="bibr" rid="ref502">Lakens, 2022</xref>; <xref ref-type="bibr" rid="ref530">Vasileiou et al., 2018</xref>). The within-subject design is typically the most appropriate and convenient choice, offering higher statistical power (<xref ref-type="bibr" rid="ref502">Lakens, 2022</xref>) despite potential carryover effects (e.g., fatigue and order effects, <xref ref-type="bibr" rid="ref36">Montoya, 2023</xref>). A limitation is that researchers may struggle to determine the appropriateness of this design compared with the randomization design.</p>
</sec>
<sec id="sec4">
<title>Classification performance of machine learning (ML) models</title>
<p>Machine learning models have gradually been used to analyze large datasets. Data come from online learning platforms, such as ASSISTment (<xref ref-type="bibr" rid="ref20">Feng and Heffernan, 2007</xref>) and large-scale datasets, such as PISA (<xref ref-type="bibr" rid="ref38">Organization for Economic Co-operation and Development, 2014</xref>) and TIMSS (<xref ref-type="bibr" rid="ref503">Martin et al., 2016</xref>). Learning analytics and educational data mining are the two general terms for research utilizing machine learning (ML) or artificial intelligence (AI) models to advance ideals in education or psychology, such as personalized learning or adaptive instruction (<xref ref-type="bibr" rid="ref22">Heffernan and Heffernan, 2014</xref>; <xref ref-type="bibr" rid="ref28">Khan and Ghosh, 2021</xref>; <xref ref-type="bibr" rid="ref54">Zotou et al., 2020</xref>). For example, affective computing can detect students&#x2019; affective status when engaging in learning or problem-solving using a computer (<xref ref-type="bibr" rid="ref3">Baker et al., 2010</xref>; <xref ref-type="bibr" rid="ref27">J&#x00E4;rvel&#x00E4; et al., 2020</xref>).</p>
<p>ML models include two main categories: Supervised and unsupervised learning (<xref ref-type="bibr" rid="ref48">Verma et al., 2022</xref>). Supervised learning relies on labeled data. Dichotomous data (e.g., correct vs. incorrect answers or gameplay sequences in this study) appear to be the basic, widely used criterion in ML (<xref ref-type="bibr" rid="ref21">Goretzko and B&#x00FC;hner, 2022</xref>; <xref ref-type="bibr" rid="ref50">Yeung and Fernandes, 2022</xref>). In the supervised learning model, the dataset was split into two parts: training and testing, with a split ratio of 80/20 (<xref ref-type="bibr" rid="ref5">Bichri et al., 2024</xref>). With human-proved criteria, supervised learning is effective in training models by using labeled data to fit the final practical use (<xref ref-type="bibr" rid="ref1">Alloghani et al., 2019</xref>; <xref ref-type="bibr" rid="ref48">Verma et al., 2022</xref>). In other words, supervised learning increases ecological validity for the human world. Contrarily, without knowing any ground truth (i.e., labeled data), the unsupervised learning model aimed to identify patterns in the data points (<xref ref-type="bibr" rid="ref1">Alloghani et al., 2019</xref>). Unsupervised learning viewed as more exploratory and data-driven (not based on criteria or theories) can still be useful in model training as long as the pattern could be found from unlabeled data. Thus, unsupervised learning models are used to identify unknown categories, similar to exploratory factor analysis (EFA) in psychology and education (<xref ref-type="bibr" rid="ref15">Cox et al., 2020</xref>). The main difference between supervised and unsupervised learning models is whether labeled data are used during the training process (<xref ref-type="bibr" rid="ref48">Verma et al., 2022</xref>). This distinction allows for the potential evaluation of the effectiveness of a within-subject experimental design, which typically involves small sample sizes and only a few factors (e.g., gender), in a manner similar to that of a randomized design. To be noted, the results obtained by unsupervised learning usually need further validation (<xref ref-type="bibr" rid="ref49">Xie et al., 2020</xref>; <xref ref-type="bibr" rid="ref53">Zimmermann, 2020</xref>), just like the confirmation factor analysis to validate the results obtained by EFA.</p>
</sec>
<sec id="sec5">
<title>Theoretical foundations for employing ML models in randomization validation</title>
<p>There are several ways to check whether samples have been randomly assigned into groups in an experiment (<xref ref-type="bibr" rid="ref7">Bruhn and McKenzie, 2009</xref>; <xref ref-type="bibr" rid="ref8">Campbell and Stanley, 2015</xref>). Common approaches include examining descriptive statics (e.g., mean, standard deviation, distribution shape via histograms or boxplots), running statistical tests (e.g., <italic>t</italic>-test, ANOVA, chi-square test), and creating a balance check table to list all relevant pre-treatment covariates, group means, <italic>p</italic>-values from statistical tests, and mean differences. However, traditional statistical tests are not always sufficient when sample characteristics are fluid and unstable over time (e.g., emotion, political inclination), when demographics involve high-dimensional data (e.g., brain neuron connectivity), or when unpredictable circumstances introduce nonlinear relationships during the experiment (e.g., bad weather, participant tardiness). While such complexities may be difficult for researchers to detect, various ML models can help uncover these patterns and provide supporting evidence (e.g., <xref ref-type="bibr" rid="ref12">Choubineh et al., 2023</xref>). Researchers are encouraged to make effective use of ML techniques to generate classification results. These results can then be used to assess whether standardized randomization has reached an acceptable threshold (e.g., 60% or higher; <xref ref-type="bibr" rid="ref52">Zhang et al., 2017</xref>).</p>
<p>In this exploratory research, ML models are employed to detect patterns or predictive relationships among data points collected from an experiment designed to undergo proper randomization. ML classifiers are expected to perform beyond chance levels (e.g., ~50% accuracy in a binary classification task). Grounded in mathematical principles, both classification and clustering algorithms can be used to evaluate whether the claim of dichotomized, randomly assigned samples is valid. If randomization is achieved, the performance of supervised ML models on classification tasks is expected to reach a satisfactory level, defined here as standardized ML performance metrics meeting a benchmark accuracy rate of at least 60%, with measures such as accuracy, precision, recall, and F1-score used to assess performance (<xref ref-type="bibr" rid="ref52">Zhang et al., 2017</xref>). It should be noted that AUC-ROC and support were not included in this study, as the focus is on validating whether randomization was achieved, rather than on ranking quality (as in AUC-ROC) or the number of supporting cases in a binary classification (<xref ref-type="bibr" rid="ref48">Verma et al., 2022</xref>).</p>
<p>Supervised ML models are expected to perform effectively in classifying participants into their assigned scenarios. Unsupervised ML models, by contrast, may identify patterns through clustering that appear to improve classification performance but risk overlooking the validity of the underlying claim. In other words, assignment bias may be detected through unsupervised ML models. Large differences in feature values across cluster centroids can indicate that specific features (e.g., demographic attributes) are driving group separation. Conversely, if substantial centroid differences emerge in unexpected features (e.g., attributes unrelated to participants), this may provide evidence of systematic assignment bias.</p>
</sec>
</sec>
<sec id="sec6">
<title>Methodologies</title>
<p>This exploratory study implements a fully randomized experiment with two scenarios. Two hypotheses are proposed, drawing on ML classification and clustering algorithms grounded in mathematical principles. Supervised ML is applied to labeled data (e.g., group assignments), enabling the model to map features to known labels. By contrast, unsupervised ML is employed to uncover latent structures or groupings in unlabeled data (e.g., clustering) (<xref ref-type="fig" rid="fig1">Figure 1</xref>). It is hypothesized that supervised ML will perform effectively under randomized conditions, and that clustering within unsupervised ML may yield even higher apparent classification performance (H1).</p>
<fig position="float" id="fig1">
<label>Figure 1</label>
<caption>
<p>Research model.</p>
</caption>
<graphic xlink:href="frai-08-1541087-g001.tif" mimetype="image" mime-subtype="tiff">
<alt-text content-type="machine-generated">Flowchart illustrating relationships between components of machine learning classification. The center box reads "Supervised Machine Learning Models for Classification (MLMC)" with arrows labeled "H1" and "H2" pointing to "Performance of Binary Classification" and "Unsupervised MLMC Feature Importance Analysis," respectively. Another arrow connects "Valid Claim of Randomization" to "Performance of Binary Classification."</alt-text>
</graphic>
</fig>
<p>In unsupervised ML, large differences in feature values across cluster centroids may indicate that a given feature (e.g., a demographic attribute of participants) contributes significantly to group separation. Conversely, if such differences emerge in features where no separation is expected (e.g., attributes unrelated to participants), this may signal assignment bias (H2). By employing both classification and clustering algorithms, the proposed hypotheses aim to evaluate model performance under conditions of valid randomization.</p>
<disp-quote>
<p><italic>H1</italic>: Supervised MLs achieve higher accuracy than unsupervised ones in detecting randomization flaws.</p>
</disp-quote>
<disp-quote>
<p><italic>H2</italic>: Feature importance analysis reveals key predictors of assignment bias.</p>
</disp-quote>
</sec>
<sec id="sec7">
<title>An experiment of learning direction game</title>
<p>This study employed <italic>Utility</italic> to develop a learning game with two distinct difficulty levels, as two scenarios used to test learning outcomes in the implementation of a fully randomized experiment. In the first scenario, participants used a 2D interface to guide a virtual agent through eight mazes, sequentially locating eight treasures (<xref ref-type="fig" rid="fig2">Figure 2a</xref>). In the second scenario, participants used a 3D interface to assist a different virtual agent in finding eight treasures by following assigned directions (<xref ref-type="fig" rid="fig2">Figure 2b</xref>). Both scenarios incorporated the same mathematical learning concept, i.e., the NSEW directional system, with a compass continuously displayed on the interface. Participants required some adaptation to orient themselves within the environment before gameplay. To enhance engagement, participants were encouraged to imagine themselves as the virtual agents while playing.</p>
<fig position="float" id="fig2">
<label>Figure 2</label>
<caption>
<p>Learning direction game experimental scenarios designed in: <bold>(a)</bold> 2D and <bold>(b)</bold> 3D interfaces. This game was developed as part of the funded project (NSTC 110-2511-H-004-001-MY3).</p>
</caption>
<graphic xlink:href="frai-08-1541087-g002.tif" mimetype="image" mime-subtype="tiff">
<alt-text content-type="machine-generated">Panel (a) displays a 2D grid-based treasure-hunting game interface featuring navigation instructions in Traditional Chinese, a countdown timer, directional arrow controls, and a &#x201C;dig treasure&#x201D; button. A character stands at the center of a dimly lit maze, with only the surrounding tiles illuminated, indicating limited visibility. Panel (b) presents a first-person immersive perspective in a bright 3D forest environment, where the player faces trees and a campfire. The interface displays two heart icons indicating the remaining chances for the player to find treasures, along with a compass and coordinate tracker for navigation, and interactive buttons for movement, digging, accessing directional instructions, and adjusting game settings.</alt-text>
</graphic>
</fig>
<p>Each scenario included a time constraint: 100&#x202F;s in the 2D interface and 240&#x202F;s in the 3D interface. Participants&#x2019; performance was evaluated based on scores, with bonus points awarded for task completion within the allotted time. The maximum achievable score was 36 points in the 2D interface (eight treasures) and 80 points in the 3D interface (eight treasures).</p>
</sec>
<sec id="sec8">
<title>Participants</title>
<p>An equal number of female and male undergraduates (all aged 20&#x202F;years) were recruited to minimize potential gender effects on learning performance in the game-based context. Participants were recruited from humanities-related disciplines (e.g., linguistics, philosophy, sociology), ensuring that their coursework did not require the use of information technologies. Prior to the experiment, a brief survey was administered to assess participants&#x2019; weekly gameplay frequency. In total, 12 participants (six female, six male) were recruited and randomly assigned to one of the two experimental scenarios. Randomization ensured balance across groups with respect to gameplay frequency: each group included two non-gamers, one seldom player, one occasional player, and two frequent players.</p>
</sec>
<sec id="sec9">
<title>Data collection</title>
<p>All participants were instructed to collect as many treasures as possible in each interface of their assigned treatment. In addition to the total scores obtained at the end of the game, task completion time each interface and the time required to collect each treasure were recorded. As summarized in <xref ref-type="table" rid="tab1">Table 1</xref>, a total of 22 data points were collected for each participant across the two interfaces. For subsequent analyses, all variables except gender were normalized using ratio-based transformations to a 0&#x2013;1 scale, ensuring a consistent basis for comparing the performance of different machine learning models.</p>
<table-wrap position="float" id="tab1">
<label>Table 1</label>
<caption>
<p>All data points collected in each interface.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top">Data types</th>
<th align="left" valign="top">Data points</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">Demographics</td>
<td align="left" valign="top">No. 1: Gender<break/>No. 2: Gameplay frequency</td>
</tr>
<tr>
<td align="left" valign="top">Learning scores</td>
<td align="left" valign="top">No. 3: Total scores obtained in 2D interface<break/>No. 4: Total scores obtained in 3D interface<break/>No. 5 ~ 12: Scores obtained from eight tasks in 2D interface<break/>No. 13 ~ 20: Scores obtained from eight tasks in 3D interface</td>
</tr>
<tr>
<td align="left" valign="top">Gameplay time usage</td>
<td align="left" valign="top">No. 21: Total playing time in seconds in 2D interface<break/>No. 22: Total playing time in seconds in 3D interface</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="sec10">
<title>Experimental procedures</title>
<p>To ensure the validity of randomized sample assignment, an equal number of female and male participants with comparable gameplay frequency levels from humanities-related fields were evenly distributed across two treatment groups. In Treatment 1 (labeled 2D3D), participants played the direction learning game starting with the 2D interface, followed by the 3D interface. In Treatment 2 (labeled 3D2D), the sequence was reversed, with participants starting on the 3D interface and then moving to the 2D interface.</p>
<p>Upon completion of the assigned sequence, data were collected on participants&#x2019; task completion times (speed in finding each treasure) and learning scores. As illustrated in <xref ref-type="fig" rid="fig3">Figure 3</xref>, both supervised and unsupervised ML models were developed using the dichotomized treatment assignments. In the supervised approach, the randomization sequence (2D3D vs. 3D2D) was explicitly provided as labels to facilitate learning. In contrast, unsupervised models attempted to recover the sequence structure directly from the experimental data. Binary classification performance was then evaluated to assess the extent to which ML models could validate the randomized experimental design.</p>
<fig position="float" id="fig3">
<label>Figure 3</label>
<caption>
<p>Experimental procedures.</p>
</caption>
<graphic xlink:href="frai-08-1541087-g003.tif" mimetype="image" mime-subtype="tiff">
<alt-text content-type="machine-generated">Flowchart illustrating an experimental process beginning with experimental design scenarios in a learning direction game, leading to the recruitment plan for samples. This connects through valid claims of randomization to experimental outcomes like learning scores. The chart also shows supervised and unsupervised machine learning models for classification influencing the outcomes.</alt-text>
</graphic>
</fig>
<p>To ensure consistency in comparison, the treatment assignment of each participant (i.e., group allocation by gender and gameplay frequency; see No. 1&#x202F;~&#x202F;2 in <xref ref-type="table" rid="tab1">Table 1</xref>) was incorporated into the ML process. Performance scores (No. 3&#x202F;~&#x202F;20 in <xref ref-type="table" rid="tab1">Table 1</xref>) and task completion times (No. 21&#x202F;~&#x202F;22 in <xref ref-type="table" rid="tab1">Table 1</xref>) were split into training and test sets (80/20) for supervised model development, whereas in unsupervised model development these were treated as features. Because participants were randomly assigned according to the recruitment plan, which balanced gender and gameplay frequency across groups, the models could learn from both the equilibrium of group assignment and performance-related datapoints. To increase the effective sample size, synthetic data were generated by introducing random noise into existing data points.</p>
</sec>
<sec id="sec11">
<title>Data analyses</title>
<p>According to <xref ref-type="table" rid="tab1">Table 1</xref>, the participants&#x2019; demographics were clearly reported in the recruitment plan. The analyses were first conducted to compare differences in the participants&#x2019; learning scores and the time spent playing the game between the two groups. Descriptive analyses were performed using the original scores, providing the mean and standard deviation. An independent samples <italic>t</italic>-test was then conducted to determine which group had significantly higher learning scores. All data points, except for gender, were normalized for later analyses.</p>
<p>Supervised and unsupervised machine learning (ML) models were used to test the proposed hypotheses. These two types of models differ based on the presence of labels in the dataset (<xref ref-type="bibr" rid="ref48">Verma et al., 2022</xref>). Hypothesis testing referred to the ML performance on a binary classification task (<xref ref-type="bibr" rid="ref5">Bichri et al., 2024</xref>). Model performance evaluation was based on standard classification metrics, including accuracy, precision, recall, and F1-score (<xref ref-type="bibr" rid="ref33">Luan and Tsai, 2021</xref>; <xref ref-type="bibr" rid="ref52">Zhang et al., 2017</xref>). All metrics were expressed in percentage format to enable direct comparison of classification performance across models. Based on the literature, satisfactory learning performance in the classification task was defined as achieving metrics that exceeded a passing threshold of 60% (<xref ref-type="bibr" rid="ref5">Bichri et al., 2024</xref>; <xref ref-type="bibr" rid="ref52">Zhang et al., 2017</xref>), even though higher thresholds may be needed for robust validation. Notably, this study used Jupyter Notebook 6.5.4 and Python 3 to run all models. All models were expected to perform well by accurately classifying the samples into two groups (i.e., 2D3D vs. 3D2D), based on the data points listed in <xref ref-type="table" rid="tab1">Table 1</xref>.</p>
<p>Commonly-used ML models were selected to test the hypotheses (<xref ref-type="bibr" rid="ref1">Alloghani et al., 2019</xref>; <xref ref-type="bibr" rid="ref33">Luan and Tsai, 2021</xref>; <xref ref-type="bibr" rid="ref48">Verma et al., 2022</xref>). The representative supervised learning models employed in this study were logistic regression, decision tree, and support vector machine (SVM). The supervised learning models were trained and tested using an 80/20 split ratio (<xref ref-type="bibr" rid="ref5">Bichri et al., 2024</xref>). Except for logistic regression, decision tree and support vector machine (SVM) are considered non-parametric algorithms. Logistic regression is a statistical linear model in which a coefficient is estimated for each variable. Decision trees recursively partition the feature space using a tree-like structure, consisting of root, parent, child, and leaf nodes; pruning may be applied when necessary to avoid overfitting (<xref ref-type="bibr" rid="ref1">Alloghani et al., 2019</xref>). In contrast, SVM constructs optimal separating hyperplanes to maximize the margin between classes without assuming a specific data distribution (<xref ref-type="bibr" rid="ref41">Prastyo et al., 2020</xref>). By incorporating kernel functions, SVM can transform a non-linear dataset into a higher-dimensional feature space, enabling the construction of a linear separation boundary and potentially improving performance compared to the basic linear SVM model. In addition, model performance varied depending on the values of <italic>random_state</italic> and other parameter settings (e.g., depth of tree, kernel type), reflecting the sensitivity of decision trees and SVM to initialization and hyperparameter choices.</p>
<p>The representative unsupervised learning models used in this study were k-means, k-nearest neighbor (KNN), and artificial neural networks (ANN). K-means is a centroid-based algorithm (<xref ref-type="bibr" rid="ref1">Alloghani et al., 2019</xref>). The numbers of clusters could be decided and then tested to determine whether the highest accuracy rate had been achieved. The value for each cluster center could be obtained. The value was a mean score representing the importance of the data point. A random seed could be provided to ensure the output was stable in each run of the learning process for the model. KNN is also a non-parametric algorithm, meaning it does not involve model learning or make assumptions about the data distribution (<xref ref-type="bibr" rid="ref33">Luan and Tsai, 2021</xref>). It operates as a supervised classification model. In the unsupervised learning phase, the only step is to assign the number of neighbors for each sample in the training dataset. However, when the data is labeled, the model can make accurate classification decisions and perform well on classification tasks. Finally, ANN can be used in both supervised and unsupervised learning contexts. This model extracts statistical properties or data features from the training dataset to inform learning. There are various approaches to running an ANN in an unsupervised manner. For example, an autoencoder can be used to help the model learn from vectorized data points (<xref ref-type="bibr" rid="ref46">Song et al., 2013</xref>). A clustering algorithm can also be used to the hidden layers of the network to identify patterns and groupings within the data.</p>
</sec>
</sec>
<sec sec-type="results" id="sec12">
<title>Results</title>
<p>Participants&#x2019; raw gameplay scores and completion time spent were first analyzed to evaluate their performance. Various supervised and unsupervised machine learning (ML) models were then trained using normalized data points and gender information collected from the randomized sample assignment in the experiment. Finally, the proposed hypotheses were tested based on the classification performance of each model.</p>
<sec id="sec13">
<title>Learning scores and time usage in gameplay experiment</title>
<p>The 2D3D group achieved higher scores than the 3D2D group when playing the learning direction games on the 2D or 3D interfaces (<xref ref-type="table" rid="tab2">Table 2</xref>). However, the difference in scores between the two groups was not statistically significant (only in 2D: <italic>t</italic>&#x202F;=&#x202F;0.870, <italic>p</italic>&#x202F;=&#x202F;0.413; only in 3D: <italic>t</italic>&#x202F;=&#x202F;0.842, <italic>p</italic>&#x202F;=&#x202F;0.612). In addition, the 2D3D group spent less time playing games on the 2D and 3D interfaces compared to the 3D2D group. To be noted, this comparison result is helpful to the machine accurately classifying samples in the follow-up model development and learning processes.</p>
<table-wrap position="float" id="tab2">
<label>Table 2</label>
<caption>
<p>Differences between two treatments in learning performance and time use.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="middle">Participation number</th>
<th align="left" valign="middle">Biological<break/>gender</th>
<th align="center" valign="middle">Game frequency&#x002A;</th>
<th align="center" valign="top">2D Gameplay time usage (100&#x202F;s)</th>
<th align="center" valign="middle">2D Learning scores (36/36)</th>
<th align="center" valign="top">3D Gameplay time usage (240&#x202F;s)</th>
<th align="center" valign="top">3D Learning scores (80/80)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="middle" colspan="7">2D3D Group (average scores in 2D: 19.67/36, s.d. = 13.09; average scores in 3D: 53.33/80, s.d. = 35.02)</td>
</tr>
<tr>
<td align="left" valign="bottom">001</td>
<td align="left" valign="bottom">Female</td>
<td align="center" valign="bottom">1</td>
<td align="center" valign="bottom">100</td>
<td align="center" valign="bottom">10</td>
<td align="center" valign="bottom">240</td>
<td align="center" valign="bottom">60</td>
</tr>
<tr>
<td align="left" valign="bottom">002</td>
<td align="left" valign="bottom">Female</td>
<td align="center" valign="bottom">3</td>
<td align="center" valign="bottom">100</td>
<td align="center" valign="bottom">15</td>
<td align="center" valign="bottom">240</td>
<td align="center" valign="bottom">0</td>
</tr>
<tr>
<td align="left" valign="bottom">003</td>
<td align="left" valign="bottom">Female</td>
<td align="center" valign="bottom">2</td>
<td align="center" valign="bottom">100</td>
<td align="center" valign="bottom">6</td>
<td align="center" valign="bottom">191</td>
<td align="center" valign="bottom">80</td>
</tr>
<tr>
<td align="left" valign="bottom">007</td>
<td align="left" valign="bottom">Male</td>
<td align="center" valign="bottom">1</td>
<td align="center" valign="bottom">100</td>
<td align="center" valign="bottom">15</td>
<td align="center" valign="bottom">230</td>
<td align="center" valign="bottom">80</td>
</tr>
<tr>
<td align="left" valign="bottom">008</td>
<td align="left" valign="bottom">Male</td>
<td align="center" valign="bottom">4</td>
<td align="center" valign="bottom">74</td>
<td align="center" valign="bottom">36</td>
<td align="center" valign="bottom">221</td>
<td align="center" valign="bottom">80</td>
</tr>
<tr>
<td align="left" valign="bottom">009</td>
<td align="left" valign="bottom">Male</td>
<td align="center" valign="bottom">4</td>
<td align="center" valign="bottom">96</td>
<td align="center" valign="bottom">36</td>
<td align="center" valign="bottom">240</td>
<td align="center" valign="bottom">20</td>
</tr>
<tr>
<td align="left" valign="bottom" colspan="7">3D2D Group (average scores in 2D: 13.86/36, s.d.&#x202F;=&#x202F;11.54; average scores in 3D: 50.00/80, s.d.&#x202F;=&#x202F;23.80)</td>
</tr>
<tr>
<td align="left" valign="bottom">004</td>
<td align="left" valign="bottom">Female</td>
<td align="center" valign="bottom">4</td>
<td align="center" valign="bottom">100</td>
<td align="center" valign="bottom">0</td>
<td align="center" valign="bottom">216</td>
<td align="center" valign="bottom">80</td>
</tr>
<tr>
<td align="left" valign="bottom">005</td>
<td align="left" valign="bottom">Female</td>
<td align="center" valign="bottom">1</td>
<td align="center" valign="bottom">100</td>
<td align="center" valign="bottom">28</td>
<td align="center" valign="bottom">240</td>
<td align="center" valign="bottom">30</td>
</tr>
<tr>
<td align="left" valign="bottom">006</td>
<td align="left" valign="bottom">Female</td>
<td align="center" valign="bottom">2</td>
<td align="center" valign="bottom">100</td>
<td align="center" valign="bottom">21</td>
<td align="center" valign="bottom">240</td>
<td align="center" valign="bottom">40</td>
</tr>
<tr>
<td align="left" valign="bottom">010</td>
<td align="left" valign="bottom">Male</td>
<td align="center" valign="bottom">3</td>
<td align="center" valign="bottom">100</td>
<td align="center" valign="bottom">21</td>
<td align="center" valign="bottom">234</td>
<td align="center" valign="bottom">80</td>
</tr>
<tr>
<td align="left" valign="bottom">011</td>
<td align="left" valign="bottom">Male</td>
<td align="center" valign="bottom">1</td>
<td align="center" valign="bottom">100</td>
<td align="center" valign="bottom">6</td>
<td align="center" valign="bottom">240</td>
<td align="center" valign="bottom">20</td>
</tr>
<tr>
<td align="left" valign="bottom">012</td>
<td align="left" valign="bottom">Male</td>
<td align="center" valign="bottom">4</td>
<td align="center" valign="bottom">100</td>
<td align="center" valign="bottom">21</td>
<td align="center" valign="bottom">240</td>
<td align="center" valign="bottom">40</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>&#x002A;1&#x202F;=&#x202F;almost none, 2&#x202F;=&#x202F;a few times or about an hour per week, 3&#x202F;=&#x202F;not often, 4&#x202F;=&#x202F;often (more than three times or about 3 h per week).</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="sec14">
<title>Machine learning model performance on binary classification</title>
<p>This section outlines the approach used to implement each ML model, followed by detailed explanations of data entries and data processing procedures. Default parameter values provided by the original models were used to ensure the stability of the learning procedures. For instance, the <italic>random_state</italic> parameter in the Python <italic>Scikit-learn</italic> library was set to an integer based on the data size for certain machine learning models (e.g., decision tree, k-means,). Finally, the performance metrics were reported to evaluate each model&#x2019;s performance on the binary classification tasks in this study. <xref ref-type="table" rid="tab3">Table 3</xref> presents a comprehensive comparison of the results across all machine learning models.</p>
<table-wrap position="float" id="tab3">
<label>Table 3</label>
<caption>
<p>Performance metrics achieved by different machine learning models.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top">Machine learning models</th>
<th align="center" valign="top">A</th>
<th align="center" valign="top">P</th>
<th align="center" valign="top">R</th>
<th align="center" valign="top">F</th>
<th align="left" valign="top">Random state values and parameters used</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top" colspan="6">Supervised</td>
</tr>
<tr>
<td align="left" valign="top">Logistic regression<break/>(after adding synthetic data)</td>
<td align="center" valign="top">0.67<break/>(0.80)</td>
<td align="center" valign="top">0.83<break/>(0.87)</td>
<td align="center" valign="top">0.67<break/>(0.80)</td>
<td align="center" valign="top">0.67<break/>(0.80)</td>
<td align="left" valign="top">Train: test&#x202F;=&#x202F;0.8:0.2, random_state&#x202F;=&#x202F;114<break/>Solver&#x202F;=&#x202F;lbfgs</td>
</tr>
<tr>
<td align="left" valign="top">Decision tree<break/>(after adding synthetic data)</td>
<td align="center" valign="top">0.67<break/>(0.80)</td>
<td align="center" valign="top">0.44<break/>(0.87)</td>
<td align="center" valign="top">0.67<break/>(0.80)</td>
<td align="center" valign="top">0.53<break/>(0.80)</td>
<td align="left" valign="top">Train: test&#x202F;=&#x202F;0.8:0.2, random_state&#x202F;=&#x202F;143<break/>Depth of tree&#x202F;=&#x202F;2</td>
</tr>
<tr>
<td align="left" valign="top">SVM<break/>(after adding synthetic data)</td>
<td align="center" valign="top">0.67<break/>(0.80)</td>
<td align="center" valign="top">0.83<break/>(0.87)</td>
<td align="center" valign="top">0.67<break/>(0.80)</td>
<td align="center" valign="top">0.67<break/>(0.80)</td>
<td align="left" valign="top">Train: test&#x202F;=&#x202F;0.8:0.2, random_state&#x202F;=&#x202F;114<break/>Kernel&#x202F;=&#x202F;linear</td>
</tr>
<tr>
<td align="left" valign="top" colspan="6">Unsupervised</td>
</tr>
<tr>
<td align="left" valign="top">K-Means<break/>(after adding synthetic data)</td>
<td align="center" valign="top">0.58<break/>(0.58)</td>
<td align="center" valign="top">0.59<break/>(0.59)</td>
<td align="center" valign="top">0.58<break/>(0.58)</td>
<td align="center" valign="top">0.58<break/>(0.58)</td>
<td align="left" valign="top">Cluster&#x202F;=&#x202F;2, random_state&#x202F;=&#x202F;1</td>
</tr>
<tr>
<td align="left" valign="top">KNN<break/>(after adding synthetic data)</td>
<td align="center" valign="top">0.67<break/>(0.67)</td>
<td align="center" valign="top">0.83<break/>(0.83)</td>
<td align="center" valign="top">0.67<break/>(0.67)</td>
<td align="center" valign="top">0.67<break/>(0.67)</td>
<td align="left" valign="top">Neighbor&#x202F;=&#x202F;1, random_state&#x202F;=&#x202F;1</td>
</tr>
<tr>
<td align="left" valign="top">ANN<break/>(after adding synthetic data)</td>
<td align="center" valign="top">0.67<break/>(0.67)</td>
<td align="center" valign="top">0.44<break/>(1.00)</td>
<td align="center" valign="top">0.67<break/>(0.67)</td>
<td align="center" valign="top">0.53<break/>(0.80)</td>
<td align="left" valign="top">activation and optimizer, random_state&#x202F;=&#x202F;3<break/>(random_state&#x202F;=&#x202F;2)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>A, accuracy rate; P, precision; R, recall; F, F1-score.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="sec15">
<title>Supervised learning models</title>
<p>In the supervised learning model, the training dataset was labeled to help the machine identify which samples were assigned to the 2D3D or 3D2D groups. The model&#x2019;s performance on the binary classification task was then evaluated using the testing dataset. It is worth noting that the sample size in this study was small. Consequently, the data split into training and testing was conducted in a conventional manner, and the learning performance may be as high as reported in the literature.</p>
<sec id="sec16">
<title>Logistic regression</title>
<p>This model was used to classify samples, represented by 22 data points (<xref ref-type="table" rid="tab1">Table 1</xref>), into the two treatment groups (2D3D and 3D2D). Using an 80/20 training-testing split and a <italic>random_state</italic> value of 114, the model achieved an accuracy of 67%. The solver parameter was set to its default (<italic>lbfgs</italic>), and alternative solvers (<italic>liblinear</italic>, <italic>newton-cg</italic>, <italic>sag</italic>, <italic>saga</italic>) produced identical results. However, when <italic>random_state</italic> values below or above 114 were applied, accuracy decreased substantially, in some cases dropping to 33% or even 0%. Following the incorporation of synthetic data, model performance improved, with accuracy reaching a maximum of 80% under the same parameter settings. Moreover, precision, recall, and F1-score each exceeded 80%, indicating that the model not only achieved higher overall accuracy but also maintained balanced and reliable classification performance across evaluation metrics.</p>
</sec>
<sec id="sec17">
<title>Decision tree</title>
<p>In this study, the 22 data points were used as input features to predict the treatment groups (2D3D vs. 3D2D) by learning simple decision rules. The decision tree model consistently achieved a maximum accuracy of 67% with an 80/20 training&#x2013;testing split and a <italic>random_state</italic> value of 143, which was higher than the value observed in the logistic regression model. The tree depth was set to the default configuration, allowing the tree to expand fully, while the default <italic>random_state</italic> parameter resulted in trees being built differently in each iteration. Notably, when <italic>random_state</italic> values lower or higher than 145 were applied during sample splitting, accuracy dropped sharply, reaching as low as 33% or even 0%. After the inclusion of synthetic data, model performance improved, with accuracy reaching 80% when the tree depth was restricted to 2 while keeping other parameters unchanged. Under these conditions, precision, recall, and F1-score also exceeded 80%, indicating improved and balanced classification performance.</p>
</sec>
<sec id="sec18">
<title>Support vector machine (SVM)</title>
<p>The model was trained with an 80/20 training&#x2013;testing split and a <italic>random_state</italic> value of 114. A <italic>linear</italic> kernel function was applied instead of the default <italic>rbf</italic> (radial basis function) kernel, resulting in an accuracy of 67%. Alternative kernels such as <italic>poly</italic> and <italic>sigmoid</italic> produced substantially lower accuracy. After adding the synthetic data, accuracy improved to 80% under the same parameter setting, with precision, recall, and F1-scores also exceeding 80%. However, when the <italic>rbf</italic> or <italic>poly</italic> kernels were used, the model achieved 100% accuracy, indicating potential overfitting. By contrast, the <italic>sigmoid</italic> kernel achieved only 40% accuracy, reflecting poor performance.</p>
</sec>
</sec>
<sec id="sec19">
<title>Unsupervised learning models</title>
<p>The ground truth refers to the actual classification of the samples in this study, which is also known as the target. To predict the target accurately, the model learned from the given data points, which were treated as vector features. In other words, even though the dataset was not labeled, the model was still able to learn effectively based on the features during the learning process.</p>
<sec id="sec20">
<title>K-means</title>
<p>All 22 data points were used to predict participants&#x2019; gameplay sequence assignments by clustering the samples into two groups. The Elbow Method plot also confirmed that two clusters provided the best fit, consistent with the original random sample assignment (2D3D vs. 3D2D). The highest accuracy rate (58%) was achieved with <italic>random_state&#x202F;=&#x202F;1</italic>. Increasing the <italic>random_state</italic> value did not improve accuracy. Other performance metrics included precision (59%), recall (58%), and F1-score (58%).</p>
<p>Feature importance analysis showed that the most influential variable was the total time spent collecting all eight treasures in the 3D game (No. 22 in <xref ref-type="table" rid="tab1">Table 1</xref>). Gender and gameplay frequency were not identified as key predictors of assignment, although gender contributed slightly more than gameplay frequency. After incorporating synthetic data, results were reproduced consistently. In addition, Silhouette scores ranged from 0.165 to 0.410, suggesting that participants fit reasonably well within their assigned clusters compared to alternative cluster assignments (<xref ref-type="bibr" rid="ref1">Alloghani et al., 2019</xref>; <xref ref-type="bibr" rid="ref41">Prastyo et al., 2020</xref>). No evidence of overfitting was detected.</p>
</sec>
<sec id="sec21">
<title>K-nearest neighbor (KNN)</title>
<p>Given the small sample size, a lower number of neighbors was deemed appropriate. Accordingly, the number of neighbors was set to 1 with <italic>random_state&#x202F;=&#x202F;1</italic>, resulting in an accuracy rate of 67%, recall and F1-scores of 67%, and a higher precision score of 83%. These values were consistent across both the training and testing datasets. However, when the number of neighbors was increased to 2 or 3, accuracy dropped to 33%. Although using more than 4 neighbors occasionally achieve 67% accuracy, this was not considered reasonable given the limited sample size.</p>
<p>Since KNN is not a tree-based model (e.g., decision tree or random forest) and lacks built-in feature importance measures (unlike <italic>RandomForestClassifier</italic> or <italic>GradientBoostingClassifier</italic>) (<xref ref-type="bibr" rid="ref43">Pudjihartono et al., 2022</xref>), feature relevance was examined through statistical methods, including chi-square tests, <italic>f_classif</italic>, and RFE (Recursive Feature Elimination). These analyses identified the following important features: (1) the time spent hunting the sixth treasure in the 3D game (No. 18 in <xref ref-type="table" rid="tab1">Table 1</xref>), (2) the total time spent hunting all treasures in the 2D game (No. 21), and (3) the total scores earned in the 2D game (No. 3). Similar findings were obtained for gender and gameplay frequency in terms of their relatively low importance. After incorporating synthetic data, the same results were reproduced, confirming the robustness of the findings.</p>
</sec>
<sec id="sec22">
<title>Artificial neural networks (ANN)</title>
<p>Since ANN can be applied in both supervised and unsupervised learning contexts, this study implemented both approaches by encoding participants&#x2019; gameplay scores and completion time collected from the 2D and 3D games, using two hidden layers. To mitigate overfitting, where the model fits the training data perfectly but generalizes poorly to the test data (<xref ref-type="bibr" rid="ref501">Bejani and Ghatee, 2021</xref>), two activation functions were applied, i.e., <italic>relu</italic> (layer dense&#x202F;=&#x202F;4) followed by <italic>sigmoid</italic> (layer dense&#x202F;=&#x202F;1).</p>
<p>Hyperparameter tuning was then conducted using the <italic>keras-tuner</italic> library. The <italic>ADAM</italic> optimizer and categorical <italic>cross-entropy loss</italic> function were employed during training to minimize loss (<xref ref-type="bibr" rid="ref510">Ghosh and Gupta, 2023</xref>). In addition, <italic>GridSearch</italic> was used to explore combinations of hyperparameters, limited to 10 trials (as overfitting was observed with more than 10 trials given the dataset size). Under these conditions, the ANN achieved its best accuracy of 67%, with <italic>random_state&#x202F;=&#x202F;3</italic> for training&#x2013;testing splitting and an optimal learning rate of 0.0001. Other metrics were also obtained but did not exceed 67%. In addition, feature relevance was examined with multiple statistical tests. The chi-square test identified the total scores participants received in the 2D game (No. 3, <xref ref-type="table" rid="tab1">Table 1</xref>) as an important predictor. The <italic>f_classif</italic> and <italic>RFE</italic> tests highlighted the total time spent hunting all treasures in the 2D game (No. 21) as important. Gender and gameplay frequency were consistently found to have relatively low importance.</p>
<p>After synthetic data were incorporated, accuracy and recall remained unchanged with a lower <italic>random_state</italic>. The F1-score increased to 80%, while precision rose to 100%, indicating potential overfitting caused by false positives being misclassified. This suggests that the trained ANN may incorrectly classify well-randomized data as flawed. Nevertheless, across all three tests, the total 2D game score (No. 3) was consistently identified as an important feature in binary classification.</p>
</sec>
</sec>
</sec>
<sec sec-type="discussion" id="sec23">
<title>Discussion</title>
<p>Randomization is critical in experiments; however, traditional validation methods may lack sensitivity to hidden bias, such as distribution imbalance or non-linear interaction among participants. This study investigated the capability of supervised and unsupervised machine learning (ML) models to detect randomization flaws. Feature importance analyses were also conducted to identify predictors of potential assignment bias. A series of actions were carried out, including participant recruitment, implementation of a well-randomized sample assignment, training of various machine learning models, and evaluation of model performance on a binary classification task using accuracy, precision, recall, and F1-score. In this study, 12 participants were randomly assigned to play a learning direction game in two interface sequences (2D3D or 3D2D).</p>
<sec id="sec24">
<title>Supervised vs. unsupervised models</title>
<p>All supervised ML models, logistic regression, decision trees, and SVM, achieved satisfactory classification performance (67%) when trained on labeled data (<xref ref-type="bibr" rid="ref5">Bichri et al., 2024</xref>; <xref ref-type="bibr" rid="ref52">Zhang et al., 2017</xref>). After incorporating synthetic data, their performance further improved, with all three models reaching up to 87% accuracy. The unsupervised ML model k-means achieved only 58%. KNN and ANN consistently plateaued at 67%, even after synthetic data were added. The ANN model showed the weakest performance, with precision as low as 44%, likely due to the small sample size and its limited ability to capture meaningful patterns in binary classification (<xref ref-type="bibr" rid="ref1">Alloghani et al., 2019</xref>; <xref ref-type="bibr" rid="ref500">Alwosheel et al., 2018</xref>; <xref ref-type="bibr" rid="ref43">Pudjihartono et al., 2022</xref>). While synthetic data improved ANN performance, it also introduced overfitting, as indicated by inflated precision scores. Overall, supervised ML models achieved higher accuracy than unsupervised models in detecting randomization flaws, thereby supporting H1.</p>
</sec>
<sec id="sec25">
<title>Feature importance and assignment bias</title>
<p>Feature importance analysis offers a practical means of guiding unsupervised ML models by identifying variables that most strongly influence cluster separation (<xref ref-type="bibr" rid="ref43">Pudjihartono et al., 2022</xref>). In this study, feature importance results revealed consistent findings across models. In both KNN and ANN, the total scores earned in the 2D game and the total time spent collecting all treasures in the 2D game emerged as key predictors. In k-means clustering, the time spent on the sixth treasure hunt in the 3D game (No. 18, <xref ref-type="table" rid="tab1">Table 1</xref>) was identified as an important feature. Gender and gameplay frequency were consistently less important, although gender showed slightly greater predictive relevance. This suggests that the experimental design achieved effective randomization with respect to demographic variables. At the same time, the findings indicate that assigning participants to groups based on gameplay performance risks data leakage, as learners may be prematurely categorized as low or high performers. Overall, the evidence supports H2: Feature importance analysis reveals key predictors of assignment bias.</p>
</sec>
<sec id="sec26">
<title>Confirmation of randomization and learning outcomes</title>
<p>The randomization procedures implemented in this study reflect a systematic sample assignment process. The results demonstrated that supervised ML classifiers effectively validated the randomization in binary classification, confirming that the two experimental treatments produced distinct learning outcomes in both scores and gameplay times. With supervised ML validation, the differences in learning outcomes (higher scores and faster completion in the 2D-first condition) can be attributed to treatment effects rather than pre-existing biases. Conversely, if independent variables such as gender or gaming frequency had significantly influenced the outcomes, supervised ML classifiers would have performed only at chance level (~50% accuracy), providing neither support for randomization nor evidence of model sensitivity. Unsupervised ML classifiers, which are designed to detect latent patterns and predictive relationships, may in some cases outperform supervised approaches in classification tasks. However, this was not the case in the present study. Overall, the consistent performance across multiple ML methods confirmed both the success of the randomization procedure and the impact of interface sequence on learning outcomes.</p>
</sec>
<sec id="sec27">
<title>Learning performance in 2D vs. 3D</title>
<p>Participants assigned to the 2D3D sequence achieved higher scores than those in the 3D2D sequence. This aligns with expectations, as 2D representations are inherently simpler and less cognitive demanding than 3D representations (<xref ref-type="bibr" rid="ref24">Herbert and Chen, 2015</xref>; <xref ref-type="bibr" rid="ref23">Hegarty, 2011</xref>; <xref ref-type="bibr" rid="ref25">Hicks et al., 2003</xref>). Two-dimensional tasks allow learners to concentrate more effectively on core learning concepts without the distraction of additional spatial complexity. In contrast, 3D representations requires greater cognitive effort to interpret and are less prevalent in traditional educational materials. Cognitive science research further supports that 2D visualizations demand less mental effort and are more easily processed by the visual system than 3D representations (<xref ref-type="bibr" rid="ref9">Chang et al., 2017</xref>; <xref ref-type="bibr" rid="ref23">Hegarty, 2011</xref>). The human visual system processes 2D flat images more easily because all elements are presented on a single plane. This reduces the cognitive load and enhances comprehension. Consequently, 2D representations tend to be more accessible for learners due to their simplicity, familiarity, ease of application, and reduced cognitive demand. Nevertheless, combining 2D and 3D perspectives can be beneficial in certain domains such as ML model development (<xref ref-type="bibr" rid="ref10">Chen et al., 2019</xref>) and structural design (<xref ref-type="bibr" rid="ref26">Hong et al., 2024</xref>).</p>
</sec>
<sec id="sec28">
<title>Implications for game-based learning</title>
<p>Game-based learning is gaining increasing popularity across educational levels and age groups (<xref ref-type="bibr" rid="ref31">Liu et al., 2020</xref>; <xref ref-type="bibr" rid="ref47">Sumi and Sato, 2022</xref>). Beyond the complexity of 3D games, the literature emphasizes their immersive effects of 3D games, particularly in virtual reality (VR) and augmented reality (AR) settings, compared to traditional 2D games. Learners frequently report greater excitement and engagement with 3D games than with 2D ones (<xref ref-type="bibr" rid="ref9">Chang et al., 2017</xref>; <xref ref-type="bibr" rid="ref16">Dalgarno and Lee, 2010</xref>). To ensure that learners achieve their learning objectives, it is important to incorporate virtual guides or an instructional pages that present fundamental subject concepts within game environment (<xref ref-type="bibr" rid="ref11">Chiu et al., 2022</xref>; <xref ref-type="bibr" rid="ref18">Dicheva et al., 2021</xref>; <xref ref-type="bibr" rid="ref51">Yilmaz and Cagiltay, 2016</xref>). Recent advances in intelligent learning companions, powered by well-trained machine learning models, further enhance the educational value of game-based learning environments. The development of intelligent learning companions, powered by well-trained ML models, has advanced significantly. These companions provide personalized interaction and adaptive support during gameplay, thereby improving engagement and learning outcomes (<xref ref-type="bibr" rid="ref1">Alloghani et al., 2019</xref>; <xref ref-type="bibr" rid="ref500">Alwosheel et al., 2018</xref>; <xref ref-type="bibr" rid="ref43">Pudjihartono et al., 2022</xref>).</p>
</sec>
<sec id="sec29">
<title>Limitations</title>
<p>While ML shows promise for validating randomization, its reliability depends strongly on sample size and experimental context. As shown in the ANN results, performance was poorest with the limited sample size (<italic>n</italic> =&#x202F;12). After synthetic data were added, overfitting emerged, suggesting that the model failed to generalize. In this study, the direction game served as the experimental scenario. Players required some adaptation to orient themselves within the environment before gameplay. This was especially challenging in the 3D2D treatment, which relied heavily on spatial orientation skills. Once players adapted to the 3D environment, however, performance improved more rapidly in the 2D3D treatment, where participants transitioned from simpler to more complex tasks. If the game is either too easy or too difficult, causing players to consistently achieve full scores or perform poorly, the resulting learning outcomes become highly predictable. In such cases, sample randomization would be trivial, and ML classification would add little value because no meaningful patterns could be explored. Therefore, the findings of the present study should be regarded as preliminary. They may not easily generalize to larger or more complex experimental designs, where greater variability in participant performance and richer datasets could lead to more robust validation of randomization.</p>
</sec>
</sec>
<sec sec-type="conclusions" id="sec30">
<title>Conclusion</title>
<p>This study contributes to advancing the use of machine learning (ML) models for validating sample randomization in experimental assignments. A carefully designed recruitment plan, followed by a well-randomized sample assignment, was shown to be enabling the evaluation of ML performance. Supervised ML models (i.e., logistic regression, decision trees, and SVM) achieved a satisfactory level of accuracy in detecting randomization validity. Feature importance analysis further demonstrated that while ML offers considerable promise, its reliability is contingent on factors such as sample size, noise tolerance, and experimental context.</p>
<p>Compared with traditional validation methods, ML models can capture complex and subtle biases, offering a more sensitive evaluation of non-linear relationships and higher-order interactions. In this study, ML models were employed as a robustness check to validate experimental randomization during sample assignment. The novelty lies in demonstrating that ML can be used to evaluate claims of participant randomization in experimental designs, suggesting its potential as a supplementary tool. However, while ML shows promise in detecting randomization patterns, its efficacy depends on sample size and design complexity. With very small samples, its reliability remains limited. Future work should therefore apply this approach to larger and more balanced datasets, combining ML with traditional balance tests (e.g., <italic>t</italic>-test, <italic>F</italic>-test). It is also recommended that ML be systematically compared with standard balance tests across diverse experimental contexts.</p>
<p>Finally, future studies are encouraged to extend this framework to experiments with more than two treatments. The capabilities of other ML models, including semi-supervised and self-supervised approaches, should also be explored in classification tasks to further expand understanding in this area.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="sec31">
<title>Data availability statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the author, without undue reservation.</p>
</sec>
<sec sec-type="ethics-statement" id="sec32">
<title>Ethics statement</title>
<p>The studies involving humans were approved by National Chengchi University Research Ethics NCCU-REC-202105-I030. The studies were conducted in accordance with the local legislation and institutional requirements. The participants provided their written informed consent to participate in this study.</p>
</sec>
<sec sec-type="author-contributions" id="sec33">
<title>Author contributions</title>
<p>P-HH: Conceptualization, Formal analysis, Methodology, Validation, Visualization, Writing &#x2013; original draft, Writing &#x2013; review &#x0026; editing.</p>
</sec>
<sec sec-type="funding-information" id="sec34">
<title>Funding</title>
<p>The author(s) declare that financial support was received for the research and/or publication of this article. This research was funded by National Science and Technology Council, Taiwan: NSTC 110-2511-H-004-001-MY3, M.-S. Chiu, 2021&#x2013;2024 &#x201C;Affect in Mathematics Learning with Teaching: Theory Building and Virtual Reality Experimental Studies.&#x201D;</p>
</sec>
<sec sec-type="COI-statement" id="sec35">
<title>Conflict of interest</title>
<p>The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="ai-statement" id="sec36">
<title>Generative AI statement</title>
<p>The author(s) declare that no Gen AI was used in the creation of this manuscript.</p>
<p>Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.</p>
</sec>
<sec sec-type="disclaimer" id="sec37">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="ref1"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Alloghani</surname><given-names>M.</given-names></name> <name><surname>Al-Jumeily</surname><given-names>D.</given-names></name> <name><surname>Mustafina</surname><given-names>J.</given-names></name> <name><surname>Hussain</surname><given-names>A.</given-names></name> <name><surname>Aljaaf</surname><given-names>A. J.</given-names></name></person-group> (<year>2019</year>). &#x201C;<article-title>A systematic review on supervised and unsupervised machine learning algorithms for data science</article-title>&#x201D; in <source>Supervised and unsupervised learning for data science. Unsupervised and semi-supervised learning</source>. eds. <person-group person-group-type="editor"><name><surname>Berry</surname><given-names>W. M.</given-names></name> <name><surname>Mohamed</surname><given-names>A.</given-names></name> <name><surname>Yap</surname><given-names>B. W.</given-names></name></person-group> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>3</fpage>&#x2013;<lpage>21</lpage>.</citation></ref>
<ref id="ref500"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Alwosheel</surname><given-names>A.</given-names></name> <name><surname>Van Cranenburgh</surname><given-names>S.</given-names></name> <name><surname>Chorus</surname><given-names>C. G.</given-names></name></person-group> (<year>2018</year>). <article-title>Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis</article-title>. <source>J. Choice Modelling</source> <volume>28</volume>, <fpage>167</fpage>&#x2013;<lpage>182</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.jocm.2018.07.002</pub-id></citation></ref>
<ref id="ref2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ar&#x0131;kan</surname><given-names>S.</given-names></name> <name><surname>&#x00D6;zer</surname><given-names>F.</given-names></name> <name><surname>&#x015E;eker</surname><given-names>V.</given-names></name> <name><surname>Erta&#x015F;</surname><given-names>G.</given-names></name></person-group> (<year>2020</year>). <article-title>The importance of sample weights and plausible values in large-scale assessments</article-title>. <source>J. Meas. Eval. Educ. Psychol</source> <volume>11</volume>, <fpage>43</fpage>&#x2013;<lpage>60</lpage>. doi: <pub-id pub-id-type="doi">10.21031/epod.602765</pub-id></citation></ref>
<ref id="ref3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Baker</surname><given-names>R. S.</given-names></name> <name><surname>D'Mello</surname><given-names>S. K.</given-names></name> <name><surname>Rodrigo</surname><given-names>M. M. T.</given-names></name> <name><surname>Graesser</surname><given-names>A. C.</given-names></name></person-group> (<year>2010</year>). <article-title>Better to be frustrated than bored: the incidence, persistence, and impact of learners&#x2019; cognitive-affective states during interactions with three different computer-based learning environments</article-title>. <source>Int. J. Hum.-Comput. Stud.</source> <volume>68</volume>, <fpage>223</fpage>&#x2013;<lpage>241</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.ijhcs.2009.12.003</pub-id></citation></ref>
<ref id="ref501"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bejani</surname><given-names>M. M.</given-names></name> <name><surname>Ghatee</surname><given-names>M.</given-names></name></person-group> (<year>2021</year>). <article-title>A systematic review on overfitting control in shallow and deep neural networks</article-title>. <source>Artificial Intelligence Rev.</source> <volume>54</volume>, <fpage>6391</fpage>&#x2013;<lpage>6438</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s10462-021-09975-1</pub-id></citation></ref>
<ref id="ref4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bernerth</surname><given-names>J. B.</given-names></name> <name><surname>Aguinis</surname><given-names>H.</given-names></name></person-group> (<year>2016</year>). <article-title>A critical review and best-practice recommendations for control variable usage</article-title>. <source>Pers. Psychol.</source> <volume>69</volume>, <fpage>229</fpage>&#x2013;<lpage>283</lpage>. doi: <pub-id pub-id-type="doi">10.1111/peps.12103</pub-id></citation></ref>
<ref id="ref5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bichri</surname><given-names>H.</given-names></name> <name><surname>Chergui</surname><given-names>A.</given-names></name> <name><surname>Hain</surname><given-names>M.</given-names></name></person-group> (<year>2024</year>). <article-title>Investigating the impact of train/test split ratio on the performance of pre-trained models with custom datasets</article-title>. <source>Int. J. Adv. Comput. Sci. Appl.</source> <volume>15</volume>, <fpage>331</fpage>&#x2013;<lpage>339</lpage>. doi: <pub-id pub-id-type="doi">10.14569/IJACSA.2024.0150235</pub-id></citation></ref>
<ref id="ref6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Brauer</surname><given-names>M.</given-names></name> <name><surname>Curtin</surname><given-names>J. J.</given-names></name></person-group> (<year>2018</year>). <article-title>Linear mixed-effects models and the analysis of nonindependent data: a unified framework to analyze categorical and continuous independent variables that vary within-subjects and/or within-items</article-title>. <source>Psychol. Methods</source> <volume>23</volume>, <fpage>389</fpage>&#x2013;<lpage>411</lpage>. doi: <pub-id pub-id-type="doi">10.1037/met0000159</pub-id>, PMID: <pub-id pub-id-type="pmid">29172609</pub-id></citation></ref>
<ref id="ref7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bruhn</surname><given-names>M.</given-names></name> <name><surname>McKenzie</surname><given-names>D.</given-names></name></person-group> (<year>2009</year>). <article-title>In pursuit of balance: randomization in practice in development field experiments</article-title>. <source>Am. Econ. J. Appl. Econ.</source> <volume>1</volume>, <fpage>200</fpage>&#x2013;<lpage>232</lpage>. doi: <pub-id pub-id-type="doi">10.1257/app.1.4.200</pub-id></citation></ref>
<ref id="ref8"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Campbell</surname><given-names>D. T.</given-names></name> <name><surname>Stanley</surname><given-names>J. C.</given-names></name></person-group> (<year>2015</year>). <source>Experimental and quasi-experimental designs for research</source>. Boston, MA, U.S.A.: Houghton Mifflin Company.</citation></ref>
<ref id="ref9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chang</surname><given-names>C. C.</given-names></name> <name><surname>Liang</surname><given-names>C.</given-names></name> <name><surname>Chou</surname><given-names>P. N.</given-names></name> <name><surname>Lin</surname><given-names>G. Y.</given-names></name></person-group> (<year>2017</year>). <article-title>Is game-based learning better in flow experience and various types of cognitive load than non-game-based learning? Perspective from multimedia and media richness</article-title>. <source>Comput. Hum. Behav</source> <volume>71</volume>, <fpage>218</fpage>&#x2013;<lpage>227</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.chb.2017.01.031</pub-id>, PMID: <pub-id pub-id-type="pmid">41098308</pub-id></citation></ref>
<ref id="ref10"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Chen</surname><given-names>Y.</given-names></name> <name><surname>Yang</surname><given-names>B.</given-names></name> <name><surname>Liang</surname><given-names>M.</given-names></name> <name><surname>Urtasun</surname><given-names>R.</given-names></name></person-group> (<year>2019</year>). <source>Learning joint 2d-3d representations for depth completion</source>. In <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name> (pp. <fpage>10023</fpage>&#x2013;<lpage>10032</lpage>).</citation></ref>
<ref id="ref11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chiu</surname><given-names>M.-S.</given-names></name> <name><surname>Lin</surname><given-names>F.-L.</given-names></name> <name><surname>Yang</surname><given-names>K.-L.</given-names></name> <name><surname>Hasumi</surname><given-names>T.</given-names></name> <name><surname>Wu</surname><given-names>T.-J.</given-names></name> <name><surname>Lin</surname><given-names>P.-S.</given-names></name></person-group> (<year>2022</year>). <article-title>The interplay of affect and cognition in the mathematics grounding activity: forming an affective teaching model</article-title>. <source>Eurasia J. Math. Sci. Technol. Educ</source> <volume>18</volume>:<fpage>em2187</fpage>. doi: <pub-id pub-id-type="doi">10.29333/ejmste/12579</pub-id></citation></ref>
<ref id="ref12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Choubineh</surname><given-names>A.</given-names></name> <name><surname>Chen</surname><given-names>J.</given-names></name> <name><surname>Wood</surname><given-names>D. A.</given-names></name> <name><surname>Coenen</surname><given-names>F.</given-names></name> <name><surname>Ma</surname><given-names>F.</given-names></name></person-group> (<year>2023</year>). <article-title>Deep ensemble learning for high-dimensional subsurface fluid flow modeling</article-title>. <source>Eng. Appl. Artif. Intell.</source> <volume>126</volume>:<fpage>106968</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.engappai.2023.106968</pub-id></citation></ref>
<ref id="ref13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chung</surname><given-names>L.-Y.</given-names></name> <name><surname>Chang</surname><given-names>R.-C.</given-names></name></person-group> (<year>2017</year>). <article-title>The effect of gender on motivation and student achievement in digital game-based learning: a case study of a contented-based classroom</article-title>. <source>Eurasia J. Math. Sci. Technol. Educ.</source> <volume>13</volume>, <fpage>2309</fpage>&#x2013;<lpage>2327</lpage>. doi: <pub-id pub-id-type="doi">10.12973/eurasia.2017.01227a</pub-id></citation></ref>
<ref id="ref14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cohen</surname><given-names>A. L.</given-names></name> <name><surname>Staub</surname><given-names>A.</given-names></name></person-group> (<year>2015</year>). <article-title>Within-subject consistency and between-subject variability in Bayesian reasoning strategies</article-title>. <source>Cogn. Psychol.</source> <volume>81</volume>, <fpage>26</fpage>&#x2013;<lpage>47</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.cogpsych.2015.08.001</pub-id>, PMID: <pub-id pub-id-type="pmid">26354671</pub-id></citation></ref>
<ref id="ref15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cox</surname><given-names>C. R.</given-names></name> <name><surname>Moscardini</surname><given-names>E. H.</given-names></name> <name><surname>Cohen</surname><given-names>A. S.</given-names></name> <name><surname>Tucker</surname><given-names>R. P.</given-names></name></person-group> (<year>2020</year>). <article-title>Machine learning for suicidology: a practical review of exploratory and hypothesis-driven approaches</article-title>. <source>Clin. Psychol. Rev.</source> <volume>82</volume>:<fpage>101940</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.cpr.2020.101940</pub-id>, PMID: <pub-id pub-id-type="pmid">33130528</pub-id></citation></ref>
<ref id="ref16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dalgarno</surname><given-names>B.</given-names></name> <name><surname>Lee</surname><given-names>M. J. W.</given-names></name></person-group> (<year>2010</year>). <article-title>What are the learning affordances of 3-D virtual environments?</article-title> <source>Br. J. Educ. Technol.</source> <volume>41</volume>:<fpage>10e32</fpage>. doi: <pub-id pub-id-type="doi">10.1111/j.1467-8535.2009.01038.x</pub-id></citation></ref>
<ref id="ref17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>DeCoster</surname><given-names>J.</given-names></name> <name><surname>Iselin</surname><given-names>A. M. R.</given-names></name> <name><surname>Gallucci</surname><given-names>M.</given-names></name></person-group> (<year>2009</year>). <article-title>A conceptual and empirical examination of justifications for dichotomization</article-title>. <source>Psychol. Methods</source> <volume>14</volume>, <fpage>349</fpage>&#x2013;<lpage>366</lpage>. doi: <pub-id pub-id-type="doi">10.1037/a0016956</pub-id>, PMID: <pub-id pub-id-type="pmid">19968397</pub-id></citation></ref>
<ref id="ref18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dicheva</surname><given-names>D.</given-names></name> <name><surname>Dichev</surname><given-names>C.</given-names></name> <name><surname>Agre</surname><given-names>G.</given-names></name> <name><surname>Angelova</surname><given-names>G.</given-names></name></person-group> (<year>2021</year>). <article-title>Gamification in education: a systematic mapping study</article-title>. <source>J. Educ. Technol. Soc.</source> <volume>24</volume>, <fpage>75</fpage>&#x2013;<lpage>88</lpage>. <ext-link xlink:href="https://www.jstor.org/stable/jeductechsoci.18.3.75" ext-link-type="uri">https://www.jstor.org/stable/jeductechsoci.18.3.75</ext-link></citation></ref>
<ref id="ref19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ertl</surname><given-names>B.</given-names></name> <name><surname>Hartmann</surname><given-names>F. G.</given-names></name> <name><surname>Heine</surname><given-names>J. H.</given-names></name></person-group> (<year>2020</year>). <article-title>Analyzing large-scale studies: benefits and challenges</article-title>. <source>Front. Psychol.</source> <volume>11</volume>:<fpage>577410</fpage>. doi: <pub-id pub-id-type="doi">10.3389/fpsyg.2020.577410</pub-id>, PMID: <pub-id pub-id-type="pmid">33362642</pub-id></citation></ref>
<ref id="ref20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Feng</surname><given-names>M.</given-names></name> <name><surname>Heffernan</surname><given-names>N. T.</given-names></name></person-group> (<year>2007</year>). <article-title>Towards live informing and automatic analyzing of student learning: reporting in ASSISTment system</article-title>. <source>J. Interact. Learn. Res.</source> <volume>18</volume>, <fpage>207</fpage>&#x2013;<lpage>230</lpage>. <ext-link xlink:href="https://web.cs.wpi.edu/~mfeng/pub/feng_heffernan_JILR.pdf" ext-link-type="uri">https://web.cs.wpi.edu/~mfeng/pub/feng_heffernan_JILR.pdf</ext-link></citation></ref>
<ref id="ref610"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Francis</surname><given-names>J. J.</given-names></name> <name><surname>Johnston</surname><given-names>M.</given-names></name> <name><surname>Robertson</surname><given-names>C.</given-names></name> <name><surname>Glidewell</surname><given-names>L.</given-names></name> <name><surname>Entwistle</surname><given-names>V.</given-names></name> <name><surname>Eccles</surname><given-names>M. P.</given-names></name> <etal/></person-group>. (<year>2010</year>). <article-title>What is an adequate sample size? operationalising data saturation for theory-based interview studies</article-title>. <source>Psychol. Health</source>. 25, 1229&#x2013;1245. doi: <pub-id pub-id-type="doi">10.1080/08870440903194015</pub-id></citation></ref>
<ref id="ref510"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ghosh</surname><given-names>J.</given-names></name> <name><surname>Gupta</surname><given-names>S.</given-names></name></person-group> (<year>2023</year>). &#x201C;<article-title>ADAM optimizer and CATEGORICAL CROSSENTROPY loss function-based CNN method for diagnosing colorectal Cancer</article-title>,&#x2019;&#x2019; in <source>2023 international conference on computational intelligence and sustainable engineering solutions (CISES).</source> (IEEE), 470&#x2013;474., PMID: <pub-id pub-id-type="pmid">35812814</pub-id></citation></ref>
<ref id="ref21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goretzko</surname><given-names>D.</given-names></name> <name><surname>B&#x00FC;hner</surname><given-names>M.</given-names></name></person-group> (<year>2022</year>). <article-title>Factor retention using machine learning with ordinal data</article-title>. <source>Appl. Psychol. Meas.</source> <volume>46</volume>, <fpage>406</fpage>&#x2013;<lpage>421</lpage>. doi: <pub-id pub-id-type="doi">10.1177/01466216221089345</pub-id>, PMID: <pub-id pub-id-type="pmid">35812814</pub-id></citation></ref>
<ref id="ref22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Heffernan</surname><given-names>N. T.</given-names></name> <name><surname>Heffernan</surname><given-names>C. L.</given-names></name></person-group> (<year>2014</year>). <article-title>The ASSISTments ecosystem: building a platform that brings scientists and teachers together for minimally invasive research on human learning and teaching</article-title>. <source>Int. J. Artif. Intell. Educ.</source> <volume>24</volume>, <fpage>470</fpage>&#x2013;<lpage>497</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s40593-014-0024-x</pub-id>, PMID: <pub-id pub-id-type="pmid">41098809</pub-id></citation></ref>
<ref id="ref23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hegarty</surname><given-names>M.</given-names></name></person-group> (<year>2011</year>). <article-title>The cognitive science of visual-spatial displays: implications for design</article-title>. <source>Top. Cogn. Sci.</source> <volume>3</volume>, <fpage>446</fpage>&#x2013;<lpage>474</lpage>. doi: <pub-id pub-id-type="doi">10.1111/j.1756-8765.2011.01150.x</pub-id>, PMID: <pub-id pub-id-type="pmid">25164399</pub-id></citation></ref>
<ref id="ref24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Herbert</surname><given-names>G.</given-names></name> <name><surname>Chen</surname><given-names>X.</given-names></name></person-group> (<year>2015</year>). <article-title>A comparison of usefulness of 2D and 3D representations of urban planning</article-title>. <source>Cartogr. Geogr. Inf. Sci.</source> <volume>42</volume>, <fpage>22</fpage>&#x2013;<lpage>32</lpage>. doi: <pub-id pub-id-type="doi">10.1080/15230406.2014.987694</pub-id></citation></ref>
<ref id="ref25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hicks</surname><given-names>M.</given-names></name> <name><surname>O'Malley</surname><given-names>C.</given-names></name> <name><surname>Nichols</surname><given-names>S.</given-names></name> <name><surname>Anderson</surname><given-names>B.</given-names></name></person-group> (<year>2003</year>). <article-title>Comparison of 2D and 3D representations for visualising telecommunication usage</article-title>. <source>Behav. Inf. Technol.</source> <volume>22</volume>, <fpage>185</fpage>&#x2013;<lpage>201</lpage>. doi: <pub-id pub-id-type="doi">10.1080/0144929031000117080</pub-id></citation></ref>
<ref id="ref26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hong</surname><given-names>J.</given-names></name> <name><surname>Hnatyshyn</surname><given-names>R.</given-names></name> <name><surname>Santos</surname><given-names>E. A.</given-names></name> <name><surname>Maciejewski</surname><given-names>R.</given-names></name> <name><surname>Isenberg</surname><given-names>T.</given-names></name></person-group> (<year>2024</year>). <article-title>A survey of designs for combined 2D+ 3D visual representations</article-title>. <source>IEEE Trans. Vis. Comput. Graph.</source> <volume>30</volume>, <fpage>2888</fpage>&#x2013;<lpage>2902</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TVCG.2024.3388516</pub-id>, PMID: <pub-id pub-id-type="pmid">38648152</pub-id></citation></ref>
<ref id="ref27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>J&#x00E4;rvel&#x00E4;</surname><given-names>S.</given-names></name> <name><surname>Ga&#x0161;evi&#x0107;</surname><given-names>D.</given-names></name> <name><surname>Sepp&#x00E4;nen</surname><given-names>T.</given-names></name> <name><surname>Pechenizkiy</surname><given-names>M.</given-names></name> <name><surname>Kirschner</surname><given-names>P. A.</given-names></name></person-group> (<year>2020</year>). <article-title>Bridging learning sciences, machine learning and affective computing for understanding cognition and affect in collaborative learning</article-title>. <source>Br. J. Educ. Technol.</source> <volume>51</volume>, <fpage>2391</fpage>&#x2013;<lpage>2406</lpage>. doi: <pub-id pub-id-type="doi">10.1111/bjet.12917</pub-id></citation></ref>
<ref id="ref28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Khan</surname><given-names>A.</given-names></name> <name><surname>Ghosh</surname><given-names>S. K.</given-names></name></person-group> (<year>2021</year>). <article-title>Student performance analysis and prediction in classroom learning: a review of educational data mining studies</article-title>. <source>Educ. Inf. Technol.</source> <volume>26</volume>, <fpage>205</fpage>&#x2013;<lpage>240</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s10639-020-10230-3</pub-id>, PMID: <pub-id pub-id-type="pmid">41098809</pub-id></citation></ref>
<ref id="ref502"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Lakens</surname><given-names>D.</given-names></name></person-group> (<year>2022</year>). Sample size justification. <source>Collabra: Psychol</source>. 8:33267. doi: <pub-id pub-id-type="doi">10.1525/collabra.33267</pub-id></citation></ref>
<ref id="ref29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Laukaityte</surname><given-names>I.</given-names></name> <name><surname>Wiberg</surname><given-names>M.</given-names></name></person-group> (<year>2018</year>). <article-title>Importance of sampling weights in multilevel modeling of international large-scale assessment data</article-title>. <source>Commun. Stat. Theory Methods</source> <volume>47</volume>, <fpage>4991</fpage>&#x2013;<lpage>5012</lpage>. doi: <pub-id pub-id-type="doi">10.1080/03610926.2017.1383429</pub-id></citation></ref>
<ref id="ref30"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Levy</surname><given-names>P. S.</given-names></name> <name><surname>Lemeshow</surname><given-names>S.</given-names></name></person-group> (<year>2013</year>). <source>Sampling of populations: Methods and applications</source>. Hoboken, New Jersey: <publisher-name>John Wiley and Sons, Inc</publisher-name>.</citation></ref>
<ref id="ref31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname><given-names>Z. Y.</given-names></name> <name><surname>Shaikh</surname><given-names>Z. A.</given-names></name> <name><surname>Gazizova</surname><given-names>F.</given-names></name></person-group> (<year>2020</year>). <article-title>Using the concept of game-based learning in education</article-title>. <source>Int. J. Emerg. Technol. Learn.</source> <volume>15</volume>, <fpage>53</fpage>&#x2013;<lpage>64</lpage>. doi: <pub-id pub-id-type="doi">10.3991/ijet.v15i14.14675</pub-id></citation></ref>
<ref id="ref32"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Lohr</surname><given-names>S. L.</given-names></name></person-group> (<year>2021</year>). <source>Sampling: Design and analysis</source>. New York: <publisher-name>Chapman and Hall/CRC</publisher-name>.</citation></ref>
<ref id="ref33"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Luan</surname><given-names>H.</given-names></name> <name><surname>Tsai</surname><given-names>C. C.</given-names></name></person-group> (<year>2021</year>). <article-title>A review of using machine learning approaches for precision education</article-title>. <source>Educ. Technol. Soc.</source> <volume>24</volume>, <fpage>250</fpage>&#x2013;<lpage>266</lpage>. <ext-link xlink:href="https://www.jstor.org/stable/26977871" ext-link-type="uri">https://www.jstor.org/stable/26977871</ext-link></citation></ref>
<ref id="ref34"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>MacCallum</surname><given-names>R. C.</given-names></name> <name><surname>Zhang</surname><given-names>S.</given-names></name> <name><surname>Preacher</surname><given-names>K. J.</given-names></name> <name><surname>Rucker</surname><given-names>D. D.</given-names></name></person-group> (<year>2002</year>). <article-title>On the practice of dichotomization of quantitative variables</article-title>. <source>Psychol. Methods</source> <volume>7</volume>, <fpage>19</fpage>&#x2013;<lpage>40</lpage>. doi: <pub-id pub-id-type="doi">10.1037/1082-989x.7.1.19</pub-id>, PMID: <pub-id pub-id-type="pmid">11928888</pub-id></citation></ref>
<ref id="ref503"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Martin</surname><given-names>M. O.</given-names></name> <name><surname>Mullis</surname><given-names>I. V. S.</given-names></name> <name><surname>Hooper</surname><given-names>M.</given-names></name></person-group> (Eds.). (<year>2016</year>). <article-title>Methods and procedures in TIMSS 2015. Retrieved from Boston College, TIMSS &#x0026; PIRLS International Study Center. Chestnut Hill, MA, U.S.A.: TIMSS &#x0026; PIRLS International Study Center</article-title>. Available at: <ext-link xlink:href="http://timssandpirls.bc.edu/publications/timss/2015-methods.html" ext-link-type="uri">http://timssandpirls.bc.edu/publications/timss/2015-methods.html</ext-link>, PMID: <pub-id pub-id-type="pmid">19968397</pub-id></citation></ref>
<ref id="ref35"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Meinck</surname><given-names>S.</given-names></name></person-group> (<year>2015</year>). <article-title>Computing sampling weights in large-scale assessments in education</article-title>. Survey insights: Methods from the field, weighting: Practical Issues and &#x2018;How to&#x2019; Approach. Available at: <ext-link xlink:href="https://surveyinsights.org/?p=5353" ext-link-type="uri">https://surveyinsights.org/?p=5353</ext-link></citation></ref>
<ref id="ref36"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Montoya</surname><given-names>A. K.</given-names></name></person-group> (<year>2023</year>). <article-title>Selecting a within-or between-subject design for mediation: validity, causality, and statistical power</article-title>. <source>Multivar. Behav. Res.</source> <volume>58</volume>, <fpage>616</fpage>&#x2013;<lpage>636</lpage>. doi: <pub-id pub-id-type="doi">10.1080/00273171.2022.2077287</pub-id>, PMID: <pub-id pub-id-type="pmid">35679239</pub-id></citation></ref>
<ref id="ref38"><citation citation-type="book"><person-group person-group-type="author"><collab id="coll1">Organization for Economic Co-operation and Development</collab></person-group> (<year>2014</year>). <source>PISA 2012 technical report. PISA</source>: <publisher-name>OECD Publishing</publisher-name>.</citation></ref>
<ref id="ref39"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pan</surname><given-names>Y.</given-names></name> <name><surname>Ke</surname><given-names>F.</given-names></name></person-group> (<year>2023</year>). <article-title>Effects of game-based learning supports on students&#x2019; math performance and perceived game flow</article-title>. <source>Educ. Technol. Res. Dev.</source> <volume>71</volume>, <fpage>459</fpage>&#x2013;<lpage>479</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11423-022-10183-z</pub-id></citation></ref>
<ref id="ref40"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pirlott</surname><given-names>A. G.</given-names></name> <name><surname>MacKinnon</surname><given-names>D. P.</given-names></name></person-group> (<year>2016</year>). <article-title>Design approaches to experimental mediation</article-title>. <source>J. Exp. Soc. Psychol.</source> <volume>66</volume>, <fpage>29</fpage>&#x2013;<lpage>38</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.jesp.2015.09.012</pub-id>, PMID: <pub-id pub-id-type="pmid">27570259</pub-id></citation></ref>
<ref id="ref41"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Prastyo</surname><given-names>D. D.</given-names></name> <name><surname>Khoiri</surname><given-names>H. A.</given-names></name> <name><surname>Purnami</surname><given-names>S. W.</given-names></name> <name><surname>Suhartono</surname><given-names>F.</given-names></name> <name><surname>Fam</surname><given-names>S. F.</given-names></name> <name><surname>Suhermi</surname><given-names>N.</given-names></name></person-group> (<year>2020</year>). &#x201C;<article-title>Survival support vector machines: a simulation study and its health-related application</article-title>&#x201D; in <source>Supervised and unsupervised learning for data science. Unsupervised and Semi-Supervised Learning</source>. eds. <person-group person-group-type="editor"><name><surname>Berry</surname><given-names>M.</given-names></name> <name><surname>Mohamed</surname><given-names>A.</given-names></name> <name><surname>Yap</surname><given-names>B.</given-names></name></person-group> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>).</citation></ref>
<ref id="ref42"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Procci</surname><given-names>K.</given-names></name> <name><surname>James</surname><given-names>N.</given-names></name> <name><surname>Bowers</surname><given-names>C.</given-names></name></person-group> (<year>2013</year>). <article-title>The effects of gender, age, and experience on game engagement</article-title>. In <conf-name>Proceedings of The Human Factors and Ergonomics Society Annual Meeting</conf-name>. <fpage>2132</fpage>&#x2013;<lpage>2136</lpage>). <publisher-loc>Los Angeles, CA</publisher-loc>: <publisher-name>SAGE Publications</publisher-name>.</citation></ref>
<ref id="ref43"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pudjihartono</surname><given-names>N.</given-names></name> <name><surname>Fadason</surname><given-names>T.</given-names></name> <name><surname>Kempa-Liehr</surname><given-names>A. W.</given-names></name> <name><surname>O'Sullivan</surname><given-names>J. M.</given-names></name></person-group> (<year>2022</year>). <article-title>A review of feature selection methods for machine learning-based disease risk prediction</article-title>. <source>Front. Bioinform.</source> <volume>2</volume>:<fpage>927312</fpage>. doi: <pub-id pub-id-type="doi">10.3389/fbinf.2022.927312</pub-id>, PMID: <pub-id pub-id-type="pmid">36304293</pub-id></citation></ref>
<ref id="ref44"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Puolakanaho</surname><given-names>A.</given-names></name> <name><surname>Tolvanen</surname><given-names>A.</given-names></name> <name><surname>Kinnunen</surname><given-names>S. M.</given-names></name> <name><surname>Lappalainen</surname><given-names>R.</given-names></name></person-group> (<year>2020</year>). <article-title>A psychological flexibility-based intervention for burnout: a randomized controlled trial</article-title>. <source>J. Context. Behav. Sci.</source> <volume>15</volume>, <fpage>52</fpage>&#x2013;<lpage>67</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.jcbs.2019.11.007</pub-id></citation></ref>
<ref id="ref45"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Rust</surname><given-names>K.</given-names></name></person-group> (<year>2014</year>). &#x201C;<article-title>Sampling, weighting, and variance estimation in international large scale assessments</article-title>,&#x2019;&#x2019; in <source>Handbook of international large-scale assessment: background, technical issues, and methods of data analysis</source>. Eds. L. Rutkowski, M. von Davier and D. Rutkowski (Boca Raton, FL, U.S.A.: Chapman and Hall/CRC), 117&#x2013;153.</citation></ref>
<ref id="ref46"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Song</surname><given-names>C.</given-names></name> <name><surname>Liu</surname><given-names>F.</given-names></name> <name><surname>Huang</surname><given-names>Y.</given-names></name> <name><surname>Wang</surname><given-names>L.</given-names></name> <name><surname>Tan</surname><given-names>T.</given-names></name></person-group> (<year>2013</year>). &#x201C;<article-title>Auto-encoder based data clustering</article-title>&#x201D; in <source>Progress in pattern recognition, image analysis, computer vision, and applications</source>. eds. <person-group person-group-type="editor"><name><surname>Ruiz-Shulcloper</surname><given-names>J.</given-names></name> <name><surname>Baja</surname><given-names>G. S.</given-names></name></person-group> (<publisher-loc>Berlin, Heidelberg</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>117</fpage>&#x2013;<lpage>124</lpage>.</citation></ref>
<ref id="ref47"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sumi</surname><given-names>K.</given-names></name> <name><surname>Sato</surname><given-names>S.</given-names></name></person-group> (<year>2022</year>). <article-title>Experiences of game-based learning and reviewing history of the experience using player&#x2019;s emotions</article-title>. <source>Front. Artifi. Intel.</source> <volume>5</volume>:<fpage>874106</fpage>. doi: <pub-id pub-id-type="doi">10.3389/frai.2022.874106</pub-id>, PMID: <pub-id pub-id-type="pmid">35910190</pub-id></citation></ref>
<ref id="ref530"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vasileiou</surname><given-names>K.</given-names></name> <name><surname>Barnett</surname><given-names>J.</given-names></name> <name><surname>Thorpe</surname><given-names>S.</given-names></name> <name><surname>Young</surname><given-names>T.</given-names></name></person-group> (<year>2018</year>). <article-title>Characterising and justifying sample size sufficiency in interview-based studies: systematic analysis of qualitative health research over a 15-year period</article-title>. <source>BMC Med. Res. Methodol.</source> 18:148. doi: <pub-id pub-id-type="doi">10.1186/s12874-018-0594-7</pub-id></citation></ref>
<ref id="ref48"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Verma</surname><given-names>K. K.</given-names></name> <name><surname>Singh</surname><given-names>B. M.</given-names></name> <name><surname>Dixit</surname><given-names>A.</given-names></name></person-group> (<year>2022</year>). <article-title>A review of supervised and unsupervised machine learning techniques for suspicious behavior recognition in intelligent surveillance system</article-title>. <source>Int. J. Inf. Technol.</source> <volume>14</volume>, <fpage>397</fpage>&#x2013;<lpage>410</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s41870-019-00364-0</pub-id></citation></ref>
<ref id="ref49"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xie</surname><given-names>X.</given-names></name> <name><surname>Zhang</surname><given-names>Z.</given-names></name> <name><surname>Chen</surname><given-names>T. Y.</given-names></name> <name><surname>Liu</surname><given-names>Y.</given-names></name> <name><surname>Poon</surname><given-names>P. L.</given-names></name> <name><surname>Xu</surname><given-names>B.</given-names></name></person-group> (<year>2020</year>). <article-title>METTLE: a metamorphic testing approach to assessing and validating unsupervised machine learning systems</article-title>. <source>IEEE Trans. Reliab.</source> <volume>69</volume>, <fpage>1293</fpage>&#x2013;<lpage>1322</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TR.2020.2972266</pub-id></citation></ref>
<ref id="ref50"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yeung</surname><given-names>R. C.</given-names></name> <name><surname>Fernandes</surname><given-names>M. A.</given-names></name></person-group> (<year>2022</year>). <article-title>Machine learning to detect invalid text responses: validation and comparison to existing detection methods</article-title>. <source>Behav. Res. Methods</source> <volume>54</volume>, <fpage>3055</fpage>&#x2013;<lpage>3070</lpage>. doi: <pub-id pub-id-type="doi">10.3758/s13428-022-01801-y</pub-id>, PMID: <pub-id pub-id-type="pmid">35175566</pub-id></citation></ref>
<ref id="ref51"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yilmaz</surname><given-names>T. K.</given-names></name> <name><surname>Cagiltay</surname><given-names>K.</given-names></name></person-group> (<year>2016</year>). <article-title>Designing and developing game-like learning experience in virtual worlds: challenges and design decisions of novice instructional designers</article-title>. <source>Contemp. Educ. Technol.</source> <volume>7</volume>, <fpage>206</fpage>&#x2013;<lpage>222</lpage>. doi: <pub-id pub-id-type="doi">10.30935/cedtech/6173</pub-id></citation></ref>
<ref id="ref52"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname><given-names>C.</given-names></name> <name><surname>Liu</surname><given-names>C.</given-names></name> <name><surname>Zhang</surname><given-names>X.</given-names></name> <name><surname>Almpanidis</surname><given-names>G.</given-names></name></person-group> (<year>2017</year>). <article-title>An up-to-date comparison of state-of-the-art classification algorithms</article-title>. <source>Expert Syst. Appl.</source> <volume>82</volume>, <fpage>128</fpage>&#x2013;<lpage>150</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.eswa.2017.04.003</pub-id></citation></ref>
<ref id="ref53"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zimmermann</surname><given-names>A.</given-names></name></person-group> (<year>2020</year>). <article-title>Method evaluation, parameterization, and result validation in unsupervised data mining: a critical survey</article-title>. <source>Wiley Interdiscip. Rev. Data Min. Knowl. Discov.</source> <volume>10</volume>:<fpage>e1330</fpage>. doi: <pub-id pub-id-type="doi">10.1002/widm.1330</pub-id></citation></ref>
<ref id="ref54"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zotou</surname><given-names>M.</given-names></name> <name><surname>Tambouris</surname><given-names>E.</given-names></name> <name><surname>Tarabanis</surname><given-names>K.</given-names></name></person-group> (<year>2020</year>). <article-title>Data-driven problem based learning: enhancing problem based learning with learning analytics</article-title>. <source>Educ. Technol. Res. Dev.</source> <volume>68</volume>, <fpage>3393</fpage>&#x2013;<lpage>3424</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11423-020-09828-8</pub-id></citation></ref>
</ref-list>
</back>
</article>