<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Educ.</journal-id>
<journal-title>Frontiers in Education</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Educ.</abbrev-journal-title>
<issn pub-type="epub">2504-284X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/feduc.2024.1389165</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Education</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Detecting differential item functioning in presence of multilevel data: do methods accounting for multilevel data structure make a DIFference?</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Svetina Valdivia</surname> <given-names>Dubravka</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/314202/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/data-curation/"/>
<role content-type="https://credit.niso.org/contributor-roles/formal-analysis/"/>
<role content-type="https://credit.niso.org/contributor-roles/investigation/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/supervision/"/>
<role content-type="https://credit.niso.org/contributor-roles/validation/"/>
<role content-type="https://credit.niso.org/contributor-roles/visualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Huang</surname> <given-names>Sijia</given-names></name>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/formal-analysis/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Botter</surname> <given-names>Preston</given-names></name>
<role content-type="https://credit.niso.org/contributor-roles/formal-analysis/"/>
<role content-type="https://credit.niso.org/contributor-roles/resources/"/>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
</contrib-group>
<aff><institution>Department of Counseling and Educational Psychology, Indiana University</institution>, <addr-line>Bloomington, IN</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Xinya Liang, University of Arkansas, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Alexander Robitzsch, IPN&#x2013;Leibniz Institute for Science and Mathematics Education, Germany</p><p>Yong Luo, Pearson, United States</p></fn>
<corresp id="c001">&#x002A;Correspondence: Dubravka Svetina Valdivia, <email>dsvetina@indiana.edu</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>29</day>
<month>04</month>
<year>2024</year>
</pub-date>
<pub-date pub-type="collection">
<year>2024</year>
</pub-date>
<volume>9</volume>
<elocation-id>1389165</elocation-id>
<history>
<date date-type="received">
<day>21</day>
<month>02</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>02</day>
<month>04</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2024 Svetina Valdivia, Huang and Botter.</copyright-statement>
<copyright-year>2024</copyright-year>
<copyright-holder>Svetina Valdivia, Huang and Botter</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Assessment practices are, among other things, concerned with issues of fairness and appropriate score interpretation, in particular when making claims about subgroup differences in performance are of interest. In order to make such claims, a psychometric concept of measurement invariance or differential item functioning (DIF) ought to be considered and met. Over the last decades, researchers have proposed and developed a plethora of methods aimed at detecting DIF. However, DIF detection methods that allow multilevel data structures to be modeled are limited and understudied. In the current study, we evaluated the performance of four methods, including the model-based multilevel Wald and the score-based multilevel Mantel&#x2013;Haenszel (MH), and two well-established single-level methods, the model-based single-level Lord and the score-based single-level MH. We conducted a simulation study that mimics real-world scenarios. Our results suggested that when data were generated as multilevel, mixed results regarding performances were observed, and not one method consistently outperformed the others. Single-level Lord and multilevel Wald yielded best control of the Type I error rates, in particular in conditions when latent means were generated as equal for the two groups. Power rates were low across all four methods in conditions with small number of between- and within-level units and when small DIF was modeled. However, in those conditions, single-level MH and multilevel MH yielded higher power rates than either single-level Lord or multilevel Wald. This suggests that current practices in detecting DIF should strongly consider adopting one of the more recent methods only in certain contexts as the tradeoff between power and complexity of the method may not warrant a blanket recommendation in favor of a single method. Limitations and future research directions are also discussed.</p>
</abstract>
<kwd-group>
<kwd>differential item functioning (DIF)</kwd>
<kwd>measurement invariance</kwd>
<kwd>multilevel data</kwd>
<kwd>fairness</kwd>
<kwd>simulation study</kwd>
</kwd-group>
<counts>
<fig-count count="2"/>
<table-count count="3"/>
<equation-count count="9"/>
<ref-count count="64"/>
<page-count count="13"/>
<word-count count="10597"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Assessment, Testing and Applied Measurement</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="S1" sec-type="intro">
<title>Introduction</title>
<p>Educational and psychological assessment practices are, among other things, concerned with fairness and appropriate score interpretations. For example, data from international large-scale assessments (ILSAs), such as the Programme for International Student Assessment (PISA) or the Trends in International Mathematics and Science Study (TIMSS), are used to inform about student academic performance across dozens of participating countries and educational systems, which provides, in large part, the basis for educational reforms in those respective countries and educational systems. Further, constructs being measured on ILSAs ought not to be only cognitive in nature. PISA and TIMSS, in addition to measuring achievement in mathematics or science, also serve as a fruitful basis from which to derive measures of affective and motivational domains how students feel about school or learning (e.g., <xref ref-type="bibr" rid="B50">Ozel et al., 2013</xref>; <xref ref-type="bibr" rid="B58">Segeritz and Pant, 2013</xref>; <xref ref-type="bibr" rid="B42">Marsh et al., 2015</xref>).</p>
<p>Similarly, the Teaching and Learning International Survey (TALIS) measures and compares teachers&#x2019; attitudes, perceptions, and experiences related to education. Outside education, examples of studying psychological constructs across cultures abound, including social axioms (e.g., <xref ref-type="bibr" rid="B7">Bou Malham and Saucier, 2014</xref>), physical self-perception (e.g., <xref ref-type="bibr" rid="B28">Hagger et al., 2003</xref>), cognitive emotional regulation (e.g., <xref ref-type="bibr" rid="B45">Megreya et al., 2016</xref>), and identity processing styles during cultural transition (e.g., <xref ref-type="bibr" rid="B62">Szabo et al., 2016</xref>). Regardless of the context, scores that represent the underlying constructs of interest on surveys and assessments are often summarized in terms of total scores or model-based scale scores (<xref ref-type="bibr" rid="B49">Olson et al., 2008</xref>; <xref ref-type="bibr" rid="B19">Economic Co-operation and Development, 2010</xref>) which are then compared across the groups.</p>
<p>Across the aforementioned examples, and for many others found in social sciences, an important precursor to making meaningful comparisons across groups on scale scores involves the establishment of measurement invariance (MI). Namely, this criterion states that a construct ought to be understood and measured equivalently across groups of interest (<xref ref-type="bibr" rid="B46">Meredith, 1993</xref>). In practice, lacking MI has long been considered a threat to the validity of score interpretations and use based on such. Often times, researchers adopt the approach of multiple-groups confirmatory factor analysis (MG-CFA; <xref ref-type="bibr" rid="B35">J&#x00F6;reskog, 1971</xref>) to examine if the structure of an assessment is the same across groups.</p>
<p>At the item level, MI indicates the absence of differential item functioning (DIF; <xref ref-type="bibr" rid="B31">Holland and Wainer, 2012</xref>). An item is said to be a <italic>DIF item</italic> when it exhibits different psychometric properties between individuals with similar proficiencies and from different groups (e.g., Croatian students vs. German students on ILSAs). DIF can be categorized as <italic>uniform</italic> when the relationship between the group membership and response to an item is constant for all levels of the matching proficiency (i.e., no interaction between group membership and ability), while <italic>non-uniform</italic> DIF is present when there exists such interaction. When two groups are considered, the group of interest in the analysis is referred to as the <italic>focal</italic> group, while the group to which focal group is compared to is known as the <italic>reference</italic> group.<sup><xref ref-type="fn" rid="footnote1">1</xref></sup> The importance of identifying DIF items in educational measurement and assessment, as well as broadly defined social and psychological sciences, has been well established (e.g., <xref ref-type="bibr" rid="B39">Magis et al., 2010</xref>; <xref ref-type="bibr" rid="B26">Gao, 2019</xref>). Over the last few decades, a number of methods have been proposed to detect DIF items more accurately and thus aid in measurement or test development process, such as the score-based Mantel&#x2013;Haenszel procedure (score-based MH; <xref ref-type="bibr" rid="B41">Mantel and Haenszel, 1959</xref>; <xref ref-type="bibr" rid="B30">Holland and Thayer, 1988</xref>; <xref ref-type="bibr" rid="B48">Narayanon and Swaminathan, 1996</xref>), and the model-based Lord&#x2019;s Wald <italic>&#x03C7;</italic><sup>2</sup> test (model-based <xref ref-type="bibr" rid="B38">Lord, 1980</xref>). If an item is determined to be differentially functioning&#x2013;meaning, it is flagged as a DIF item&#x2013;test developers can choose whether to revise or remove it dependent on what sources of DIF are determined to be.</p>
<p><xref ref-type="bibr" rid="B39">Magis et al. (2010)</xref> presented a useful framework for researchers to select and employ a DIF detection method in their analysis when data are scored dichotomously (e.g., 0 as incorrect/disagree; 1 as correct/agree). Specifically, the authors organized DIF detection methods along four main dimensions the methods are able to accommodate: (a) number of focal groups (two vs. &#x003E; 2), (b) methodological approach in creating a matching variable (Item Response Theory-based vs. Classical Test Theory-based), (c) DIF type (uniform vs. non-uniform), and (d) item purification (considered). These four dimensions highlight important considerations/aspects a researcher ought to engage with when conducting DIF analysis with the ultimate aim to make appropriate and valid claims. Alongside the proposed framework, Magis and his colleagues developed an <italic>R</italic> package called <italic>difR</italic> (<xref ref-type="bibr" rid="B39">Magis et al., 2010</xref>) which included a collection of standard DIF detection methods for dichotomous items. As described below, in the current study, we utilized two DIF detection methods employed in <italic>difR</italic>&#x2014;specifically, the single-level MH method (<xref ref-type="bibr" rid="B41">Mantel and Haenszel, 1959</xref>) and Lord&#x2019;s Wald &#x03C7;<sup>2</sup> statistic (<xref ref-type="bibr" rid="B38">Lord, 1980</xref>). One aspect of <xref ref-type="bibr" rid="B39">Magis et al. (2010)</xref> framework that is missing but should be considered when choosing DIF detection methods is the nested nature of (many) data.</p>
<p>While the importance of modeling nested data as such has been recognized for decades (e.g., <xref ref-type="bibr" rid="B57">Rubin, 1981</xref>; <xref ref-type="bibr" rid="B53">Peugh, 2010</xref>), and models such as multilevel models have been used in applied research, little research has been devoted to identifying DIF items in the ubiquitous multilevel data structures. When such nesting occurs (e.g., students on ILSAs are nested within their respective countries or educational systems), it is often times ignored and conventional single-level DIF detection methods are applied. To our knowledge, there exist only a handful of DIF detection methods that account for the multilevel structures, including the score-based multilevel MH and model-based multilevel Wald (<xref ref-type="bibr" rid="B33">Jin et al., 2014</xref>; <xref ref-type="bibr" rid="B23">French and Finch, 2015</xref>; <xref ref-type="bibr" rid="B25">French et al., 2016</xref>, <xref ref-type="bibr" rid="B24">2019</xref>; <xref ref-type="bibr" rid="B32">Huang and Valdivia, 2023</xref>). Additionally, there lacks a more comprehensive comparison of these methods and single-level methods in the context of multilevel data. Part of the lack of research is due to accessibility of the methods; namely, DIF detection methods that incorporate the nature of nested data have only recently become (easily) accessible to researchers (<xref ref-type="bibr" rid="B21">French and Finch, 2010</xref>, <xref ref-type="bibr" rid="B22">2013</xref>; <xref ref-type="bibr" rid="B24">French et al., 2019</xref>; <xref ref-type="bibr" rid="B32">Huang and Valdivia, 2023</xref>). Thus, the main research aim of the current study is to examine and compare the performance of single-level and multilevel methods of detecting DIF when data have more than one level. Through a comparison of four popular DIF detection methods, this research aims to extend <xref ref-type="bibr" rid="B39">Magis et al. (2010)</xref> framework and provide guidance for applied researchers engaged in the DIF analysis when data are nested. This study helps address questions that may be asked by practitioners, such as &#x201C;do single-level DIF detections perform sufficiently well when data are multilevel?&#x201D; This might be especially important as the multilevel DIF methods are inherently harder to understand than their single level counterparts. The current study focuses on dichotomous items and uniform DIF and investigates the performance of four DIF detection methods in detecting DIF when data are nested. Specifically, the performance of the single-level and multilevel versions of the score-based MH and model-based Lord/Wald procedures are studied.</p>
<p>The remainder of our paper is organized as follows. The next section discusses the four studied methods and research related to their performance in detecting DIF. Next, we describe the study design utilized to address the main research aim, including our justifications of choices for the manipulated factors and levels, as well as the outcome variables to evaluate the methods&#x2019; performance. A description of the planned analyses is included to guide the interpretation of results as well. Next, we report results as they pertain to the main research aim. Lastly, we discuss the findings and implications for future research, in addition to acknowledgment of the limitations.</p>
<sec id="S1.SS1">
<title>1.1 Single-level and multilevel DIF detection methods</title>
<p>As suggested above, a myriad of methods have been developed to investigate DIF but only a few have been developed that allow for multilevel data structures to be directly modeled into the DIF detection. We briefly describe each of the four methods utilized in the current study&#x2014;namely, the single-level methods of MH and Lord, and the multilevel MH and Wald. We provide original sources of the proposed methods for more detailed specifications of the methods.</p>
<sec id="S1.SS1.SSS1">
<title>1.1.1 Single-level MH</title>
<p>MH procedure is a score-based method that flags possible DIF items by testing whether an association exists between group membership and item responses, conditional on sum scores. This is done by testing the null hypothesis of no DIF using a 2 &#x00D7; 2 contingency table when the number of groups is two (e.g., <xref ref-type="table" rid="T1">Table 1</xref>).</p>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>2 &#x00D7; 2 contingency table for sum scores across the reference and focal groups.</p></caption>
<table cellspacing="5" cellpadding="5" frame="box" rules="all">
<thead>
<tr>
<td valign="top" align="left" style="color:#ffffff;background-color: #7f8080;"></td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Correct</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Incorrect</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Row total</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Reference group (R)</td>
<td valign="top" align="center"><italic>A<sub>s</sub></italic></td>
<td valign="top" align="center"><italic>B<sub>s</sub></italic></td>
<td valign="top" align="center"><italic>n</italic><sub><italic>Rs</italic></sub> = <italic>A</italic><sub><italic>s</italic></sub> + <italic>B</italic><sub><italic>s</italic></sub></td>
</tr>
<tr>
<td valign="top" align="left">Focal group (F)</td>
<td valign="top" align="center"><italic>C<sub>s</sub></italic></td>
<td valign="top" align="center"><italic>D<sub>s</sub></italic></td>
<td valign="top" align="center"><italic>n</italic><sub><italic>Fs</italic></sub> = <italic>C</italic><sub><italic>s</italic></sub> + <italic>D</italic><sub><italic>s</italic></sub></td>
</tr>
<tr>
<td valign="top" align="left">Column total</td>
<td valign="top" align="center"><italic>m</italic><sub>1<italic>s</italic></sub> = <italic>A</italic><sub><italic>s</italic></sub> + <italic>C</italic><sub><italic>s</italic></sub></td>
<td valign="top" align="center"><italic>m</italic><sub>0<italic>s</italic></sub> = <italic>B</italic><sub><italic>s</italic></sub> + <italic>D</italic><sub><italic>s</italic></sub></td>
<td valign="top" align="center"><italic>T</italic><sub><italic>s</italic></sub> = <italic>A</italic><sub><italic>s</italic></sub> + <italic>B</italic><sub><italic>s</italic></sub> + <italic>C</italic><sub><italic>s</italic></sub> + <italic>D</italic><sub><italic>s</italic></sub></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn><p>Table contains counts, row margins, and column margins relevant to <xref ref-type="disp-formula" rid="E1">Eqs 1</xref>&#x2013;<xref ref-type="disp-formula" rid="E3">3</xref>. For example, <italic>A<sub>s</sub></italic> is the number of correct responses in the reference group for a particular item, while <italic>D</italic><sub><italic>s</italic></sub> is the number of incorrect responses in the focal group for that particular item. Inclusion of this table was motivated by similar tables used to illustrate the MH procedure, such the table used by <xref ref-type="bibr" rid="B56">Roussos et al. (1999)</xref>.</p></fn>
</table-wrap-foot>
</table-wrap>
<p>Letting <italic>s</italic> = 1, 2,&#x2026;,S denote the unique sum scores observed in the sample, the MH hypothesis of no DIF for an item is tested using the MH &#x03C7;<sup>2</sup> statistic:</p>
<disp-formula id="E1">
<label>(1)</label>
<mml:math id="M1">
<mml:mfrac>
<mml:msup>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="normal">&#x03A3;</mml:mi>
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:mi>s</mml:mi>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>S</mml:mi>
</mml:msubsup>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mi>E</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:mn>0.5</mml:mn>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="normal">&#x03A3;</mml:mi>
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:mi>s</mml:mi>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>S</mml:mi>
</mml:msubsup>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>V</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:math>
</disp-formula>
<p>where,</p>
<p><italic>A</italic><sub><italic>s</italic></sub> = the number of correct responses in the reference group for a particular item,</p>
<disp-formula id="E2">
<label>(2)</label>
<mml:math id="M2">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>E</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo rspace="5.8pt">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mpadded width="+3.3pt">
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
</mml:mfrac>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x22C5;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>B</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>B</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="E3">
<label>(3)</label>
<mml:math id="M3">
<mml:mrow>
<mml:mrow>
<mml:mi>V</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo rspace="5.8pt">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mpadded width="+3.3pt">
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mi>T</mml:mi>
<mml:mi>s</mml:mi>
<mml:mn>2</mml:mn>
</mml:msubsup>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mpadded>
<mml:mo>=</mml:mo>
<mml:mi/>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi> </mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>B</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x22C5;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x22C5;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x22C5;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>B</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>B</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>&#x22C5;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>B</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>and 0.5 is the Yates correction for continuity (<xref ref-type="bibr" rid="B64">Yates, 1934</xref>). The resulting statistic &#x03C7;<sup>2</sup> is chi-square distributed with one degree of freedom and tests the null hypothesis of no uniform DIF (<xref ref-type="bibr" rid="B24">French et al., 2019</xref>) based on the assumption of a conditional binomial distribution for the events <italic>A<sub>s</sub></italic> (<xref ref-type="bibr" rid="B5">Bock and Gibbons, 2021</xref>).</p>
<p>The MH procedure remains a popular DIF detection method. In part, this is undoubtedly due to its simplicity. However, another reason is its performance. For example, a meta-analysis (<xref ref-type="bibr" rid="B27">Guilera et al., 2013</xref>) of the MH procedure found it to display adequate statistical power and Type I error rates across a total of 3,774 conditions, especially when the sample size was between 500 and 2,000. Further, a recent review (<xref ref-type="bibr" rid="B4">Berr&#x00ED;o et al., 2020</xref>) of the current trends in DIF detection research found the MH to be the most studied DIF detection method. Some research also showed that the single-level MH showed promise at detecting DIF in ILSAs (<xref ref-type="bibr" rid="B61">Svetina and Rutkowski, 2014</xref>) as well as in multidimensional contexts (e.g., <xref ref-type="bibr" rid="B37">Liu, 2024</xref>). Lastly, we included single-level MH because its multilevel variant (<xref ref-type="bibr" rid="B22">French and Finch, 2013</xref>) has been proposed. We note that several extensions to the single-level MH procedure exist, such as the generalized MH for polytomous data (<xref ref-type="bibr" rid="B52">Penfield, 2001</xref>), though such extensions are beyond the scope of this study.</p>
</sec>
<sec id="S1.SS1.SSS2">
<title>1.1.2 Single-level Lord</title>
<p>The first model-based DIF detection method we consider is Lord&#x2019;s (single-level) Wald &#x03C7;<sup>2</sup> test, which flags DIF items by comparing item parameter estimates between groups (<xref ref-type="bibr" rid="B5">Bock and Gibbons, 2021</xref>). The idea behind this method is that if trace lines differ meaningfully between groups, DIF is said to be present, as trace lines are a function of item parameters. The present study considers only trace lines parameterized by the two-parameter logistic (2PL) model outlined below.</p>
<p>The 2PL IRT model specifies the probability that respondent <italic>j</italic> (<italic>j</italic> = 1, &#x2026;, <italic>J</italic>) correctly answers or endorses item <italic>i</italic> (<italic>i</italic> = 1, &#x2026;, <italic>I</italic>) as presented in <xref ref-type="disp-formula" rid="E4">Eq. 4</xref>:</p>
<disp-formula id="E4">
<label>(4)</label>
<mml:math id="M5">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mpadded width="+3.3pt">
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B8;</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo rspace="5.8pt">)</mml:mo>
</mml:mrow>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>exp</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B1;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B8;</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B2;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where &#x03B8;<sub><italic>j</italic></sub> represents a respondent&#x2019;s latent variable (e.g., student proficiency or motivation latent score), and &#x03B1;<sub><italic>i</italic></sub> and &#x03B2;<sub><italic>i</italic></sub> are item <italic>i</italic>&#x2032;<italic>s</italic> discrimination and location/difficulty parameters. Formally, Lord&#x2019;s Wald &#x03C7;<sup>2</sup> test tests the null hypothesis that no difference between item parameters exist in the focal and reference group, and specifically, &#x03C7;<sup>2</sup> statistic for each item is computed as shown in <xref ref-type="disp-formula" rid="E5">Eq. 5</xref>:</p>
<disp-formula id="E5">
<label>(5)</label>
<mml:math id="M6">
<mml:mrow>
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:msubsup>
<mml:mi mathvariant="normal">&#x03C7;</mml:mi>
<mml:mi>i</mml:mi>
<mml:mn>2</mml:mn>
</mml:msubsup>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mi>i</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msubsup>
<mml:mo>&#x2062;</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="normal">&#x03A3;</mml:mi>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <inline-formula><mml:math id="INEQ19"><mml:mrow><mml:mpadded width="+3.3pt"><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mpadded><mml:mo rspace="5.8pt">=</mml:mo><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03B1;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mi>F</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo rspace="7.5pt">-</mml:mo><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03B1;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mi>R</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo rspace="7.5pt">,</mml:mo><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03B2;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mi>F</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo rspace="7.5pt">-</mml:mo><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03B2;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mi>R</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula> is a vector containing the differences between the reference and focal groups parameter estimates and &#x03A3;<sub><italic>i</italic></sub> is the error covariance matrix differences are divided by. Degrees of freedom associated with each <inline-formula><mml:math id="INEQ21"><mml:msubsup><mml:mi mathvariant="normal">&#x03C7;</mml:mi><mml:mi>i</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> is equal to the number of item parameters (per item) compared between the reference and focal groups, which is two when the items are modeled via the 2PL model. For an item to be flagged as displaying DIF, its &#x03C7;<sup>2</sup> statistic needs to be statistically significant, with typically <italic>p</italic> &#x003C; 0.05 being used as a criterion for flagging items.</p>
<p>Many variants of Lord&#x2019;s &#x03C7;<sup>2</sup> DIF detection procedure exist. Two important ways in which implementations differ are by (a) how each group&#x2019;s scale is placed on the same metric, and (b) by whether some sort of item purification procedure is used. The present study places the reference and focal group on the same metric using equal means anchoring (<xref ref-type="bibr" rid="B17">Cook and Eignor, 1991</xref>) and items were purified using an iterative method described by <xref ref-type="bibr" rid="B14">Candell and Drasgow (1988)</xref>. Alternatives to (a) include multiple group IRT (<xref ref-type="bibr" rid="B6">Bock and Zimowski, 1997</xref>) and alternatives to (b) include the Wald-1 (<xref ref-type="bibr" rid="B13">Cai et al., 2011</xref>) and Wald-2 (<xref ref-type="bibr" rid="B36">Langer, 2008</xref>) variants. Some of these updates to <xref ref-type="bibr" rid="B38">Lord&#x2019;s (1980)</xref> original formulation are discussed in a subsequent section. We decided to use the <italic>difR</italic> implementation of Lord&#x2019;s &#x03C7;<sup>2</sup>, which is closer to the original procedure than newer variants, mainly because newer implementations require knowledge of more specialized IRT software, and our desire to study a variety of methods by leveraging accessibility/complexity of the chosen methods.</p>
</sec>
<sec id="S1.SS1.SSS3">
<title>1.1.3 Multilevel MH</title>
<p>Motivated by the ubiquity of multilevel data structures in educational assessment, <xref ref-type="bibr" rid="B22">French and Finch (2013)</xref> proposed several extensions to the standard (single-level) MH procedure. The method employed in the current study is an extension based on work by <xref ref-type="bibr" rid="B2">Begg (1999)</xref>, which adjusted the MH statistic described above by dividing it by the ratio of two score test statistic variances. The two score statistic variances are obtained for each item using the following logistic regression model as presented in <xref ref-type="disp-formula" rid="E6">Eq. 6</xref>:</p>
<disp-formula id="E6">
<label>(6)</label>
<mml:math id="M7">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>log</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mfrac>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>-</mml:mo>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mo rspace="5.8pt">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B2;</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B2;</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B2;</mml:mi>
<mml:mi>a</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where,</p>
<p><italic>P</italic><sub><italic>ij</italic></sub> = the probability of a correct response to item <italic>i</italic>,</p>
<p>&#x03B2;<sub>0</sub> = the intercept,</p>
<p><italic>X<sub>j</sub></italic> = group membership for student <italic>j</italic>,</p>
<p><italic>Y<sub>j</sub></italic> = sum score for student <italic>j</italic>,</p>
<p>&#x03B2;<sub>1</sub> = coefficient corresponding to the (dummy-coded) group variable,</p>
<p>&#x03B2;<sub>2</sub> = coefficient corresponding to the sum score.</p>
<p>More specifically, the model is fit to each item twice, using different estimation methods, to obtain both the na&#x00EF;ve score statistic variance <inline-formula><mml:math id="INEQ28"><mml:msubsup><mml:mi mathvariant="normal">&#x03C3;</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>v</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>e</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> and a modified score statistic variance that accounts for the multilevel nature of the data<sup><xref ref-type="fn" rid="footnote2">2</xref></sup> <inline-formula><mml:math id="INEQ29"><mml:msubsup><mml:mi mathvariant="normal">&#x03C3;</mml:mi><mml:mrow><mml:mi>G</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>E</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>E</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula>. Once obtained, these two variances are then used to calculate the <italic>f</italic> ratio:</p>
<disp-formula id="E7">
<label>(7)</label>
<mml:math id="M8">
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:mi>f</mml:mi>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mfrac>
<mml:msubsup>
<mml:mi mathvariant="normal">&#x03C3;</mml:mi>
<mml:mrow>
<mml:mi>G</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>E</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>E</mml:mi>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msubsup>
<mml:msubsup>
<mml:mi mathvariant="normal">&#x03C3;</mml:mi>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msubsup>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<p>and subsequently the adjusted MH statistic as shown in <xref ref-type="disp-formula" rid="E7">Eqs. 7</xref>, <xref ref-type="disp-formula" rid="E8">8</xref>, respectively:</p>
<disp-formula id="E8">
<label>(8)</label>
<mml:math id="M9">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+3.3pt">
<mml:msub>
<mml:mi>H</mml:mi>
<mml:mi>B</mml:mi>
</mml:msub>
</mml:mpadded>
</mml:mrow>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>H</mml:mi>
</mml:mrow>
<mml:mi>f</mml:mi>
</mml:mfrac>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>The idea behind <italic>MH<sub>B</sub></italic> is that when the population interclass correlation (ICC) is large, <inline-formula><mml:math id="INEQ30"><mml:msubsup><mml:mi mathvariant="normal">&#x03C3;</mml:mi><mml:mrow><mml:mi>G</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>E</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>E</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> will be larger than <inline-formula><mml:math id="INEQ31"><mml:msubsup><mml:mi mathvariant="normal">&#x03C3;</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>v</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>e</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula>, resulting in an <italic>f</italic> ratio that will decrease <italic>MH<sub>B</sub></italic> relative to MH. This decrease in <italic>MH<sub>B</sub></italic> is designed to correct for the within-cluster correlation induced by the data&#x2019;s multilevel structure. However, when the population ICC is 0, <italic>f</italic> = 1, and <italic>MH<sub>B</sub></italic> = MH.</p>
<p>The <italic>MH<sub>B</sub></italic> was considered in the present study mainly because it seemed to be the most popular DIF detection method that accounts for multilevel data structures. Another reason was its accessibility in the <italic>DIFplus R</italic> package (<xref ref-type="bibr" rid="B18">Dai et al., 2022</xref>). Finally, it should be noted that the multilevel MH method is not the only MH procedure developed for multilevel data structures, as <xref ref-type="bibr" rid="B22">French and Finch (2013)</xref> and others (<xref ref-type="bibr" rid="B24">French et al., 2019</xref>) have proposed similar extensions that show promise. Here, we only evaluate the <italic>MH<sub>B</sub></italic> variant (hereupon referred to as multilevel MH) simply to focus our study on a handful of DIF detection methods.</p>
</sec>
<sec id="S1.SS1.SSS4">
<title>1.1.4 Multilevel Wald</title>
<p>In order to motivate our discussion of the multilevel DIF detection method proposed by <xref ref-type="bibr" rid="B32">Huang and Valdivia (2023)</xref>, we return to our earlier presentation of the 2PL model [under single-level Lord section, <xref ref-type="disp-formula" rid="E4">Eq. 4</xref>]. Extension of 2PL IRT model to account for the multilevel data structure can be accomplished through incorporating a between-level latent construct. As <xref ref-type="bibr" rid="B43">Marsh et al. (2012)</xref> explained, the between-level latent construct can be defined as a clustering of characteristics of individuals within the between-level unit. For example, assume we have students (within-level) nested within schools (between-level). Then, the probability that a student <italic>j</italic> in school <italic>k</italic> (<italic>k</italic> = 1, &#x2026;, <italic>K</italic>) correctly responds to item <italic>i</italic> would be expressed as,</p>
<disp-formula id="E9">
<label>(9)</label>
<mml:math id="M10">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B8;</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B8;</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo rspace="5.8pt">)</mml:mo>
</mml:mrow>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>exp</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B1;</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B8;</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B1;</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>W</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B8;</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B2;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where &#x03B8;<sub><italic>k</italic></sub> is a between-level latent variable and can be interpreted as mean proficiency of students in school <italic>k</italic>. &#x03B8;<sub><italic>k</italic></sub> is assumed to follow a normal distribution <italic>N</italic>(&#x03BC;,&#x03C4;<sup>2</sup>). &#x03B8;<sub><italic>jk</italic></sub> is a within-level latent variable and captures the deviation in proficiency of student <italic>j</italic> to &#x03B8;<sub><italic>k</italic></sub>. &#x03B8;<sub><italic>jk</italic></sub> is assumed to follow a normal distribution. &#x03B1;<sub><italic>i</italic>,<italic>B</italic></sub> and &#x03B1;<sub><italic>i</italic>,<italic>W</italic></sub> are the item discrimination parameters associated with the between- and within-level latent variables, respectively, while &#x03B2;<sub><italic>i</italic></sub> is item <italic>i</italic>&#x2032;s location/difficulty parameter. Model in <xref ref-type="disp-formula" rid="E9">Eq. 9</xref> can be identified by constraining the lower-level variance term &#x03C3;<sup>2</sup> and the discrimination parameters &#x03B1;<sub><italic>i</italic>,<italic>B</italic></sub> and &#x03B1;<sub><italic>i</italic>,<italic>W</italic></sub>; for example, the term &#x03C3;<sup>2</sup> can be set to 1, at the same time, &#x03B1;<sub><italic>i,B</italic></sub> and &#x03B1;<sub><italic>i,W</italic></sub> can be constrained to be equal (i.e., &#x03B1;<sub><italic>i</italic>,<italic>B</italic></sub> = &#x03B1;<sub><italic>i</italic>,<italic>W</italic></sub>).</p>
<p><xref ref-type="bibr" rid="B32">Huang and Valdivia (2023)</xref> introduced a procedure to detect both uniform and non-uniform DIF in the presence of multilevel data. This procedure extends the <xref ref-type="bibr" rid="B29">Hansen et al. (2014)</xref> approach by applying the Metropolis-Hastings Robbins-Monro (MH-RM; <xref ref-type="bibr" rid="B9">Cai, 2008</xref>, <xref ref-type="bibr" rid="B10">Cai, 2010a</xref>,<xref ref-type="bibr" rid="B11">b</xref>) to estimate parameters in multilevel IRT models and obtain the associated standard errors. The procedure for DIF in multilevel data consists of two stages. Specifically, an <italic>initial screening</italic> stage is employed first to designate the items as either anchor items or candidate items through an extended Wald-2 test. Then the <italic>formal evaluation</italic> stage further evaluates the candidate items to identify DIF items using the extended Wald-1 test, A simulation study indicated that this two-stage procedure has great power for detecting DIF and well controls the Type I error rate.</p>
</sec>
</sec>
</sec>
<sec id="S2">
<title>2 Research aim</title>
<p>As noted above, our main research aim is to compare performance of single-level and multilevel methods of detecting DIF when data are multilevel. To our knowledge, limited literature exists on comparing methods in detecting DIF for nested data. Hence, we aim to evaluate the performance of four DIF detection methods: single-level MH, single-level Lord, multilevel MH, and multilevel Wald and their ability to detect DIF when data are nested.</p>
</sec>
<sec id="S3" sec-type="materials|methods">
<title>3 Materials and methods</title>
<p>The research question regarding the performance of the DIF detection methods was addressed using a Monte Carlo simulation study. Our design choices, including manipulated factors and their levels, were motivated by empirical and methodological research including but not limited to assessments found in psychological research and education (e.g., <xref ref-type="bibr" rid="B60">Sulis and Toland, 2017</xref>; <xref ref-type="bibr" rid="B24">French et al., 2019</xref>; <xref ref-type="bibr" rid="B32">Huang and Valdivia, 2023</xref>).</p>
<sec id="S3.SS1">
<title>3.1 Fixed factors</title>
<p>We simulated data to 20 dichotomous items following the multilevel 2PL IRT model as shown in <xref ref-type="disp-formula" rid="E9">Eq. 9</xref>. Item location/difficulty and discrimination parameters for the 20 items in DIF-free (baseline) conditions are presented in Appendix A in Supplemental materials.<sup><xref ref-type="fn" rid="footnote3">3</xref></sup> The selection of data-generating item parameters was made by randomly sampling 20 item location/difficulty and discrimination parameters from the TIMSS 2015 eighth-grade mathematics assessment. Item parameters for DIF-induced conditions were produced by adding a constant to two DIF items of varied magnitude (see &#x201C;3.2 Manipulated factors&#x201D;).</p>
<p>Two groups, reference and focal, were considered in the study, and DIF was modeled such that a difficulty/location parameter for two items was shifted upward by a specified magnitude in the focal group. We considered uniform DIF only. While we recognize that nonuniform DIF is also possible (see &#x201C;5 Discussion&#x201D;) our choice to only examine uniform DIF was driven by several factors, including that uniform DIF has been more prevalent in operational settings in some contexts (e.g., <xref ref-type="bibr" rid="B34">Joo et al., 2023</xref>), that it would allow us to examine commonly used methods (e.g., single-level MH), and lastly, to keep our study manageable.<sup><xref ref-type="fn" rid="footnote4">4</xref></sup></p>
</sec>
<sec id="S3.SS2">
<title>3.2 Manipulated factors</title>
<p>Due to emphasis on nested data and methodological approaches that allow/do not allow for modeling nested data in DIF detection, we designed a simulation study that examined various conditions present in nested data. Specifically, we considered the following manipulated factors:</p>
<list list-type="simple">
<list-item>
<label>(a)</label>
<p>the number of clusters (N2; between-level units),</p>
</list-item>
<list-item>
<label>(b)</label>
<p>the number of subjects (N1; within-level units),</p>
</list-item>
<list-item>
<label>(c)</label>
<p>the sample size ratio (N2/N1 ratio),</p>
</list-item>
<list-item>
<label>(d)</label>
<p>the intraclass correlation in focal group (ICC),</p>
</list-item>
<list-item>
<label>(e)</label>
<p>latent trait proficiency means for the reference (&#x03B8;<sub><italic>r</italic></sub>) and focal (&#x03B8;<sub><italic>f</italic></sub>) groups, and</p>
</list-item>
<list-item>
<label>(f)</label>
<p>DIF magnitude.</p>
</list-item>
</list>
<sec id="S3.SS2.SSS1">
<title>3.2.1 Number of clusters (N2)</title>
<p>We considered two levels of N2 factor: 10 or 30 between-level units (clusters). These choices represented small to medium numbers of clusters, aiming to better understand DIF application when fewer between-level units are present (these choices are also similar to other studies, such as <xref ref-type="bibr" rid="B33">Jin et al., 2014</xref>; <xref ref-type="bibr" rid="B24">French et al., 2019</xref>; <xref ref-type="bibr" rid="B32">Huang and Valdivia, 2023</xref>).</p>
</sec>
<sec id="S3.SS2.SSS2">
<title>3.2.2 Number of subjects per cluster (N1)</title>
<p>Two levels of N1 factor were manipulated. For the <italic>balanced</italic> conditions, the numbers of subjects (within-level units) per cluster were 25 or 50. For <italic>imbalanced</italic> conditions, the N1 unit was 60 to 40% for half of the subjects (see more detail next under &#x201C;3.2.3 N2/N1 ratio&#x201D;). Sample size for N1 of 25 and 50 levels are suggestive of a smaller sample size per cluster (e.g., such as a classroom of 25 students) or a medium sized unit (e.g., a group of participants in a feasibility study). These values also resemble choices in similar research studies.</p>
</sec>
<sec id="S3.SS2.SSS3">
<title>3.2.3 N2/N1 ratio</title>
<p>We considered two levels: a balanced sample size ratio, where all clusters (between-level units) had the same number of within-level units (e.g., 25 subjects in each of 10 clusters), or an imbalanced sample size ratio, where half of the between-level units contained the N1 within-level units, and the other half had 60% of the N1 size. For example, under imbalanced conditions, when N2 = 10 and N1 = 25, five clusters (between-level units) units had 25 subjects (within-level units) each and the remaining five clusters had 10 (0.60 &#x002A; 25) subjects, each. This imbalanced scenario represents a situation where clusters contain a different number of subjects, which may be more realistic in empirical data.<sup><xref ref-type="fn" rid="footnote5">5</xref></sup></p>
</sec>
<sec id="S3.SS2.SSS4">
<title>3.2.4 Intraclass correlation for focal group (ICC)</title>
<p>We manipulated three levels of ICCs for the focal group in the study. The reference group&#x2019;s ICC was fixed at a 0.33 level. The focal group&#x2019;s ICC varied at levels of 0.33 (same as focal); 0.20 (smaller than focal), or at 0.50 (larger than focal). Specifically, we manipulated the value of ICC through varying the between-cluster variance (<inline-formula><mml:math id="INEQ52"><mml:msubsup><mml:mi mathvariant="normal">&#x03C3;</mml:mi><mml:mi>B</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula>) while fixing the within-cluster variance (<inline-formula><mml:math id="INEQ53"><mml:msubsup><mml:mi mathvariant="normal">&#x03C3;</mml:mi><mml:mi>W</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula>) at 1. Effectively, this means that for the focal group, the <inline-formula><mml:math id="INEQ54"><mml:msubsup><mml:mi mathvariant="normal">&#x03C3;</mml:mi><mml:mi>B</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> values were set at 0.25, 0.50, and 1.<sup><xref ref-type="fn" rid="footnote6">6</xref></sup> Choices for ICCs were selected based on previous studies (<xref ref-type="bibr" rid="B33">Jin et al., 2014</xref>; <xref ref-type="bibr" rid="B24">French et al., 2019</xref>) and aimed to reflect ICC values observed in practice (<xref ref-type="bibr" rid="B47">Muthen, 1994</xref>).</p>
</sec>
<sec id="S3.SS2.SSS5">
<title>3.2.5 Latent proficiency means (&#x03B8;<sub><italic>r</italic></sub> and &#x03B8;<sub><italic>f</italic></sub>)</title>
<p>We considered two levels of latent trait means: reference and focal group means were equal at 0 (i.e., &#x03B8;<sub><italic>r</italic></sub> mean = 0 and &#x03B8;<sub><italic>f</italic></sub> mean = 0), or focal group&#x2019;s mean was shifted downward to &#x2212;0.75, suggesting that the latent proficiency distributions of the two groups were unequal (i.e., &#x03B8;<sub><italic>r</italic></sub> mean = 0 and &#x03B8;<sub><italic>f</italic></sub> mean = &#x2212;0.75). We considered conditions where both groups were modeled with the same latent variable mean value (of 0) as a baseline condition; while different means represented contexts, such as in ILSA, where some participating countries (or educational systems) might have a lower latent variable mean.</p>
</sec>
<sec id="S3.SS2.SSS6">
<title>3.2.6 DIF magnitude</title>
<p>We simulated uniform within-cluster DIF (e.g., gender identity with two levels) with two different magnitudes. The two DIF magnitude values considered were 0.5, and 1, which, respectively, reflected small and large DIF. The uniform DIF was introduced by adding DIF magnitude values to location/difficulty of the first two items in the focal group.</p>
</sec>
</sec>
<sec id="S3.SS3">
<title>3.3 Data generation and analysis</title>
<p>Our fully crossed design yielded 48 baseline (non-DIF conditions) and 96 DIF conditions for a total of 144 conditions. Each condition was replicated 100 times. We simulated the data using the popular IRT software flexMIRT<sup>&#x00AE;</sup> (<xref ref-type="bibr" rid="B12">Cai, 2017</xref>) according to the specific conditions.</p>
<p>Once data were simulated, datasets were submitted to each of the four studied DIF detection methods to examine their ability to detect DIF. Specifically, for single-level MH, we employed difMH function in <italic>difR</italic> package (<xref ref-type="bibr" rid="B39">Magis et al., 2010</xref>) in R (<xref ref-type="bibr" rid="B54">R Core Team, 2023</xref>), with most of its default options. Two changes were made to defaults, such that we increased the number of iterations to 100 (from default 20) and we employed purification process in the analysis. Similarly, for single-level Lord, we utilized difLord function with same changes to defaults (in the <italic>difR</italic> package in R). For multilevel MH, we used ML.DIF function in the <italic>DIFplus</italic> package (<xref ref-type="bibr" rid="B18">Dai et al., 2022</xref>) with most of its defaults, except we specified argument correct.factor = 0.85 and opted for purification. Lastly, for the multilevel Wald, we employed flexMIRT and proposed a two-stage DIF detection procedure which implements both the MH-RM algorithm and Wald tests (per <xref ref-type="bibr" rid="B32">Huang and Valdivia, 2023</xref>).</p>
<p>To evaluate the performance of the four DIF detection methods, two outcome variables were computed. First, we examined Type I error rates, which we computed as the proportion of times that a DIF-free item (an item that was simulated to have no DIF) was identified as a DIF item (false positive rate) across converged replications. Second, for DIF conditions, we computed power by examining the number of times that the two DIF-simulated items were correctly identified as DIF items, across the converged replications. Lastly, to guide our results presentation, we conducted an analysis of variance (ANOVA) to evaluate the impact of each of the manipulated factors and DIF detection methods on the outcome variables. Where appropriate, post-hoc pairwise comparisons were performed using the Bonferroni method. Sample code for data generation, analysis, and additional results are included in Supplemental materials at <ext-link ext-link-type="uri" xlink:href="https://osf.io/96j3g/?view_only=49e6378ac0da4b4ba78b9f17949aa1c2">https://osf.io/96j3g/?view_only=49e6378ac0da4b4ba78b9f17949aa1c2</ext-link>.</p>
</sec>
</sec>
<sec id="S4" sec-type="results">
<title>4 Results</title>
<p>All tabulated results, as well as additional graphical visualizations, can be found in Supplemental documentation (in Results folder, as extended Appendix B [Figures B1-B6] and Appendix C [C1-C6]). In what follows, we describe the main trends in results for the two studied outcomes: Type I error rates and Power rates. For each outcome separately, we fit a between-subjects ANOVA where manipulated factors in the study served as independent variables. Due to complexity of the models, only main effects and associated effect sizes expressed as <inline-formula><mml:math id="INEQ61"><mml:msubsup><mml:mi mathvariant="normal">&#x03B7;</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> were examined. However, given that we used ANOVA results only to guide presentation of the findings, examination of the interactions was not viewed as problematic.</p>
<sec id="S4.SS1">
<title>4.1 Type I error rates summary</title>
<p>Based on ANOVA, it was found that five of seven factors were statistically significant at the 0.05 level (i.e., N2, N1, &#x03B8;, method, and DIF magnitude), while two were not (ICC and N2/N1 ratio). Post-hoc analysis suggested significant pair-wise differences among all method pairs except for single-level Lord and multilevel Wald methods. The effect size was large for the method factor at <inline-formula><mml:math id="INEQ62"><mml:msubsup><mml:mi mathvariant="normal">&#x03B7;</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.51, followed by moderate effect sizes for &#x03B8; (<inline-formula><mml:math id="INEQ63"><mml:mrow><mml:mrow><mml:mpadded width="+3.3pt"><mml:msubsup><mml:mi mathvariant="normal">&#x03B7;</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:mpadded><mml:mo rspace="5.8pt">=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math></inline-formula>08), DIF Type (<inline-formula><mml:math id="INEQ64"><mml:mrow><mml:mrow><mml:mpadded width="+3.3pt"><mml:msubsup><mml:mi mathvariant="normal">&#x03B7;</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:mpadded><mml:mo rspace="5.8pt">=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math></inline-formula>05) and N2 (<inline-formula><mml:math id="INEQ65"><mml:mrow><mml:mrow><mml:mpadded width="+3.3pt"><mml:msubsup><mml:mi mathvariant="normal">&#x03B7;</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:mpadded><mml:mo rspace="5.8pt">=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math></inline-formula>05). <xref ref-type="table" rid="T2">Table 2</xref> and <xref ref-type="fig" rid="F1">Figure 1</xref> show the results based on Type I error rate, averaged across the ICC and N2/N1 ratio levels due to their main effects being statistically nonsignificant and negligible effect sizes (<inline-formula><mml:math id="INEQ66"><mml:msubsup><mml:mi mathvariant="normal">&#x03B7;</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.003 and 0.005, respectively). Corresponding results for all levels of manipulated factors can be found in Supplemental materials (under Results, B1-B6).</p>
<table-wrap position="float" id="T2">
<label>TABLE 2</label>
<caption><p>Type I error rates across ICC and sample size ratios across between-level and within-level sample sizes for studied conditions (bolded values represent above 0.05).</p></caption>
<table cellspacing="5" cellpadding="5" frame="box" rules="all">
<thead>
<tr>
<td valign="top" align="left" style="color:#ffffff;background-color: #7f8080;"></td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;"></td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;"></td>
<td valign="top" align="center" colspan="4" style="color:#ffffff;background-color: #7f8080;">&#x03B8; <sub>Equal</sub></td>
<td valign="top" align="center" colspan="4" style="color:#ffffff;background-color: #7f8080;">&#x03B8; <sub>Unequal</sub></td>
</tr>
<tr>
<td valign="top" align="left" style="color:#ffffff;background-color: #7f8080;">N2</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">N1</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">DIF</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Single Lord</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Single MH</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Multilevel<break/> MH</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Multilevel Wald</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Single Lord</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Single MH</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Multilevel MH</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Multilevel Wald</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">10</td>
<td valign="top" align="center">25</td>
<td valign="top" align="center">No DIF</td>
<td valign="top" align="center">0.032</td>
<td valign="top" align="center">0.041</td>
<td valign="top" align="center"><bold>0.064</bold></td>
<td valign="top" align="center">0.038</td>
<td valign="top" align="center">0.026</td>
<td valign="top" align="center">0.036</td>
<td valign="top" align="center"><bold>0.054</bold></td>
<td valign="top" align="center">0.036</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">Small</td>
<td valign="top" align="center">0.031</td>
<td valign="top" align="center">0.040</td>
<td valign="top" align="center"><bold>0.070</bold></td>
<td valign="top" align="center">0.033</td>
<td valign="top" align="center">0.036</td>
<td valign="top" align="center">0.043</td>
<td valign="top" align="center"><bold>0.066</bold></td>
<td valign="top" align="center">0.034</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">Large</td>
<td valign="top" align="center">0.047</td>
<td valign="top" align="center">0.040</td>
<td valign="top" align="center"><bold>0.087</bold></td>
<td valign="top" align="center">0.028</td>
<td valign="top" align="center">0.049</td>
<td valign="top" align="center">0.041</td>
<td valign="top" align="center"><bold>0.072</bold></td>
<td valign="top" align="center">0.026</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">50</td>
<td valign="top" align="center">No DIF</td>
<td valign="top" align="center">0.032</td>
<td valign="top" align="center">0.044</td>
<td valign="top" align="center"><bold>0.057</bold></td>
<td valign="top" align="center">0.043</td>
<td valign="top" align="center">0.030</td>
<td valign="top" align="center">0.043</td>
<td valign="top" align="center"><bold>0.053</bold></td>
<td valign="top" align="center">0.043</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">Small</td>
<td valign="top" align="center">0.041</td>
<td valign="top" align="center">0.046</td>
<td valign="top" align="center"><bold>0.095</bold></td>
<td valign="top" align="center">0.033</td>
<td valign="top" align="center">0.041</td>
<td valign="top" align="center">0.039</td>
<td valign="top" align="center"><bold>0.068</bold></td>
<td valign="top" align="center">0.034</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">Large</td>
<td valign="top" align="center">0.039</td>
<td valign="top" align="center">0.048</td>
<td valign="top" align="center"><bold>0.111</bold></td>
<td valign="top" align="center">0.026</td>
<td valign="top" align="center">0.043</td>
<td valign="top" align="center">0.043</td>
<td valign="top" align="center"><bold>0.072</bold></td>
<td valign="top" align="center">0.030</td>
</tr>
<tr>
<td valign="top" align="left">30</td>
<td valign="top" align="center">25</td>
<td valign="top" align="center">No DIF</td>
<td valign="top" align="center">0.033</td>
<td valign="top" align="center"><bold>0.051</bold></td>
<td valign="top" align="center"><bold>0.075</bold></td>
<td valign="top" align="center">0.041</td>
<td valign="top" align="center">0.030</td>
<td valign="top" align="center">0.045</td>
<td valign="top" align="center"><bold>0.066</bold></td>
<td valign="top" align="center">0.040</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">Small</td>
<td valign="top" align="center">0.037</td>
<td valign="top" align="center">0.049</td>
<td valign="top" align="center"><bold>0.112</bold></td>
<td valign="top" align="center">0.028</td>
<td valign="top" align="center">0.037</td>
<td valign="top" align="center">0.042</td>
<td valign="top" align="center"><bold>0.073</bold></td>
<td valign="top" align="center">0.033</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">Large</td>
<td valign="top" align="center">0.037</td>
<td valign="top" align="center">0.048</td>
<td valign="top" align="center"><bold>0.172</bold></td>
<td valign="top" align="center">0.025</td>
<td valign="top" align="center">0.036</td>
<td valign="top" align="center">0.040</td>
<td valign="top" align="center"><bold>0.071</bold></td>
<td valign="top" align="center">0.026</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">50</td>
<td valign="top" align="center">No DIF</td>
<td valign="top" align="center">0.034</td>
<td valign="top" align="center"><bold>0.058</bold></td>
<td valign="top" align="center"><bold>0.079</bold></td>
<td valign="top" align="center">0.042</td>
<td valign="top" align="center">0.033</td>
<td valign="top" align="center">0.043</td>
<td valign="top" align="center"><bold>0.059</bold></td>
<td valign="top" align="center">0.042</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">Small</td>
<td valign="top" align="center">0.040</td>
<td valign="top" align="center"><bold>0.061</bold></td>
<td valign="top" align="center"><bold>0.150</bold></td>
<td valign="top" align="center">0.038</td>
<td valign="top" align="center">0.041</td>
<td valign="top" align="center">0.045</td>
<td valign="top" align="center"><bold>0.075</bold></td>
<td valign="top" align="center">0.031</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">Large</td>
<td valign="top" align="center">0.042</td>
<td valign="top" align="center"><bold>0.061</bold></td>
<td valign="top" align="center"><bold>0.263</bold></td>
<td valign="top" align="center">0.033</td>
<td valign="top" align="center">0.038</td>
<td valign="top" align="center">0.046</td>
<td valign="top" align="center"><bold>0.084</bold></td>
<td valign="top" align="center">0.034</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn><p>&#x03B8;<sub>Equal</sub> represents conditions where reference and focal groups latent means were simulated to be equal (value of 0); &#x03B8;<sub>Unequal</sub> represents conditions where latent mean for reference group was approximately 0 while for focal group was approximately &#x2212;0.75. N2 represents between-level unit sample size, while N1 represents sample size for within-level units. Small and large DIF were modeled by shifting difficulty parameter for focal group by 0.50 and 1.00, respectively.</p></fn>
</table-wrap-foot>
</table-wrap>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>Type I error rates across ICC and sample size ratios across Level-2 and Level-1 sample sizes for studied conditions with 0.05 reference line.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="feduc-09-1389165-g001.tif"/>
</fig>
<p>In null conditions, Type I error rates were maintained quite well at around the 0.05 level for three of the four studied methods (see <xref ref-type="table" rid="T2">Table 2</xref> and <xref ref-type="fig" rid="F1">Figure 1</xref>). When no DIF was simulated (null conditions), only multilevel MH rates rose above 0.05 level, in particular in conditions where numbers of between-level (N2) and within-level (N1) units increased (in the range of 0.054 to 0.079). The pattern of performance was quite similar when small DIF and large DIF conditions were studied, in that multilevel MH Type I error rates were again higher across the studied conditions when compared to the other methods. For example, when small DIF was introduced, elevated Type I error rates were observed in particular for the multilevel MH method and under unequal &#x03B8; conditions (i.e., when means for the two groups were different) with Type I error rates reaching 0.15 levels. When large DIF was simulated, patterns of elevated Type I error rates were similar to those previously noted, in that higher Type I error rates were found in conditions with unequal &#x03B8; and a larger number of N2 and N1.</p>
<p>It is noteworthy that two methods, as reported in <xref ref-type="table" rid="T2">Table 2</xref>, single-level Lord and multilevel Wald test, yielded Type I error rates at or below 0.05, suggesting methods&#x2019; ability to maintain levels of false positives at a reasonable level (see &#x201C;5 Discussion&#x201D; for a more detailed reporting). The single-level MH method yielded rates below 0.05 across conditions, particularly those with fewer N2 and N1 and across theta levels. One exception was noted in a condition with unequal &#x03B8;, and large N2 and N1, where the Type I error rate reached 0.058. Under large DIF, across studied conditions, elevated Type I error rates were observed for the multilevel MH where Type I error rates ranged from 0.071 to 0.263. Unsurprisingly, the highest rates were observed in conditions where means between the focal and reference groups were unequal (i.e., the focal group&#x2019;s mean was lower by 0.75 standard deviation) and when between-level and within-level units were 30 and 50, respectively.</p>
</sec>
<sec id="S4.SS2">
<title>4.2 Power rates summary</title>
<p>Based on ANOVA, it was found that five of seven factors were statistically significant at the 0.05 level (i.e., N2, N1, N2/N1 ratio, method, and DIF magnitude), while two were not (ICC and &#x03B8;). Post-hoc analysis suggested only one, statistically significant, pairwise comparison&#x2014;single-level Lord and single-level MH. The effect sizes were large for DIF Type (<inline-formula><mml:math id="INEQ67"><mml:msubsup><mml:mi mathvariant="normal">&#x03B7;</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.49), N2 (<inline-formula><mml:math id="INEQ68"><mml:msubsup><mml:mi mathvariant="normal">&#x03B7;</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.45), and N1 (<inline-formula><mml:math id="INEQ69"><mml:msubsup><mml:mi mathvariant="normal">&#x03B7;</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.22), and moderate for the method factor at = 0.07. A small <inline-formula><mml:math id="INEQ70"><mml:msubsup><mml:mi mathvariant="normal">&#x03B7;</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.04 was associated with N1/N2 ratio, while negligible effect sizes were found for the two remaining statistically nonsignificant main effects of &#x03B8; (<inline-formula><mml:math id="INEQ71"><mml:msubsup><mml:mi mathvariant="normal">&#x03B7;</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.001) and ICC (<inline-formula><mml:math id="INEQ72"><mml:msubsup><mml:mi mathvariant="normal">&#x03B7;</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> &#x003C; 0.001). Corresponding results for all levels of manipulated factors can be found in Supplemental materials (under Results, C1-C6).</p>
<p>Several observations were noted in examining results for power, as shown in <xref ref-type="table" rid="T3">Table 3</xref> and <xref ref-type="fig" rid="F2">Figure 2</xref>. Namely, across all four methods, when large DIF was introduced, high power rates were observed across all conditions. Rates of 0.80 or higher were noted, with large N2 and N1 yielding near or 1.00 power rates. The lowest power rates when DIF was large, albeit still above 0.80, were found in conditions when N2 = 10 and N1 = 25. It was only in these conditions with fewer observations at between-level and within-level that we observed some variation in methods&#x2019; performance, such that the highest power rates were observed by single-level MH, followed by multilevel MH, multilevel Wald, and single-level Lord, respectively.</p>
<table-wrap position="float" id="T3">
<label>TABLE 3</label>
<caption><p>Power rates for studied conditions averaged across ICC and theta levels (bolded values below 0.80 level).</p></caption>
<table cellspacing="5" cellpadding="5" frame="box" rules="all">
<thead>
<tr>
<td valign="top" align="left" style="color:#ffffff;background-color: #7f8080;">N1</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">N2</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">DIF</td>
<td valign="top" align="center" colspan="4" style="color:#ffffff;background-color: #7f8080;">N2/N1 ratio balanced</td>
<td valign="top" align="center" colspan="4" style="color:#ffffff;background-color: #7f8080;">N2/N1 ratio imbalanced</td>
</tr>
<tr>
<td valign="top" align="left" style="color:#ffffff;background-color: #7f8080;"></td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;"></td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;"></td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Single Lord</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Single MH</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Multilevel<break/> MH</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Multilevel Wald</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Single Lord</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Single MH</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Multilevel MH</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Multilevel Wald</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">10</td>
<td valign="top" align="center">25</td>
<td valign="top" align="center">Small</td>
<td valign="top" align="center"><bold>0.350</bold></td>
<td valign="top" align="center"><bold>0.559</bold></td>
<td valign="top" align="center"><bold>0.550</bold></td>
<td valign="top" align="center"><bold>0.385</bold></td>
<td valign="top" align="center"><bold>0.240</bold></td>
<td valign="top" align="center"><bold>0.437</bold></td>
<td valign="top" align="center"><bold>0.429</bold></td>
<td valign="top" align="center"><bold>0.318</bold></td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">Large</td>
<td valign="top" align="center">0.920</td>
<td valign="top" align="center">0.983</td>
<td valign="top" align="center">0.970</td>
<td valign="top" align="center">0.942</td>
<td valign="top" align="center">0.827</td>
<td valign="top" align="center">0.960</td>
<td valign="top" align="center">0.926</td>
<td valign="top" align="center">0.889</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">50</td>
<td valign="top" align="center">Small</td>
<td valign="top" align="center"><bold>0.718</bold></td>
<td valign="top" align="center">0.892</td>
<td valign="top" align="center">0.833</td>
<td valign="top" align="center"><bold>0.707</bold></td>
<td valign="top" align="center"><bold>0.605</bold></td>
<td valign="top" align="center">0.811</td>
<td valign="top" align="center"><bold>0.730</bold></td>
<td valign="top" align="center"><bold>0.590</bold></td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">Large</td>
<td valign="top" align="center">0.999</td>
<td valign="top" align="center">1.000</td>
<td valign="top" align="center">1.000</td>
<td valign="top" align="center">0.999</td>
<td valign="top" align="center">0.991</td>
<td valign="top" align="center">0.998</td>
<td valign="top" align="center">1.000</td>
<td valign="top" align="center">0.997</td>
</tr>
<tr>
<td valign="top" align="left">30</td>
<td valign="top" align="center">25</td>
<td valign="top" align="center">Small</td>
<td valign="top" align="center">0.902</td>
<td valign="top" align="center">0.981</td>
<td valign="top" align="center">0.969</td>
<td valign="top" align="center">0.909</td>
<td valign="top" align="center"><bold>0.793</bold></td>
<td valign="top" align="center">0.919</td>
<td valign="top" align="center">0.913</td>
<td valign="top" align="center">0.805</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">Large</td>
<td valign="top" align="center">1.000</td>
<td valign="top" align="center">1.000</td>
<td valign="top" align="center">1.000</td>
<td valign="top" align="center">1.000</td>
<td valign="top" align="center">1.000</td>
<td valign="top" align="center">1.000</td>
<td valign="top" align="center">1.000</td>
<td valign="top" align="center">1.000</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">50</td>
<td valign="top" align="center">Small</td>
<td valign="top" align="center">0.993</td>
<td valign="top" align="center">0.999</td>
<td valign="top" align="center">0.997</td>
<td valign="top" align="center">0.989</td>
<td valign="top" align="center">0.983</td>
<td valign="top" align="center">0.997</td>
<td valign="top" align="center">0.997</td>
<td valign="top" align="center">0.973</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">Large</td>
<td valign="top" align="center">1.000</td>
<td valign="top" align="center">1.000</td>
<td valign="top" align="center">1.000</td>
<td valign="top" align="center">1.000</td>
<td valign="top" align="center">1.000</td>
<td valign="top" align="center">1.000</td>
<td valign="top" align="center">1.000</td>
<td valign="top" align="center">1.000</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn><p>N2 represents between-level unit sample size, while N1 represents sample size for within-level units. N2/N1 ratio balanced represents conditions where sample size for N2 and N1 were equal across reference and focal groups; N2/N1 ratio imbalanced represent conditions where half of the between-level units contained the N1 within-level units, and the other half had 60% of the N1 size. Small and large DIF were modeled by shifting difficulty parameter for focal group by 0.50 and 1.00, respectively.</p></fn>
</table-wrap-foot>
</table-wrap>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>Power rates across ICC and theta levels for studied conditions with 0.80 reference line.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="feduc-09-1389165-g002.tif"/>
</fig>
<p>More differentiations among the methods to detect items that were simulated as DIF was found under small DIF conditions. Namely, here again we observed that the order of more powerful methods (i.e., higher observed power rates) remained similar to those in small sample sizes of large DIF conditions. Single-level MH and multilevel MH yielded the highest power rates across conditions, but to differing levels. Specifically, the power rates were considerably lower across the conditions when DIF was small for all methods. For example, in conditions with the lowest number of N2 and N1, power rates ranged from 0.350 to 0.559 for balanced and 0.240 to 0.437 for imbalanced sample sizes, respectively. As N1 increased to a sample size of 50, power rates increased to 0.590 to 0.811 for balanced and 0.707 to 0.892 for imbalanced conditions, respectively. The impact of sample balance/imbalance was observed across the conditions such that, on average, imbalanced sample sizes across N1 yielded lower power rates when compared to the balanced sample sizes, although those differences diminished as sample sizes increased.</p>
</sec>
</sec>
<sec id="S5" sec-type="discussion">
<title>5 Discussion</title>
<p>To achieve equitable measurement, identifying DIF items in an assessment is of paramount importance. While research on DIF detection methods abound, little is known about their ability in the presence of multilevel (nested) data. Given the prevalence of nested data in various social sciences (e.g., students are nested within schools; employees are nested within companies/industries), it is important to consider multilevel structures of data when conducting DIF analysis and evaluate the consequences of applying single-level methods when data are multilevel. Thus, the current study examined the performance of the four DIF detection methods in their ability to appropriately identify DIF items when data are nested. Specifically, we considered two methods that directly allow for modeling of nested data within the procedure and two routinely used single-level DIF detection methods in the score- and model-based frameworks. As such, the current study extended <xref ref-type="bibr" rid="B39">Magis et al. (2010)</xref> framework and provided important information for practitioners to consider when investigating DIF, in particular of their choice of the method.</p>
<p>A simulation study was conducted to evaluate the performance of the four DIF detection methods under various conditions when data were generated as multilevel. In addition to the aforementioned observations which were averaged across factors that yielded nonsignificant main effects and minimal effect sizes, we reflect further on the methods&#x2019; performance. As presented in Figures B1&#x2013;B6 (Type I error rates) and C1-C6 (Power rates) under Results in the Supplemental materials, we observed in a more nuanced way that no one method outperformed the other three across all conditions. For example, recently proposed multilevel Wald and single-level MH had similar performance in terms of controlling Type I error rates under or around 0.05 levels across most conditions. In only two exceptions, these two methods yielded Type I error rates above 0.05. Specifically, for multilevel Wald, Type I error rates averaged above 0.05 levels (the rates were around 0.07 and 0.08) included conditions where DIF was modeled as small, N2 = 30 and N1 = 50, unequal &#x03B8;s with fixed ICC and imbalanced N1/N2 ratio. For single-level Lord, Type I error rates of 0.06 and 0.07 were observed only when both N2 and N1 units were small (i.e., 10 and 25, respectively), conditions with fixed ICC, when N2/N1 ratio was imbalanced and DIF was large. Similarly, as shown in Figures under Results in the Supplemental materials, power rates across various conditions tended to be the highest for single-level MH and multilevel MH methods, and differentiation among the methods was largely found in small sample sizes (N2, N1, as well as N2/N1 balanced and imbalanced ratios) when DIF was modeled as small.</p>
<p>In addition to the performance in detecting DIF, it is worth noting that the four studied methods vary in complexity. The two model-based methods, the single-level Lord and multilevel Wald require estimating item parameters, while the two score-based methods rely on summed scores. The multilevel Wald method consists of two stages, with an initial screening stage that uses extended Wald-2, followed by the formal evaluation stage that uses extended Wald-1. This approach is more complex and requires a researcher to have specific methodological skills when compared to, for example, a more straight forward single-level MH method. Another promising technique for selecting anchor items is Regularized Differential Item Functioning (Reg-DIF; <xref ref-type="bibr" rid="B3">Belzak and Bauer, 2020</xref>), which introduces a penalty function during the estimation process for anchor item selection. This model can be implemented using either frequentist (<xref ref-type="bibr" rid="B40">Magis et al., 2015</xref>; <xref ref-type="bibr" rid="B55">Robitzsch, 2023</xref>) or Bayesian (<xref ref-type="bibr" rid="B16">Chen and Bauer, 2023</xref>) estimation methods. Additionally, <xref ref-type="bibr" rid="B63">Tutz and Schauberger (2015)</xref> proposed a new penalty approach to DIF in Rasch models. Currently, it appears that neither method has been adapted to handle nested data structures. Despite this, we find the Bayesian approach especially promising. This is because Bayesian software, like Stan (<xref ref-type="bibr" rid="B15">Carpenter et al., 2017</xref>), seamlessly integrates with other advanced DIF detection methodologies, such as Moderated Nonlinear Factor Analysis (MNLFA; <xref ref-type="bibr" rid="B1">Bauer and Hussong, 2009</xref>). Furthermore, the accessibility of Bayesian approaches to IRT (<xref ref-type="bibr" rid="B20">Fox, 2010</xref>) has been greatly enhanced by <italic>R</italic> packages like <italic>brms</italic> (<xref ref-type="bibr" rid="B8">B&#x00FC;rkner, 2017</xref>), which enable the fitting of complex models with minimal coding effort.</p>
<p>Related, the accessibility of the four studied methods also varies, with three of the four methods being developed and implemented with relatively easy access within <italic>R</italic>, while multilevel Wald method requires knowledge of flexMIRT. Therefore, given the reasonably good performance of single-level methods when multilevel data are present, we recognize that it might not be necessary to always employ a more complex DIF detection method that accounts for the nested structure.</p>
<p>As with any simulation study, generalizability of our results and interpretations of them is bound by the choices of simulation conditions. In what follows, we discuss limitations of our study while reflecting on the future research directions. One limitation is related to the choice of our use of dichotomously scored data and the 2PL model for data generation. Namely, we studied only dichotomous items which while prevalent in educational contexts may be limiting to contexts where Likert-type items or partial credit items are used. It would be important to further study methods&#x2019; performance in polytomous scored nested data to have a more complete understanding of the impact of multilevel data on detecting DIF. We briefly reflect on that <xref ref-type="bibr" rid="B32">Huang and Valdivia (2023)</xref> study which introduced the novel method of multilevel Wald examined polytomously scored data, and the authors demonstrated a promise of multilevel Wald in such context. While generating item responses based on the 2PL model, the score-based MH methods were relatively disadvantaged since they use unweighted raw scores as the matching criterion. However, we did find that they perform well in many simulation conditions. Additionally, ILSAs such as PISA, use 2PL to calibrate binary scored items, further motivating our study design choices. Second, we encountered some issues in convergence which should be further studied. As noted in Appendix D, the vast majority of the methods had high levels of convergence, with over 99% of replications within conditions converged for the three methods. The lowest convergence rates, however, with an overall average of just over 82% was found in multilevel MH. While convergence was not an issue in the majority of the conditions, it was most pronounced in conditions where N2 = 10 and N1 = 25, with unequal theta, which as noted above would be something to keep in mind when analyzing data. Furthermore, we examined Monte Carlo standard errors (MCSEs) across studies outcome variables, in order to better understand our choice of 100 replications. As noted in Appendix E (Supplemental materials), we computed MCSEs and found them to be stable and comparable in size across the methods (with some variation). We also generated new data for two conditions with 1,000 replications and analyzed results for three of the four methods.<sup><xref ref-type="fn" rid="footnote7">7</xref></sup> The goal here was to examine whether the MCSEs would change (possibly decrease) when a much larger number of replications was considered. First selected condition yielded MCSEs based on 100 replications that were similar to the average MCSEs across the studied methods/conditions (&#x223C;0.05 and 0.36, for Type I and power, respectively). Second selected condition yielded MCSEs that were more varied across the studied methods (e.g., for Type I rate, MCSEs ranged from 0.05 to 0.10, and for power, MCSEs ranged from 0.05 to 0.15 across the methods). Both of these conditions included N2 = 10 and <italic>N</italic> = 25, with fixed ICC, while DIF and &#x03B8; were different between them. As summarized in Appendix C, the results suggested very small changes when replications were increased to 1,000 compared to those found under 100 replications. Recognizing that we only examined two such conditions, and because it is a good practice, we encourage researchers to consider MCSE computation when deciding on what number of replications are desirable in the study to achieve stable results, preferably prior to conducting the analysis.</p>
<p>Additional limitation of the current study is related to our DIF-related factors. Because we wanted to establish impact of nested data structures on DIF detection methods, we focused on the simulation design that reflected more features related to features of data (e.g., between- and within-level units sample sizes, ICC values, etc.) rather than DIF. In addition to including other choices across the data structure factors, further attention should be given to DIF-related factors. For example, we only studied uniform DIF, and while a reasonable choice (e.g., <xref ref-type="bibr" rid="B32">Huang and Valdivia, 2023</xref> found good performance of multilevel Wald in polytomous data for uniform and non-uniform DIF), it would be important to incorporate other DIF features, including non-uniform DIF. In the current study, we assumed that latent proficiency variances were equal across the groups. As <xref ref-type="bibr" rid="B51">Pei and Li (2010)</xref> found, latent proficiency variance had an impact on DIF detection. Thus, future research should consider examining this feature as well. Another aspect of DIF consideration concerns what is known in the literature as within- and between-DIF. Our study exclusively focused on the within-cluster DIF, as opposed to between-cluster DIF. In a multilevel data context, within-level DIF is generated at the individual level, whereas between-level DIF is generated at the cluster level. Outside a multilevel data context, most simulation studies generate DIF in a way that is analogous to the within-cluster DIF, as DIF effect sizes are typically not moderated by cluster membership. Given this, the present study focused only on within-cluster DIF, as we are most interested in research scenarios where one may reasonably consider well-established single level methods. Future research should focus on between-cluster level DIF in the presence of clustered observations, as past research has shown that this is when DIF detection methods explicitly designed for multilevel data structures are most advantageous (per <xref ref-type="bibr" rid="B21">French and Finch, 2010</xref>, <xref ref-type="bibr" rid="B22">2013</xref>, <xref ref-type="bibr" rid="B23">2015</xref>; <xref ref-type="bibr" rid="B24">French et al., 2019</xref>). Additionally, we only surveyed four DIF detection methods, which as one of the first studies that conducted such comparison, seems reasonable. However, we recognize that several other options exist. Thus, future researchers studying DIF in contexts of nested data might include other possible methods, such as aforementioned SIBTEST (<xref ref-type="bibr" rid="B59">Shealy and Stout, 1993</xref>), hierarchical logistic regression, or Bayesian approaches.</p>
<p>The current study provided important information to practitioners to aid the selection of DIF detection method. We recognize that aspects of the design (such as sample size) play an important role which a researcher should consider when gathering validity evidence for generalization when engaging in DIF detection analysis. For example, having larger sample size at the between-level (N2) was shown to be advantageous in detecting DIF items (i.e., generally, power rates were higher for conditions where N2 increased while keeping N1 the same, compared to analogous conditions where N1 units were increased but N2 were the same). Thinking about design, this would suggest when designing a study, a researcher might consider having larger sample size at the between-level units. We further observed that ICC levels we investigated did not seem to make a big impact on the results, which might partially explain why the single-level methods also performed quite well. Given that it was not a single method that outperformed the rest in the simulation, we recommend that researchers consider the data structure, along with additional information regarding accessibility, complexity, knowledge of the methods, when selecting any DIF detection method.</p>
<p>Our recommendation for applied researchers regarding which method to use when studying DIF is somewhat complex. Given the results, we cannot provide a blanket recommendation in favor of one method over another, as their performance depended on context. For example, when focusing on detecting only large DIF effects, most methods, except the multilevel MH, exhibited sufficient power and displayed appropriate Type I error rates. Multilevel MH performed particularly well (as did the other methods), yielding high power rates across conditions when sample sizes were larger. When small DIF effects were modeled, our results suggested a more complex set of recommendations is warranted. Namely, when the number of between-level units is small, the single-level MH may be the best choice as it (along with the multilevel MH) had the highest power while maintaining an acceptable Type I error rate. On the other hand, when the number of between-level units is large, either the single-level Lord or multilevel Wald is preferable as they maintain adequate power and Type I error rates.</p>
</sec>
<sec id="S6" sec-type="data-availability">
<title>Data availability statement</title>
<p>The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: <ext-link ext-link-type="uri" xlink:href="https://osf.io/96j3g/?view_only=49e6378ac0da4b4ba78b9f17949aa1c2">https://osf.io/96j3g/?view_only=49e6378ac0da4b4ba78b9f17949aa1c2</ext-link>.</p>
</sec>
<sec id="S7" sec-type="author-contributions">
<title>Author contributions</title>
<p>DSV: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Supervision, Validation, Visualization, Writing &#x2013; original draft, Writing &#x2013; review &#x0026; editing. SH: Conceptualization, Formal analysis, Methodology, Software, Writing &#x2013; original draft, Writing &#x2013; review &#x0026; editing. PB: Formal analysis, Resources, Software, Writing &#x2013; original draft, Writing &#x2013; review &#x0026; editing.</p>
</sec>
</body>
<back>
<sec id="S8" sec-type="funding-information">
<title>Funding</title>
<p>The authors declare that financial support was received for the research, authorship, and/or publication of this article. This work was partially supported by a grant to the DSV: Indiana University Institute for Advanced Study, Indiana University&#x2014;Bloomington, IN, USA. Support for open access publication charges provided by IU Libraries.</p>
</sec>
<sec id="S9" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.</p>
</sec>
<sec id="S10" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<fn-group>
<fn id="footnote1">
<label>1</label>
<p>It is possible to study more than two groups in DIF analysis. In those situations, analyst selects one reference group and the remaining groups are referred to as focal groups.</p></fn>
<fn id="footnote2">
<label>2</label>
<p>The score statistic variance, which accounts for the multilevel nature of the data, was calculated using generalized estimating equations&#x2014;a method that corrects for clustered data commonly found in fields such as medicine, biology, and epidemiology (<xref ref-type="bibr" rid="B44">McNeish et al., 2017</xref>).</p></fn>
<fn id="footnote3">
<label>3</label>
<p><ext-link ext-link-type="uri" xlink:href="https://osf.io/96j3g/?view_only=49e6378ac0da4b4ba78b9f17949aa1c2">https://osf.io/96j3g/?view_only=49e6378ac0da4b4ba78b9f17949aa1c2</ext-link></p></fn>
<fn id="footnote4">
<label>4</label>
<p>We also note that <xref ref-type="bibr" rid="B32">Huang and Valdivia (2023)</xref> found that ML-Wald method yielded better results (higher power rates and controlled Type I error rates) in detecting non-uniform DIF in polytomous multilevel data than uniform DIF thus motivating us to consider uniform DIF only.</p></fn>
<fn id="footnote5">
<label>5</label>
<p>We recognize that 60% choice to create imbalance is somewhat arbitrary and that other choices are possible. Studies such as (<xref ref-type="bibr" rid="B24">French et al., 2019</xref>) included balanced cases, thus our efforts here are to provide initial insights into the sample size imbalance.</p></fn>
<fn id="footnote6">
<label>6</label>
<p>Stated differently, we examined ICCs to be either equal in value (0.33) between the reference and focal groups, or varied, where varied took on two different values: ICC for reference group was set at 0.33, while for focal group at either 0.20 or 0.50.</p></fn>
<fn id="footnote7">
<label>7</label>
<p>Due to computational time, we computed MCSEs for three of the four methods (all but multilevel Wald method), although given similarity of MCSEs when replications = 100 and 1,000 for any of the three studied methods, we would expect multilevel Wald results based on 1,000 replications to also be very consistent.</p></fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bauer</surname> <given-names>D. J.</given-names></name> <name><surname>Hussong</surname> <given-names>A. M.</given-names></name></person-group> (<year>2009</year>). <article-title>Psychometric approaches for developing commensurate measures across independent studies: Traditional and new models.</article-title> <source><italic>Psychol. Methods</italic></source> <volume>14</volume> <fpage>101</fpage>&#x2013;<lpage>125</lpage>. <pub-id pub-id-type="doi">10.1037/a0015583</pub-id> <pub-id pub-id-type="pmid">19485624</pub-id></citation></ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Begg</surname> <given-names>M. D.</given-names></name></person-group> (<year>1999</year>). <article-title>Analyzing k (2&#x00D7;2) tables under cluster sampling.</article-title> <source><italic>Biometrics</italic></source> <volume>55</volume> <fpage>302</fpage>&#x2013;<lpage>307</lpage>. <pub-id pub-id-type="doi">10.1111/j.0006-341X.1999.00302.x</pub-id> <pub-id pub-id-type="pmid">11318173</pub-id></citation></ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Belzak</surname> <given-names>W. C. M.</given-names></name> <name><surname>Bauer</surname> <given-names>D. J.</given-names></name></person-group> (<year>2020</year>). <article-title>Improving the assessment of measurement invariance: Using regularization to select anchor items and identify differential item functioning.</article-title> <source><italic>Psychol. Methods</italic></source> <volume>25</volume> <fpage>673</fpage>&#x2013;<lpage>690</lpage>. <pub-id pub-id-type="doi">10.1037/met0000253</pub-id> <pub-id pub-id-type="pmid">31916799</pub-id></citation></ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Berr&#x00ED;o</surname> <given-names>&#x00C1;I.</given-names></name> <name><surname>Gomez-Benito</surname> <given-names>J.</given-names></name> <name><surname>Arias-Pati&#x00F1;o</surname> <given-names>E. M.</given-names></name></person-group> (<year>2020</year>). <article-title>Developments and trends in research on methods of detecting differential item functioning.</article-title> <source><italic>Educ. Res. Rev.</italic></source> <volume>31</volume>:<issue>100340</issue>. <pub-id pub-id-type="doi">10.1016/j.edurev.2020.100340</pub-id></citation></ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bock</surname> <given-names>R. D.</given-names></name> <name><surname>Gibbons</surname> <given-names>R. D.</given-names></name></person-group> (<year>2021</year>). <source><italic>Item response theory.</italic></source> <publisher-loc>Hoboken, NJ</publisher-loc>: <publisher-name>John Wiley &#x0026; Sons</publisher-name>.</citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bock</surname> <given-names>R. D.</given-names></name> <name><surname>Zimowski</surname> <given-names>M. F.</given-names></name></person-group> (<year>1997</year>). <source><italic>Multiple group IRT. Handbook of modern item response theory.</italic></source> <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer New York</publisher-name>, <fpage>433</fpage>&#x2013;<lpage>448</lpage>.</citation></ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bou Malham</surname> <given-names>P.</given-names></name> <name><surname>Saucier</surname> <given-names>G.</given-names></name></person-group> (<year>2014</year>). <article-title>Measurement invariance of social axioms in 23 countries.</article-title> <source><italic>J. Cross Cult. Psychol.</italic></source> <volume>45</volume> <fpage>1046</fpage>&#x2013;<lpage>1060</lpage>.</citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>B&#x00FC;rkner</surname> <given-names>P. C.</given-names></name></person-group> (<year>2017</year>). <article-title>brms: An R package for Bayesian multilevel models using Stan.</article-title> <source><italic>J. Stat. Softw.</italic></source> <volume>80</volume> <fpage>1</fpage>&#x2013;<lpage>28</lpage>. <pub-id pub-id-type="doi">10.18637/jss.v080.i01</pub-id></citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cai</surname> <given-names>L.</given-names></name></person-group> (<year>2008</year>). <source><italic>A Metropolis-Hastings Robbins-Monro algorithm for maximum likelihood nonlinear latent structure analysis with a comprehensive measurement model</italic></source>. <comment>Ph.D. thesis</comment>. <publisher-loc>Chapel Hill, NC</publisher-loc>: <publisher-name>The University of North Carolina at Chapel Hill</publisher-name>.</citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cai</surname> <given-names>L.</given-names></name></person-group> (<year>2010a</year>). <article-title>High-dimensional exploratory item factor analysis by a Metropolis-Hastings Robbins-Monro Algorithm</article-title>. <source><italic>Psychometrika</italic></source> <volume>75</volume>, <fpage>33</fpage>&#x2013;<lpage>57</lpage>.</citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cai</surname> <given-names>L.</given-names></name></person-group> (<year>2010b</year>). <article-title>Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis</article-title>. <source><italic>J. Educ. Behav. Stat</italic></source>. <volume>35</volume>, <fpage>307</fpage>&#x2013;<lpage>335</lpage>.</citation></ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cai</surname> <given-names>L.</given-names></name></person-group> (<year>2017</year>). <source><italic>Flexible multilevel multidimensional item analysis and test scoring [computer software]; flexMIRT R version 3.51.</italic></source> <publisher-loc>Chapel Hill, NC</publisher-loc>: <publisher-name>Vector Psychometric Group</publisher-name>.</citation></ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cai</surname> <given-names>L.</given-names></name> <name><surname>Thissen</surname> <given-names>D.</given-names></name> <name><surname>du Toit</surname> <given-names>S. H. C.</given-names></name></person-group> (<year>2011</year>). <source><italic>IRTPRO: Flexible, multidimensional, multiple categorical IRT modeling [computer software].</italic></source> <publisher-loc>Lincolnwood, IL</publisher-loc>: <publisher-name>Scientific Software International</publisher-name>.</citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Candell</surname> <given-names>G. L.</given-names></name> <name><surname>Drasgow</surname> <given-names>F.</given-names></name></person-group> (<year>1988</year>). <article-title>An iterative procedure for linking metrics and assessing item bias in item response theory</article-title>. <source><italic>Appl. Psychol. Meas</italic></source>. <volume>12</volume>, <fpage>253</fpage>&#x2013;<lpage>260</lpage>. <pub-id pub-id-type="doi">10.1177/014662168801200304</pub-id></citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Carpenter</surname> <given-names>B.</given-names></name> <name><surname>Gelman</surname> <given-names>A.</given-names></name> <name><surname>Hoffman</surname> <given-names>M. D.</given-names></name> <name><surname>Lee</surname> <given-names>D.</given-names></name> <name><surname>Goodrich</surname> <given-names>B.</given-names></name> <name><surname>Betancourt</surname> <given-names>M.</given-names></name><etal/></person-group> (<year>2017</year>). <article-title>Stan: A probabilistic programming language.</article-title> <source><italic>J. Stat. Softw.</italic></source> <volume>76</volume> <fpage>1</fpage>&#x2013;<lpage>32</lpage>. <pub-id pub-id-type="doi">10.18637/jss.v076.i01</pub-id> <pub-id pub-id-type="pmid">36568334</pub-id></citation></ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>S. M.</given-names></name> <name><surname>Bauer</surname> <given-names>D. J.</given-names></name></person-group> (<year>2023</year>). <article-title>Modeling growth in the presence of changing measurement properties between persons and within persons over time: A Bayesian regularized second-order growth curve model</article-title>. <source><italic>Multiv. Behav. Res</italic></source>. <volume>58</volume>, <fpage>150</fpage>&#x2013;<lpage>151</lpage>. <pub-id pub-id-type="doi">10.1080/00273171.2022.2160955</pub-id> <pub-id pub-id-type="pmid">36622866</pub-id></citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cook</surname> <given-names>L. L.</given-names></name> <name><surname>Eignor</surname> <given-names>D. R.</given-names></name></person-group> (<year>1991</year>). <article-title>IRT equating methods.</article-title> <source><italic>Educ. Meas. Issues Pract.</italic></source> <volume>10</volume> <fpage>37</fpage>&#x2013;<lpage>45</lpage>. <pub-id pub-id-type="doi">10.1111/j.1745-3992.1991.tb00207.x</pub-id></citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dai</surname> <given-names>S.</given-names></name> <name><surname>French</surname> <given-names>B. F.</given-names></name> <name><surname>Finch</surname> <given-names>W. H.</given-names></name> <name><surname>Iverson</surname> <given-names>A.</given-names></name> <name><surname>Dai</surname> <given-names>M. S.</given-names></name></person-group> (<year>2022</year>). <source><italic>Package &#x2018;DIFplus&#x2019;. R package version 1.1.</italic></source> Available online at: <ext-link ext-link-type="uri" xlink:href="https://CRAN.R-project.org/package=DIFplus">https://CRAN.R-project.org/package=DIFplus</ext-link> <comment>(accessed October 12, 2023)</comment>.</citation></ref>
<ref id="B19"><citation citation-type="journal"><collab>Economic Co-operation and Development</collab> (<year>2010</year>). <source><italic>TALIS technical report.</italic></source> <publisher-loc>Paris</publisher-loc>: <publisher-name>Economic Co-operation and Development</publisher-name>.</citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fox</surname> <given-names>J. P.</given-names></name></person-group> (<year>2010</year>). <source><italic>Bayesian item response modeling: Theory and applications.</italic></source> <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>.</citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>French</surname> <given-names>B. F.</given-names></name> <name><surname>Finch</surname> <given-names>W. H.</given-names></name></person-group> (<year>2010</year>). <article-title>Hierarchical logistic regression: Accounting for multilevel data in DIF detection.</article-title> <source><italic>J. Educ. Meas.</italic></source> <volume>47</volume> <fpage>299</fpage>&#x2013;<lpage>317</lpage>. <pub-id pub-id-type="doi">10.1111/j.1745-3984.2010.00115.x</pub-id></citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>French</surname> <given-names>B. F.</given-names></name> <name><surname>Finch</surname> <given-names>W. H.</given-names></name></person-group> (<year>2013</year>). <article-title>Extensions of Mantel&#x2013;Haenszel for multilevel DIF detection.</article-title> <source><italic>Educ. Psychol. Meas.</italic></source> <volume>73</volume> <fpage>648</fpage>&#x2013;<lpage>671</lpage>. <pub-id pub-id-type="doi">10.1177/0013164412472341</pub-id></citation></ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>French</surname> <given-names>B. F.</given-names></name> <name><surname>Finch</surname> <given-names>W. H.</given-names></name></person-group> (<year>2015</year>). <article-title>Transforming SIBTEST to account for multilevel data structures.</article-title> <source><italic>J. Educ. Meas.</italic></source> <volume>52</volume> <fpage>159</fpage>&#x2013;<lpage>180</lpage>.</citation></ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>French</surname> <given-names>B. F.</given-names></name> <name><surname>Finch</surname> <given-names>W. H.</given-names></name> <name><surname>Immekus</surname> <given-names>J. C.</given-names></name></person-group> (<year>2019</year>). <article-title>Multilevel generalized Mantel-Haenszel for differential item functioning detection.</article-title> <source><italic>Front. Educ.</italic></source> <volume>4</volume>:<issue>47</issue>. <pub-id pub-id-type="doi">10.3389/feduc.2019.00047</pub-id></citation></ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>French</surname> <given-names>B. F.</given-names></name> <name><surname>Finch</surname> <given-names>W. H.</given-names></name> <name><surname>Vazquez</surname> <given-names>J. A. V.</given-names></name></person-group> (<year>2016</year>). <article-title>Differential item functioning on mathematics items using multilevel SIBTEST.</article-title> <source><italic>Psychol. Test Assess. Model.</italic></source> <volume>58</volume>:<issue>471</issue>.</citation></ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gao</surname> <given-names>X.</given-names></name></person-group> (<year>2019</year>). <source><italic>A comparison of six DIF detection methods</italic></source>. <comment>Master&#x2019;s thesis</comment>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://opencommons.uconn.edu/gs_theses/1411">https://opencommons.uconn.edu/gs_theses/1411</ext-link></citation></ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guilera</surname> <given-names>G.</given-names></name> <name><surname>G&#x00F3;mez-Benito</surname> <given-names>J.</given-names></name> <name><surname>Hidalgo</surname> <given-names>M. D.</given-names></name> <name><surname>S&#x00E1;nchez-Meca</surname> <given-names>J.</given-names></name></person-group> (<year>2013</year>). <article-title>Type I error and statistical power of the Mantel-Haenszel procedure for detecting DIF: A meta-analysis.</article-title> <source><italic>Psychol. Methods</italic></source> <volume>18</volume>:<issue>553</issue>. <pub-id pub-id-type="doi">10.1037/a0034306</pub-id> <pub-id pub-id-type="pmid">24127986</pub-id></citation></ref>
<ref id="B28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hagger</surname> <given-names>M.</given-names></name> <name><surname>Biddle</surname> <given-names>S.</given-names></name> <name><surname>Chow</surname> <given-names>E.</given-names></name> <name><surname>Stambulova</surname> <given-names>N.</given-names></name> <name><surname>Kavussanu</surname> <given-names>M.</given-names></name></person-group> (<year>2003</year>). <article-title>Physical self-perceptions in adolescence: Generalizability of a hierarchical multidimensional model across three cultures.</article-title> <source><italic>J. Cross Cult. Psychol.</italic></source> <volume>34</volume> <fpage>611</fpage>&#x2013;<lpage>628</lpage>.</citation></ref>
<ref id="B29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hansen</surname> <given-names>M.</given-names></name> <name><surname>Cai</surname> <given-names>L.</given-names></name> <name><surname>Stucky</surname> <given-names>B. D.</given-names></name> <name><surname>Tucker</surname> <given-names>J. S.</given-names></name> <name><surname>Shadel</surname> <given-names>W. G.</given-names></name> <name><surname>Edelen</surname> <given-names>M. O.</given-names></name></person-group> (<year>2014</year>). <article-title>Methodology for developing and evaluating the PROMIS_ smoking item banks</article-title>. <source><italic>Nicotine Tobacco Res</italic></source>. <volume>16</volume>, <fpage>S175</fpage>&#x2013;<lpage>S189</lpage>.</citation></ref>
<ref id="B30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Holland</surname> <given-names>P. W.</given-names></name> <name><surname>Thayer</surname> <given-names>D. T.</given-names></name></person-group> (<year>1988</year>). &#x201C;<article-title>Differential item performance and the Mantel-Haenszel procedure</article-title>,&#x201D; in <source><italic>Test validity</italic></source>, <role>eds</role> <person-group person-group-type="editor"><name><surname>Wainer</surname> <given-names>H.</given-names></name> <name><surname>Braun</surname> <given-names>H. I.</given-names></name></person-group> (<publisher-name>Lawrence Erlbaum Associates, Inc</publisher-name>), <fpage>129</fpage>&#x2013;<lpage>145</lpage>.</citation></ref>
<ref id="B31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Holland</surname> <given-names>P. W.</given-names></name> <name><surname>Wainer</surname> <given-names>H.</given-names></name></person-group> (<year>2012</year>). <source><italic>Differential item functioning.</italic></source> <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Routledge</publisher-name>.</citation></ref>
<ref id="B32"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>S.</given-names></name> <name><surname>Valdivia</surname> <given-names>D. S.</given-names></name></person-group> (<year>2023</year>). <article-title>Wald &#x03C7;2 test for differential item functioning detection with polytomous items in multilevel data.</article-title> <source><italic>Educ. Psychol. Meas.</italic></source> <pub-id pub-id-type="doi">10.1177/00131644231181688</pub-id></citation></ref>
<ref id="B33"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jin</surname> <given-names>Y.</given-names></name> <name><surname>Myers</surname> <given-names>N. D.</given-names></name> <name><surname>Ahn</surname> <given-names>S.</given-names></name></person-group> (<year>2014</year>). <article-title>Complex versus simple modeling for DIF detection: When the intraclass correlation coefficient (r) of the studied item is less than the r of the Total score.</article-title> <source><italic>Educ. Psychol. Meas.</italic></source> <volume>74</volume> <fpage>163</fpage>&#x2013;<lpage>190</lpage>. <pub-id pub-id-type="doi">10.1177/0013164413497572</pub-id></citation></ref>
<ref id="B34"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Joo</surname> <given-names>S.</given-names></name> <name><surname>Valdivia</surname> <given-names>M.</given-names></name> <name><surname>Valdivia</surname> <given-names>D. S.</given-names></name> <name><surname>Rutkowski</surname> <given-names>L.</given-names></name></person-group> (<year>2023</year>). <article-title>Alternatives to weighted item fit statistics for establishing measurement invariance in many groups</article-title>. <source><italic>J. Educ. Behav. Stat</italic></source>. <pub-id pub-id-type="doi">10.3102/10769986231183326</pub-id> <pub-id pub-id-type="pmid">38293548</pub-id></citation></ref>
<ref id="B35"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>J&#x00F6;reskog</surname> <given-names>K. G.</given-names></name></person-group> (<year>1971</year>). <article-title>Simultaneous factor analysis in several populations.</article-title> <source><italic>Psychometrika</italic></source> <volume>36</volume> <fpage>409</fpage>&#x2013;<lpage>426</lpage>. <pub-id pub-id-type="doi">10.1007/BF02291366</pub-id></citation></ref>
<ref id="B36"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Langer</surname> <given-names>M. M.</given-names></name></person-group> (<year>2008</year>). <source><italic>A reexamination of Lord&#x2019;s Wald test for differential item functioning using item response theory and modern error estimation Ph.D. thesis.</italic></source> <publisher-loc>Chapel Hill, NC</publisher-loc>: <publisher-name>The University of North Carolina at Chapel Hill</publisher-name>.</citation></ref>
<ref id="B37"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>X.</given-names></name></person-group> (<year>2024</year>). <article-title>Detecting differential item functioning with multiple causes: A comparison of three methods.</article-title> <source><italic>Int. J. Test.</italic></source> <volume>24</volume> <fpage>53</fpage>&#x2013;<lpage>59</lpage>. <pub-id pub-id-type="doi">10.1080/15305058.2023.2286381</pub-id></citation></ref>
<ref id="B38"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lord</surname> <given-names>F. M.</given-names></name></person-group> (<year>1980</year>). <source><italic>Applications of item response theory to practical testing problems</italic></source>. <publisher-name>Lawrence Erlbaum</publisher-name>.</citation></ref>
<ref id="B39"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Magis</surname> <given-names>D.</given-names></name> <name><surname>Beland</surname> <given-names>S.</given-names></name> <name><surname>Tuerlinckx</surname> <given-names>F.</given-names></name> <name><surname>De Boeck</surname> <given-names>P.</given-names></name></person-group> (<year>2010</year>). <article-title>A general framework and an R package for the detection of dichotomous differential item functioning.</article-title> <source><italic>Behav. Res. Methods</italic></source> <volume>42</volume> <fpage>847</fpage>&#x2013;<lpage>862</lpage>. <pub-id pub-id-type="doi">10.3758/BRM.42.3.847</pub-id> <pub-id pub-id-type="pmid">20805607</pub-id></citation></ref>
<ref id="B40"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Magis</surname> <given-names>D.</given-names></name> <name><surname>Tuerlinckx</surname> <given-names>F.</given-names></name> <name><surname>Destaeck</surname> <given-names>P.</given-names></name></person-group> (<year>2015</year>). <article-title>Detection of differential item functioning using the lasso approach.</article-title> <source><italic>J. Educ. Behav. Stat.</italic></source> <volume>40</volume> <fpage>111</fpage>&#x2013;<lpage>135</lpage>. <pub-id pub-id-type="doi">10.3102/1076998614559747</pub-id> <pub-id pub-id-type="pmid">38293548</pub-id></citation></ref>
<ref id="B41"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mantel</surname> <given-names>N.</given-names></name> <name><surname>Haenszel</surname> <given-names>W.</given-names></name></person-group> (<year>1959</year>). <article-title>Statistical aspects of the analysis of data from retrospective studies of disease.</article-title> <source><italic>J. Natl. Cancer Inst.</italic></source> <volume>22</volume> <fpage>719</fpage>&#x2013;<lpage>748</lpage>. <pub-id pub-id-type="doi">10.1093/jnci/22.4.719</pub-id></citation></ref>
<ref id="B42"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Marsh</surname> <given-names>H. W.</given-names></name> <name><surname>Abduljabbar</surname> <given-names>A. S.</given-names></name> <name><surname>Morin</surname> <given-names>A. J. S.</given-names></name> <name><surname>Parker</surname> <given-names>P.</given-names></name> <name><surname>Abdelfattah</surname> <given-names>F.</given-names></name> <name><surname>Nagengast</surname> <given-names>B.</given-names></name><etal/></person-group> (<year>2015</year>). <article-title>The big-fish-little-pond effect: Generalizability of social comparison processes over two age cohorts from Western, Asian, and Middle Eastern Islamic countries.</article-title> <source><italic>J. Educ. Psychol.</italic></source> <volume>107</volume> <fpage>258</fpage>&#x2013;<lpage>271</lpage>. <pub-id pub-id-type="doi">10.1037/a0037485</pub-id></citation></ref>
<ref id="B43"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Marsh</surname> <given-names>H. W.</given-names></name> <name><surname>L&#x00FC;dtke</surname> <given-names>O.</given-names></name> <name><surname>Nagengast</surname> <given-names>B.</given-names></name> <name><surname>Trautwein</surname> <given-names>U.</given-names></name> <name><surname>Morin</surname> <given-names>A. J.</given-names></name> <name><surname>Abduljabbar</surname> <given-names>A. S.</given-names></name><etal/></person-group> (<year>2012</year>). <article-title>Classroom climate and contextual effects: Conceptual and methodological issues in the evaluation of group-level effects.</article-title> <source><italic>Educ. Psychol.</italic></source> <volume>47</volume> <fpage>106</fpage>&#x2013;<lpage>124</lpage>. <pub-id pub-id-type="doi">10.1080/00461520.2012.670488</pub-id></citation></ref>
<ref id="B44"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>McNeish</surname> <given-names>D.</given-names></name> <name><surname>Stapleton</surname> <given-names>L. M.</given-names></name> <name><surname>Silverman</surname> <given-names>R. D.</given-names></name></person-group> (<year>2017</year>). <article-title>On the unnecessary ubiquity of hierarchical linear modeling.</article-title> <source><italic>Psychol. Methods</italic></source> <volume>22</volume>:<issue>114</issue>. <pub-id pub-id-type="doi">10.1037/met0000078</pub-id> <pub-id pub-id-type="pmid">27149401</pub-id></citation></ref>
<ref id="B45"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Megreya</surname> <given-names>A. M.</given-names></name> <name><surname>Latzman</surname> <given-names>R. D.</given-names></name> <name><surname>Al-Attiyah</surname> <given-names>A. A.</given-names></name> <name><surname>Alrashidi</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). <article-title>The robustness of the nine-factor structure of the cognitive emotion regulation questionnaire across four arabic speaking middle eastern countries</article-title>. <source><italic>J. Cross-Cult. Psychol</italic></source>. <volume>47</volume>, <fpage>875</fpage>&#x2013;<lpage>890</lpage>. <pub-id pub-id-type="doi">10.1177/0022022116644785</pub-id></citation></ref>
<ref id="B46"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Meredith</surname> <given-names>W.</given-names></name></person-group> (<year>1993</year>). <article-title>Measurement invariance, factor analysis and factorial invariance.</article-title> <source><italic>Psychometrika</italic></source> <volume>58</volume> <fpage>525</fpage>&#x2013;<lpage>543</lpage>. <pub-id pub-id-type="doi">10.1007/BF02294825</pub-id></citation></ref>
<ref id="B47"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Muthen</surname> <given-names>B. O.</given-names></name></person-group> (<year>1994</year>). <article-title>Multilevel covariance structure analysis</article-title>. <source><italic>Sociol. Methods Res.</italic></source> <volume>22</volume>, <fpage>376</fpage>&#x2013;<lpage>398</lpage>.</citation></ref>
<ref id="B48"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Narayanon</surname> <given-names>P.</given-names></name> <name><surname>Swaminathan</surname> <given-names>H.</given-names></name></person-group> (<year>1996</year>). <article-title>Identification of items that show nonuniform DIF.</article-title> <source><italic>Appl. Psychol. Meas.</italic></source> <volume>20</volume> <fpage>257</fpage>&#x2013;<lpage>274</lpage>. <pub-id pub-id-type="doi">10.1177/014662169602000306</pub-id></citation></ref>
<ref id="B49"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Olson</surname> <given-names>J.</given-names></name> <name><surname>Martin</surname> <given-names>M. O.</given-names></name> <name><surname>Mullis</surname> <given-names>I. V. S.</given-names></name></person-group> (<year>2008</year>). <source><italic>TIMSS 2007 technical report.</italic></source> <publisher-loc>Chestnut Hill, MA</publisher-loc>: <publisher-name>TIMSS &#x0026; PIRLS International Study Center, Boston College</publisher-name>.</citation></ref>
<ref id="B50"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ozel</surname> <given-names>M.</given-names></name> <name><surname>Caglak</surname> <given-names>S.</given-names></name> <name><surname>Erdogan</surname> <given-names>M.</given-names></name></person-group> (<year>2013</year>). <article-title>Are affective factors a good predictor of science achievement? Examining the role of affective factors based on PISA 2006.</article-title> <source><italic>Learn. Individ. Differ.</italic></source> <volume>24</volume> <fpage>73</fpage>&#x2013;<lpage>82</lpage>. <pub-id pub-id-type="doi">10.1016/j.lindif.2012.09.006</pub-id></citation></ref>
<ref id="B51"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pei</surname> <given-names>L. K.</given-names></name> <name><surname>Li</surname> <given-names>J.</given-names></name></person-group> (<year>2010</year>). <article-title>Effects of unequal ability variances on the performance of logistic regression, Mantel-Haenszel, SIBTEST IRT, and IRT likelihood ratio for DIF detection.</article-title> <source><italic>Appl. Psychol. Meas.</italic></source> <volume>34</volume> <fpage>453</fpage>&#x2013;<lpage>456</lpage>.</citation></ref>
<ref id="B52"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Penfield</surname> <given-names>R. D.</given-names></name></person-group> (<year>2001</year>). <article-title>Assessing differential item functioning among multiple groups: A comparison of three Mantel-Haenszel procedures.</article-title> <source><italic>Appl. Meas. Educ.</italic></source> <volume>14</volume> <fpage>235</fpage>&#x2013;<lpage>259</lpage>. <pub-id pub-id-type="doi">10.1207/S15324818AME1403_3</pub-id></citation></ref>
<ref id="B53"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Peugh</surname> <given-names>J. L.</given-names></name></person-group> (<year>2010</year>). <article-title>A practical guide to multilevel modeling.</article-title> <source><italic>J. Sch. Psychol.</italic></source> <volume>48</volume> <fpage>85</fpage>&#x2013;<lpage>112</lpage>. <pub-id pub-id-type="doi">10.1016/j.jsp.2009.09.002</pub-id> <pub-id pub-id-type="pmid">20006989</pub-id></citation></ref>
<ref id="B54"><citation citation-type="journal"><collab>R Core Team</collab> (<year>2023</year>). <source><italic>R: A language and environment for statistical computing.</italic></source> <publisher-loc>Vienna</publisher-loc>: <publisher-name>R Foundation for Statistical Computing</publisher-name>.</citation></ref>
<ref id="B55"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Robitzsch</surname> <given-names>A.</given-names></name></person-group> (<year>2023</year>). <article-title>Comparing robust linking and regularized estimation for linking two groups in the 1PL and 2PL models in the presence of sparse uniform differential item functioning.</article-title> <source><italic>Stats</italic></source> <volume>6</volume> <fpage>192</fpage>&#x2013;<lpage>208</lpage>. <pub-id pub-id-type="doi">10.3390/stats6010012</pub-id></citation></ref>
<ref id="B56"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Roussos</surname> <given-names>L. A.</given-names></name> <name><surname>Schnipke</surname> <given-names>D. L.</given-names></name> <name><surname>Pashley</surname> <given-names>P. J.</given-names></name></person-group> (<year>1999</year>). <article-title>A generalized formula for the Mantel-Haenszel differential item functioning parameter.</article-title> <source><italic>J. Educ. Behav. Stat.</italic></source> <volume>24</volume> <fpage>293</fpage>&#x2013;<lpage>322</lpage>. <pub-id pub-id-type="doi">10.3102/10769986024003293</pub-id> <pub-id pub-id-type="pmid">38293548</pub-id></citation></ref>
<ref id="B57"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rubin</surname> <given-names>D. B.</given-names></name></person-group> (<year>1981</year>). <article-title>Estimation in parallel randomized experiments.</article-title> <source><italic>J. Educ. Stat.</italic></source> <volume>6</volume> <fpage>377</fpage>&#x2013;<lpage>401</lpage>. <pub-id pub-id-type="doi">10.3102/10769986006004377</pub-id> <pub-id pub-id-type="pmid">38293548</pub-id></citation></ref>
<ref id="B58"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Segeritz</surname> <given-names>M.</given-names></name> <name><surname>Pant</surname> <given-names>H. A.</given-names></name></person-group> (<year>2013</year>). <article-title>Do they feel the same way about math?: Testing measurement invariance of the PISA &#x201C;students&#x2019; approaches to learning&#x201D; instrument across immigrant groups within Germany</article-title>. <source><italic>Educ. Psychol. Meas</italic></source>. <volume>73</volume>, <fpage>601</fpage>&#x2013;<lpage>630</lpage>. <pub-id pub-id-type="doi">10.1177/0013164413481802</pub-id></citation></ref>
<ref id="B59"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shealy</surname> <given-names>R.</given-names></name> <name><surname>Stout</surname> <given-names>W.</given-names></name></person-group> (<year>1993</year>). <article-title>A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF.</article-title> <source><italic>Psychometrika</italic></source> <volume>58</volume> <fpage>159</fpage>&#x2013;<lpage>194</lpage>. <pub-id pub-id-type="doi">10.1007/BF02294572</pub-id></citation></ref>
<ref id="B60"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sulis</surname> <given-names>I.</given-names></name> <name><surname>Toland</surname> <given-names>M. D.</given-names></name></person-group> (<year>2017</year>). <article-title>Introduction to multilevel item response theory analysis: Descriptive and explanatory models</article-title>. <source><italic>J. Early Adolesc</italic></source>. <volume>37</volume>, <fpage>85</fpage>&#x2013;<lpage>128</lpage>. <pub-id pub-id-type="doi">10.1177/0272431616642328</pub-id></citation></ref>
<ref id="B61"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Svetina</surname> <given-names>D.</given-names></name> <name><surname>Rutkowski</surname> <given-names>L.</given-names></name></person-group> (<year>2014</year>). <article-title>Detecting differential item functioning using generalized logistic regression in the context of large-scale assessments.</article-title> <source><italic>Large Scale Assess. Educ.</italic></source> <volume>2</volume> <fpage>1</fpage>&#x2013;<lpage>17</lpage>. <pub-id pub-id-type="doi">10.1186/s40536-014-0004-5</pub-id></citation></ref>
<ref id="B62"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Szabo</surname> <given-names>A.</given-names></name> <name><surname>Ward</surname> <given-names>C.</given-names></name> <name><surname>Fletcher</surname> <given-names>G. O.</given-names></name></person-group> (<year>2016</year>). <article-title>Identity processing styles during cultural transition: Construct and measurement.</article-title> <source><italic>J. Cross Cult. Psychol.</italic></source> <volume>47</volume> <fpage>483</fpage>&#x2013;<lpage>507</lpage>. <pub-id pub-id-type="doi">10.1177/0022022116631825</pub-id></citation></ref>
<ref id="B63"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tutz</surname> <given-names>G.</given-names></name> <name><surname>Schauberger</surname> <given-names>G.</given-names></name></person-group> (<year>2015</year>). <article-title>A penalty approach to differential item functioning in Rasch models</article-title>. <source><italic>Psychometrika</italic></source> <volume>80</volume>, <fpage>21</fpage>&#x2013;<lpage>43</lpage>. <pub-id pub-id-type="doi">10.1007/s11336-013-9377-6</pub-id> <pub-id pub-id-type="pmid">24297435</pub-id></citation></ref>
<ref id="B64"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yates</surname> <given-names>F.</given-names></name></person-group> (<year>1934</year>). <article-title>Contingency tables involving small numbers and the &#x03C7; 2 test.</article-title> <source><italic>Suppl. J. R. Stat. Soc.</italic></source> <volume>1</volume> <fpage>217</fpage>&#x2013;<lpage>235</lpage>.</citation></ref>
</ref-list>
</back>
</article>