<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Psychol.</journal-id>
<journal-title>Frontiers in Psychology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Psychol.</abbrev-journal-title>
<issn pub-type="epub">1664-1078</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fpsyg.2021.633896</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Psychology</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Linking of Rasch-Scaled Tests: Consequences of Limited Item Pools and Model Misfit</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Fischer</surname> <given-names>Luise</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1143140/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Rohm</surname> <given-names>Theresa</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Carstensen</surname> <given-names>Claus H.</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/317229/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Gnambs</surname> <given-names>Timo</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/171537/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Leibniz Institute for Educational Trajectories</institution>, <addr-line>Bamberg</addr-line>, <country>Germany</country></aff>
<aff id="aff2"><sup>2</sup><institution>Psychological Methods of Educational Research, University of Bamberg</institution>, <addr-line>Bamberg</addr-line>, <country>Germany</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Pei Sun, Tsinghua University, China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Ze Lu, McMaster University, Canada; Jorge N. Tendeiro, Hiroshima University, Japan</p></fn>
<corresp id="c001">&#x002A;Correspondence: Timo Gnambs, <email>timo.gnambs@lifbi.de</email></corresp>
<fn fn-type="other" id="fn004"><p>This article was submitted to Educational Psychology, a section of the journal Frontiers in Psychology</p></fn>
</author-notes>
<pub-date pub-type="epub">
<day>06</day>
<month>07</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>12</volume>
<elocation-id>633896</elocation-id>
<history>
<date date-type="received">
<day>26</day>
<month>11</month>
<year>2020</year>
</date>
<date date-type="accepted">
<day>14</day>
<month>06</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2021 Fischer, Rohm, Carstensen and Gnambs.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Fischer, Rohm, Carstensen and Gnambs</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>In the context of item response theory (IRT), linking the scales of two measurement points is a prerequisite to examine a change in competence over time. In educational large-scale assessments, non-identical test forms sharing a number of anchor-items are frequently scaled and linked using two&#x2212; or three-parametric item response models. However, if item pools are limited and/or sample sizes are small to medium, the sparser Rasch model is a suitable alternative regarding the precision of parameter estimation. As the Rasch model implies stricter assumptions about the response process, a violation of these assumptions may manifest as model misfit in form of item discrimination parameters empirically deviating from their fixed value of one. The present simulation study investigated the performance of four IRT linking methods&#x2014;fixed parameter calibration, mean/mean linking, weighted mean/mean linking, and concurrent calibration&#x2014;applied to Rasch-scaled data with a small item pool. Moreover, the number of anchor items required in the absence/presence of moderate model misfit was investigated in small to medium sample sizes. Effects on the link outcome were operationalized as bias, relative bias, and root mean square error of the estimated sample mean and variance of the latent variable. In the light of this limited context, concurrent calibration had substantial convergence issues, while the other methods resulted in an overall satisfying and similar parameter recovery&#x2014;even in the presence of moderate model misfit. Our findings suggest that in case of model misfit, the share of anchor items should exceed 20% as is currently proposed in the literature. Future studies should further investigate the effects of anchor item composition regarding unbalanced model misfit.</p>
</abstract>
<kwd-group>
<kwd>Rasch model</kwd>
<kwd>item response theory</kwd>
<kwd>linking methods</kwd>
<kwd>model misfit</kwd>
<kwd>anchor- items design</kwd>
<kwd>limited item pools</kwd>
</kwd-group>
<contract-sponsor id="cn001">Deutsche Forschungsgemeinschaft<named-content content-type="fundref-id">10.13039/501100001659</named-content></contract-sponsor>
<counts>
<fig-count count="4"/>
<table-count count="2"/>
<equation-count count="5"/>
<ref-count count="31"/>
<page-count count="10"/>
<word-count count="0"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1">
<title>Introduction</title>
<p>Investigating differences between groups that were administered non-identical test forms in an item response theory (IRT) framework requires aligning two (or more) test forms onto a common scale, which is known as linking (<xref ref-type="bibr" rid="B13">Kolen and Brennan, 2014</xref>). As the process of linking requires an overlap of information among scales, this is frequently achieved by using an anchor-items design (<xref ref-type="bibr" rid="B26">Vale, 1986</xref>, p. 333&#x2013;344), where test forms share a number of common items. Linking is a common procedure in the context of large-scale assessments (LSA) in educational measurement such as the <italic>Programme of International Student Assessment</italic> (PISA) or the <italic>American National Assessment of Educational Progress</italic> (NAEP), which are characterized by large item pools and sample sizes. As such, LSAs provide an appropriate field for the application of 2-parameter logistic (2PL) and 3-parameter logistic (3PL) models (<xref ref-type="bibr" rid="B1">Birnbaum, 1968</xref>, p. 397&#x2013;472) as a basis for scaling and linking the data. In contrast, in contexts which are characterized by a limited pool of items and small to medium sample sizes (as often is the case in studies with restricted economical resources or longitudinal designs) the sparser <xref ref-type="bibr" rid="B20">Rasch (1960)</xref> model is a suitable alternative (<xref ref-type="bibr" rid="B22">Sinharay and Haberman, 2014</xref>, p. 23&#x2013;35). As of yet, the linking of Rasch-scaled data in this specific context was rarely researched.</p>
<p>In this article, we systematically investigate the linking of Rasch-scaled data based on limited item pools and small to medium sample sizes. To mimic applied settings, the data simulation mirrored a longitudinal design similar to the German <italic>National Educational Panel Study</italic> (NEPS; <xref ref-type="bibr" rid="B2">Blossfeld et al., 2011</xref>). Although mean change in a longitudinal design is often larger than differences among groups in a cross-sectional design, the linking is conceptually equivalent (<xref ref-type="bibr" rid="B29">von Davier et al., 2006</xref>). More specifically, the present simulation study deals with the issues of comparing and evaluating the performance of four IRT linking methods and investigating the absolute and relative number of anchor items required in these contexts. Moreover, as strict assumptions are made on equal item slopes in the Rasch model that are hardly met in empirical data, the robustness of linking methods toward model-data misfit is investigated.</p>
<p>In the following sections, we describe the Rasch model, the four common IRT linking methods, as well as challenges inherent to linking with limited item pools and sample sizes. Next, we describe the set-up of the simulation study and report the present findings. Finally, we discuss implications and limitations of our results.</p>
</sec>
<sec id="S2">
<title>The Rasch Model</title>
<p>In the <xref ref-type="bibr" rid="B20">Rasch (1960)</xref> model, it is assumed that the probability <italic>P</italic> of person <italic>n</italic> &#x2208; 1&#x2026;<italic>N</italic> to correctly answer a dichotomous item <italic>i</italic> &#x2208; 1&#x2026;<italic>I</italic> is conditioned on the interaction of two parameters, that is, a person&#x2019;s ability &#x03B2;<sub><italic>n</italic></sub> and an item&#x2019;s difficulty &#x03B4;<sub><italic>i</italic></sub> on a latent continuum:</p>
<disp-formula id="S2.E1">
<label>(1)</label>
<mml:math id="M1">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mpadded width="+3.3pt">
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B2;</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo rspace="5.8pt">)</mml:mo>
</mml:mrow>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>exp</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B2;</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo rspace="7.5pt">-</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>exp</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B2;</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo rspace="7.5pt">-</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Compared to 2PL and 3PL models, no parameter for item discrimination &#x03B1;<italic><sub><italic>i</italic></sub></italic> is directly incorporated. Therefore, a higher precision in (anchor) item difficulties can be obtained at smaller sample sizes (<xref ref-type="bibr" rid="B25">Thissen and Wainer, 1982</xref>, p. 397&#x2013;412) in the <xref ref-type="bibr" rid="B20">Rasch (1960)</xref> model.</p>
<p>Every item <italic>i</italic>, belonging to a test form fitting a Rasch model, measures the same latent construct with equal item discriminations <italic>&#x03B1;<sub><italic>i</italic></sub></italic> at all levels of &#x03B2;. Stated differently, items are not allowed to differ in their power to discriminate among persons (<xref ref-type="bibr" rid="B30">Wright, 1977</xref>, p. 97&#x2013;116) and, thus, an irrevocable rank order among individuals &#x03B2;<sub>1</sub> &#x2026; &#x003C; &#x03B2;<italic><sub><italic>n</italic></sub></italic> &#x003C; &#x2026; &#x03B2;<italic><sub><italic>N</italic></sub></italic> is determined based on the sufficient statistics of the person sum scores. As it can be challenging for empirical data to fully meet this strict specification, the question is <italic>not</italic> whether the data does or does not fit to a model, but is rather a &#x201C;matter of degree&#x201D; (<xref ref-type="bibr" rid="B17">Meijer and Tendeiro, 2015</xref>). As the weighting by &#x03B1;<italic><sub><italic>i</italic></sub></italic> of person sum scores is ignored in case of Rasch model-data misfit (i.e., &#x03B1;<italic><sub><italic>i</italic></sub></italic> &#x2260; 1), sample mean and variance estimates of the latent variable might be biased (<xref ref-type="bibr" rid="B7">Humphry, 2018</xref>, 216&#x2013;228) as they are based on (1). Additionally, the precision of (anchor) item difficulties decreases (<xref ref-type="bibr" rid="B25">Thissen and Wainer, 1982</xref>, p. 397&#x2013;412).</p>
</sec>
<sec id="S3">
<title>IRT Linking Methods</title>
<p>In IRT, only individual proficiencies and item difficulties located on equally defined scales are directly comparable over different measurement occasions (<xref ref-type="bibr" rid="B13">Kolen and Brennan, 2014</xref>). As such, prior to investigating proficiency development or group differences in an IRT framework, it is required to align two (or more) test forms onto a common scale (e.g., using an anchor-items design). As anchor item parameters are assumed to be measurement invariant and, thus, to maintain their difficulty over time, they allow for displaying an individual&#x2019;s change in proficiency. Several IRT linking methods exist, differentially &#x201C;translating&#x201D; the linking information during the linking process. The present study focuses on IRT linking methods compatible with Rasch-type models (<xref ref-type="bibr" rid="B28">van der Linden and Hambleton, 2013</xref>) that preserve uniform item discrimination parameters across the linked scales (<xref ref-type="bibr" rid="B5">Fischer et al., 2019</xref>, p. 37&#x2013;64). The different linking methods scale the different test forms either separately or concurrently. In separate calibration methods, anchor item difficulty parameters of each test form are estimated prior to the linking process. This subsequently extracted link information is then implemented uniquely by each linking method. Hence, a once established reference scale remains unchanged throughout the course of measurement. In the present section, the three different calibration methods (1) fixed parameter calibration (<xref ref-type="bibr" rid="B11">Kim, 2006</xref>, p. 355&#x2013;381), (2) mean/mean linking (<xref ref-type="bibr" rid="B15">Loyd and Hoover, 1980</xref>, p. 179&#x2013;193), and (3) weighted mean/mean linking (<xref ref-type="bibr" rid="B27">van der Linden and Barrett, 2016</xref>, p. 650&#x2013;673) are shortly described. Additionally, (4) a one-step approach of simultaneously calibrating and concurrently linking all test forms (e.g., <xref ref-type="bibr" rid="B12">Kim and Cohen, 1998</xref>, p. 131&#x2013;143) is presented.</p>
<sec id="S3.SS1">
<title>Fixed Parameter Calibration (FPC)</title>
<p>The parameter of anchor item <italic>l</italic> &#x2208; 1<italic>L</italic> with <italic>L</italic>&#x2286;<italic>I</italic> of test form <italic>A</italic> intended to link are fixed using the estimated item parameters of the referencing test form <italic>B</italic>:</p>
<disp-formula id="S3.E2">
<label>(2)</label>
<mml:math id="M2">
<mml:mrow>
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mrow>
<mml:mi>B</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>leaving no possibility for differences in anchor item parameters. Test forms based on a longitudinal design that vary in their sets of anchor items are linked sequentially (i.e., after test form t<sub>2</sub> is linked to t<sub>1</sub>, t<sub>3</sub> is linked to t<sub>2</sub> and so on).</p>
</sec>
<sec id="S3.SS2">
<title>Mean/Mean Linking (m/m)</title>
<p>To link test form <italic>A</italic> to test form <italic>B</italic> and, therefore, obtain the linked item difficulty parameters &#x03B4;<sup>&#x2217;</sup><sub><italic>Ai</italic></sub>, the linking constant <italic>v</italic> is added to each item &#x03B4;<sub><italic>Ai</italic></sub>:</p>
<disp-formula id="S3.E3">
<label>(3)</label>
<mml:math id="M3">
<mml:mrow>
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:msubsup>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo>&#x002A;</mml:mo>
</mml:msubsup>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mi>v</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mo>;</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>with <italic>v</italic> being the difference of the means of the <italic>anchor item</italic> difficulty parameters &#x03B4;<sub><italic>AL</italic></sub> and &#x03B4;<sub><italic>BL</italic></sub>:</p>
<disp-formula id="S3.E4">
<label>(4)</label>
<mml:math id="M4">
<mml:mrow>
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:mi>v</mml:mi>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mrow>
<mml:mi>B</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo rspace="7.5pt">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo rspace="7.5pt">-</mml:mo>
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>After the linking results that M(&#x03B4;<sup>&#x2217;</sup><sub><italic>AL</italic></sub>) = M(&#x03B4;<sub><italic>BL</italic></sub>).</p>
</sec>
<sec id="S3.SS3">
<title>Weighted Mean/Mean Linking (wm/m)</title>
<p>This approach incorporates estimation precision in weighting the anchor item difficulty parameter estimates by the inverse of their squared standard errors, <inline-formula><mml:math id="INEQ10"><mml:mrow><mml:mi>S</mml:mi><mml:mo>&#x2062;</mml:mo><mml:msubsup><mml:mi>E</mml:mi><mml:msub><mml:mi mathvariant="normal">&#x03B4;</mml:mi><mml:mrow><mml:mi>A</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>-</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="INEQ11"><mml:mrow><mml:mi>S</mml:mi><mml:mo>&#x2062;</mml:mo><mml:msubsup><mml:mi>E</mml:mi><mml:msub><mml:mi mathvariant="normal">&#x03B4;</mml:mi><mml:mrow><mml:mi>B</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>-</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, prior to conducting a mean/mean linking, replacing <italic>v</italic> with</p>
<disp-formula id="S3.E5">
<label>(5)</label>
<mml:math id="M5">
<mml:mrow>
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:msup>
<mml:mi>v</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msup>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:mi>l</mml:mi>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>L</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mrow>
<mml:mi>B</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msubsup>
<mml:mi>E</mml:mi>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mrow>
<mml:mi>B</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>-</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:mi>l</mml:mi>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>L</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msubsup>
<mml:mi>E</mml:mi>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mrow>
<mml:mi>B</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>-</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mo rspace="7.5pt">-</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:mi>l</mml:mi>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>L</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msubsup>
<mml:mi>E</mml:mi>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>-</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:mi>l</mml:mi>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>L</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msubsup>
<mml:mi>E</mml:mi>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>-</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>As such, the precision of the anchor item difficulty estimates of test forms <italic>A</italic> and <italic>B</italic> is taken into account, aiming at reducing the link error (i.e., a reflection of the uncertainty introduced to the link due to the selection of link items). In other words, <italic>v&#x2019;</italic> is identical to <italic>v</italic> when the anchor item difficulty parameter estimates have equal standard errors within a test form. Hence, weighted mean/mean linking is expected to outperform mean/mean linking when anchor items differ in precision.</p>
</sec>
<sec id="S3.SS4">
<title>Concurrent Calibration (CC)</title>
<p>All test forms are scaled concurrently in a one-step estimation procedure, constraining the anchor item difficulties across time points. As such, anchor item difficulties are simultaneously fitted to best meet the characteristics of all measurement points interacting with the samples&#x2019; proficiency distributions.</p>
<p>Imprecision of (anchor) item difficulty estimates is reflected in their increased standard error (<italic>SE</italic>). In order to minimize estimation imprecision in item and person parameter estimates at <italic>each time point</italic>, a sample&#x2019;s proficiency and a test&#x2019;s difficulty should considerably overlap (i.e., also known as test targeting). In other words, the mean and variance of some test items&#x2019; difficulty should closely fit the proficiency distribution of a respective sample. Of course, this claim is also true for sets of anchor items. Since sets of anchor items are administered repeatedly, they are expected to fit <italic>several</italic> proficiency distributions simultaneously. Consequently, the more diverging these proficiency distributions are, the more wide-spread a section of the latent scale needs to be covered by the sets of anchor items. It is to be noted that anchor items located at the outer edges of these joint ability distributions are prone to an increased <italic>SE</italic>. <xref ref-type="bibr" rid="B24">Svetina et al. (2013</xref>, p. 335&#x2013;360) reported that a mismatch between item and person parameter distributions (i.e., if the item difficulties are, on average, too easy or too difficult as compared to the average proficiency distribution of the sample) impacted the recovery of item difficulty parameters more than the person parameter estimates. As such, linking methods that do not derive the linking information from the item level may be more &#x201C;forgiving&#x201D; with respect to imprecise estimates, as they are more likely to cancel out. As was shown by <xref ref-type="bibr" rid="B27">van der Linden and Barrett (2016</xref>, 650&#x2013;673), the linking result of wm/m was superior to m/m in situations when anchor items did not perfectly display the samples&#x2019; ability distribution. Therefore, the estimated amount of change is expected to be closer to its true value, compared to a result that is based on linking methods that link on the item level. Consequently, the method of weighted mean/mean linking that accounts for possible imprecisions in difficulty estimates by weighting anchor items by their <italic>SE</italic>s is expected to outperform the linking methods mean/mean linking, concurrent calibration and fixed parameter calibration (in the given order).</p>
</sec>
</sec>
<sec id="S4">
<title>Challenges for the Linking of Rasch-Scaled Data</title>
<sec id="S4.SS1">
<title>Model-Data Misfit</title>
<p>There is a rather limited body of research examining the influence of Rasch model-data misfit on linking results. For example, <xref ref-type="bibr" rid="B31">Zhao and Hambleton (2017</xref>, p. 484) showed that in an LSA context with large sample sizes (<italic>N</italic> = 50,000) and long tests (78 items) with many anchor items (<italic>k</italic> = 39) fixed parameter calibration was more sensitive to model misfit and more robust against sizable ability shifts (up to 0.5 logits) as compared to linking methods that preserve the relation between item difficulty parameters during linking (i.e., mean/sigma method; <xref ref-type="bibr" rid="B16">Marco, 1977</xref>, and the characteristic curve methods; e.g., <xref ref-type="bibr" rid="B23">Stocking and Lord, 1983</xref>). As such, model fit was crucial to the appropriate use of FPC. So far, no research investigated the sensitivity and reactivity of IRT linking methods toward model misfit under more realistic conditions with smaller samples and shorter tests. Following <xref ref-type="bibr" rid="B31">Zhao and Hambleton (2017</xref>, p. 484), we hypothesized that FPC would be more sensitive toward model misfit as compared to CC, whereas m/m and wm/m would be least affected.</p>
</sec>
<sec id="S4.SS2">
<title>Number of Anchor Items</title>
<p><xref ref-type="bibr" rid="B13">Kolen and Brennan (2014)</xref> formulated a rule of thumb for large item pools, proposing that the number of anchor items should make up about 20%. Nothing was stated for item pools consisting of less than 200 items. If a single anchor item would fully reflect the latent construct and was free of differential item functioning (DIF), this item would be sufficient for aligning two tests on a common scale. As this hardly is the case in practice, several anchor items are typically used in operational tests. Generally, a larger number of anchor items is assumed to reduce random link error and, thus, is expected to more precisely recover the true value of mean change. Moreover, a larger number of anchor items increase the content validity of the link. However, when test length is rather short (i.e., 25 items) and changes in proficiency between measurement points of a longitudinal sample are expected to be sizable (i.e., &#x2265;0.25 logits; <xref ref-type="bibr" rid="B31">Zhao and Hambleton, 2017</xref>, p. 484), one repeatedly administered identical test form (i.e., 100% anchor items) would potentially affect test targeting and test reliability. In other words, when samples differ substantially in their mean proficiencies, the number of anchor items in a short test form becomes a question of measurement precision at each measurement point. More precisely: An item&#x2019;s difficulty that matches a sample&#x2019;s mean ability well at the first measurement point <italic>t</italic><sub>1</sub> cannot match a sample&#x2019;s mean ability well at the second measurement point <italic>t</italic><sub>2</sub> when there was a significant change in the sample&#x2019;s ability between <italic>t</italic><sub>1</sub> and <italic>t</italic><sub>2</sub>. Here is a demonstrative example: We assume that there is a significant change in ability of a sample that is administered two test forms with a length of 15 items sharing a number of 10 anchor items. We further assume that these 10 anchor items have a very good test targeting at <italic>t</italic><sub>1</sub>. From that follows that the test targeting of these 10 anchor items would have to be worse at <italic>t</italic><sub>2</sub>, affecting test reliability. Furthermore, administering items repeatedly may provoke memory effects that become more probable to emerge with an increasing number of anchor items. This leads to the question which proportion of anchor items can optimally balance measurement precision and linking information. Is the advice of a 20% anchor items share transferable to (rather) short test forms? In addition, questions about the minimum number of anchor items necessary to accurately display growth, and how model-data misfit interacts with the number of anchor items, remain.</p>
<p>To sum up, the present study aimed at comparing the performance of four common IRT linking methods (fixed parameter calibration, mean/mean linking, weighted mean/mean linking and concurrent calibration) based on Rasch-scaled simulated data. Particularly, we examined to what degree the number of anchor items and the degree of Rasch model-data misfit affected the linking for the different approaches.</p>
</sec>
</sec>
<sec id="S5">
<title>Methods</title>
<sec id="S5.SS1">
<title>Data Generation</title>
<p>We simulated data for four time points (t<sub>1</sub>&#x2013;t<sub>4</sub>) to measure within-individual growth in an anchor-items design (<xref ref-type="bibr" rid="B26">Vale, 1986</xref>, p. 333&#x2013;344). The simulation was modeled after empirical data from the German National Educational Panel Study (NEPS; <xref ref-type="bibr" rid="B2">Blossfeld et al., 2011</xref>). The NEPS aims at measuring competence development over the life span. Therefore, respondents from different age cohorts (e.g., 10- or 15 years old) are followed and receive repeated competences tests at different ages in their lives. Thus, the measured competences of these respondents are characterized by large changes across childhood and adolescence. As such, the NEPS is confronted with various methodological issues such as linking test forms administered at different ages that vary significantly in their average difficulty. Nonetheless, these tests were intended to measure the same underlying construct. To gain deeper insight in the linking process under these conditions the setup of the present simulation study was oriented on reading tests, that were administered in grades 5, 7, 9, and 12 of the NEPS (<xref ref-type="bibr" rid="B18">Pohl et al., 2012</xref>; <xref ref-type="bibr" rid="B14">Krannich et al., 2017</xref>; <xref ref-type="bibr" rid="B21">Scharl et al., 2017</xref>). The observed mean proficiencies (in logits) were 0.0, 0.7, 1.2, and 1.5, respectively. Similar, we randomly drew proficiencies from normal distributions with these means and unit variances. We simulated responses to four test forms each including 25 items. The true item difficulties were generated in R 3.5.2 (<xref ref-type="bibr" rid="B19">R Core Team, 2018</xref>) from multivariate normal distributions matching the proficiency distributions (see <xref ref-type="table" rid="T1">Table 1</xref>), thus, resulting in a good test targeting. As the anchor items had to fit two distributions simultaneously (t<sub>1/2</sub>, t<sub>2/3</sub>, t<sub>3/4</sub>), they were set to fall between two distributions (see <xref ref-type="table" rid="T1">Tables 1</xref>, <xref ref-type="table" rid="T2">2</xref>). Anchor items maintained their difficulty parameters over time and as such met the assumption of measurement invariance. The item response models were estimated using the R-package TAM 3.1-26 (<xref ref-type="bibr" rid="B10">Kiefer et al., 2018</xref>) that iteratively updated the prior ability distribution using the EM algorithm (<xref ref-type="bibr" rid="B3">Bock and Aitkin, 1981</xref>, p. 443&#x2013;459) during MML estimation (<xref ref-type="bibr" rid="B8">Kang and Petersen, 2012</xref>, p. 311&#x2013;321). Due to the need of extensive computational power for the concurrent calibration, the quasi Monte Carlo estimation algorithm (based on 1,000 nodes) was used, whereas the Gauss-Hermite quadrature was used for the other linking methods. The original code for data generation is provided at <ext-link ext-link-type="uri" xlink:href="https://osf.io/7vta8/">https://osf.io/7vta8/</ext-link>.</p>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>True item difficulty and item discrimination parameters of the four test forms (t<sub>1</sub>&#x2013;t<sub>4</sub>).</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<tbody>
<tr>
<td><inline-graphic xlink:href="fpsyg-12-633896-i000.jpg"/></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<attrib><italic><italic>Framed parameters represent anchor items linking adjacent measurement points. Position = item position in each test form; t<sub>1/2</sub>, t<sub>2/3</sub>, t<sub>3/4</sub> = true anchor item parameters linking measurement points t<sub>1</sub>, t<sub>2</sub>, t<sub>3</sub>, t<sub>4</sub>; M = mean of 25 true item parameters; SD = standard deviation of 25 true item parameters.</italic></italic></attrib>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T2">
<label>TABLE 2</label>
<caption><p>Descriptive statistics of the true anchor item parameters split by the experimental factor number of anchor items.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center" colspan="4">t<sub>1/2</sub></td>
<td valign="top" align="center" colspan="3">t<sub>2/3</sub></td>
<td valign="top" align="center" colspan="3">t<sub>3/4</sub></td>
</tr>
<tr>
<td valign="top" align="center"></td>
<td valign="top" align="center" colspan="2"></td>
<td valign="top" align="center" colspan="5"><hr/></td>
<td valign="top" align="center" colspan="3"></td>
</tr>
<tr>
<td/>
<td valign="top" align="center" colspan="10">Anchor item difficulty parameters</td>
</tr>
<tr>
<td valign="top" align="center"></td>
<td valign="top" align="center" colspan="10"><hr/></td>
</tr>
<tr>
<td valign="top" align="left">Anchor</td>
<td valign="top" align="center">Position</td>
<td valign="top" align="center"><italic>M</italic></td>
<td valign="top" align="center"><italic>SD</italic></td>
<td valign="top" align="left" colspan="2">Position</td>
<td valign="top" align="center"><italic>M</italic></td>
<td valign="top" align="center"><italic>SD</italic></td>
<td valign="top" align="center">Position</td>
<td valign="top" align="center"><italic>M</italic></td>
<td valign="top" align="center"><italic>SD</italic></td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">3</td>
<td valign="top" align="center">2,5,8</td>
<td valign="top" align="center">0.369</td>
<td valign="top" align="center">1.051</td>
<td valign="top" align="left" colspan="2">2,5,8</td>
<td valign="top" align="center">0.976</td>
<td valign="top" align="center">0.855</td>
<td valign="top" align="center">2,5,8</td>
<td valign="top" align="center">1.393</td>
<td valign="top" align="center">1.193</td>
</tr>
<tr>
<td valign="top" align="left">5</td>
<td valign="top" align="center">2,3,4,6,9</td>
<td valign="top" align="center">0.333</td>
<td valign="top" align="center">1.050</td>
<td valign="top" align="left" colspan="2">1,5,6,7,8</td>
<td valign="top" align="center">0.873</td>
<td valign="top" align="center">1.138</td>
<td valign="top" align="center">2,3,4,6,9</td>
<td valign="top" align="center">1.316</td>
<td valign="top" align="center">1.193</td>
</tr>
<tr>
<td valign="top" align="left">7</td>
<td valign="top" align="center">1,2,4,5,6,7,9</td>
<td valign="top" align="center">0.332</td>
<td valign="top" align="center">1.066</td>
<td valign="top" align="left" colspan="2">1,3,4,5,6,8,9</td>
<td valign="top" align="center">0.976</td>
<td valign="top" align="center">1.169</td>
<td valign="top" align="center">1,3,4,5,6,7,9</td>
<td valign="top" align="center">1.362</td>
<td valign="top" align="center">1.096</td>
</tr>
<tr>
<td valign="top" align="left">9</td>
<td valign="top" align="center">1&#x2013;9</td>
<td valign="top" align="center">0.360</td>
<td valign="top" align="center">1.022</td>
<td valign="top" align="left" colspan="2">1&#x2013;9</td>
<td valign="top" align="center">0.950</td>
<td valign="top" align="center">1.074</td>
<td valign="top" align="center">1&#x2013;9</td>
<td valign="top" align="center">1.358</td>
<td valign="top" align="center">1.120</td>
</tr>
<tr>
<td valign="top" align="center" colspan="11"><hr/></td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center" colspan="10"><bold>Anchor item discrimination parameters</bold></td>
</tr>
<tr>
<td valign="top" align="center"></td>
<td valign="top" align="center" colspan="10"><hr/></td>
</tr>
<tr>
<td/>
<td valign="top" align="center"><bold>Position</bold></td>
<td valign="top" align="center"><italic><bold>M</bold></italic></td>
<td valign="top" align="center"><italic><bold>SD</bold></italic></td>
<td valign="top" align="left" colspan="2"><bold>Position</bold></td>
<td valign="top" align="center"><italic><bold>M</bold></italic></td>
<td valign="top" align="center"><italic><bold>SD</bold></italic></td>
<td valign="top" align="center"><bold>Position</bold></td>
<td valign="top" align="center"><italic><bold>M</bold></italic></td>
<td valign="top" align="center"><italic><bold>SD</bold></italic></td>
</tr>
<tr>
<td valign="top" align="center" colspan="11"><hr/></td>
</tr>
<tr>
<td valign="top" align="left">3</td>
<td valign="top" align="center">2,5,8</td>
<td valign="top" align="center">1.015</td>
<td valign="top" align="center">0.256</td>
<td valign="top" align="left" colspan="2">2,5,8</td>
<td valign="top" align="center">0.927</td>
<td valign="top" align="center">0.006</td>
<td valign="top" align="center">2,5,8</td>
<td valign="top" align="center">0.946</td>
<td valign="top" align="center">0.136</td>
</tr>
<tr>
<td valign="top" align="left">5</td>
<td valign="top" align="center">2,3,4,6,9</td>
<td valign="top" align="center">1.013</td>
<td valign="top" align="center">0.160</td>
<td valign="top" align="left" colspan="2">1,5,6,7,8</td>
<td valign="top" align="center">0.978</td>
<td valign="top" align="center">0.058</td>
<td valign="top" align="center">2,3,4,6,9</td>
<td valign="top" align="center">1.035</td>
<td valign="top" align="center">0.123</td>
</tr>
<tr>
<td valign="top" align="left">7</td>
<td valign="top" align="center">1,2,4,5,6,7,9</td>
<td valign="top" align="center">0.944</td>
<td valign="top" align="center">0.178</td>
<td valign="top" align="left" colspan="2">1,3,4,5,6,8,9</td>
<td valign="top" align="center">1.023</td>
<td valign="top" align="center">0.077</td>
<td valign="top" align="center">1,3,4,5,6,7,9</td>
<td valign="top" align="center">1.032</td>
<td valign="top" align="center">0.171</td>
</tr>
<tr>
<td valign="top" align="left">9</td>
<td valign="top" align="center">1&#x2013;9</td>
<td valign="top" align="center">1.013</td>
<td valign="top" align="center">0.206</td>
<td valign="top" align="left" colspan="2">1&#x2013;9</td>
<td valign="top" align="center">1.006</td>
<td valign="top" align="center">0.076</td>
<td valign="top" align="center">1&#x2013;9</td>
<td valign="top" align="center">1.030</td>
<td valign="top" align="center">0.148</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<attrib><italic><italic>Anchor = Number of anchor items used for linking; t1/2, t2/3, t3/4 = true anchor item parameters linking adjacent measurement points; Position = selected anchor items out of anchor set (see <xref ref-type="table" rid="T1">Table 1</xref> for anchor item identification); M = mean of true anchor item parameters; SD = standard deviation of true anchor item parameters.</italic></italic></attrib>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="S5.SS2">
<title>Experimental Factors</title>
<p>For each simulated sample the four test forms (t<sub>1</sub>&#x2013;t<sub>4</sub>) were linked based on the four linking methods of fixed parameter calibration, mean/mean linking, weighted mean/mean linking, and concurrent calibration. Model fit was varied in two ways by either meeting the Rasch model assumptions of constant item discriminations (&#x03B1;<italic><sub><italic>i</italic></sub></italic> = 1) or modeling slight deviations (see <xref ref-type="table" rid="T1">Table 1</xref>) by drawing them from <italic>N</italic>(1, 0.14<sup>2</sup>). The resulting item discrimination parameters mirrored empirical results from a 2PL scaling of the tests (<xref ref-type="bibr" rid="B14">Krannich et al., 2017</xref>) mentioned above and, thus, were assumed to reflect a moderate degree of misfit within the range of operational proficiency test forms. Linking was based on a number of 3 (12%), 5 (20%), 7 (28%), or 9 (36%) common items among adjacent test forms (see <xref ref-type="table" rid="T1">Table 1</xref>). While 5 anchor items fell in line with recommendations in the literature (<xref ref-type="bibr" rid="B13">Kolen and Brennan, 2014</xref>), the other conditions evaluated the consequence of using more anchor items (7 or 9) or relying on a very restricted set of anchor items. The sample size condition was varied twofold (<italic>N</italic> = 500, <italic>N</italic> = 3,000). Overall, in addition to the within-subject experimental factor (four IRT-linking methods), three between-variable experimental factors&#x2014;model fit (2), number of anchor items (4) and sample size (2)&#x2014;were manipulated resulting in 4 &#x00D7; 2 &#x00D7; 4 &#x00D7; 2 = 64 conditions. Each within-subject experimental condition was simulated 100 times, to control for random sampling error.</p>
</sec>
<sec id="S5.SS3">
<title>Outcome Variables</title>
<p>We examined (a) the convergence rate of models as well as calculated (b) bias, (c) relative bias, and (d) root mean square error (RMSE) for sample mean and variance of the latent variable. The bias was calculated as <inline-formula><mml:math id="INEQ12"><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03C4;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>d</mml:mi></mml:msub><mml:mo>-</mml:mo><mml:mi mathvariant="normal">&#x03C4;</mml:mi></mml:mrow></mml:math></inline-formula>, with <inline-formula><mml:math id="INEQ13"><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03C4;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula> denoted as parameter estimate of the <italic>k</italic>th replication of condition <italic>d</italic> and &#x03C4; denoting the true parameter value. The bias was then averaged over all <italic>k</italic> replications of each condition. Serving as an effect size, the relative bias was calculated as a proportion of <inline-formula><mml:math id="INEQ14"><mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03C4;</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mi>d</mml:mi></mml:msub><mml:mo rspace="7.5pt">-</mml:mo><mml:mi mathvariant="normal">&#x03C4;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:mi mathvariant="normal">&#x03C4;</mml:mi></mml:mrow><mml:mo rspace="7.5pt">,</mml:mo></mml:mrow></mml:math></inline-formula>with <inline-formula><mml:math id="INEQ15"><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03C4;</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula> being the averaged parameter estimate over all <italic>k</italic> replications. Following <xref ref-type="bibr" rid="B6">Forero et al. (2009</xref>, p. 625&#x2013;641), we considered a relative bias below 10% as acceptable. The RMSE gives the precision of a parameter estimate and was calculated as <inline-formula><mml:math id="INEQ16"><mml:msqrt><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mi>c</mml:mi></mml:mfrac><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:msubsup><mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mpadded width="+3.3pt"><mml:mi>k</mml:mi></mml:mpadded><mml:mo rspace="5.8pt">=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>c</mml:mi></mml:msubsup><mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03C4;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>k</mml:mi></mml:msub><mml:mo>-</mml:mo><mml:mi mathvariant="normal">&#x03C4;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mrow></mml:msqrt></mml:math></inline-formula>. As such the RMSE was defined as the square root of the mean of the squared bias.</p>
</sec>
</sec>
<sec id="S6">
<title>Results</title>
<p>Only negligible differences among the three linking methods of fixed parameter calibration, mean/mean and weighted mean/mean linking were found with regard to the outcome variables bias, relative bias and RMSE. Results are, therefore, reported combined. Descriptive statistics split by linking methods and experimental factors of the respective outcome variables are reported in <xref ref-type="supplementary-material" rid="DS1">Supplementary Tables 2</xref>&#x2013;<xref ref-type="supplementary-material" rid="DS1">5</xref>.</p>
<sec id="S6.SS1">
<title>Convergence Rates</title>
<p>Only 50.8% (i.e., 813 of 1,600 samples) of the models calibrated concurrently converged. Non-convergence was split about evenly among the experimental factors of sample size and model-data misfit, but varied substantially among different numbers of anchor items (see <xref ref-type="supplementary-material" rid="DS1">Supplementary Table 1</xref>). Moreover, in-depth analyses (not reported in this manuscript) of successfully converged concurrently calibrated models revealed that smaller numbers of iteration steps did not necessarily lead to a more precise parameter estimation. As these findings were questioning the applicability of concurrent calibration in settings based on small absolute numbers of anchor items, it was excluded from further analyses. In contrast, all models that were calibrated separately (fixed parameter calibration, mean/mean linking and weighted mean/mean linking) converged.</p>
</sec>
<sec id="S6.SS2">
<title>Sample Mean</title>
<sec id="S6.SS2.SSS1">
<title>Bias</title>
<p>Overall, there was no (change in) bias over the three time points (<italic>M</italic><sub><italic>t2</italic>&#x2013;<italic>t4</italic></sub> = 0.00; t<sub>1</sub> was constrained to 0 due for model identification) in the absence of model misfit. Neither sample size nor the number of anchor items had a substantial effect on the consistency of the bias of sample mean in the absence of model misfit (see <xref ref-type="fig" rid="F1">Figure 1</xref>); although the bias was marginally smaller when sample size was <italic>N</italic> = 3,000 compared to <italic>N</italic> = 500. However, the sample mean was less well recovered in case of moderate model misfit (see <xref ref-type="fig" rid="F1">Figure 1</xref> and <xref ref-type="supplementary-material" rid="DS1">Supplementary Table 2</xref>). Rather consistently, the sample mean was underestimated over the three time points, t<sub>2</sub>&#x2013;t<sub>4</sub>, in all conditions but the conditions based on linking using 9 (36%) anchor items. The amount and pattern of the bias of sample mean emerged in a rather heterogeneous picture among time points and the number of anchor items. Overall, we found that the bias of sample mean rather decreased with an increasing number of anchor items.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>Bias of sample mean over three time points (t<sub>2</sub>&#x2013;t<sub>4</sub>). The figure is split by three linking methods and the experimental factors number of anchor items, sample size and Rasch model-data fit. FPC = fixed parameter calibration, m/m = mean/mean linking, w. m/m = weighted mean/mean linking. 95% confidence intervals are depicted.</p></caption>
<graphic xlink:href="fpsyg-12-633896-g001.tif"/>
</fig>
</sec>
<sec id="S6.SS2.SSS2">
<title>Relative Bias</title>
<p>The relative bias was always explicitly below 10% and only rose above 5% in 2 conditions (see <xref ref-type="supplementary-material" rid="DS1">Supplementary Table 2</xref>) and was, thus, considered acceptable.</p>
</sec>
<sec id="S6.SS2.SSS3">
<title>RMSE</title>
<p>The RMSE of sample mean linearly increased from t<sub>2</sub> to t<sub>4</sub> (see <xref ref-type="fig" rid="F2">Figure 2</xref>). Sample size influenced the amount of RMSE as expected: smaller sample size led to a bigger RMSE with marginally steeper slope over time (<italic>N</italic> = 500: t<sub>2</sub> = 0.06 (<italic>SD</italic> = 0.04), t<sub>3</sub> = 0.08 (<italic>SD</italic> = 0.06), t<sub>4</sub> = 0.10 (<italic>SD</italic> = 0.08) compared to a larger sample size (<italic>N</italic> = 3,000: t<sub>2</sub> = 0.03 (<italic>SD</italic> = 0.02), t<sub>3</sub> and t<sub>4</sub> = 0.04 (<italic>SD</italic><sub><italic>t3</italic>,t4</sub> = 0.03). Additionally, the RMSE of sample mean was in general smaller when linking based on a larger number of anchor items. More precisely, a larger number of anchor items seemed more beneficial for a smaller sample size (<italic>N</italic> = 500). It has to be noted that a moderate Rasch model-data misfit did not necessarily lead to a decreased estimation precision of the sample mean. Rather the effect of model misfit on the RMSE of sample mean seemed to depend on the number of anchor items and was intercepted when the linking was based on at least 5 (20%) anchor items.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>RMSE of sample mean over three time points (t<sub>2</sub>&#x2013;t<sub>4</sub>). The figure is split by the three linking methods and the experimental factors number of anchor items, sample size and Rasch model-data fit. FPC = fixed parameter calibration, Mean/Mean = mean/mean linking, w. Mean/Mean = weighted Mean/Mean. 95% confidence intervals are depicted.</p></caption>
<graphic xlink:href="fpsyg-12-633896-g002.tif"/>
</fig>
</sec>
</sec>
<sec id="S6.SS3">
<title>Sample Variance</title>
<sec id="S6.SS3.SSS1">
<title>Bias</title>
<p>Overall, there was no change in bias or its <italic>SD</italic> over the four time points (<italic>M</italic><sub><italic>t1</italic>&#x2013;<italic>t4</italic></sub> = 0.00, <italic>SD</italic><sub><italic>t1</italic>&#x2013;<italic>t4</italic></sub> = 0.06) in the absence of model misfit. Neither sample size nor the number of anchor items had a substantial effect on the consistency of the bias of sample variance in the absence of model misfit (see <xref ref-type="fig" rid="F3">Figure 3</xref>). In case of moderate Rasch model-data misfit, the sample variance was marginally underestimated at t<sub>1</sub> and almost rose back to its true value with measurement progressing. This finding was similarly observed for different number of anchor items and sample size.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p>Bias of sample variance over four time points (t<sub>1</sub>&#x2013;t<sub>4</sub>). The figure is split by three linking methods and the experimental factors number of anchor items, sample size and Rasch model-data fit. FPC = fixed parameter calibration, m/m = mean/mean linking, w. m/m = weighted mean/mean linking. 95% confidence intervals are depicted.</p></caption>
<graphic xlink:href="fpsyg-12-633896-g003.tif"/>
</fig>
</sec>
<sec id="S6.SS3.SSS2">
<title>Relative Bias</title>
<p>The relative bias was considered acceptable in all conditions as it was always below 5% (see <xref ref-type="supplementary-material" rid="DS1">Supplementary Table 4</xref>).</p>
</sec>
<sec id="S6.SS3.SSS3">
<title>RMSE</title>
<p>The RMSE of sample variance did not change from t<sub>1</sub> to t<sub>4</sub> (see <xref ref-type="fig" rid="F4">Figure 4</xref>). Sample size influenced the amount of RMSE as expected: smaller sample size led to a larger RMSE [<italic>N</italic> = 500: t<sub>1</sub>&#x2013;t<sub>4</sub> = 0.07 (<italic>SD</italic><sub>t1&#x2013;<italic>t4</italic></sub> = 0.05)] compared to a larger sample size [<italic>N</italic> = 3,000: t<sub>1</sub>&#x2013;t<sub>4</sub> = 0.03 (<italic>SD</italic><sub><italic>t1</italic>&#x2013;<italic>t4</italic></sub> = 0.02)]. No effect was found on the precision of the sample variance estimate due to the number of anchor items or a moderate Rasch model-data misfit.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption><p>RMSE of sample variance over four time points (t<sub>1</sub>&#x2013;t<sub>4</sub>). The figure is split by three linking methods and the experimental factors number of anchor items, sample size and Rasch model-data fit. FPC = fixed parameter calibration, Mean/Mean = mean/mean linking, w. Mean/Mean = weighted Mean/Mean. 95% confidence intervals are depicted.</p></caption>
<graphic xlink:href="fpsyg-12-633896-g004.tif"/>
</fig>
</sec>
</sec>
</sec>
<sec id="S7">
<title>Discussion</title>
<p>The present simulation study focused on the comparison of four common IRT-linking methods (fixed parameter calibration, mean/mean linking, weighted mean/mean linking and concurrent calibration) within three experimental conditions (number of anchor items, sample size and model-data fit). Due to convergence issues, the application of concurrent calibration is not advisable for Rasch-scaled data when linking is based on a small absolute number of anchor items. The separate calibration linking methods somewhat unexpectedly resulted in negligible differences in the outcome variables of bias, relative bias and RMSE of sample mean and variance of the latent variable. Hence, the choice of linking method had no effect on the link outcome. This finding may result from the well fitted test targeting at each measurement point in the present study. Thus, even though mean change between time points was substantial (up to 0.7 logits), there were only small differences in measurement precision within each set of anchor items, potentially depriving the method of weighted mean/mean linking of its unique strength in adjusting for differences in anchor item&#x2019;s <italic>SE</italic>s. Moreover, different amounts of mean change in proficiency over time were handled equally well by the three separate calibration methods. It is to be noted that no differences were found among the three linking methods in sensitivity and reactivity regarding moderate Rasch model-data misfit in the context of longitudinal linking.</p>
<p>In the absence of model misfit, the mean recovery of sample mean and variance was very good, regardless of the sample size or the number of anchor items used. However, in case of moderate Rasch model-data misfit, the parameters of sample mean and variance were generally slightly underestimated, suggesting an influence of the empirical relationship of anchor item difficulty parameters &#x03B4;<italic><sub><italic>i</italic></sub></italic> and anchor item discrimination parameters <italic>&#x03B1;<sub><italic>i</italic></sub></italic>. In contrast to prior findings reported in the literature (<xref ref-type="bibr" rid="B31">Zhao and Hambleton, 2017</xref>, p. 484), no substantial differences in performance were found between linking methods that based the linking on the anchor item level (e.g., FPC) or the anchor set level (e.g., m/m, wm/m). More specific, a certain composition of &#x03B4;<italic><sub><italic>i</italic></sub></italic> and <italic>&#x03B1;<sub><italic>i</italic></sub></italic> in the anchor items seemed to substantially influence the estimation of sample parameters. Factors characterizing this certain composition may include a deviation of item discrimination from 1 on the anchor item and/or anchor set level (i.e., whether misfit is balanced or not), the correlation&#x2019;s amount and/or direction of &#x03B4;<italic><sub><italic>i</italic></sub></italic> and <italic>&#x03B1;<sub><italic>i</italic></sub></italic> as well as person-item fit. Additionally, further investigating the consequences of Rasch model-data misfit seems a promising approach in detangling the compositional effects of anchor items. As the degree of model misfit was assumed to reflect a moderate degree of misfit within the range of operational proficiency test forms, we would furthermore deduce that an increasing degree of model misfit leads to an increasing deviation of parameter estimates from their true parameter.</p>
<p>In the present simulation study, change in proficiency was modeled as decelerating growth in steps of 0.7, 0.5, and 0.3 logits. Nevertheless, the amount of change between two time points seemed independent from the number of anchor items advisable to sufficiently map the change in proficiency distributions of the latent variable. This may suggest a transferability of the present findings to situations in that differences among groups are less pronounced.</p>
<p>It is to be noted, that the consistency of sample mean and variance estimation differed in their sensitivity to the number of anchor items in the case of moderate Rasch model-data misfit. However, accumulating effects (as reported by <xref ref-type="bibr" rid="B9">Keller and Keller, 2011</xref>, p. 362&#x2013;379) of bias were only found when linking was based on 3 (12%) anchor items. While a number of 9 (36%) anchor items seemed sufficient to somewhat balance moderate misfit and resulted in good sample mean recovery, the recovery of sample variance seemed independent of the number of anchor items used. Similarly, for estimation precision of the sample mean, a bigger number of anchor items somewhat attenuated moderate Rasch model-data misfit, although this effect was more beneficial to a smaller sample size. Estimation precision of sample variance seemed to only depend on the sample size.</p>
<sec id="S7.SS1">
<title>Practical Implications</title>
<p>As no substantial impact on parameter recovery of sample mean and variance was found due to moderate Rasch model-data misfit, the Rasch model seemed rather robust in the present context. However, special attention should be payed to anchor items, as their characteristics critically determine sample parameter estimates. Therefore, using a 2PL model seems a practicable diagnostical tool to uncover noticeable deviations in anchor item discrimination parameters. Only marginal differences were found between the three IRT-linking methods of fixed parameter calibration, mean/mean linking and weighted mean/mean linking. More specifically, all of them were equally robust toward a moderate Rasch model-data misfit and different numbers of anchor items even when mean growth was substantial (0.7 logits). As such, the decision for a linking method could rely on more functional factors (e.g., scale preservation, practicability) in case of a well fitted test targeting. If, however, test targeting is expected to be poor, we agree with <xref ref-type="bibr" rid="B27">van der Linden and Barrett (2016</xref>, p. 650&#x2013;673) that weighted mean/mean linking seems to be the preferable choice, as it allows for the inclusion of measurement precision as well as leaving the &#x201C;pre linking&#x201D; model fit unaltered. Furthermore, we would like to stress the point that defining an appropriate share of anchor items should depend on the respective Rasch model-data fit rather than following <xref ref-type="bibr" rid="B13">Kolen and Brennan&#x2019;s (2014)</xref> rule of thumb suggesting a share of 20%. In case of moderate misfit, we suggest a number of 7 (36%) anchor items, for the longitudinal linking of short (i.e., 25 items) operational test forms when a Rasch model is used for scaling. Additionally, in case of misfitting anchor items, findings hinted on a compensatory effect when the misfit present is balanced within an anchor item set.</p>
<p>Due to the issues of non-convergence and the disproportionate occurrence of extreme values in parameter recovery, concurrent calibration seemed less suitable for common use than separate calibration methods in longitudinal study designs using small absolute numbers of anchor items.</p>
</sec>
<sec id="S7.SS2">
<title>Limitations of the Study</title>
<p>The setup of the simulation study did not consider several issues relevant in empirical contexts such as missing data or differential item functioning in anchor items. Similarly, our simulated anchor items exhibited good test targeting for the two proficiency distributions intended to link, which might be hard to achieve in operational assessments. These simplifications of reality were taken into account in order to master the complexity of the central issue. As a consequence, results may be limited in their transferability to empirical data. Future research should study these aspects in more detail and, thus, could further elaborate on the conditions that allow precise linking in the context of the Rasch model. Moreover, the present study was motivated by operational LSAs which are usually characterized by relatively large sample sizes and rather short test forms. In other empirical settings that include smaller sample sizes often substantially longer test forms can be administered. Therefore, future research could address the particulars of linking in these studies. Particularly, this research could also explore whether alternative scaling approaches (e.g., the 2-parameter logistic model) might show more pronounced benefits for data exhibiting misfit to the Rasch model or whether the linking results are comparable to the findings presented in the present study.</p>
<p>As the mean of <italic>&#x03B1;<sub><italic>i</italic></sub></italic> within anchor item sets as well as the correlations of &#x03B4;<italic><sub><italic>i</italic></sub></italic> and <italic>&#x03B1;<sub><italic>i</italic></sub></italic> in the present simulation study were not varied systematically, the underlying mechanisms affecting the recovery of sample mean and variance in case of moderate Rasch model-data misfit was not fully traceable and, thus, limited the conclusions on certain compositional effects inherent to sets of anchor items. However, regarding longitudinal measurements, considering the empirical correlation of &#x03B4;<italic><sub><italic>i</italic></sub></italic> and <italic>&#x03B1;<sub><italic>i</italic></sub></italic> only, would fall short for the effect of person-item fit. As anchor item difficulties are held constant in repeated administrations to samples with variable proficiencies, person-item fit differs between time points. Therefore, differential effects of an anchor item on the estimation of sample parameters (<xref ref-type="bibr" rid="B4">Bolt et al., 2014</xref>, p. 141&#x2013;162) are to be additionally considered between time points in case of Rasch model-data misfit (<xref ref-type="bibr" rid="B7">Humphry, 2018</xref>, p. 216&#x2013;228).</p>
</sec>
</sec>
<sec id="S8">
<title>Conclusion</title>
<p>Overall, the challenges inherent to contexts characterized by small absolute and relative numbers of anchor items due to short test length as well as small to medium sample sizes were mastered equally well by the three separate calibration methods mean/mean linking, weighted mean/mean linking and fixed parameter calibration, resulting in reliable and valid parameter recovery. However, results of the present simulation study suggested that the choice of linking method is rather secondary when linking Rasch modeled data&#x2014;independent of the absence or presence of (moderate) model misfit. More important seems the awareness of the practitioner that a combination of moderate model misfit and certain factors (e.g., empirical relation of &#x03B4;<italic><sub><italic>i</italic></sub></italic> and <italic>&#x03B1;<sub><italic>i</italic></sub></italic>, composition of anchor items, person-item fit, sample size) may lead to a distorted parameter estimation&#x2014;although at presence no applicable diagnostics nor concrete guidelines for empirical data seem at hand. As such, future research should analytically deduce and systematically investigate the consequences of an interaction between Rasch model-data misfit and certain experimental factors.</p>
</sec>
<sec id="S9">
<title>Data Availability Statement</title>
<p>The original contributions presented in the study are included in the article/<xref ref-type="supplementary-material" rid="DS1">Supplementary Material</xref>, further inquiries can be directed to the corresponding author/s.</p>
</sec>
<sec id="S10">
<title>Author Contributions</title>
<p>LF conducted the literature research, drafted significant parts of the manuscript, and analyzed and interpreted the data used in this study. TG wrote the code for the simulation study. CC, TR, and TG substantively revised the manuscript and provided substantial input for the statistical analyses. All authors read and approved the final manuscript.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="financial-disclosure">
<p><bold>Funding.</bold> We would like to thank the Deutsche Forschungsgemeinschaft (DFG; <ext-link ext-link-type="uri" xlink:href="http://www.dfg.de">www.dfg.de</ext-link>) for funding our research project within the Priority Programme 1646 entitled &#x201C;Analyzing relations between latent competencies and context information in the National Educational Panel Study&#x201D; under Grant No. CA 289/8-2 (awarded to CC). We furthermore thank the Leibniz Institute for Educational Trajectories (<ext-link ext-link-type="uri" xlink:href="http://www.lifbi.de">www.lifbi.de</ext-link>) for funding the open access publication fee.</p>
</fn>
</fn-group>
<sec id="S12" sec-type="supplementary material"><title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fpsyg.2021.633896/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fpsyg.2021.633896/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.docx" id="DS1" mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Birnbaum</surname> <given-names>A.</given-names></name></person-group> (<year>1968</year>). &#x201C;<article-title>Some latent trait models and their use in inferring an examinee&#x2019;s ability</article-title>,&#x201D; in <source><italic>Statistical Theories of Mental test Scores</italic></source>, <role>eds</role> <person-group person-group-type="editor"><name><surname>Lord</surname> <given-names>F. M.</given-names></name> <name><surname>Novick</surname> <given-names>M. R.</given-names></name></person-group> (<publisher-loc>Reading, MA</publisher-loc>: <publisher-name>Addison-Wesley Publishing</publisher-name>), <fpage>397</fpage>&#x2013;<lpage>472</lpage>.</citation></ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="editor"><name><surname>Blossfeld</surname> <given-names>H. P.</given-names></name> <name><surname>Ro&#x00DF;bach</surname> <given-names>H. G.</given-names></name> <name><surname>von Maurice</surname> <given-names>J.</given-names></name></person-group> (<role>Eds</role>.) (<year>2011</year>). &#x201C;<article-title>Zeitschrift f&#x00FC;r erziehungswissenschaft sonderheft</article-title>,&#x201D; in <source><italic>Education as a Lifelong Process: The German National Educational Panel Study (NEPS)</italic></source>, <volume>Vol. 14</volume> (<publisher-loc>Wiesbaden</publisher-loc>: <publisher-name>VS Verlag f&#x00FC;r Sozialwissenschaften</publisher-name>).</citation></ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bock</surname> <given-names>R. D.</given-names></name> <name><surname>Aitkin</surname> <given-names>M.</given-names></name></person-group> (<year>1981</year>). <article-title>Marginal maximum likelihood estimation of item parameters: application of an EM algorithm.</article-title> <source><italic>Psychometrika</italic></source> <volume>46</volume> <fpage>443</fpage>&#x2013;<lpage>459</lpage>. <pub-id pub-id-type="doi">10.1007/bf02293801</pub-id></citation></ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bolt</surname> <given-names>D. M.</given-names></name> <name><surname>Deng</surname> <given-names>S.</given-names></name> <name><surname>Lee</surname> <given-names>S.</given-names></name></person-group> (<year>2014</year>). <article-title>IRT model misspecification and measurement of growth in vertical scaling.</article-title> <source><italic>J. Educ. Meas.</italic></source> <volume>51</volume>:<issue>2</issue>. <pub-id pub-id-type="doi">10.1111/jedm.12039</pub-id></citation></ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fischer</surname> <given-names>L.</given-names></name> <name><surname>Gnambs</surname> <given-names>T.</given-names></name> <name><surname>Rohm</surname> <given-names>T.</given-names></name> <name><surname>Carstensen</surname> <given-names>C. H.</given-names></name></person-group> (<year>2019</year>). <article-title>Longitudinal linking of Rasch-model-scaled competence tests in large-scale assessments: a comparison and evaluation of different linking methods and anchoring designs based on two tests on mathematical competence administered in grades 5 and 7.</article-title> <source><italic>Psychol. Test Assessment Model.</italic></source> <volume>61</volume> <fpage>37</fpage>&#x2013;<lpage>64</lpage>.</citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Forero</surname> <given-names>C. G.</given-names></name> <name><surname>Maydeu-Olivares</surname> <given-names>A.</given-names></name> <name><surname>Gallardo-Pujol</surname> <given-names>D.</given-names></name></person-group> (<year>2009</year>). <article-title>Factor analysis with ordinal indicators: a monte carlo study comparing DWLS and ULS estimation.</article-title> <source><italic>Struct. Equ. Model.</italic></source> <volume>16</volume> <fpage>625</fpage>&#x2013;<lpage>641</lpage>. <pub-id pub-id-type="doi">10.1080/10705510903203573</pub-id></citation></ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Humphry</surname> <given-names>S. M.</given-names></name></person-group> (<year>2018</year>). <article-title>The impact of levels of discrimination on vertical equating in the rasch model.</article-title> <source><italic>J. Appl. Meas.</italic></source> <volume>19</volume> <fpage>216</fpage>&#x2013;<lpage>228</lpage>.</citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kang</surname> <given-names>T.</given-names></name> <name><surname>Petersen</surname> <given-names>N. S.</given-names></name></person-group> (<year>2012</year>). <article-title>Linking item parameters to a base scale.</article-title> <source><italic>Asia Pacific Educ. Rev.</italic></source> <volume>13</volume>:<issue>2</issue>. <pub-id pub-id-type="doi">10.1007/s12564-011-9197-2</pub-id></citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Keller</surname> <given-names>L. A.</given-names></name> <name><surname>Keller</surname> <given-names>R. R.</given-names></name></person-group> (<year>2011</year>). <article-title>The long-term sustainability of different item response theory scaling methods.</article-title> <source><italic>Educ. Psychol. Meas.</italic></source> <volume>71</volume> <fpage>362</fpage>&#x2013;<lpage>379</lpage>. <pub-id pub-id-type="doi">10.1177/0013164410375111</pub-id></citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kiefer</surname> <given-names>T.</given-names></name> <name><surname>Robitzsch</surname> <given-names>A.</given-names></name> <name><surname>Wu</surname> <given-names>M.</given-names></name></person-group> (<year>2018</year>). <source><italic>TAM: Test Analysis Modules. [Computer Software].</italic></source> Available online at: <ext-link ext-link-type="uri" xlink:href="https://CRAN.R-project.org/package=TAM">https://CRAN.R-project.org/package=TAM</ext-link></citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>S.</given-names></name></person-group> (<year>2006</year>). <article-title>A comparative study of IRT fixed parameter calibration methods.</article-title> <source><italic>J. Educ. Meas.</italic></source> <volume>43</volume>:<issue>4</issue>. <pub-id pub-id-type="doi">10.1111/j.1745-3984.2006.00021.x</pub-id></citation></ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>S.</given-names></name> <name><surname>Cohen</surname> <given-names>A. S.</given-names></name></person-group> (<year>1998</year>). <article-title>A comparison of linking and concurrent calibration under item response theory.</article-title> <source><italic>Appl. Psychol. Meas.</italic></source> <volume>22</volume>:<issue>2</issue>.</citation></ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kolen</surname> <given-names>M. J.</given-names></name> <name><surname>Brennan</surname> <given-names>R. L.</given-names></name></person-group> (<year>2014</year>). <source><italic>Test Equating, Scaling, and Linking: Methods and Practices. Statistics for Social and Behavioral Sciences</italic></source>, <edition>3rd Edn</edition>. <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>.</citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Krannich</surname> <given-names>M.</given-names></name> <name><surname>Jost</surname> <given-names>O.</given-names></name> <name><surname>Rohm</surname> <given-names>T.</given-names></name> <name><surname>Koller</surname> <given-names>I.</given-names></name> <name><surname>Carstensen</surname> <given-names>C. H.</given-names></name> <name><surname>Fischer</surname> <given-names>L.</given-names></name><etal/></person-group> (<year>2017</year>). <source><italic>NEPS Technical Report for Reading: Scaling results of Starting Cohort 3 for grade 7.</italic></source> <comment>NEPS Survey Papers, 14</comment>. <publisher-loc>Bamberg</publisher-loc>: <publisher-name>Leibniz Institute for Educational Trajectories</publisher-name>.</citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Loyd</surname> <given-names>B. H.</given-names></name> <name><surname>Hoover</surname> <given-names>H. D.</given-names></name></person-group> (<year>1980</year>). <article-title>Vertical equating using the Rasch model.</article-title> <source><italic>J. Educ. Meas.</italic></source> <volume>17</volume> <fpage>179</fpage>&#x2013;<lpage>193</lpage>. <pub-id pub-id-type="doi">10.1111/j.1745-3984.1980.tb00825.x</pub-id></citation></ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Marco</surname> <given-names>G. L.</given-names></name></person-group> (<year>1977</year>). <article-title>Item characteristic curve solutions to three intractable testing problems.</article-title> <source><italic>J. Educ. Meas.</italic></source> <volume>14</volume>:<issue>2</issue>. <pub-id pub-id-type="doi">10.1111/j.1745-3984.1977.tb00033.x</pub-id></citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Meijer</surname> <given-names>R. R.</given-names></name> <name><surname>Tendeiro</surname> <given-names>J. N.</given-names></name></person-group> (<year>2015</year>). <source><italic>The Effect of Item and Person Misfit on Selection Decisions: An Empirical Study.</italic></source> <comment>LSAC Research Report Series 15:05</comment>. <publisher-loc>Newton, PA</publisher-loc>: <publisher-name>Law School Admission Council</publisher-name>.</citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pohl</surname> <given-names>S.</given-names></name> <name><surname>Haberkorn</surname> <given-names>K.</given-names></name> <name><surname>Hardt</surname> <given-names>K.</given-names></name> <name><surname>Wiegand</surname> <given-names>E.</given-names></name></person-group> (<year>2012</year>). <source><italic>NEPS Technical Report for Reading &#x2013; NEPS Technical Report for reading: Scaling results of Starting Cohort 3 in fifth grade.</italic></source> <comment>NEPS Working Paper, 15</comment>. <publisher-loc>Bamberg</publisher-loc>: <publisher-name>Leibniz Institute for Educational Trajectories</publisher-name>.</citation></ref>
<ref id="B19"><citation citation-type="journal"><collab>R Core Team</collab> (<year>2018</year>). <source><italic>R: A Language and Environment for Statistical Computing.</italic></source> <publisher-loc>Vienna</publisher-loc>: <publisher-name>R Foundation for Statistical Computing</publisher-name>.</citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rasch</surname> <given-names>G.</given-names></name></person-group> (<year>1960</year>). <source><italic>Probabilistic Models For Some Intelligence And Attainment Tests: Studies In Mathematical Psychology: I.</italic></source> <publisher-loc>Copenhagen</publisher-loc>: <publisher-name>Danmarks Paedagogiske Institut</publisher-name>.</citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Scharl</surname> <given-names>A.</given-names></name> <name><surname>Fischer</surname> <given-names>L.</given-names></name> <name><surname>Gnambs</surname> <given-names>T.</given-names></name> <name><surname>Rohm</surname> <given-names>T.</given-names></name></person-group> (<year>2017</year>). <source><italic>NEPS Technical Report for Reading: Scaling Results of Starting Cohort 3 for Grade 9.</italic></source> <comment>NEPS Survey Papers, 20</comment>. <publisher-loc>Bamberg</publisher-loc>: <publisher-name>Leibniz Institute for Educational Trajectories</publisher-name>.</citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sinharay</surname> <given-names>S.</given-names></name> <name><surname>Haberman</surname> <given-names>S. J.</given-names></name></person-group> (<year>2014</year>). <article-title>How often is the misfit of item response theory models practically significant?</article-title> <source><italic>Educ. Meas.</italic></source> <volume>33</volume>:<issue>1</issue>. <pub-id pub-id-type="doi">10.1111/emip.12024</pub-id></citation></ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stocking</surname> <given-names>M. L.</given-names></name> <name><surname>Lord</surname> <given-names>F. M.</given-names></name></person-group> (<year>1983</year>). <article-title>Developing a common metric in item response theory.</article-title> <source><italic>Appl. Psychol. Meas.</italic></source> <volume>7</volume> <fpage>201</fpage>&#x2013;<lpage>210</lpage>. <pub-id pub-id-type="doi">10.1177/014662168300700208</pub-id></citation></ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Svetina</surname> <given-names>D.</given-names></name> <name><surname>Crawford</surname> <given-names>A. V.</given-names></name> <name><surname>Levy</surname> <given-names>R.</given-names></name> <name><surname>Green</surname> <given-names>S. B.</given-names></name> <name><surname>Scott</surname> <given-names>L.</given-names></name> <name><surname>Thompson</surname> <given-names>M.</given-names></name><etal/></person-group> (<year>2013</year>). <article-title>Designing small-scale tests: a simulation study of parameter recovery with the 1-PL.</article-title> <source><italic>Psychol. Test Assessment Modeling</italic></source> <volume>55</volume> <fpage>335</fpage>&#x2013;<lpage>360</lpage>.</citation></ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Thissen</surname> <given-names>D.</given-names></name> <name><surname>Wainer</surname> <given-names>H.</given-names></name></person-group> (<year>1982</year>). <article-title>Some standard errors in item response theory.</article-title> <source><italic>Psychometrika</italic></source> <volume>47</volume> <fpage>397</fpage>&#x2013;<lpage>412</lpage>. <pub-id pub-id-type="doi">10.1007/BF02293705</pub-id></citation></ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vale</surname> <given-names>C. D.</given-names></name></person-group> (<year>1986</year>). <article-title>Linking item parameters onto a common scale.</article-title> <source><italic>Appl. Psychol. Meas.</italic></source> <volume>10</volume>:<issue>4</issue>. <pub-id pub-id-type="doi">10.1177/014662168601000402</pub-id></citation></ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>van der Linden</surname> <given-names>W. J.</given-names></name> <name><surname>Barrett</surname> <given-names>M. D.</given-names></name></person-group> (<year>2016</year>). <article-title>Linking item response model parameters.</article-title> <source><italic>Psychometrika</italic></source> <volume>81</volume>:<issue>3</issue>. <pub-id pub-id-type="doi">10.1007/s11336-015-9469-6</pub-id> <pub-id pub-id-type="pmid">26155754</pub-id></citation></ref>
<ref id="B28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>van der Linden</surname> <given-names>W. J.</given-names></name> <name><surname>Hambleton</surname> <given-names>R. K.</given-names></name></person-group> (<year>2013</year>). <source><italic>Handbook of Modern Item Response Theory.</italic></source> <publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer Science &#x0026; Business Media</publisher-name>.</citation></ref>
<ref id="B29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>von Davier</surname> <given-names>A. A.</given-names></name> <name><surname>Carstensen</surname> <given-names>C. H.</given-names></name> <name><surname>von Davier</surname> <given-names>M.</given-names></name></person-group> (<year>2006</year>). <article-title>Linking competencies in educational settings and measuring growth.</article-title> <source><italic>ETS Res. Rep. Ser.</italic></source> <volume>2006</volume>:<issue>1</issue>. <pub-id pub-id-type="doi">10.1002/j.2333-8504.2006.tb02018.x</pub-id></citation></ref>
<ref id="B30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wright</surname> <given-names>B. D.</given-names></name></person-group> (<year>1977</year>). <article-title>Solving measurement problems with the rasch model.</article-title> <source><italic>J. Educ. Meas.</italic></source> <volume>14</volume> <fpage>97</fpage>&#x2013;<lpage>116</lpage>. <pub-id pub-id-type="doi">10.1111/j.1745-3984.1977.tb00031.x</pub-id></citation></ref>
<ref id="B31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>Y.</given-names></name> <name><surname>Hambleton</surname> <given-names>R. K.</given-names></name></person-group> (<year>2017</year>). <article-title>Practical consequences of item response theory model misfit in the context of test equating with mixed-format test data.</article-title> <source><italic>Front. Psychol.</italic></source> <volume>8</volume>:<issue>484</issue>. <pub-id pub-id-type="doi">10.3389/fpsyg.2017.00484</pub-id> <pub-id pub-id-type="pmid">28421011</pub-id></citation></ref>
</ref-list>
</back>
</article>