<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="2.3" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Psychol.</journal-id>
<journal-title>Frontiers in Psychology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Psychol.</abbrev-journal-title>
<issn pub-type="epub">1664-1078</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fpsyg.2023.1131019</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Psychology</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Using process features to investigate scientific problem-solving in large-scale assessments</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes"><name><surname>Gong</surname><given-names>Tao</given-names></name><xref rid="aff1" ref-type="aff"><sup>1</sup></xref><xref rid="aff2" ref-type="aff"><sup>2</sup></xref><xref rid="aff3" ref-type="aff"><sup>3</sup></xref><xref rid="c001" ref-type="corresp"><sup>&#x002A;</sup></xref>
<uri xlink:href="https://loop.frontiersin.org/people/82658/overview"/>
</contrib>
<contrib contrib-type="author"><name><surname>Shuai</surname><given-names>Lan</given-names></name><xref rid="aff2" ref-type="aff"><sup>2</sup></xref>
<uri xlink:href="https://loop.frontiersin.org/people/122097/overview"/>
</contrib>
<contrib contrib-type="author"><name><surname>Jiang</surname><given-names>Yang</given-names></name><xref rid="aff2" ref-type="aff"><sup>2</sup></xref>
<uri xlink:href="https://loop.frontiersin.org/people/619010/overview"/>
</contrib>
<contrib contrib-type="author"><name><surname>Arslan</surname><given-names>Burcu</given-names></name><xref rid="aff2" ref-type="aff"><sup>2</sup></xref>
<uri xlink:href="https://loop.frontiersin.org/people/1501973/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>School of Foreign Languages, Zhejiang University of Finance and Economics</institution>, <addr-line>Hangzhou, Zhejiang</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>Educational Testing Service</institution>, <addr-line>Princeton, NJ</addr-line>, <country>United States</country></aff>
<aff id="aff3"><sup>3</sup><institution>Google</institution>, <addr-line>New York, NY</addr-line>, <country>United States</country></aff>
<author-notes>
<fn id="fn0001" fn-type="edited-by"><p>Edited by: Heining Cham, Fordham University, United States</p></fn>
<fn id="fn0002" fn-type="edited-by"><p>Reviewed by: Mark D. Reckase, Michigan State University, United States; Zhi Wang, Columbia University, United States; Jun Feng, Hangzhou Normal University, China</p></fn>
<corresp id="c001">&#x002A;Correspondence: Tao Gong, <email>gtojty@gmail.com</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>18</day>
<month>04</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>14</volume>
<elocation-id>1131019</elocation-id>
<history>
<date date-type="received">
<day>24</day>
<month>12</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>03</day>
<month>04</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2023 Gong, Shuai, Jiang and Arslan.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Gong, Shuai, Jiang and Arslan</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<sec>
<title>Introduction</title>
<p>This study investigates the process data from scientific inquiry tasks of <italic>fair tests</italic> [requiring test-takers to manipulate a target variable while keeping other(s) constant] and <italic>exhaustive tests</italic> (requiring test-takers to construct all combinations of given variables) in the National Assessment of Educational Progress program.</p>
</sec>
<sec>
<title>Methods</title>
<p>We identify significant associations between item scores and temporal features of preparation time, execution time, and mean execution time.</p>
</sec>
<sec>
<title>Results</title>
<p>Reflecting, respectively, durations of action planning and execution, and execution efficiency, these process features quantitatively differentiate the high- and low-performing students: in the fair tests, high-performing students tended to exhibit shorter execution time than low-performing ones, but in the exhaustive tests, they showed longer execution time; and in both types of tests, high-performing students had shorter mean execution time than low-performing ones.</p>
</sec>
<sec>
<title>Discussion</title>
<p>This study enriches process features reflecting scientific problem-solving process and competence and sheds important light on how to improve performance in large-scale, online delivered scientific inquiry tasks.</p>
</sec>
</abstract>
<kwd-group>
<kwd>scientific problem solving</kwd>
<kwd>fair test</kwd>
<kwd>exhaustive test</kwd>
<kwd>preparation time</kwd>
<kwd>execution time</kwd>
</kwd-group>
<counts>
<fig-count count="7"/>
<table-count count="7"/>
<equation-count count="0"/>
<ref-count count="47"/>
<page-count count="14"/>
<word-count count="10367"/>
</counts>
</article-meta>
</front>
<body>
<sec id="sec1" sec-type="intro">
<label>1.</label>
<title>Introduction</title>
<p>The past two decades have witnessed an increasing use of computers and relevant technologies in classroom teaching and learning (<xref ref-type="bibr" rid="ref19">Hoyles and Noss, 2003</xref>) and a swift transition from traditional paper-and-pencil tests to <italic>digitally-based assessments</italic> (DBAs) (<xref ref-type="bibr" rid="ref48">Zenisky and Sireci, 2002</xref>; <xref ref-type="bibr" rid="ref42">Scalise and Gifford, 2006</xref>) that accommodate advancement of educational technologies. Along with these trends, the National Assessment of Educational Progress (NAEP)<xref rid="fn0003" ref-type="fn">
<sup>1</sup></xref> began to use hand-held tablets to administer math assessments in the U.S. in 2017, so did other disciplines afterward. Capable of recording multi-dimensional data, DBAs offer ample opportunities to systematically investigate U.S. students&#x2019; problem-solving processes through well-designed <italic>technology-enhanced items</italic> (TEIs) (<xref ref-type="bibr" rid="ref36">National Assessment Governing Board, 2015</xref>). TEIs refer broadly to computer-aided items that incorporate technology beyond simple option selections as test-takers&#x2019; response method (<xref ref-type="bibr" rid="ref25">Koedinger and Corbett, 2006</xref>). In a TEI, test-takers are asked to interact with computers by performing a series of actions to solve one (or multiple) problem. For example, in scientific inquiry TEIs of <italic>fair tests</italic> (<xref ref-type="bibr" rid="ref5">Chen and Klahr, 1999</xref>), students are asked to adjust a target variable in an experimental setting or condition while keeping other(s) constant, to reveal effect or outcome of the target variable. In another type of scientific inquiry TEIs, <italic>exhaustive tests</italic> (<xref ref-type="bibr" rid="ref34">Montgomery, 2000</xref>; <xref ref-type="bibr" rid="ref3">Black, 2007</xref>), students are required to construct all possible combinations of given variables to investigate what combination(s) leads to a specific outcome (see Section 2 for details). In both types of tests, students need to apply the <italic>control-of-variables strategy</italic> (CVS, see Section 2 for details), a domain-general processing skill to design controlled experiments in a multi-variable system (<xref ref-type="bibr" rid="ref28">Kuhn and Dean, 2005</xref>; <xref ref-type="bibr" rid="ref27">Kuhn, 2007</xref>).</p>
<p>Beyond final responses, interactive actions of students are captured as <italic>process data</italic>.<xref rid="fn0004" ref-type="fn">
<sup>2</sup></xref> Such data help (re)construct problem-solving processes, reflect durations (or frequencies) of major problem-solving stages, and infer how students deploy strategies they seem to know (<xref ref-type="bibr" rid="ref39">Pedaste et al., 2015</xref>; <xref ref-type="bibr" rid="ref40">Provasnik, 2021</xref>), all of which provide additional clues of students&#x2019; problem-solving behaviors (<xref ref-type="bibr" rid="ref23">Kim et al., 2007</xref>; <xref ref-type="bibr" rid="ref9">Ebenezer et al., 2011</xref>; <xref ref-type="bibr" rid="ref11">Gobert et al., 2012</xref>). For example, in drag-and-drop (D&#x0026;D) items, a popular type of TEIs, students drag some objects from source locations and drop them into target positions on screen. Compared to conventional multiple-choice items, such items can better represent construct-relevant skills, strengthen measurement, improve engagement/motivation of test-takers, and reduce interference of random guessing (<xref ref-type="bibr" rid="ref4">Bryant, 2017</xref>; <xref ref-type="bibr" rid="ref1">Arslan et al., 2020</xref>).</p>
<p>Despite the advantages, process data have long been treated as by-products in educational assessments. Until recently, scholars have begun to investigate whether (and if so, how) process data inform (meta)cognitive processes and students&#x2019; strategies during problem solving (<xref ref-type="bibr" rid="ref16">Guo et al., 2019</xref>; <xref ref-type="bibr" rid="ref45">Tang et al., 2019</xref>; <xref ref-type="bibr" rid="ref12">Gong et al., 2021</xref>, <xref ref-type="bibr" rid="ref13">2022</xref>). By reviewing pioneering studies on NAEP process data before its formal transition to DBA, <xref ref-type="bibr" rid="ref2">Bergner and von Davier (2019)</xref> proposed a hierarchical framework that divides process data use into five levels based on its relative importance to outcome: <italic>Level 1</italic>, process data are irrelevant/ignored and only response data are considered; <italic>Level 2</italic>, process data are incorporated as <italic>auxiliary</italic> to understanding outcome; <italic>Level 3</italic>, process data are incorporated as <italic>essential</italic> to understanding outcome; <italic>Level 4</italic>, process data are outcome and incorporated into scoring rubrics; and <italic>Level 5</italic>, process data are outcome and incorporated into measurement models.</p>
<p>Most published process data studies remain up to level 2 of this framework; they directly use students&#x2019; actions, action sequences, and (partial/rough) durations of answering processes to interpret item outcome (e.g., answer change behaviors, <xref ref-type="bibr" rid="ref32">Liu et al., 2015</xref>; response time, <xref ref-type="bibr" rid="ref29">Lee and Jia, 2014</xref>; or action sequences, <xref ref-type="bibr" rid="ref18">Han et al., 2019</xref>; <xref ref-type="bibr" rid="ref47">Ulitzsch et al., 2021</xref>). Before explicitly revealing correlations between process data and individual performance, inferences from these studies remain <italic>auxiliary</italic> rather than <italic>essential</italic>. In other words, discovering process features and their relatedness to test-takers&#x2019; performance is a <italic>prerequisite</italic> for using process features to understand or interpret individual performance, thus reaching higher levels of the framework.</p>
<p>This study aims to fulfill this prerequisite by investigating process data from <italic>scientific inquiry tasks</italic> (see <xref ref-type="supplementary-material" rid="SM1">Supplementary materials</xref>) and related research questions therein in a three-step procedure:</p>
<p><italic>Define time-related features to illustrate action planning and executing stages of scientific problem solving</italic>. Many early studies have examined action-related features that reflect conceptual formation (<xref ref-type="bibr" rid="ref22">Jonassen, 2000</xref>; <xref ref-type="bibr" rid="ref30">Lesh and Harel, 2011</xref>), response strategies, and internal (individual dispositions) or external (testing circumstances) factors probably affecting students&#x2019; choices of strategies (<xref ref-type="bibr" rid="ref14">Griffiths et al., 2015</xref>; <xref ref-type="bibr" rid="ref31">Lieder and Griffiths, 2017</xref>; <xref ref-type="bibr" rid="ref35">Moon et al., 2018</xref>). However, the time needed for problem solving has been largely undervalued (<xref ref-type="bibr" rid="ref8">Dost&#x00E1;l, 2015</xref>). As an informative indicator of problem solving stages, temporal information helps characterize patterns of students, and infer (meta)cognitive processes occurring at various stages of problem solving.</p>
<p>We propose three temporal features to reflect, respectively, the major stages of scientific problem solving (see <xref rid="fig1" ref-type="fig">Figure 1</xref>). In an assessment setting, <italic>preparation time</italic> (<italic>PT</italic>) is defined as the time difference (duration) between the moment students enter a test scene and when they make their first answer-related event. It denotes the duration while students understand instructions and conceptually plan their actions, before making any. <italic>Execution time</italic> (<italic>ET</italic>) is defined as the time difference between students&#x2019; first and last answer-related events. It measures the duration while students execute their planned actions. <italic>Mean execution time per</italic> (<italic>MET</italic>) is measured as <italic>ET</italic> divided by the number of answer related events.<xref rid="fn0005" ref-type="fn">
<sup>3</sup></xref> <italic>ET</italic> reflects total efforts of students casted to construct their answers, including setting up answers and revising or (possibly) reviewing their choices, whereas <italic>MET</italic> reflects average effort over total events. Controlling for answer-related events, <italic>MET</italic> indicates the efficiency of action execution. Our study examines whether these temporal features significantly correlate item scores and characterize high/low-performing students in test scenes.<xref rid="fn0006" ref-type="fn">
<sup>4</sup></xref></p>
<fig position="float" id="fig1">
<label>Figure 1</label>
<caption>
<p>Proposed process features [preparation time (PT) and execution time (ET)] and corresponding major stages of scientific problem solving (understanding and planning, and executing planned actions, denoted by colored bars) in a scientific inquiry test item. Vertical lines denote the times when a test-taker enters and exits the task, vertical bars denote answer-related actions. See <xref ref-type="supplementary-material" rid="SM1">Supplementary materials</xref> for review of scientific problem solving processes.</p>
</caption>
<graphic xlink:href="fpsyg-14-1131019-g001.tif"/>
</fig>
<p><italic>Explore correlations between process features and item scores</italic>. This is the missing link in many existing studies of problem solving; some cannot verify such correlations, since the categorical features (e.g., action sequences) used cannot fit for correlation tests, whereas others, directly assuming such correlations, skip this step and use process features to inform/interpret performance. Neither approach is complete. Our study focuses on detecting correlations between continuous process features and item scores and explaining feature differences across score groups.</p>
<p>We apply two statistical tests to detect correlations. First, the Kruskal-Wallis test (<xref ref-type="bibr" rid="ref26">Kruskal and Wallis, 1952</xref>) compares process features across score groups, and reports whether (at least) one of the multiple samples is significantly distinct from others. As a non-parametric version of ANOVA, this test does not require a normal distribution of the residual values of features. Extended from the Mann&#x2013;Whitney test, this test is an omnibus test applicable to small-scale, independent samples from multiple groups. Second, we conduct the omnibus ANOVA between score groups, and log-transformed (base <italic>e</italic>) the process features to meet the normality assumption. This test is applicable to large-scale datasets. We use both methods to cross-validate obtained results by each method.</p>
<p><italic>Use process features to characterize performance (or competence) differences between students and/or tasks</italic>. After verifying correlations between process features and item scores, we further investigate: (a) <italic>whether there exist differences (or similarities) in the process features across score groups and/or inquiry tasks</italic>; and (b) <italic>whether the observed differences (or similarities) characterize problem solving performance (or competence) between high- and low-performing students and between inquiry tasks.</italic> Answers to these questions further foster these features as informative indicators of students&#x2019; performance and pave the way for incorporating them into scoring rubrics and measurement models aiming to classify and interpret students&#x2019; behaviors.</p>
<p>In the following sections, we first review the CVS strategies and scientific inquiry tasks, and then define the process metrics and analysis plans. After reporting the analysis results, we answer the abovementioned questions, summarize our contributions to scientific inquiry and problem solving research, and point out the general procedure of process data use in educational assessments.</p>
</sec>
<sec id="sec2">
<label>2.</label>
<title>Control-of-variables strategies and scientific inquiry tasks</title>
<p><italic>Control-of-variables strategy</italic> (CVS)<xref rid="fn0007" ref-type="fn">
<sup>5</sup></xref> has been widely studied in science assessments. CVS refers to the skill used to design controlled experiments in a multi-variable system. To avoid confounded experiments, all variables but those under investigation must be controlled in a way to meet task requirements. In the <italic>Next Generation Science Standards</italic> (NGSS), CVS and multivariate reasoning are viewed as two key scientific thinking skills. Central to early science instruction (<xref ref-type="bibr" rid="ref24">Klahr and Nigam, 2004</xref>) (around grades 4&#x2013;8), CVS cannot develop routinely without practice or instruction (<xref ref-type="bibr" rid="ref43">Schwichow et al., 2016</xref>), making it a critical issue in development of scientific thinking (<xref ref-type="bibr" rid="ref28">Kuhn and Dean, 2005</xref>). Children, adolescents, and adults with low science inquiry skills show difficulty in applying CVS in scientific problem solving (<xref ref-type="bibr" rid="ref5">Chen and Klahr, 1999</xref>).</p>
<p>In large-scale assessments like NAEP, CVS is often assessed by two types of scientific inquiry tasks: fair tests and exhaustive tests. <italic>A fair test</italic> (see examples in Section 3.1) refers to a controlled investigation carried out to answer a scientific question about the effect of a target variable. To control for confounding factors and be scientifically sound, students are expected to apply the CVS to meet the fair test requirement that: (a) all other variable(s) are kept constant; and (b) only the target one(s) changes across conditional sets for comparison. In such a &#x201C;fair&#x201D; setting, the effect of the target variable(s) can be observed and less interfered by other variables. To properly complete the task, students need to choose, among possible combinations of different levels of the target and other variables, one (or a few) condition that meets the requirement. There are studies of CVS in scientific inquiry using small-scale participants and response/survey data (<xref ref-type="bibr" rid="ref27">Kuhn, 2007</xref>). A recent meta-analysis of intervention studies (partially) designed to enhance CVS skills revealed that instruction/intervention (e.g., cognitive conflict and demonstration) influences achievement in scientific inquiry tasks (<xref ref-type="bibr" rid="ref43">Schwichow et al., 2016</xref>).</p>
<p><italic>An exhaustive test</italic> (a.k.a. <italic>all-pair</italic> or <italic>combinatorial test</italic>) (see examples in Section 3.2) requires test-takers to construct, physically or mentally, (nearly) all possible combinations of given variables to address an inquiry of what condition(s) induces a specific outcome. Similar to fair tests, students in exhaustive tests need to control the given variables by setting up combinations exhaustively or nearly so (in an open-ended case). Though not explicitly mentioned in NGSS, exhaustive testing is essentially related to CVS or at least a case of multivariate reasoning. How to conduct exhaustive tests is usually taught and learned relatively late in science education (around grades 9&#x2013;12). Such tests have also been adopted in other fields than educational assessments, e.g., software engineering and business (<xref ref-type="bibr" rid="ref15">Grindal et al., 2005</xref>).</p>
</sec>
<sec id="sec3" sec-type="materials|methods">
<label>3.</label>
<title>Materials and methods</title>
<p>Our study makes use of the 2018 NAEP science pilot tasks (see <xref ref-type="supplementary-material" rid="SM1">Supplementary materials</xref>). It adopted four tests, respectively, from four tasks in the repertoire: two fair tests administered on fourth- and eighth-graders, respectively (the primary and middle school bands, per NGSS), and two exhaustive tests on twelfth-graders (the high school band). <xref rid="tab1" ref-type="table">Table 1</xref> shows the samples of these tests.</p>
<table-wrap position="float" id="tab1">
<label>Table 1</label>
<caption>
<p>Basic information of the testlets investigated in this paper.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top">Test item</th>
<th align="left" valign="top">Subfield</th>
<th align="center" valign="top">Grade</th>
<th align="center" valign="top">No. Students (Female, Male) for analyses</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="middle">Fair test 1</td>
<td align="left" valign="top">Earth/space science</td>
<td align="center" valign="middle">8</td>
<td align="center" valign="top">1,607 (800, 807)</td>
</tr>
<tr>
<td align="left" valign="middle">Fair test 2</td>
<td align="left" valign="top">Physical science</td>
<td align="center" valign="middle">4</td>
<td align="center" valign="top">1,990 (977, 1,013)</td>
</tr>
<tr>
<td align="left" valign="middle">Exhaustive test 1</td>
<td align="left" valign="top">Life science</td>
<td align="center" valign="middle">12</td>
<td align="center" valign="top">2,726 (1,285, 1,341)</td>
</tr>
<tr>
<td align="left" valign="middle">Exhaustive test 2</td>
<td align="left" valign="top">Earth/space science</td>
<td align="center" valign="middle">12</td>
<td align="center" valign="top">2,947 (1,465, 1,482)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Due to various reasons (e.g., early quit or data capture glitches), data of some students were missing. The rightmost column records the number of students whose process and response data were used for analyses.</p>
</table-wrap-foot>
</table-wrap>
<p>Two criteria lead to the choice of these tasks. First, the sampled tests should cover most science subfields and grades in the NAEP sample. However, given that lower grade students have not been taught to solve exhaustive tasks, no such tests were administered on fourth-graders. Second, since fair tests were administered mostly on eighth-graders and exhaustive tests on twelfth-graders, it is impossible to select fair tests and exhaustive tests administered on students of the same grade. Nonetheless, since all the NAEP fair and exhaustive tests were designed by content experts following similar constructs and the only difference was that each task fell into one of the science disciplines (physical, life, earth/space sciences), the chosen tests in our study are representative.</p>
<sec id="sec4">
<label>3.1.</label>
<title>The fair tests and scoring rubrics</title>
<p>The fair test 1 came from an earth/space science task. Its cover task<xref rid="fn0008" ref-type="fn">
<sup>6</sup></xref> is as follows. A city near a mountain suffers from strong north wind each year. The government plans to test the wind-blocking effect of three types of trees. Each type can be planted at the foot (low), side (medium), or peak (high) of the northern ridge of the mountain to reduce wind speed, and there is no interaction between tree type and mountain position (e.g., there is no preference for one type of trees to be planted at a specific position).</p>
<p>The whole task is presented to students through multiple scenes, some involving items. The first few scenes help students understand, represent, and explore relevant issues. After them comes the fair test scene, in which students are asked to design a controlled experiment to investigate the wind-blocking effects of the three types of trees. The follow-up scenes ask them to interpret/revisit their answers and apply their knowledge in novel conditions. Students went through these scenes in the same order and could not freely jump around.</p>
<p>In the fair test scene (see <xref rid="fig2" ref-type="fig">Figure 2</xref>), students need to drag each type of the trees and drop it at one of the four virtual mountains resembling the real one near the city; students can drop the trees at the foot (low), side (middle), or peak (high) of the northern ridge of the mountain. Each mountain can hold one type of the trees, and each type can only be planted at one mountain. Students can move trees from one mountain to another, or from one position of a mountain to another position of the same or different mountain. After making final selections, students click an on-screen &#x201C;Submit&#x201D; button to initiate the experiment. Then, the wind speeds before and after passing over each of the mountains with/without trees are shown on screen as experimental results.</p>
<fig position="float" id="fig2">
<label>Figure 2</label>
<caption>
<p>Example answers of the fair test 1. &#x201C;Low,&#x201D; &#x201C;Medium,&#x201D; &#x201C;High&#x201D; denote tree positions (foot, side, peak) in the northern ridge of a virtual mountain. &#x201C;None&#x201D; means no tree planted. In <bold>(A)</bold>, the first &#x201C;Low&#x201D; indicates that one type of trees is planted at the foot of the mountain, the second and third &#x201C;Low&#x201D; indicate that the other two types of trees are planted on the second and third mountains, and &#x201C;None&#x201D; means that the fourth mountain has no trees planted. Since the scoring rubric (see <xref rid="tab2" ref-type="table">Table 2</xref>) does not specify tree type and ignores the mountain without trees, submitted answers can be simply denoted by the tree positions in the mountains with trees. In this way, answer <bold>(A)</bold> can be denoted as &#x201C;Low; Low; Low&#x201D;, answer <bold>(B)</bold> as &#x201C;Medium; High; High&#x201D;, and answer (C) as &#x201C;Low, High, Medium&#x201D;.</p>
</caption>
<graphic xlink:href="fpsyg-14-1131019-g002.tif"/>
</fig>
<p>This fair test has two variables: tree type (with three levels, corresponding to the three tree types) and tree position (with three levels, low, middle, and high). To conduct a fair test showing the effect of tree type, students must keep the tree positions across mountains identical. <xref rid="tab2" ref-type="table">Table 2</xref> shows the scoring rubric of the test. Since students can never plant the same type of trees on two mountains or at two positions of one mountain, the rubric focuses mainly on the types of trees planted on mountains. In addition, no matter how students plant trees, one mountain is left with no trees. A complete comparison on the effect of tree type needs a baseline condition of no trees, but students are not required to explicitly set up this condition in this test. Therefore, although there are in principle 3&#x2009;&#x00D7;&#x2009;3&#x2009;&#x00D7;&#x2009;3&#x2009;&#x00D7;&#x2009;<italic>P</italic>(4,3)&#x2009;=&#x2009;648 ways of tree planting and 3&#x2009;&#x00D7;&#x2009;<italic>P</italic>(4,3)&#x2009;=&#x2009;72 in which match the fair test requirement (<italic>P</italic> means permutation), the matching answers can be classified into three types: (a) those having the three types of trees all planted at the &#x201C;Low&#x201D; positions of any three out of the four mountains; (b) all at the &#x201C;Middle&#x201D; positions; and (c) all at the &#x201C;High&#x201D; positions. These answers receive a full score (3). Answers having trees planted at two distinct positions of any three mountains has a partial score (2), and those having trees planted at three distinct positions of any three mountains receives the lowest score (1).</p>
<table-wrap position="float" id="tab2">
<label>Table 2</label>
<caption>
<p>Scoring rubrics of the fair tests 1 and 2.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top">Score</th>
<th align="left" valign="top">Rubric of the fair test 1</th>
<th align="left" valign="top">Rubric of the fair test 2</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">3</td>
<td align="left" valign="top">Trees are planted at the same positions of three mountains (e.g., Low; Low; Low in <xref rid="fig2" ref-type="fig">Figure 2A</xref>)</td>
<td align="left" valign="top">Select three distinct ingredients with identical amount (e.g., <xref rid="fig3" ref-type="fig">Figure 3B</xref>)</td>
</tr>
<tr>
<td align="left" valign="top">2</td>
<td align="left" valign="top">Two types of trees are planted at the same positions of the mountains (e.g., Medium; High; High in <xref rid="fig2" ref-type="fig">Figure 2B</xref>)</td>
<td align="left" valign="top">Select three distinct ingredients, but two of them have identical amounts or all three have distinct amounts (e.g., <xref rid="fig3" ref-type="fig">Figure 3C</xref>)</td>
</tr>
<tr>
<td align="left" valign="top">1</td>
<td align="left" valign="top">Tree positions on the mountains are distinct (e.g., Low; High; Medium in <xref rid="fig2" ref-type="fig">Figure 2C</xref>)</td>
<td align="left" valign="top">None of the above (e.g., <xref rid="fig3" ref-type="fig">Figure 3D</xref>)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The fair test 2 comes from a physical science task. Its cover task is as follows. A bakery shop is developing a new product. The bakers want to test which of the three ingredients (white candy, butter, and honey) has the most acceptable sweetness in the new product. Each ingredient has three amounts to choose: 50, 100, and 200 milligrams. After instruction scenes, in the fair test scene, nine piles of the three ingredients with the three amounts are shown on the left side of the screen (see <xref rid="fig3" ref-type="fig">Figure 3A</xref>), and students can drag three of these piles into the three slots on the right side of the screen to show the effect of ingredients on the sweetness of the product. Students can move the piles from one slot to another. After making final choices, students click on an on-screen &#x201C;Submit&#x201D; button to initiate the experiment, and the sweetness of each choice is shown on the screen.</p>
<fig position="float" id="fig3">
<label>Figure 3</label>
<caption>
<p>Example answers of the fair test 2. <bold>(A)</bold> Nine piles of ingredients for selection: 50, 100, and 200 are milligrams, and each pile is marked by an index of 1&#x2013;9. <bold>(B)</bold> A choice of three piles, denoted by 2&#x2013;5&#x2013;8, matching the fair test requirement. <bold>(C)</bold> A choice of three piles, 1&#x2013;5&#x2013;9, partially matching the requirement. <bold>(D)</bold> A choice of three piles, 7&#x2013;8&#x2013;6, not matching the requirement.</p>
</caption>
<graphic xlink:href="fpsyg-14-1131019-g003.tif"/>
</fig>
<p>This fair test has two variables: ingredient type (white candy, butter, and honey) and ingredient amount (50, 100, and 200 milligrams). To show the effect of ingredients, one needs to keep ingredient amount identical across conditions. Among a total of <italic>C</italic>(9,3)&#x2009;&#x00D7;&#x2009;<italic>P</italic>(3,3)&#x2009;=&#x2009;504 choices of three piles of ingredients (<italic>C</italic> means combination), 3&#x2009;&#x00D7;&#x2009;<italic>P</italic>(3,3)&#x2009;=&#x2009;18 match the fair test requirement. <xref rid="tab2" ref-type="table">Table 2</xref> shows the scoring rubric of the test. Answers matching the fair test requirement receive a full score (3), and others receive a partial (2) or the lowest score (1).</p>
</sec>
<sec id="sec5">
<label>3.2.</label>
<title>The exhaustive tests and scoring rubric</title>
<p>The exhaustive test 1 comes from a life science task. Its cover task is as follows. Farmers are trying to cultivate flowers with a special color. They do this in a natural way or using one or two types of fertilizers (A and B). After scenes for students to understand related issues, represent and explore different conditions, there comes the exhaustive test scene, in which students are asked to design an experiment to show which way has the highest chance to cultivate flowers with a target color. They can set up a condition by selecting (or not) any (or both) fertilizer, and save it by clicking an on-screen &#x201C;Save&#x201D; button. They can also remove a saved condition by clicking it and an on-screen &#x201C;Delete&#x201D; button. After saving some conditions, they can click on an on-screen &#x201C;Submit&#x201D; button to submit all the saved conditions at that moment as final answers. This test requires four variable combinations (see <xref rid="fig4" ref-type="fig">Figure 4</xref>). The follow-up scenes ask students to review their answers and apply their knowledge in similar domains. Students went through these scenes in the same order and could not freely jump around the scenes.</p>
<fig position="float" id="fig4">
<label>Figure 4</label>
<caption>
<p>All the combinations in the exhaustive test 1: <bold>(A)</bold> None; <bold>(B)</bold>: A; <bold>(C)</bold>: B; <bold>(D)</bold>: A&#x2009;+&#x2009;B.</p>
</caption>
<graphic xlink:href="fpsyg-14-1131019-g004.tif"/>
</fig>
<p><xref rid="tab3" ref-type="table">Table 3</xref> shows the scoring rubric of this test. The four scales are based on the types of the saved answers, especially whether they include some hard-to-foresee ones (e.g., <xref rid="fig4" ref-type="fig">Figures 4A</xref>,<xref rid="fig4" ref-type="fig">D</xref>). An exhaustive answer covering all combinations in <xref rid="fig4" ref-type="fig">Figure 4</xref> receives the full score (4), whereas answers lacking one, two, or three of the combinations receive lower scores 3, 2, and 1. The validity of the rubric (whether it can reasonably reflect students&#x2019; intuitive conceptions and clarify students with various levels of problem solving skills) is beyond the scope of this paper.</p>
<table-wrap position="float" id="tab3">
<label>Table 3</label>
<caption>
<p>Scoring rubrics of the exhaustive tests 1 and 2.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top">Score</th>
<th align="left" valign="top">Rubric of exhaustive test 1</th>
<th align="left" valign="top">Rubric of exhaustive test 2</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">4</td>
<td align="left" valign="top">Answers cover all the four conditions: None, A, B, A&#x2009;+&#x2009;B.</td>
<td align="left" valign="top">The 15 chosen loc. Include:<break/>(1) One loc. in each of the 13 regions (except region 6).<break/>(2) One additional loc. in one of adjacent regions to City A.<break/>(3) One additional loc. in one of adjacent regions to City B.</td>
</tr>
<tr>
<td align="left" valign="top">3</td>
<td align="left" valign="top">Answers exclude None, OR exclude A or B.</td>
<td align="left" valign="top">At least 14 chosen loc. Match cases (1) and (2), or (1) and (3). At least 13 chosen loc. Match case (1) only.</td>
</tr>
<tr>
<td align="left" valign="top">2</td>
<td align="left" valign="top">Answers exclude A&#x2009;+&#x2009;B,<break/>OR exclude A&#x2009;+&#x2009;B and A or B,<break/>OR exclude None and A&#x2009;+&#x2009;B.</td>
<td align="left" valign="top">At least 2 loc. Match cases (2) and (3) above.</td>
</tr>
<tr>
<td align="left" valign="top">1</td>
<td align="left" valign="top">None of the above.</td>
<td align="left" valign="top">None of the above.</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>For the exhaustive test 1, the original rubric also evaluates whether students give a proper interpretation of submitted answers. Here, students are rescored based only on their saved conditions. &#x201C;Loc.&#x201D; stands for location.</p>
</table-wrap-foot>
</table-wrap>
<p>The exhaustive test 2 comes from an earth science task. Its cover task is as follows. Two cities (A and B) plan to build a transmission tower to broadcast television signals. To evaluate signal quality on the land between the cities, they segment the land into 14 regions, each having four locations for signal sampling (see <xref rid="fig5" ref-type="fig">Figure 5</xref>). After instructions, students are asked to select at most 15 locations (out of 42) in the 13 regions (one region with one location therein being chosen is used as a demo) to test the signal coverage. They can select a location by clicking on it and deselect it by clicking on it again. If 15 locations are already chosen, students must deselect some chosen locations before making new selection(s). After choosing some locations (not necessarily 15), students can click on an on-screen &#x201C;Submit&#x201D; button to submit the chosen locations as final answers.</p>
<fig position="float" id="fig5">
<label>Figure 5</label>
<caption>
<p>Example answers in the exhaustive test 2. Squares marked as 1&#x2013;14 are the regions between City A and City B. Round dots in a region are locations for sample taking. Region 6 is the demo region with a chosen location marked in red, and the others are marked in grey. Green dots are students&#x2019; chosen locations. In this answer, there is at least one chosen location in all 13 regions except region 6, and there is at least one additional chosen location in one of the three regions adjacent to City A (region 5) and City B (region 11). This answer has a score of 4 (see <xref rid="tab3" ref-type="table">Table 3</xref>).</p>
</caption>
<graphic xlink:href="fpsyg-14-1131019-g005.tif"/>
</fig>
<p>In this exhaustive test, students need to: (a) select at least one location in each of the 13 regions to test signal quality; and (b) choose two additional locations, respectively, in the three regions adjacent to each city to evaluate the signal sources in the two cities. It is challenging to foresee both aspects of requirements. <xref rid="tab3" ref-type="table">Table 3</xref> shows the scoring rubric of the test. The four scales are dependent on whether students fulfil both, either, or none of the two aspects of requirement. Whether this rubric is valid is not the focus of this paper.</p>
</sec>
<sec id="sec6">
<label>3.3.</label>
<title>Process-based measures</title>
<p>We define and measure three temporal features: preparation time (<italic>PT</italic>), execution time (<italic>ET</italic>), and mean execution time per answer-related event (<italic>MET</italic>). All of them are calculated based on time stamps of answer-related events. In the fair tests, answer-related events include: drag a type of trees (or a pile of ingredients) and drop it on a position of a virtual mountain (or a slot), or move a type of tree (or a pile of ingredients) from one mountain (position) (or one slot) to another; in the exhaustive tests, such events include: select one or two fertilizers (or a number of locations), and save or cancel a condition. The ending time point of <italic>ET</italic> is not the moment when students click on the &#x201C;Submit&#x201D; button, because after executing the last answer-related event, they can review their answers, thus moving into the next stage of problem solving. Also, executing actions may involve planning bounded to prior actions, which is different from the conceptual planning of related actions before making any. Therefore, we limited <italic>ET</italic> as the duration between the first and last answer related events. In addition to answer-related events, other factors (e.g., mouse or computer speed) might affect the efficiency of action execution. Since the tests were administered on site using the same model of tablets, the influence of these factors was minimal.</p>
</sec>
<sec id="sec7">
<label>3.4.</label>
<title>Preprocessing and analysis plan</title>
<p>Before analysis, we first remove missing values. Then, for each process feature in a data set, we adopt a 98% winsorization estimation (<xref ref-type="bibr" rid="ref6">Dixon, 1960</xref>) (set the values &#x003C;1% of the whole values to the value at 1%, and those &#x003E;99% to the value at 99%) to adjust spurious outliers. Winsorization is independent of data distribution and preserves the exact proportion of data points, thus being more flexible than other outlier removal methods that presume a normal distribution of data points.</p>
<p>For response data, we first show score distributions among students and summarize how many students appropriately applied the CVS in each test, and then show the most frequent (top 10) submitted answers.</p>
<p>For process data, we conduct the Kruskal&#x2013;Wallis test to compare the duration features across score groups. If a significant value of <italic>p</italic> is reported by the test, we adopt another non-parametric test, the Wilcoxon signed-rank test, on pair-wised score groups to clarify which pair(s) of score groups have different means of the features. These two tests, implemented using kruskal.test and wilcox.test functions in the <italic>stats</italic> package in R 3.6.1 (<xref ref-type="bibr" rid="ref41">R Core Team, 2019</xref>), provide quantitative evidence on the relation between item scores and process features. Since there are three Kruskal&#x2013;Wallis tests on the three measures, respectively, the critical <italic>p-</italic>value for identifying significance is set to 0.05/3&#x2009;&#x2248;&#x2009;0.0167.</p>
<p>To cross-validate the results of the Kurskal&#x2013;Wallis and Wilcoxon signed-rank tests, we also conduct the omnibus ANOVA and pair-wised <italic>t</italic> tests (if the omnibus ANOVA test reports a significant value of <italic>p</italic>) between score groups. The log-transformed (base <italic>e</italic>) features pass the normality test (we use the Shapairo&#x2013;Wilk&#x2019;s method to test normality, and the <italic>p</italic>-values are all above 0.05, indicating that the distributions of the log-transformed data are not significantly distinct from a normal distribution). The ANOVA results are shown in the <xref ref-type="supplementary-material" rid="SM1">Supplementary materials</xref>.</p>
</sec>
</sec>
<sec id="sec8" sec-type="results">
<label>4.</label>
<title>Results</title>
<sec id="sec9">
<label>4.1.</label>
<title>The fair tests</title>
<p>The two fair tests show similar trends in score distribution and top 10 frequent submitted answers.</p>
<p>In the fair test 1, 41.4% of the students received the lowest score (1), 29.1% received a partial score (2), and only 29.5% properly applied the CVS and got a full score (3). In other words, the majority (over 70%) of the students failed to properly apply the CVS in this test. <xref rid="fig6" ref-type="fig">Figure 6A</xref> illustrates the top 10 frequently-submitted answers in this test. &#x201C;Low; Low; Low&#x201D; was the most frequent correct answer, and other correct ones (e.g., &#x201C;Medium; Medium; Medium&#x201D; and &#x201C;High; High; High&#x201D;) were less frequent; &#x201C;Low; Medium; High,&#x201D; an answer with totally-varied tree positions, was the most common incorrect answer, and its variants (e.g., &#x201C;High; Medium; Low&#x201D; or &#x201C;Low; High; Medium&#x201D;) were also common, all receiving the lowest score (1); and the answers having a partial score (2) (e.g., &#x201C;Medium; Low; Medium&#x201D;) were less frequent.</p>
<fig position="float" id="fig6">
<label>Figure 6</label>
<caption>
<p>Top 10 frequent answers of the fair test 1 <bold>(A)</bold> and those of the fair test 2 <bold>(B)</bold>. Values on top of the bars are numbers of students, and those inside brackets are proportions.</p>
</caption>
<graphic xlink:href="fpsyg-14-1131019-g006.tif"/>
</fig>
<p>In the fair test 2, only 28.8% of the students properly applied the CVS and got the full score (3), and most students had either the lowest score (1) (33.3%) or the partial score (2) (37.9%). <xref rid="fig6" ref-type="fig">Figure 6B</xref> shows that &#x201C;1,4,7&#x201D; was the most frequent correct answer, so was &#x201C;3,6,9,&#x201D; but other variants (e.g., &#x201C;2,5,8&#x201D; and &#x201C;9,6,3&#x201D;) were less frequent. &#x201C;1,5,9,&#x201D; an answer with totally varied ingredient amounts, was the most frequent incorrect answer. Others (e.g., &#x201C;1,2,3,&#x201D; &#x201C;7,8,9&#x201D; or &#x201C;9,8,7&#x201D;) that kept ingredient type consistent but varied ingredient amount were also frequent. Students who submitted these answers applied the CVS on a wrong variable. Other answers (e.g., &#x201C;1,2,4&#x201D; or &#x201C;1,6,9&#x201D;) that partially controlled the target variable of ingredient amount could not get the full score.</p>
<p><xref rid="tab4" ref-type="table">Table 4</xref> shows the means and standard errors of the process features across score groups. As for the fair test 1, the Kruskal&#x2013;Wallis tests report significant differences of these features across score groups (<italic>PT</italic>, <italic>&#x03C7;</italic><sup>2</sup>&#x2009;=&#x2009;12.2, df&#x2009;=&#x2009;2, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.005; <italic>ET</italic>, <italic>&#x03C7;</italic><sup>2</sup>&#x2009;=&#x2009;89.916, df&#x2009;=&#x2009;2, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.001; <italic>MET</italic>, <italic>&#x03C7;</italic><sup>2</sup>&#x2009;=&#x2009;64.776, df&#x2009;=&#x2009;2, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.001). The omnibus ANOVA tests show similar results [<italic>PT</italic>, <italic>F</italic>(2,1604)&#x2009;=&#x2009;5.943, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.005; <italic>ET</italic>, <italic>F</italic>(2,1604)&#x2009;=&#x2009;51.7, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.001; <italic>MET</italic>, <italic>F</italic>(2,1604)&#x2009;=&#x2009;38.93, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.001].</p>
<table-wrap position="float" id="tab4">
<label>Table 4</label>
<caption>
<p>Means and standard errors of <italic>PT</italic>, <italic>ET</italic>, and <italic>MET</italic> in each score group of the two fair tests.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top" rowspan="2">Score</th>
<th align="center" valign="top" colspan="3">Fair test 1</th>
<th align="center" valign="top" colspan="3">Fair test 2</th>
</tr>
<tr>
<th align="center" valign="top">PT</th>
<th align="center" valign="top">ET</th>
<th align="center" valign="top">MET</th>
<th align="center" valign="top">PT</th>
<th align="center" valign="top">ET</th>
<th align="center" valign="top">MET</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="middle">1</td>
<td align="center" valign="middle">85.571 (1.166)</td>
<td align="center" valign="middle">41.330 (1.098)</td>
<td align="center" valign="middle">5.125 (0.091)</td>
<td align="center" valign="middle">22.247 (1.007)</td>
<td align="center" valign="middle">53.043 (1.650)</td>
<td align="center" valign="middle">6.178 (0.170)</td>
</tr>
<tr>
<td align="left" valign="middle">2</td>
<td align="center" valign="middle">85.154 (1.407)</td>
<td align="center" valign="middle">38.807 (1.216)</td>
<td align="center" valign="middle">4.958 (0.103)</td>
<td align="center" valign="middle">22.219 (0.832)</td>
<td align="center" valign="middle">41.677 (1.265)</td>
<td align="center" valign="middle">5.777 (0.134)</td>
</tr>
<tr>
<td align="left" valign="middle">3</td>
<td align="center" valign="middle">79.745 (1.172)</td>
<td align="center" valign="middle">29.082 (1.081)</td>
<td align="center" valign="middle">4.138 (0.090)</td>
<td align="center" valign="middle">21.959 (0.874)</td>
<td align="center" valign="middle">37.108 (1.387)</td>
<td align="center" valign="middle">5.549 (0.157)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Values (in seconds) outside brackets are means and those inside brackets are standard errors.</p>
</table-wrap-foot>
</table-wrap>
<p>As for the fair test 2, the Kruskal&#x2013;Wallis tests report marginally significant differences in <italic>PT</italic> (<italic>&#x03C7;</italic><sup>2</sup>&#x2009;=&#x2009;7.824, df&#x2009;=&#x2009;2, <italic>p</italic>&#x2009;=&#x2009;0.02) and <italic>MET</italic> (<italic>&#x03C7;</italic><sup>2</sup>&#x2009;=&#x2009;6.600, df&#x2009;=&#x2009;2, <italic>p</italic>&#x2009;=&#x2009;0.037) and significant differences in <italic>ET</italic> (<italic>&#x03C7;</italic><sup>2</sup>&#x2009;=&#x2009;78.111, df&#x2009;=&#x2009;2, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.001) between score groups. The omnibus ANOVA tests show non-significant results for <italic>PT</italic> [<italic>F</italic>(2,1987)&#x2009;=&#x2009;2.744, <italic>p</italic>&#x2009;=&#x2009;0.065], but significant and marginally significant results for <italic>ET</italic> [<italic>F</italic>(2,1987)&#x2009;=&#x2009;37.53, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.001] and <italic>MET</italic> [<italic>F</italic>(2,1987)&#x2009;=&#x2009;3.451, <italic>p</italic>&#x2009;=&#x2009;0.032].</p>
<p><xref rid="tab5" ref-type="table">Table 5</xref> shows the Wilcoxon signed-rank test results. In both fair tests, the students with higher scores had shorter <italic>ET</italic> than those with lower scores; and the full score students had shorter <italic>PT</italic> and <italic>MET</italic> than the lowest score students, but such differences were not statistically significant when the partial score group was involved.</p>
<table-wrap position="float" id="tab5">
<label>Table 5</label>
<caption>
<p>Wilcoxon signed-rank test results between pair-wised score groups of the two fair tests.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="2"/>
<th align="center" valign="top" colspan="3">Fair test 1</th>
<th align="center" valign="top" colspan="3">Fair test 2</th>
</tr>
<tr>
<th align="center" valign="top">PT</th>
<th align="center" valign="top">ET</th>
<th align="center" valign="top">MET</th>
<th align="center" valign="top">PT</th>
<th align="center" valign="top">ET</th>
<th align="center" valign="top">MET</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="middle">1v2</td>
<td align="center" valign="middle">158,942 (0.527)</td>
<td align="center" valign="middle">
<bold>163,023 (0.016)</bold>
</td>
<td align="center" valign="middle">158,766 (0.548)</td>
<td align="center" valign="top">229,631 (0.025)</td>
<td align="center" valign="top">
<bold>289,097 (0.001)</bold>
</td>
<td align="center" valign="top">252,368.5 (0.458)</td>
</tr>
<tr>
<td align="left" valign="middle">1v3</td>
<td align="center" valign="middle">
<bold>176,639 (&#x003C;0.001)</bold>
</td>
<td align="center" valign="middle">
<bold>2,038,350.5 (&#x003C;0.001)</bold>
</td>
<td align="center" valign="middle">
<bold>199,945.5 (&#x003C;0.001)</bold>
</td>
<td align="center" valign="top">
<bold>177,693 (0.011)</bold>
</td>
<td align="center" valign="top">
<bold>247,740 (&#x003C;0.001)</bold>
</td>
<td align="center" valign="top">
<bold>209,433.5 (0.014)</bold>
</td>
</tr>
<tr>
<td align="left" valign="middle">2v3</td>
<td align="center" valign="middle">
<bold>120,966.5 (0.014)</bold>
</td>
<td align="center" valign="middle">
<bold>139,637.5 (&#x003C; 0.001)</bold>
</td>
<td align="center" valign="middle">
<bold>136,592.5 (&#x003C; 0.001)</bold>
</td>
<td align="center" valign="top">212,462 (0.580)</td>
<td align="center" valign="top">
<bold>244,581.5 (&#x003C;0.001)</bold>
</td>
<td align="center" valign="top">229,515 (0.056)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>&#x201C;1&#x201D; to &#x201C;3&#x201D; refer to score groups. Values outside brackets are test statistics, and those inside are <italic>p</italic> values. Significant (having <italic>p</italic> values &#x003C;0.0167) results are marked in bold. <xref ref-type="supplementary-material" rid="SM1">Supplementary Table S1</xref> shows the omnibus ANOVA and pair-wise <italic>t-</italic>tests results.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="sec10">
<label>4.2.</label>
<title>The exhaustive tests</title>
<p>The two exhaustive tests show similar results.</p>
<p>In the exhaustive test 1, 25.2% of the students received the lowest score (1), 33.9% properly applied the CVS and received the full score (4), and the rest got the partially high (3) (34.1%) or low (2) (6.8%) scores. In other words, the majority (over 65%) of the students failed to properly apply the CVS. Among the top 10 frequent answers (see <xref rid="fig7" ref-type="fig">Figure 7</xref>), &#x201C;A; B; A&#x2009;+&#x2009;B; None&#x201D; and its variants &#x201C;A; A&#x2009;+&#x2009;B; B; None&#x201D; and &#x201C;A&#x2009;+&#x2009;B; A; B; None&#x201D; received the full score, but they were less frequent than &#x201C;A&#x2009;+&#x2009;B,&#x201D; &#x201C;B,&#x201D; &#x201C;A,&#x201D; and &#x201C;None,&#x201D; which were the most frequent incorrect answers with the lowest score. Answers having partially high (e.g., &#x201C;A; A&#x2009;+&#x2009;B; None&#x201D;) or low (e.g., &#x201C;A; A&#x2009;+&#x2009;B&#x201D;) scores were less frequent.</p>
<fig position="float" id="fig7">
<label>Figure 7</label>
<caption>
<p>Top 10 frequent answers of the exhaustive test 1. Values on top of bars are numbers of students, and those inside brackets are percentages of students.</p>
</caption>
<graphic xlink:href="fpsyg-14-1131019-g007.tif"/>
</fig>
<p>In the exhaustive test 2, many students received the lowest score (1) (26.8%) and partially low score (2) (60.6%), and only 5.8% properly applied the CVS and got the full score (4), and 6% got the partially high score (3). Due to extremely numerous cases of submitted answers and equivalence of submitted answers, we discuss the frequent answers based on <xref rid="fig4" ref-type="fig">Figure 4</xref> and the scoring rubric in <xref rid="tab3" ref-type="table">Table 3</xref>. Most students got the partially low score (1), their submitted answers did not ensure that at least one location in each of the 13 regions was chosen; instead, they chose over 2 locations in the three regions adjacent to City A/City B, indicating that they failed to figure out the two requirements (see Section 3.2) of this test.</p>
<p><xref rid="tab6" ref-type="table">Table 6</xref> shows the means and standard errors of the process features across score groups. In the exhaustive test 1, the Kruskal&#x2013;Wallis tests report significant feature differences across score groups (<italic>PT</italic>, <italic>&#x03C7;</italic><sup>2</sup>&#x2009;=&#x2009;133.57, df&#x2009;=&#x2009;3, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.001; <italic>ET</italic>, <italic>&#x03C7;</italic><sup>2</sup>&#x2009;=&#x2009;498.49, df&#x2009;=&#x2009;3, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.001; <italic>MET</italic>, <italic>&#x03C7;</italic><sup>2</sup>&#x2009;=&#x2009;258.97, df&#x2009;=&#x2009;3, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.001). The omnibus ANOVA tests show similar results [<italic>PT</italic>, <italic>F</italic>(3,2721)&#x2009;=&#x2009;65.4, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.001; <italic>ET</italic>, <italic>F</italic>(3,2721)&#x2009;=&#x2009;224.5, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.001; <italic>MET</italic>, <italic>F</italic>(3,2721)&#x2009;=&#x2009;78.4, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.001].</p>
<table-wrap position="float" id="tab6">
<label>Table 6</label>
<caption>
<p>Means and standard errors of <italic>PT</italic>, <italic>ET</italic>, and <italic>MET</italic> across the score groups of the exhaustive tests 1 and 2.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top" rowspan="2">Score</th>
<th align="center" valign="top" colspan="3">Exhaustive test 1</th>
<th align="center" valign="top" colspan="3">Exhaustive test 2</th>
</tr>
<tr>
<th align="center" valign="top">PT</th>
<th align="center" valign="top">ET</th>
<th align="center" valign="top">MET</th>
<th align="center" valign="top">PT</th>
<th align="center" valign="top">ET</th>
<th align="center" valign="top">MET</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="middle">1</td>
<td align="center" valign="middle">9.056 (0.325)</td>
<td align="center" valign="middle">24.949 (0.922)</td>
<td align="center" valign="middle">5.502 (0.144)</td>
<td align="center" valign="middle">15.994 (0.388)</td>
<td align="center" valign="middle">29.560 (0.670)</td>
<td align="center" valign="middle">2.324 (0.043)</td>
</tr>
<tr>
<td align="left" valign="middle">2</td>
<td align="center" valign="middle">6.797 (0.439)</td>
<td align="center" valign="middle">41.623 (1.804)</td>
<td align="center" valign="middle">3.520 (0.113)</td>
<td align="center" valign="middle">14.967 (0.231)</td>
<td align="center" valign="middle">35.772 (0.473)</td>
<td align="center" valign="middle">1.969 (0.022)</td>
</tr>
<tr>
<td align="left" valign="middle">3</td>
<td align="center" valign="middle">7.105 (0.207)</td>
<td align="center" valign="middle">31.700 (0.715)</td>
<td align="center" valign="middle">3.899 (0.070)</td>
<td align="center" valign="middle">15.821 (0.705)</td>
<td align="center" valign="middle">50.940 (1.624)</td>
<td align="center" valign="middle">2.705 (0.076)</td>
</tr>
<tr>
<td align="left" valign="middle">4</td>
<td align="center" valign="middle">5.714 (0.172)</td>
<td align="center" valign="middle">42.523 (0.763)</td>
<td align="center" valign="middle">3.140 (0.051)</td>
<td align="center" valign="middle">16.641 (0.755)</td>
<td align="center" valign="middle">71.503 (2.053)</td>
<td align="center" valign="middle">2.423 (0.066)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Values (in seconds) outside brackets are means and those inside are standard errors.</p>
</table-wrap-foot>
</table-wrap>
<p>In the exhaustive test 2, the Kruskal&#x2013;Wallis tests report marginally significant difference in <italic>PT</italic> (<italic>&#x03C7;</italic><sup>2</sup>&#x2009;=&#x2009;10.317, df&#x2009;=&#x2009;3, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.017), and significantly differences in <italic>ET</italic> (<italic>&#x03C7;</italic><sup>2</sup>&#x2009;=&#x2009;440.33, df&#x2009;=&#x2009;3, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.001) and <italic>MET</italic> (<italic>&#x03C7;</italic><sup>2</sup>&#x2009;=&#x2009;158.79, df&#x2009;=&#x2009;3, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.001). The omnibus ANOVA tests also show (marginally) significant differences in <italic>PT</italic> [<italic>F</italic>(3,2942)&#x2009;=&#x2009;3.094, <italic>p</italic>&#x2009;=&#x2009;0.026], <italic>ET</italic> [<italic>F</italic>(3,2942)&#x2009;=&#x2009;185.7, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.001], and <italic>MET</italic> [<italic>F</italic>(3,2942)&#x2009;=&#x2009;45.58, <italic>p</italic>&#x2009;&#x003C;&#x2009;0.001].</p>
<p><xref rid="tab7" ref-type="table">Table 7</xref> shows the Wilcoxon signed-rank test results. Similar to the fair tests, the full score students had shorter <italic>PT</italic> and <italic>MET</italic> than other low score students; but unlike the fair tests, the full score students had longer <italic>ET</italic> than most of the other score students. The patterns might not be consistent when partial score groups were involved.</p>
<table-wrap position="float" id="tab7">
<label>Table 7</label>
<caption>
<p>Wilcoxon signed-rank test results between pair-wised score groups of the two exhaustive tests.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th/>
<th align="center" valign="top" colspan="3">Exhaustive test 1</th>
<th align="center" valign="top" colspan="3">Exhaustive test 2</th>
</tr>
<tr>
<th/>
<th align="center" valign="top">PT</th>
<th align="center" valign="top">ET</th>
<th align="center" valign="top">MET</th>
<th align="center" valign="top">PT</th>
<th align="center" valign="top">ET</th>
<th align="center" valign="top">MET</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="middle">1v2</td>
<td align="center" valign="middle"><bold>750,18.0 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle"><bold>29,065 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle"><bold>83,948 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle">743358.5 (0.034)</td>
<td align="center" valign="middle"><bold>547,881.5 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle"><bold>834,208 (&#x003C;0.001)</bold></td>
</tr>
<tr>
<td align="left" valign="middle">1v3</td>
<td align="center" valign="middle"><bold>372,475.5 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle"><bold>215,673.5 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle"><bold>400,288.5 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle">77,802 (0.975)</td>
<td align="center" valign="middle"><bold>31,302 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle"><bold>60441.5 (&#x003C;0.001)</bold></td>
</tr>
<tr>
<td align="left" valign="middle">1v4</td>
<td align="center" valign="middle"><bold>422,941.5 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle"><bold>128,978.5 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle"><bold>458,813.5 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle">63,511.5 (0.172)</td>
<td align="center" valign="middle"><bold>12,317 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle"><bold>60,717 (&#x003C;0.001)</bold></td>
</tr>
<tr>
<td align="left" valign="middle">2v3</td>
<td align="center" valign="middle">84,443.5 (0.693)</td>
<td align="center" valign="middle"><bold>111,656 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle">80,219.5 (0.147)</td>
<td align="center" valign="middle">166,080.5 (0.197)</td>
<td align="center" valign="middle"><bold>100,023 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle"><bold>99,775 (&#x003C;0.001)</bold></td>
</tr>
<tr>
<td align="left" valign="middle">2v4</td>
<td align="center" valign="middle"><bold>98,433.5 (&#x003C;0.005)</bold></td>
<td align="center" valign="middle">81,207 (0.284)</td>
<td align="center" valign="middle"><bold>100,851 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle"><bold>135,210 (0.009)</bold></td>
<td align="center" valign="middle"><bold>41,858 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle"><bold>102,545 (&#x003C;0.001)</bold></td>
</tr>
<tr>
<td align="left" valign="middle">3v4</td>
<td align="center" valign="middle"><bold>501,936.5 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle"><bold>274,362.5 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle"><bold>531,023.5 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle">15,784.5 (0.258)</td>
<td align="center" valign="middle"><bold>9,378 (&#x003C;0.001)</bold></td>
<td align="center" valign="middle"><bold>19,416.5 (0.016)</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>&#x201C;1&#x201D; to &#x201C;4&#x201D; refer to score groups. Values outside brackets are test statistics, and those inside are <italic>p-</italic>values. Significant (having <italic>p-</italic>values &#x003C;0.0167) results are marked in bold. <xref ref-type="supplementary-material" rid="SM1">Supplementary Table S2</xref> shows the omnibus ANOVA and pair-wise <italic>t-</italic>tests results.</p>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec id="sec11" sec-type="discussions">
<label>5.</label>
<title>Discussions</title>
<sec id="sec12">
<label>5.1.</label>
<title>Problem solving processes of high- and low-performing students</title>
<p>This study examined two fair tests and two exhaustive tests from the NAEP scientific inquiry tasks, which require students to apply the control-of-variables strategy to design controlled experiments. We propose three process features to reflect the major stages of problem solving and use them to investigate performances of students having various levels of problem solving competency. In both types of tests, high- and low-performing students exhibited distinct response and process patterns.</p>
<p>In terms of response, more than 70% of the fourth- and eighth-graders failed to properly apply the control-of-variables strategy in the fair tests, and over 80% of the twelfth-graders failed to do so in the exhaustive tests. These are consistent with previous literature (<xref ref-type="bibr" rid="ref5">Chen and Klahr, 1999</xref>).</p>
<p>In the fair test 1, the most common strategy was to vary tree position in mountain, e.g., &#x201C;Low; Medium; High&#x201D; (and its variations) (see <xref rid="fig6" ref-type="fig">Figure 6A</xref>). In the fair test 2, the most common strategy was to vary ingredient amount, e.g., &#x201C;1,5,9&#x201D; (and its variations) (see <xref rid="fig6" ref-type="fig">Figure 6B</xref>). These similar results are in line with early observations in response data (e.g., <xref ref-type="bibr" rid="ref44">Shimoda et al., 2002</xref>): students adopting inappropriate strategies failed to recognize that variation in this extraneous variable actually interfered the effect of the target variable.</p>
<p>In the exhaustive test 1, the common wrong strategies were to save (and submit) only one of the four possible conditions. In the exhaustive test 2, the common wrong strategies were to select locations mainly in the regions adjacent to a city but ignore those in between. These inappropriate strategies reveal that: the low-performing students in these tests failed to conceive an exhaustive set of test data for the controlled experiments, probably due to lacking intention or required skills, and as a consequence, they simply submitted a subset of test data or some guessed answers. These results are in line with early studies (<xref ref-type="bibr" rid="ref46">Tschirgi, 1980</xref>).</p>
<p>In terms of process, consistent patterns are evident in the process features. As for preparation time, in the fair tests, compared to students with the lowest score, those with a full score tended to spend shorter preparation time before making their first answer-related action. Longer preparation in students with the lowest score indicates that they needed more time to understand the test and plan their activities, whereas high-performing students could efficiently do so. This difference at the planning stage reveals that whether a student can properly solve a problem depends on whether he/she efficiently grasps the instructions and plans the activities <italic>before</italic> any is made. Apparently contradictive to the intuition that longer planning leads to better outcome, our finding is supported by results from other time-constrained tasks, e.g., a shorter pre-writing pause (duration between the moment a student entered the item and when he/she made the first typing event) in high-performing students in a time-constrained writing test, indicating efficient task planning (<xref ref-type="bibr" rid="ref49">Zhang et al., 2017</xref>).</p>
<p>Patterns in preparation time between the high- and low-performing students were not consistent in the exhaustive tests. In the exhaustive test 1, students with the full score spent less preparation time than those with the lowest score, but in the exhaustive test 2, such pattern disappeared. The number of exhaustive combinations in the exhaustive test 1 (4) is much smaller than that in the exhaustive test 2 (15). Therefore, in the exhaustive test 2, both students with lower scores and those with the full score might not be able to foresee all required combinations at the planning stage, so they simply started right away to make selections and think along with the process of answer formation. This leads to non-significant difference in preparation time between the high- and low-performing students in this test.</p>
<p>As for execution time and mean execution time, in the two fair tests, most students with the lowest score spent longer execution time in conducting the drag-and-drop actions than those with higher scores (see <xref rid="tab4" ref-type="table">Tables 4</xref>, <xref rid="tab5" ref-type="table">5</xref>). In these tests, the minimum number of actions required to construct an answer was just 3: drag and drop each type of trees (or three piles of different ingredients) respectively at the same (or different) positions of three mountains (or the three slots). There were two situations that caused longer execution time in students with the lowest scores: they either spent more time in executing individual actions or kept revising their choices,<xref rid="fn0009" ref-type="fn">
<sup>7</sup></xref> both reflecting hesitation or uncertainty during the action execution stage of problem solving. The process feature of mean execution time (see <xref rid="tab4" ref-type="table">Tables 4</xref>, <xref rid="tab5" ref-type="table">5</xref>) explicitly reveals that on average, students with the lowest score spent more time on conducting each of their answer-related actions, i.e., they were less efficient in action execution than those with the full score.</p>
<p>Unlike the fair tests, in the exhaustive tests, most students with the lowest score showed shorter execution time than those with higher scores in answer formulation (see <xref rid="tab6" ref-type="table">Tables 6</xref>, <xref rid="tab7" ref-type="table">7</xref>). According to <xref rid="tab3" ref-type="table">Table 3</xref>, low scores in these tests correspond to incomplete submissions. The longest execution time of most full score students suggests that they were well motivated and had endeavored in constructing and saving all possible conditions, even at the cost of spending more time in total. By contrast, the shorter execution times of the lower score students were mostly caused by two cases: (1) they did not spend much time exploring the conditions and finished the test with lack-of-thinking results, which reflected low motivation/engagement or lack of reasonable understanding; (2) without realizing that they needed to submit all possible conditions, some students left the test after submitting just one condition (consistent with the frequent wrong answers).</p>
<p>As for mean execution time, in the exhaustive test 1, though spending more time in problem solving, most students with the full score showed shorter mean execution time than those with the lower scores (see <xref rid="tab5" ref-type="table">Tables 5</xref>, <xref rid="tab6" ref-type="table">6</xref>). This indicates that most of high-performing students efficiently formulated their answers. In the exhaustive test 2, although spending longer time in selecting multiple locations for comparison, most high-performing students had smaller or comparable mean execution time to that of low-performing students, who submitted incomplete answers. To sum up, in both tests, high-performing students tended to be more efficient in executing multiple answer-related actions than low-performing ones.</p>
</sec>
<sec id="sec13">
<label>5.2.</label>
<title>Process features and problem-solving competency</title>
<p>In all four tests, most students who properly applied the control-of-variables strategy (thus having high problem-solving competency) enacted more goal-oriented behaviors (<xref ref-type="bibr" rid="ref44">Shimoda et al., 2002</xref>). In the fair tests, they quickly grasped the goal at the planning stage, and efficiently set up the conditions matching the fair test requirement; in the exhaustive tests, with a clear goal in mind, they persistently constructed all the conditions for comparison within a longer execution time. By contrast, students having low problem-solving competency were confused about the target variable while formulating answers in the fair tests; in the exhaustive tests, they either ignored or did not fully understand the goal, and tended to drop before submitting enough conditions.</p>
<p>The proposed process features of execution time and mean execution time reflect the level differences in goal-orientation and motivation between students, which are crucial to problem solving (<xref ref-type="bibr" rid="ref10">Gardner, 2006</xref>; <xref ref-type="bibr" rid="ref7">D&#x00F6;rner and G&#x00FC;ss, 2013</xref>; <xref ref-type="bibr" rid="ref17">G&#x00FC;ss et al., 2017</xref>). The contrasting patterns of execution time between the two types of tests reveal different characteristics of the solutions and execution stages therein; the fair tests need conditions matching the fair test requirement, yet the exhaustive tests request all possible conditions. They also reveal that task property could influence how students deploy strategies that they seem to know, which echoes the knowledge-practice integration in NGSS.</p>
<p>The consistent patterns of mean execution time in high-performing students across the two types of tests indicate that both types of tests require similar control-of-variables strategies and high-performing students can efficiently apply such strategies in solving apparently-distinct problems. This suggests that the capabilities of doing analogical reasoning and employing key skills and related abilities across tasks of various contents are critical in scientific problem solving.</p>
<p>Most of the above discussions concern the full and lowest score students, because the statistical tests report consistent results between these groups in each test. Inconsistent results exist between partial score groups or between a partial score group and the full (or the lowest) score group. This inconsistency is due to several reasons. First, some partial score groups contained fewer students than others. Second, as in the scoring rubrics, the response difference between the full (or the lowest) and a partial score is smaller than that between the full and the lowest scores. Both of these factors decimated the statistical power of the analyses. Third, due to lacking empirical bases (<xref ref-type="bibr" rid="ref38">Parshall and Brunner, 2017</xref>), the predefined rubrics might not be able to clearly differentiate students with different levels of problem solving competency. The reliability of scoring rubrics is worth further investigation, but it is beyond the scope of the current study.</p>
<p>The discussions on problem solving process and competency based on process features of high- and low-performing students in different tests provide useful insights on teaching and learning of the control-of-variables strategies and related skills as well as applying them in similar scientific inquiry tasks. For example, comparing a specific student&#x2019;s performance with the typical patterns of high-performing students can reveal on which problem solving stage the student needs to improve efficiency; comparing high- and low-performing students&#x2019; process patterns can also reveal on which aspects the low-performing students need to polish, e.g., how to allocate time and effort in different problem-solving stages in order to improve overall performance in scientific inquiry tasks.</p>
</sec>
<sec id="sec14">
<label>5.3.</label>
<title>Precision of process features</title>
<p>The temporal features of preparation time and execution time roughly estimate the process of action planning and that of action execution, respectively. In addition to individual differences, other factors may &#x201C;contaminate&#x201D; these features, especially in complex tasks requiring careful thinking and multiple answer formulation stages; e.g., students may change part of their answers during the problem solving process, and execution time may cover the time of answer change.</p>
<p>Answer change is part of action execution. In all four tests, most students conducted answer change through drag-and-drop actions. For example, in the fair test 1, the minimum number of drag-and-drop actions for correctly answering the question is 3, but only 11% of the students conducted exactly 3 drag-and-drop actions, and more than 50% conducted 3 to 6 actions; in the exhaustive test 1, the minimum number of saved conditions for a correct answer is 4, but only 23% of the students saved exactly 4 cases, and more than 90% saved 4&#x2013;6 cases. In addition, answer change actions are often intertwined with answer formulation actions, indicating that the purpose of such actions is to correct execution error and stick to planned actions. In this sense, answer change is part of action execution, and their durations should be included into execution time.</p>
<p>However, students might occasionally clear all the answers and re-answer the question from scratch. In this case, they could spend some time to re-plan their actions, but such time is embedded in the current definition of execution time. In the four tests, very few (&#x003C;1%) students went through such re-planning and re-execution process, but in complex tasks, such cases may be ample. To better clarify such cases, we need to improve the precision of process features by examining drag-and-drop action sequences and their time stamps to clearly identify whether a student re-planned. We leave such modification to future work.</p>
</sec>
<sec id="sec15">
<label>5.4.</label>
<title>Procedure of process data use</title>
<p>In addition to the process features and insights on scientific problem solving, this study lays out a general procedure of using process data to study test-takers&#x2019; performance or competency:</p>
<p><italic>Discover or define process features that could (potentially) inform test-takers&#x2019; performance or competency</italic>. This step is often based on prior hypotheses or existing studies;</p>
<p><italic>Demonstrate correlation or relatedness between process features and test-takers&#x2019; performance</italic>. This step is critical in two aspects. First, it verifies whether the features are related to performance in the target dataset. Second, it bridges the first and third steps; only after relatedness or correlation between test-takers&#x2019; performance and the process features is validated would analyses on these features and derived understandings become meaningful.</p>
<p><italic>Understand or characterize test-takers&#x2019; performance, or incorporate process features into scoring rubrics, cognitive or measurement models</italic>. Understanding test-takers&#x2019; performance is based on defined or discovered features in the first step. In our study, the proposed features characterize high- and low-performing (or common vs. abnormal) test-takers. The observed consistent patterns of process features also pave the way for incorporating those features into scoring rubrics, e.g., specific values or ranges of values of process features correspond to various scales of scores. Moreover, the quantitative process features as in our study could serve as important components in cognitive or measurement models to predict, classify, or interpret test-takers&#x2019; performance.</p>
</sec>
</sec>
<sec id="sec16" sec-type="conclusions">
<label>6.</label>
<title>Conclusion</title>
<p>This study proposes three process features and an analytical procedure of process data use. Based on four scientific inquiry tasks, we investigate how students apply the control-of-variables strategy in typical fair and exhaustive tests and how the process features characterize high- and low-performing students in these tasks. Although (meta)cognitive processes cannot be observed directly from process data, the proposed features have proven values in elucidating the planning and executing stage of problem solving, characterizing students&#x2019; performance patterns, and revealing relatedness among capacities (the control-of-variables strategy), test properties (the fair and exhaustive tests), and performance (answers, scores, and answering process). Our study demonstrates that process data provide unique windows to interpret students&#x2019; performance beyond scores, and that a combination of analytical procedures and process data helps infer students&#x2019; problem-solving strategies, fill in the gap in early studies, and stimulate future work on process features reflecting problem-solving performance.</p>
</sec>
<sec id="sec17" sec-type="data-availability">
<title>Data availability statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.</p>
</sec>
<sec id="sec18">
<title>Ethics statement</title>
<p>The studies involving human participants were reviewed and approved by NCES. Written informed consent to participate in this study was provided by the participants&#x2019; legal guardian/next of kin.</p>
</sec>
<sec id="sec19">
<title>Author contributions</title>
<p>TG and LS designed the study. TG collected the data and conducted the analysis and wrote the manuscript. TG, LS, YJ, and BA discussed the results. LS, YJ, and BA edited the manuscript. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec id="conf1" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>TG was employed by the company Google.</p>
<p>The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="sec100" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<ack>
<p>We thank Madeleine Keehner, Kathleen Scalise, Christopher Agard, and Gary Feng from ETS for guidance on this work. Preliminary results of the paper were reported on the 13th International Conference on Educational Data Mining (EDM 2020).</p>
</ack>
<sec id="sec21" sec-type="supplementary-material">
<title>Supplementary material</title>
<p>The Supplementary material for this article can be found online at: <ext-link xlink:href="https://www.frontiersin.org/articles/10.3389/fpsyg.2023.1131019/full#supplementary-material" ext-link-type="uri">https://www.frontiersin.org/articles/10.3389/fpsyg.2023.1131019/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Table_1.DOC" id="SM1" mimetype="application/msword" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="ref1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Arslan</surname> <given-names>B.</given-names></name> <name><surname>Jiang</surname> <given-names>Y.</given-names></name> <name><surname>Keehner</surname> <given-names>M.</given-names></name> <name><surname>Gong</surname> <given-names>T.</given-names></name> <name><surname>Katz</surname> <given-names>I. R.</given-names></name> <name><surname>Yan</surname> <given-names>F.</given-names></name></person-group> (<year>2020</year>). <article-title>The effect of drag-and-drop item features on test-taker performance and response strategies</article-title>. <source>Educ. Meas. Issues Pract.</source> <volume>39</volume>, <fpage>96</fpage>&#x2013;<lpage>106</lpage>. doi: <pub-id pub-id-type="doi">10.1111/emip.12326</pub-id></citation></ref>
<ref id="ref2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bergner</surname> <given-names>Y.</given-names></name> <name><surname>von Davier</surname> <given-names>A. A.</given-names></name></person-group> (<year>2019</year>). <article-title>Process data in NAEP: past, present, and future</article-title>. <source>J. Educ. Behav. Stat.</source> <volume>44</volume>, <fpage>706</fpage>&#x2013;<lpage>732</lpage>. doi: <pub-id pub-id-type="doi">10.3102/1076998618784700</pub-id></citation></ref>
<ref id="ref3"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Black</surname> <given-names>R.</given-names></name></person-group> (<year>2007</year>). <source>Pragmatic software testing: Becoming an effective and efficient test professional</source>. <publisher-loc>New York</publisher-loc>: <publisher-name>Wiley</publisher-name>.</citation></ref>
<ref id="ref4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bryant</surname> <given-names>W.</given-names></name></person-group> (<year>2017</year>). <article-title>Developing a strategy for using technology-enhanced items in large-scale standardized tests</article-title>. <source>Pract. Assess. Res. Eval.</source> <volume>22</volume>, <fpage>1</fpage>&#x2013;<lpage>10</lpage>. doi: <pub-id pub-id-type="doi">10.7275/70yb-dj34</pub-id></citation></ref>
<ref id="ref5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>Z.</given-names></name> <name><surname>Klahr</surname> <given-names>D.</given-names></name></person-group> (<year>1999</year>). <article-title>All other things being equal: acquisition and transfer of the control-of-variables strategy</article-title>. <source>Child Dev.</source> <volume>70</volume>, <fpage>1098</fpage>&#x2013;<lpage>1120</lpage>. doi: <pub-id pub-id-type="doi">10.1111/1467-8624.00081</pub-id>, PMID: <pub-id pub-id-type="pmid">10546337</pub-id></citation></ref>
<ref id="ref6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dixon</surname> <given-names>W. J.</given-names></name></person-group> (<year>1960</year>). <article-title>Simplified estimation from censored normal samples</article-title>. <source>Ann. Math. Stat.</source> <volume>31</volume>, <fpage>385</fpage>&#x2013;<lpage>391</lpage>. doi: <pub-id pub-id-type="doi">10.1214/aoms/1177705900</pub-id></citation></ref>
<ref id="ref7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>D&#x00F6;rner</surname> <given-names>D.</given-names></name> <name><surname>G&#x00FC;ss</surname> <given-names>C. D.</given-names></name></person-group> (<year>2013</year>). <article-title>PSI: a computational architecture of cognition, motivation, and emotion</article-title>. <source>Rev. Gen. Psychol.</source> <volume>17</volume>, <fpage>297</fpage>&#x2013;<lpage>317</lpage>. doi: <pub-id pub-id-type="doi">10.1037/a0032947</pub-id></citation></ref>
<ref id="ref8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dost&#x00E1;l</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>Theory of problem solving</article-title>. <source>Procedia Soc. Behav. Sci.</source> <volume>174</volume>, <fpage>2798</fpage>&#x2013;<lpage>2805</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.sbspro.2015.01.970</pub-id></citation></ref>
<ref id="ref9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ebenezer</surname> <given-names>J.</given-names></name> <name><surname>Kaya</surname> <given-names>O. N.</given-names></name> <name><surname>Ebenezer</surname> <given-names>D. L.</given-names></name></person-group> (<year>2011</year>). <article-title>Engaging students in environmental research projects: perceptions of fluency with innovative technologies and levels of scientific inquiry abilities</article-title>. <source>J. Res. Sci. Teach.</source> <volume>48</volume>, <fpage>94</fpage>&#x2013;<lpage>116</lpage>. doi: <pub-id pub-id-type="doi">10.1002/tea.20387</pub-id></citation></ref>
<ref id="ref10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gardner</surname> <given-names>E. A.</given-names></name></person-group> (<year>2006</year>). <article-title>Instruction in mastery goal orientation: developing problem solving and persistence for clinical settings</article-title>. <source>J. Nurs. Educ.</source> <volume>45</volume>, <fpage>343</fpage>&#x2013;<lpage>347</lpage>. doi: <pub-id pub-id-type="doi">10.3928/01484834-20060901-03</pub-id>, PMID: <pub-id pub-id-type="pmid">17002080</pub-id></citation></ref>
<ref id="ref11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gobert</surname> <given-names>J. D.</given-names></name> <name><surname>Sao Pedro</surname> <given-names>M. A.</given-names></name> <name><surname>Baker</surname> <given-names>R. S. J. D.</given-names></name> <name><surname>Toto</surname> <given-names>E.</given-names></name> <name><surname>Montalvo</surname> <given-names>O.</given-names></name></person-group> (<year>2012</year>). <article-title>Leveraging educational data mining for real-time performance assessment of scientific inquiry skills within microworlds</article-title>. <source>J. Educ. Data Mining</source> <volume>4</volume>, <fpage>104</fpage>&#x2013;<lpage>143</lpage>. doi: <pub-id pub-id-type="doi">10.5281/zenodo.3554645</pub-id></citation></ref>
<ref id="ref12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gong</surname> <given-names>T.</given-names></name> <name><surname>Jiang</surname> <given-names>Y.</given-names></name> <name><surname>Saldivia</surname> <given-names>L. E.</given-names></name> <name><surname>Agard</surname> <given-names>C.</given-names></name></person-group> (<year>2021</year>). <article-title>Using Sankey diagrams to visualize drag and drop action sequences in technology-enhanced items</article-title>. <source>Behaiv. Res. Methods</source> <volume>54</volume>, <fpage>117</fpage>&#x2013;<lpage>132</lpage>. doi: <pub-id pub-id-type="doi">10.3758/s13428-021-01615-4</pub-id>, PMID: <pub-id pub-id-type="pmid">34109559</pub-id></citation></ref>
<ref id="ref13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gong</surname> <given-names>T.</given-names></name> <name><surname>Zhang</surname> <given-names>M.</given-names></name> <name><surname>Li</surname> <given-names>C.</given-names></name></person-group> (<year>2022</year>). <article-title>Association of keyboarding fluency and writing performance in online-delivered assessment</article-title>. <source>Assess. Writ.</source> <volume>51</volume>:<fpage>100575</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.asw.2021.100575</pub-id></citation></ref>
<ref id="ref14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Griffiths</surname> <given-names>T. L.</given-names></name> <name><surname>Lieder</surname> <given-names>F.</given-names></name> <name><surname>Goodman</surname> <given-names>N. D.</given-names></name></person-group> (<year>2015</year>). <article-title>Rational use of cognitive resources: levels of analysis between the computational and the algorithmic</article-title>. <source>Top. Cogn. Sci.</source> <volume>7</volume>, <fpage>217</fpage>&#x2013;<lpage>229</lpage>. doi: <pub-id pub-id-type="doi">10.1111/tops.12142</pub-id>, PMID: <pub-id pub-id-type="pmid">25898807</pub-id></citation></ref>
<ref id="ref15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Grindal</surname> <given-names>M.</given-names></name> <name><surname>Offutt</surname> <given-names>J.</given-names></name> <name><surname>Andler</surname> <given-names>S. F.</given-names></name></person-group> (<year>2005</year>). <article-title>Combination testing strategies: a survey</article-title>. <source>Soft. Testing, Verif. Reliab.</source> <volume>15</volume>, <fpage>167</fpage>&#x2013;<lpage>199</lpage>. doi: <pub-id pub-id-type="doi">10.1002/stvr.319</pub-id></citation></ref>
<ref id="ref16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guo</surname> <given-names>H.</given-names></name> <name><surname>Zhang</surname> <given-names>M.</given-names></name> <name><surname>Deane</surname> <given-names>P. D.</given-names></name> <name><surname>Bennett</surname> <given-names>R. E.</given-names></name></person-group> (<year>2019</year>). <article-title>Writing process differences in subgroups reflected in keystroke logs</article-title>. <source>J. Educ. Behav. Stat.</source> <volume>44</volume>, <fpage>571</fpage>&#x2013;<lpage>596</lpage>. doi: <pub-id pub-id-type="doi">10.3102/1076998619856590</pub-id></citation></ref>
<ref id="ref17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>G&#x00FC;ss</surname> <given-names>C. D.</given-names></name> <name><surname>Burger</surname> <given-names>M. L.</given-names></name> <name><surname>D&#x00F6;rner</surname> <given-names>D.</given-names></name></person-group> (<year>2017</year>). <article-title>The role of motivation in complex problem solving</article-title>. <source>Front. Psychol.</source> <volume>8</volume>:<fpage>851</fpage>. doi: <pub-id pub-id-type="doi">10.3389/fpsyg.2017.00851</pub-id>, PMID: <pub-id pub-id-type="pmid">28588545</pub-id></citation></ref>
<ref id="ref18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Han</surname> <given-names>Z.</given-names></name> <name><surname>He</surname> <given-names>Q.</given-names></name> <name><surname>von Davier</surname> <given-names>M.</given-names></name></person-group> (<year>2019</year>). <article-title>Predictive feature generation and selection using process data from PISA interactive problem-solving items: an application of random forests</article-title>. <source>Front. Psychol.</source> <volume>10</volume>:<fpage>2461</fpage>. doi: <pub-id pub-id-type="doi">10.3389/fpsyg.2019.02461</pub-id>, PMID: <pub-id pub-id-type="pmid">31824363</pub-id></citation></ref>
<ref id="ref19"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Hoyles</surname> <given-names>C.</given-names></name> <name><surname>Noss</surname> <given-names>R.</given-names></name></person-group> (<year>2003</year>). &#x201C;<article-title>What can digital technologies take from and bring to research in mathematics education?</article-title>&#x201D; in <source>Second international handbook of mathematics education</source>. eds. <person-group person-group-type="editor"><name><surname>Bishop</surname> <given-names>A.</given-names></name> <name><surname>Clements</surname> <given-names>M. A. K.</given-names></name> <name><surname>Keitel-Kreidt</surname> <given-names>C.</given-names></name> <name><surname>Kilpatrick</surname> <given-names>J.</given-names></name> <name><surname>Leung</surname> <given-names>F. K.-S.</given-names></name></person-group> (<publisher-loc>Dordrecht</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>323</fpage>&#x2013;<lpage>349</lpage>.</citation></ref>
<ref id="ref20"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Inhelder</surname> <given-names>B.</given-names></name> <name><surname>Piaget</surname> <given-names>J.</given-names></name></person-group> (<year>1958</year>). <source>The growth of logical thinking from childhood to adolescence: An essay on the construction of formal operational structures</source>. <publisher-loc>London</publisher-loc>: <publisher-name>Routledge and Kegan Paul.</publisher-name></citation></ref>
<ref id="ref21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jiang</surname> <given-names>Y.</given-names></name> <name><surname>Gong</surname> <given-names>T.</given-names></name> <name><surname>Saldivia</surname> <given-names>L. E.</given-names></name> <name><surname>Cayton-Hodges</surname> <given-names>G.</given-names></name> <name><surname>Agard</surname> <given-names>C.</given-names></name></person-group> (<year>2021</year>). <article-title>Using process data to understand problem-solving strategies and processes in large-scale mathematics assessments</article-title>. <source>Large-Scale Assess. Educ.</source> <volume>9</volume>, <fpage>1</fpage>&#x2013;<lpage>31</lpage>. doi: <pub-id pub-id-type="doi">10.1186/s40536-021-00095-4</pub-id></citation></ref>
<ref id="ref22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jonassen</surname> <given-names>D. H.</given-names></name></person-group> (<year>2000</year>). <article-title>Toward a design theory of problem solving</article-title>. <source>Educ. Technol. Res. Dev.</source> <volume>48</volume>, <fpage>63</fpage>&#x2013;<lpage>85</lpage>. doi: <pub-id pub-id-type="doi">10.1007/BF02300500</pub-id></citation></ref>
<ref id="ref23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>M. C.</given-names></name> <name><surname>Hannafin</surname> <given-names>M. J.</given-names></name> <name><surname>Bryan</surname> <given-names>L. A.</given-names></name></person-group> (<year>2007</year>). <article-title>Technology-enhanced inquiry tools in science education: an emerging pedagogical framework for classroom practice</article-title>. <source>Sci. Educ.</source> <volume>91</volume>, <fpage>1010</fpage>&#x2013;<lpage>1030</lpage>. doi: <pub-id pub-id-type="doi">10.1002/sce.20219</pub-id></citation></ref>
<ref id="ref24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Klahr</surname> <given-names>D.</given-names></name> <name><surname>Nigam</surname> <given-names>M.</given-names></name></person-group> (<year>2004</year>). <article-title>The equivalence of learning paths in early science instruction: effects of direct instruction and discovery learning</article-title>. <source>Psychol. Sci.</source> <volume>15</volume>, <fpage>661</fpage>&#x2013;<lpage>667</lpage>. doi: <pub-id pub-id-type="doi">10.1111/j.0956-7976.2004.00737.x</pub-id>, PMID: <pub-id pub-id-type="pmid">15447636</pub-id></citation></ref>
<ref id="ref25"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Koedinger</surname> <given-names>K. R.</given-names></name> <name><surname>Corbett</surname> <given-names>A.</given-names></name></person-group> (<year>2006</year>). &#x201C;<article-title>Cognitive tutors: technology bringing learning science to the classroom</article-title>&#x201D; in <source>The Cambridge handbook of the learning sciences</source>. ed. <person-group person-group-type="editor">
<name><surname>Sawyer</surname> <given-names>K.</given-names></name></person-group> (<publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>Cambridge University Press</publisher-name>), <fpage>61</fpage>&#x2013;<lpage>78</lpage>.</citation></ref>
<ref id="ref26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kruskal</surname> <given-names>W. H.</given-names></name> <name><surname>Wallis</surname> <given-names>W. A.</given-names></name></person-group> (<year>1952</year>). <article-title>Use of ranks in one-criterion variance analysis</article-title>. <source>J. Am. Stat. Assoc.</source> <volume>47</volume>, <fpage>583</fpage>&#x2013;<lpage>621</lpage>. doi: <pub-id pub-id-type="doi">10.1080/01621459.1952.10483441</pub-id></citation></ref>
<ref id="ref27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kuhn</surname> <given-names>D.</given-names></name></person-group> (<year>2007</year>). <article-title>Reasoning about multiple variables: control of variables is not the only challenge</article-title>. <source>Sci. Educ.</source> <volume>91</volume>, <fpage>710</fpage>&#x2013;<lpage>726</lpage>. doi: <pub-id pub-id-type="doi">10.1002/sce.20214</pub-id></citation></ref>
<ref id="ref28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kuhn</surname> <given-names>D.</given-names></name> <name><surname>Dean</surname> <given-names>D.</given-names></name></person-group> (<year>2005</year>). <article-title>Is developing scientific thinking all about learning to control variables?</article-title> <source>Psychol. Sci.</source> <volume>16</volume>, <fpage>866</fpage>&#x2013;<lpage>870</lpage>. doi: <pub-id pub-id-type="doi">10.1111/j.1467-9280.2005.01628.x</pub-id></citation></ref>
<ref id="ref29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>Y.</given-names></name> <name><surname>Jia</surname> <given-names>Y.</given-names></name></person-group> (<year>2014</year>). <article-title>Using response time to investigate students&#x2019; test-taking behaviors in a NAEP computer-based study</article-title>. <source>Large-Scale Assess. Educ.</source> <volume>2</volume>, <fpage>1</fpage>&#x2013;<lpage>24</lpage>. doi: <pub-id pub-id-type="doi">10.1186/s40536-014-0008-1</pub-id></citation></ref>
<ref id="ref30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lesh</surname> <given-names>R.</given-names></name> <name><surname>Harel</surname> <given-names>G.</given-names></name></person-group> (<year>2011</year>). <article-title>Problem solving, modeling, and local conceptual development</article-title>. <source>Math. Think. Learn.</source> <volume>5</volume>, <fpage>157</fpage>&#x2013;<lpage>189</lpage>. doi: <pub-id pub-id-type="doi">10.1080/10986065.2003.9679998</pub-id></citation></ref>
<ref id="ref31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lieder</surname> <given-names>F.</given-names></name> <name><surname>Griffiths</surname> <given-names>T. L.</given-names></name></person-group> (<year>2017</year>). <article-title>Strategy selection as rational metareasoning</article-title>. <source>Psychol. Rev.</source> <volume>124</volume>, <fpage>762</fpage>&#x2013;<lpage>794</lpage>. doi: <pub-id pub-id-type="doi">10.1037/rev0000075</pub-id>, PMID: <pub-id pub-id-type="pmid">29106268</pub-id></citation></ref>
<ref id="ref32"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>O. L.</given-names></name> <name><surname>Bridgeman</surname> <given-names>B.</given-names></name> <name><surname>Gu</surname> <given-names>L.</given-names></name> <name><surname>Xu</surname> <given-names>J.</given-names></name> <name><surname>Kong</surname> <given-names>N.</given-names></name></person-group> (<year>2015</year>). <article-title>Investigation of response changes in the GRE revised general test</article-title>. <source>Educ. Psychol. Meas.</source> <volume>75</volume>, <fpage>1002</fpage>&#x2013;<lpage>1020</lpage>. doi: <pub-id pub-id-type="doi">10.1177/0013164415573988</pub-id>, PMID: <pub-id pub-id-type="pmid">29795850</pub-id></citation></ref>
<ref id="ref34"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Montgomery</surname> <given-names>D. C.</given-names></name></person-group> (<year>2000</year>). <source>Design and analysis of experiments</source> (<edition>5th</edition>). <publisher-loc>Indianapolis, IN</publisher-loc>: <publisher-name>Wiley Text Books</publisher-name>.</citation></ref>
<ref id="ref35"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Moon</surname> <given-names>J. A.</given-names></name> <name><surname>Keehner</surname> <given-names>M.</given-names></name> <name><surname>Katz</surname> <given-names>I. R.</given-names></name></person-group> (<year>2018</year>). <article-title>Affordances of item formats and their effects on test-taker cognition under uncertainty</article-title>. <source>Educ. Meas. Issues Pract.</source> <volume>38</volume>, <fpage>54</fpage>&#x2013;<lpage>62</lpage>. doi: <pub-id pub-id-type="doi">10.1111/emip.12229</pub-id></citation></ref>
<ref id="ref36"><citation citation-type="other"><person-group person-group-type="author"><collab id="coll1">National Assessment Governing Board</collab></person-group>. (<year>2015</year>). Science framework for the 2015 national assessment of educational progress. Washington, DC. Available at: <ext-link xlink:href="https://www.nagb.gov/naep-frameworks/science/2015-science-framework.html" ext-link-type="uri">https://www.nagb.gov/naep-frameworks/science/2015-science-framework.html</ext-link></citation></ref>
<ref id="ref38"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Parshall</surname> <given-names>C. G.</given-names></name> <name><surname>Brunner</surname> <given-names>B.</given-names></name></person-group> (<year>2017</year>). &#x201C;<article-title>Content development and review</article-title>&#x201D; in <source>Testing in the professions: Credentialing policies and practice</source>. eds. <person-group person-group-type="editor"><name><surname>Davis-Becker</surname> <given-names>S.</given-names></name> <name><surname>Buckendahl</surname> <given-names>C. W.</given-names></name></person-group> (<publisher-loc>Abingdon</publisher-loc>: <publisher-name>Routledge</publisher-name>), <fpage>85</fpage>&#x2013;<lpage>104</lpage>.</citation></ref>
<ref id="ref39"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pedaste</surname> <given-names>M.</given-names></name> <name><surname>M&#x00E4;eots</surname> <given-names>M.</given-names></name> <name><surname>Siiman</surname> <given-names>L. A.</given-names></name> <name><surname>de Jong</surname> <given-names>T.</given-names></name> <name><surname>van Riesen</surname> <given-names>S. A. N.</given-names></name> <name><surname>Kamp</surname> <given-names>E. T.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Phases of inquiry-based learning: definitions and the inquiry cycle</article-title>. <source>Educ. Res. Rev.</source> <volume>14</volume>, <fpage>47</fpage>&#x2013;<lpage>61</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.edurev.2015.02.003</pub-id></citation></ref>
<ref id="ref40"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Provasnik</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>Process data, the new frontier for assessment development: rich new soil or a quixotic quest?</article-title> <source>Large Scale Assess. Educ.</source> <volume>9</volume>:<fpage>1</fpage>. doi: <pub-id pub-id-type="doi">10.1186/s40536-020-00092-z</pub-id></citation></ref>
<ref id="ref41"><citation citation-type="book"><person-group person-group-type="author"><collab id="coll3">R Core Team</collab></person-group>. (<year>2019</year>). <source>R: A language and environment for statistical computing</source>. <publisher-name>R Foundation for Statistical Computing</publisher-name>, <publisher-loc>Vienna, Austria.</publisher-loc></citation></ref>
<ref id="ref42"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Scalise</surname> <given-names>K.</given-names></name> <name><surname>Gifford</surname> <given-names>B.</given-names></name></person-group> (<year>2006</year>). <article-title>Computer-based assessment in E-learning: a framework for constructing &#x201C;intermediate constraint&#x201D; questions and tasks for technology platforms</article-title>. <source>J. Technol. Learn. Assess.</source> <volume>4</volume>, <fpage>3</fpage>&#x2013;<lpage>44</lpage>. Available at: <ext-link xlink:href="https://ejournals.bc.edu/index.php/jtla/article/view/1653" ext-link-type="uri">https://ejournals.bc.edu/index.php/jtla/article/view/1653</ext-link></citation></ref>
<ref id="ref43"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schwichow</surname> <given-names>M.</given-names></name> <name><surname>Croker</surname> <given-names>S.</given-names></name> <name><surname>Zimmerman</surname> <given-names>C.</given-names></name> <name><surname>H&#x00F6;ffler</surname> <given-names>T. N.</given-names></name> <name><surname>Haertig</surname> <given-names>H.</given-names></name></person-group> (<year>2016</year>). <article-title>Teaching the control-of-variables strategy: a meta analysis</article-title>. <source>Dev. Rev.</source> <volume>39</volume>, <fpage>37</fpage>&#x2013;<lpage>63</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.dr.2015.12.001</pub-id></citation></ref>
<ref id="ref44"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shimoda</surname> <given-names>T. A.</given-names></name> <name><surname>White</surname> <given-names>B. Y.</given-names></name> <name><surname>Frederiksen</surname> <given-names>J. R.</given-names></name></person-group> (<year>2002</year>). <article-title>Student goal orientation in learning inquiry skills with modifiable software advisors</article-title>. <source>Sci. Educ.</source> <volume>86</volume>, <fpage>244</fpage>&#x2013;<lpage>263</lpage>. doi: <pub-id pub-id-type="doi">10.1002/sce.10003</pub-id></citation></ref>
<ref id="ref45"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tang</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>He</surname> <given-names>Q.</given-names></name> <name><surname>Liu</surname> <given-names>J.</given-names></name> <name><surname>Ying</surname> <given-names>Z.</given-names></name></person-group> (<year>2019</year>). <article-title>Latent feature extraction for process data via multidimensional scaling</article-title>. <source>Psychometrika</source> <volume>85</volume>, <fpage>378</fpage>&#x2013;<lpage>397</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11336-020-09708-3</pub-id></citation></ref>
<ref id="ref46"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tschirgi</surname> <given-names>J. E.</given-names></name></person-group> (<year>1980</year>). <article-title>Sensible reasoning: a hypothesis about hypotheses</article-title>. <source>Child Dev.</source> <volume>51</volume>, <fpage>1</fpage>&#x2013;<lpage>10</lpage>. doi: <pub-id pub-id-type="doi">10.2307/1129583</pub-id></citation></ref>
<ref id="ref47"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ulitzsch</surname> <given-names>E.</given-names></name> <name><surname>He</surname> <given-names>Q.</given-names></name> <name><surname>Pohl</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>Using sequence mining techniques for understanding incorrect behavioral patterns on interactive tasks</article-title>. <source>J. Educ. Behav. Stat.</source> <volume>47</volume>, <fpage>3</fpage>&#x2013;<lpage>35</lpage>. doi: <pub-id pub-id-type="doi">10.3102/10769986211010467</pub-id></citation></ref>
<ref id="ref48"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zenisky</surname> <given-names>A. L.</given-names></name> <name><surname>Sireci</surname> <given-names>S. G.</given-names></name></person-group> (<year>2002</year>). <article-title>Technological innovations in large-scale assessment</article-title>. <source>Appl. Meas. Educ.</source> <volume>15</volume>, <fpage>337</fpage>&#x2013;<lpage>362</lpage>. doi: <pub-id pub-id-type="doi">10.1207/S15324818AME1504_02</pub-id></citation></ref>
<ref id="ref49"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>M.</given-names></name> <name><surname>Zou</surname> <given-names>D.</given-names></name> <name><surname>Wu</surname> <given-names>A. D.</given-names></name> <name><surname>Deane</surname> <given-names>P.</given-names></name> <name><surname>Li</surname> <given-names>C.</given-names></name></person-group> (<year>2017</year>). &#x201C;<article-title>An investigation of the writing processes in timed task condition using keystrokes</article-title>&#x201D; in <source>Understanding and investigating writing processes in validation research</source>. eds. <person-group person-group-type="editor"><name><surname>Zumbo</surname> <given-names>B. D.</given-names></name> <name><surname>Hubley</surname> <given-names>A. M.</given-names></name></person-group> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>321</fpage>&#x2013;<lpage>339</lpage>.</citation></ref>
</ref-list>
<fn-group>
<fn id="fn0003"><p>
<sup>1</sup>
<ext-link xlink:href="https://nces.ed.gov/" ext-link-type="uri">https://nces.ed.gov/</ext-link>
</p></fn>
<fn id="fn0004"><p><sup>2</sup>Such data include, but are not limited to: <italic>student events and their time stamps</italic>, e.g., drag-and-drop, (de)selection, or tool-use actions during answer formulation, text (re)typing or editing behaviors during keyboard-based writing, or navigations across pages, scenes, or questions during on-screen reading; and <italic>system events and their time stamps</italic>, e.g., entering/leaving scenes, (de)activating on-scene tools or popping-up messages.</p></fn>
<fn id="fn0005"><p><sup>3</sup>In addition to durations, one can measure numbers/sequence of answer-related events in the test scene. Such features have some uncertainties: greater/fewer events do not necessarily require more/less effort, more events sometimes indicate low competency, and it is non-trivial to align such features with performance (scores). The duration features proposed in our study can overcome these uncertainties by explicitly measuring students&#x2019; planning and executing stages and relating them to performance.</p></fn>
<fn id="fn0006"><p><sup>4</sup>Some recent studies begun to touch upon duration features. For example, <xref ref-type="bibr" rid="ref1">Arslan et al. (2020)</xref> reported no significant correlations between item scores and preparation (and execution) times in math items. <xref ref-type="bibr" rid="ref21">Jiang et al. (2021)</xref> investigated action sequences and response strategies derived. One can also measure ratios between durations. If durations can reflect the planning and executing stages during problem solving and characterize performance patterns of students, ratios between durations can further reveal relative cognitive loads between the planning and executing processes. We leave such ratio-based features in future work.</p></fn>
<fn id="fn0007"><p><sup>5</sup><italic>a.k.a.</italic> "isolation of variables" (<xref ref-type="bibr" rid="ref20">Inhelder and Piaget, 1958</xref>), "vary one thing at a time" (<xref ref-type="bibr" rid="ref46">Tschirgi, 1980</xref>), or "control of variables strategy" (<xref ref-type="bibr" rid="ref5">Chen and Klahr, 1999</xref>).</p></fn>
<fn id="fn0008"><p><sup>6</sup>Due to privacy and secure nature of NAEP data, we present conceptually equivalent cover tasks to maintain task security. The cover tasks have similar underlying structures and require similar cognitive processes to solve, but do not connect to specific science contents as the real tasks.</p></fn>
<fn id="fn0009"><p><sup>7</sup>Both can be identified from event logs, and action frequency data can further clarify which situation is more popular.</p></fn>
</fn-group>
</back>
</article>