<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="review-article" dtd-version="2.3" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Educ.</journal-id>
<journal-title>Frontiers in Education</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Educ.</abbrev-journal-title>
<issn pub-type="epub">2504-284X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/feduc.2023.858273</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Education</subject>
<subj-group>
<subject>Mini Review</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Automatic item generation: foundations and machine learning-based approaches for assessments</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Circi</surname>
<given-names>Ruhan</given-names>
</name>
<xref rid="aff1" ref-type="aff"><sup>1</sup></xref>
<xref rid="c001" ref-type="corresp"><sup>&#x002A;</sup></xref>
<uri xlink:href="https://loop.frontiersin.org/people/539144/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Hicks</surname>
<given-names>Juanita</given-names>
</name>
<xref rid="aff1" ref-type="aff"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Sikali</surname>
<given-names>Emmanuel</given-names>
</name>
<xref rid="aff2" ref-type="aff"><sup>2</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>American Institutes for Research</institution>, <addr-line>Arlington, VA</addr-line>, <country>United States</country></aff>
<aff id="aff2"><sup>2</sup><institution>National Center for Education Statistics</institution>, <addr-line>Washington, DC</addr-line>, <country>United States</country></aff>
<author-notes>
<fn id="fn0001" fn-type="edited-by"><p>Edited by: April Lynne Zenisky, University of Massachusetts Amherst, United States</p></fn>
<fn id="fn0002" fn-type="edited-by"><p>Reviewed by: Mark Gierl, University of Alberta, Canada</p></fn>
<corresp id="c001">&#x002A;Correspondence: Ruhan Circi, <email>rcirci@air.org</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>10</day>
<month>05</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>8</volume>
<elocation-id>858273</elocation-id>
<history>
<date date-type="received">
<day>19</day>
<month>01</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>06</day>
<month>04</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2023 Circi, Hicks and Sikali.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Circi, Hicks and Sikali</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>This mini review summarizes the current state of knowledge about automatic item generation in the context of educational assessment and discusses key points in the item generation pipeline. Assessment is critical in all learning systems and digitalized assessments have shown significant growth over the last decade. This leads to an urgent need to generate more items in a fast and efficient manner. Continuous improvements in computational power and advancements in methodological approaches, specifically in the field of natural language processing, provide new opportunities as well as new challenges in automatic generation of items for educational assessment. This mini review asserts the need for more work across a wide variety of areas for the scaled implementation of AIG.</p>
</abstract>
<kwd-group>
<kwd>digital assessments</kwd>
<kwd>automatic item generation</kwd>
<kwd>item models</kwd>
<kwd>machine learning approaches, NLP</kwd>
</kwd-group>
<contract-num rid="cn1">ED-IES-12-D-0002/0004</contract-num>
<contract-sponsor id="cn1">NCES</contract-sponsor>
<counts>
<fig-count count="0"/>
<table-count count="0"/>
<equation-count count="0"/>
<ref-count count="50"/>
<page-count count="5"/>
<word-count count="4840"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Digital Learning Innovations</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="sec1" sec-type="intro">
<title>Introduction</title>
<p>Due to the increase in large-scale summative (e.g., national assessments, re/certification assessments) and formative assessments (e.g., practice tests/assignments, feedback preparation), items need to be created at a higher pace than ever before to keep up with continuous testing (<xref ref-type="bibr" rid="ref3">Attali, 2018</xref>; <xref ref-type="bibr" rid="ref34">Kurdi et al., 2019</xref>). This new era of continuous testing presents a challenge to the traditional methods of item creation and item stability, as it is labor intensive and costly to create items <italic>individually</italic>, and it is difficult to keep a &#x201C;healthy&#x201D; item bank where items are not overexposed, especially for computer adaptive testing. In addition to traditional items and methods of item development, more attention has been given to innovative item types, which are needed to measure newer skills that have emerged in the 21st century (e.g., collaborative skills). These innovative and interactive items are even <italic>more</italic> labor intensive and costly to create.</p>
<p>A potential solution to these issues is to generate items automatically. Automatic/automated item generation (AIG) and automated question generation (AQG) are used synonymously to broadly refer to the process of generating items/questions from various inputs, including models, templates, or schemas. In research papers, the terms AIG and AQG are used interchangeably. However, automatic item generation is mostly used in the education domain; therefore, AIG will be used in the continuation of this review.</p>
<p>Historically, AIG was first described by John Bormuth in the 1960&#x2019;s (<xref ref-type="bibr" rid="ref10">Bormuth, 1969</xref>) but was not developed until much later. Through the years, item generation techniques evolved from using traditional instructional objectives to semi-automated means (<xref ref-type="bibr" rid="ref42">Roid and Haladyna, 1978</xref>). In 2006, <xref ref-type="bibr" rid="ref14">Drasgow et al. (2006)</xref> established the base for the theoretical framework and methods of AIG that are widely used in education today. By 2012, with the increase in the number of assessments and the increase in software and computer resources, AIG had become a unique research area with rapid and continuing growth. At this time, there was enough research to provide analysis of both the theoretical concepts and practical applications of AIG (<xref ref-type="bibr" rid="ref20">Gierl and Haladyna, 2012</xref>). For the last two decades, research on AIG has addressed the current challenges of test/assessment development by generating items on a large scale efficiently (e.g., <xref ref-type="bibr" rid="ref24">Gierl et al., 2021</xref>).</p>
<p>The main promises of AIG for test/assessment development include: (1) reduced item generation time, (2) reduced cost to create items, (3) support for continuous and rapid item development for large item pools, and (4) support for learning by tailoring items for customized measurement and learning needs. In the educational context, the goal of AIG is defined as creating more items in an efficient and fast manner, such that the items target the same construct but appear unique to test takers (e.g., <xref ref-type="bibr" rid="ref39">Pugh et al., 2016</xref>). Despite these promises, there is still not enough application of AIG in educational assessment. Therefore, it is critical to understand AIG regarding its feasibility, applicability, and item quality.</p>
<p>The following review covers over 40 papers, two multimedia sources, and one systematic literature review published in the field of automated/automatic item generation for educational purposes. Three data bases (i.e., ACM, IEEE, ERIC), google scholar, AERA and NCME programs, and google searches with selected key words (e.g., &#x201C;automated item generation,&#x201D; &#x201C;automated question generation,&#x201D; &#x201C;machine learning and item generation&#x201D;) were used to extract foundational studies and the most recent work related to AIG to shed light on the most recent developments in this field. For the purpose of this mini review, we scanned and extracted papers to make a finalized list to perform an in-depth review of each selected paper. Our review focuses on the following key points:</p>
<list list-type="bullet">
<list-item>
<p>Purpose of AIG in the reviewed material.</p>
</list-item>
<list-item>
<p>Type of items generated.</p>
</list-item>
<list-item>
<p>Input type and approaches to generate items.</p>
</list-item>
<list-item>
<p>Methods used to evaluate generated items.</p>
</list-item>
</list>
<p>The following section summarizes the results of the review with a specific focus on the previous key points to show the diversity of thought regarding the topic of automated/automatic item generation and to highlight areas of improvement.</p>
</sec>
<sec id="sec2">
<title>Purpose of AIG in the reviewed material</title>
<p>With regard to the purpose of AIG in the current review, most studies focus on using AIG for assessment purposes, either for large scale assessments (e.g., <xref ref-type="bibr" rid="ref26">Gierl et al., 2008</xref>; <xref ref-type="bibr" rid="ref39">Pugh et al., 2016</xref>; <xref ref-type="bibr" rid="ref4">Attali et al., 2022</xref>), opinion questions (e.g., <xref ref-type="bibr" rid="ref5">Baghaee, 2017</xref>), classroom/formative assessment purposes such as exam questions (e.g., <xref ref-type="bibr" rid="ref18">Fridenfalk, 2013</xref>), and practice questions (e.g., <xref ref-type="bibr" rid="ref3">Attali, 2018</xref>). There are other studies that also use AIG to focus on generating personality items (<xref ref-type="bibr" rid="ref44">von Davier, 2018</xref>; <xref ref-type="bibr" rid="ref29">Hommel et al., 2022</xref>). AIG can also be used to expand past the generation of items to more complex tasks for assessments, such as stories and passages (e.g., <xref ref-type="bibr" rid="ref28">Harrison et al., 2021</xref>; <xref ref-type="bibr" rid="ref4">Attali et al., 2022</xref>) which is a critical next step in assessment development (<xref ref-type="bibr" rid="ref11">Burke, 2020</xref>).</p>
</sec>
<sec id="sec3">
<title>Types of generated items</title>
<p>In this review, it was found that multiple choice items are the most frequently generated type of questions in the large-scale educational assessment context as they are the main item type in most large-scale assessments. In addition to the item stem, distractors can also be generated. While using AIG to generate distractors was a challenge (e.g., <xref ref-type="bibr" rid="ref16">Embretson and Kingston, 2018</xref>), there have been improvement in the methods used to create efficient distractors for multiple choice items over time (<xref ref-type="bibr" rid="ref36">Lai et al., 2016</xref>; <xref ref-type="bibr" rid="ref24">Gierl et al., 2021</xref>). For other types of assessments, open ended factual questions (i.e., who, where, when, etc.) are the most common item type to be generated; in comparison, there are fewer studies that are attempting to create open ended questions (e.g., <xref ref-type="bibr" rid="ref18">Fridenfalk, 2013</xref>; <xref ref-type="bibr" rid="ref49">Zhou and Huang, 2019</xref>).</p>
</sec>
<sec id="sec4">
<title>Input type and approaches to generate items</title>
<p>In this review, commonly used input types for item generation can be divided into two groups: (a) structured inputs, such as item model templates (e.g., <xref ref-type="bibr" rid="ref25">Gierl et al., 2012</xref>, <xref ref-type="bibr" rid="ref23">2016</xref>; <xref ref-type="bibr" rid="ref37">Latifi et al., 2013</xref>; <xref ref-type="bibr" rid="ref13">Colvin et al., 2016</xref>; <xref ref-type="bibr" rid="ref3">Attali, 2018</xref>; <xref ref-type="bibr" rid="ref8">Blum and Holling, 2018</xref>), and (b) unstructured inputs such as available written material (e.g., <xref ref-type="bibr" rid="ref31">Khodeir et al., 2018</xref>; <xref ref-type="bibr" rid="ref46">Wang et al., 2018</xref>; <xref ref-type="bibr" rid="ref45">von Davier, 2019</xref>; <xref ref-type="bibr" rid="ref49">Zhou and Huang, 2019</xref>; <xref ref-type="bibr" rid="ref4">Attali et al., 2022</xref>). A third input type can be described as a combination of both structured and unstructured inputs (e.g., <xref ref-type="bibr" rid="ref2">Atapattu et al., 2012</xref>; <xref ref-type="bibr" rid="ref47">Wang et al., 2018</xref>).</p>
<p>In the educational context, item models are commonly used (<xref ref-type="bibr" rid="ref26">Gierl et al., 2008</xref>; <xref ref-type="bibr" rid="ref21">Gierl and Lai, 2013</xref>; <xref ref-type="bibr" rid="ref8">Blum and Holling, 2018</xref>; <xref ref-type="bibr" rid="ref16">Embretson and Kingston, 2018</xref>; <xref ref-type="bibr" rid="ref40">Pugh et al., 2020</xref>). An item model is defined as &#x201C;&#x2026; a template that specifies the features in an item that can be manipulated&#x201D; (<xref ref-type="bibr" rid="ref35">LaDuca et al., 1986</xref>; <xref ref-type="bibr" rid="ref6">Bejar et al., 2003</xref>). There are multiple approaches to generate an item model (<xref ref-type="bibr" rid="ref14">Drasgow et al., 2006</xref>) and they include: (a) weak theory [e.g., generate item sets that are derived from a parent item but look different from one another (<xref ref-type="bibr" rid="ref19">Geerlings et al., 2011</xref>)], (b) cognitive theory/strong theory [e.g., systematic variations of the parts in an item supported by an underlying theory (e.g., <xref ref-type="bibr" rid="ref21">Gierl and Lai, 2013</xref>)], and (c) automatic min-max [e.g., introduction of the construct to be measured into the item development process (<xref ref-type="bibr" rid="ref1">Arendasy and Sommer, 2012</xref>)]. The most applied approach, in an educational context, for generating item models is the cognitive theory/strong theory approach. The main steps in this approach are (a) highlighting the skills and knowledge required for the problem to be solved, (b) subject matter experts (SME) developing cognitive models, (c) creating item models based on the cognitive models that specify the features that can be manipulated, and (d) finally manipulating item models using computer-based algorithms (e.g., software called IGOR).</p>
<p>The approaches used for question/item generation from available written text have been gradually developed and are more diverse than item models. In the field of educational assessment (i.e., large scale), there is a limited amount of work using available written text for automatic item generation (e.g., <xref ref-type="bibr" rid="ref4">Attali et al., 2022</xref>). Hence, most of the examples in this review are from different assessment domains such as practice quiz generation, factual question generation (e.g., <xref ref-type="bibr" rid="ref17">Fattoh et al., 2015</xref>; <xref ref-type="bibr" rid="ref5">Baghaee, 2017</xref>; <xref ref-type="bibr" rid="ref46">Wang et al., 2018</xref>; <xref ref-type="bibr" rid="ref33">Kumar et al., 2019</xref>; <xref ref-type="bibr" rid="ref7">Bl&#x0161;t&#x00E1;k and Rozinajov&#x00E1;, 2022</xref>), personality item generation (e.g., <xref ref-type="bibr" rid="ref44">von Davier, 2018</xref>; <xref ref-type="bibr" rid="ref29">Hommel et al., 2022</xref>).</p>
<p>The approaches include use of machine learning/deep learning architectures (e.g., RNN and variants of RNNs as in <xref ref-type="bibr" rid="ref32">Kim et al., 2019</xref>), natural language processing (NLP) based models (as in <xref ref-type="bibr" rid="ref47">Wang et al., 2018</xref>; <xref ref-type="bibr" rid="ref7">Bl&#x0161;t&#x00E1;k and Rozinajov&#x00E1;, 2022</xref>), and large pre-trained language models (e.g., GTP2, GPT3, BERT as in <xref ref-type="bibr" rid="ref4">Attali et al., 2022</xref>). The focus of neural/deep networks, some have even integrated Natural Language Processing (NLP) based approaches into their models (e.g., <xref ref-type="bibr" rid="ref50">Zhou et al., 2017</xref>; <xref ref-type="bibr" rid="ref47">Wang et al., 2018</xref>), is to train models on large data sets and they have the potential to learn implicit rules from the data itself, e.g., GPT-2 fine-tuned using the International Personality Item Pool in <xref ref-type="bibr" rid="ref29">Hommel et al. (2022)</xref>; free medical articles on GPT-2 in <xref ref-type="bibr" rid="ref45">von Davier (2019)</xref>; Stanford Question Answering Dataset (SQuAD) in <xref ref-type="bibr" rid="ref33">Kumar et al. (2019)</xref>; Dolphin18K in <xref ref-type="bibr" rid="ref49">Zhou and Huang (2019)</xref>; Amazon Question/Answer data set in <xref ref-type="bibr" rid="ref5">Baghaee (2017)</xref>; Wikipedia in <xref ref-type="bibr" rid="ref28">Harrison et al. (2021)</xref>. For example, SQuAD (<xref ref-type="bibr" rid="ref41">Rajpurkar et al., 2016</xref>) consists of more than 100,000 questions from more than 500 articles. It is critical to note that for the existing data sets which include question-answer pairs, none are tailored for educational assessment subjects. Research in automated item generation for education assessments can benefit from the availability of targeted resources (e.g., science subject specific input data) to take advantage of emerging approaches.</p>
<p>Among others utilizing deep learning technology, sequence-to-sequence models have come a long way from its inception (e.g., <xref ref-type="bibr" rid="ref34">Kurdi et al., 2019</xref>; <xref ref-type="bibr" rid="ref38">Pan et al., 2019</xref>), which aims to produce plausible questions with minimal human intervention. From the baseline barebones sequence-to-sequence model <xref ref-type="bibr" rid="ref15">Du et al. (2017)</xref> come up with a model which uses an encoder to take sentence level and paragraph level information and convert it to hidden vectors. Then the decoder takes the vectors from the encoder and creates hidden vectors to predict the next word. This approach is used to generate questions from text passages to measure reading comprehension. In this work, the authors realized that generated questions also included parts of the answer. To address this issue, <xref ref-type="bibr" rid="ref48">Zhao et al. (2018)</xref> and <xref ref-type="bibr" rid="ref32">Kim et al. (2019)</xref> proposed various sequence-to-sequence models that are answer aware (i.e., which takes an answer as additional information) and position aware (i.e., using distance between the context words and the answer). Similarly, <xref ref-type="bibr" rid="ref43">Sun et al. (2018)</xref> addressed issues of unmatching words and unrelated copied context words using complex model.</p>
<p>Methods utilizing RNNs as sequence-to-sequence models to generate questions from sentences or passages (<xref ref-type="bibr" rid="ref15">Du et al., 2017</xref>; <xref ref-type="bibr" rid="ref32">Kim et al., 2019</xref>) are most common. However, RNN models suffer from long context/sequences, that is, performance of the models decreases when they are applied to paragraph level context. <xref ref-type="bibr" rid="ref12">Chan and Fan (2019)</xref> showed that pre-trained language models can also be efficiently used to generate questions. By altering the architecture of one language model (i.e., BERT) to allow sequential generation of words, the authors demonstrated the ability of those models to produce appropriate questions from a paragraph context.</p>
</sec>
<sec id="sec5">
<title>Methods used to evaluate generated items</title>
<p>Development of test specifications and producing items are the first steps of item development for operational purposes. Traditionally, items go through a very rigorous review process including multiple rounds of review and editing, they are also pilot tested, and their psychometric characteristics are evaluated, which include item difficulty, discrimination, and differential item functioning, for operational use (e.g., <xref ref-type="bibr" rid="ref27">Haladyna and Rodriguez, 2013</xref>). The main promise of AIG is to reduce one-by-one item production and produce items in large quantities. Yet, another time-consuming part of item generation is reviewing and approving items for use in operational settings. Specifically, evaluation of the psychometric characteristics of automatically generated items is critical for the operational use in large scale educational settings. Therefore, it is important to have a systematic and automated evaluation of items generated by AIG.</p>
<p>The item model approach provides a more comprehensive method for the evaluation of AIG generated items. Those approaches include both qualitative and empirical methods. One qualitative approach is to mix the AIG items, traditional items, and then ask content experts to review the items to examine if the items are differentiable (e.g., <xref ref-type="bibr" rid="ref22">Gierl and Lai, 2018</xref>; <xref ref-type="bibr" rid="ref40">Pugh et al., 2020</xref>). Empirical methods include examination of psychometric properties of the items using classical test theory, item response theory measures (e.g., <xref ref-type="bibr" rid="ref23">Gierl et al., 2016</xref>; <xref ref-type="bibr" rid="ref3">Attali, 2018</xref>; <xref ref-type="bibr" rid="ref8">Blum and Holling, 2018</xref>; <xref ref-type="bibr" rid="ref16">Embretson and Kingston, 2018</xref>), and similarity metrics (e.g., <xref ref-type="bibr" rid="ref21">Gierl and Lai, 2013</xref>; <xref ref-type="bibr" rid="ref37">Latifi et al., 2013</xref>). Among these methods, the evaluation of psychometric properties for multiple choice items is more established compared to other item types [e.g., most of the items in <xref ref-type="bibr" rid="ref3">Attali (2018)</xref>; <xref ref-type="bibr" rid="ref37">Latifi et al. (2013)</xref>].</p>
<p>Neural net/deep learning approaches have more variety in relation to the evaluation of generated items; however, these evaluation procedures are still in their early stages and are less standardized. In addition to human evaluation of generated items, some studies use machine transformation evaluation metrics, such as Bilingual Evaluation Understudy (BLEU) or Metric for Evaluation of Translation with Explicit ORdering (METEOR) and text summary measures such as Recall-Oriented Understudy for Gisting Evaluation (ROUGE) to compare their approach with other AIG methods (e.g., <xref ref-type="bibr" rid="ref5">Baghaee, 2017</xref>; <xref ref-type="bibr" rid="ref46">Wang et al., 2018</xref>; <xref ref-type="bibr" rid="ref49">Zhou and Huang, 2019</xref>). One study by <xref ref-type="bibr" rid="ref44">von Davier (2018)</xref> used factor analysis, as an evaluation method, to show that dimensionality is the same for generated items. Another study, <xref ref-type="bibr" rid="ref17">Fattoh et al. (2015)</xref> used confusion matrix measures (i.e., precision, recall, f-measure) to evaluate the prediction of item types (i.e., who, what, when etc. types), not the actual item created using AIG.</p>
</sec>
<sec id="sec6" sec-type="conclusions">
<title>Conclusion</title>
<p>In this mini review, we have shown that there are various approaches for item generation for assessment purposes. Similar to <xref ref-type="bibr" rid="ref34">Kurdi et al. (2019)</xref> our review of the literature suggests that almost all the work conducted using AIG is experimental, not operational. However, it is hard to conclude that AIG is not commonly used in operational settings, as there is limited access to the methods used by testing organizations due to the privacy and confidentiality policies. There are a few well-known testing organizations that mention the use of AIG to produce operational items (e.g., <xref ref-type="bibr" rid="ref9">Bo et al., 2020</xref>). We observed that most of the work specific to AIG in large scale assessments uses template or rule-based approaches as the primary method for creating item models from which to generate items (e.g., <xref ref-type="bibr" rid="ref22">Gierl and Lai, 2018</xref>). They provide an advantage of creating items aligned well with the intended constructs. Yet, they are mostly used in subjects (e.g., math, medical assessments) where question types are more conventional (e.g., multiple choice, fill-in-blanks). There are a few exceptions where data driven methods such as deep learning and natural language processing are employed (<xref ref-type="bibr" rid="ref45">von Davier, 2019</xref>; <xref ref-type="bibr" rid="ref11">Burke, 2020</xref>; <xref ref-type="bibr" rid="ref4">Attali et al., 2022</xref>). These models aim to generate questions/items with minimal human involvement, in the realm of neural networks and large pre-trained models. The use of neural-based models dominates state of the art question generation in various domains. In the last few years, pretrained large models significantly increased performance gains on many tasks. Researchers now leverage existing models to generate semantically coherent and fluent questions, which is critical in the context of educational assessments as digitalization in education continues to grow. However, it is important to reiterate that non-template data driven models still have a long way to meet the standards of quality expected in operational testing situations.</p>
<p>Automated/automatic item generation is a process, therefore the steps used in AIG are very important. While terminology is not common across fields or papers, it is helpful to understand two main stages of AIG (different approaches differ in the number stages): (1) the input stage (or encoder), and (2) the transformation stage (generation or decoder). Algorithmic generation approaches lead to scalable item development and produce large numbers of items. However, with the abundance of items, the next challenge becomes differentiation of high quality items. Evaluation of generated items was not provided in all the studies; however, for studies that did provide item evaluation there was high variation in evaluation approaches. Some of the evaluation approaches include: (1) blind review of both AIG items and traditionally developed items by expert panels (e.g., <xref ref-type="bibr" rid="ref31">Khodeir et al., 2018</xref>; <xref ref-type="bibr" rid="ref40">Pugh et al., 2020</xref>), (2) factor analysis to examine the internal structure (e.g., <xref ref-type="bibr" rid="ref44">von Davier, 2018</xref>), (3) comparison of psychometric properties of AIG items with operational items (e.g., <xref ref-type="bibr" rid="ref3">Attali, 2018</xref>; <xref ref-type="bibr" rid="ref16">Embretson and Kingston, 2018</xref>), (4) examination of the similarity of generated items (using cosine similarity index, e.g., <xref ref-type="bibr" rid="ref21">Gierl and Lai, 2013</xref>; <xref ref-type="bibr" rid="ref37">Latifi et al., 2013</xref>; <xref ref-type="bibr" rid="ref30">Kaliski et al., 2020</xref>), and (5) comparison of different models with the original text or human judgement using machine translation indices together with or without human evaluation (e.g., <xref ref-type="bibr" rid="ref46">Wang et al., 2018</xref>; <xref ref-type="bibr" rid="ref12">Chan and Fan, 2019</xref>; <xref ref-type="bibr" rid="ref49">Zhou and Huang, 2019</xref>). The variety of AIG evaluation approaches included in the literature suggests that there is a clear need for more research in this area.</p>
<p>Over the years, increasing item demands have led to multiple approaches to automatically generate items. Various researchers and practitioners have helped AIG to be more consistent and established, but at the same time have increased its complexity. All approaches used for AIG still need to be thoroughly tested to become well understood. Thus, this summary review suggests that the topic of automated/automatic item generation is wide and varied with its unique strengths and limitations as an assessment tool.</p>
</sec>
<sec id="sec7">
<title>Author contributions</title>
<p>RC and JH: wrote the original draft, reviewed, and edited. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec id="sec8" sec-type="funding-information">
<title>Funding</title>
<p>Research in this working paper was developed with funding from NCES under Contract No. ED-IES-12-D-0002/0004. The views, thoughts, and opinions expressed in the paper belong solely to the author(s) and do not reflect NCES position or endorsement.</p>
</sec>
<sec id="conf1" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="sec100" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<ack>
<p>We would like to thank our colleagues, Fusun Sahin, Tiago Calico, and Xiaying Zheng, for their contributions to the original literature review referenced in this work. We also would like to thank Emmanuel Sikali, Senior Research Scientist/Mathematical Statistician at NCES, for his support to expand the original work.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="ref1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Arendasy</surname> <given-names>M.</given-names></name> <name><surname>Sommer</surname> <given-names>M.</given-names></name></person-group> (<year>2012</year>). <article-title>Using automatic item generation to meet the increasing item demands of high-stakes assessment</article-title>. <source>Learn. Individ. Differ.</source> <volume>22</volume>, <fpage>112</fpage>&#x2013;<lpage>117</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.lindif.2011.11.005</pub-id></citation></ref>
<ref id="ref2"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Atapattu</surname> <given-names>T.</given-names></name> <name><surname>Falkner</surname> <given-names>K.</given-names></name> <name><surname>Falkner</surname> <given-names>N.</given-names></name></person-group> (<year>2012</year>). &#x201C;<article-title>Automated extraction of semantic concepts from semi structured data: supporting computer-based education through the analysis of lecture notes</article-title>,&#x201D; in <source>Database and Expert Systems Applications. DEXA 2012</source>. eds. <person-group person-group-type="editor"><name><surname>Liddle</surname> <given-names>S. W.</given-names></name> <name><surname>Schewe</surname> <given-names>K. D.</given-names></name> <name><surname>Tjoa</surname> <given-names>A. M.</given-names></name> <name><surname>Zhou</surname> <given-names>X.</given-names></name></person-group>, <series>Lecture Notes in Computer Science</series> (<publisher-loc>Berlin, Heidelberg</publisher-loc>: <publisher-name>Springer</publisher-name>).</citation></ref>
<ref id="ref3"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Attali</surname> <given-names>Y.</given-names></name></person-group> (<year>2018</year>). &#x201C;<article-title>Automatic item generation unleashed: an evaluation of a large-scale deployment of item models</article-title>,&#x201D; in <source>Artificial intelligence in education: 19th  International Conference</source>, eds C. P. Ros&#x00E9;, R. Mart&#x00ED;nez-Maldonado, H. U. Hoppe, R. Luckin, M. Mavrikis, K. Porayska-Pomsta, B. McLaren, and B. du Boulay (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>17</fpage>&#x2013;<lpage>29</lpage>.</citation></ref>
<ref id="ref4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Attali</surname> <given-names>Y.</given-names></name> <name><surname>Runge</surname> <given-names>A.</given-names></name> <name><surname>LaFlair</surname> <given-names>G. T.</given-names></name> <name><surname>Yancey</surname> <given-names>K.</given-names></name> <name><surname>Goodwin</surname> <given-names>S.</given-names></name> <name><surname>Park</surname> <given-names>Y.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>The interactive reading task: transformer-based automatic item generation</article-title>. <source>Front. Artif. Intell.</source> <volume>5</volume>:<fpage>903077</fpage>. doi: <pub-id pub-id-type="doi">10.3389/frai.2022.903077</pub-id>, PMID: <pub-id pub-id-type="pmid">35937141</pub-id></citation></ref>
<ref id="ref5"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Baghaee</surname> <given-names>T.</given-names></name></person-group> (<year>2017</year>). Automatic neural question generation using community-based question answering systems. Unpublished master&#x2019;s thesis. University of Lethbridge.</citation></ref>
<ref id="ref6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bejar</surname> <given-names>I. I.</given-names></name> <name><surname>Lawless</surname> <given-names>R. R.</given-names></name> <name><surname>Morley</surname> <given-names>M. E.</given-names></name> <name><surname>Wagner</surname> <given-names>M. E.</given-names></name> <name><surname>Bennett</surname> <given-names>R. E.</given-names></name> <name><surname>Revuelta</surname> <given-names>J.</given-names></name></person-group> (<year>2003</year>). <article-title>A feasibility study of on-the-fly item generation in adaptive testing</article-title>. <source>J. Technol. Learn. Assess.</source> <volume>2</volume>, <fpage>1</fpage>&#x2013;<lpage>32</lpage>.</citation></ref>
<ref id="ref7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bl&#x0161;t&#x00E1;k</surname> <given-names>M.</given-names></name> <name><surname>Rozinajov&#x00E1;</surname> <given-names>V.</given-names></name></person-group> (<year>2022</year>). <article-title>Automatic question generation based on sentence structure analysis using machine learning approach</article-title>. <source>Nat. Lang. Eng.</source> <volume>28</volume>, <fpage>487</fpage>&#x2013;<lpage>517</lpage>. doi: <pub-id pub-id-type="doi">10.1017/S1351324921000139</pub-id></citation></ref>
<ref id="ref8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Blum</surname> <given-names>D.</given-names></name> <name><surname>Holling</surname> <given-names>H.</given-names></name></person-group> (<year>2018</year>). <article-title>Automatic generation of figural analogies with the IMak package</article-title>. <source>Front. Psychol.</source> <volume>9</volume>, <fpage>1</fpage>&#x2013;<lpage>13</lpage>. doi: <pub-id pub-id-type="doi">10.3389/fpsyg.2018.01286</pub-id>, PMID: <pub-id pub-id-type="pmid">30127757</pub-id></citation></ref>
<ref id="ref9"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Bo</surname> <given-names>E.</given-names></name> <name><surname>He</surname> <given-names>W.</given-names></name> <name><surname>Javurel</surname> <given-names>A.</given-names></name> <name><surname>Miller</surname> <given-names>S.</given-names></name> <name><surname>Scheuring</surname> <given-names>M. S.</given-names></name> <name><surname>Simpson</surname> <given-names>M. A.</given-names></name></person-group> (<year>2020</year>). Items and item models: AIG traditional and modern. In P. Kaliski (Chair), <italic>Challenges with automatic item generation implementation: Research, strategies, and lessons learned [virtual symposium]. Annual meeting of the National Council on measurement in education</italic>.</citation></ref>
<ref id="ref10"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Bormuth</surname> <given-names>J.</given-names></name></person-group> (<year>1969</year>). <source>On a Theory of Achievement Test Items</source>. <publisher-loc>Chicago, IL</publisher-loc>: <publisher-name>University of Chicago Press</publisher-name>.</citation></ref>
<ref id="ref11"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Burke</surname> <given-names>A.</given-names></name></person-group> (Host). (<year>2020</year>). Creating test items with automated item generation: AIG, AIGL, &#x0026; POE (no.7) [audio podcast episode]. In ACT next navigator. Available at: <ext-link xlink:href="https://www.youtube.com/watch?v=XUSGAIdfVsg" ext-link-type="uri">https://www.youtube.com/watch?v=XUSGAIdfVsg</ext-link></citation></ref>
<ref id="ref12"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Chan</surname> <given-names>Y.-H.</given-names></name> <name><surname>Fan</surname> <given-names>Y.-C.</given-names></name></person-group> (<year>2019</year>). A recurrent BERT-based model for question generation. In <italic>Proceedings of the 2nd Workshop on Machine Reading for Question Answering. Association for Computational Linguistics</italic>.</citation></ref>
<ref id="ref13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Colvin</surname> <given-names>K. F.</given-names></name> <name><surname>Keller</surname> <given-names>L. A.</given-names></name> <name><surname>Robin</surname> <given-names>F.</given-names></name></person-group> (<year>2016</year>). <article-title>Effect of imprecise parameter estimation on ability estimation in a multistage test in an automatic item generation context</article-title>. <source>J. Comput. Adapt. Test.</source> <volume>4</volume>, <fpage>1</fpage>&#x2013;<lpage>18</lpage>. doi: <pub-id pub-id-type="doi">10.7333/1608-040101</pub-id></citation></ref>
<ref id="ref14"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Drasgow</surname> <given-names>F.</given-names></name> <name><surname>Luecht</surname> <given-names>R. M.</given-names></name> <name><surname>Bennett</surname> <given-names>R. E.</given-names></name></person-group> (<year>2006</year>). &#x201C;<article-title>Technology and testing</article-title>,&#x201D; in <source>Educational Measurement</source>. ed. <person-group person-group-type="editor"><name><surname>Brennan</surname> <given-names>R. L.</given-names></name></person-group> (<publisher-loc>Westport, CT</publisher-loc>: <publisher-name>Praeger Publishers</publisher-name>), <fpage>471</fpage>&#x2013;<lpage>516</lpage>.</citation></ref>
<ref id="ref15"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Du</surname> <given-names>X.</given-names></name> <name><surname>Shao</surname> <given-names>J.</given-names></name> <name><surname>Cardie</surname> <given-names>C.</given-names></name></person-group> (<year>2017</year>). Learning to ask: Neural question generation for reading comprehension. ArXiv:1705.00106 [Cs] <comment>[Epub ahead of preprint]</comment>.</citation></ref>
<ref id="ref16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Embretson</surname> <given-names>S.</given-names></name> <name><surname>Kingston</surname> <given-names>N. M.</given-names></name></person-group> (<year>2018</year>). <article-title>Automatic item generation: a more efficient process for developing mathematics achievement items?</article-title> <source>J. Educ. Meas.</source> <volume>55</volume>, <fpage>112</fpage>&#x2013;<lpage>131</lpage>. doi: <pub-id pub-id-type="doi">10.1111/jedm.12166</pub-id></citation></ref>
<ref id="ref17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fattoh</surname> <given-names>I. E.</given-names></name> <name><surname>Aboutabl</surname> <given-names>A. E.</given-names></name> <name><surname>Haggag</surname> <given-names>M. H.</given-names></name></person-group> (<year>2015</year>). <article-title>Semantic question generation using artificial immunity</article-title>. <source>Int. J. Mod. Educ. Comput. Sci.</source> <volume>7</volume>, <fpage>1</fpage>&#x2013;<lpage>8</lpage>. doi: <pub-id pub-id-type="doi">10.5815/ijmecs.2015.01.01</pub-id></citation></ref>
<ref id="ref18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fridenfalk</surname> <given-names>M.</given-names></name></person-group> (<year>2013</year>). &#x201C;<article-title>System for automatic generation of examination papers in discrete mathematics</article-title>,&#x201D; in <source>Proceedings of IADIS International Conference on e-Learning 2013 IADIS Multi Conference on Computer Science and Information Systems</source>, <fpage>365</fpage>&#x2013;<lpage>368</lpage>.</citation></ref>
<ref id="ref19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Geerlings</surname> <given-names>H.</given-names></name> <name><surname>Glas</surname> <given-names>C. A. W.</given-names></name> <name><surname>van der Linden</surname> <given-names>W. J.</given-names></name></person-group> (<year>2011</year>). <article-title>Modeling rule-based item generation</article-title>. <source>Psychometrika</source> <volume>76</volume>, <fpage>337</fpage>&#x2013;<lpage>359</lpage>. doi: <pub-id pub-id-type="doi">10.1007/S11336-011-9204-X</pub-id></citation></ref>
<ref id="ref20"><citation citation-type="book"><person-group person-group-type="editor"><name><surname>Gierl</surname> <given-names>M. J.</given-names></name> <name><surname>Haladyna</surname> <given-names>T. M.</given-names></name></person-group> (Eds.). (<year>2012</year>). <source>Automatic Item Generation: Theory and Practice</source> (<edition>1st</edition>). <publisher-loc>England</publisher-loc>: <publisher-name>Routledge</publisher-name>.</citation></ref>
<ref id="ref21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gierl</surname> <given-names>M. J.</given-names></name> <name><surname>Lai</surname> <given-names>H.</given-names></name></person-group> (<year>2013</year>). <article-title>Instructional topics in educational measurement (items) module: using automated processes to generate test items</article-title>. <source>Educ. Meas. Issues Pract.</source> <volume>32</volume>, <fpage>36</fpage>&#x2013;<lpage>50</lpage>. doi: <pub-id pub-id-type="doi">10.1111/emip.12018</pub-id></citation></ref>
<ref id="ref22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gierl</surname> <given-names>M. J.</given-names></name> <name><surname>Lai</surname> <given-names>H.</given-names></name></person-group> (<year>2018</year>). <article-title>Using automatic item generation to create solutions and rationales for computerized formative testing</article-title>. <source>Appl. Psychol. Meas.</source> <volume>42</volume>, <fpage>42</fpage>&#x2013;<lpage>57</lpage>. doi: <pub-id pub-id-type="doi">10.1177/0146621617726788</pub-id>, PMID: <pub-id pub-id-type="pmid">29881111</pub-id></citation></ref>
<ref id="ref23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gierl</surname> <given-names>M. J.</given-names></name> <name><surname>Lai</surname> <given-names>H.</given-names></name> <name><surname>Pugh</surname> <given-names>D.</given-names></name> <name><surname>Touchie</surname> <given-names>C.</given-names></name> <name><surname>Boulais</surname> <given-names>A.-P.</given-names></name> <name><surname>De Champlain</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <article-title>Evaluating the psychometric characteristics of generated multiple-choice test items</article-title>. <source>Appl. Meas. Educ.</source> <volume>29</volume>, <fpage>196</fpage>&#x2013;<lpage>210</lpage>. doi: <pub-id pub-id-type="doi">10.1080/08957347.2016.1171768</pub-id></citation></ref>
<ref id="ref24"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Gierl</surname> <given-names>M. J.</given-names></name> <name><surname>Lai</surname> <given-names>H.</given-names></name> <name><surname>Tanygin</surname> <given-names>V.</given-names></name></person-group> (Eds.). (<year>2021</year>). <source>Advanced Methods in Automatic Item Generation</source> (<edition>1st</edition>). <publisher-loc>England</publisher-loc>: <publisher-name>Routledge</publisher-name>.</citation></ref>
<ref id="ref25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gierl</surname> <given-names>M. J.</given-names></name> <name><surname>Lai</surname> <given-names>H.</given-names></name> <name><surname>Turner</surname> <given-names>S. R.</given-names></name></person-group> (<year>2012</year>). <article-title>Using automatic item generation to create multiple-choice test items</article-title>. <source>Med. Educ. J.</source> <volume>46</volume>, <fpage>757</fpage>&#x2013;<lpage>765</lpage>. doi: <pub-id pub-id-type="doi">10.1111/j.1365-2923.2012.04289.x</pub-id>, PMID: <pub-id pub-id-type="pmid">22803753</pub-id></citation></ref>
<ref id="ref26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gierl</surname> <given-names>M. J.</given-names></name> <name><surname>Zhou</surname> <given-names>J.</given-names></name> <name><surname>Alves</surname> <given-names>C.</given-names></name></person-group> (<year>2008</year>). <article-title>Developing a taxonomy of item model types to promote assessment engineering</article-title>. <source>J. Technol. Learn. Assess.</source> <volume>7</volume>, <fpage>1</fpage>&#x2013;<lpage>51</lpage>.</citation></ref>
<ref id="ref27"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Haladyna</surname> <given-names>T.</given-names></name> <name><surname>Rodriguez</surname> <given-names>M.</given-names></name></person-group> (<year>2013</year>). &#x201C;<article-title>Developing the test item</article-title>,&#x201D; in <source>Developing and Validating Test Items</source>. eds. <person-group person-group-type="editor"><name><surname>Haladyna</surname> <given-names>T.</given-names></name> <name><surname>Rodriguez</surname> <given-names>M.</given-names></name></person-group> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Routledge</publisher-name>), <fpage>17</fpage>&#x2013;<lpage>27</lpage>.</citation></ref>
<ref id="ref28"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Harrison</surname> <given-names>B.</given-names></name> <name><surname>Purdy</surname> <given-names>C.</given-names></name> <name><surname>Riedl</surname> <given-names>M.</given-names></name></person-group> (<year>2021</year>). Toward automated story generation with markov chain Monte Carlo methods and deep neural networks. In <italic>Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment</italic>, pp. 191&#x2013;197.</citation></ref>
<ref id="ref29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hommel</surname> <given-names>B. E.</given-names></name> <name><surname>Wollang</surname> <given-names>F.-J. M.</given-names></name> <name><surname>Kotova</surname> <given-names>V.</given-names></name> <name><surname>Zacher</surname> <given-names>H.</given-names></name> <name><surname>Schmukle</surname> <given-names>S. C.</given-names></name></person-group> (<year>2022</year>). <article-title>Transformer-based deep neural language modeling for construct-specific automatic item generation</article-title>. <source>Psychometrika</source> <volume>87</volume>, <fpage>749</fpage>&#x2013;<lpage>772</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11336-021-09823-9</pub-id>, PMID: <pub-id pub-id-type="pmid">34907497</pub-id></citation></ref>
<ref id="ref30"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Kaliski</surname> <given-names>P.</given-names></name> <name><surname>Clauser</surname> <given-names>J.</given-names></name> <name><surname>Burke</surname> <given-names>M.</given-names></name></person-group> (<year>2020</year>). Exploring the utility of semantic similarity indices for automated item generation. In P. Kaliski (Chair), <italic>Challenges with automatic item generation implementation: Research, strategies, and lessons learned [virtual symposium]. Annual meeting of the National Council on measurement in education</italic>.</citation></ref>
<ref id="ref31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Khodeir</surname> <given-names>N. A.</given-names></name> <name><surname>Elazhary</surname> <given-names>H.</given-names></name> <name><surname>Wanas</surname> <given-names>N.</given-names></name></person-group> (<year>2018</year>). <article-title>Generating story problems via controlled parameters in a web-based intelligent tutoring system</article-title>. <source>Int. J. Inf. Learn. Technol.</source> <volume>35</volume>, <fpage>199</fpage>&#x2013;<lpage>216</lpage>. doi: <pub-id pub-id-type="doi">10.1108/IJILT-09-2017-0085</pub-id></citation></ref>
<ref id="ref32"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>Y.</given-names></name> <name><surname>Lee</surname> <given-names>H.</given-names></name> <name><surname>Shin</surname> <given-names>J.</given-names></name> <name><surname>Jung</surname> <given-names>K.</given-names></name></person-group> (<year>2019</year>). <article-title>Improving neural question generation using answer separation</article-title>. <source>Proc. AAAI Conf. Artif. Intell.</source> <volume>33</volume>, <fpage>6602</fpage>&#x2013;<lpage>6609</lpage>. doi: <pub-id pub-id-type="doi">10.1609/aaai.v33i01.33016602</pub-id></citation></ref>
<ref id="ref33"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Kumar</surname> <given-names>V.</given-names></name> <name><surname>Muneeswaran</surname> <given-names>S.</given-names></name> <name><surname>Ramakrisnan</surname> <given-names>G.</given-names></name> <name><surname>Li</surname> <given-names>Y.-G.</given-names></name></person-group>. (<year>2019</year>). ParaQG: a system for generating questions and answers from paragraphs. arXiv <comment>[Epub ahead of preprint]</comment>: 1&#x2013;6.</citation></ref>
<ref id="ref34"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kurdi</surname> <given-names>G.</given-names></name> <name><surname>Leo</surname> <given-names>J.</given-names></name> <name><surname>Parsia</surname> <given-names>B.</given-names></name> <name><surname>Sattler</surname> <given-names>U.</given-names></name> <name><surname>Al-Emari</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>A systematic review of automatic question generation for educational purposes</article-title>. <source>Int. J. Artif. Intell. Educ.</source> <volume>30</volume>, <fpage>121</fpage>&#x2013;<lpage>204</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s40593-019-00186-y</pub-id></citation></ref>
<ref id="ref35"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>LaDuca</surname> <given-names>A.</given-names></name> <name><surname>Staples</surname> <given-names>W. I.</given-names></name> <name><surname>Templeton</surname> <given-names>B.</given-names></name> <name><surname>Holzman</surname> <given-names>G. B.</given-names></name></person-group> (<year>1986</year>). <article-title>Item modeling procedures for constructing content-equivalent multiple-choice questions</article-title>. <source>Med. Educ.</source> <volume>20</volume>, <fpage>53</fpage>&#x2013;<lpage>56</lpage>. doi: <pub-id pub-id-type="doi">10.1111/j.1365-2923.1986.tb01042.x</pub-id>, PMID: <pub-id pub-id-type="pmid">3951382</pub-id></citation></ref>
<ref id="ref36"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lai</surname> <given-names>H.</given-names></name> <name><surname>Gierl</surname> <given-names>M. J.</given-names></name> <name><surname>Touchie</surname> <given-names>C.</given-names></name> <name><surname>Pugh</surname> <given-names>D.</given-names></name> <name><surname>Boulais</surname> <given-names>A.</given-names></name> <name><surname>De Champlain</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <article-title>Using automatic item generation to improve the quality of MCQ distractors</article-title>. <source>Teach. Learn. Med.</source> <volume>28</volume>, <fpage>166</fpage>&#x2013;<lpage>173</lpage>. doi: <pub-id pub-id-type="doi">10.1080/10401334.2016.1146608</pub-id>, PMID: <pub-id pub-id-type="pmid">26849247</pub-id></citation></ref>
<ref id="ref37"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Latifi</surname> <given-names>S.</given-names></name> <name><surname>Gierl</surname> <given-names>M. J.</given-names></name> <name><surname>Lai</surname> <given-names>H.</given-names></name> <name><surname>Fung</surname> <given-names>K.</given-names></name></person-group> (<year>2013</year>) Establishing item uniqueness for automatic item generation [Paper presentation]. Annual Meeting of the National Council on Measurement in Education, San Francisco, CA.</citation></ref>
<ref id="ref38"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Pan</surname> <given-names>L.</given-names></name> <name><surname>Lei</surname> <given-names>W.</given-names></name> <name><surname>Chua</surname> <given-names>T. S.</given-names></name> <name><surname>Kan</surname> <given-names>M. Y.</given-names></name></person-group> (<year>2019</year>). Recent advances in neural question generation. arXiv <comment>[Epub ahead of preprint]</comment>.</citation></ref>
<ref id="ref39"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pugh</surname> <given-names>D.</given-names></name> <name><surname>De Champlain</surname> <given-names>A.</given-names></name> <name><surname>Gierl</surname> <given-names>M. J.</given-names></name> <name><surname>Lai</surname> <given-names>H.</given-names></name> <name><surname>Touchie</surname> <given-names>C.</given-names></name></person-group> (<year>2016</year>). <article-title>Using cognitive models to develop quality multiple-choice questions</article-title>. <source>Med. Teach.</source> <volume>38</volume>, <fpage>838</fpage>&#x2013;<lpage>843</lpage>. doi: <pub-id pub-id-type="doi">10.3109/0142159X.2016.1150989</pub-id>, PMID: <pub-id pub-id-type="pmid">26998566</pub-id></citation></ref>
<ref id="ref40"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pugh</surname> <given-names>D.</given-names></name> <name><surname>De Champlain</surname> <given-names>A.</given-names></name> <name><surname>Gierl</surname> <given-names>M. J.</given-names></name> <name><surname>Lai</surname> <given-names>H.</given-names></name> <name><surname>Touchie</surname> <given-names>C.</given-names></name></person-group> (<year>2020</year>). <article-title>Can automated item generation be used to develop high quality MCQs that assess application of knowledge?</article-title> <source>Res. Pract. Technol. Enhanc. Learn.</source> <volume>15</volume>, <fpage>1</fpage>&#x2013;<lpage>13</lpage>. doi: <pub-id pub-id-type="doi">10.1186/s41039-020-00134-8</pub-id></citation></ref>
<ref id="ref41"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Rajpurkar</surname> <given-names>P.</given-names></name> <name><surname>Zhang</surname> <given-names>J.</given-names></name> <name><surname>Lopyrev</surname> <given-names>K.</given-names></name> <name><surname>Liang</surname> <given-names>P.</given-names></name></person-group> (<year>2016</year>). Squad: 100,000+ questions for machine comprehension of text. arXiv <comment>[Epub ahead of preprint]</comment>.</citation></ref>
<ref id="ref42"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Roid</surname> <given-names>G. H.</given-names></name> <name><surname>Haladyna</surname> <given-names>T. M.</given-names></name></person-group> (<year>1978</year>). <article-title>A comparison of objective-based and modified-Bormuth item writing techniques</article-title>. <source>Educ. Psychol. Meas.</source> <volume>38</volume>, <fpage>19</fpage>&#x2013;<lpage>28</lpage>. doi: <pub-id pub-id-type="doi">10.1177/001316447803800104</pub-id></citation></ref>
<ref id="ref43"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Sun</surname> <given-names>X.</given-names></name> <name><surname>Liu</surname> <given-names>J.</given-names></name> <name><surname>Lyu</surname> <given-names>Y.</given-names></name> <name><surname>He</surname> <given-names>W.</given-names></name> <name><surname>Ma</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). Answer-focused and position-aware neural question generation. In <italic>Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</italic>, pp. 3930&#x2013;3939.</citation></ref>
<ref id="ref44"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>von Davier</surname> <given-names>M.</given-names></name></person-group> (<year>2018</year>). <article-title>Automated item generation with recurrent neural networks</article-title>. <source>Psychometrika</source> <volume>83</volume>, <fpage>847</fpage>&#x2013;<lpage>857</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11336-018-9608-y</pub-id>, PMID: <pub-id pub-id-type="pmid">29532403</pub-id></citation></ref>
<ref id="ref45"><citation citation-type="other"><person-group person-group-type="author"><name><surname>von Davier</surname> <given-names>M.</given-names></name></person-group> (<year>2019</year>). Training optimus prime, M.D.: generating medical certification items by fine tuning OpenAI&#x2019;s gpt2 transformer model. arXiv <comment>[Epub ahead of preprint]</comment>, pp. 1&#x2013;19.</citation></ref>
<ref id="ref46"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Lan</surname> <given-names>A. S.</given-names></name> <name><surname>Nie</surname> <given-names>W.</given-names></name> <name><surname>Waters</surname> <given-names>A. E.</given-names></name> <name><surname>Grimaldi</surname> <given-names>P. J.</given-names></name> <name><surname>Baraniuk</surname> <given-names>R. G.</given-names></name></person-group> (<year>2018</year>). QG net: a data driven question generation model for educational content. In <italic>Proceedings of the Fifth Annual ACM Conference on Learning at Scale</italic>, pp. 1&#x2013;10.</citation></ref>
<ref id="ref47"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>H.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ma</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>H.</given-names></name> <name><surname>Wang</surname> <given-names>M.</given-names></name></person-group> (<year>2018</year>). A neural question answering model based on semi-structured tables. In <italic>Proceedings of the 27th International Conference on Computational Linguistics</italic>, pp. 1941&#x2013;1951.</citation></ref>
<ref id="ref48"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>Y.</given-names></name> <name><surname>Ni</surname> <given-names>X.</given-names></name> <name><surname>Ding</surname> <given-names>Y.</given-names></name> <name><surname>Ke</surname> <given-names>Q.</given-names></name></person-group> (<year>2018</year>). Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In <italic>Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</italic>, pp. 3901&#x2013;3910.</citation></ref>
<ref id="ref49"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>Q.</given-names></name> <name><surname>Huang</surname> <given-names>D.</given-names></name></person-group> (<year>2019</year>). Towards generating math word problems from equations and topics. In <italic>Proceedings of the 12th international conference on natural language generation</italic>, pp. 494&#x2013;503.</citation></ref>
<ref id="ref50"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>Q.</given-names></name> <name><surname>Yang</surname> <given-names>N.</given-names></name> <name><surname>Wei</surname> <given-names>F.</given-names></name> <name><surname>Tan</surname> <given-names>C.</given-names></name> <name><surname>Bao</surname> <given-names>H.</given-names></name> <name><surname>Zhou</surname> <given-names>M.</given-names></name></person-group> (<year>2017</year>). Neural question generation from text: a preliminary study. arXiv <comment>[Epub ahead of preprint]</comment>.</citation></ref>
</ref-list>
</back>
</article>