<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/frai.2024.1406857</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Artificial Intelligence</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Grammar-constrained decoding for structured information extraction with fine-tuned generative models applied to clinical trial abstracts</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Schmidt</surname> <given-names>David M.</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2693180/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/formal-analysis/"/>
<role content-type="https://credit.niso.org/contributor-roles/investigation/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/resources/"/>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/validation/"/>
<role content-type="https://credit.niso.org/contributor-roles/visualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Cimiano</surname> <given-names>Philipp</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/173982/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/funding-acquisition/"/>
<role content-type="https://credit.niso.org/contributor-roles/project-administration/"/>
<role content-type="https://credit.niso.org/contributor-roles/supervision/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
</contrib-group>
<aff><institution>Center for Cognitive Interaction Technology (CITEC), Technical Faculty, Bielefeld University</institution>, <addr-line>Bielefeld</addr-line>, <country>Germany</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Anisa Rula, University of Brescia, Italy</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Azanzi Jiomekong, University of Yaounde I, Cameroon</p>
<p>Disha Purohit, Technische Informationsbibliothek (TIB), Germany</p></fn>
<corresp id="c001">&#x0002A;Correspondence: David M. Schmidt <email>david.schmidt&#x00040;uni-bielefeld.de</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>07</day>
<month>01</month>
<year>2025</year>
</pub-date>
<pub-date pub-type="collection">
<year>2024</year>
</pub-date>
<volume>7</volume>
<elocation-id>1406857</elocation-id>
<history>
<date date-type="received">
<day>25</day>
<month>03</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>12</day>
<month>12</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2025 Schmidt and Cimiano.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Schmidt and Cimiano</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<sec>
<title>Background</title>
<p>In the field of structured information extraction, there are typically semantic and syntactic constraints on the output of information extraction (IE) systems. These constraints, however, can typically not be guaranteed using standard (fine-tuned) encoder-decoder architectures. This has led to the development of constrained decoding approaches which allow, e.g., to specify constraints in form of context-free grammars. An open question is in how far an IE system can be effectively guided by a domain-specific grammar to ensure that the output structures follow the requirements of a certain domain data model.</p>
</sec>
<sec>
<title>Methods</title>
<p>In this work we experimentally investigate the influence of grammar-constrained decoding as well as pointer generators on the performance of a domain-specific information extraction system. For this, we consider fine-tuned encoder-decoder models, Longformer and Flan-T5 in particular, and experimentally investigate whether the addition of grammar-constrained decoding and pointer generators improve information extraction results. Toward this goal, we consider the task of inducing structured representations from abstracts describing clinical trials, relying on the C-TrO ontology to semantically describe the clinical trials and their results. We frame the task as a slot filling problem where certain slots of templates need to be filled with token sequences occurring in the input text. We use a dataset comprising 211 annotated clinical trial abstracts about type 2 diabetes and glaucoma for training and evaluation. Our focus is on settings in which the available training data is in the order of a few hundred training examples, which we consider as a <italic>low-resource setting</italic>.</p>
</sec>
<sec>
<title>Results</title>
<p>In all our experiments we could demonstrate the positive impact of grammar-constrained decoding, with an increase in <italic>F</italic><sub>1</sub> score of pp 0.351 (absolute score 0.413) and pp 0.425 (absolute score 0.47) for the best-performing models on type 2 diabetes and glaucoma datasets, respectively. The addition of the pointer generators had a detrimental impact on the results, decreasing <italic>F</italic><sub>1</sub> scores by pp 0.15 (absolute score 0.263) and pp 0.198 (absolute score 0.272) for the best-performing pointer generator models on type 2 diabetes and glaucoma datasets, respectively.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>The experimental results indicate that encoder-decoder models used for structure prediction for information extraction tasks in low-resource settings clearly benefit from grammar-constrained decoding guiding the output generation. In contrast, the evaluated pointer generator models decreased the performance drastically in some cases. Moreover, the performance of the pointer models appears to depend both on the used base model as well as the function used for aggregating the attention values. How the size of large language models affects the performance benefit of grammar-constrained decoding remains to be more structurally investigated in future work.</p>
</sec></abstract>
<kwd-group>
<kwd>grammar-constrained decoding</kwd>
<kwd>structured information extraction</kwd>
<kwd>clinical trials</kwd>
<kwd>deep learning</kwd>
<kwd>generative large language models</kwd>
<kwd>PICO</kwd>
<kwd>evidence-based medicine</kwd>
</kwd-group>
<contract-num rid="cn001">NW21-059A (SAIL)</contract-num>
<contract-sponsor id="cn001">Ministerium f&#x000FC;r Kultur und Wissenschaft des Landes Nordrhein-Westfalen<named-content content-type="fundref-id">10.13039/501100014690</named-content></contract-sponsor>
<counts>
<fig-count count="2"/>
<table-count count="4"/>
<equation-count count="10"/>
<ref-count count="50"/>
<page-count count="14"/>
<word-count count="11135"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Natural Language Processing</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1 Introduction</title>
<p>The increasing success of large language models on a wide range of tasks together with their wide availability has inspired a number of approaches in structured information extraction as well as other tasks requiring a structured prediction, e.g., event extraction (Lu et al., <xref ref-type="bibr" rid="B23">2021</xref>), syntactic and semantic parsing (Roy et al., <xref ref-type="bibr" rid="B31">2024</xref>), symbolic expression generation like SMT formulas (Pan et al., <xref ref-type="bibr" rid="B25">2023</xref>; Sun et al., <xref ref-type="bibr" rid="B42">2023</xref>) or SQL query generation from text (Scholak et al., <xref ref-type="bibr" rid="B36">2021</xref>; Lin et al., <xref ref-type="bibr" rid="B22">2020</xref>).</p>
<p>With structured output, we refer to output structures that go beyond a linear sequence, that is, representing a tree, graph or some other kind of nested structure as output. Many application areas have strict requirements on the structure of the corresponding output and it is key to ensure that the output is valid w.r.t. some pre-defined data model. However, the validity w.r.t. those constraints on the output sequence typically cannot be guaranteed using vanilla generative large language models (Sun et al., <xref ref-type="bibr" rid="B42">2023</xref>; Roy et al., <xref ref-type="bibr" rid="B31">2024</xref>). This is the case because the validity of tokens w.r.t. those constraints is not reflected in the standard unconstrained greedy or beam search decoding approaches. Typically, the output is determined by how likely the model considers specific tokens to be in specific positions. However, these predictions can be sometimes wrong or violate output constraints, in the worst case rendering the entire output to be invalid. This is especially relevant when the output needs to be parsed or executed, like when generating code or formulas, e.g., SMT formulas (Pan et al., <xref ref-type="bibr" rid="B25">2023</xref>; Sun et al., <xref ref-type="bibr" rid="B42">2023</xref>) or SQL queries (Scholak et al., <xref ref-type="bibr" rid="B36">2021</xref>; Lin et al., <xref ref-type="bibr" rid="B22">2020</xref>). In those domains, &#x0201C;almost correct&#x0201D; is usually equivalent to invalid and wrong, as those outputs are then commonly simply rejected by the parsers or execution engines after generation. Related work like Sun et al. (<xref ref-type="bibr" rid="B42">2023</xref>); Roy et al. (<xref ref-type="bibr" rid="B31">2024</xref>) also shows that this is an actual problem and that validity rates even of the most recent language models are far from perfect for applications where it is important to strictly follow a certain grammar.</p>
<p>It is thus an important goal to ensure that the output of a model follows a certain semantic and syntactic structure. As one solution we can consider grammar-constrained decoding approaches that ensure that the decoded structure follows the (production) rules of a given (semantic) grammar. In this paper, we thus experimentally investigate the impact of a grammar-constrained decoding approach on the well-formedness and correctness of the output of structured information extraction models. We investigate this impact in the context of fine-tuned large language models that rely on a supervised setting to adapt the model parameters by optimizing parameters on the basis of a given labeled dataset. Further, our focus is on what we call <italic>low-resource settings</italic> by which we denote settings in which at most 500 training examples are available (compare Roy et al., <xref ref-type="bibr" rid="B31">2024</xref>) and where the number of model parameters is less than 500 million parameters. The first restriction matches typical information extraction settings which rely on human-labeled text examples that are costly to obtain. The second restriction corresponds to situations where models are trained on standard hardware.</p>
<p>The low-resource setting we consider in this paper is of practical relevance. First, the training of large models requires substantial energy resources and generates a corresponding carbon footprint (Strubell et al., <xref ref-type="bibr" rid="B40">2019</xref>; George et al., <xref ref-type="bibr" rid="B15">2023</xref>), such that reducing energy consumption by models with a smaller footprint is an important goal. Second, in many settings, pre-trained models are used in zero-shot or few-shot settings, but they are not fine-tuned to a specific problem due to the large costs and resources needed for that. Nevertheless, in order to produce domain-adapted performance, it is important to optimize models on the actual target task, so fine-tuning is still an important paradigm. Yet, when fine-tuning models on a particular task, it remains a larger challenge to manually annotate thousands of examples. In many cases, resources available for annotating data for research tasks are limited. Especially in the biomedical domain as we consider in this paper, requiring scarcely available domain expertise for the annotation of texts, it is rare to find datasets with several thousands of annotated documents. For instance, the well-known Genia corpus (Kim et al., <xref ref-type="bibr" rid="B19">2003</xref>) features 1, 999 annotated abstracts. Biomedical named-entity recognition and entity linking corpora like MedMentions (Mohan and Li, <xref ref-type="bibr" rid="B24">2019</xref>) contain around 4, 000 abstracts, and biomedical text summarization datasets like MeQSum (Abacha and Demner-Fushman, <xref ref-type="bibr" rid="B1">2019</xref>) comprise 1, 000 summarized health questions. Additionally, the BLUE benchmark (Peng et al., <xref ref-type="bibr" rid="B27">2019</xref>) contains corpora with different sizes, ranging from 64 to 11, 232 examples. However, the second largest dataset with 5, 203 already contains considerably fewer examples. All in all, biomedical datasets in general and more specialized clinical datasets in particular tend to be comparably small.</p>
<p>Taken together, the considered low-resource setting assuming models to be in the order of hundreds of millions of parameters and hundreds of training examples is of practical value and relevance. In this setting, we empirically investigate the impact of a domain-specific grammar that is used at decoding time to ensure that output structures meet well-formedness criteria.</p>
<p>In the slot-filling information extraction paradigm we consider, where a template structure must be filled with slots extracted from the text, an important constraint is that the elements of the slots actually come from the original text. In fact, when using generative models, there is the risk that the model &#x0201C;hallucinates&#x0201D; slot fillers that were never mentioned in the text. Thus, in addition to considering decoding following a domain-specific grammar, we also consider the impact of pointer generators that additionally use the attention to the input tokens at each output step and thus allow a model to &#x0201C;copy&#x0201D; from the input more directly.</p>
<p>Given this motivation, in this article, we pose three research questions:</p>
<list list-type="simple">
<list-item><p><bold>RQ1</bold>. Impact of grammar-constrained decoding: How does grammar-constrained decoding (<monospace>GCD</monospace>) affect the performance of fine-tuned large language models in low-resource settings compared to greedy decoding (<monospace>noGCD</monospace>) w.r.t. structured information extraction tasks?</p></list-item>
<list-item><p><bold>RQ2</bold>. Combining grammar-constrained decoding with pointer generators: Does the combination of grammar-constrained decoding with pointer generators improve results?</p></list-item>
<list-item><p><bold>RQ3</bold>. Performance of different attention aggregation strategies: Which attention aggregation method (<monospace>ptr-sum</monospace>/<monospace>ptr-max</monospace>) works best for pointer generators combined with grammar-constrained decoding?</p></list-item>
</list>
<p>The specific application domain we consider in our paper is structured information extraction in the clinical trial domain. In particular, we focus on the extraction of PICO-related information from abstracts describing the results of randomized clinical trials (RCTs). Hereby, PICO refers to Patient, Intervention, Comparison and Outcomes, representing the key concepts relevant in describing the results of a randomized clinical trial (Schardt et al., <xref ref-type="bibr" rid="B33">2007</xref>; Richardson et al., <xref ref-type="bibr" rid="B30">1995</xref>).</p>
<p>Taken together, to the best of our knowledge, our work contributes novel insights w.r.t. the benefits and drawbacks of using grammar-constrained decoding and pointer generators with fine-tuned generative large language models in low-resource settings. Our paper features the following contributions:</p>
<list list-type="bullet">
<list-item><p>We show the positive impact of grammar-constrained decoding on generative LLMs fine-tuned for structured information extraction in a low-resource setting, improving <italic>F</italic><sub>1</sub> scores from 0.062 to 0.413 and from 0.102 to 0.47 for type 2 diabetes and glaucoma datasets, respectively.</p></list-item>
<list-item><p>We show that adding pointer generators on top of grammar-constrained decoding has a negative impact on the performance, decreasing <italic>F</italic><sub>1</sub> scores from 0.413 to 0.263 and from 0.47 to 0.292 for type 2 diabetes and glaucoma datasets, respectively.</p></list-item>
<list-item><p>We investigate the influence of different attention aggregation strategies (determining what to do with the attention values if a token occurs multiple times in the input) on the performance of pointer generators, considering the sum and the maximum function in particular. We show that the choice depends on the base model, as the maximum function generates the overall best results in this category, but only if paired with the <monospace>led-base-16384</monospace> model, whereas it yields the worst results when used together with the <monospace>flan-t5-base</monospace>. In contrast, the sum function achieves comparable although not the best scores for both tested base models.</p></list-item>
<list-item><p>An ablation experiment with a larger model analyzes the influence of model size on the benefits reached via grammar-constrained decoding, showing that the performance improvements persist or even increase when using larger models, suggesting that the model size alone does not solve the problem of LLMs not always sticking to the desired output specification.</p></list-item>
</list>
<p>In the following, we first discuss how this paper is embedded into related work (Section 2) before discussing the methods presented in this paper and modifications applied to the base models (Section 3). Afterwards, the conducted experiments and used datasets are described in detail in Section 4. The results of those experiments are then reported in Section 5 and discussed w.r.t. the research questions in the following Section 6. Finally, we conclude our findings with a conclusion (Section 7).</p>
</sec>
<sec id="s2">
<title>2 Related work</title>
<p>Many natural language processing tasks require structured output, including event extraction (Lu et al., <xref ref-type="bibr" rid="B23">2021</xref>), syntactic and semantic parsing (Roy et al., <xref ref-type="bibr" rid="B31">2024</xref>), symbolic expression generation like SMT formulas (Pan et al., <xref ref-type="bibr" rid="B25">2023</xref>; Sun et al., <xref ref-type="bibr" rid="B42">2023</xref>) or generating SQL queries (Scholak et al., <xref ref-type="bibr" rid="B36">2021</xref>; Lin et al., <xref ref-type="bibr" rid="B22">2020</xref>), which have a clearly defined syntax. Additionally, except for Lin et al. (<xref ref-type="bibr" rid="B22">2020</xref>), pointer generators are rarely evaluated in related work whereas this is a main focus of our work.</p>
<p>Because of the typically strict constraints on the output structure, various types of constrained decoding algorithms have evolved over the years, e.g., by pruning invalid tokens in beam search algorithms (Anderson et al., <xref ref-type="bibr" rid="B6">2017</xref>), incremental parsing techniques (Scholak et al., <xref ref-type="bibr" rid="B36">2021</xref>) or trie-based constraints (Cao et al., <xref ref-type="bibr" rid="B8">2021</xref>; Lu et al., <xref ref-type="bibr" rid="B23">2021</xref>). These kinds of constraints, however, are different in multiple ways from the flexible generalized grammar-based approach that is pursued in this paper. For example, the trie-based constrained decoding for event extraction proposed by Lu et al. (<xref ref-type="bibr" rid="B23">2021</xref>) is used to generate trie-like structures to capture event structures present in a given text with generative models. The structural properties are, however, not generalized to the level where constraints of the desired structure can be flexibly formulated as grammar rules. Nevertheless, the approach and the trie-like structure closely resemble parse trees, such that the approach is a specific instance of the more general approach that we examine in this paper, relying on context-free grammars to guide decoding.</p>
<p>Along these lines, recent work by Geng et al. (<xref ref-type="bibr" rid="B14">2023</xref>) has examined the impact of using a context-free grammar and Grammatical Framework (Ranta, <xref ref-type="bibr" rid="B29">2019</xref>) for constrained decoding, aiming to provide a unified approach to address various kinds of structures required in different domains and tasks. While they have focused on pre-trained models, our work specifically focuses on investigating the impact of grammar-constrained decoding in fine-tuning settings.</p>
<p>The effect of constrained decoding has been evaluated with respect to fine-tuned models on tasks other than information extraction, that is on the task of generating SQL queries (Scholak et al., <xref ref-type="bibr" rid="B36">2021</xref>). Stengel-Eskin et al. (<xref ref-type="bibr" rid="B39">2024</xref>) have presented an approach to convert ambiguous natural language descriptions into logic formulas and code, but considering zero- and few-shot settings instead of fine-tuning settings as considered in our work.</p>
<p>While Pan et al. (<xref ref-type="bibr" rid="B25">2023</xref>) do not use constrained decoding, they instead explore the effect of self-refinement in case of symbolic reasoner parsing errors on the validity of the generated logic formulas. Roy et al. (<xref ref-type="bibr" rid="B31">2024</xref>) propose a benchmark consisting of syntactic and semantic parsing tasks and evaluate it on a range of models, both fine-tuned and non-fine-tuned models as well as with and without grammar constraints. Compared to this work, neither the clinical domain nor structured information extraction are in the focus of their evaluations nor are pointer generators considered as a supporting mechanism in detail.</p>
<p>In conclusion, our work follows the lines of Geng et al. (<xref ref-type="bibr" rid="B14">2023</xref>), Lu et al. (<xref ref-type="bibr" rid="B23">2021</xref>), and Roy et al. (<xref ref-type="bibr" rid="B31">2024</xref>) by testing the benefit of grammar-constrained decoding in NLP settings. In contrast to previous work, we investigate the impact of grammar-constrained decoding in a fine-tuning setting and in particular on the task of structured information extraction, focusing on low-resource settings. The impact of grammar-constrained decoding has not been investigated in low-resource settings before.</p>
<p>With respect to the biomedical and clinical domain, various approaches have been proposed for tasks like relation extraction (Jiang and Kavuluru, <xref ref-type="bibr" rid="B18">2023</xref>; Kim and Meystre, <xref ref-type="bibr" rid="B20">2020</xref>), question answering (Wang et al., <xref ref-type="bibr" rid="B45">2020</xref>), named entity recognition (Stylianou et al., <xref ref-type="bibr" rid="B41">2021</xref>) or event extraction (Wang et al., <xref ref-type="bibr" rid="B45">2020</xref>; Ramponi et al., <xref ref-type="bibr" rid="B28">2020</xref>; Zhu and Zheng, <xref ref-type="bibr" rid="B50">2020</xref>; Huang et al., <xref ref-type="bibr" rid="B17">2020</xref>; Trieu et al., <xref ref-type="bibr" rid="B43">2020</xref>). Some approaches and models even aim to detect and extract information from (randomized) clinical trial abstracts, e.g., slot fillers (Papanikolaou et al., <xref ref-type="bibr" rid="B26">2022</xref>) or clinical trial outcomes (Abaho et al., <xref ref-type="bibr" rid="B4">2022b</xref>,<xref ref-type="bibr" rid="B3">a</xref>, <xref ref-type="bibr" rid="B2">2021</xref>; Ganguly et al., <xref ref-type="bibr" rid="B13">2021</xref>). Taken together, all of the listed examples either deal with a different task, do not work in a sequence-to-sequence manner as our approach does, or lack the nested structure and dependencies of Patient, Intervention, Comparison, Outcomes (PICO) templates and slots that are dealt with in this paper.</p>
<p>Considering the latter, this work represents randomized controlled trials (RCTs) in a structured way using the already mentioned Patient, Intervention, Comparison, Outcomes (PICO) framework (Schardt et al., <xref ref-type="bibr" rid="B33">2007</xref>; Richardson et al., <xref ref-type="bibr" rid="B30">1995</xref>). This framework consists of templates with corresponding slots, which can be filled with either textual data or again with template instances. In contrast to our approach, most related work like Schmidt et al. (<xref ref-type="bibr" rid="B35">2020</xref>) and Zhang et al. (<xref ref-type="bibr" rid="B49">2020</xref>) treats PICO elements as flat classes, i.e., parts of sentences which are just labeled, e.g., P or I. In contrast, our approach treats PICO elements as nested structures in order to do justice to the complex information that is presented in those elements. In particular, we structure the information by means of templates with slots that have to be filled with some portion of text or other template instances, thus creating a nested structured representation of the PICO information. Furthermore, there are also some approaches (Whitton and Hunter, <xref ref-type="bibr" rid="B46">2023</xref>; Dhrangadhariya et al., <xref ref-type="bibr" rid="B12">2021</xref>) which aim to generate more structured representations of the PICO information in RCT abstracts, but differ in terms of architecture and decoding approaches additionally to the structures generated still being less complex than the recursive template structure we use in this paper.</p>
</sec>
<sec sec-type="methods" id="s3">
<title>3 Methods</title>
<p>In this section, we describe how we approach the structured information extraction task and describe two aspects that we add to the &#x0201C;raw&#x0201D; sequence-to-sequence model, namely grammar-constrained decoding and pointer generator-like behavior. This is also illustrated in <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Illustration of the baseline model as well as the two adjustments added to that baseline, grammar-constrained decoding and pointer generator-like behavior. Words in boxes represent single tokens, numbers below those boxes symbolize outputs from the decoder, where higher values stand for a higher probability that this is the best next token as estimated by the model. For greedy decoding, the token with the highest value is chosen. For <monospace>GCD</monospace>, a filter is applied before, visualized as gray, crossed-out boxes for tokens that are filtered out. Red boxes show the selected token. <bold>(A)</bold> Greedy decoding (baseline, <monospace>basic</monospace>). <bold>(B)</bold> Grammar-constrained decoding (<monospace>GCD</monospace>). <bold>(C)</bold> Pointer generators &#x0002B; grammar-constrained decoding (<monospace>ptr</monospace>).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-07-1406857-g0001.tif"/>
</fig>
<sec>
<title>3.1 Task</title>
<p>In this paper, we tackle the task of structured information extraction from RCT abstracts. We do this in a sequence-to-sequence manner by providing an abstract as input and expecting structured results in terms of the C-TrO ontology (Sanchez-Graillet et al., <xref ref-type="bibr" rid="B32">2019</xref>) as an output. The information extraction task is framed as a slot-filling approach in this paper. In such a task, the templates, i.e., collections of slots, which are defined by the C-TrO ontology, need to be filled using text from an RCT abstract. A slot can be filled with one of two types of slot-fillers, with the type depending on which slot of a template is filled: text from the RCT abstract or a (nested) instance of another template. The grammar used to represent and linearize the different parts of the C-TrO ontology is the one from (Witte et al., <xref ref-type="bibr" rid="B48">2024</xref>). An example is illustrated in <xref ref-type="fig" rid="F2">Figure 2</xref>.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Illustration of a linearized intervention template instance (Witte et al., <xref ref-type="bibr" rid="B48">2024</xref>). The nested template instance as shown in <bold>(A)</bold> is linearized to a flat string as shown in <bold>(B)</bold>, adding start and end tokens for both textual and complex, nested slots.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-07-1406857-g0002.tif"/>
</fig>
</sec>
<sec>
<title>3.2 Baseline</title>
<p>As a baseline, the dataset comprising pairs of an abstract and the corresponding linearized C-TrO ontology representation is used to fine-tune a sequence-to-sequence model for the specific task. For this purpose, a encoder-decoder model is fine-tuned &#x0201C;as-is&#x0201D; without any of the modifications presented in the remainder of this section. The baseline will also be called <monospace>basic</monospace> in the following.</p>
<p>In order to formally define the decoding methods as well as the pointer generator-like behavior later, we first have to define some general notation for vectors and matrices and how to access their values. This notation is inspired by the way NumPy (Harris et al., <xref ref-type="bibr" rid="B16">2020</xref>) arrays are accessed.</p>
<p>Let <inline-formula><mml:math id="M1"><mml:mover accent="true"><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo></mml:mover><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> be a <italic>d</italic>-dimensional vector the elements of which are accessed using square brackets, i.e., <inline-formula><mml:math id="M2"><mml:mover accent="true"><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo></mml:mover><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> with 0 &#x02264; <italic>i</italic> &#x02208; &#x02115; &#x0003C; <italic>d</italic> to retrieve the <italic>i</italic>-th element of <inline-formula><mml:math id="M3"><mml:mover accent="true"><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo></mml:mover></mml:math></inline-formula>. Similarly, let <inline-formula><mml:math id="M4"><mml:mi>M</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> be an <italic>n</italic>-dimensional matrix with <italic>d</italic><sub><italic>i</italic></sub> &#x02208; &#x02115; values in each dimension 1 &#x02264; <italic>i</italic> &#x02208; &#x02115; &#x02264; <italic>n</italic>. In order to access a single element of the matrix, an index for every dimension of <italic>M</italic> has to be given via bracket notation: <italic>M</italic>[<italic>j</italic><sub>0</sub>, &#x02026;, <italic>j</italic><sub><italic>n</italic></sub>] with 0 &#x02264; <italic>j</italic><sub><italic>k</italic></sub> &#x02208; &#x02115; &#x0003C; <italic>d</italic><sub><italic>k</italic></sub> for every 1 &#x02264; <italic>k</italic> &#x02208; &#x02115; &#x02264; <italic>n</italic>. To access larger parts of a matrix, : can be used instead of an index to indicate that all values in that dimension are selected instead of just a single element of it. With this notation in mind, we can now define the used decoding algorithms in the following.</p>
<p>For the baseline, we use an unconstrained greedy decoding, called <monospace>noGCD</monospace>, which always chooses the token with the highest corresponding value from the final distribution regardless of any constraints. Let <italic>dist</italic> be the final token distribution created by the model, then this means:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M5"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>t</mml:mi><mml:mi>T</mml:mi><mml:mi>o</mml:mi><mml:mi>k</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">argmax</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
</sec>
<sec>
<title>3.3 Grammar-constrained decoding</title>
<p>In the grammar-constrained decoding approach we rely on a decoding approach that ensures that the production rules of a given context-free grammar are enforced. For decoding, we thus use a grammar-constrained decoding algorithm, denoted as <monospace>GCD</monospace>, which masks the vocabulary in every step based on the possible next tokens determined by the given context-free grammar.</p>
<p>Concretely, this means the following: Let <italic>accepted</italic> be the set of tokens which are valid according to the context-free grammar used by the decoding algorithm and represented by their token id, i.e., their index in the tokenizer vocabulary &#x1D54D;. Then the vocabulary or distribution mask is defined as follows:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M6"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>k</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x0211D;</mml:mi><mml:mo>&#x0222A;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mo>-</mml:mo><mml:mi>&#x0221E;</mml:mi></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mo>&#x1D54D;</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mtext class="textrm" mathvariant="normal">if&#x000A0;</mml:mtext><mml:mi>i</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>-</mml:mo><mml:mi>&#x0221E;</mml:mi></mml:mtd><mml:mtd><mml:mtext class="textrm" mathvariant="normal">otherwise</mml:mtext></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The next token is then determined using the masked token distribution:</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M7"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>t</mml:mi><mml:mi>T</mml:mi><mml:mi>o</mml:mi><mml:mi>k</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">argmax</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>In practice, this is implemented using the Lark parsing toolkit (Shinan, <xref ref-type="bibr" rid="B38">2024</xref>) together with the core grammar which can be found in <xref ref-type="supplementary-material" rid="SM1">Supplementary material</xref>. The decoding phase consists of two parts. First, the model output sequence is generated using grammar-constrained decoding as described above. As a second step, the generated output sequence is parsed, returning a parse tree which is then further processed to build template instances from that parse tree, which can then be evaluated.</p>
<p>The first decoding phase is implemented by utilizing the Lark interactive parser, which is available when using the LALR parsing mode (DeRemer, <xref ref-type="bibr" rid="B11">1969</xref>). The next possible tokens corresponding to the look-ahead become accessible and can be utilized to create the token distribution mask described above. As no backtracking mechanism is currently implemented in the decoding algorithm, the used grammar needs to be defined in a way which allows to unambiguously decide for the correct token in a single step, as reverting a previous decision based on later tokens is currently not possible. After decoding, the regular parsing mode can then be used to create an actual parse tree from the generated output.</p>
<p>In our structured information extraction task and in order to keep the decoding process as efficient as possible, the first decoding phase using the interactive parser features the core grammar with a simple definition of the free-text non-terminal <monospace>POINT</monospace>. This definition only avoids matches of <monospace>[start:</monospace> and <monospace>[end:</monospace> in order to prevent errors related to the special tokens indicating boundaries of slots and templates in the linearization. As this decoding process can by construction only generate valid tokens in each step, no further validation is necessary here.</p>
<p>In contrast, the second phase, which can also be used separately to parse stored linearizations in order to reconstruct the corresponding template instances, uses a more restrictive definition of the free text non-terminal <monospace>POINT</monospace>. In this phase, the definition of <monospace>POINT</monospace> is constructed from the tokenizer vocabulary in such a way that greedy matching is applied in case there are multiple possible tokenizations for a string. Considering the typically thousands of tokens in a tokenizer vocabulary, this increases the size of the resulting grammar substantially, but in exchange ensures a meaningful parse tree even for different tokenizers.</p>
</sec>
<sec>
<title>3.4 Pointer generators</title>
<p>The second adjustment made to the baseline method is adding pointer generator-like behavior. Therefore, this category of models will be called <monospace>ptr</monospace> in the following, or more precisely <monospace>ptr-max</monospace> when using the maximum function and <monospace>ptr-sum</monospace> when using the sum function for aggregation of the attention values when some token occurs multiple times in the input. Adding pointer generator-like behavior is intended to help the model copy tokens from the input, which is an important part of the considered information extraction task.</p>
<p>More concretely, we add a linear layer followed by a sigmoid activation as well as a slightly different method to calculate the final distribution over the token vocabulary. The method works in a similar fashion to the pointer generator described by See et al. (<xref ref-type="bibr" rid="B37">2017</xref>) and Deaton et al. (<xref ref-type="bibr" rid="B10">2019</xref>), but relying on a different architecture described in more detail below.</p>
<p>For this purpose, we define the calculation of the token distribution for <monospace>ptr</monospace> models as follows: Let <italic>l</italic> be the latest prediction logits in a generation step (omitting the dimension for batching) of the generative model and <italic>dist</italic><sub><italic>gen</italic></sub> &#x0003D; <italic>softmax</italic>(<italic>l</italic>) the corresponding classical generative token distribution. Moreover, let <italic>p</italic><sub><italic>gen</italic></sub> &#x02208; [0, 1] be the output of an additional linear layer with sigmoid activation applied to it afterwards. <italic>p</italic><sub><italic>gen</italic></sub> is the fraction to which the classical token distribution influences the final token distribution. The pointer distribution will instead be multiplied with 1&#x02212;<italic>p</italic><sub><italic>gen</italic></sub>.</p>
<p>Now, let <italic>C</italic> &#x02208; &#x0211D;<sup><italic>heads</italic>&#x000D7;<italic>inputTokens</italic></sup> be the normalized (i.e., the sum of all values of a head is 1) values of the last cross-attention layer of the latest generation step with <italic>heads</italic> attention heads and <italic>inputTokens</italic> input tokens. We then first calculate the mean over all attention heads, i.e., <inline-formula><mml:math id="M8"><mml:mover accent="true"><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:mfrac><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mi>s</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:mi>C</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mo>:</mml:mo></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula>.</p>
<p><inline-formula><mml:math id="M9"><mml:mover accent="true"><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo></mml:mover></mml:math></inline-formula> is, however, a distribution over the input tokens and not the tokenizer vocabulary &#x1D54D; unlike <italic>dist</italic><sub><italic>gen</italic></sub>. To transform <inline-formula><mml:math id="M10"><mml:mover accent="true"><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo></mml:mover></mml:math></inline-formula> into a suitable distribution over &#x1D54D;, on the one hand we have to decide what happens when there are multiple attention values for a token because it occurs multiple times and on the other hand aggregate those values at the position of the correct token in &#x1D54D; in some resulting distribution.</p>
<p>In this work, we evaluated both the maximum (<monospace>ptr-max</monospace>) and the sum operation (<monospace>ptr-sum</monospace>). In the case of <monospace>ptr-max</monospace> we use the maximum operation to determine the maximum attention value for a given token <inline-formula><mml:math id="M11"><mml:mi>t</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo></mml:mover></mml:math></inline-formula> to be used in the pointer distribution at the corresponding position (see <xref ref-type="disp-formula" rid="E4">Equation 4</xref>). Correspondingly, in case of <monospace>ptr-sum</monospace>, the sum of all attention values of a token <italic>t</italic> is used in the pointer distribution (see <xref ref-type="disp-formula" rid="E5">Equation 5</xref>).</p>
<p>The resulting pointer token distribution <italic>dist</italic><sub><italic>ptr</italic></sub> is therefore determined as follows:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M12"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>:</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">max</mml:mo></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mi>T</mml:mi><mml:mi>o</mml:mi><mml:mi>k</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo class="qopname">&#x02192;</mml:mo></mml:mover><mml:mtext>&#x000A0;</mml:mtext><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x02223;</mml:mo><mml:mover class="overrightarrow"><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mo class="qopname">&#x020D7;</mml:mo></mml:mover><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mo>&#x1D54D;</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>&#x02115;</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mo>&#x1D54D;</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E5"><label>(5)</label><mml:math id="M14"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>u</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>:</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mi>T</mml:mi><mml:mi>o</mml:mi><mml:mi>k</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo></mml:mover><mml:mtext>&#x000A0;</mml:mtext><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x02223;</mml:mo><mml:mover class="overrightarrow"><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mo>&#x020D7;</mml:mo></mml:mover><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mo>&#x1D54D;</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>&#x02115;</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mo>&#x1D54D;</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>with <inline-formula><mml:math id="M16"><mml:mover class="overrightarrow"><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mo>&#x020D7;</mml:mo></mml:mover><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x02115;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mi>T</mml:mi><mml:mi>o</mml:mi><mml:mi>k</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> being the vector of input token ids.</p>
<p>The final token distribution used for decoding in case of the <monospace>ptr-max</monospace> model is then defined as follows:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M17"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:msubsup><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:msubsup><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mo>&#x1D54D;</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Similarly, the final token distribution for <monospace>ptr-sum</monospace> is:</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M18"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>u</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:msubsup><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>u</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:msubsup><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>u</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mo>&#x1D54D;</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>These distributions over the tokenizer vocabulary &#x1D54D; are then used analogously in the decoding process to generate a sequence of output tokens.</p>
</sec>
</sec>
<sec id="s4">
<title>4 Experiments</title>
<p>In this section, the different experiments conducted in this paper are described, together with the dataset and other relevant experimental settings used for training and evaluation.</p>
<sec>
<title>4.1 Dataset</title>
<p>In our experiments, we reuse the dataset provided by Witte and Cimiano (<xref ref-type="bibr" rid="B47">2022</xref>) and Witte et al. (<xref ref-type="bibr" rid="B48">2024</xref>), which consists of abstracts of RCTs about type 2 diabetes and glaucoma and annotated according to the C-TrO ontology (Sanchez-Graillet et al., <xref ref-type="bibr" rid="B32">2019</xref>). The dataset comprises a total of 211 documents, 104 on type 2 diabetes and 107 on glaucoma. The 104 type 2 diabetes documents are split up into training, validation and test sets of size 68, 16, and 20, respectively. Analogously, the 107 glaucoma documents are split up into training, validation and test sets of size 69, 17, and 21, respectively. Thus, we use the same dataset as well as the same fixed train-validation-test split as Witte and Cimiano (<xref ref-type="bibr" rid="B47">2022</xref>) and Witte et al. (<xref ref-type="bibr" rid="B48">2024</xref>) and run separate experiments for those two diseases. The exact corresponding number of tokens certainly varies with the used model, tokenizer and pre-processing steps. However, to give a rough estimate, the whole dataset, including both input and output tokens, consists of around 300K tokens in total, with &#x0007E;200K training tokens, 50K validation tokens and around 60K test tokens. These numbers are roughly split in half by disease, i.e., around 100K training tokens for type 2 diabetes and glaucoma each. With these sizes, the used dataset can be considered small when compared to typical fine-tuning tasks explored in related work, especially considering the complexity of the task and length of the targeted output (e.g., even datasets with more samples are still considered low-resource in Roy et al., <xref ref-type="bibr" rid="B31">2024</xref>). In contrast, the dataset is also much larger than the data that is provided to large language models in zero- to few-shot prompting settings (e.g., in Stengel-Eskin et al., <xref ref-type="bibr" rid="B39">2024</xref>) and this way provides an interesting perspective on constrained decoding used in combination with fine-tuning in low-resource environments.</p>
</sec>
<sec>
<title>4.2 Models</title>
<p>In our experiments, we tested two different encoder-decoder transformers (Vaswani et al., <xref ref-type="bibr" rid="B44">2017</xref>) as base models, namely <monospace>google/flan-t5-base</monospace><xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> (Chung et al., <xref ref-type="bibr" rid="B9">2022</xref>, 223M parameters) and <monospace>allenai/led-base-16384</monospace><xref ref-type="fn" rid="fn0002"><sup>2</sup></xref> (Beltagy et al., <xref ref-type="bibr" rid="B7">2020</xref>, 161M parameters). These base models were then evaluated in four variants:</p>
<list list-type="order">
<list-item><p><monospace>basic</monospace>: Vanilla model without modifications, paired with standard greedy decoding (<monospace>noGCD</monospace>).</p></list-item>
<list-item><p><monospace>GCD</monospace>: Vanilla model without modifications, paired with grammar-constrained decoding.</p></list-item>
<list-item><p><monospace>ptr</monospace>: Model with additional layers for pointer generator-like behavior, paired with grammar-constrained decoding, using different attention aggregation functions for the case of multiple occurrences of a token in the input sequence:</p>
<list list-type="simple">
<list-item><p>(a) <monospace>ptr-max</monospace>: Using the maximum function for aggregating attention values of multiple token occurrences.</p></list-item>
<list-item><p>(b) <monospace>ptr-sum</monospace>: Using the sum function for aggregating attention values of multiple token occurrences.</p></list-item>
</list>
</list-item>
</list>
<p>Therefore, when adding pointer generator-like behavior, exclusively grammar-constrained decoding is considered. The used decoding approach itself is then identical to the grammar-constrained decoding described in Section 3.3. However, in this case <inline-formula><mml:math id="M19"><mml:msubsup><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math id="M20"><mml:msubsup><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>u</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> are used for <monospace>ptr-max</monospace> and <monospace>ptr-sum</monospace>, respectively, as token distributions over the respective vocabulary instead of the classical distribution <italic>dist</italic>.</p>
</sec>
<sec>
<title>4.3 Experimental setup</title>
<p>For the considered models and diseases, we ran hyperparameter optimizations using Optuna (Akiba et al., <xref ref-type="bibr" rid="B5">2019</xref>) with 30 trials each and measuring performance with grammar-constrained decoding using validation <italic>F</italic><sub>1</sub> scores, calculated as described in Section 4.4. The <monospace>noGCD</monospace> values are not calculated on models that were trained separately, but instead the already trained models are additionally evaluated with a different decoding technique. The training procedure itself is the same and <monospace>GCD</monospace> behaves just like standard greedy decoding when valid output sequences are generated, so that the difference should not be relevant. However, there could be other training parameters which are more beneficial for <monospace>noGCD</monospace> than for <monospace>GCD</monospace> (which was used to measure validation performance), but this has not been evaluated in this work.</p>
<p>In each executed trial, a &#x003BB; for the lambda learning rate scheduler (between 0.9 and 1.0, using logarithmic domain, learning rate calculated with <italic>lr</italic>(<italic>epoch</italic>) &#x0003D; &#x003BB;<sup><italic>epoch</italic></sup>) as well as a corresponding initial learning rate (between 1<italic>e</italic><sup>&#x02212;3</sup> and 1<italic>e</italic><sup>&#x02212;5</sup>, using logarithmic domain) are sampled from Optuna. The chosen batch size is 1 and the number of epochs is 50 in all experiments, each of which is then executed on a single NVIDIA A40 GPU.</p>
<p>The best hyperparameters for each disease-model-setting-combination are then used to train 10 additional models. Unless stated differently, mean and standard deviation in tables refer to the different results of these 10 training runs. The means and standard deviations of the test <italic>F</italic><sub>1</sub> scores of these 10 trained models are listed in <xref ref-type="table" rid="T1">Table 1</xref> for the experimental setting of 1 and in <xref ref-type="table" rid="T2">Table 2</xref> for 1 and 1.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Evaluation of the impact of grammar-constrained decoding vs. greedy decoding (1).</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="center" colspan="2">&#x02193;<bold>Setting</bold></th>
<th valign="top" align="left"><bold>Dataset&#x02192;</bold></th>
<th valign="top" align="center"><bold>Type 2 diabetes</bold></th>
<th valign="top" align="center"><bold>Glaucoma</bold></th>
</tr>
</thead>
<tbody>
<tr style="background-color:#919498;color:#ffffff">
<td valign="top" align="left"><bold>Model</bold></td>
<td valign="top" align="left"><bold>Type</bold></td>
<td valign="top" align="left"><bold>Decoding</bold></td>
<td valign="top" align="center"><bold>Mean</bold> <italic>F</italic><sub>1</sub> <bold>(</bold>&#x000B1;&#x003C3;<bold>)</bold></td>
<td valign="top" align="center"><bold>Mean</bold> <italic>F</italic><sub>1</sub> <bold>(</bold>&#x000B1;&#x003C3;<bold>)</bold></td>
</tr> <tr>
<td valign="top" align="left">flan-t5-base</td>
<td valign="top" align="left">Basic</td>
<td valign="top" align="left">GCD</td>
<td valign="top" align="center"><bold>0.413 (&#x000B1;0.13)</bold></td>
<td valign="top" align="center"><bold>0.47 (&#x000B1;0.061)</bold></td>
</tr> <tr>
<td valign="top" align="left">flan-t5-base</td>
<td valign="top" align="left">Basic</td>
<td valign="top" align="left">noGCD</td>
<td valign="top" align="center">0.062 (&#x000B1;0.041)</td>
<td valign="top" align="center">0.045 (&#x000B1;0.043)</td>
</tr> <tr>
<td valign="top" align="left">led-base-16384</td>
<td valign="top" align="left">Basic</td>
<td valign="top" align="left">GCD</td>
<td valign="top" align="center">0.301 (&#x000B1;0.102)</td>
<td valign="top" align="center">0.292 (&#x000B1;0.12)</td>
</tr> <tr>
<td valign="top" align="left">led-base-16384</td>
<td valign="top" align="left">Basic</td>
<td valign="top" align="left">noGCD</td>
<td valign="top" align="center">0.016 (&#x000B1;0.029)</td>
<td valign="top" align="center">0.102 (&#x000B1;0.049)</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Mean and standard deviation &#x003C3; of test <italic>F</italic><sub>1</sub> scores across 10 models trained using best-performing (<italic>F</italic><sub>1</sub> on validation dataset) configuration found in 30 trials of hyperparameter optimization. Numbers rounded to three decimal places, best configuration of each disease marked bold.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Evaluation of the impact of pointer generators and the used attention aggregation method (1 and 1).</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="center" colspan="2">&#x02193;<bold>Setting</bold></th>
<th valign="top" align="left"><bold>Dataset&#x02192;</bold></th>
<th valign="top" align="center"><bold>Type 2 diabetes</bold></th>
<th valign="top" align="center"><bold>Glaucoma</bold></th>
</tr>
</thead>
<tbody>
<tr style="background-color:#919498;color:#ffffff">
<td valign="top" align="left"><bold>Model</bold></td>
<td valign="top" align="left"><bold>Type</bold></td>
<td valign="top" align="left"><bold>Decoding</bold></td>
<td valign="top" align="center"><bold>Mean</bold> <italic>F</italic><sub>1</sub> <bold>(</bold>&#x000B1;&#x003C3;<bold>)</bold></td>
<td valign="top" align="center"><bold>Mean</bold> <italic>F</italic><sub>1</sub> <bold>(</bold>&#x000B1;&#x003C3;<bold>)</bold></td>
</tr> <tr>
<td valign="top" align="left">flan-t5-base</td>
<td valign="top" align="left">basic</td>
<td valign="top" align="left">GCD</td>
<td valign="top" align="center"><bold>0.413 (&#x000B1;0.13)</bold></td>
<td valign="top" align="center"><bold>0.47 (&#x000B1;0.061)</bold></td>
</tr>
<tr>
<td valign="top" align="left">flan-t5-base</td>
<td valign="top" align="left">ptr-max</td>
<td valign="top" align="left">GCD</td>
<td valign="top" align="center">0.092 (&#x000B1;0.075)</td>
<td valign="top" align="center">0.091 (&#x000B1;0.015)</td>
</tr>
<tr>
<td valign="top" align="left">flan-t5-base</td>
<td valign="top" align="left">ptr-sum</td>
<td valign="top" align="left">GCD</td>
<td valign="top" align="center">0.16 (&#x000B1;0.074)</td>
<td valign="top" align="center">0.211(&#x000B1;0.084)</td>
</tr>
<tr>
<td valign="top" align="left">led-base-16384</td>
<td valign="top" align="left">ptr-max</td>
<td valign="top" align="left">GCD</td>
<td valign="top" align="center">0.263 (&#x000B1;0.067)</td>
<td valign="top" align="center">0.272 (&#x000B1;0.046)</td>
</tr>
<tr>
<td valign="top" align="left">led-base-16384</td>
<td valign="top" align="left">ptr-sum</td>
<td valign="top" align="left">GCD</td>
<td valign="top" align="center">0.236 (&#x000B1;0.064)</td>
<td valign="top" align="center">0.216 (&#x000B1;0.078)</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Mean and standard deviation &#x003C3; of test <italic>F</italic><sub>1</sub> scores across 10 models trained using best-performing (<italic>F</italic><sub>1</sub> on validation dataset) configuration found in 30 trials of hyperparameter optimization. Numbers rounded to three decimal places, best configuration of each disease marked bold.</p>
</table-wrap-foot>
</table-wrap>
<p>However, as the dataset consists of information that is stored in a complex nested template structure, it is not immediately possible to train a model on this data. Therefore, the structure is linearized similarly to XML, i.e., with start <monospace>[start:&#x0003C;<italic>slot or template name</italic>&#x0003E;]</monospace> and end <monospace>[end:&#x0003C;<italic>slot or template name</italic>&#x0003E;]</monospace> tags for slots and templates, which allows to freely nest even templates in other templates. For each of these tags, special tokens are added to the vocabulary. In order to reduce the input data variance and allow the models to learn the relations more easily, an (arbitrary but fixed, e.g., alphanumerically sorted) order is enforced when linearizing templates and slots. An example for a linearized nested template with both textual slots and a slot which contains a template is given in <xref ref-type="fig" rid="F2">Figure 2</xref>.</p>
<p>In order to answer our research questions we focus on the following two experimental settings:</p>
<list list-type="bullet">
<list-item><p>Impact of grammar-constrained decoding (1): We compare the setting in which a grammar is used to constrain the decoding <monospace>(GCD)</monospace> and the case in which it is not used <monospace>(noGCD)</monospace>, i.e., in which standard greedy decoding is applied.</p></list-item>
<list-item><p>Impact of pointer generators and attention aggregation methods (1 and 1): We quantify the impact of adding pointer generator-like behavior, comparing two different attention aggregation methods (sum/maximum) for the <monospace>GCD</monospace> case.</p></list-item>
</list>
</sec>
<sec>
<title>4.4 Evaluation</title>
<p>Evaluating the predicted templates against the ground truth templates is again not a trivial task, as, in some cases, various template instances have to be aligned to each other. This is done by optimizing the <italic>F</italic><sub>1</sub> score across all possible alignments/matchings by modeling it as a linear inequality system and maximizing for the resulting <italic>F</italic><sub>1</sub> score. The <italic>F</italic><sub>1</sub> score for a single predicted and ground truth template is calculated by first determining true positives, false positives and false negatives for the textual slot fillers of the two templates.</p>
<p>Two textual slot fillers are considered equal when the concatenation of the tokens of that slot filler has a similarity of &#x02265;0.9 according to the following normalized Levenshtein similarity measure:</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M21"><mml:mtable class="eqnarray" columnalign="center"><mml:mtr><mml:mtd><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>m</mml:mi><mml:mi>L</mml:mi><mml:mi>e</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>h</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>S</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mo>:</mml:mo><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:mi>l</mml:mi><mml:mi>e</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>h</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>D</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo class="qopname">max</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>levenshteinDistance</italic> in the above definition refers to the Levenshtein distance proposed by Levenshtein (<xref ref-type="bibr" rid="B21">1966</xref>). The concatenation and Levenshtein similarity calculation step is necessary to avoid problems regarding the tokenization, e.g., situations where the generated text is equal but the tokenization is slightly different and leading to low scores otherwise. Furthermore, this reduces the bias toward long textual slots with many tokens like the <monospace>Title</monospace> slot of a publication which is easy to predict and typically consists of many tokens compared to, e.g., <monospace>Outcome</monospace> template instances comprising primarily numbers and short units in most cases. We consider this to be a more meaningful and fair evaluation, albeit having the drawback of our results not being directly comparable to those reported by Witte et al. (<xref ref-type="bibr" rid="B48">2024</xref>) and Witte and Cimiano (<xref ref-type="bibr" rid="B47">2022</xref>).</p>
<p>Correspondingly, slot fillers which are equal w.r.t. the above definition and occur in both templates are counted as a true positive, those which only occur in the predicted template are counted as a false positive, and those which only occur in the ground truth template are counted as a false negative. Moreover, identical and different template slot fillers are added to those numbers in the same fashion, but without applying the approach recursively, i.e., only completely identical template slot fillers are considered equal and counted as one accordingly. From the sum of these true positives, false positives and false negatives then the <italic>F</italic><sub>1</sub> score of a template is calculated.</p>
</sec>
</sec>
<sec sec-type="results" id="s5">
<title>5 Results</title>
<p>The results of the conducted experiments relevant for 1 can be found in <xref ref-type="table" rid="T1">Table 1</xref> and for 1 as well as 1 in <xref ref-type="table" rid="T2">Table 2</xref>. In addition to the overall performance scores presented in <xref ref-type="table" rid="T1">Tables 1</xref>, <xref ref-type="table" rid="T2">2</xref>, the mean scores per template are shown in <xref ref-type="table" rid="T3">Table 3</xref> as well as per slot in <xref ref-type="table" rid="T4">Table 4</xref>. In both cases, the values for the glaucoma dataset for the <monospace>GCD</monospace> models are listed there as an example. The remaining data can be found in <xref ref-type="supplementary-material" rid="SM1">Supplementary material</xref>. This section briefly mentions the most important results w.r.t. all three research questions together with a small ablation study w.r.t. the model size.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Glaucoma test <italic>F</italic><sub>1</sub> scores per template.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th/>
<th valign="top" align="center" colspan="3"><bold>Glaucoma</bold> <italic><bold>F</bold></italic><sub><bold>1</bold></sub></th>
</tr>
</thead>
<tbody>
<tr style="background-color:#919498;color:#ffffff">
<td valign="top" align="left"><bold>Template name</bold></td>
<td valign="top" align="center"><bold>Basic GCD</bold></td>
<td valign="top" align="center"><bold>ptr-max GCD</bold></td>
<td valign="top" align="center"><bold>ptr-sum GCD</bold></td>
</tr> <tr>
<td valign="top" align="left">Arm</td>
<td valign="top" align="center">0.21 (&#x000B1;0.08)</td>
<td valign="top" align="center">0.07 (&#x000B1;0.06)</td>
<td valign="top" align="center">0.05 (&#x000B1;0.07)</td>
</tr> <tr>
<td valign="top" align="left">ClinicalTrial</td>
<td valign="top" align="center">0.53 (&#x000B1;0.05)</td>
<td valign="top" align="center">0.3 (&#x000B1;0.07)</td>
<td valign="top" align="center">0.22 (&#x000B1;0.09)</td>
</tr> <tr>
<td valign="top" align="left">DiffBetweenGroups</td>
<td valign="top" align="center">0.15 (&#x000B1;0.08)</td>
<td valign="top" align="center">0.07 (&#x000B1;0.04)</td>
<td valign="top" align="center">0.04 (&#x000B1;0.03)</td>
</tr> <tr>
<td valign="top" align="left">Endpoint</td>
<td valign="top" align="center">0.33 (&#x000B1;0.06)</td>
<td valign="top" align="center">0.22 (&#x000B1;0.03)</td>
<td valign="top" align="center">0.19 (&#x000B1;0.07)</td>
</tr> <tr>
<td valign="top" align="left">Intervention</td>
<td valign="top" align="center">0.49 (&#x000B1;0.1)</td>
<td valign="top" align="center">0.23 (&#x000B1;0.07)</td>
<td valign="top" align="center">0.21 (&#x000B1;0.1)</td>
</tr> <tr>
<td valign="top" align="left">Medication</td>
<td valign="top" align="center">0.51 (&#x000B1;0.11)</td>
<td valign="top" align="center">0.3 (&#x000B1;0.06)</td>
<td valign="top" align="center">0.27 (&#x000B1;0.12)</td>
</tr> <tr>
<td valign="top" align="left">Outcome</td>
<td valign="top" align="center">0.26 (&#x000B1;0.05)</td>
<td valign="top" align="center">0.1 (&#x000B1;0.04)</td>
<td valign="top" align="center">0.09 (&#x000B1;0.05)</td>
</tr> <tr>
<td valign="top" align="left">Population</td>
<td valign="top" align="center">0.47 (&#x000B1;0.06)</td>
<td valign="top" align="center">0.25 (&#x000B1;0.08)</td>
<td valign="top" align="center">0.16 (&#x000B1;0.1)</td>
</tr> <tr>
<td valign="top" align="left">Publication</td>
<td valign="top" align="center">0.69 (&#x000B1;0.06)</td>
<td valign="top" align="center">0.34 (&#x000B1;0.1)</td>
<td valign="top" align="center">0.0 (&#x000B1;0.0)</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Mean and standard deviation &#x003C3; per template of glaucoma test <italic>F</italic><sub>1</sub> scores, considering the best-performing base model of each category. Numbers rounded to two decimal places.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Glaucoma test <italic>F</italic><sub>1</sub> scores per slot.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th/>
<th valign="top" align="center" colspan="3"><bold>Glaucoma</bold> <italic><bold>F</bold></italic><sub><bold>1</bold></sub></th>
</tr>
</thead>
<tbody>
<tr style="background-color:#919498;color:#ffffff">
<td valign="top" align="left"><bold>Slot name</bold></td>
<td valign="top" align="center"><bold>Basic GCD</bold></td>
<td valign="top" align="center"><bold>ptr-max GCD</bold></td>
<td valign="top" align="center"><bold>ptr-sum GCD</bold></td>
</tr> <tr>
<td valign="top" align="left">AggregationMethod</td>
<td valign="top" align="center">0.52 (&#x000B1;0.08)</td>
<td valign="top" align="center">0.34 (&#x000B1;0.07)</td>
<td valign="top" align="center">0.31 (&#x000B1;0.13)</td>
</tr> <tr>
<td valign="top" align="left">AnalysesHealthCondition</td>
<td valign="top" align="center">0.87 (&#x000B1;0.02)</td>
<td valign="top" align="center">0.59 (&#x000B1;0.05)</td>
<td valign="top" align="center">0.57 (&#x000B1;0.1)</td>
</tr> <tr>
<td valign="top" align="left">Author</td>
<td valign="top" align="center">0.61 (&#x000B1;0.04)</td>
<td valign="top" align="center">0.34 (&#x000B1;0.11)</td>
<td valign="top" align="center">0.21 (&#x000B1;0.12)</td>
</tr> <tr>
<td valign="top" align="left">BaselineUnit</td>
<td valign="top" align="center">0.55 (&#x000B1;0.06)</td>
<td valign="top" align="center">0.38 (&#x000B1;0.04)</td>
<td valign="top" align="center">0.3 (&#x000B1;0.1)</td>
</tr> <tr>
<td valign="top" align="left">BaselineValue</td>
<td valign="top" align="center">0.47 (&#x000B1;0.17)</td>
<td valign="top" align="center">0.18 (&#x000B1;0.1)</td>
<td valign="top" align="center"><bold>0.19 (&#x000B1;0.16)</bold></td>
</tr> <tr>
<td valign="top" align="left">CTDesign</td>
<td valign="top" align="center">0.63 (&#x000B1;0.05)</td>
<td valign="top" align="center">0.39 (&#x000B1;0.09)</td>
<td valign="top" align="center">0.21 (&#x000B1;0.13)</td>
</tr> <tr>
<td valign="top" align="left">CTduration</td>
<td valign="top" align="center">0.68 (&#x000B1;0.09)</td>
<td valign="top" align="center">0.32 (&#x000B1;0.1)</td>
<td valign="top" align="center">0.25 (&#x000B1;0.14)</td>
</tr> <tr>
<td valign="top" align="left">ChangeValue</td>
<td valign="top" align="center">0.43 (&#x000B1;0.07)</td>
<td valign="top" align="center">0.21 (&#x000B1;0.05)</td>
<td valign="top" align="center">0.21 (&#x000B1;0.09)</td>
</tr> <tr>
<td valign="top" align="left">ConclusionComment</td>
<td valign="top" align="center">0.59 (&#x000B1;0.06)</td>
<td valign="top" align="center">0.25 (&#x000B1;0.1)</td>
<td valign="top" align="center">0.14 (&#x000B1;0.09)</td>
</tr> <tr>
<td valign="top" align="left">ConfIntervalDiff</td>
<td valign="top" align="center">0.11 (&#x000B1;0.14)</td>
<td valign="top" align="center">0.02 (&#x000B1;0.06)</td>
<td valign="top" align="center">0.0 (&#x000B1;0.0)</td>
</tr> <tr>
<td valign="top" align="left">Country</td>
<td valign="top" align="center">0.76 (&#x000B1;0.1)</td>
<td valign="top" align="center">0.39 (&#x000B1;0.12)</td>
<td valign="top" align="center">0.25 (&#x000B1;0.14)</td>
</tr> <tr>
<td valign="top" align="left">DeliveryMethod</td>
<td valign="top" align="center">0.23 (&#x000B1;0.2)</td>
<td valign="top" align="center">0.05 (&#x000B1;0.11)</td>
<td valign="top" align="center"><bold>0.11 (&#x000B1;0.14)</bold></td>
</tr> <tr>
<td valign="top" align="left">DiffGroupAbsValue</td>
<td valign="top" align="center">0.11 (&#x000B1;0.13)</td>
<td valign="top" align="center">0.04 (&#x000B1;0.09)</td>
<td valign="top" align="center">0.02 (&#x000B1;0.06)</td>
</tr> <tr>
<td valign="top" align="left">DoseUnit</td>
<td valign="top" align="center">0.69 (&#x000B1;0.13)</td>
<td valign="top" align="center">0.51 (&#x000B1;0.11)</td>
<td valign="top" align="center">0.45 (&#x000B1;0.2)</td>
</tr> <tr>
<td valign="top" align="left">DoseValue</td>
<td valign="top" align="center">0.65 (&#x000B1;0.12)</td>
<td valign="top" align="center">0.36 (&#x000B1;0.05)</td>
<td valign="top" align="center">0.31 (&#x000B1;0.14)</td>
</tr> <tr>
<td valign="top" align="left">Drug</td>
<td valign="top" align="center">0.45 (&#x000B1;0.07)</td>
<td valign="top" align="center">0.29 (&#x000B1;0.07)</td>
<td valign="top" align="center">0.23 (&#x000B1;0.1)</td>
</tr> <tr>
<td valign="top" align="left">EndoPointDescription</td>
<td valign="top" align="center">0.21 (&#x000B1;0.06)</td>
<td valign="top" align="center">0.16 (&#x000B1;0.04)</td>
<td valign="top" align="center">0.16 (&#x000B1;0.06)</td>
</tr> <tr>
<td valign="top" align="left">FinalNumPatientsArm</td>
<td valign="top" align="center">0.03 (&#x000B1;0.11)</td>
<td valign="top" align="center">0.0 (&#x000B1;0.0)</td>
<td valign="top" align="center">0.0 (&#x000B1;0.0)</td>
</tr> <tr>
<td valign="top" align="left">FinalNumberPatientsCT</td>
<td valign="top" align="center">0.11 (&#x000B1;0.16)</td>
<td valign="top" align="center">0.05 (&#x000B1;0.11)</td>
<td valign="top" align="center">0.0 (&#x000B1;0.0)</td>
</tr> <tr>
<td valign="top" align="left">Frequency</td>
<td valign="top" align="center">0.67 (&#x000B1;0.06)</td>
<td valign="top" align="center">0.4 (&#x000B1;0.08)</td>
<td valign="top" align="center">0.37 (&#x000B1;0.12)</td>
</tr> <tr>
<td valign="top" align="left">Journal</td>
<td valign="top" align="center">0.66 (&#x000B1;0.07)</td>
<td valign="top" align="center">0.36 (&#x000B1;0.1)</td>
<td valign="top" align="center">0.24 (&#x000B1;0.16)</td>
</tr> <tr>
<td valign="top" align="left">MeasurementDevice</td>
<td valign="top" align="center">0.06 (&#x000B1;0.13)</td>
<td valign="top" align="center">0.0 (&#x000B1;0.0)</td>
<td valign="top" align="center">0.0 (&#x000B1;0.0)</td>
</tr> <tr>
<td valign="top" align="left">NumberAffected</td>
<td valign="top" align="center">0.37 (&#x000B1;0.27)</td>
<td valign="top" align="center">0.01 (&#x000B1;0.02)</td>
<td valign="top" align="center"><bold>0.03 (&#x000B1;0.06)</bold></td>
</tr> <tr>
<td valign="top" align="left">NumberPatientsArm</td>
<td valign="top" align="center">0.38 (&#x000B1;0.2)</td>
<td valign="top" align="center">0.13 (&#x000B1;0.13)</td>
<td valign="top" align="center">0.1 (&#x000B1;0.15)</td>
</tr> <tr>
<td valign="top" align="left">NumberPatientsCT</td>
<td valign="top" align="center">0.48 (&#x000B1;0.07)</td>
<td valign="top" align="center">0.27 (&#x000B1;0.19)</td>
<td valign="top" align="center">0.23 (&#x000B1;0.16)</td>
</tr> <tr>
<td valign="top" align="left">ObjectiveDescription</td>
<td valign="top" align="center">0.36 (&#x000B1;0.09)</td>
<td valign="top" align="center">0.24 (&#x000B1;0.06)</td>
<td valign="top" align="center">0.12 (&#x000B1;0.1)</td>
</tr> <tr>
<td valign="top" align="left">ObservedResult</td>
<td valign="top" align="center">0.01 (&#x000B1;0.02)</td>
<td valign="top" align="center"><bold>0.02 (&#x000B1;0.02)</bold></td>
<td valign="top" align="center">0.01 (&#x000B1;0.02)</td>
</tr> <tr>
<td valign="top" align="left">PMID</td>
<td valign="top" align="center">0.76 (&#x000B1;0.07)</td>
<td valign="top" align="center">0.32 (&#x000B1;0.12)</td>
<td valign="top" align="center">0.21 (&#x000B1;0.14)</td>
</tr> <tr>
<td valign="top" align="left">PValueChangeValue</td>
<td valign="top" align="center">0.01 (&#x000B1;0.04)</td>
<td valign="top" align="center"><bold>0.08 (&#x000B1;0.13)</bold></td>
<td valign="top" align="center">0.03 (&#x000B1;0.07)</td>
</tr> <tr>
<td valign="top" align="left">PercentageAffected</td>
<td valign="top" align="center">0.19 (&#x000B1;0.1)</td>
<td valign="top" align="center">0.06 (&#x000B1;0.05)</td>
<td valign="top" align="center">0.05 (&#x000B1;0.07)</td>
</tr> <tr>
<td valign="top" align="left">Precondition</td>
<td valign="top" align="center">0.18 (&#x000B1;0.06)</td>
<td valign="top" align="center">0.12 (&#x000B1;0.06)</td>
<td valign="top" align="center">0.06 (&#x000B1;0.07)</td>
</tr> <tr>
<td valign="top" align="left">PublicationYear</td>
<td valign="top" align="center">0.88 (&#x000B1;0.09)</td>
<td valign="top" align="center">0.36 (&#x000B1;0.12)</td>
<td valign="top" align="center">0.29 (&#x000B1;0.18)</td>
</tr> <tr>
<td valign="top" align="left">PvalueDiff</td>
<td valign="top" align="center">0.24 (&#x000B1;0.04)</td>
<td valign="top" align="center">0.14 (&#x000B1;0.06)</td>
<td valign="top" align="center">0.09 (&#x000B1;0.05)</td>
</tr> <tr>
<td valign="top" align="left">RelativeChangeValue</td>
<td valign="top" align="center">0.1 (&#x000B1;0.18)</td>
<td valign="top" align="center">0.05 (&#x000B1;0.08)</td>
<td valign="top" align="center"><bold>0.1 (&#x000B1;0.15)</bold></td>
</tr> <tr>
<td valign="top" align="left">RelativeFreqTime</td>
<td valign="top" align="center">0.31 (&#x000B1;0.17)</td>
<td valign="top" align="center">0.07 (&#x000B1;0.12)</td>
<td valign="top" align="center">0.05 (&#x000B1;0.16)</td>
</tr> <tr>
<td valign="top" align="left">ResultMeasuredValue</td>
<td valign="top" align="center">0.39 (&#x000B1;0.11)</td>
<td valign="top" align="center">0.16 (&#x000B1;0.09)</td>
<td valign="top" align="center">0.11 (&#x000B1;0.08)</td>
</tr> <tr>
<td valign="top" align="left">SdDevBL</td>
<td valign="top" align="center">0.31 (&#x000B1;0.14)</td>
<td valign="top" align="center">0.07 (&#x000B1;0.08)</td>
<td valign="top" align="center">0.07 (&#x000B1;0.07)</td>
</tr> <tr>
<td valign="top" align="left">SdDevChangeValue</td>
<td valign="top" align="center">0.24 (&#x000B1;0.09)</td>
<td valign="top" align="center">0.08 (&#x000B1;0.07)</td>
<td valign="top" align="center"><bold>0.09 (&#x000B1;0.12)</bold></td>
</tr> <tr>
<td valign="top" align="left">SdDevResValue</td>
<td valign="top" align="center">0.43 (&#x000B1;0.12)</td>
<td valign="top" align="center">0.15 (&#x000B1;0.1)</td>
<td valign="top" align="center">0.14 (&#x000B1;0.11)</td>
</tr> <tr>
<td valign="top" align="left">SdErrorChangeValue</td>
<td valign="top" align="center">0.11 (&#x000B1;0.18)</td>
<td valign="top" align="center">0.03 (&#x000B1;0.11)</td>
<td valign="top" align="center">0.0 (&#x000B1;0.0)</td>
</tr> <tr>
<td valign="top" align="left">TimePoint</td>
<td valign="top" align="center">0.36 (&#x000B1;0.05)</td>
<td valign="top" align="center">0.18 (&#x000B1;0.1)</td>
<td valign="top" align="center">0.14 (&#x000B1;0.1)</td>
</tr> <tr>
<td valign="top" align="left">Title</td>
<td valign="top" align="center">0.56 (&#x000B1;0.05)</td>
<td valign="top" align="center">0.32 (&#x000B1;0.1)</td>
<td valign="top" align="center">0.21 (&#x000B1;0.14)</td>
</tr> <tr>
<td valign="top" align="left">Total Micro <italic>F</italic><sub>1</sub> Score</td>
<td valign="top" align="center">0.47 (&#x000B1;0.06)</td>
<td valign="top" align="center">0.27 (&#x000B1;0.05)</td>
<td valign="top" align="center">0.22 (&#x000B1;0.08)</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Mean and standard deviation &#x003C3; per slot of glaucoma test <italic>F</italic><sub>1</sub> scores, considering the best-performing base model of each category. Numbers rounded to two decimal places. Notable exceptions where either <monospace>ptr-max</monospace> outperforms <monospace>basic</monospace> or <monospace>ptr-sum</monospace> outperforms <monospace>ptr-max</monospace> are marked bold.</p>
</table-wrap-foot>
</table-wrap>
<sec>
<title>5.1 Impact of grammar-constrained decoding (RQ1)</title>
<p>This section presents the results of <xref ref-type="table" rid="T1">Table 1</xref>, i.e., the results w.r.t. 1, discussing <monospace>basic</monospace> models with and without grammar-constrained decoding.</p>
<p>The combination of the <monospace>flan-t5-base</monospace> base model with grammar-constrained decoding (<monospace>GCD</monospace>) yields the overall best results, achieving an <italic>F</italic><sub>1</sub> score of 0.413 (&#x000B1;0.13) for the type 2 diabetes and 0.47 (&#x000B1;0.061) for the glaucoma test set. The second best results are achieved by the combination of grammar-constrained decoding and the <monospace>led-base-16384</monospace> base model, albeit performing considerably worse with 0.301 (&#x000B1;0.102) for the type 2 diabetes and 0.292 (&#x000B1;0.12) for the glaucoma test set.</p>
<p>The grammar-constrained decoding approach outperforms the greedy decoding approach on average by far, both for <monospace>flan-t5-base</monospace> (0.413 vs. 0.062 for type 2 diabetes and 0.470 vs. 0.045 for glaucoma) and for <monospace>led-base-16384</monospace> (0.301 vs. 0.016 for type 2 diabetes and 0.292 vs. 0.102 for glaucoma).</p>
<p>In spite of achieving the best overall performance, the standard deviation is also highest for the <monospace>basic</monospace> &#x0002B; <monospace>GCD</monospace> combination in general and for the best-performing model <monospace>flan-t5-base</monospace> in particular (0.13 for type 2 diabetes). However, the highest standard deviation for the glaucoma dataset is achieved by the <monospace>led-base-16384</monospace> for <monospace>basic</monospace> &#x0002B; <monospace>GCD</monospace> with 0.12.</p>
</sec>
<sec>
<title>5.2 Interplay between grammar-constrained decoding and pointer generators (RQ2)</title>
<p>This section shows the results of <xref ref-type="table" rid="T2">Table 2</xref> comparing the pointer models as a whole with the <monospace>basic</monospace> baseline using grammar-constrained decoding, i.e., <monospace>GCD</monospace>. For this purpose, the best <monospace>basic</monospace> scores are shown at the top of <xref ref-type="table" rid="T2">Table 2</xref> for both datasets.</p>
<p>In the conducted experiments, the pointer models on average always performed worse than their <monospace>basic</monospace> counterparts, with a much higher performance decrease for <monospace>flan-t5-base</monospace> (0.413 vs. 0.092 for type 2 diabetes and 0.47 vs. 0.091 for glaucoma) than for <monospace>led-base-16384</monospace> (0.301 vs. 0.263 for type 2 diabetes and 0.292 vs. 0.272 for glaucoma). Thus, in total, pointer models do not outperform vanilla <monospace>basic</monospace> models in combination with grammar-constrained decoding. As <monospace>basicGCD</monospace> models occupy the first and second place in total, a pointer model can be found in the overall third place. More precisely, this is the <monospace>ptr-max</monospace> model using <monospace>led-base-16384</monospace> as a base model with 0.263 (&#x000B1;0.067) for type 2 diabetes and 0.272 (&#x000B1;0.046) for glaucoma.</p>
<p>Examining <xref ref-type="table" rid="T3">Table 3</xref>, it is also striking that the <monospace>basic</monospace> model achieves better mean <italic>F</italic><sub>1</sub> scores than both <monospace>ptr-max</monospace> and <monospace>ptr-sum</monospace> for every single template type. For <xref ref-type="table" rid="T4">Table 4</xref>, the results are slightly more mixed. For the majority of slots, the performance ranking is the same as for the template, namely <monospace>basic</monospace> outperforming <monospace>ptr-max</monospace> and <monospace>ptr-max</monospace> performing slightly better than <monospace>ptr-sum</monospace>. Nevertheless, there are a few exceptions which are marked bold in <xref ref-type="table" rid="T4">Table 4</xref>. For example, for <monospace>PValueChangeValue</monospace>, the pointer model <monospace>ptr-max</monospace> achieves a mean <italic>F</italic><sub>1</sub> score of 0.08 whereas the <monospace>basic</monospace> model only reaches a score of 0.01.</p>
<p>However, there is no slot for which <monospace>ptr-sum</monospace> outperforms <monospace>basic</monospace>. Additionally, these different performances occur mostly for slots which only have comparably low <italic>F</italic><sub>1</sub> scores anyway and is typically paired with a high standard deviation. This may indicate that these exceptions are more due to noise and random fluctuations than to actual architectural differences. In order to further investigate this hypothesis, additional experiments would be necessary, which remain to be done in future work.</p>
</sec>
<sec>
<title>5.3 Performance of different attention aggregation strategies (RQ3)</title>
<p>This section inspects the results of <xref ref-type="table" rid="T2">Table 2</xref> comparing the pointer models with each other in order to determine the best attention aggregation method, i.e., either sum (<monospace>ptr-sum</monospace>) or maximum (<monospace>ptr-max</monospace>).</p>
<p>In absolute numbers, <monospace>ptr-max</monospace> models perform slightly better than <monospace>ptr-sum</monospace> models. However, the <monospace>ptr-sum</monospace> architecture seems to work a lot better for <monospace>flan-t5-base</monospace> than <monospace>ptr-max</monospace> (0.092 vs. 0.16 for type 2 diabetes and 0.091 vs. 0.211 for glaucoma) while working reasonably well for <monospace>led-base-16384</monospace>, too (0.263 vs. 0.236 for type 2 diabetes and 0.272 vs. 0.216 for glaucoma).</p>
<p>Regarding standard deviation, both pointer model architectures deliver comparable values for type 2 diabetes. For glaucoma, the standard deviation is about twice as large for <monospace>ptr-sum</monospace> than for <monospace>ptr-max</monospace>, even for both base models (0.046 vs. 0.078 for <monospace>led-base-16384</monospace> and 0.015 vs. 0.084 for <monospace>flan-t5-base</monospace>).</p>
<p>Regarding <xref ref-type="table" rid="T3">Table 3</xref>, <monospace>ptr-max</monospace> outperforms <monospace>ptr-sum</monospace> for every template on the glaucoma dataset. However, the standard deviation is higher in most cases for the <monospace>ptr-sum</monospace> model, indicating that the performance of <monospace>ptr-sum</monospace> models is more volatile but can be better than <monospace>ptr-max</monospace> in extreme cases. For example, for the <monospace>Medication</monospace> template, the mean score with 0.3 vs. 0.27 is comparable for <monospace>ptr-max</monospace> and <monospace>ptr-sum</monospace> but the standard deviation for <monospace>ptr-sum</monospace> is twice as high with 0.06 vs. 0.12, indicating a higher potential to achieve high scores in some cases. Whether these high performance trials can be achieved more consistently with different training parameters is unclear and remains to be investigated in future work.</p>
<p>For <xref ref-type="table" rid="T4">Table 4</xref>, the results are slightly more mixed for the performance of the different pointer models just as when comparing them to the <monospace>basic</monospace> baseline in the previous section. Nevertheless, <monospace>ptr-max</monospace> is performing slightly better than <monospace>ptr-sum</monospace> usually. However, for slot <monospace>DeliveryMethod</monospace>, there is an exception where <monospace>ptr-sum</monospace> outperforms <monospace>ptr-max</monospace> with a score of 0.11 vs. 0.05.</p>
</sec>
<sec>
<title>5.4 Ablation study: increasing model size</title>
<p>Although the influence of the model size is not systematically evaluated in this work, we conducted a small ablation study for a single branch of the experiments presented above. Concretely, we trained a <monospace>basic</monospace> model of the <monospace>google/flan-t5-large</monospace> base model on the glaucoma dataset in the same 30&#x0002B;10 trials fashion described above and evaluated the resulting 10 models both with and without grammar-constrained decoding. With grammar-constrained decoding, i.e., for <monospace>GCD</monospace>, the result on the glaucoma test set is an <italic>F</italic><sub>1</sub> score of 0.490 (&#x000B1;0.053). With standard greedy decoding, i.e., <monospace>noGCD</monospace>, a score of 0.044 (&#x000B1;0.043) is achieved.</p>
<p>Compared to <monospace>flan-t5-base</monospace> with 0.47 (&#x000B1;0.061) with <monospace>GCD</monospace> and 0.045 (&#x000B1;0.043) with <monospace>noGCD</monospace>, this is a slight performance improvement when using grammar-constrained decoding and a very similar or even slightly worse result when using standard greedy decoding. This indicates that increasing model size alone does not solve the problems with reliably generating syntactically correct output sequences. However, more structured evaluations are necessary to test this hypothesis further.</p>
<boxed-text id="C1" position="float">
<label>Listing 1</label>
<title>Case study of an arbitrarily chosen syntax error made by <monospace>google/flan-t5-large</monospace> trained on the glaucoma dataset when evaluated without grammar-constrained decoding, i.e., <monospace>noGCD</monospace>.</title>
<preformat>
1 &#x000A0;&#x000A0;...
2 &#x000A0;&#x000A0;[end:hasEndpoint]
3 &#x000A0;&#x000A0;<bold>[start:hasObservedResult]</bold>
4 &#x000A0;&#x000A0;After&#x000A0;3&#x000A0;months&#x000A0;of&#x000A0;treatment...
5 &#x000A0;&#x000A0;<bold>[end:hasObservedResult]</bold>
6 &#x000A0;&#x000A0;<bold>[start:hasPValueChangeValue]</bold>
7 &#x000A0;&#x000A0;After&#x000A0;3&#x000A0;months&#x000A0;of&#x000A0;treatment...
8 &#x000A0;&#x000A0;<bold>[end:hasObservedResult]</bold>
9 &#x000A0;&#x000A0;[start:hasPValueChangeValue]
10 &#x000A0;&#x000A0;P=0.01
11 &#x000A0;&#x000A0;[end:hasPValueChangeValue]
12 &#x000A0;&#x000A0;[end:Outcome]
13 &#x000A0;&#x000A0;[end:hasOutcome]
14 &#x000A0;&#x000A0;...
</preformat>
</boxed-text>
<p>In <xref ref-type="other" rid="C1">Listing 1</xref>, an exemplary syntax error made by the fine-tuned <monospace>google/flan-t5-large</monospace> during evaluation with the glaucoma test set is shown. Considering the presented output snippet, it is striking that the model generates the first part correctly, i.e., <monospace>[start:hasObservedResult]After 3 months of treatment...[end:hasObservedResult]</monospace> is syntactically correct. But after that, some kind of mixture between <monospace>hasObservedResult</monospace> and <monospace>hasPValueChangeValue</monospace> seems to be generated, as the content is almost identical to the <monospace>hasObservedResult</monospace> slot whereas the starting tag, i.e., <monospace>hasPValueChangeValue</monospace>, is not. The similar content might be the reason why the model confuses the end tags and chooses <monospace>hasObservedResult</monospace> over the correct <monospace>hasPValueChangeValue</monospace>. After this, a correct instance of <monospace>hasPValueChangeValue</monospace> is then generated with different content and no confusion of the end tag.</p>
<p>Although the type of syntax errors were not systematically evaluated in this work, the presented example appears to be prototypical for the category of errors that is common when looking at unconstrained output. Despite large parts of the generated output being correct and meaningful, there is often just a small error which causes the whole output to end up invalid due to the strict requirements imposed by the context-free grammar. At the same time, this kind of errors can easily be circumvented with grammar-constrained decoding, preserving the validity of the otherwise in large parts useful and correct output. A more structured evaluation of these kinds of mistakes unconstrained models do would be an interesting path for future work.</p>
</sec>
</sec>
<sec sec-type="discussion" id="s6">
<title>6 Discussion</title>
<p>In this section, we discuss the results presented in the previous section w.r.t. the research questions of this paper and connect our findings to some related work.</p>
<p>Considering the bad performance of almost all <monospace>noGCD</monospace> configurations with an absolute performance increase for <monospace>GCD</monospace> between 0.425 (glaucoma dataset, <monospace>flan-t5-base</monospace> &#x0002B; <monospace>basic</monospace>) and 0.091 (glaucoma dataset, <monospace>flan-t5-base</monospace> &#x0002B; <monospace>ptr</monospace>), this suggests w.r.t. 1 that grammar-constrained decoding helps the models substantially to generate better results and eliminates the burden of having to learn the structure of the data from examples. Thus, grammar-constrained decoding positively affects the performance for the considered structured information extraction task.</p>
<p>This is in line with the results of Geng et al. (<xref ref-type="bibr" rid="B14">2023</xref>). However, they have only shown the positive impact of grammar-constrained decoding on pre-trained models. In contrast, we have focused on fine-tuned models and shown that also in this setting grammar-constrained decoding has a very positive impact, in particular in what we have called low-resource settings. Considering the higher amount of data given to the models compared to few-shot prompting, one could have expected the benefit of grammar-constrained decoding to decrease. Instead, our experiments indicate that grammar-constrained decoding can still be useful in fine-tuning settings, at least for the comparably small models that have been tested and the resulting performance increase is similar, if not larger, to the results obtained by Geng et al. (<xref ref-type="bibr" rid="B14">2023</xref>). The poor performance without grammar-constrained decoding may also be caused by the usage of special tokens for the start and end tags, which were not part of the pre-training process. Thus, the available training data might have been too small for learning the meaning and structure of these tokens. Whether not using special tokens and instead relying on the existing vocabulary for the structure generation would improve the performance is unclear and remains to be investigated in future work. However, some preliminary experiments indicate that the task is actually easier to learn for the models with special tokens compared to tokenizing the slot start and end tags like regular text.</p>
<p>At the same time, this shows that learning a complex output structure reliably from relatively few examples is still not a trivial task for large language models of the considered size (e.g., ca. 220 million parameters for <monospace>flan-t5-base</monospace> and ca. 160 million parameters for <monospace>led-base-16384</monospace>). How the size of the used large language model affects this part of the performance remains to be investigated in future work in a more structured way. Our ablation study with <monospace>flan-t5-large</monospace> presented in the previous section suggests that the benefit of using grammar-constrained decoding is similarly large even when the model size is increased. However, Geng et al. (<xref ref-type="bibr" rid="B14">2023</xref>) use much larger models and appear to get very promising results such that the benefit of grammar-constrained decoding might be smaller for fine-tuning on larger models. These results are also in agreement with Sun et al. (<xref ref-type="bibr" rid="B42">2023</xref>), although they primarily explored the validity of SMT solver formulas when varying the model temperature, whereas this work evaluated a domain-specific grammar and varied different architectural properties, yielding a much larger validity difference between constrained and unconstrained decoding. In summary, this illustrates that large language models actually struggle with sticking to strictly-constrained output structures in practice such that having guarantees as provided by grammar-constrained decoding is useful, especially for smaller models or rare output structures but also for larger models.</p>
<p>All in all, this emphasizes that grammar-constrained decoding appears to be beneficial when fine-tuning in low-resource settings for structured information extraction tasks.</p>
<p>Regarding 1, the presented results suggest that pointer generators are not beneficial for structured information extraction tasks in low-resource environments when combined with grammar-constrained decoding. They therefore seem not to be a promising path of future research, at least with the considered models, dataset size, output syntax and overall architecture. In contrast, Lin et al. (<xref ref-type="bibr" rid="B22">2020</xref>) successfully generated SQL statements from text using a BERT model in combination with pointer generators, which indicates that pointer generators can be useful in the context of structure generation nevertheless. However, it is not clear which aspect made the pointer generator approach fail in our case, whether it was the dataset size, choice of models, the (compared to SQL) rare output structure or something else. Therefore, a deeper analysis of the errors made by the pointer models and where the differences to the successes of Lin et al. (<xref ref-type="bibr" rid="B22">2020</xref>) are is open to be explored in future work.</p>
<p>All in all, this indicates that pointer generator-like behavior seems to hurt performance in structured information extraction tasks in low-resource settings combined with grammar-constrained decoding instead of improving it.</p>
<p>Considering 1, i.e., the two attention aggregation methods, the pointer generator-like behavior with maximum used for attention aggregation <monospace>ptr-max</monospace> works better for the (otherwise worse-performing) <monospace>led-base-16384</monospace> base model than for <monospace>flan-t5-base</monospace>. However, the pointer generator models with summing as an aggregation function <monospace>ptr-sum</monospace> perform comparably well in combination with both base models, although slightly worse in absolute numbers. It is not clear which properties of the architecture cause this difference, both between the pointer models and the <monospace>basic</monospace> model as well as between the pointer models <monospace>ptr-max</monospace> and <monospace>ptr-sum</monospace>, such that this remains to be investigated in future work.</p>
<p>All in all, this means there is no clear winner comparing the maximum and sum function as attention aggregation methods for pointer generator-like models and the choice appears to depend on the used base model as well. Overall, the maximum aggregation function achieved the best scores for both datasets.</p>
</sec>
<sec sec-type="conclusions" id="s7">
<title>7 Conclusion</title>
<p>In this work, we have presented a grammar-constrained decoding approach for structured information extraction with fine-tuned generative large language models. Our sequence-to-sequence models predict complex output structures, consisting of nested templates with both textual slots as well as slots again containing templates. The chosen base models <monospace>google/flan-t5-base</monospace> and <monospace>allenai/led-base-16384</monospace> have been evaluated in multiple configurations, i.e., with and without support by grammar-constrained decoding as well as with two kinds of supporting pointer generator-like behavior.</p>
<p>We have instantiated the model specifically for PICO element extraction from randomized controlled trials in combination with a domain-specific grammar for that purpose and evaluated all different model configurations w.r.t. two diseases, namely type 2 diabetes and glaucoma.</p>
<p>In summary, our results indicate that grammar-constrained decoding can substantially increase the model performance in low-resource settings for structured information extraction tasks (1) and that pointer generator-like behavior appears not to be beneficial in the considered settings, with varying intensities of performance degradation depending on the model and the chosen attention aggregation function (1). Furthermore, the used attention aggregation method appears to depend on the used model and, in total, the maximum function achieves the best results for both datasets (1).</p>
<p>Evaluating in a structured way how the large language models size affects the performance benefit of grammar-constrained decoding as well as investigating the reasons for the generally bad performance of both pointer generator models <monospace>ptr-max</monospace> and <monospace>ptr-sum</monospace> are just a few aspects besides many others that still pose interesting open questions which remain to be investigated in future work.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s8">
<title>Data availability statement</title>
<p>The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: <ext-link ext-link-type="uri" xlink:href="https://zenodo.org/doi/10.5281/zenodo.10419785">https://zenodo.org/doi/10.5281/zenodo.10419785</ext-link>. The code used in this work has been published as (Schmidt and Cimiano, <xref ref-type="bibr" rid="B34">2024</xref>).</p>
</sec>
<sec sec-type="author-contributions" id="s9">
<title>Author contributions</title>
<p>DS: Conceptualization, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing &#x02013; original draft, Writing &#x02013; review &#x00026; editing. PC: Funding acquisition, Project administration, Supervision, Writing &#x02013; review &#x00026; editing.</p>
</sec>
<sec sec-type="funding-information" id="s10">
<title>Funding</title>
<p>The author(s) declare financial support was received for the research, authorship, and/or publication of this article. The research of David M. Schmidt is funded by the Ministry of Culture and Science of the State of North Rhine-Westphalia under the grant no NW21-059A (SAIL). The research of Philipp Cimiano is partially funded by the Ministry of Culture and Science of the State of North Rhine-Westphalia under the grant no NW21-059A (SAIL). We acknowledge the financial support of the German Research Foundation (DFG) and the Open Access Publication Fund of Bielefeld University for the article processing charge.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s11">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec><sec sec-type="supplementary-material" id="s12">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/frai.2024.1406857/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/frai.2024.1406857/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.pdf" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<fn-group>
<title>Abbreviations</title>
<fn fn-type="abbr"><p>C-TrO, clinical trial ontology; CFG, context-free grammar; EBM, evidence-based medicine; GT, ground truth; IE, information extraction; LED, longformer-encoder-decoder; MAD, mean absolute deviation; PICO, patient, intervention, comparison, outcomes; RCT, randomized controlled trial; SQL, structured query language; T5, text-to-text transfer transformer.</p></fn></fn-group>
<fn-group>
<fn id="fn0001"><p><sup>1</sup><ext-link ext-link-type="uri" xlink:href="https://huggingface.co/google/flan-t5-base">https://huggingface.co/google/flan-t5-base</ext-link>.</p></fn>
<fn id="fn0002"><p><sup>2</sup><ext-link ext-link-type="uri" xlink:href="https://huggingface.co/allenai/led-base-16384">https://huggingface.co/allenai/led-base-16384</ext-link>.</p></fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Abacha</surname> <given-names>A. B.</given-names></name> <name><surname>Demner-Fushman</surname> <given-names>D.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;On the summarization of consumer health questions,&#x0201D;</article-title> in <italic>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</italic> (Florence: ACL), <fpage>2228</fpage>&#x02013;<lpage>2234</lpage>. <pub-id pub-id-type="doi">10.18653/v1/P19-1215</pub-id><pub-id pub-id-type="pmid">36568019</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Abaho</surname> <given-names>M.</given-names></name> <name><surname>Bollegala</surname> <given-names>D.</given-names></name> <name><surname>Williamson</surname> <given-names>P.</given-names></name> <name><surname>Dodd</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Detect and classify - joint span detection and classification for health outcomes,&#x0201D;</article-title> <italic>inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</italic>, eds. M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (Punta Cana: Association for Computational Linguistics), <fpage>8709</fpage>&#x02013;<lpage>8721</lpage>. <pub-id pub-id-type="doi">10.18653/v1/2021.emnlp-main.686</pub-id><pub-id pub-id-type="pmid">36568019</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Abaho</surname> <given-names>M.</given-names></name> <name><surname>Bollegala</surname> <given-names>D.</given-names></name> <name><surname>Williamson</surname> <given-names>P.</given-names></name> <name><surname>Dodd</surname> <given-names>S.</given-names></name></person-group> (<year>2022a</year>). <article-title>&#x0201C;Position-based prompting for health outcome generation,&#x0201D;</article-title> <italic>inProceedings of the 21st Workshop on Biomedical Language Processing</italic>, eds. D. Demner-Fushman, K. B. Cohen, S. Ananiadou, and J. Tsujii (Dublin: Association for Computational Linguistics), <fpage>26</fpage>&#x02013;<lpage>36</lpage>. <pub-id pub-id-type="doi">10.18653/v1/2022.bionlp-1.3</pub-id><pub-id pub-id-type="pmid">36568019</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Abaho</surname> <given-names>M.</given-names></name> <name><surname>Bollegala</surname> <given-names>D.</given-names></name> <name><surname>Williamson</surname> <given-names>P. R.</given-names></name> <name><surname>Dodd</surname> <given-names>S.</given-names></name></person-group> (<year>2022b</year>). <article-title>Assessment of contextualised representations in detecting outcome phrases in clinical trials</article-title>. <source>arXiv</source> [Preprint]. arXiv:2203.03547. <pub-id pub-id-type="doi">10.48550/arXiv.2203.03547</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Akiba</surname> <given-names>T.</given-names></name> <name><surname>Sano</surname> <given-names>S.</given-names></name> <name><surname>Yanase</surname> <given-names>T.</given-names></name> <name><surname>Ohta</surname> <given-names>T.</given-names></name> <name><surname>Koyama</surname> <given-names>M.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Optuna: a next-generation hyperparameter optimization framework,&#x0201D;</article-title> in <italic>Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</italic> (New York, NY: ACM). <pub-id pub-id-type="doi">10.1145/3292500.3330701</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Anderson</surname> <given-names>P.</given-names></name> <name><surname>Fernando</surname> <given-names>B.</given-names></name> <name><surname>Johnson</surname> <given-names>M.</given-names></name> <name><surname>Gould</surname> <given-names>S.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Guided open vocabulary image captioning with constrained beam search,&#x0201D;</article-title> in <source>Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source> (<publisher-loc>Copenhagen</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>936</fpage>&#x02013;<lpage>945</lpage>. <pub-id pub-id-type="doi">10.18653/v1/D17-1098</pub-id><pub-id pub-id-type="pmid">36568019</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Beltagy</surname> <given-names>I.</given-names></name> <name><surname>Peters</surname> <given-names>M. E.</given-names></name> <name><surname>Cohan</surname> <given-names>A.</given-names></name></person-group> (<year>2020</year>). <article-title>Longformer: the long-document transformer</article-title>. <source>arXiv</source> [Preprint]. arXiv:2004.05150. <pub-id pub-id-type="doi">10.48550/arXiv.2004.05150</pub-id></citation>
</ref>
<ref id="B8">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Cao</surname> <given-names>N. D.</given-names></name> <name><surname>Izacard</surname> <given-names>G.</given-names></name> <name><surname>Riedel</surname> <given-names>S.</given-names></name> <name><surname>Petroni</surname> <given-names>F.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Autoregressive entity retrieval,&#x0201D;</article-title> in <source>9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3&#x02013;7, 2021</source> (<publisher-loc>OpenReview.net</publisher-loc>). Available at: <ext-link ext-link-type="uri" xlink:href="https://openreview.net/forum?id=5k8F6UU39V">https://openreview.net/forum?id=5k8F6UU39V</ext-link></citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chung</surname> <given-names>H. W.</given-names></name> <name><surname>Hou</surname> <given-names>L.</given-names></name> <name><surname>Longpre</surname> <given-names>S.</given-names></name> <name><surname>Zoph</surname> <given-names>B.</given-names></name> <name><surname>Tay</surname> <given-names>Y.</given-names></name> <name><surname>Fedus</surname> <given-names>W.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>Scaling instruction-finetuned language models</article-title>. <source>arXiv</source> [Preprint]. arXiv:2210.11416. <pub-id pub-id-type="doi">10.48550/arXiv.2210.11416</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Deaton</surname> <given-names>J.</given-names></name> <name><surname>Jacobs</surname> <given-names>A.</given-names></name> <name><surname>Kenealy</surname> <given-names>K.</given-names></name></person-group> (<year>2019</year>). <source>Transformers and Pointer-Generator Networks for Abstractive Summarization</source>. Stanford. Available at: <ext-link ext-link-type="uri" xlink:href="https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/custom/15784595.pdf">https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/custom/15784595.pdf</ext-link></citation>
</ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>DeRemer</surname> <given-names>F. L.</given-names></name></person-group> (<year>1969</year>). <source>Practical translators for LR (k) languages</source> (PhD thesis). <publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>Massachusetts Institute of Technology</publisher-name>.</citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dhrangadhariya</surname> <given-names>A.</given-names></name> <name><surname>Aguilar</surname> <given-names>G.</given-names></name> <name><surname>Solorio</surname> <given-names>T.</given-names></name> <name><surname>Hilfiker</surname> <given-names>R.</given-names></name> <name><surname>M&#x000FC;ller</surname> <given-names>H.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;End-to-end fine-grained neural entity recognition of patients, interventions, outcomes,&#x0201D;</article-title> in <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction, Volume 12880</source>, eds. K. S. Candan, B. Ionescu, L. Goeuriot, B. Larsen, H. M&#x000FC;ller, A. Joly, et al. (Cham: Springer International Publishing), <fpage>65</fpage>&#x02013;<lpage>77</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-85251-1_6</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ganguly</surname> <given-names>D.</given-names></name> <name><surname>Gleize</surname> <given-names>M.</given-names></name> <name><surname>Hou</surname> <given-names>Y.</given-names></name> <name><surname>Jochim</surname> <given-names>C.</given-names></name> <name><surname>Bonin</surname> <given-names>F.</given-names></name> <name><surname>Pascale</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Outcome prediction from behaviour change intervention evaluations using a combination of node and word embedding</article-title>. <source>AMIA Annu. Symp. Proc</source>. <volume>2021</volume>, <fpage>486</fpage>&#x02013;<lpage>495</lpage>. <pub-id pub-id-type="pmid">35308987</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Geng</surname> <given-names>S.</given-names></name> <name><surname>Josifoski</surname> <given-names>M.</given-names></name> <name><surname>Peyrard</surname> <given-names>M.</given-names></name> <name><surname>West</surname> <given-names>R.</given-names></name></person-group> (<year>2023</year>). <article-title>&#x0201C;Grammar-constrained decoding for structured NLP tasks without finetuning,&#x0201D;</article-title> in <italic>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</italic> (Singapore: Association for Computational Linguistics), <fpage>10932</fpage>&#x02013;<lpage>10952</lpage>. <pub-id pub-id-type="doi">10.18653/v1/2023.emnlp-main.674</pub-id><pub-id pub-id-type="pmid">36568019</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>George</surname> <given-names>A. S.</given-names></name> <name><surname>George</surname> <given-names>A. H.</given-names></name> <name><surname>Martin</surname> <given-names>A. G.</given-names></name></person-group> (<year>2023</year>). <article-title>The environmental impact of AI: a case study of water consumption by chat gpt</article-title>. <source>Partners Universal Int. Innov. J</source>. <volume>1</volume>, <fpage>97</fpage>&#x02013;<lpage>104</lpage>. <pub-id pub-id-type="doi">10.5281/zenodo.7855594</pub-id></citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Harris</surname> <given-names>C. R.</given-names></name> <name><surname>Millman</surname> <given-names>K. J.</given-names></name> <name><surname>van der Walt</surname> <given-names>S. J.</given-names></name> <name><surname>Gommers</surname> <given-names>R.</given-names></name> <name><surname>Virtanen</surname> <given-names>P.</given-names></name> <name><surname>Cournapeau</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Array programming with NumPy</article-title>. <source>Nature</source> <volume>585</volume>, <fpage>357</fpage>&#x02013;<lpage>362</lpage>. <pub-id pub-id-type="doi">10.1038/s41586-020-2649-2</pub-id><pub-id pub-id-type="pmid">32939066</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>K.-H.</given-names></name> <name><surname>Yang</surname> <given-names>M.</given-names></name> <name><surname>Peng</surname> <given-names>N.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Biomedical event extraction with hierarchical knowledge graphs,&#x0201D;</article-title> in <source>Findings of the Association for Computational Linguistics: EMNLP 2020</source>, eds. T. Cohn, Y. He, and Y. Liu (Association for Computational Linguistics), <fpage>1277</fpage>&#x02013;<lpage>1285</lpage>. <pub-id pub-id-type="doi">10.18653/v1/2020.findings-emnlp.114</pub-id><pub-id pub-id-type="pmid">36568019</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jiang</surname> <given-names>Y.</given-names></name> <name><surname>Kavuluru</surname> <given-names>R.</given-names></name></person-group> (<year>2023</year>). <article-title>End-to-end <italic>n</italic>-ary relation extraction for combination drug therapies</article-title>. <italic>arXiv</italic> [Preprints]. arXiv:2303.16886. <pub-id pub-id-type="doi">10.48550/arXiv:2303.16886</pub-id></citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>J.-D.</given-names></name> <name><surname>Ohta</surname> <given-names>T.</given-names></name> <name><surname>Tateisi</surname> <given-names>Y.</given-names></name> <name><surname>Tsujii</surname> <given-names>J.</given-names></name></person-group> (<year>2003</year>). <article-title>Genia corpus&#x02013;a semantically annotated corpus for bio-textmining</article-title>. <source>Bioinformatics</source> 19(suppl_1):i180-i182. <pub-id pub-id-type="doi">10.1093/bioinformatics/btg1023</pub-id><pub-id pub-id-type="pmid">12855455</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>Y.</given-names></name> <name><surname>Meystre</surname> <given-names>S. M.</given-names></name></person-group> (<year>2020</year>). <article-title>Ensemble method&#x02013;based extraction of medication and related information from clinical texts</article-title>. <source>J. Am. Med. Inform. Assoc</source>. <volume>27</volume>, <fpage>31</fpage>&#x02013;<lpage>38</lpage>. <pub-id pub-id-type="doi">10.1093/jamia/ocz100</pub-id><pub-id pub-id-type="pmid">31282932</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Levenshtein</surname> <given-names>V. I.</given-names></name></person-group> (<year>1966</year>). <article-title>&#x0201C;Binary codes capable of correcting deletions, insertions, and reversals,&#x0201D;</article-title> in <source>Soviet physics doklady, volume 10</source> (<publisher-loc>Soviet Union</publisher-loc>), <fpage>707</fpage>&#x02013;<lpage>710</lpage>.</citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>X. V.</given-names></name> <name><surname>Socher</surname> <given-names>R.</given-names></name> <name><surname>Xiong</surname> <given-names>C.</given-names></name></person-group> (<year>2020</year>). <article-title>Bridging textual and tabular data for cross-domain text-to-sql semantic parsing</article-title>. <source>arXiv</source> [Preprint]. arXiv:2012.12627. <pub-id pub-id-type="doi">10.48550/arXiv.2012.12627</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lu</surname> <given-names>Y.</given-names></name> <name><surname>Lin</surname> <given-names>H.</given-names></name> <name><surname>Xu</surname> <given-names>J.</given-names></name> <name><surname>Han</surname> <given-names>X.</given-names></name> <name><surname>Tang</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Text2Event: controllable sequence-to-structure generation for end-to-end event extraction</article-title>. <source>arXiv</source> [Preprint]. arXiv:2106.09232. <pub-id pub-id-type="doi">10.48550/arXiv.2106.09232</pub-id><pub-id pub-id-type="pmid">39493181</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mohan</surname> <given-names>S.</given-names></name> <name><surname>Li</surname> <given-names>D.</given-names></name></person-group> (<year>2019</year>). <article-title>Medmentions: a large biomedical corpus annotated with umls concepts</article-title>. <source>arXiv</source> [Preprint]. arXiv:1902.09476. <pub-id pub-id-type="doi">10.48550/arXiv.1902.09476</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pan</surname> <given-names>L.</given-names></name> <name><surname>Albalak</surname> <given-names>A.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>W. Y.</given-names></name></person-group> (<year>2023</year>). <article-title>Logic-LM: empowering large language models with symbolic solvers for faithful logical reasoning</article-title>. <source>arXiv</source> [Preprint]. arXiv:2305.12295. <pub-id pub-id-type="doi">10.48550/arXiv:2305.12295</pub-id></citation>
</ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Papanikolaou</surname> <given-names>Y.</given-names></name> <name><surname>Staib</surname> <given-names>M.</given-names></name> <name><surname>Grace</surname> <given-names>J. J.</given-names></name> <name><surname>Bennett</surname> <given-names>F.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Slot filling for biomedical information extraction,&#x0201D;</article-title> in <source>Proceedings of the 21st Workshop on Biomedical Language Processing, BioNLP&#x00040;ACL 2022, Dublin, Ireland, May 26, 2022</source>, eds. D. Demner-Fushman, K. B. Cohen, S. Ananiadou, and J. Tsujii (Dublin: Association for Computational Linguistics), <fpage>82</fpage>&#x02013;<lpage>90</lpage>. <pub-id pub-id-type="doi">10.18653/v1/2022.bionlp-1.7</pub-id><pub-id pub-id-type="pmid">36568019</pub-id></citation></ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Peng</surname> <given-names>Y.</given-names></name> <name><surname>Yan</surname> <given-names>S.</given-names></name> <name><surname>Lu</surname> <given-names>Z.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Transfer learning in biomedical natural language processing: an evaluation of bert and elmo on ten benchmarking datasets,&#x0201D;</article-title> in <italic>Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)</italic> (Florence). <pub-id pub-id-type="doi">10.18653/v1/W19-5006</pub-id><pub-id pub-id-type="pmid">36568019</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ramponi</surname> <given-names>A.</given-names></name> <name><surname>Van Der Goot</surname> <given-names>R.</given-names></name> <name><surname>Lombardo</surname> <given-names>R.</given-names></name> <name><surname>Plank</surname> <given-names>B.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Biomedical event extraction as sequence labeling,&#x0201D;</article-title> in <italic>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</italic> (Association for Computational Linguistics), <fpage>5357</fpage>&#x02013;<lpage>5367</lpage>. <pub-id pub-id-type="doi">10.18653/v1/2020.emnlp-main.431</pub-id><pub-id pub-id-type="pmid">36568019</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ranta</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Grammatical framework: an interlingual grammar formalism,&#x0201D;</article-title> in <source>Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing</source>, eds. H. Vogler, and A. Maletti (Dresden: Association for Computational Linguistics), <fpage>1</fpage>&#x02013;<lpage>2</lpage>. <pub-id pub-id-type="doi">10.18653/v1/W19-3101</pub-id><pub-id pub-id-type="pmid">36568019</pub-id></citation></ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Richardson</surname> <given-names>W. S.</given-names></name> <name><surname>Wilson</surname> <given-names>M. C.</given-names></name> <name><surname>Nishikawa</surname> <given-names>J.</given-names></name> <name><surname>Hayward</surname> <given-names>R. S.</given-names></name></person-group> (<year>1995</year>). <article-title>The well-built clinical question: a key to evidence-based decisions</article-title>. <source>ACP J. Club</source> <volume>123</volume>, <fpage>A12</fpage>&#x02013;<lpage>13</lpage>. <pub-id pub-id-type="doi">10.7326/ACPJC-1995-123-3-A12</pub-id></citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Roy</surname> <given-names>S.</given-names></name> <name><surname>Thomson</surname> <given-names>S.</given-names></name> <name><surname>Chen</surname> <given-names>T.</given-names></name> <name><surname>Shin</surname> <given-names>R.</given-names></name> <name><surname>Pauls</surname> <given-names>A.</given-names></name> <name><surname>Eisner</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>BenchCLAMP: a benchmark for evaluating language models on syntactic and semantic parsing</article-title>. <source>arXiv</source> [Preprint]. arXiv:2206.10668. <pub-id pub-id-type="doi">10.48550/arXiv:2206.10668</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Sanchez-Graillet</surname> <given-names>O.</given-names></name> <name><surname>Cimiano</surname> <given-names>P.</given-names></name> <name><surname>Witte</surname> <given-names>C.</given-names></name> <name><surname>Ell</surname> <given-names>B.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;C-TrO: an ontology for summarization and aggregation of the level of evidence in clinical trials,&#x0201D;</article-title> in <italic>Proc. of the 5th Joint Ontology Workshops (JOWO): Ontologies and Data in the Life Sciences</italic> (Graz). Available at: <ext-link ext-link-type="uri" xlink:href="https://ceur-ws.org/Vol-2518/paper-ODLS7.pdf">https://ceur-ws.org/Vol-2518/paper-ODLS7.pdf</ext-link></citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schardt</surname> <given-names>C.</given-names></name> <name><surname>Adams</surname> <given-names>M. B.</given-names></name> <name><surname>Owens</surname> <given-names>T.</given-names></name> <name><surname>Keitz</surname> <given-names>S.</given-names></name> <name><surname>Fontelo</surname> <given-names>P.</given-names></name></person-group> (<year>2007</year>). <article-title>Utilization of the pico framework to improve searching pubmed for clinical questions</article-title>. <source>BMC Med. Inform. Decis. Mak</source>. <volume>7</volume>, <fpage>1</fpage>&#x02013;<lpage>6</lpage>. <pub-id pub-id-type="doi">10.1186/1472-6947-7-16</pub-id><pub-id pub-id-type="pmid">17573961</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schmidt</surname> <given-names>D. M.</given-names></name> <name><surname>Cimiano</surname> <given-names>P.</given-names></name></person-group> (<year>2024</year>). <source>ag-sc/clinical-trial-ie-gcd: v1.1</source>. <pub-id pub-id-type="doi">10.5281/zenodo.10869280</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Schmidt</surname> <given-names>L.</given-names></name> <name><surname>Weeds</surname> <given-names>J.</given-names></name> <name><surname>Higgins</surname> <given-names>J. P. T.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Data mining in clinical trial text: transformers for classification and question answering tasks,&#x0201D;</article-title> in <italic>Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2020)</italic>- <italic>Volume 5: HEALTHINF, Valletta, Malta, February 24-26, 2020</italic>, pages eds. F. Cabitza, A. L. N. Fred, and H. Gamboa (<publisher-loc>Setubal</publisher-loc>: <publisher-name>SCITEPRESS</publisher-name>), <fpage>83</fpage>&#x02013;<lpage>94</lpage>. <pub-id pub-id-type="doi">10.5220/0008945700002513</pub-id></citation>
</ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Scholak</surname> <given-names>T.</given-names></name> <name><surname>Schucher</surname> <given-names>N.</given-names></name> <name><surname>Bahdanau</surname> <given-names>D.</given-names></name></person-group> (<year>2021</year>). <article-title>PICARD: parsing incrementally for constrained auto-regressive decoding from language models</article-title>. <source>arXiv</source> [Preprint]. arXiv:2109.05093. <pub-id pub-id-type="doi">10.48550/arXiv.2109.05093</pub-id></citation>
</ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>See</surname> <given-names>A.</given-names></name> <name><surname>Liu</surname> <given-names>P. J.</given-names></name> <name><surname>Manning</surname> <given-names>C. D.</given-names></name></person-group> (<year>2017</year>). <article-title>Get to the point: summarization with pointer-generator networks</article-title>. <source>arXiv</source> [Preprint]. arXiv:1704.04368. <pub-id pub-id-type="doi">10.48550/arXiv.1704.04368</pub-id></citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shinan</surname> <given-names>E.</given-names></name></person-group> (<year>2024</year>). <source>Lark parsing library and toolkit</source>.</citation>
</ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stengel-Eskin</surname> <given-names>E.</given-names></name> <name><surname>Rawlins</surname> <given-names>K.</given-names></name> <name><surname>Van Durme</surname> <given-names>B.</given-names></name></person-group> (<year>2024</year>). <article-title>Zero and few-shot semantic parsing with ambiguous inputs</article-title>. <source>arXiv</source> [Preprint]. arXiv:2306.00824. <pub-id pub-id-type="doi">10.48550/arXiv:2306.00824</pub-id></citation>
</ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Strubell</surname> <given-names>E.</given-names></name> <name><surname>Ganesh</surname> <given-names>A.</given-names></name> <name><surname>McCallum</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>Energy and policy considerations for deep learning in NLP</article-title>. <source>arXiv</source> [Preprint]. arXiv:1906.02243. <pub-id pub-id-type="doi">10.48550/arXiv.1906.02243</pub-id></citation>
</ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stylianou</surname> <given-names>N.</given-names></name> <name><surname>Kosmoliaptsis</surname> <given-names>P.</given-names></name> <name><surname>Vlahavas</surname> <given-names>I.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Improved biomedical entity recognition via longer context modeling,&#x0201D;</article-title> <italic>inArtificial Intelligence Applications and Innovations, Volume 627</italic>, eds. I. Maglogiannis, J. Macintyre, and L. Iliadis (Cham: Springer International Publishing), <fpage>45</fpage>&#x02013;<lpage>56</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-79150-6_4</pub-id></citation>
</ref>
<ref id="B42">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sun</surname> <given-names>M.</given-names></name> <name><surname>Yang</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Wen</surname> <given-names>M.</given-names></name> <name><surname>Jia</surname> <given-names>H.</given-names></name> <name><surname>Zhou</surname> <given-names>Y.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>&#x0201C;SMT solver validation empowered by large pre-trained language models,&#x0201D;</article-title> in <italic>2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)</italic> (<publisher-loc>Luxembourg</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1288</fpage>&#x02013;<lpage>1300</lpage>. <pub-id pub-id-type="doi">10.1109/ASE56229.2023.00180</pub-id></citation>
</ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Trieu</surname> <given-names>H.-L.</given-names></name> <name><surname>Tran</surname> <given-names>T. T.</given-names></name> <name><surname>Duong</surname> <given-names>K. N. A.</given-names></name> <name><surname>Nguyen</surname> <given-names>A.</given-names></name> <name><surname>Miwa</surname> <given-names>M.</given-names></name> <name><surname>Ananiadou</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>DeepEventMine: end-to-end neural nested event extraction from biomedical texts</article-title>. <source>Bioinformatics</source> <volume>36</volume>, <fpage>4910</fpage>&#x02013;<lpage>4917</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btaa540</pub-id><pub-id pub-id-type="pmid">33141147</pub-id></citation></ref>
<ref id="B44">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Vaswani</surname> <given-names>A.</given-names></name> <name><surname>Shazeer</surname> <given-names>N.</given-names></name> <name><surname>Parmar</surname> <given-names>N.</given-names></name> <name><surname>Uszkoreit</surname> <given-names>J.</given-names></name> <name><surname>Jones</surname> <given-names>L.</given-names></name> <name><surname>Gomez</surname> <given-names>A. N.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>&#x0201C;Attention is all you need,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017</source> (<publisher-loc>Long Beach, CA</publisher-loc>), <fpage>5998</fpage>&#x02013;<lpage>6008</lpage>. <pub-id pub-id-type="doi">10.48550/arXiv.1706.03762</pub-id></citation>
</ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>X. D.</given-names></name> <name><surname>Weber</surname> <given-names>L.</given-names></name> <name><surname>Leser</surname> <given-names>U.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Biomedical event extraction as multi-turn question answering,&#x0201D;</article-title> in <source>Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis</source> (Association for Computational Linguistics), <fpage>88</fpage>&#x02013;<lpage>96</lpage>. <pub-id pub-id-type="doi">10.18653/v1/2020.louhi-1.10</pub-id><pub-id pub-id-type="pmid">36568019</pub-id></citation></ref>
<ref id="B46">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Whitton</surname> <given-names>J.</given-names></name> <name><surname>Hunter</surname> <given-names>A.</given-names></name></person-group> (<year>2023</year>). <article-title>Automated tabulation of clinical trial results: a joint entity and relation extraction approach with transformer-based language representations</article-title>. <source>Artif. Intell. Med</source>. <volume>144</volume>:<fpage>102661</fpage>. <pub-id pub-id-type="doi">10.1016/j.artmed.2023.102661</pub-id><pub-id pub-id-type="pmid">37783549</pub-id></citation></ref>
<ref id="B47">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Witte</surname> <given-names>C.</given-names></name> <name><surname>Cimiano</surname> <given-names>P.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Intra-template entity compatibility based slot-filling for clinical trial information extraction,&#x0201D;</article-title> in <source>Proceedings of the 21st Workshop on Biomedical Language Processing</source> (<publisher-loc>Dublin</publisher-loc>), <fpage>178</fpage>&#x02013;<lpage>192</lpage>. <pub-id pub-id-type="doi">10.18653/v1/2022.bionlp-1.18</pub-id><pub-id pub-id-type="pmid">36568019</pub-id></citation></ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Witte</surname> <given-names>C.</given-names></name> <name><surname>Schmidt</surname> <given-names>D. M.</given-names></name> <name><surname>Cimiano</surname> <given-names>P.</given-names></name></person-group> (<year>2024</year>). <article-title>Comparing generative and extractive approaches to information extraction from abstracts describing randomized clinical trials</article-title>. <source>J. Biomed. Semant</source>. <volume>15</volume>:<fpage>3</fpage>. <pub-id pub-id-type="doi">10.1186/s13326-024-00305-2</pub-id><pub-id pub-id-type="pmid">38654304</pub-id></citation></ref>
<ref id="B49">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>T.</given-names></name> <name><surname>Yu</surname> <given-names>Y.</given-names></name> <name><surname>Mei</surname> <given-names>J.</given-names></name> <name><surname>Tang</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Li</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Unlocking the power of deep PICO extraction: step-wise medical NER identification</article-title>. <source>arXiv</source> [Preprint]. arXiv:2005.06601. <pub-id pub-id-type="doi">10.48550/arXiv.2005.06601</pub-id></citation>
</ref>
<ref id="B50">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhu</surname> <given-names>L.</given-names></name> <name><surname>Zheng</surname> <given-names>H.</given-names></name></person-group> (<year>2020</year>). <article-title>Biomedical event extraction with a novel combination strategy based on hybrid deep neural networks</article-title>. <source>BMC Bioinformatics</source> <volume>21</volume>:<fpage>47</fpage>. <pub-id pub-id-type="doi">10.1186/s12859-020-3376-2</pub-id><pub-id pub-id-type="pmid">32028883</pub-id></citation></ref>
</ref-list>
</back>
</article>