<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="2.3" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Oncol.</journal-id>
<journal-title>Frontiers in Oncology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Oncol.</abbrev-journal-title>
<issn pub-type="epub">2234-943X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fonc.2023.1239570</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Oncology</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Can we predict discordant RECIST 1.1 evaluations in double read clinical trials?</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Beaumont</surname>
<given-names>Hubert</given-names>
</name>
<xref ref-type="author-notes" rid="fn001">
<sup>*</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1317900"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Iannessi</surname>
<given-names>Antoine</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/1960510"/>
</contrib>
</contrib-group>
<aff id="aff1">
<institution>Sciences, Median Technologies</institution>, <addr-line>Valbonne</addr-line>, <country>France</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>Edited by: Jayasree Chakraborty, Memorial Sloan Kettering Cancer Center, United States</p>
</fn>
<fn fn-type="edited-by">
<p>Reviewed by: Joao Santos, Memorial Sloan Kettering Cancer Center, United States; Stefano Trebeschi, The Netherlands Cancer Institute (NKI), Netherlands</p>
</fn>
<fn fn-type="corresp" id="fn001">
<p>*Correspondence: Hubert Beaumont, <email xlink:href="mailto:hubertbeaumont@hotmail.com">hubertbeaumont@hotmail.com</email>
</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>04</day>
<month>10</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>13</volume>
<elocation-id>1239570</elocation-id>
<history>
<date date-type="received">
<day>13</day>
<month>06</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>05</day>
<month>09</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2023 Beaumont and Iannessi</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Beaumont and Iannessi</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<sec>
<title>Background</title>
<p>In lung clinical trials with imaging, blinded independent central review with double reads is recommended to reduce evaluation bias and the Response Evaluation Criteria In Solid Tumor (RECIST) is still widely used. We retrospectively analyzed the inter-reader discrepancies rate over time, the risk factors for discrepancies related to baseline evaluations, and the potential of machine learning to predict inter-reader discrepancies.</p>
</sec>
<sec>
<title>Materials and methods</title>
<p>We retrospectively analyzed five BICR clinical trials for patients on immunotherapy or targeted therapy for lung cancer. Double reads of 1724 patients involving 17 radiologists were performed using RECIST 1.1. We evaluated the rate of discrepancies over time according to four endpoints: progressive disease declared (PDD), date of progressive disease (DOPD), best overall response (BOR), and date of the first response (DOFR). Risk factors associated with discrepancies were analyzed, two predictive models were evaluated.</p>
</sec>
<sec>
<title>Results</title>
<p>At the end of trials, the discrepancy rates between trials were not different. On average, the discrepancy rates were 21.0%, 41.0%, 28.8%, and 48.8% for PDD, DOPD, BOR, and DOFR, respectively. Over time, the discrepancy rate was higher for DOFR than DOPD, and the rates increased as the trial progressed, even after accrual was completed. It was rare for readers to not find any disease, for less than 7% of patients, at least one reader selected non-measurable disease only (NTL). Often the readers selected some of their target lesions (TLs) and NTLs in different organs, with ranges of 36.0-57.9% and 60.5-73.5% of patients, respectively. Rarely (4-8.1%) two readers selected all their TLs in different locations. Significant risk factors were different depending on the endpoint and the trial being considered. Prediction had a poor performance but the positive predictive value was higher than 80%. The best classification was obtained with BOR.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>Predicting discordance rates necessitates having knowledge of patient accrual, patient survival, and the probability of discordances over time. In lung cancer trials, although risk factors for inter-reader discrepancies are known, they are weakly significant, the ability to predict discrepancies from baseline data is limited. To boost prediction accuracy, it would be necessary to enhance baseline-derived features or create new ones, considering other risk factors and looking into optimal reader associations.</p>
</sec>
</abstract>
<kwd-group>
<kwd>clinical trial</kwd>
<kwd>Interobserver variation</kwd>
<kwd>RECIST</kwd>
<kwd>computed tomography</kwd>
<kwd>lung cancer</kwd>
</kwd-group>
<counts>
<fig-count count="5"/>
<table-count count="7"/>
<equation-count count="4"/>
<ref-count count="36"/>
<page-count count="14"/>
<word-count count="7044"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-in-acceptance</meta-name>
<meta-value>Cancer Imaging and Image-directed Interventions</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="s1">
<title>Highlights</title>
<p>In RECIST BICR trials with double reads there is large variability in tumor measurement and localization.</p>
<p>Discrepancy rates can be modeled over time.</p>
<p>Few discrepancies can be predicted from baseline evaluations.</p>
</sec>
<sec id="s2" sec-type="intro">
<label>1</label>
<title>Introduction</title>
<p>In 2004, the Food and Drug Administration recommended double radiology reads for clinical trials with blinded independent central review (BICR) to minimize evaluation bias (<xref ref-type="bibr" rid="B1">1</xref>). Due to the variabilities in observers&#x2019; evaluations, diagnostic results can be discordant (<xref ref-type="bibr" rid="B2">2</xref>). In such situations, a third radiologist, the &#x201c;adjudicator&#x201d;, is required so that a final decision can be made (<xref ref-type="bibr" rid="B3">3</xref>). The rate of discordance (a.k.a. the adjudication rate), which is the number of discordant evaluations out of the total number of patients in the trial, is the preferred high-level indicator that summarizes the overall reliability of assessments of trials (<xref ref-type="bibr" rid="B4">4</xref>). The monitoring of observers&#x2019; variability through the adjudication rate requires a burdensome process, which all stakeholders aim to make cost-effective (<xref ref-type="bibr" rid="B5">5</xref>, <xref ref-type="bibr" rid="B6">6</xref>), with the goal of minimizing inter-reader discordances.</p>
<p>The Response Evaluation Criteria In Solid Tumor (RECIST) (<xref ref-type="bibr" rid="B7">7</xref>) is widely used and accepted by regulatory authorities for evaluating the efficacy of oncology therapies in clinical trials with imaging. The very purpose of RECIST is to assign each patient to one of the classes of response to therapy: progressive disease (PD), stable disease (SD) or responders (partial or complete response [PR, CR]) (<xref ref-type="bibr" rid="B7">7</xref>). When categorized as PD, patients must be withdrawn from the study and their treatment stopped. Depending on the development phase of the drug (<xref ref-type="bibr" rid="B8">8</xref>), different trial endpoints can be derived from the RECIST assessments. Indeed, in phase 2, the study endpoint generally relates to response (responder vs non-responder) while in phase 3, it relates to progression. Each of these trial endpoints have their own statistical features linked to their respective kind of inter-reader discrepancy (KoD). Therefore, for a given trial, the most relevant KoD to monitor can differ from another trial.</p>
<p>During trials, patients undergo a sequential radiological RECIST 1.1 evaluation with a probability of discrepancy occurring at each radiological timepoint response (RTPR). We can hypothesize that each of the KoDs has a different likelihood of occurrence over time. We can also assume that &#x201c;at-risk-periods&#x201d; and &#x201c;at-risk-factors&#x201d; exist for discrepancies occurring during patient follow-up. From an operational standpoint, confirming these assumptions would be particularly relevant for BICR trials with double reads (<xref ref-type="bibr" rid="B3">3</xref>, <xref ref-type="bibr" rid="B9">9</xref>).</p>
<p>Clinical trials can often take a long time to complete, during which changes may arise from various sources: readers may become more experienced, tumor shapes may complexify, or operational parameters may have unintended impacts. Thus, it is essential to assess the broad trends while the trial is in progress to gain insight into any potential changes that may have occurred and their effects on the trial&#x2019;s ultimate results.</p>
<p>The issue around RECIST subjectivity has been widely discussed (<xref ref-type="bibr" rid="B10">10</xref>, <xref ref-type="bibr" rid="B11">11</xref>) and there is consensus on risk factors related to disease evaluation at baseline. These risk factors can be grouped into tumor selection (<xref ref-type="bibr" rid="B12">12</xref>) and quantification (<xref ref-type="bibr" rid="B13">13</xref>). However, it is still unclear which of these risk factors most impact the response and how they interact with each other to allow prediction at baseline as to which reads are more likely to become discrepant at follow-up. In addition, a data-driven approach using machine learning (ML) could be an opportunity to test whether baseline-derived features are predictors of inter-reader discordances.</p>
<p>We conducted a retrospective analysis of inter-reader discrepancies in five BICR RECIST lung trials with double reads with three primary objectives: 1) to investigate the discrepancy rate over time, aiming to identify influential high-level factors, 2) to identify risk factors for discrepancies related to RECIST-derived features at baseline, and 3) to use confirmed risk factors for discrepancies to evaluate predictive models.</p>
</sec>
<sec id="s3">
<label>2</label>
<title>Methods</title>
<sec id="s3_1">
<label>2.1</label>
<title>Study data inclusion criteria</title>
<p>Our retrospective analysis included results from five BICR clinical trials (Trials 1-5) that evaluated immunotherapy or targeted therapy for lung cancer (<xref ref-type="table" rid="T1">
<bold>Table&#xa0;1</bold>
</xref>). The selected BICR trials were conducted between 2017 and 2021 with double reads with adjudication based on RECIST 1.1 guidelines. All data were fully blinded regarding study sponsor, study protocol number, therapeutic agent, subject demographics, and randomization. For these five trials, a total of 1724 patients were expected, involving 17 radiologists. The central reads were all performed using the same radiological reading platform (iSee; Median Technologies, France).</p>
<table-wrap id="T1" position="float">
<label>Table&#xa0;1</label>
<caption>
<p>Description of included trials.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left">Trial ID</th>
<th valign="top" align="left">Phase</th>
<th valign="top" align="left">Ranked adjudicated endpoints</th>
<th valign="top" align="left">Number of patients and visits</th>
<th valign="top" align="left">Visit period<break/>(weeks)</th>
<th valign="top" align="left">Average number of visits per patient</th>
<th valign="top" align="left">Per patient, mean and median follow-up duration (days)</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">
<bold>Trial 1</bold>
</td>
<td valign="top" align="left">3</td>
<td valign="top" align="left">DOPD, BOR, DOFR</td>
<td valign="top" align="left">nPatient=333<break/>nTP=2054</td>
<td valign="top" align="left">6 then 12 after 54 weeks</td>
<td valign="top" align="left">6</td>
<td valign="top" align="left">mean: 226 [212,240]<break/>median: 192</td>
</tr>
<tr>
<td valign="top" align="left">
<bold>Trial 2</bold>
</td>
<td valign="top" align="left">3</td>
<td valign="top" align="left">DOPD, BOR</td>
<td valign="top" align="left">nPatient=493<break/>nTP=6006</td>
<td valign="top" align="left">6 then 12 after 48 weeks</td>
<td valign="top" align="left">12</td>
<td valign="top" align="left">mean: 234 [222; 246]<break/>median: 217</td>
</tr>
<tr>
<td valign="top" align="left">
<bold>Trial 3</bold>
</td>
<td valign="top" align="left">2</td>
<td valign="top" align="left">BOR, DOFR, DOPD</td>
<td valign="top" align="left">nPatient=243<break/>nTP=5260</td>
<td valign="top" align="left">8 weeks</td>
<td valign="top" align="left">14</td>
<td valign="top" align="left">mean: 514 [456; 571]<break/>median: 315</td>
</tr>
<tr>
<td valign="top" align="left">
<bold>Trial 4</bold>
</td>
<td valign="top" align="left">3</td>
<td valign="top" align="left">DOPD</td>
<td valign="top" align="left">nPatient=276<break/>nTP=2796</td>
<td valign="top" align="left">8 then 12 after cycle 19</td>
<td valign="top" align="left">19</td>
<td valign="top" align="left">mean: 516 [479; 553]<break/>median: 506</td>
</tr>
<tr>
<td valign="top" align="left">
<bold>Trial 5</bold>
</td>
<td valign="top" align="left">3</td>
<td valign="top" align="left">DOPD</td>
<td valign="top" align="left">nPatient=379<break/>nTP=2554</td>
<td valign="top" align="left">6 then 9 after 48 weeks</td>
<td valign="top" align="left">6</td>
<td valign="top" align="left">mean: 248 [233; 264]<break/>median: 198</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>In all trials, the indication was locally advanced (&gt;III-B) or metastatic non-small cell lung cancer treated with immunotherapy evaluated with RECIST 1.1. Trials 1 and 5 related to first line treatment, and all trials included a control group except Trial 3. For Trials 1 and 2, measurable disease was an inclusion criterion at baseline and brain lesions were not excluded (selected as NTLs). Adjudication endpoints were DOPD, BOR, and DOFR.</p>
</fn>
<fn>
<p>BOR, best overall rate; DOFR, date of first response; DOPD, date of progressive disease; NTL, non-target lesion.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="s3_2">
<label>2.2</label>
<title>Read paradigm</title>
<p>Two independent radiologists performed the review of each image and determined the RTPR in accordance with RECIST 1.1. According to the trial&#x2019;s endpoints (response or progression), specific KoDs triggered adjudications that were pre-defined in an imaging review charter. The adjudicator reviewed the response assessments from the two primary readers and endorsed the outcome of one of the readers, providing rationale to endorse the adjudicator&#x2019;s assessment.</p>
</sec>
<sec id="s3_3">
<label>2.3</label>
<title>Analysis plan</title>
<p>We considered four KoDs related to the standard endpoints used on trials:</p>
<list list-type="simple">
<list-item>
<p>A. Two related to event or time-to-event of disease progression:</p>
</list-item>
<list-item>
<p>1. Progressive disease declared (PDD): Discrepant PD detection when only one of the readers declared PD during the follow-up.</p>
</list-item>
<list-item>
<p>2. Date of progressive disease (DOPD): Discrepant dates of PD detection as either one reader did not detect PD at all during the follow-up or both readers declared a PD but at different dates.</p>
</list-item>
<list-item>
<p>B. Two related to event or time-to-event of disease response:</p>
</list-item>
<list-item>
<p>3. Best overall response (BOR): Discrepant reporting of the best among all overall responses (CR was best, followed by PR, SD, and then PD) during follow-up. To simplify the analysis, we adopted the definition of BOR from the RECIST group (<xref ref-type="bibr" rid="B7">7</xref>) but without response confirmation and minimal SD duration definition. This definition is also known as best time point response.</p>
</list-item>
<list-item>
<p>4. Date of first response (DOFR): Discrepant CR or PR of detection date as either one of the readers declared no CR or PR during the follow-up or both readers detected a first CR or PR but at different dates.</p>
</list-item>
</list>
<p>One example of patient follow-up with corresponding KoDs is provided in <xref ref-type="table" rid="T2">
<bold>Table&#xa0;2</bold>
</xref>.</p>
<table-wrap id="T2" position="float">
<label>Table&#xa0;2</label>
<caption>
<p>Example of KoDs.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" rowspan="2" align="left">Evaluation</th>
<th valign="top" colspan="6" align="center">Timepoints Evaluation</th>
<th valign="top" colspan="4" align="left">Kind of Discrepancy</th>
</tr>
<tr>
<th valign="top" align="left">TP1</th>
<th valign="top" align="left">TP2</th>
<th valign="top" align="left">TP3</th>
<th valign="top" align="left">TP4</th>
<th valign="top" align="left">TP5</th>
<th valign="top" align="left">TP6</th>
<th valign="top" align="left">BOR</th>
<th valign="top" align="left">DOFR</th>
<th valign="top" align="left">PPD</th>
<th valign="top" align="left">DOPD</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">
<italic>Reader 1</italic>
</td>
<td valign="top" align="left">SD</td>
<td valign="top" align="left">PR</td>
<td valign="top" align="left">PR</td>
<td valign="top" align="left">CR</td>
<td valign="top" align="left">PD</td>
<td valign="top" align="left">PD</td>
<td valign="top" align="left">
<italic>CR</italic>
</td>
<td valign="top" align="left">
<italic>TP2</italic>
</td>
<td valign="top" align="left">
<italic>YES</italic>
</td>
<td valign="top" align="left">
<italic>TP5</italic>
</td>
</tr>
<tr>
<td valign="top" align="left">
<italic>Reader 2</italic>
</td>
<td valign="top" align="left">SD</td>
<td valign="top" align="left">SD</td>
<td valign="top" align="left">PR</td>
<td valign="top" align="left">PR</td>
<td valign="top" align="left">PR</td>
<td valign="top" align="left">PR</td>
<td valign="top" align="left">
<italic>PR</italic>
</td>
<td valign="top" align="left">
<italic>TP3</italic>
</td>
<td valign="top" align="left">
<italic>NO</italic>
</td>
<td valign="top" align="left">
<italic>NA</italic>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>During double read evaluations by Reader 1 and Reader 2 over six time points, the discrepant values of the four KoDs were reported in the rightmost columns.</p>
</fn>
<fn>
<p>BOR, best overall response; CR, complete response; DOFR, date of first response; DOPD, declaration of progression of disease; NA, not applicable; PD, progressive disease; PPD, progressive disease declared; PR, partial response; SD, stable disease; TP, time point.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>For each trial and KoD, our study addressed the distribution of the discrepancies, the risk factors for discrepancies, and their predictions:</p>
<sec id="s3_3_1">
<label>2.3.1</label>
<title>Distribution of discrepancies</title>
<p>a) The rate of discrepant patients</p>
<p>At trial completion (or near completion for Trial 1), we measured the ratio of the number of patients for whom a KoD was detected during their follow-up to the number of patients.</p>
<p>b) The rate of discrepant patients over time</p>
<p>The temporal discrepancy rate can be written as the cumulative function of the distribution of probability for discrepancy at each time point <italic>PDisc(t)</italic> multiplied by the survival curve <italic>Surv(t)</italic> (or drop out curve) and convolved by the function of the number of patients included at each time point <italic>PatIncl(t)</italic> (Equation 1).</p>
<disp-formula>
<label>Equation 1</label>
<mml:math display="block" id="M1">
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>O</mml:mi>
<mml:mi>f</mml:mi>
<mml:mi>D</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>c</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo>&#x222b;</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:munderover>
<mml:mi>P</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>I</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>&#x3c4;</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>*</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>D</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>c</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>&#x3c4;</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#xb7;</mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>v</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>&#x3c4;</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mi>d</mml:mi>
<mml:mi>&#x3c4;</mml:mi>
</mml:mrow>
</mml:math>
</disp-formula>
<p>For each KoD and for each trial, we analyzed the rate of discrepancies over time. This rate can be simply calculated as the ratio of the number of discrepant patients included from the beginning of the study to the number of patients included from the start of the study during the same period (Equation 2).</p>
<disp-formula>
<label>Equation 2</label>
<mml:math display="block" id="M2">
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>O</mml:mi>
<mml:mi>f</mml:mi>
<mml:mi>D</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>c</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mi>N</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mi>f</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>d</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>p</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mi>N</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mi>f</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>P</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Where <italic>t</italic> is the time, <italic>t=0</italic> at trial onset (first patient in).</p>
<p>c) The proportion of discrepancies occurring at each follow-up time</p>
<p>For each KoD and each trial, we computed the average proportion of discrepant patients occurring between two time points (Equation 3). These proportions, which are functions of the progression free survival (PFS) curve (therefore of the survival curve) and the probability of KoD occurrence during an interval of time, are displayed along with the proportions of patients still evaluated until this time point.</p>
<disp-formula>
<label>Equation 3</label>
<mml:math display="block" id="M3">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>D</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>c</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>U</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mn>100</mml:mn>
<mml:mo>*</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>U</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>U</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mi>N</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mi>f</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>d</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>p</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>B</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>U</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mi>N</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mi>f</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>d</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>p</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Where <italic>FU</italic> is an interval in patient follow-up time and <italic>t</italic> ranging [<italic>Baseline; FUmax</italic>] is the maximum follow-up duration measured in the study.</p>
<p>d) The probability of discrepancy during follow-up</p>
<p>For each KoD and each trial, as presented earlier in Equation 1, we provided the probability of a patient having a discrepant diagnosis during a given follow-up interval. We computed this probability as the ratio of the number of discrepant diagnoses to the number of patients that were evaluated during this follow-up interval of time (Equation 4).</p>
<disp-formula>
<label>Equation 4</label>
<mml:math display="block" id="M4">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>D</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>c</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>U</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>U</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>U</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mi>N</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mi>f</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>d</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>p</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>U</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>U</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mi>N</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mi>f</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>p</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo>&#xa0;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Where <italic>FU</italic> is a given follow-up time point.</p>
<p>Our discrepancy analysis considered only patients who underwent at least one follow-up visit after baseline, for a clearer display, we resampled curves in a standardized time-frame of one month.</p>
</sec>
<sec id="s3_3_2">
<label>2.3.2</label>
<title>Baseline-derived risk factors for discrepancy</title>
<p>We wanted to identify the variabilities in the RECIST process of selection and/or measurement performed at baseline (<xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref>) which were likely to entail discrepancy in responses (<xref ref-type="fig" rid="f2">
<bold>Figure&#xa0;2</bold>
</xref>). For this aim, we arbitrarily considered risk factors likely 1) to quantify measurement variability at baseline in computing the Delta Burden between the two readers as the relative difference of their SOD (Abs(SOD1-SOD2)/(SOD1+SOD2)) and in measuring SPropSOD (<xref ref-type="app" rid="app1C">
<bold>Annex C</bold>
</xref>); 2) to quantify the variability in TL selection at baseline in reporting when the two readers did not select their TLs in exactly the same organs, when they selected TLs in totally different organs, or when one of the reader selected a TL in a particularly infrequent location. We also reported when, at least, one of the readers did not select any TL at baseline. At least, we considered a risk factor when the two readers did not select all their NTL in the same organs. Comprehensive description of all the risk factors are provided in <xref ref-type="app" rid="app1A">
<bold>Annex A</bold>
</xref>. By means of odds ratios (ODDs), we performed a univariate analysis testing associations between KoDs and a set of predefined features (<xref ref-type="bibr" rid="B14">14</xref>) (<xref ref-type="app" rid="app1A">
<bold>Annex A</bold>
</xref>) derived from risk factors. These features applied to target lesions (TLs) and non-target lesions (NTLs) and were stratified according to the different diseased organs (See <xref ref-type="app" rid="app1B">
<bold>Annex B</bold>
</xref>).</p>
<fig id="f1" position="float">
<label>Figure&#xa0;1</label>
<caption>
<p>Inter-observer discrepancies at baseline. For a same lesion, two radiologist might consider non measureable a lesion due to its size not meeting the RECIST 1.1 measurability criteria of 1cm long axis for non lymph-node lesions (1a,1b,2a,2b) and 1.5cm short axis for lymph node (3a,3b). The measureability criteria can also be challenged for large lesions with ill defined margins and non robust measurement (4a,4b). Two radiologist might consider one same disease lesion belonging to lymph-node or not lymph-node organ resulting in a large measurement discrepancy due to the different method of measurement for the two type of organs (5a,5b). Measurement discrepancy can also be linked to the selected series for measurement such as arterial versus portal phase (6a,6b) or to the type of lesions such as cavitary lesions (7a,7b).</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fonc-13-1239570-g001.tif"/>
</fig>
<fig id="f2" position="float">
<label>Figure&#xa0;2</label>
<caption>
<p>RECIST 1.1 inter-reader variability at baseline leading to endpoint discrepancy. The patient is presented at baseline with involvement of several mediastinal and hilar lymph nodes. Both readers followed RECIST 1.1 guidelines and accurately selected 2 large lymph nodes for measurement without any errors. During the follow-up period, both readers observed a partial response, with the first response documented in week 9. However, at week 18, the readers disagreed on the assessment of disease progression.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fonc-13-1239570-g002.tif"/>
</fig>
</sec>
<sec id="s3_3_3">
<label>2.3.3</label>
<title>Predictions of discrepancy derived from baseline evaluations</title>
<p>Since the association of a risk factor with a given outcome must be strong (ODD&gt;10) to make classification effective (<xref ref-type="bibr" rid="B15">15</xref>), we used previously identified risk factors to train a ML model. After features reduction and classification, we documented the performances of two classification systems.</p>
</sec>
</sec>
<sec id="s3_4">
<label>2.4</label>
<title>Statistics</title>
<p>All statistics were performed using base version and packages from R CRAN freeware.</p>
<p>Confidence Intervals (CIs) of discrepancies rates were computed by Clopper-Pearson exact CI method (<xref ref-type="bibr" rid="B16">16</xref>). We used &#x201c;PropCIs&#x201d; package. Multiple comparisons of continuous variables were performed using the Dunnett-Tukey-Kramer method for unequal sample size (<xref ref-type="bibr" rid="B17">17</xref>) with &#x201c;DTK&#x201d; package. Multiple comparisons of proportions were performed using Marascuilo test (<xref ref-type="bibr" rid="B18">18</xref>).</p>
<p>We derived the proportions of detected KoDs from the 95<sup>th</sup> percentile of patient follow-up duration in trials and the 95<sup>th</sup> percentile of follow-up duration until the first occurrence of KoDs. ODDs were computed using &#x201c;fmsb&#x201d; package (<xref ref-type="bibr" rid="B19">19</xref>), with associated p-values for significant associations.</p>
<p>Continuous variables were analyzed using two samples non-parametric Wilcoxon test (discrepant versus non-discrepant patient groups).</p>
<p>A predictive model of discrepant patient evaluation was trained and tested in a cross-validation with 80:20 split setting. We reported classification accuracy when McNemar&#x2019;s test indicated no significant bias in assessments due to imbalance in the data. We also reported the Area Under the Curve (AUC).</p>
<p>We evaluated the classification performances using two different algorithms: 1) a random forest (RF) algorithm (<xref ref-type="bibr" rid="B20">20</xref>) from the &#x201c;caret&#x201d; package after recursive feature elimination (<xref ref-type="bibr" rid="B21">21</xref>) and 2) a deep learning (DL) algorithm from the &#x201c;h2o&#x201d; package (<xref ref-type="bibr" rid="B22">22</xref>) after grid search. CIs were computed for AUC, accuracy (Acc), sensitivity (Se), specificity (Sp), positive predictive value (PPV), and negative predictive value (NPV) using the bootstrap method from the &#x201c;DescTools&#x201d; package.</p>
</sec>
</sec>
<sec id="s4" sec-type="results">
<label>3</label>
<title>Results</title>
<sec id="s4_1">
<label>3.1</label>
<title>Discrepancy rates</title>
<p>
<xref ref-type="table" rid="T3">
<bold>Table&#xa0;3</bold>
</xref> provides a summary of the KoD rates obtained at the end of each trial. Per KoD, Marascuilo tests yielded no significant inter-trial differences (p&gt;0.05). The average discrepancy rates were 21.0% [19.1; 23.0%], 41.0% [38.7; 43.4%], 28.8 [26.6; 30.9], 48.8% [46.4; 51.2%] for PDD, DOPD, BOR, DOFR, respectively. When combining the data from all five trials, a multiple comparison test showed that there were significant differences between the two KoDs related to time (DOPD and DOFR) and the two KoDs related to the event (PDD and BOR). The discrepancy rate for PDD was lower than for the other KoDs.</p>
<table-wrap id="T3" position="float">
<label>Table&#xa0;3</label>
<caption>
<p>Discrepancy rates at end of trial.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left"/>
<th valign="top" align="left">PDD % (n)</th>
<th valign="top" align="left">DOPD % (n)</th>
<th valign="top" align="left">BOR % (n)</th>
<th valign="top" align="left">DOFR % (n)</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">
<bold>Trial 1</bold> <italic>(N=333)</italic>
</td>
<td valign="top" align="left">16.5 (55)</td>
<td valign="top" align="left">40.8 (136)</td>
<td valign="top" align="left">27.0 (90)</td>
<td valign="top" align="left">45.0 (150)</td>
</tr>
<tr>
<td valign="top" align="left">
<bold>Trial 2</bold> <italic>(N=493)</italic>
</td>
<td valign="top" align="left">21.2 (104)</td>
<td valign="top" align="left">39.3 (194)</td>
<td valign="top" align="left">26.8 (132)</td>
<td valign="top" align="left">49.7 (245)</td>
</tr>
<tr>
<td valign="top" align="left">
<bold>Trial 3</bold> <italic>(N=243)</italic>
</td>
<td valign="top" align="left">21.4 (52)</td>
<td valign="top" align="left">33.7 (82)</td>
<td valign="top" align="left">33.7 (82)</td>
<td valign="top" align="left">55.6 (135)</td>
</tr>
<tr>
<td valign="top" align="left">
<bold>Trial 4</bold> <italic>(N=276)</italic>
</td>
<td valign="top" align="left">26.1 (72)</td>
<td valign="top" align="left">41.3 (114)</td>
<td valign="top" align="left">31.9 (88)</td>
<td valign="top" align="left">51.9 (143)</td>
</tr>
<tr>
<td valign="top" align="left">
<bold>Trial 5</bold> <italic>(N=379)</italic>
</td>
<td valign="top" align="left">20.8 (79)</td>
<td valign="top" align="left">47.7 (181)</td>
<td valign="top" align="left">27.4 (104)</td>
<td valign="top" align="left">44.6 (169)</td>
</tr>
<tr>
<td valign="top" align="left">
<bold>Pooled</bold> <italic>(N=1724)</italic>
</td>
<td valign="top" align="left">21.0 (362)</td>
<td valign="top" align="left">41.0 (707)</td>
<td valign="top" align="left">28.8 (496)</td>
<td valign="top" align="left">48.8 (842)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>For the five clinical trials (rows), we reported, per patient, the double read discrepancy rates as percentages (raw number are in parenthesis). These were computed for each KoD (column): 1) discrepant PDD; 2) discrepant DOPD; 3) discrepant BOR; and 4) discrepant DOFR.</p>
</fn>
<fn>
<p>BOR, best overall response; DOFR, date of first response; DOPD, declaration of progressive disease; PDD, progressive disease declared.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>
<xref ref-type="fig" rid="f3">
<bold>Figure&#xa0;3</bold>
</xref> displays the discrepancy rates over time for the five trials and the different KoDs, showing that, most of the time, DOFR was higher than DOPD. We observed that the rates of all KoDs increased as the trial progressed, even long after the completion of patient accrual. The KoD curves did not always feature smooth variations, and the curves of accrual displayed different shapes. It seems that a significant patient recruitment immediately from the start of the study will guarantee early meaningful KoD curves.</p>
<fig id="f3" position="float">
<label>Figure&#xa0;3</label>
<caption>
<p>The discrepancy rate over time. The discrepancy rates for the four KoDs are displayed along with the proportion of accrued patients. Curves for DOPD, PDD, DOFR, and BOR KoDs are displayed in red, orange, blue, and green, respectively, with corresponding 95% CIs. The black curve is the cumulative proportion of accrued patients. The five trials are represented from top left to bottom right: a) Trial 1, b) Trial 2, c) Trial 3, d) Trial 4, and e) Trial 5 Time scale is one month.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fonc-13-1239570-g003.tif"/>
</fig>
<p>As depicted in <xref ref-type="fig" rid="f4">
<bold>Figure&#xa0;4</bold>
</xref>, the ratio of KoD occurrence and the proportion of patients remaining at this time point followed a steady downward trend over time. At the outset of follow-up, proportionally more DOFR and BOR occurred than DOPD and PDD. Additionally, <xref ref-type="fig" rid="f4">
<bold>Figure&#xa0;4</bold>
</xref> demonstrates that the decrease in the proportion of KoD occurrence had distinct patterns. For some trials (Trials 1, 2, and 5), the decrease had a tendency similar to the proportion of patients still present at this time, while for others, this was not the case (Trials 3 and 4).</p>
<fig id="f4" position="float">
<label>Figure&#xa0;4</label>
<caption>
<p>Proportion of discrepancies as distributed during follow-ups. For each trial and KoD, we computed the proportion of discrepancies (Equation 3) occurring at each follow-up. The DOPD, PDD, DOFR, and BOR KoDs are displayed in red, orange, blue, and green, respectively. The black curve displays the proportion of patients evaluated at this time point. BOR, best overall response; DOFR, date of first response; DOPD, declaration of progressive disease; KoD, kind of inter-reader discrepancy; PDD, progressive disease declared.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fonc-13-1239570-g004.tif"/>
</fig>
<p>In <xref ref-type="fig" rid="f5">
<bold>Figure&#xa0;5</bold>
</xref>, we present the probability of KoDs occurring in relation to consecutive time points.</p>
<fig id="f5" position="float">
<label>Figure&#xa0;5</label>
<caption>
<p>Probability of discrepancy over patients follow-up. The probability of discrepancy for the four KoDs is displayed. Probabilities of DOPD, PDD, DOFR, and BOR KoD occurrence are displayed in red, orange, blue, and green, respectively, with corresponding 95% CIs. The five trials are represented from top left to bottom right: a) Trial 1, b) Trial 2, c) Trial 3, d) Trial 4, and e) Trial 5. BOR, best overall response; CI, confidence interval; DOFR, date of first response; DOPD, declaration of progressive disease; KoD, kind of inter-reader discrepancy; PDD, progressive disease declared.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fonc-13-1239570-g005.tif"/>
</fig>
<p>In <xref ref-type="fig" rid="f5">
<bold>Figure&#xa0;5</bold>
</xref>, we can see that the probability of discrepancies occurring with DOFR was significantly higher at the beginning of follow-up, but became closer to the other KoDs at the following time point.</p>
<p>After 6 months, the probabilities of Trial 3 and 4 KoDs were generally less than 10%. They were higher for other trials. DOFR was the most likely KoD in the initial part of the patient evaluation, and the ordering of KoDs was stable when considering phase 3 studies i.e., DOFR &gt; DOPD &gt; BOR &gt; PDD.</p>
<p>Regarding progression-related KoDs, the probability of PDD at the patient level was quite stable over cycles while the probability of DOPD tended to decrease over cycles. PDD was also globally lower than the KoDs of response at the beginning of evaluations.</p>
</sec>
<sec id="s4_2">
<label>3.2</label>
<title>Evaluation of risks factors derived from baseline discrepancies</title>
<p>We found that for a single trial (Trial 3), one reader did not identify any disease in two patients out of 243 (0.8%), without any evidence that there was a significant risk factor of discrepancy, particularly for DOPD (p=0.68) or DOFR (p=0.1).</p>
<p>Except for Trial 3, classification of disease as non-measurable by at least one reader occurred in less than 7% of patients. It was very common for readers to select some of their TLs and NTLs in different organs, ranging from 36.0-57.9% and 60.5-73.5% of patients, respectively, but rarely (4.0-8.1%) did the two readers select all of their TLs in different locations.</p>
<p>We did not identify any set of risk factors which was relevant to all trials for a given KoD.</p>
<p>As summarized in <xref ref-type="table" rid="T4">
<bold>Table&#xa0;4</bold>
</xref>, significant risk factors varied depending on the KoD and the trial examined. Analysis of pooled data revealed significant risk factors associated with BOR, whereas DOPD had a single risk factor (non-measurable disease reported by one reader) that could be identified at baseline. PDD had none. One or more readers choosing an infrequent disease location was not a risk factor for discrepancy.</p>
<table-wrap id="T4" position="float">
<label>Table&#xa0;4</label>
<caption>
<p>Risks factors and occurrence of discrepancy.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left"/>
<th valign="top" align="left">Non-measurable disease (NTLs)</th>
<th valign="top" align="left">TLs not all in same organs</th>
<th valign="top" align="left">All TLs in different organs</th>
<th valign="top" align="left">NTLs not all in same organs</th>
<th valign="top" align="left">Delta burden</th>
<th valign="top" align="left">TLs not lung for one of the readers</th>
<th valign="top" align="left">SPropSOD*</th>
<th valign="top" align="left">Infrequent disease<break/>location</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">
<bold>Trial 1</bold>
<break/>
<bold>(N=333)</bold>
</td>
<td valign="top" align="left">BOR<break/>
<break/>
<break/>
<break/>(4.8%)</td>
<td valign="top" align="left">BOR<break/>
<break/>
<break/>
<break/>(42.9%)</td>
<td valign="top" align="left" style="background-color:#d9d9d9">(7.6%)</td>
<td valign="top" align="left">BOR<break/>
<break/>
<break/>PDD<break/>(61.5%)</td>
<td valign="top" align="left">BOR<break/>
<break/>
<break/>
<break/>(NA)</td>
<td valign="top" align="left">BOR<break/>DOFR<break/>
<break/>
<break/>(23.3%)</td>
<td valign="top" align="left">
<break/>
<break/>
<break/>PDD<break/>(NA)</td>
<td valign="top" align="left" style="background-color:#d9d9d9">
<break/>
<break/>
<break/>
<break/>(6.6 %)</td>
</tr>
<tr>
<td valign="top" align="left">
<bold>Trial 2</bold>
<break/>
<bold>(N=493)</bold>
</td>
<td valign="top" align="left" style="background-color:#d9d9d9">
<break/>
<break/>
<break/>
<break/>(0.4%)</td>
<td valign="top" align="left" style="background-color:#d9d9d9">
<break/>
<break/>
<break/>
<break/>(40.9%)</td>
<td valign="top" align="left">BOR<break/>DOFR<break/>
<break/>
<break/>(6.3%)</td>
<td valign="top" align="left">
<break/>
<break/>
<break/>PDD<break/>(66.9%)</td>
<td valign="top" align="left">BOR<break/>DOFR<break/>
<break/>
<break/>(NA)</td>
<td valign="top" align="left">BOR<break/>DOFR<break/>
<break/>
<break/>(13.0%)</td>
<td valign="top" align="left">BOR<break/>DOFR<break/>DOPD<break/>
<break/>(NA)</td>
<td valign="top" align="left" style="background-color:#ffffff">
<break/>DOFR<break/>(7.9 %)</td>
</tr>
<tr>
<td valign="top" align="left">
<bold>Trial 3</bold>
<break/>
<bold>(N=243)</bold>
</td>
<td valign="top" align="left">BOR<break/>
<break/>
<break/>PDD<break/>(22.9%)</td>
<td valign="top" align="left" style="background-color:#d9d9d9">
<break/>
<break/>
<break/>
<break/>(47.0%)</td>
<td valign="top" align="left">BOR<break/>
<break/>
<break/>
<break/>(8.1%)</td>
<td valign="top" align="left">
<break/>
<break/>DOPD<break/>
<break/>(63.7%)</td>
<td valign="top" align="left" style="background-color:#d9d9d9">
<break/>
<break/>
<break/>
<break/>(NA)</td>
<td valign="top" align="left" style="background-color:#d9d9d9">
<break/>
<break/>
<break/>
<break/>(13.1%)</td>
<td valign="top" align="left" style="background-color:#d9d9d9">
<break/>
<break/>
<break/>
<break/>(NA)</td>
<td valign="top" align="left">BOR<break/>DOFR<break/>
<break/>
<break/>(17.1%)</td>
</tr>
<tr>
<td valign="top" align="left">
<bold>Trial 4</bold>
<break/>
<bold>(N=276)</bold>
</td>
<td valign="top" align="left">BOR<break/>DOFR<break/>
<break/>
<break/>
<break/>(6.1%)</td>
<td valign="top" align="left" style="background-color:#d9d9d9">
<break/>
<break/>
<break/>
<break/>
<break/>(57.9%)</td>
<td valign="top" align="left">
<break/>DOFR<break/>DOPD<break/>
<break/>
<break/>(5.8%)</td>
<td valign="top" align="left" style="background-color:#d9d9d9">
<break/>
<break/>
<break/>
<break/>
<break/>(73.5%)</td>
<td valign="top" align="left">
<break/>
<break/>DOPD<break/>
<break/>
<break/>(NA)</td>
<td valign="top" align="left">BOR<break/>
<break/>DOPD<break/>
<break/>
<break/>(25.1%)</td>
<td valign="top" align="left" style="background-color:#ffffff">BOR<break/>
<break/>
<break/>
<break/>
<break/>(NA)</td>
<td valign="top" align="left" style="background-color:#d9d9d9">
<break/>
<break/>
<break/>
<break/>
<break/>(43.1 %)</td>
</tr>
<tr>
<td valign="top" align="left">
<bold>Trial 5</bold>
<break/>
<bold>(N=379)&#xa0;</bold>
</td>
<td valign="top" align="left">BOR<break/>
<break/>
<break/>
<break/>(1.3%)</td>
<td valign="top" align="left" style="background-color:#d9d9d9">
<break/>
<break/>
<break/>
<break/>(36.0%)</td>
<td valign="top" align="left" style="background-color:#d9d9d9">
<break/>
<break/>
<break/>
<break/>(4%)</td>
<td valign="top" align="left" style="background-color:#d9d9d9">
<break/>
<break/>
<break/>
<break/>(60.5%)</td>
<td valign="top" align="left">
<break/>DOFR<break/>
<break/>
<break/>(NA)</td>
<td valign="top" align="left" style="background-color:#d9d9d9">
<break/>
<break/>
<break/>
<break/>(13.1%)</td>
<td valign="top" align="left">PDD<break/>
<break/>
<break/>
<break/>(NA)</td>
<td valign="top" align="left" style="background-color:#d9d9d9">
<break/>
<break/>
<break/>(9.0 %)</td>
</tr>
<tr>
<td valign="top" align="left">
<bold>Pooled (N=1724)</bold>
</td>
<td valign="top" align="left">BOR (4.7)<break/>DOFR (1.7)<break/>DOPD (0.6)<break/>
<break/>(5.5%)<break/>95/1718</td>
<td valign="top" align="left">BOR (1.3)<break/>
<break/>
<break/>
<break/>(43.5%)<break/>706/1623</td>
<td valign="top" align="left">BOR (1.8)<break/>DOFR (1.9)<break/>
<break/>
<break/>(6.2%)<break/>100/1623</td>
<td valign="top" align="left">
<break/>DOFR (1.3)<break/>
<break/>
<break/>(65.1%)<break/>1120/1720</td>
<td valign="top" align="left">BOR (1.7)<break/>DOFR (1.6)<break/>
<break/>
<break/>(NA)</td>
<td valign="top" align="left">BOR (1.7)<break/>DOFR (1.4)<break/>
<break/>
<break/>(17.0%)<break/>276/1623</td>
<td valign="top" align="left">BOR (1.8)<break/>
<break/>
<break/>
<break/>
<break/>(NA)</td>
<td valign="top" align="left" style="background-color:#d9d9d9">
<break/>
<break/>
<break/>
<break/>(14.8%)<break/>255/1720</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>For the five clinical trials (rows) and the different risk factors (columns), we reported the KoDs representing a significant risk: DOPD (red), PDD (Orange), DOFR (blue), BOR (green). In parentheses of corresponding colors are the values of the ODDs. In black parentheses are the percentages of patients with the potential risk factors. ODDs derived from Delta Burden and SPropSOD (<xref ref-type="app" rid="app1C">Annex C</xref>) used optimized thresholds, so represented best cases.</p>
</fn>
<fn>
<p>BOR, best overall response; DOFR, date of first response; DOPD, declaration of progressive disease; NTL, non-target lesion; PDD, progressive disease declared; SPropSOD, percentage of specific sum of diameters; TL, target lesion.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="s4_3">
<label>3.3</label>
<title>Prediction of discrepancy derived from baseline evaluations</title>
<p>Our risk factors analysis showed that progression-related KoDs were marginally impacted by baseline evaluation. Therefore, our evaluation of predictive models focused mainly on the response-related KoDs of DOFR and BOR.</p>
<p>Based on our features set (<xref ref-type="bibr" rid="B14">14</xref>), and a pre-processing of feature selection for RF algorithm, classification performances for response-related KoDs are summarized in <xref ref-type="table" rid="T5">
<bold>Table&#xa0;5</bold>
</xref>. For each independent clinical trial or in pooled data, feature selection did not improve classification performances.</p>
<table-wrap id="T5" position="float">
<label>Table&#xa0;5</label>
<caption>
<p>Prediction of response KoDs derived from baseline features.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="center"/>
<th valign="top" align="center">DOFR</th>
<th valign="top" align="center">BOR</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" style="background-color:#8eaadb">
<bold>AUC</bold>
</td>
<td valign="top" align="center">57.2[56.6; 57.7]</td>
<td valign="top" align="center">60.8 [60.2; 61.4]</td>
</tr>
<tr>
<td valign="top" align="left" style="background-color:#8eaadb">
<bold>Acc</bold>
</td>
<td valign="top" align="center">55.4 [54.9, 55.8]</td>
<td valign="top" align="center">73.1 [72.8, 73.3]</td>
</tr>
<tr>
<td valign="top" align="left" style="background-color:#8eaadb">
<bold>1.&#x2003;Se</bold>
<break/>
<bold>2.&#x2003;Sp</bold>
<break/>
<bold>3.&#x2003;PPV
4.&#x2003;NPV</bold>
</td>
<td valign="top" align="center">44.1 (43.1; 45.1)<break/>66.2 [65.3; 67.2]<break/>56.0 [55.3; 56.2]<break/>55.5 [55.0; 56.0]</td>
<td valign="top" align="center">12.4 (11.8; 12.9]<break/>97.4 [97.1; 97.6]<break/>66.0 [63.0; 68.0]<break/>73.5 [73.0; 74.0]</td>
</tr>
<tr>
<td valign="top" align="left" style="background-color:#a8d08d">
<bold>AUC</bold>
</td>
<td valign="top" align="center">58.0 [57.7; 58.2]</td>
<td valign="top" align="center">61.9 [61.6; 62.2]</td>
</tr>
<tr>
<td valign="top" align="left" style="background-color:#a8d08d">
<bold>Acc</bold>
</td>
<td valign="top" align="center">52.8 [52.3; 53.3]</td>
<td valign="top" align="center">73.5 [73.0; 73.9]</td>
</tr>
<tr>
<td valign="top" align="left" style="background-color:#a8d08d">
<bold>1.&#x2003;Se</bold>
<break/>
<bold>2.&#x2003;Sp</bold>
<break/>
<bold>3.&#x2003;PPV</bold>
<break/>
<bold>4.&#x2003;NPV</bold>
</td>
<td valign="top" align="center">4.3 [3.5; 5.0]<break/>98.9 [98.6; 99.2]<break/>84.0 [81.6; 86.4]<break/>52.2 [51.6; 52.7]</td>
<td valign="top" align="center">10.3 [9.4; 11.2]<break/>98.8 [98.6; 99.0]<break/>81.0 [78.8; 83.2]<break/>73.3 [72.8; 73.8]</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>We pooled the data from the five trials to evaluate classification performance of two algorithms in a cross-validation setting: RF after feature selection (Top rows, in blue); DL after grid search of the hyperparameters (bottom rows, in green). For the KoDs of response, we measured the predictive performances as the AUC, Acc, Se, Sp, PPV, and NPV with corresponding CIs (in block brackets).</p>
</fn>
<fn>
<p>Acc, accuracy; AUC, area under ROC curve; BOR, best overall response; CI, confidence interval; DL, deep learning; DOFR, date of first response; KoD, kind of inter-reader discrepancy; NPV, negative predictive value; PPV, positive predictive value; RF, random forest; Se, sensitivity; Sp, specificity.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>Using the validation dataset of pooled data, DL outperformed the RF algorithm. Performances with DL were poor but with a PPV higher than 80% for all the KoDs. The best classification performances were obtained with BOR based on AUC. For DOPD, AUCs was 57.3 [57.1; 57.5].</p>
</sec>
</sec>
<sec id="s5" sec-type="discussion">
<label>4</label>
<title>Discussion</title>
<sec id="s5_1" sec-type="discussion">
<label>4.1</label>
<title>Discussion of our results</title>
<p>At completion of trials, discrepancy rates based on DOPD or DOFR were comparable across trials with respective average values of 41.0% and 48.8%. Over time, the rates of KoDs steadily increased as the trials progressed, even after the end of patient accrual. The discrepancy rates for the time-related endpoints, DOPD and DOFR, were always higher than their event-related PDD and BOR counterparts. A higher proportion for DOPD was expected as the counting of DOPD occurrences encompassed those of PDD. Translated into clinical study endpoints, these observations mean that discrepancy rates for overall response rate are generally higher than for PFS.</p>
<p>Assuming part of the discrepancies was attributable to a delayed event detection by one of the readers, it could be expected that the proportion of DOFR would be higher than DOPD at an earlier stage, as progressive patients were withdrawn from the trial; the only room for patient response was then before progression.</p>
<p>As <xref ref-type="fig" rid="f3">
<bold>Figures&#xa0;3</bold>
</xref>&#x2013;<xref ref-type="fig" rid="f5">
<bold>5</bold>
</xref> show, the number of recruited patients, <italic>PatIncl(t)</italic>, at each time point is an operational-related function that can significantly vary between trials and is difficult to predict. The survival curve, <italic>Surv(t</italic>) (and the PFS curve), however, is dependent on the drug and/or disease, and can be predicted to some degree. The likelihood of a discrepancy occurring at each time point is more of a measure of the reading process, which is partially dependent on the readers&#x2019; abilities and the complexity of the observed disease. To predict the rate of discrepancies over time, one must be able to completely regulate the trials, the drugs, and the readers&#x2019; performances. The mathematical formulation of the temporal discrepancy rate (Equation 1) helps us understand the intricacies of trials by breaking down the components that interact with one another. When the rate of discrepancies increases, it can be difficult to determine if the cause is a pattern of patient recruitment or if readers are simply more prone to making mistakes.</p>
<p>Regarding our risk factor analysis, we found that at least one reader not detecting disease when it was likely present was very rare and not a major concern. Selecting non-measurable disease (i.e., NTLs only) was more prevalent and was considered a significant risk when pooling the five trials.</p>
<p>The analysis of discrepancies between DOPD and DOFR showed that readers often selected TLs (43.5% of patients) or NTLs (65.3%) in different organs without being critical of discrepancies. When all the data were pooled, most of the risk factors were significant for BOR. Selecting non-measurable disease was the only risk factor for DOPD discrepancies. The selection of infrequent disease by any reader was not a risk factor regardless of the KoD.</p>
<p>Feature selection did not improve RF classification performance. DL slightly outperformed the RF algorithm but with globally poor classification performance. Best performances (AUC based) were reached in detecting BOR (61.9).</p>
</sec>
<sec id="s5_2" sec-type="discussion">
<label>4.2</label>
<title>Discussion around the literature</title>
<sec id="s5_2_1">
<label>4.2.1</label>
<title>Discrepancy rate</title>
<p>The discrepancy rates we found at end of trial agree with the literature (<xref ref-type="bibr" rid="B8">8</xref>, <xref ref-type="bibr" rid="B23">23</xref>). Considering the na&#xef;ve assumption of equiprobability between the four RECIST classes of response, the ranking of our measured discrepancies rates DR(DOFR) &gt; DR(DOPD) are consistent with the basic rules of combinatory probabilities.</p>
</sec>
<sec id="s5_2_2">
<label>4.2.2</label>
<title>Discrepancy timing</title>
<p>Most discrepancies occurring earlier for DOFR than for DOPD can be explained by the fact that a genuine response can only take place before a progression (except for pseudo progressions), because patients are withdrawn from trials after a PD.</p>
<p>As we observed in <xref ref-type="fig" rid="f2">
<bold>Figure&#xa0;2</bold>
</xref>, the ranking of KoD rates could change over time (crossing curves). This indicates that the probability of inter-reader discrepancy is not a stationary process in time (<xref ref-type="bibr" rid="B24">24</xref>). This factor should be considered when it comes to improving the modeling of trial monitoring, as illustrated by Equation 1, and when designing new metrics. The probability of PDD at each cycle seemed the most stable and lower than other KoDs.</p>
</sec>
<sec id="s5_2_3">
<label>4.2.3</label>
<title>Risk factors</title>
<sec id="s5_2_3_1">
<label>4.2.3.1</label>
<title>NTL assessment</title>
<p>The NTL category only determines three events: CR, PD, and stability (Not CR/Not PD). The PR event is only determined by measurable disease. In support of this finding, Raskin et&#xa0;al. (<xref ref-type="bibr" rid="B25">25</xref>) showed that NTLs were an important factor for detecting PD, while Park et&#xa0;al. (<xref ref-type="bibr" rid="B26">26</xref>) revealed that selecting metastasis-only lesions as TLs may be more effective for determining response in kidney disease.</p>
<p>In our observations, for more than 5% of patients, at least one reader did not detect any measurable disease, so only NTLs were selected. Under these conditions, it is not surprising to observe a significant correlation with the occurrence of a response-related KoD (<xref ref-type="table" rid="T4">
<bold>Table&#xa0;4</bold>
</xref>).</p>
<p>Moreover, the NTL category theoretically includes less defined and smaller lesions, making them more equivocal during the first evaluation of the disease. The high prevalence of selected NTLs in different organs (65% of patients) reflects this uncertainty when capturing the disease at baseline. Lheureux et&#xa0;al. (<xref ref-type="bibr" rid="B27">27</xref>) developed a comprehensive discussion about the equivocality associated with RECIST, which is responsible for concerns related to its reliability. Moreover, during follow-up, due to the &#x201c;under-representation&#x201d; of this category and the qualitative appreciation of non-measurable disease, RECIST recommends interpreting progression of NTLs by considering the entire disease. Indeed, it is rare to observe a PD event triggered solely by the NTL category. This is reported during paradoxical progression in approximately 10% of progression cases (<xref ref-type="bibr" rid="B28">28</xref>). Finally, since the CR event is quite rare in our patients in advanced clinical stage, the influence of a difference in the appreciation of the non-measurable disease ultimately presents little risk in terms of variability on the study&#x2019;s endpoints. Cases that are ultimately more at risk relate to patients with a low measurable tumor mass compared to the non-measurable disease. However, the detection of these cases is very difficult in the absence of quantification of NTLs [see scenario F in the supplementary appendix of Seymour et&#xa0;al. (<xref ref-type="bibr" rid="B29">29</xref>)].</p>
</sec>
<sec id="s5_2_3_2">
<label>4.2.3.2</label>
<title>TL selection</title>
<p>We showed that readers selected TLs in different organs in 36.0% to 57.9% of patients, with no association with DOPD or DOFR discrepancies and poor association with BOR.</p>
<p>In the study by Keil et&#xa0;al. (<xref ref-type="bibr" rid="B12">12</xref>), for 39% of patients, readers had chosen different TLs, demonstrating a strong association with DOPD. Keil et&#xa0;al. had different study inclusion criteria, considering breast cancer, a single follow-up, no new target and no NTLs. Their mean number of TLs was 1.8 (2.3 in our study). Keil et&#xa0;al. adopted a strict definition of &#x201c;same TLs&#x201d; as those with the same coordinates, whereas ours was for those chosen in the same organ.</p>
<p>Kuhl et&#xa0;al. (<xref ref-type="bibr" rid="B30">30</xref>) reported higher discrepancy rates than us (27% for DOPD). Readers selected different sets of TLs in 60% of patients with even a stronger association with readers&#x2019; disagreement than Keil et&#xa0;al. Kuhl et&#xa0;al. adopted an even stricter definition of concordant selection than Keil et&#xa0;al. Kuhl et&#xa0;al. also included a broad spectrum of primary cancers.</p>
</sec>
<sec id="s5_2_3_3">
<label>4.2.3.3</label>
<title>Sum of diameters value</title>
<p>Our study confirmed findings by Sharma et&#xa0;al. (<xref ref-type="bibr" rid="B31">31</xref>) who concluded that there was an association of the variability of SOD at baseline with the variability of the study endpoint. However, our percentage of specific sum of diameters (SPropSOD) analysis did not confirm or contradict other works about dissociated responses (<xref ref-type="bibr" rid="B28">28</xref>, <xref ref-type="bibr" rid="B32">32</xref>), probably because this phenomenon is reportedly observed in only around 10% of cases (<xref ref-type="bibr" rid="B28">28</xref>).</p>
</sec>
<sec id="s5_2_3_4">
<label>4.2.3.4</label>
<title>Location (TL lung &amp; infrequent)</title>
<p>Regarding discrepancies in targeting the most frequent location, this was poorly associated with variability of responses. We considered five primary lung cancer trials, but for 17% of the patients, at least one reader did not select any lung TL.</p>
<p>Regarding discrepancies in identifying disease (TL or NTL) in infrequent locations, surprisingly we found no risk factors associated, although some authors discuss the controversial use of RECIST outside the most targeted disease locations (<xref ref-type="bibr" rid="B33">33</xref>).</p>
</sec>
<sec id="s5_2_3_5">
<label>4.2.3.5</label>
<title>Progression-related KoDs</title>
<p>Several studies (<xref ref-type="bibr" rid="B34">34</xref>, <xref ref-type="bibr" rid="B35">35</xref>) have documented that more than half of discrepancies in reporting progressing disease are triggered by debatable detection of new lesions. Thus, they are not concerned with baseline evaluations, preventing prediction from baseline. Unlike Keil et&#xa0;al. (<xref ref-type="bibr" rid="B12">12</xref>), we did not measure a systematic risk factor associated with TL selection. The only exception was in a specific trial (Trial 4), when readers selected all of their TLs from completely different organs.</p>
</sec>
</sec>
<sec id="s5_2_4">
<label>4.3.3</label>
<title>Classifications</title>
<p>According to Corso et&#xa0;al. (<xref ref-type="bibr" rid="B20">20</xref>), feature selection does not improve classification performances. For all the KoDs, DL outperformed RF. Overall, classification performance was low. Specifically regarding detection of DOPD, our findings were consistent with studies that reported that the majority of DOPD discrepancies were due to the misdetection of new lesions (<xref ref-type="bibr" rid="B34">34</xref>), which is not linked to baseline. The best classification performances were obtained for BOR, albeit with poor performances.</p>
<p>Even though certain studies suggest that baseline selection and measurement have considerable influence on the accuracy of response assessment (<xref ref-type="bibr" rid="B12">12</xref>, <xref ref-type="bibr" rid="B13">13</xref>), we found that, while they existed, these correlations were weak and heterogeneous. Therefore, additional root causes of variability could be studied, such as follow-up management (e.g., measurement variability of tumor burden or perception of NTL change), readers&#x2019; associations, or fluctuating readers&#x2019; perception (<xref ref-type="bibr" rid="B35">35</xref>). We also need to investigate the gap between selecting &#x201c;exactly the same TLs&#x201d; and &#x201c;TLs in same organs&#x201d;, as the first definition is reported to have a strong association with variability in literature (<xref ref-type="bibr" rid="B36">36</xref>), while we found the second has a weak association. If strong heterogeneity within the same organ is confirmed, the issue of RECIST is no longer its subjectivity but its intrinsic inappropriateness in assessing patient follow-up.</p>
</sec>
<sec id="s5_2_5">
<label>4.3.4</label>
<title>Study limitations</title>
<p>First, we did not document some well-known risk factors for variability linked to image quality, such as variability due to reconstruction parameters (different image selection) or the timing in IV contrast injection. Indeed, we assumed that the imaging charter of the included trial adequately standardized the images so that these risks associated with these factors would be negligible.</p>
<p>Second, we did not analyze the impact of the selection of the same TLs by readers. When designing our study, we did not expect that collecting tumor coordinates would be of interest but assumed that checking the association with TLs in the same organ would be adequate.</p>
<p>Third, we did not investigate the impact of the inter-reader variability in assessing the &#x201c;measurability&#x201d; of tumor. Indeed, when a first reader considers a tumor as measurable and candidate to be TL included in his tumor burden, while the second reader considers the same tumor as NTL, the two readers would have a different in sensitivity at detecting responses.</p>
<p>Fourth, our study focused on lung data, therefore cannot be generalized to other diseases. For some other types of cancers CT is not the preferred modality, the disease spread differently, and the tumors feature different phenotypes, thus risk factors and training performances would be different.</p>
<p>Lastly, as we were blinded as to randomization, we were not able to refine our analysis by treatment/control. However, we can assume that KoD statistics and association with variability is different for treatment and control as those statistics are directly linked to the occurrence of the events of response or progression, supposedly different for the two arms.</p>
</sec>
<sec id="s5_2_6">
<label>4.3.5</label>
<title>Perspective on operations</title>
<p>Throughout the initial cycles of BICR with double reads, the rate of discrepancies is hard to analyze without a benchmark to refer to. This leads researchers to search for other key performance indicators to assess trial reliability and, eventually, take corrective measures as expeditiously as possible.</p>
<p>Even though we were not able to make powerful predictions, our analysis of the baseline data revealed that some inter-reader differences can affect the reliability of trial results. A tracking of baseline assessment during BICR could be a beneficial addition to trial quality control.</p>
<p>The poor predictive performances were probably, in part, obtained because a predictive model cannot avoid including follow up data. Another hypothesis could be that, based only on baseline data, the risk factors, and the features we considered were unable to fully capture the complexity of disagreements. In the future, we can imagine improving the features derived from tumor selection or even creating new ones; considering other risk factors in the classification, such as variability in scan selection or variabilities in involving the first follow-up and optimizing readers&#x2019; associations. However, an open question will remain: &#x201c;How can we manage a baseline assessment with a high probability of becoming discrepant?&#x201d;. If future technologies can predict discrepancies, it is likely to be a prediction with limited explicability. Moreover, we can question whether the 2 + 1 adjudication paradigm is obsolete, as choosing between two medically justifiable differences (<xref ref-type="bibr" rid="B2">2</xref>) does not make sense.</p>
</sec>
</sec>
</sec>
<sec id="s6" sec-type="conclusions">
<label>5</label>
<title>Conclusion</title>
<p>For the discordance rate to be predictable over time, at each time point, we need to know patient accrual, patient survival, and probability of discrepancy. Discrepancies in date of responses occurred more often than those related to progressions. Careful thought should be given to corrective actions based on the analysis of KoD rates if less than 50% of patients have been enrolled.</p>
<p>Several risk factors for inter-reader discrepancies have been confirmed, albeit with relatively weak implications. We found that for around 50% of patients, readers chose tumors in different organs without impacting the variability of responses. The prediction performances of inter-reader discrepancies based on the baseline selection were poor. Baseline-derived features should be improved or new ones designed, other risks factors must be considered for predicting discordances, and optimal reader association must be investigated.</p>
</sec>
<sec id="s7">
<label>6</label>
<title>Take home messages</title>
<list list-type="simple">
<list-item>
<p>1) The discrepancy rate over time depends on patient accrual, PFS, and the probability of discrepancy during follow-up.</p>
</list-item>
<list-item>
<p>2) At the outset of follow-up, proportionally more DOFR and BOR occurs than DOPD and PDD.</p>
</list-item>
<list-item>
<p>3) The KoD rates are not stabilized in the first half period of total patient inclusion and should be interpreted carefully. Baseline variability evaluation can help to determine risk of study endpoint variability.</p>
</list-item>
<list-item>
<p>4) The inter-reader variability in disease selection at baseline is frequent (50%). The general impact on the variability is more significant for response endpoints.</p>
</list-item>
</list>
</sec>
<sec id="s8" sec-type="data-availability">
<title>Data availability statement</title>
<p>The data analyzed in this study is subject to the following licenses/restrictions: Data set are proprietary to sponsors. Requests to access these datasets should be directed to <ext-link ext-link-type="uri" xlink:href="https://clinicaltrials.gov">https://clinicaltrials.gov</ext-link>.</p>
</sec>
<sec id="s9" sec-type="author-contributions">
<title>Author contributions</title>
<p>HB: Conceptualization, methodology, data curation, formal analysis, original draft, writing, review, and editing. AI: Conceptualization, methodology, formal analysis, project administration, original draft, review, and editing. Both authors contributed to the article and approved the submitted version.</p>
</sec>
</body>
<back>
<sec id="s10" sec-type="funding-information">
<title>Funding</title>
<p>The authors declare that no financial support was received for the research, authorship, and/or publication of this article.</p>
</sec>
<ack>
<title>Acknowledgments</title>
<p>We would like to thank Jos&#xe9; Luis Macias and Sebastien Grosset for their continued support and commitment.</p>
</ack>
<sec id="s11" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>HB and AI are employees at Median Technologies.</p>
</sec>
<sec id="s12" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<fn-group>
<title>Abbreviations</title>
<fn fn-type="abbr">
<p>Acc, Accuracy; AUC, Area Under the ROC Curve; BOR, Best Overall Response; BICR, Blinded Independent Central Review; CI, Confidence Interval; CR, Complete Response; DOFR, Date Of First Response; DOPD, Date of Progressive Disease; DR, Discrepancy Rate; KoD, Kind of Inter-Reader Discrepancy; ML, Machine Learning; NPV, Negative Predictive Value; NTL, Non-Target Lesion; ODD, Odds Ratio; PD, Progressive Disease; PDD, Progressive Disease Declared; PFS, Progression Free Survival; PPV, Positive Predictive Value; PR, Partial Response; RECIST, Response Evaluation Criteria In Solid Tumor; RF, Random Forest; RTPR, Radiological Time Point Response; SD, Stable Disease; SOD, Sum Of Diameters; SPropSOD, Percentage of Specific Sum of Diameters; Se, Sensitivity; Sp, Specificity; TL, Target Lesion; ML, Machine Learning; NPV, Negative Predictive Value; NTL, Non-Target Lesion.</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<label>1</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lauritzen</surname> <given-names>PM</given-names>
</name>
<name>
<surname>Andersen</surname> <given-names>JG</given-names>
</name>
<name>
<surname>Stokke</surname> <given-names>MV</given-names>
</name>
<name>
<surname>Tennstrand</surname> <given-names>AL</given-names>
</name>
<name>
<surname>Aamodt</surname> <given-names>R</given-names>
</name>
<name>
<surname>Heggelund</surname> <given-names>T</given-names>
</name>
<etal/>
</person-group>. <article-title>Radiologist-initiated double reading of abdominal CT: Retrospective analysis of the clinical importance of changes to radiology reports</article-title>. <source>BMJ Qual Saf</source> (<year>2016</year>) <volume>25</volume>:<fpage>595</fpage>&#x2013;<lpage>603</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1136/bmjqs-2015-004536</pub-id>
</citation>
</ref>
<ref id="B2">
<label>2</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Beaumont</surname> <given-names>H</given-names>
</name>
<name>
<surname>Iannessi</surname> <given-names>A</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>Y</given-names>
</name>
<name>
<surname>Voyton</surname> <given-names>CM</given-names>
</name>
<name>
<surname>Cillario</surname> <given-names>J</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>Y</given-names>
</name>
</person-group>. <article-title>Blinded independent central review (BICR) in new therapeutic lung cancer trials</article-title>. <source>Cancers (Basel)</source> (<year>2021</year>) <volume>13</volume>:<elocation-id>4533</elocation-id>. doi:&#xa0;<pub-id pub-id-type="doi">10.3390/cancers13184533</pub-id>
</citation>
</ref>
<ref id="B3">
<label>3</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kahan</surname> <given-names>BC</given-names>
</name>
<name>
<surname>Feagan</surname> <given-names>B</given-names>
</name>
<name>
<surname>Jairath</surname> <given-names>V</given-names>
</name>
</person-group>. <article-title>A comparison of approaches for adjudicating outcomes in clinical trials</article-title>. <source>Trials</source> (<year>2017</year>) <volume>18</volume>:<fpage>1</fpage>&#x2013;<lpage>14</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1186/s13063-017-1995-3</pub-id>
</citation>
</ref>
<ref id="B4">
<label>4</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ford</surname> <given-names>R</given-names>
</name>
<name>
<surname>O&#x2019; Neal</surname> <given-names>M</given-names>
</name>
<name>
<surname>Moskowitz</surname> <given-names>S</given-names>
</name>
<name>
<surname>Fraunberger</surname> <given-names>J</given-names>
</name>
</person-group>. <article-title>Adjudication rates between readers in blinded independent central review of oncology studies</article-title>. <source>J Clin Trials</source> (<year>2016</year>) <volume>6</volume>:<fpage>289</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.4172/2167-0870.1000289</pub-id>
</citation>
</ref>
<ref id="B5">
<label>5</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Geijer</surname> <given-names>H</given-names>
</name>
<name>
<surname>Geijer</surname> <given-names>M</given-names>
</name>
</person-group>. <article-title>Added value of double reading in diagnostic radiology,a systematic review</article-title>. <source>Insights Imaging</source> (<year>2018</year>) <volume>9</volume>:<fpage>287</fpage>&#x2013;<lpage>301</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1007/s13244-018-0599-0</pub-id>
</citation>
</ref>
<ref id="B6">
<label>6</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Taylor-Phillips</surname> <given-names>S</given-names>
</name>
<name>
<surname>Stinton</surname> <given-names>C</given-names>
</name>
</person-group>. <article-title>Double reading in breast cancer screening: Considerations for policy-making</article-title>. <source>Br J Radiol</source> (<year>2020</year>) <volume>93</volume>(<issue>1106</issue>):<elocation-id>20190610</elocation-id>. doi:&#xa0;<pub-id pub-id-type="doi">10.1259/bjr.20190610</pub-id>
</citation>
</ref>
<ref id="B7">
<label>7</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Eisenhauer</surname> <given-names>E A</given-names>
</name>
<name>
<surname>Therasse</surname> <given-names>P</given-names>
</name>
<name>
<surname>Bogaerts</surname> <given-names>J</given-names>
</name>
<name>
<surname>Schwartz</surname> <given-names>LH</given-names>
</name>
<name>
<surname>Sargent</surname> <given-names>D</given-names>
</name>
<name>
<surname>Ford</surname> <given-names>R</given-names>
</name>
<etal/>
</person-group>. <article-title>New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1)</article-title>. <source>Eur J Cancer</source> (<year>2009</year>) <volume>45</volume>:<page-range>228&#x2013;47</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1016/j.ejca.2008.10.026</pub-id>
</citation>
</ref>
<ref id="B8">
<label>8</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schmid</surname> <given-names>AM</given-names>
</name>
<name>
<surname>Raunig</surname> <given-names>DL</given-names>
</name>
<name>
<surname>Miller</surname> <given-names>CG</given-names>
</name>
<name>
<surname>Walovitch</surname> <given-names>RC</given-names>
</name>
<name>
<surname>Ford</surname> <given-names>RW</given-names>
</name>
<name>
<surname>O&#x2019;Connor</surname> <given-names>M</given-names>
</name>
<etal/>
</person-group>. <article-title>Radiologists and clinical trials: part 1 the truth about reader disagreements</article-title>. <source>Ther Innov Regul Sci</source> (<year>2021</year>) <volume>55</volume>(<issue>6</issue>):<page-range>1111&#x2013;21</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1007/s43441-021-00316-6</pub-id>
</citation>
</ref>
<ref id="B9">
<label>9</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Granger</surname> <given-names>CB</given-names>
</name>
<name>
<surname>Vogel</surname> <given-names>V</given-names>
</name>
<name>
<surname>Cummings</surname> <given-names>SR</given-names>
</name>
<name>
<surname>Held</surname> <given-names>P</given-names>
</name>
<name>
<surname>Fiedorek</surname> <given-names>F</given-names>
</name>
<name>
<surname>Lawrence</surname> <given-names>M</given-names>
</name>
<etal/>
</person-group>. <article-title>Do we need to adjudicate major clinical events</article-title>? <source>Clin Trials</source> (<year>2008</year>) <volume>5</volume>:<fpage>56</fpage>&#x2013;<lpage>60</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1177/1740774507087972</pub-id>
</citation>
</ref>
<ref id="B10">
<label>10</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Morse</surname> <given-names>B</given-names>
</name>
<name>
<surname>Jeong</surname> <given-names>D</given-names>
</name>
<name>
<surname>Ihnat</surname> <given-names>G</given-names>
</name>
<name>
<surname>Silva</surname> <given-names>AC</given-names>
</name>
</person-group>. <article-title>Pearls and pitfalls of response evaluation criteria in solid tumors (RECIST) v1.1 non-target lesion assessment</article-title>. <source>Abdom Radiol</source> (<year>2019</year>) <volume>44</volume>:<page-range>766&#x2013;74</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1007/s00261-018-1752-4</pub-id>
</citation>
</ref>
<ref id="B11">
<label>11</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Iannessi</surname> <given-names>A</given-names>
</name>
<name>
<surname>Beaumont</surname> <given-names>H</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>Y</given-names>
</name>
<name>
<surname>Bertrand</surname> <given-names>AS</given-names>
</name>
</person-group>. <article-title>RECIST 1.1 and lesion selection: How to deal with ambiguity at baseline</article-title>? <source>Insights Imaging</source> (<year>2021</year>) <volume>12</volume>(<issue>1</issue>):<fpage>36</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1186/s13244-021-00976-w</pub-id>
</citation>
</ref>
<ref id="B12">
<label>12</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Keil</surname> <given-names>S</given-names>
</name>
<name>
<surname>Barabasch</surname> <given-names>A</given-names>
</name>
<name>
<surname>Dirrichs</surname> <given-names>T</given-names>
</name>
<name>
<surname>Bruners</surname> <given-names>P</given-names>
</name>
<name>
<surname>Hansen</surname> <given-names>NL</given-names>
</name>
<name>
<surname>Bieling</surname> <given-names>HB</given-names>
</name>
<etal/>
</person-group>. <article-title>Target lesion selection: an important factor causing variability of response classification in the Response Evaluation Criteria for Solid Tumors 1.1</article-title>. <source>Invest Radiol</source> (<year>2014</year>) <volume>49</volume>:<page-range>509&#x2013;17</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1097/RLI.0000000000000048</pub-id>
</citation>
</ref>
<ref id="B13">
<label>13</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yoon</surname> <given-names>SH</given-names>
</name>
<name>
<surname>Kim</surname> <given-names>KW</given-names>
</name>
<name>
<surname>Goo</surname> <given-names>JM</given-names>
</name>
<name>
<surname>Kim</surname> <given-names>D-W</given-names>
</name>
<name>
<surname>Hahn</surname> <given-names>S</given-names>
</name>
</person-group>. <article-title>Observer variability in RECIST-based tumour burden measurements: a meta-analysis</article-title>. <source>Eur J Cancer</source> (<year>2016</year>) <volume>53</volume>:<fpage>5</fpage>&#x2013;<lpage>15</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1016/j.ejca.2015.10.014</pub-id>
</citation>
</ref>
<ref id="B14">
<label>14</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Iannessi</surname> <given-names>A</given-names>
</name>
<name>
<surname>Beaumont</surname> <given-names>H</given-names>
</name>
</person-group>. <article-title>Breaking down the RECIST 1.1 double read variability in lung trials: What do baseline assessments tell us</article-title>? <source>Front Oncol</source> (<year>2023</year>) <volume>13</volume>:<elocation-id>988784</elocation-id>. doi:&#xa0;<pub-id pub-id-type="doi">10.3389/fonc.2023.988784</pub-id>
</citation>
</ref>
<ref id="B15">
<label>15</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pepe</surname> <given-names>MS</given-names>
</name>
<name>
<surname>Janes</surname> <given-names>H</given-names>
</name>
<name>
<surname>Longton</surname> <given-names>G</given-names>
</name>
<name>
<surname>Leisenring</surname> <given-names>W</given-names>
</name>
<name>
<surname>Newcomb</surname> <given-names>P</given-names>
</name>
</person-group>. <article-title>Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker</article-title>. <source>Am J Epidemiol</source> (<year>2004</year>) <volume>159</volume>:<page-range>882&#x2013;90</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1093/aje/kwh101</pub-id>
</citation>
</ref>
<ref id="B16">
<label>16</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Clopper</surname> <given-names>CJ</given-names>
</name>
<name>
<surname>Pearson</surname> <given-names>ES</given-names>
</name>
</person-group>. <article-title>THE USE OF CONFIDENCE OR FIDUCIAL LIMITS ILLUSTRATED IN THE CASE OF THE BINOMIAL</article-title>. <source>Biometrika</source> (<year>1934</year>) <volume>26</volume>:<page-range>404&#x2013;13</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1093/biomet/26.4.404</pub-id>
</citation>
</ref>
<ref id="B17">
<label>17</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hayter</surname> <given-names>AJ</given-names>
</name>
</person-group>. <article-title>A proof of the conjecture that the tukey-kramer multiple comparisons procedure is conservative</article-title>. <source>Ann Stat</source> (<year>1984</year>) <volume>12</volume>:<fpage>590</fpage>&#x2013;<lpage>606</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1214/aos/1176346392</pub-id>
</citation>
</ref>
<ref id="B18">
<label>18</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marascuilo</surname> <given-names>LA</given-names>
</name>
</person-group>. <article-title>Extensions of the significance test for one-parameter signal detection hypotheses</article-title>. <source>Psychometrika</source> (<year>1970</year>) <volume>35</volume>:<page-range>237&#x2013;43</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1007/BF02291265</pub-id>
</citation>
</ref>
<ref id="B19">
<label>19</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stang</surname> <given-names>A</given-names>
</name>
<name>
<surname>Kenneth</surname> <given-names>J</given-names>
</name>
</person-group>. <article-title>Rothman: epidemiology. An introduction</article-title>. <source>Eur J Epidemiol</source> (<year>2012</year>) <volume>27</volume>:<page-range>827&#x2013;9</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1007/s10654-012-9732-4</pub-id>
</citation>
</ref>
<ref id="B20">
<label>20</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Corso</surname> <given-names>F</given-names>
</name>
<name>
<surname>Tini</surname> <given-names>G</given-names>
</name>
<name>
<surname>Lo Presti</surname> <given-names>G</given-names>
</name>
<name>
<surname>Garau</surname> <given-names>N</given-names>
</name>
<name>
<surname>De Angelis</surname> <given-names>SP</given-names>
</name>
<name>
<surname>Bellerba</surname> <given-names>F</given-names>
</name>
<etal/>
</person-group>. <article-title>The challenge of choosing the best classification method in radiomic analyses: Recommendations and applications to lung cancer CT images</article-title>. <source>Cancers (Basel)</source> (<year>2021</year>) <volume>13</volume>:<fpage>3088</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.3390/cancers13123088</pub-id>
</citation>
</ref>
<ref id="B21">
<label>21</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Guyon</surname> <given-names>I</given-names>
</name>
<name>
<surname>Weston</surname> <given-names>J</given-names>
</name>
<name>
<surname>Barnhill</surname> <given-names>S</given-names>
</name>
<name>
<surname>Vapnik</surname> <given-names>V</given-names>
</name>
</person-group>. <article-title>Gene selection for cancer classification using support vector machines</article-title>. <source>Mach Learn</source> (<year>2002</year>) <volume>46</volume>:<fpage>389</fpage>&#x2013;<lpage>422</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1023/A:1012487302797</pub-id>
</citation>
</ref>
<ref id="B22">
<label>22</label>
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Aiello</surname> <given-names>S</given-names>
</name>
<name>
<surname>Eckstrand</surname> <given-names>E</given-names>
</name>
<name>
<surname>Fu</surname> <given-names>E</given-names>
</name>
<name>
<surname>Landry</surname> <given-names>M</given-names>
</name>
<name>
<surname>Aboyoun</surname> <given-names>P</given-names>
</name>
</person-group>. <source>Machine Learning with R and H2O</source> (<year>2022</year>). Available at: <uri xlink:href="http://h2o.ai/resources/">http://h2o.ai/resources/</uri>.</citation>
</ref>
<ref id="B23">
<label>23</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rosenkrantz</surname> <given-names>AB</given-names>
</name>
<name>
<surname>Duszak</surname> <given-names>R</given-names>
</name>
<name>
<surname>Babb</surname> <given-names>JS</given-names>
</name>
<name>
<surname>Glover</surname> <given-names>M</given-names>
</name>
<name>
<surname>Kang</surname> <given-names>SK</given-names>
</name>
</person-group>. <article-title>Discrepancy rates and clinical impact of imaging secondary interpretations: A systematic review and meta-analysis</article-title>. <source>J Am Coll Radiol</source> (<year>2018</year>) <volume>15</volume>:<page-range>1222&#x2013;31</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1016/j.jacr.2018.05.037</pub-id>
</citation>
</ref>
<ref id="B24">
<label>24</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gibson</surname> <given-names>EJ</given-names>
</name>
<name>
<surname>Begum</surname> <given-names>N</given-names>
</name>
<name>
<surname>Koblbauer</surname> <given-names>I</given-names>
</name>
<name>
<surname>Dranitsaris</surname> <given-names>G</given-names>
</name>
<name>
<surname>Liew</surname> <given-names>D</given-names>
</name>
<name>
<surname>McEwan</surname> <given-names>P</given-names>
</name>
<etal/>
</person-group>. <article-title>Cohort versus patient level simulation for the economic evaluation of single versus combination immuno-oncology therapies in metastatic melanoma</article-title>. <source>J Med Econ</source> (<year>2019</year>) <volume>22</volume>:<page-range>531&#x2013;44</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1080/13696998.2019.1569446</pub-id>
</citation>
</ref>
<ref id="B25">
<label>25</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Raskin</surname> <given-names>S</given-names>
</name>
<name>
<surname>Klang</surname> <given-names>E</given-names>
</name>
<name>
<surname>Amitai</surname> <given-names>M</given-names>
</name>
</person-group>. <article-title>Target versus non-target lesions in determining disease progression: analysis of 545 patients</article-title>. <source>Cancer Imaging</source> (<year>2015</year>) <volume>15</volume>:<fpage>2015</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1186/1470-7330-15-s1-s8</pub-id>
</citation>
</ref>
<ref id="B26">
<label>26</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Park</surname> <given-names>I</given-names>
</name>
<name>
<surname>Park</surname> <given-names>K</given-names>
</name>
<name>
<surname>Park</surname> <given-names>S</given-names>
</name>
<name>
<surname>Ahn</surname> <given-names>Y</given-names>
</name>
<name>
<surname>Ahn</surname> <given-names>JH</given-names>
</name>
<name>
<surname>Choi</surname> <given-names>HJ</given-names>
</name>
<etal/>
</person-group>. <article-title>Response evaluation criteria in solid tumors response of the primary lesion in metastatic renal cell carcinomas treated with sunitinib: Does the primary lesion have to be regarded as a target lesion</article-title>? <source>Clin Genitourin Cancer</source> (<year>2013</year>) <volume>11</volume>:<page-range>276&#x2013;82</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1016/j.clgc.2012.12.005</pub-id>
</citation>
</ref>
<ref id="B27">
<label>27</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lheureux</surname> <given-names>S</given-names>
</name>
<name>
<surname>Wilson</surname> <given-names>MK</given-names>
</name>
<name>
<surname>O&#x2019;Malley</surname> <given-names>M</given-names>
</name>
<name>
<surname>Sinaei</surname> <given-names>M</given-names>
</name>
<name>
<surname>Oza</surname> <given-names>AM</given-names>
</name>
</person-group>. <article-title>Non-target progression - The fine line between objectivity and subjectivity</article-title>. <source>Eur J Cancer</source> (<year>2014</year>) <volume>50</volume>:<page-range>3271&#x2013;2</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1016/j.ejca.2014.08.021</pub-id>
</citation>
</ref>
<ref id="B28">
<label>28</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Humbert</surname> <given-names>O</given-names>
</name>
<name>
<surname>Chardin</surname> <given-names>D</given-names>
</name>
</person-group>. <article-title>Dissociated response in metastatic cancer: an atypical pattern brought into the spotlight with immunotherapy</article-title>. <source>Front Oncol</source> (<year>2020</year>) <volume>10</volume>:<elocation-id>566297</elocation-id>. doi:&#xa0;<pub-id pub-id-type="doi">10.3389/fonc.2020.566297</pub-id>
</citation>
</ref>
<ref id="B29">
<label>29</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Seymour</surname> <given-names>L</given-names>
</name>
<name>
<surname>Bogaerts</surname> <given-names>J</given-names>
</name>
<name>
<surname>Perrone</surname> <given-names>A</given-names>
</name>
<name>
<surname>Ford</surname> <given-names>R</given-names>
</name>
<name>
<surname>Schwartz</surname> <given-names>LH</given-names>
</name>
<name>
<surname>Mandrekar</surname> <given-names>S</given-names>
</name>
<etal/>
</person-group>. <article-title>iRECIST: guidelines for response criteria for use in trials testing immunotherapeutics</article-title>. <source>Lancet Oncol</source> (<year>2017</year>) <volume>18</volume>:<page-range>e143&#x2013;52</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1016/S1470-2045(17)30074-8</pub-id>
</citation>
</ref>
<ref id="B30">
<label>30</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kuhl</surname> <given-names>CK</given-names>
</name>
<name>
<surname>Alparslan</surname> <given-names>Y</given-names>
</name>
<name>
<surname>Schmoee</surname> <given-names>J</given-names>
</name>
<name>
<surname>Sequeira</surname> <given-names>B</given-names>
</name>
<name>
<surname>Keulers</surname> <given-names>A</given-names>
</name>
<name>
<surname>Br&#xfc;mmendorf</surname> <given-names>TH</given-names>
</name>
<etal/>
</person-group>. <article-title>Validity of RECIST version 1.1 for response assessment in metastatic cancer: A prospective, multireader study</article-title>. <source>Radiology</source> (<year>2019</year>) <volume>290</volume>:<page-range>349&#x2013;56</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1148/radiol.2018180648</pub-id>
</citation>
</ref>
<ref id="B31">
<label>31</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sharma</surname> <given-names>M</given-names>
</name>
<name>
<surname>Singareddy</surname> <given-names>A</given-names>
</name>
<name>
<surname>Bajpai</surname> <given-names>S</given-names>
</name>
<name>
<surname>Narang</surname> <given-names>J</given-names>
</name>
<name>
<surname>O&#x2019;Connor</surname> <given-names>M</given-names>
</name>
<name>
<surname>Jarecha</surname> <given-names>R</given-names>
</name>
</person-group>. <article-title>To determine correlation of inter reader variability in sum of diameters using RECIST 1.1 with end point assessment in lung cancer</article-title>. <source>J Clin Oncol</source> (<year>2021</year>) <volume>39</volume>:<page-range>e13557&#x2013;7</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1200/JCO.2021.39.15_suppl.e13557</pub-id>
</citation>
</ref>
<ref id="B32">
<label>32</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lopci</surname> <given-names>E</given-names>
</name>
</person-group>. <article-title>Immunotherapy monitoring with immune checkpoint inhibitors based on [18 f]fdg pet/ct in metastatic melanomas and lung cancer</article-title>. <source>J Clin Med</source> (<year>2021</year>) <volume>10</volume>:<fpage>5160</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.3390/jcm10215160</pub-id>
</citation>
</ref>
<ref id="B33">
<label>33</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Morgan</surname> <given-names>RL</given-names>
</name>
<name>
<surname>Camidge</surname> <given-names>DR</given-names>
</name>
</person-group>. <article-title>Reviewing RECIST in the era of prolonged and targeted therapy</article-title>. <source>J Thorac Oncol</source> (<year>2018</year>) <volume>13</volume>:<page-range>154&#x2013;64</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1016/j.jtho.2017.10.015</pub-id>
</citation>
</ref>
<ref id="B34">
<label>34</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Beaumont</surname> <given-names>H</given-names>
</name>
<name>
<surname>Evans</surname> <given-names>TL</given-names>
</name>
<name>
<surname>Klifa</surname> <given-names>C</given-names>
</name>
<name>
<surname>Guermazi</surname> <given-names>A</given-names>
</name>
<name>
<surname>Hong</surname> <given-names>SR</given-names>
</name>
<name>
<surname>Chadjaa</surname> <given-names>M</given-names>
</name>
<etal/>
</person-group>. <article-title>Discrepancies of assessments in a RECIST 1.1 phase II clinical trial &#x2013; association between adjudication rate and variability in images and tumors selection</article-title>. <source>Cancer Imaging</source> (<year>2018</year>) <volume>18</volume>:<fpage>50</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1186/s40644-018-0186-0</pub-id>
</citation>
</ref>
<ref id="B35">
<label>35</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Abramson</surname> <given-names>RG</given-names>
</name>
<name>
<surname>McGhee</surname> <given-names>CR</given-names>
</name>
<name>
<surname>Lakomkin</surname> <given-names>N</given-names>
</name>
<name>
<surname>Arteaga</surname> <given-names>CL</given-names>
</name>
</person-group>. <article-title>Pitfalls in RECIST data extraction for clinical trials</article-title>. <source>Acad Radiol</source> (<year>2015</year>) <volume>22</volume>(<issue>6</issue>):<fpage>779</fpage>&#x2013;<lpage>86</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1016/j.acra.2015.01.015</pub-id>
</citation>
</ref>
<ref id="B36">
<label>36</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Maskell</surname> <given-names>G</given-names>
</name>
</person-group>. <article-title>Error in radiology&#x2014;where are we now</article-title>? <source>Br J Radiol</source> (<year>2019</year>) <volume>92</volume>:<fpage>8</fpage>&#x2013;<lpage>9</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1259/bjr.20180845</pub-id>
</citation>
</ref>
</ref-list>
<app-group>
<app id="app1">
<title>Annexes</title>
<sec id="app1A">
<title>A. List of features</title>
<table-wrap>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left">ID#</th>
<th valign="top" align="left">Feature abbreviation</th>
<th valign="top" align="left">Feature description</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">1</td>
<td valign="top" align="left">SumTL</td>
<td valign="top" align="left">Sum of the number of TLs selected by the two readers.</td>
</tr>
<tr>
<td valign="top" align="left">2</td>
<td valign="top" align="left">MinTL</td>
<td valign="top" align="left">Minimum number of TLs selected by the readers.</td>
</tr>
<tr>
<td valign="top" align="left">3</td>
<td valign="top" align="left">DeltaTL</td>
<td valign="top" align="left">Difference in the number of TLs selected by the two readers.</td>
</tr>
<tr>
<td valign="top" align="left">4</td>
<td valign="top" align="left">DeltaBurden</td>
<td valign="top" align="left">Relative difference of SOD between the two readers: Abs(SOD1-SOD2)/(SOD1+SOD2).</td>
</tr>
<tr>
<td valign="top" align="left">5</td>
<td valign="top" align="left">MeanBurden</td>
<td valign="top" align="left">Average SOD measured by the two readers.</td>
</tr>
<tr>
<td valign="top" align="left">6</td>
<td valign="top" align="left">NoDisease</td>
<td valign="top" align="left">No disease was found by at least one reader.</td>
</tr>
<tr>
<td valign="top" align="left">7</td>
<td valign="top" align="left">NonMeasurable</td>
<td valign="top" align="left">Non-measurable disease (TL) was found by at least one reader.</td>
</tr>
<tr>
<td valign="top" align="left">8-39</td>
<td valign="top" align="left">TL Organ</td>
<td valign="top" align="left">One of the readers selected their list of TLs in given organs from the catalogue (See <xref ref-type="app" rid="app1B">Annex B</xref>), while the other reader did not use the same list.</td>
</tr>
<tr>
<td valign="top" align="left">40</td>
<td valign="top" align="left">NotSameLoc</td>
<td valign="top" align="left">Binary: At least one TL selected in different organs by readers.</td>
</tr>
<tr>
<td valign="top" align="left">41</td>
<td valign="top" align="left">NoIdenticalLoc.</td>
<td valign="top" align="left">Binary: Readers did not select any TL in common organ.</td>
</tr>
<tr>
<td valign="top" align="left">42-73</td>
<td valign="top" align="left">NTL Organ</td>
<td valign="top" align="left">One of the readers selected their list of NTLs in given organs from the catalogue (See <xref ref-type="app" rid="app1B">Annex B</xref>), while the other reader did not use the same list.</td>
</tr>
<tr>
<td valign="top" align="left">74</td>
<td valign="top" align="left">NotSameNTLLoc</td>
<td valign="top" align="left">Binary: At least one NTL selected in different organs by readers.</td>
</tr>
<tr>
<td valign="top" align="left">75</td>
<td valign="top" align="left">NoIdenticalNTLLoc.</td>
<td valign="top" align="left">Binary: Readers did not select any NTL in common organ.</td>
</tr>
<tr>
<td valign="top" align="left">76</td>
<td valign="top" align="left">SPropSOD</td>
<td valign="top" align="left">% of specific SOD. (See <xref ref-type="app" rid="app1C">Annex C</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">77</td>
<td valign="top" align="left">Unfreq. Disease</td>
<td valign="top" align="left">Binary: At least one reader selected an infrequent disease [#10; #31].</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>NTL: non-target lesion; SOD: sum of diameters; TL: target lesion.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="app1B">
<title>B. List of disease locations</title>
<table-wrap>
<table frame="hsides">
<tbody>
<tr>
<td valign="top" align="left">1. Lung</td>
<td valign="top" align="left">2. Liver</td>
<td valign="top" align="left">3. Lymph node</td>
<td valign="top" align="left">4. Pleura</td>
<td valign="top" align="left">5. Chest wall</td>
</tr>
<tr>
<td valign="top" align="left">6. Bone</td>
<td valign="top" align="left">7. Abdominal wall</td>
<td valign="top" align="left">8. Adrenal gland</td>
<td valign="top" align="left">9. Brain</td>
<td valign="top" align="left" style="background-color:#d5dce4">10. Spleen</td>
</tr>
<tr>
<td valign="top" align="left" style="background-color:#d5dce4">11. Bone marrow</td>
<td valign="top" align="left" style="background-color:#d5dce4">12. Esophagus</td>
<td valign="top" align="left" style="background-color:#d5dce4">13. Kidney</td>
<td valign="top" align="left" style="background-color:#d5dce4">14. Mediastinum</td>
<td valign="top" align="left" style="background-color:#d5dce4">15. Peritoneum</td>
</tr>
<tr>
<td valign="top" align="left" style="background-color:#d5dce4">16. Muscle</td>
<td valign="top" align="left" style="background-color:#d5dce4">17. Subcutis</td>
<td valign="top" align="left" style="background-color:#d5dce4">18. Pericardial cavity</td>
<td valign="top" align="left" style="background-color:#d5dce4">19. Skin</td>
<td valign="top" align="left" style="background-color:#d5dce4">20. Pancreas</td>
</tr>
<tr>
<td valign="top" align="left" style="background-color:#d5dce4">21. Gastric</td>
<td valign="top" align="left" style="background-color:#d5dce4">22. Blood vessels</td>
<td valign="top" align="left" style="background-color:#d5dce4">23. Heart</td>
<td valign="top" align="left" style="background-color:#d5dce4">24. Neck</td>
<td valign="top" align="left" style="background-color:#d5dce4">25. Spinal cord</td>
</tr>
<tr>
<td valign="top" align="left" style="background-color:#d5dce4">26. Thyroid</td>
<td valign="top" align="left" style="background-color:#d5dce4">25. Diaphragm</td>
<td valign="top" align="left" style="background-color:#d5dce4">28. Pelvis</td>
<td valign="top" align="left" style="background-color:#d5dce4">29. Breast</td>
<td valign="top" align="left" style="background-color:#d5dce4">30. Trachea</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>&#x2022; additional category, the 31st, labeled as &#x201c;miscellaneous&#x201d; was added.</p>
<p>&#x2022; Infrequent disease locations are those from 10 (spleen) to 31 (miscellaneous).</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="app1C">
<title>C. Specific proportional sum of diameter</title>
<p>Considering:</p>
<p>SODi: Tumor burden as reported by reader i</p>
<p>C_SOD<sub>i</sub>: Part of the tumor burden that targets same organs selected by the other reader</p>
<p>S_SOD<sub>i</sub>: Part of the tumor burden that targets organs not selected by other reader <inline-formula>
<mml:math display="inline" id="im1">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>O</mml:mi>
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>C</mml:mi>
<mml:mo>_</mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>O</mml:mi>
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>_</mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>O</mml:mi>
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>Equation 1</p>
<p>The specific SOD can be defined as: <inline-formula>
<mml:math display="inline" id="im2">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>O</mml:mi>
<mml:mi>D</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>100</mml:mn>
<mml:mo>*</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>_</mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>O</mml:mi>
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">/</mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>O</mml:mi>
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>Equation 2</p>
<p>So that, when a reader targeted the same organs SPropSOD=0 while when readers targeted completely different organs SPropSOD= 1</p>
</sec>
</app>
</app-group>
</back>
</article>