<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="2.3" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Immunol.</journal-id>
<journal-title>Frontiers in Immunology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Immunol.</abbrev-journal-title>
<issn pub-type="epub">1664-3224</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fimmu.2023.1128326</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Immunology</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Performance comparison of TCR-pMHC prediction tools reveals a strong data dependency</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Deng</surname>
<given-names>Lihua</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="author-notes" rid="fn003">
<sup>&#x2020;</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1851050"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ly</surname>
<given-names>Cedric</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="author-notes" rid="fn003">
<sup>&#x2020;</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/2150199"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Abdollahi</surname>
<given-names>Sina</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/2270403"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Zhao</surname>
<given-names>Yu</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/2146756"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Prinz</surname>
<given-names>Immo</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="author-notes" rid="fn001">
<sup>*</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/24822"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Bonn</surname>
<given-names>Stefan</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="author-notes" rid="fn001">
<sup>*</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff1">
<sup>1</sup>
<institution>Institute of Systems Immunology, University Medical Center Hamburg-Eppendorf</institution>, <addr-line>Hamburg</addr-line>, <country>Germany</country>
</aff>
<aff id="aff2">
<sup>2</sup>
<institution>Institut of Medical Systems Biology, University Medical Center Hamburg-Eppendorf</institution>, <addr-line>Hamburg</addr-line>, <country>Germany</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>Edited by: Pieter Meysman, University of Antwerp, Belgium</p>
</fn>
<fn fn-type="edited-by">
<p>Reviewed by: Philippe Auguste Robert, University of Oslo, Norway; Lintai Da, Shanghai Jiao Tong University, China</p>
</fn>
<fn fn-type="corresp" id="fn001">
<p>*Correspondence: Immo Prinz, <email xlink:href="mailto:immo.prinz@zmnh.uni-hamburg.de">immo.prinz@zmnh.uni-hamburg.de</email>; Stefan Bonn, <email xlink:href="mailto:stefan.bonn@zmnh.uni-hamburg.de">stefan.bonn@zmnh.uni-hamburg.de</email>
</p>
</fn>
<fn fn-type="equal" id="fn003">
<p>&#x2020;These authors have contributed equally to this work and share first authorship</p>
</fn>
<fn fn-type="other" id="fn002">
<p>This article was submitted to Systems Immunology, a section of the journal Frontiers in Immunology</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>18</day>
<month>04</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>14</volume>
<elocation-id>1128326</elocation-id>
<history>
<date date-type="received">
<day>20</day>
<month>12</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>24</day>
<month>03</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2023 Deng, Ly, Abdollahi, Zhao, Prinz and Bonn</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Deng, Ly, Abdollahi, Zhao, Prinz and Bonn</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>The interaction of T-cell receptors with peptide-major histocompatibility complex molecules (TCR-pMHC) plays a crucial role in adaptive immune responses. Currently there are various models aiming at predicting TCR-pMHC binding, while a standard dataset and procedure to compare the performance of these approaches is still missing. In this work we provide a general method for data collection, preprocessing, splitting and generation of negative examples, as well as comprehensive datasets to compare TCR-pMHC prediction models. We collected, harmonized, and merged all the major publicly available TCR-pMHC binding data and compared the performance of five state-of-the-art deep learning models (TITAN, NetTCR-2.0, ERGO, DLpTCR and ImRex) using this data. Our performance evaluation focuses on two scenarios: 1) different splitting methods for generating training and testing data to assess model generalization and 2) different data versions that vary in size and peptide imbalance to assess model robustness. Our results indicate that the five contemporary models do not generalize to peptides that have not been in the training set. We can also show that model performance is strongly dependent on the data balance and size, which indicates a relatively low model robustness. These results suggest that TCR-pMHC binding prediction remains highly challenging and requires further high quality data and novel algorithmic approaches.</p>
</abstract>
<kwd-group>
<kwd>T-cell receptor (TCR)</kwd>
<kwd>peptide</kwd>
<kwd>MHC</kwd>
<kwd>machine learning/deep learning</kwd>
<kwd>TCR specificity prediction</kwd>
</kwd-group>
<contract-num rid="cn002">497674564</contract-num>
<contract-sponsor id="cn001">Deutsche Forschungsgemeinschaft<named-content content-type="fundref-id">10.13039/501100001659</named-content>
</contract-sponsor>
<contract-sponsor id="cn002">Deutsche Forschungsgemeinschaft<named-content content-type="fundref-id">10.13039/501100001659</named-content>
</contract-sponsor>
<counts>
<fig-count count="7"/>
<table-count count="1"/>
<equation-count count="1"/>
<ref-count count="22"/>
<page-count count="9"/>
<word-count count="4764"/>
</counts>
</article-meta>
</front>
<body>
<sec id="s1" sec-type="intro">
<label>1</label>
<title>Introduction</title>
<p>T-cell receptors (TCR) play a crucial role in adaptive immunity mainly through the recognition of peptide fragments from foreign pathogens that are presented by major histocompatibility complex (MHC) molecules. TCRs consist of two transmembrane polypeptide chains, &#x3b1; and &#x3b2; chain; they form a heterodimer on the cell surface. The extraordinary diversity of the TCR repertoire is mainly attributed to a somatic recombination process, V(D)J recombination. Humans can theoretically generate more than 10<sup>15</sup> different antigen-specific TCRs Uziela et&#xa0;al. (<xref ref-type="bibr" rid="B1">1</xref>). The diversity of TCR &#x3b1; and &#x3b2; is realized mainly by the complementarity-determining regions (CDRs), with CDR3 being the contact side to the peptide fragment and consequently the most important area for antigen recognition Hennecke and Wiley (<xref ref-type="bibr" rid="B2">2</xref>). There are two types of MHC molecules, MHC class I and MHC class II molecules, presenting peptides to CD8<sup>+</sup> and CD4<sup>+</sup> T cells, respectively.</p>
<p>The major public data resources for TCR-pMHC binding data are VDJdb Goncharov et&#xa0;al. (<xref ref-type="bibr" rid="B3">3</xref>), IEDB Vita et&#xa0;al. (<xref ref-type="bibr" rid="B4">4</xref>), McPAS-TCR Tickotsky et&#xa0;al. (<xref ref-type="bibr" rid="B5">5</xref>), ImmuneCODE Nolan et&#xa0;al. (<xref ref-type="bibr" rid="B6">6</xref>), TBAdb Zhang et&#xa0;al. (<xref ref-type="bibr" rid="B7">7</xref>) and 10X Genomics 10x Genomics (<xref ref-type="bibr" rid="B8">8</xref>), which all contain TCR CDR3 &#x3b2; chain information. These are all precious data since identifying cognate TCRs-pMHC binding pairs typically needs both the pMHC multimers technology and single cell sequencing technology Pai and Satpathy (<xref ref-type="bibr" rid="B9">9</xref>); Joglekar and Li (<xref ref-type="bibr" rid="B10">10</xref>).</p>
<p>This vast diversity of the TCR repertoire makes it difficult to experimentally cover all possible TCR pMHC binding pairs. Under the fundamental assumption that the binding between TCR and pMHC is governed by fundamental physicochemical interaction rules, computational approaches could detect and learn patterns in data. Applying machine learning (ML) and deep learning (DL) approaches to predict the interaction between TCR and pMHC have been explored, resulting in various models such as TITAN, NetTCR-2.0, ERGO, DLpTCR and ImRex Weber et&#xa0;al. (<xref ref-type="bibr" rid="B11">11</xref>); Montemurro et&#xa0;al. (<xref ref-type="bibr" rid="B12">12</xref>); Springer et&#xa0;al. (<xref ref-type="bibr" rid="B13">13</xref>); Xu et&#xa0;al. (<xref ref-type="bibr" rid="B14">14</xref>); Moris et&#xa0;al. (<xref ref-type="bibr" rid="B15">15</xref>). Among these models, ERGO and TITAN integrated natural language processing (NLP) techniques, NetTCR-2.0 and ImRex are based on convolutional neural networks (CNN), and DLpTCR is a combination of CNN, fully connected network (FCN) and deep residual network (ResNet). Unfortunately, to date there exists no appropriate benchmark dataset or workflow to compare contemporary TCR-pMHC prediction models and improve them. In this work, we collected and preprocessed all available major TCR-pMHC data and compared the performance of those state-of-the-art models in different training and testing scenarios.</p>
</sec>
<sec id="s2" sec-type="results">
<label>2</label>
<title>Results</title>
<sec id="s2_1">
<label>2.1</label>
<title>Current available data showed a great imbalance</title>
<p>To compare currently available TCR-pMHC prediction models, we first collected data from the most comprehensive public resources, including 10X Genomics, McPAS-TCR, VDJdb, ImmuneCODE, TBAdb and IEDB, then preprocessed separately and afterwards merged into one dataset (TCR preprocessed dataset, tpp dataset). The general process is depicted in <xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref>. The tpp dataset amounts to 113762 entries, out of which 32237 entries contain paired TCR chains, 7167 entries contain only &#x3b1; chains (TRA) and 74358 entries contain only &#x3b2; chains (TRB)(<xref ref-type="fig" rid="f2">
<bold>Figure&#xa0;2A</bold>
</xref>). The composition of the database is shown in <xref ref-type="fig" rid="f2">
<bold>Figure&#xa0;2B</bold>
</xref>. From different data resources, ImmuneCODE contains exclusively TRB information, whereas VDJdb contains the highest number of paired chain examples (<xref ref-type="fig" rid="f2">
<bold>Figure&#xa0;2C</bold>
</xref>). If we further look into the binding pairs between TCRs and peptides presented by MHC molecules, there is a strong imbalance concerning the peptides, i.e. 0.12% of all peptides (20/1659) account for 58.38% of the total entries (66413/113762). More detailed peptides origin concerning different disease categories for each resource is shown in <xref ref-type="supplementary-material" rid="SM1">
<bold>Figure S1</bold>
</xref>.</p>
<fig id="f1" position="float">
<label>Figure&#xa0;1</label>
<caption>
<p>Flow chart shows the basic procedure for preparing different datasets. After collecting data from public resources and merging the preprocessed into one dataset (TCR preprocessed dataset, tpp dataset), different filtering criteria were applied to obtain the positive examples for <inline-formula>
<mml:math display="inline" id="im1">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula>
<mml:math display="inline" id="im2">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>u</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>f</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula>
<mml:math display="inline" id="im3">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math display="inline" id="im4">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> datasets. Negative examples were generated within folds (refer to 4.1.3) after splitting (refer to 4.1.2) to obtain the complete datasets. <inline-formula>
<mml:math display="inline" id="im5">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>: the base dataset filtered from tpp dataset. <inline-formula>
<mml:math display="inline" id="im6">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>: strict splitting used on <inline-formula>
<mml:math display="inline" id="im7">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. <inline-formula>
<mml:math display="inline" id="im8">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>u</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>f</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>: uniform splitting used on <inline-formula>
<mml:math display="inline" id="im9">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. <inline-formula>
<mml:math display="inline" id="im10">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>: the balanced dataset filtered from <inline-formula>
<mml:math display="inline" id="im11">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, then split using uniform splitting. <inline-formula>
<mml:math display="inline" id="im12">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>: the imbalance dataset filtered from <inline-formula>
<mml:math display="inline" id="im13">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, then split using uniform splitting.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fimmu-14-1128326-g001.tif"/>
</fig>
<fig id="f2" position="float">
<label>Figure&#xa0;2</label>
<caption>
<p>Overview of TCR-pMHC binding data merged from different resources. <bold>(A)</bold> Venn plot shows the overlap of entries that contain only TRA, paired chains or only TRB. The size of the ellipses correlate to the number of entries for each category. <bold>(B)</bold> Pieplot shows the composition of the merged database. Number of entries in each resource indicated in the parentheses. <bold>(C)</bold> TRA and TRB availability for the six major resources.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fimmu-14-1128326-g002.tif"/>
</fig>
<p>In order to compare the performance of TITAN, NetTCR-2.0, ERGO, DLpTCR and ImRex, they need to be trained and tested on the same data. We constructed a base dataset (<inline-formula>
<mml:math display="inline" id="im14">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>), which fulfills all the requirements from these models so that every model can be trained and tested on it. The criteria are: 1) peptide length equals to 9; 2) CDR3 TRB length in the range of 10 to 18; 3) peptides are presented by the HLA-A*02 MHC allele. After applying these criteria, we removed duplicates based on the CDR3 TRB and peptide, this resulted in a total of 15331 entries for <inline-formula>
<mml:math display="inline" id="im15">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, across 15039 CDR3 TRB and 691 peptides. The data in <inline-formula>
<mml:math display="inline" id="im16">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is highly imbalanced towards high frequent peptides, 82.66% (12672) of all entries are derived from the top 20 most frequent peptides. The total entries for the top 20 peptides in <inline-formula>
<mml:math display="inline" id="im17">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is shown in <xref ref-type="fig" rid="f3">
<bold>Figure&#xa0;3A</bold>
</xref>. The imbalance of TCRs pairing with the top 20 peptides is highlighted in <xref ref-type="fig" rid="f3">
<bold>Figure&#xa0;3B</bold>
</xref>. The top 20 peptides are paired with 82.66% of the total TCRs while the remaining peptides are paired with the remaining 17.34% TCRs. Furthermore, 517 out of the total 691 peptides have less than five examples per peptide in <inline-formula>
<mml:math display="inline" id="im18">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<fig id="f3" position="float">
<label>Figure&#xa0;3</label>
<caption>
<p>Overview of TCR-pMHC positive binding examples for <inline-formula>
<mml:math display="inline" id="im19">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. <bold>(A)</bold> Barplot shows the number of entries for the top 20 peptides. <bold>(B)</bold> Pie chart shows the constitution of examples for the top 20 peptides vs. the rest in <inline-formula>
<mml:math display="inline" id="im20">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fimmu-14-1128326-g003.tif"/>
</fig>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Comparison of model performance on d<sub>base</sub> indicates that current DL models perform similarly well regardless of model complexity</title>
<p>After acquiring the merged dataset and filtering with the most strict requirements of all tested models we obtained the <inline-formula>
<mml:math display="inline" id="im21">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> dataset. In the creation of <inline-formula>
<mml:math display="inline" id="im22">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> dataset there were two steps necessary. First, we split the data into five folds as we use 5-fold cross-validation. We used two different splitting methods (see subsection 4.1.2), uniform splitting which keeps the peptide distribution equal across all folds and strict splitting which keeps the peptides unique in each fold. The second prerequisite was to generate negative examples (see subsection 4.1.3), i.e. by assigning combinations of CDR3 &#x3b2; sequences and peptides that do not bind to each other.</p>
<p>Next, we tested six different DL models from five publications. The chosen models predict the binding between a given TCR-pMHC pair. The feature input are the CDR3 TRB sequence of the TCR, and the amino acid (aa) sequence of the peptide. The six models differ in their approaches to embed and process the given features. This subsection compared the different approaches and measured their performance. Models were trained and tested on <inline-formula>
<mml:math display="inline" id="im23">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> using 5-fold cross-validation. In <xref ref-type="table" rid="T1">
<bold>Table&#xa0;1</bold>
</xref> the tested models are summarized.</p>
<table-wrap id="T1" position="float">
<label>Table&#xa0;1</label>
<caption>
<p>Overview of the tested models.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left">Models</th>
<th valign="top" align="left">Architecture</th>
<th valign="top" align="left">Embedding</th>
<th valign="top" align="left">Year</th>
<th valign="top" align="left">Trainable parameters</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">TITAN Weber et&#xa0;al. (<xref ref-type="bibr" rid="B11">11</xref>)</td>
<td valign="top" align="left">Bimodal attention networks, pretrained with BindingDB.</td>
<td valign="top" align="left">Encoded peptides with SMILES, TCRs with BLOSUM62 and padded to the same length.</td>
<td valign="top" align="left">2021</td>
<td valign="top" align="left">15,506,099</td>
</tr>
<tr>
<td valign="top" align="left">DLpTCR Xu et&#xa0;al. (<xref ref-type="bibr" rid="B14">14</xref>)</td>
<td valign="top" align="left">Ensemble network out of: FCN, CNN and ResNet</td>
<td valign="top" align="left">depending on subNN: PCA on 500 amino acid indices, one-hot encoded or 20 different physicochemical properties (PCP)</td>
<td valign="top" align="left">2021</td>
<td valign="top" align="left">10,454,869</td>
</tr>
<tr>
<td valign="top" align="left">ERGO Springer et&#xa0;al. (<xref ref-type="bibr" rid="B13">13</xref>)</td>
<td valign="top" align="left">Autoencoder or LSTM <inline-formula>
<mml:math display="inline" id="im24">
<mml:mo>&#x2192;</mml:mo>
</mml:math>
</inline-formula>Multilayer perceptron (MLP)</td>
<td valign="top" align="left">One-hot encoded and embedded with either LSTM or Autoencoder</td>
<td valign="top" align="left">2020</td>
<td valign="top" align="left">580,299 (Autoencoder) or 6,557,421 (LSTM)</td>
</tr>
<tr>
<td valign="top" align="left">NetTCR2.0 Montemurro et&#xa0;al. (<xref ref-type="bibr" rid="B12">12</xref>)</td>
<td valign="top" align="left">CNN</td>
<td valign="top" align="left">Both sequences were encoded using the BLOSUM50 matrix</td>
<td valign="top" align="left">2021</td>
<td valign="top" align="left">21,345</td>
</tr>
<tr>
<td valign="top" align="left">ImRex Moris et&#xa0;al. (<xref ref-type="bibr" rid="B15">15</xref>)</td>
<td valign="top" align="left">CNN, L2 regularization penalty of 0.01. Dual-input CNN architecture</td>
<td valign="top" align="left">PCP interaction map between CDR3 and peptide sequence with 20x11x4 dimensions.</td>
<td valign="top" align="left">2020</td>
<td valign="top" align="left">248,257</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The number of trainable parameters (<xref ref-type="table" rid="T1">
<bold>Table&#xa0;1</bold>
</xref>) of a model indicates the model complexity. We do not see a correlation between the number of trainable parameters and performance of the model. We used a 1:1 ratio of positive:negative binding examples for both training and testing sets. The ROC-AUC score of each model on <inline-formula>
<mml:math display="inline" id="im25">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, except for ERGO with the embedding of long short-term memory (LSTM), were above fifty percent (<xref ref-type="fig" rid="f4">
<bold>Figure&#xa0;4</bold>
</xref>). Therefore, almost all models predicted the outcome of a given TCR-pMHC pair better than random guessing. With the exception of ERGO with the LSTM embedding, no ROC-AUC score stood out and performances of those models were within <inline-formula>
<mml:math display="inline" id="im26">
<mml:mrow>
<mml:mn>0.66</mml:mn>
<mml:mo>&#xb1;</mml:mo>
<mml:mn>0.04</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> ROC-AUC. A summary to compare the obtained ROC-AUC from the original work and our measurements using a distinct dataset is given by <xref ref-type="supplementary-material" rid="SM1">
<bold>Table S1</bold>
</xref>.</p>
<fig id="f4" position="float">
<label>Figure&#xa0;4</label>
<caption>
<p>ROC of models predicting binding of TCR-pMHC trained and tested on <inline-formula>
<mml:math display="inline" id="im27">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> using uniform splitting. The dashed red line indicates performance of random guessing. ROC curve for DLpTCR looks &#x201c;linear&#x201d;, because DLpTCR outputs a binary and not a continuous probability.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fimmu-14-1128326-g004.tif"/>
</fig>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Model performance on uniform or strict split data indicates that current models do not perform well on unseen peptides</title>
<p>A generalized prediction model will find interaction patterns that are transferable to new TCR-pMHCs examples. We used two training and testing splitting methods (see subsection 4.1.2) to generate uniform and strict splitting data sets. The main difference of uniform splitting and strict splitting is whether the peptide in the testing set appears in the training set. In uniform splitting the peptides in the testing set also exist in the training set (seen peptides), whereas the peptides in strict splitting have no overlap between training and testing set (unseen peptides). For a generalized TCR-pMHC binding prediction model, it should be able to predict binding on unseen peptides.</p>
<p>The model performance for all models using these two splitting methods is compared in <xref ref-type="fig" rid="f5">
<bold>Figure&#xa0;5</bold>
</xref>. DLpTCR returns a binary in its prediction, and this explains why the curves for DLpTCR in <xref ref-type="fig" rid="f5">
<bold>Figure&#xa0;5</bold>
</xref> only connect three points. Every other model outputs a value between zero and one, which serves as a probability for the given TCR-pMHC pair to bind. A continuous probability value can generate more points in the ROC and PR curve, if one vary the threshold for a binding and unbinding prediction. Model performance collapsed for strict splitting (comparing <xref ref-type="fig" rid="f5"><bold>Figures 5A</bold></xref> with <xref ref-type="fig" rid="f5"><bold>Figure 5C</bold></xref> or <xref ref-type="fig" rid="f5"><bold>Figure 5B</bold></xref> with <xref ref-type="fig" rid="f5"><bold>Figure 5D</bold></xref> for each model), indicating that current models do not generalize to unseen peptides.</p>
<fig id="f5" position="float">
<label>Figure&#xa0;5</label>
<caption>
<p>Model performance on d<sub>base</sub> using different splitting methods. <bold>(A)</bold> ROC curve and <bold>(B)</bold> PR curve for models using uniform splitting. <bold>(C)</bold> ROC curve and <bold>(D)</bold> PR curve for models using strict splitting. The dashed red lines indicate performance of random guessing. ROC and PR curve for DLpTCR looks &#x201c;linear&#x201d;, because DLpTCR outputs a binary and not a continuous probability.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fimmu-14-1128326-g005.tif"/>
</fig>
</sec>
<sec id="s2_4">
<label>2.4</label>
<title>Collapsing performance on d<sub>bal</sub> suggests that 5-10 examples per peptide is not sufficient for training state-of-the-art DL models</title>
<p>After comparing the results for <inline-formula>
<mml:math display="inline" id="im28">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> using uniform/strict splitting, we realized that current models are not able to predict the binding for unseen peptides. Since results for uniform splitting showed moderate prediction ability, we suspected that these models learned for the high frequent peptides. In order to elucidate this, we prepared a new balanced data set (<inline-formula>
<mml:math display="inline" id="im29">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>) to test this hypothesis. Based on <inline-formula>
<mml:math display="inline" id="im30">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> , we filtered out entries with less than 5 examples per peptide and afterwards we downsampled (see chapter 4.1.4) each unique peptide, so that each peptide in <inline-formula>
<mml:math display="inline" id="im31">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> only contains 5-10 examples. This resulted in <inline-formula>
<mml:math display="inline" id="im32">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> with a total of 2812 examples, across 1397 unique CDR3 TRB sequences and 174 unique peptides. Training the models on <inline-formula>
<mml:math display="inline" id="im33">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, we saw a complete collapse of performance for the models (<xref ref-type="fig" rid="f6">
<bold>Figure&#xa0;6</bold>
</xref>), similar to <inline-formula>
<mml:math display="inline" id="im34">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> strict splitting. This indicates either that 5-10 examples per peptide is not sufficient for a predictive model to learn the general TCRs-pMHC binding rules or that a total of 2812 examples is not enough to train and test the models on. In the following subsection we investigated how data imbalance impacted the model performance.</p>
<fig id="f6" position="float">
<label>Figure&#xa0;6</label>
<caption>
<p>Model performance on different datasets using uniform splitting. ROC curve for <bold>(A)</bold> NetTCR-2.0, <bold>(B)</bold> ImRex, <bold>(C)</bold> TITAN, <bold>(D)</bold> DLpTCR (Curves looks &#x201c;linear&#x201d;, because DLpTCR outputs a binary and not a continuous probability), <bold>(E)</bold> ERGO Autoencoder model and <bold>(F)</bold> ERGO LSTM model using <inline-formula>
<mml:math display="inline" id="im35">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula>
<mml:math display="inline" id="im36">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math display="inline" id="im37">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. The dashed red diagonal line indicates performance for random guessing.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fimmu-14-1128326-g006.tif"/>
</fig>
</sec>
<sec id="s2_5">
<label>2.5</label>
<title>Model performance comparison on d<sub>base</sub> and d<sub>imbal</sub> indicates that &#x201c;success&#x201d; is only due to the most frequent peptide</title>
<p>The difference between <inline-formula>
<mml:math display="inline" id="im38">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math display="inline" id="im39">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is in the size and the imbalance regarding the peptide distribution. The degree of balance can be calculated with the formula for Shannon entropy,</p>
<disp-formula>
<label>(1)</label>
<mml:math display="block" id="M1">
<mml:mrow>
<mml:mi>B</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>log</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>K</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>K</mml:mi>
</mml:munderover>
<mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mi>l</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>g</mml:mi>
</mml:mrow>
</mml:mstyle>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>with <inline-formula>
<mml:math display="inline" id="im40">
<mml:mi>K</mml:mi>
</mml:math>
</inline-formula> as the number of unique peptides and <inline-formula>
<mml:math display="inline" id="im41">
<mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> as the occurrence in percentage for peptide <inline-formula>
<mml:math display="inline" id="im42">
<mml:mi>i</mml:mi>
</mml:math>
</inline-formula>. We constructed <inline-formula>
<mml:math display="inline" id="im43">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> to investigate whether the peptide imbalance or the data size impacts the performance more. This dataset included all available data for the most frequent peptide (mfp) (&#x201c;NLVPMVATV&#x201d;), but filtered and downsampled the remaining peptides (non-mfp). In total, <inline-formula>
<mml:math display="inline" id="im44">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> has 12268 entries, with 7678 unique CDR3 TRB sequences and 174 unique peptides. This dataset has a higher peptide imbalance than <inline-formula>
<mml:math display="inline" id="im45">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and a smaller size (see <xref ref-type="supplementary-material" rid="SM1">
<bold>Table S2</bold>
</xref>).</p>
<p>We would expect <inline-formula>
<mml:math display="inline" id="im46">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> which contains more input data to have a better performance over <inline-formula>
<mml:math display="inline" id="im47">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> if the model can learn a general binding rule. However, models trained on <inline-formula>
<mml:math display="inline" id="im48">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> had a prediction power comparable to models trained on <inline-formula>
<mml:math display="inline" id="im49">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, and even slightly better than models trained on <inline-formula>
<mml:math display="inline" id="im50">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> (<xref ref-type="fig" rid="f6">
<bold>Figure&#xa0;6</bold>
</xref>). In the case of ERGO with LSTM embedding, which was as bad as random guessing if trained on <inline-formula>
<mml:math display="inline" id="im51">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, if trained on <inline-formula>
<mml:math display="inline" id="im52">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> we saw an increase in prediction performance. Therefore, we conclude that peptide imbalance impacts the performance more than the size of the data. This result also suggests that all models learned the binding rule for the most frequent peptide examples.</p>
</sec>
<sec id="s2_6">
<label>2.6</label>
<title>Performance increases with peptide imbalance</title>
<p>Next, we investigated whether the learned most frequent peptides from <inline-formula>
<mml:math display="inline" id="im53">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> can be transferred to predict the binding for less frequent peptides. Overall, the ROC-AUC scores for the models trained on <inline-formula>
<mml:math display="inline" id="im54">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> were significantly higher than the one trained on <inline-formula>
<mml:math display="inline" id="im55">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> (<xref ref-type="fig" rid="f6">
<bold>Figure&#xa0;6</bold>
</xref>). If models trained on <inline-formula>
<mml:math display="inline" id="im56">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> also showed better performance for the non-mfp, compared to models trained on <inline-formula>
<mml:math display="inline" id="im57">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, this would mean that the learned mfp increases the likelihood of generalization. In <xref ref-type="fig" rid="f7">
<bold>Figure&#xa0;7</bold>
</xref>, taking NetTCR-2.0 as an example, we compared the accuracy on non-mfp data using models trained on the two datasets and no change in performance was observed. We observed a strong data dependency regarding the performance of all models (<xref ref-type="supplementary-material" rid="SM1">
<bold>Figure S2</bold>
</xref>). In retrospect, the success of previously published models could thus be attributed to the peptide imbalance within each dataset.</p>
<fig id="f7" position="float">
<label>Figure&#xa0;7</label>
<caption>
<p>Exemplary comparison of NetTCR-2.0 performance trained on <inline-formula>
<mml:math display="inline" id="im58">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math display="inline" id="im59">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. Data points indicate accuracy for models (trained on different datasets) testing on unique peptide with different occurrence. mfp: most frequent peptide 20 examples in <inline-formula>
<mml:math display="inline" id="im60">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and 9476 examples in <inline-formula>
<mml:math display="inline" id="im61">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fimmu-14-1128326-g007.tif"/>
</fig>
</sec>
</sec>
<sec id="s3" sec-type="discussion">
<label>3</label>
<title>Discussion</title>
<p>In this work, we compared different state-of-the-art models for the prediction of TCR-pMHC binding. We chose to use these models as they were supplied, without optimizing them for our datasets. This might have advantages for some models and disadvantages for others, but the aim of this study was to make a consistent comparison across all available data, rather than to compare the peak performance of these models. The data preprocessing and filtering criteria were based on the intersection requirements of all models. In this way we fairly tested the models for their generalization ability using the same input data. By using different train/test splitting methods, we were able to contrast the performance of the models between unseen and seen peptides. Our findings clearly show that all models with different complexity fail to predict on unseen peptide examples. This is consistent with the findings of Grazioli et&#xa0;al. Grazioli et&#xa0;al. (<xref ref-type="bibr" rid="B16">16</xref>), who contrasted the performance between uniform and strict splitting as well. They show that ERGO II as well as NetTCR-2.0 performs worse in strict splitting. Here, we have also tested NetTCR-2.0 and a predecessor model of ERGO II (ERGO), but additionally includes TITAN, DLpTCR and ImRex to cover all the current state-of-the-art models (<xref ref-type="table" rid="T1">
<bold>Table&#xa0;1</bold>
</xref>) for TCR-pMHC binding prediction. We showed that the performance stays the same across models with different complexity. Notably, Grazioli et&#xa0;al. suggested that TITAN is a potential candidate to have a generalized prediction prowess. TITAN Weber et&#xa0;al. (<xref ref-type="bibr" rid="B11">11</xref>) by Weber et&#xa0;al. applied strict splitting themselves and measured a performance of up to 0.62 ROC-AUC. However, we could not replicate this result based on our dataset. TITAN did not perform significantly better than the other models tested in our study, despite using the most advanced model architecture. A possible explanation why Weber et&#xa0;al. measured better performances could be that they only used data from VDJdb (peptides from various origin) and ImmuneCODE (exclusively COVID data). Merging those two datasets will result in mostly peptides associated with COVID (105/192 [54.69%] assuming VDJdb does not contain many COVID data). Even if the peptides in the testing and training sets are disjoint in strict splitting, there might be similar peptides across the training and testing set, due to their same origin from COVID. This may have contributed to the better performance reported. If this hypothesis is true, given enough training examples, it might be possible for TITAN and other models to not only predict peptide-specific binding but also origin-specific binding. Based on current available data, models work better for epitope specific predictions, not for general predictions.</p>
<p>We also investigate the impact of peptide imbalance on the performance of the models. To the best of our knowledge, we have not seen similar training and testing of the models on different data scenarios (<inline-formula>
<mml:math display="inline" id="im62">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula>
<mml:math display="inline" id="im63">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math display="inline" id="im64">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>). The data scenarios vary in size and peptide distribution. We suggest that peptide imbalance contributes more to a better performance of the models than size, a finding that was also made in antibody-antigen prediction Robert et&#xa0;al. (<xref ref-type="bibr" rid="B17">17</xref>). It will be interesting to see whether the models perform well purely because of peptide frequency, or whether other factors such as biological or physicochemical properties may influence performance. This can be explored by clustering peptides based on physicochemical features using different approaches (HMM Rabiner (<xref ref-type="bibr" rid="B18">18</xref>) to KNN Taunk et&#xa0;al. (<xref ref-type="bibr" rid="B19">19</xref>), and checking the performance. With various clustering methods to choose and an abundant set of parameters, we would continue our research on this in the future.</p>
<p>This is consistent with the consensus that currently available data are not sufficient, an issue raised so far by every study of these models Weber et&#xa0;al. (<xref ref-type="bibr" rid="B11">11</xref>); Montemurro et&#xa0;al. (<xref ref-type="bibr" rid="B12">12</xref>); Springer et&#xa0;al. (<xref ref-type="bibr" rid="B13">13</xref>); Xu et&#xa0;al. (<xref ref-type="bibr" rid="B14">14</xref>); Moris et&#xa0;al. (<xref ref-type="bibr" rid="B15">15</xref>). The way we prove data dependence in this study may not take into account the effects of sequence features or similarities, but this actually strengthens the findings. We have shown in the most straightforward and transparent way that down to the smallest granularity (peptide as a categorical variable), data imbalance has a major impact on performance. Our results support the idea that a generalized predictive model requires data that is not only large but also massively diverse to uncover a large range of potential pMHC-TCR binding rules. A suggestion would be to specifically increase the screening for scarce peptides to further increase dataset diversity. TCR sequencing on the single cell level is a rapidly progressing field, so affordable screening technology to do so with high fidelity should be available soon.</p>
<p>The hypothesis that models such as TITAN might be able to predict unseen but similar peptides or peptides from the same origin is a very interesting research question for future work. If this hypothesis holds, we need a global effort to experimentally screen a set of peptides to cover a diverse peptides pool, and make use of the generated data for constructing a generalizable prediction model.</p>
<p>A limitation of this study is that our datasets only comprised TCRs from CD8<sup>+</sup> T cells pairing with peptides presented by the HLA-A*02 allele without considering other MHC alleles, however, it was important to exclude additional variables such as HLA isotypes at this point. Moreover, we only compared DL models for predicting binding between random TCRs and random pMHC, not epitope-specific models (i.e. the prediction of whether random TCRs bind to a specific peptide). Meysman et&#xa0;al. have compared superficially different approaches to TCR-pMHC binding Meysman et&#xa0;al. (<xref ref-type="bibr" rid="B20">20</xref>), but also raised the importance of a truly independent benchmark. They reveal that additional information like CDR1/2 improved the prediction, but they did not investigate the role that imbalance, size or overtraining might have on model performance by using those additional features within the used dataset.</p>
</sec>
<sec id="s4">
<label>4</label>
<title>Methods</title>
<sec id="s4_1">
<label>4.1</label>
<title>Data preprocessing</title>
<sec id="s4_1_1">
<label>4.1.1</label>
<title>Date merging and preprocessing</title>
<p>We downloaded the data from six different resources. We unified the column names of (CDR3 TRA, CDR3 TRB, peptide and MHC, etc.). We only kept entries that have a peptide and at least either a CDR3 TRA or TRB sequence. Only TCRs sequences and peptide sequences that use the 20 valid amino acid residues are kept. After this quality control, all data from different resources were merged into one dataset (tpp dataset), duplicates in this merged dataset were then removed. The preprocessing of the merged dataset and prefiltering for different datasets are shown in <xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref>.</p>
</sec>
<sec id="s4_1_2">
<label>4.1.2</label>
<title>Splitting</title>
<p>We explored two different splitting methods (<xref ref-type="supplementary-material" rid="SM1">
<bold>Figure S3</bold>
</xref>). The first method kept the distribution of the peptide in each part (uniform splitting). The second method distributed peptides to each part, so that no peptide is in two different parts (strict splitting). The strict splitting we used here is inspired by the splitting method from the TITAN [20] model. Strict splitting was only used for <inline-formula>
<mml:math display="inline" id="im65">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> (<xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref>). <inline-formula>
<mml:math display="inline" id="im66">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math display="inline" id="im67">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>u</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>f</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> vary in size (<xref ref-type="supplementary-material" rid="SM1">
<bold>Table S2</bold>
</xref>), because strict splitting includes peptides with less than five examples. In subsection 2.1 we showed a data imbalance in peptides. For the 5-fold cross-validation in strict splitting we ensured, that each fold did not have a peptide exceeding more than half of its entries. If a peptide has more entries it will be downsampled to the half of the fold size. Uniform splitting exclude peptides with less than five examples, because uniform splitting requires at least one example for each peptide in all five folds. <xref ref-type="supplementary-material" rid="SM1">
<bold>Table S2</bold>
</xref> shows that <inline-formula>
<mml:math display="inline" id="im68">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> have more unique peptides but less total entries compare to <inline-formula>
<mml:math display="inline" id="im69">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>u</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>f</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. In <inline-formula>
<mml:math display="inline" id="im70">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, we downsampled many positive examples (for high frequent peptides) in order to generate negative examples within each fold without external reference TCR repertoire, this reduces the total number of examples in the dataset, while in <inline-formula>
<mml:math display="inline" id="im71">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>u</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>f</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, some examples for less frequent peptides were filtered out to ensure at least one example in each fold.</p>
</sec>
<sec id="s4_1_3">
<label>4.1.3</label>
<title>Negative example generation</title>
<p>The collected and merged dataset only have positive binding examples. The training of neural network models for binding prediction requires positive and negative examples. The negative examples were created by rearranging TCR-pMHC pairs. Let <inline-formula>
<mml:math display="inline" id="im72">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula>
<mml:math display="inline" id="im73">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> be T cells which bind to peptide <inline-formula>
<mml:math display="inline" id="im74">
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>a</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math display="inline" id="im75">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula>
<mml:math display="inline" id="im76">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> bind to <inline-formula>
<mml:math display="inline" id="im77">
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>b</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> or <inline-formula>
<mml:math display="inline" id="im78">
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>c</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> respectively. By pairing <inline-formula>
<mml:math display="inline" id="im79">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula>
<mml:math display="inline" id="im80">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> with <inline-formula>
<mml:math display="inline" id="im81">
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>b</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math display="inline" id="im82">
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>c</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> we created negative pairing examples. Statistically it is unlikely for the new generated TCR-pMHC pair to bind. This generation of negative examples agrees with most models original work. For each positive example a negative example was created. <inline-formula>
<mml:math display="inline" id="im83">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula>
<mml:math display="inline" id="im84">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math display="inline" id="im85">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> have therefore a positive to negative ratio of <inline-formula>
<mml:math display="inline" id="im86">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>:</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>. In case one peptide needs more <inline-formula>
<mml:math display="inline" id="im87">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> (i.e. <inline-formula>
<mml:math display="inline" id="im88">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>) to generate the same amount of negative examples, <inline-formula>
<mml:math display="inline" id="im89">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> from previous downsampling served as additional reference <inline-formula>
<mml:math display="inline" id="im90">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
</sec>
<sec id="s4_1_4">
<label>4.1.4</label>
<title>Downsampling</title>
<p>Peptides are not uniformly distributed throughout tpp dataset. Some peptides occur only a few times (low frequent peptides) and some occur hundreds of times (high frequent peptide). For <inline-formula>
<mml:math display="inline" id="im91">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math display="inline" id="im92">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> we downsampled the high frequent peptides to keep only 10 random examples for each peptide.</p>
</sec>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Model performance measurement</title>
<p>We downloaded the source code for all models from their respected GitHub repository. We evaluated all models with 5-fold cross-validation. We used our datasets to train the models with the default parameters. The performance is measured by the area under the receiver and operator curve (ROC-AUC) Davis and Goadrich (<xref ref-type="bibr" rid="B21">21</xref>), as well as the area under the precision recall curve (PR-AUC)Saito and Rehmsmeier (<xref ref-type="bibr" rid="B22">22</xref>). The best ROC-AUC models was saved and evaluated on testing set.</p>
</sec>
</sec>
<sec id="s5" sec-type="data-availability">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/<xref ref-type="supplementary-material" rid="SM1">
<bold>Supplementary Material</bold>
</xref>. Further inquiries can be directed to the corresponding authors.</p>
</sec>
<sec id="s6" sec-type="author-contributions">
<title>Author contributions</title>
<p>LD and CL contributed equally to this work and share first authorship. IP and SB contributed to conception and design of the study. LD and CL collected and preprocessed the data. SA supported LD, and CL in performing the comparison of the existing prediction tools, LD and CL interpreted the comparison result. YZ supported in the discussion of this study. LD and CL wrote the draft of the manuscript. All authors contributed to the article and approved the submitted version.</p>
</sec>
</body>
<back>
<sec id="s7" sec-type="funding-information">
<title>Funding</title>
<p>This work was funded by grants from the Deutsche Forschungsgemeinschaft (DFG), grants CRC1192, project number 264599542 and PR727/14-1, project number 497674564. IP is funded by DFG FOR2799. SB, YZ and CL are funded by SFB 1192 projects B8 and C3, FOR 5068 P9, as well as by the 3R reduction of animal testing initiative of the UKE.</p>
</sec>
<sec id="s8" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="s9" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec id="s10" sec-type="supplementary-material">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fimmu.2023.1128326/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fimmu.2023.1128326/full#supplementary-material</ext-link>
</p>
<supplementary-material xlink:href="DataSheet_1.pdf" id="SM1" mimetype="application/pdf"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<label>1</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Uziela</surname> <given-names>K</given-names>
</name>
<name>
<surname>Men&#xe9;ndez Hurtado</surname> <given-names>D</given-names>
</name>
<name>
<surname>Shu</surname> <given-names>N</given-names>
</name>
<name>
<surname>Wallner</surname> <given-names>B</given-names>
</name>
<name>
<surname>Elofsson</surname> <given-names>A</given-names>
</name>
</person-group>. <article-title>ProQ3D: improved model quality assessments using deep learning</article-title>. <source>Bioinformatics</source> (<year>2017</year>) <volume>33</volume>:<page-range>1578&#x2013;80</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1093/bioinformatics/btw819</pub-id>
</citation>
</ref>
<ref id="B2">
<label>2</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hennecke</surname> <given-names>J</given-names>
</name>
<name>
<surname>Wiley</surname> <given-names>DC</given-names>
</name>
</person-group>. <article-title>T Cell receptor&#x2013;mhc interactions up close</article-title>. <source>Cell</source> (<year>2001</year>) <volume>104</volume>:<fpage>1</fpage>&#x2013;<lpage>4</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1016/S0092-8674(01)00185-4</pub-id>
</citation>
</ref>
<ref id="B3">
<label>3</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Goncharov</surname> <given-names>M</given-names>
</name>
<name>
<surname>Bagaev</surname> <given-names>D</given-names>
</name>
<name>
<surname>Shcherbinin</surname> <given-names>D</given-names>
</name>
<name>
<surname>Zvyagin</surname> <given-names>I</given-names>
</name>
<name>
<surname>Bolotin</surname> <given-names>D</given-names>
</name>
<name>
<surname>Thomas</surname> <given-names>PG</given-names>
</name>
<etal/>
</person-group>. <article-title>Vdjdb in the pandemic era: a compendium of t cell receptors specific for sars-cov-2</article-title>. <source>Nat Methods</source> (<year>2022</year>) <volume>19</volume>(<issue>9</issue>):<page-range>1017&#x2013;9</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1038/s41592-022-01578-0</pub-id>
</citation>
</ref>
<ref id="B4">
<label>4</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vita</surname> <given-names>R</given-names>
</name>
<name>
<surname>Mahajan</surname> <given-names>S</given-names>
</name>
<name>
<surname>Overton</surname> <given-names>JA</given-names>
</name>
<name>
<surname>Dhanda</surname> <given-names>SK</given-names>
</name>
<name>
<surname>Martini</surname> <given-names>S</given-names>
</name>
<name>
<surname>Cantrell</surname> <given-names>JR</given-names>
</name>
<etal/>
</person-group>. <article-title>The immune epitope database (IEDB): 2018 update</article-title>. <source>Nucleic Acids Res</source> (<year>2018</year>) <volume>47</volume>:<page-range>D339&#x2013;43</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1093/nar/gky1006</pub-id>
</citation>
</ref>
<ref id="B5">
<label>5</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tickotsky</surname> <given-names>N</given-names>
</name>
<name>
<surname>Sagiv</surname> <given-names>T</given-names>
</name>
<name>
<surname>Prilusky</surname> <given-names>J</given-names>
</name>
<name>
<surname>Shifrut</surname> <given-names>E</given-names>
</name>
<name>
<surname>Friedman</surname> <given-names>N</given-names>
</name>
</person-group>. <article-title>Mcpas-tcr: a manually curated catalogue of pathology-associated t cell receptor sequences</article-title>. <source>Bioinformatics</source> (<year>2017</year>) <volume>33</volume>:<page-range>2924&#x2013;9</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1093/bioinformatics/btx286</pub-id>
</citation>
</ref>
<ref id="B6">
<label>6</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nolan</surname> <given-names>S</given-names>
</name>
<name>
<surname>Vignali</surname> <given-names>M</given-names>
</name>
<name>
<surname>Klinger</surname> <given-names>M</given-names>
</name>
<name>
<surname>Dines</surname> <given-names>JN</given-names>
</name>
<name>
<surname>Kaplan</surname> <given-names>IM</given-names>
</name>
<name>
<surname>Svejnoha</surname> <given-names>E</given-names>
</name>
<etal/>
</person-group>. <article-title>A large-scale database of t-cell receptor beta (tcr&#x3b2;) sequences and binding associations from natural and synthetic exposure to sars-cov-2</article-title>. <source>Res square</source> (<year>2020</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.21203/rs.3.rs-51964/v1</pub-id>
</citation>
</ref>
<ref id="B7">
<label>7</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname> <given-names>W</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>L</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>K</given-names>
</name>
<name>
<surname>Wei</surname> <given-names>X</given-names>
</name>
<name>
<surname>Yang</surname> <given-names>K</given-names>
</name>
<name>
<surname>Du</surname> <given-names>W</given-names>
</name>
<etal/>
</person-group>. <article-title>Pird: pan immune repertoire database</article-title>. <source>Bioinformatics</source> (<year>2020</year>) <volume>36</volume>:<fpage>897</fpage>&#x2013;<lpage>903</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1093/bioinformatics/btz614</pub-id>
</citation>
</ref>
<ref id="B8">
<label>8</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<collab>10x Genomics</collab>
</person-group>. <article-title>A new way of exploring immunity&#x2013;linking highly multiplexed antigen recognition to immune repertoire and phenotype</article-title>. <source>Tech Rep</source> (<year>2019</year>).</citation>
</ref>
<ref id="B9">
<label>9</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pai</surname> <given-names>JA</given-names>
</name>
<name>
<surname>Satpathy</surname> <given-names>AT</given-names>
</name>
</person-group>. <article-title>High-throughput and single-cell t cell receptor sequencing technologies</article-title>. <source>Nat Methods</source> (<year>2021</year>) <volume>18</volume>:<page-range>881&#x2013;92</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1038/s41592-021-01201-8</pub-id>
</citation>
</ref>
<ref id="B10">
<label>10</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Joglekar</surname> <given-names>AV</given-names>
</name>
<name>
<surname>Li</surname> <given-names>G</given-names>
</name>
</person-group>. <article-title>T Cell antigen discovery</article-title>. <source>Nat Methods</source> (<year>2021</year>) <volume>18</volume>:<page-range>873&#x2013;80</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1038/s41592-020-0867-z</pub-id>
</citation>
</ref>
<ref id="B11">
<label>11</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Weber</surname> <given-names>A</given-names>
</name>
<name>
<surname>Born</surname> <given-names>J</given-names>
</name>
<name>
<surname>Mart&#xed;nez</surname> <given-names>MR</given-names>
</name>
</person-group>. <article-title>Titan: T cell receptor specificity prediction with bimodal attention networks</article-title>. <source>Bioinformatics</source> (<year>2021</year>) <volume>37</volume>(<supplement>Supplement 1</supplement>):<page-range>i237&#x2013;44</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.48550/ARXIV.2105.03323</pub-id>
</citation>
</ref>
<ref id="B12">
<label>12</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Montemurro</surname> <given-names>A</given-names>
</name>
<name>
<surname>Schuster</surname> <given-names>V</given-names>
</name>
<name>
<surname>Povlsen</surname> <given-names>HR</given-names>
</name>
<name>
<surname>Bentzen</surname> <given-names>AK</given-names>
</name>
<name>
<surname>Jurtz</surname> <given-names>V</given-names>
</name>
<name>
<surname>Chronister</surname> <given-names>WD</given-names>
</name>
<etal/>
</person-group>. <article-title>Nettcr-2.0 enables accurate prediction of tcr-peptide binding by using paired tcr&#x3b1; and &#x3b2; sequence data</article-title>. <source>Commun Biol</source> (<year>2021</year>) <volume>4</volume>:<fpage>1060</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1038/s42003-021-02610-3</pub-id>
</citation>
</ref>
<ref id="B13">
<label>13</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Springer</surname> <given-names>I</given-names>
</name>
<name>
<surname>Besser</surname> <given-names>H</given-names>
</name>
<name>
<surname>Tickotsky-Moskovitz</surname> <given-names>N</given-names>
</name>
<name>
<surname>Dvorkin</surname> <given-names>S</given-names>
</name>
<name>
<surname>Louzoun</surname> <given-names>Y</given-names>
</name>
</person-group>. <article-title>Prediction of specific tcr-peptide binding from large dictionaries of tcr-peptide pairs</article-title>. <source>Front Immunol</source> (<year>2020</year>) <volume>11</volume>:<elocation-id>1803</elocation-id>. doi:&#xa0;<pub-id pub-id-type="doi">10.3389/fimmu.2020.01803</pub-id>
</citation>
</ref>
<ref id="B14">
<label>14</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xu</surname> <given-names>Z</given-names>
</name>
<name>
<surname>Luo</surname> <given-names>M</given-names>
</name>
<name>
<surname>Lin</surname> <given-names>W</given-names>
</name>
<name>
<surname>Xue</surname> <given-names>G</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>P</given-names>
</name>
<name>
<surname>Jin</surname> <given-names>X</given-names>
</name>
<etal/>
</person-group>. <article-title>DLpTCR: an ensemble deep learning framework for predicting immunogenic peptide recognized by T cell receptor</article-title>. <source>Briefings Bioinf</source> (<year>2021</year>) <volume>22</volume>(<issue>6</issue>):<page-range>bbab335</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1093/bib/bbab335</pub-id>
</citation>
</ref>
<ref id="B15">
<label>15</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Moris</surname> <given-names>P</given-names>
</name>
<name>
<surname>De Pauw</surname> <given-names>J</given-names>
</name>
<name>
<surname>Postovskaya</surname> <given-names>A</given-names>
</name>
<name>
<surname>Gielis</surname> <given-names>S</given-names>
</name>
<name>
<surname>De Neuter</surname> <given-names>N</given-names>
</name>
<name>
<surname>Bittremieux</surname> <given-names>W</given-names>
</name>
<etal/>
</person-group>. <article-title>Current challenges for unseen-epitope TCR interaction prediction and a new perspective derived from image classification</article-title>. <source>Briefings Bioinf</source> (<year>2020</year>) <volume>22</volume>:<fpage>Bbaa318</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1093/bib/bbaa318</pub-id>
</citation>
</ref>
<ref id="B16">
<label>16</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Grazioli</surname> <given-names>F</given-names>
</name>
<name>
<surname>M&#xf6;sch</surname> <given-names>A</given-names>
</name>
<name>
<surname>Machart</surname> <given-names>P</given-names>
</name>
<name>
<surname>Li</surname> <given-names>K</given-names>
</name>
<name>
<surname>Alqassem</surname> <given-names>I</given-names>
</name>
<name>
<surname>O&#x2019;Donnell</surname> <given-names>T</given-names>
</name>
<etal/>
</person-group>. <article-title>On tcr binding predictors failing to generalize to unseen peptides</article-title>. <source>Front Immunol</source> (<year>2022</year>) <volume>13</volume>. doi:&#xa0;<pub-id pub-id-type="doi">10.3389/fimmu.2022.1014256</pub-id>
</citation>
</ref>
<ref id="B17">
<label>17</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Robert</surname> <given-names>PA</given-names>
</name>
<name>
<surname>Akbar</surname> <given-names>R</given-names>
</name>
<name>
<surname>Frank</surname> <given-names>R</given-names>
</name>
<name>
<surname>Pavlovi&#x107;</surname> <given-names>M</given-names>
</name>
<name>
<surname>Widrich</surname> <given-names>M</given-names>
</name>
<name>
<surname>Snapkov</surname> <given-names>I</given-names>
</name>
<etal/>
</person-group>. <article-title>Unconstrained generation of synthetic antibody&#x2013;antigen structures to guide machine learning methodology for antibody specificity prediction</article-title>. <source>Nat Comput Sci</source> (<year>2022</year>) <volume>2</volume>:<page-range>845&#x2013;65</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1038/s43588-022-00372-4</pub-id>
</citation>
</ref>
<ref id="B18">
<label>18</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rabiner</surname> <given-names>L</given-names>
</name>
</person-group>. <article-title>A tutorial on hidden markov models and selected applications in speech recognition</article-title>. <source>Proc IEEE</source> (<year>1989</year>) <volume>77</volume>:<page-range>257&#x2013;86</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1109/5.18626</pub-id>
</citation>
</ref>
<ref id="B19">
<label>19</label>
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Taunk</surname> <given-names>K</given-names>
</name>
<name>
<surname>De</surname> <given-names>S</given-names>
</name>
<name>
<surname>Verma</surname> <given-names>S</given-names>
</name>
<name>
<surname>Swetapadma</surname> <given-names>A</given-names>
</name>
</person-group>. (<year>2019</year>). <article-title>A brief review of nearest neighbor algorithm for learning and classification</article-title>, in: <conf-name>2019 International Conference on Intelligent Computing and Control Systems (ICCS)</conf-name>, <publisher-loc>Madurai, India</publisher-loc>: <publisher-name>IEEE</publisher-name>) <page-range>1255&#x2013;60</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1109/ICCS45141.2019.9065747</pub-id>
</citation>
</ref>
<ref id="B20">
<label>20</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Meysman</surname> <given-names>P</given-names>
</name>
<name>
<surname>Barton</surname> <given-names>J</given-names>
</name>
<name>
<surname>Bravi</surname> <given-names>B</given-names>
</name>
<name>
<surname>Cohen-Lavi</surname> <given-names>L</given-names>
</name>
<name>
<surname>Karnaukhov</surname> <given-names>V</given-names>
</name>
<name>
<surname>Lilleskov</surname> <given-names>E</given-names>
</name>
<etal/>
</person-group>. <article-title>Benchmarking solutions to the t-cell receptor epitope prediction problem: Immrep22 workshop report</article-title>. <source>ImmunoInformatics</source> (<year>2023</year>) <volume>9</volume>:<page-range>100024</page-range>. doi: <pub-id pub-id-type="doi">10.1016/j.immuno.2023.100024</pub-id>
</citation>
</ref>
<ref id="B21">
<label>21</label>
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Davis</surname> <given-names>J</given-names>
</name>
<name>
<surname>Goadrich</surname> <given-names>M</given-names>
</name>
</person-group>. <article-title>The relationship between precision-recall and roc curves</article-title>. In: <source>Proceedings of the 23rd international conference on machine learning</source>, vol. <volume>ICML &#x2018;06</volume>. <publisher-loc>New York, NY, USA</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name> (<year>2006</year>). p. <page-range>233&#x2013;40</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1145/1143844.1143874</pub-id>
</citation>
</ref>
<ref id="B22">
<label>22</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Saito</surname> <given-names>T</given-names>
</name>
<name>
<surname>Rehmsmeier</surname> <given-names>M</given-names>
</name>
</person-group>. <article-title>The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets</article-title>. <source>PLoS One</source> (<year>2015</year>) <volume>10</volume>:<fpage>1</fpage>&#x2013;<lpage>21</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1371/journal.pone.0118432</pub-id>
</citation>
</ref>
</ref-list>
</back>
</article>