<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/frai.2022.736791</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Artificial Intelligence</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Critical Analysis of Deconfounded Pretraining to Improve Visio-Linguistic Models</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Cornille</surname> <given-names>Nathan</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1357990/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Laenen</surname> <given-names>Katrien</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/1699110/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Moens</surname> <given-names>Marie-Francine</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/1697849/overview"/>
</contrib>
</contrib-group>
<aff><institution>Department of Computer Science, Language Intelligence and Information Retrieval (LIIR), KU Leuven</institution>, <addr-line>Leuven</addr-line>, <country>Belgium</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Julia Hockenmaier, University of Illinois at Urbana-Champaign, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Katerina Pastra, Cognitive Systems Research Institute, Greece; William Wang, University of California, Santa Barbara, United States; Raffaella Bernardi, University of Trento, Italy; Alessandro Suglia, Heriot-Watt University, United Kingdom, in collaboration with reviewer RB</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Nathan Cornille <email>nathan.cornille&#x00040;kuleuven.be</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Machine Learning and Artificial Intelligence, a section of the journal Frontiers in Artificial Intelligence</p></fn></author-notes>
<pub-date pub-type="epub">
<day>17</day>
<month>03</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>5</volume>
<elocation-id>736791</elocation-id>
<history>
<date date-type="received">
<day>05</day>
<month>07</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>15</day>
<month>02</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Cornille, Laenen and Moens.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Cornille, Laenen and Moens</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>An important problem with many current visio-linguistic models is that they often depend on spurious correlations. A typical example of a spurious correlation between two variables is one that is due to a third variable causing both (a &#x0201C;confounder&#x0201D;). Recent work has addressed this by adjusting for spurious correlations using a technique of deconfounding with automatically found confounders. We will refer to this technique as <italic>AutoDeconfounding</italic>. This article dives more deeply into <italic>AutoDeconfounding</italic>, and surfaces a number of issues of the original technique. First, we evaluate whether its implementation is actually equivalent to deconfounding. We provide an explicit explanation of the relation between <italic>AutoDeconfounding</italic> and the underlying causal model on which it implicitly operates, and show that additional assumptions are needed before the implementation of <italic>AutoDeconfounding</italic> can be equated to correct deconfounding. Inspired by this result, we perform ablation studies to verify to what extent the improvement on downstream visio-linguistic tasks reported by the works that implement <italic>AutoDeconfounding</italic> is due to <italic>AutoDeconfounding</italic>, and to what extent it is specifically due to the deconfounding aspect of <italic>AutoDeconfounding</italic>. We evaluate <italic>AutoDeconfounding</italic> in a way that isolates its effect, and no longer see the same improvement. We also show that tweaking <italic>AutoDeconfounding</italic> to be less related to deconfounding does not negatively affect performance on downstream visio-linguistic tasks. Furthermore, we create a human-labeled ground truth causality dataset for objects in a scene to empirically verify whether and how well confounders are found. We show that some models do indeed find more confounders than a random baseline, but also that finding more confounders is not correlated with performing better on downstream visio-linguistic tasks. Finally, we summarize the current limitations of <italic>AutoDeconfounding</italic> to solve the issue of spurious correlations and provide directions for the design of novel <italic>AutoDeconfounding</italic> methods that are aimed at overcoming these limitations.</p></abstract>
<kwd-group>
<kwd>causality</kwd>
<kwd>vision</kwd>
<kwd>language</kwd>
<kwd>deep learning</kwd>
<kwd>structural causal model (SCM)</kwd>
</kwd-group>
<contract-sponsor id="cn001">H2020 European Research Council<named-content content-type="fundref-id">10.13039/100010663</named-content></contract-sponsor>
<contract-sponsor id="cn002">Fonds Wetenschappelijk Onderzoek<named-content content-type="fundref-id">10.13039/501100003130</named-content></contract-sponsor>
<counts>
<fig-count count="6"/>
<table-count count="8"/>
<equation-count count="44"/>
<ref-count count="37"/>
<page-count count="20"/>
<word-count count="15552"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Recent years have seen great progress in vision and language research. Increasingly complex models trained on very large datasets of paired images and text seem to capture a lot of the correlations that are needed to solve tasks such as Visual Question Answering or Image Retrieval.</p>
<p>A concern however is that models make many of their predictions based on so-called <italic>spurious correlations</italic>: correlations that are present in the training data, but do not generalize to accurately make predictions when confronted with real world data.</p>
<p>To address this concern, researchers increasingly study the integration of causation into models. A model that is able to solve problems <italic>in a causal way</italic>, will be faster to adapt to other distributions and thus be more generalizable (Sch&#x000F6;lkopf, <xref ref-type="bibr" rid="B25">2019</xref>). Because there are often multiple possible underlying causal models that each perfectly match the observed statistics, a key challenge is that it is not always easy to discover causal structure in purely observational data without being able to perform interventions<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> on the data.</p>
<p>Recently, a technique has been proposed that aims to automatically discover and use knowledge of the underlying causal structure in order to avoid learning spurious correlations. More specifically, the goal is to avoid learning correlations that are due to a common cause (or &#x0201C;confounder&#x0201D;)<xref ref-type="fn" rid="fn0002"><sup>2</sup></xref>, a process called &#x0201C;deconfounding.&#x0201D; We will further refer to this technique as <italic>AutoDeconfounding</italic>. <italic>AutoDeconfounding</italic> was first developed by Wang et al. (<xref ref-type="bibr" rid="B31">2020</xref>) in their model named VC-R-CNN, and more recently adapted to the multi-modal setting by Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>) in their model DeVLBERT.</p>
<p>The goal of this article is to critically examine the technique of <italic>AutoDeconfounding</italic> and theoretically and empirically investigate whether <italic>AutoDeconfounding</italic> is an effective method to avoid spurious correlations. A closer inspection of <italic>AutoDeconfounding</italic> raises a number of questions that are addressed in this article.</p>
<p>First, deconfounding implies a certain assumption on the type of causal variables and their possible values. We make the underlying causal model of <italic>AutoDeconfounding</italic> explicit, and use that model to show that additional assumptions are needed before the implementation of <italic>AutoDeconfounding</italic> can be equated to correct deconfounding.</p>
<p>Inspired by this observation, we then set out to investigate to what extent the reported improvement on downstream tasks is due to <italic>AutoDeconfounding</italic>, and to what extent it is specifically due to the deconfounding aspect of <italic>AutoDeconfounding</italic>. Focusing on the most recent article (Zhang et al., <xref ref-type="bibr" rid="B37">2020b</xref>), that implements <italic>AutoDeconfounding</italic> in a visio-linguistic context, we retrain and evaluate their model (&#x0201C;DeVLBERT-repro&#x0201D;) and their baseline (&#x0201C;ViLBERT-repro&#x0201D;) in a way that isolates the contribution of <italic>AutoDeconfounding</italic>. We compare the scores for our reproductions with the reported scores, as well as with the score of a pretrained checkpoint provided by Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>) (&#x0201C;DeVLBERT-CkptCopy.&#x0201D;) We also train and evaluate two newly created variations of DeVLBERT (&#x0201C;DeVLBERT-NoPrior&#x0201D; and &#x0201C;DeVLBERT-DepPrior&#x0201D;) intended to isolate the component within <italic>AutoDeconfounding</italic> that is hypothesized to be responsible for its beneficial effect. Our experiments show no noticable improvements of performance on downstream tasks with DeVLBERT-repro compared to ViLBERT-repro. Moreover, we show that DeVLBERT-NoPrior and DeVLBERT-DepPrior perform on-par with DeVLBERT-repro as well. This sheds doubt both on the role of deconfounding withing <italic>AutoDeconfounding</italic>, as on the effectiveness of <italic>AutoDeconfounding</italic> in general.</p>
<p>Finally, we investigate how accurately models that integrate <italic>AutoDeconfounding</italic> actually discover confounders. Such an experiment is relevant, because finding confounders from purely observational data seems to be at odds with the Causal Hierarchy Theorem (Bareinboim et al., <xref ref-type="bibr" rid="B3">2020</xref>). For this purpose, we collect a human-labeled ground truth dataset of causal relations. We make two observations here. First, we find that DeVLBERT-repro and DeVLBERT-NoPrior outperform a random baseline in finding confounders, implying that <italic>some</italic> knowledge useful for identifying causes is present in the data. Second, we see no correlation between better confounder-finding and improved performance on downstream tasks: while DeVLBERT-CkptCopy and DeVLBERT-DepPrior score higher on downstream tasks, they are no better than a random baseline at finding confounders.</p>
<p>The contributions of this work are the following:</p>
<list list-type="bullet">
<list-item><p>We theoretically clarify what deconfounding in the visio-linguistic domain means, and show which additional assumptions need to be made for previous work to be equivalent to deconfounding.</p></list-item>
<list-item><p>We verify the benefit of <italic>AutoDeconfounding</italic> on downstream task performance in a way that better isolates its effect, and fail to reproduce the reported gains.</p></list-item>
<list-item><p>We collect a dataset of hand-labeled causality relations between object presence in visual scenes coming from the Conceptual Captions (Sharma et al., <xref ref-type="bibr" rid="B27">2018</xref>) dataset. This dataset can be useful for validating future approaches to resolve spurious correlations through causality.</p></list-item>
</list>
<p>The rest of this article is structured as follows. In section 2, we discuss related work. In section 3, we explain what the terms causation, Structural Causal Models (SCMs)<xref ref-type="fn" rid="fn0003"><sup>3</sup></xref> and deconfounding mean in the sense of the do-calculus of Pearl and Mackenzie (<xref ref-type="bibr" rid="B22">2018</xref>) and we explain how deconfounding could indeed theoretically improve performance on out-of-distribution downstream visio-linguistic tasks. In section 4, we show in detail how <italic>AutoDeconfounding</italic> is implemented in Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>) and Wang et al. (<xref ref-type="bibr" rid="B31">2020</xref>), and we explain what the various approximations they make mean in terms of assumptions on the underlying SCM. In section 5, we explain our methodology for investigating <italic>AutoDeconfounding</italic> more closely on three fronts: is its implementation equivalent to deconfounding, what explains its reported improvement on downstream visio-linguistic tasks, and are confounders found in its implementation. We discuss experimental results in section 6. Finally, we conclude in section 7.</p>
</sec>
<sec id="s2">
<title>2. Related Work</title>
<p><bold>Visio-linguistic models</bold>. There has been a lot of work on creating the best possible general-purpose visio-linguistic models. Most of the recent models are based on the Transformer architecture (Vaswani et al., <xref ref-type="bibr" rid="B30">2017</xref>), examples include ViLBERT (Lu et al., <xref ref-type="bibr" rid="B17">2019</xref>), LXMERT (Tan and Bansal, <xref ref-type="bibr" rid="B29">2019</xref>), Uniter (Chen et al., <xref ref-type="bibr" rid="B8">2020</xref>), and VL-BERT (Su et al., <xref ref-type="bibr" rid="B28">2019</xref>). Often, the Transformer architecture is complemented with a convolutional Region Proposal Network (RPN) to convert images into sets of region features: Ren et al. (<xref ref-type="bibr" rid="B24">2015</xref>) and Anderson et al. (<xref ref-type="bibr" rid="B1">2018</xref>), present examples of RPNs that have been used for this purpose. This articles that use <italic>AutoDeconfounding</italic>, which is the topic of this article, both use ViLBERT (Lu et al., <xref ref-type="bibr" rid="B17">2019</xref>) as a basis for multi-modal tasks.</p>
<p><bold>Issue of spurious correlations</bold>. The issue of models learning spurious correlations is widely recognized. Sch&#x000F6;lkopf et al. (<xref ref-type="bibr" rid="B26">2021</xref>) gives a good overview of the theoretical benefits of learning causal representations as a way to address spurious correlations. A number of works have tried to put ideas from causality in practice to address this issue. Most of these assume a certain fixed underlying SCM, and use this structure to adjust for confounders. Examples include Qi et al. (<xref ref-type="bibr" rid="B23">2020</xref>), Zhang et al. (<xref ref-type="bibr" rid="B36">2020a</xref>), Niu et al. (<xref ref-type="bibr" rid="B18">2021</xref>), or Yue et al. (<xref ref-type="bibr" rid="B34">2020</xref>). An important difference of <italic>AutoDeconfounding</italic> with regard to these works, is that in <italic>AutoDeconfounding</italic> the structure of the SCM is <italic>automatically discovered</italic>, as well as that the variables of the SCM correspond to individual object classes.</p>
<p><bold>Discovering causal structure</bold>. There is theoretical work explaining the &#x0201C;ladder of causality&#x0201D; (Pearl and Mackenzie, <xref ref-type="bibr" rid="B22">2018</xref>), where the different &#x0201C;rungs&#x0201D; of the ladder correspond to the availability of observational, interventional and counterfactual information, respectively. The Causal Hierarchy Theorem (CHT) (Bareinboim et al., <xref ref-type="bibr" rid="B3">2020</xref>) shows that it is often very hard to discover the complete causal structure of an SCM (the second &#x0201C;rung&#x0201D; from the ladder) from purely observational data (the first &#x0201C;rung&#x0201D; of the ladder). However, it is not a problem for the CHT to discover the causal structure of an SCM up to its Markov Equivalence Class<xref ref-type="fn" rid="fn0004"><sup>4</sup></xref>. This has been done with constraint-based methods such as Colombo et al. (<xref ref-type="bibr" rid="B10">2012</xref>), and score-based methods such as Chickering (<xref ref-type="bibr" rid="B9">2002</xref>).</p>
<p>Despite the CHT, there have also been attempts to go beyond the Markov Equivalence class. One tactic to do this is through supervised training on ground truth causal annotations of synthetic data, and porting those results to real data (Lopez-Paz et al., <xref ref-type="bibr" rid="B16">2017</xref>). Another way makes use of distribution-shifts to discover causal structure: this does not violate the CHT by being a proxy for having access to interventional (&#x0201C;second rung&#x0201D;) data. More specifically Bengio et al. (<xref ref-type="bibr" rid="B4">2019</xref>) and more recently Ke et al. (<xref ref-type="bibr" rid="B13">2020</xref>) train different models with different factorizations, see which model is the best at adapting to out-of-distribution data, and retroactively conclude which factorization is the &#x0201C;causal&#x0201D; one.</p>
<p>In contrast to these methods, <italic>AutoDeconfounding</italic> does not make use of distribution shifts nor of ground truth labeled causal data, but only of &#x0201C;first rung&#x0201D; observational data.</p>
<p><bold>Investigating</bold> <italic><bold>AutoDeconfounding</bold></italic>. The works that implement <italic>AutoDeconfounding</italic> (Wang et al., <xref ref-type="bibr" rid="B31">2020</xref> and Zhang et al., <xref ref-type="bibr" rid="B37">2020b</xref>) both explain the benefit of <italic>AutoDeconfounding</italic> as coming from its deconfounding effect. This article will do novel additional experiments that surface a number of issues with <italic>AutoDeconfounding</italic>. We focus on the implementation by Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>) as it is the SOTA for <italic>AutoDeconfounding</italic>. First, we make the underlying SCM more explicit, showing the assumptions under which it corresponds to deconfounding. Second, we compare with the non-causal baseline in a way that better isolates the effect of <italic>AutoDeconfounding</italic>. Finally, we evaluate whether confounders (and thus &#x0201C;second rung&#x0201D; information about the underlying SCM) are indeed found by collecting and evaluating on a ground-truth confounder dataset.</p>
</sec>
<sec id="s3">
<title>3. Background: Causality</title>
<p><italic>AutoDeconfounding</italic> is based on the do-calculus (Pearl, <xref ref-type="bibr" rid="B21">2012</xref>), which is a calculus that operates on variables in so-called Structural Causal Models or SCMs. This section will explain the key aspects of SCMs and the do-calculus that are necessary to understand the discussion in the rest of this article.</p>
<sec>
<title>3.1. Structural Causal Models</title>
<p>To understand SCMs, consider the following example. Say that we have observations of the presence (1) or absence (0) of rain clouds (<italic>R</italic>) and umbrellas (<italic>U</italic>) in a scene, see <xref ref-type="table" rid="T1">Table 1</xref>:</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Example observations of the presence of certain objects in a scene.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Rain cloud (R)</bold></th>
<th valign="top" align="center"><bold>Umbrella (U)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">1</td>
<td valign="top" align="center">1</td>
</tr>
<tr>
<td valign="top" align="left">0</td>
<td valign="top" align="center">0</td>
</tr>
<tr>
<td valign="top" align="left">0</td>
<td valign="top" align="center">0</td>
</tr>
<tr>
<td valign="top" align="left">1</td>
<td valign="top" align="center">0</td>
</tr>
<tr>
<td valign="top" align="left">&#x022EE;</td>
<td valign="top" align="center">&#x022EE;</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We might observe the following joint probabilities: From</p>
<disp-formula id="E49"><graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-736791-e0001.tif"/></disp-formula>
<p>the point of view of causal modeling, any particular distribution of data observed by a model is generated by <italic>physical, deterministic causal mechanisms</italic> (Parascandolo et al., <xref ref-type="bibr" rid="B19">2018</xref>).</p>
<p>For our example, say that the underlying mechanisms that generate the data are as follows. Whether or not it rains depends on factors outside of the observed data (such as the humidity of the air, the temperature, etc.). Whether or not an umbrella is present depends on whether it rains as well as on factors outside of the observed data (such as the psychology of the people carrying the umbrellas, etc.) Let us collect these external factors in the variables <italic>E</italic><sub><italic>R</italic></sub> for the rain cloud and <italic>E</italic><sub><italic>U</italic></sub> for the umbrella. Then, the value of <italic>R</italic> is determined by the function <italic>R</italic> &#x0003D; <italic>f</italic><sub><italic>R</italic></sub>(<italic>E</italic><sub><italic>R</italic></sub>), and the value of <italic>U</italic> is determined by the function <italic>U</italic> &#x0003D; <italic>f</italic><sub><italic>U</italic></sub>(<italic>R, E</italic><sub><italic>U</italic></sub>). Because the model cannot observe <italic>E</italic><sub><italic>R</italic></sub> and <italic>E</italic><sub><italic>U</italic></sub> however, the best it can do is view them as random variables, and try to learn their distribution.</p>
<p>In this way, the probability distributions of the <italic>E</italic><sub><italic>i</italic></sub>, along with the causal mechanisms <italic>f</italic><sub><italic>i</italic></sub>, generate a probability distribution of the observed variables: <italic>P</italic>(<italic>R, U</italic>). The set of observed variables and their relation is typically represented in an SCM. An SCM is a Directed Acyclic Graph (DAG) whose vertices consist of observed random variables <italic>X, Y, Z</italic>, &#x02026; and in which a directed edge between nodes implies that the origin node is in the domain of the causal mechanism of the target node. Formally, if</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>X</mml:mi><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>P</mml:mi><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>then <italic>PA</italic><sub><italic>X</italic></sub> is the set of variables with outgoing arrows into <italic>X</italic>.</p>
<p><xref ref-type="fig" rid="F1">Figure 1</xref> shows the SCM for the example we discussed. Typically, the possible values and the unobserved variables are not explicitly shown, but we do so here for clarity.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Example SCM.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-736791-g0001.tif"/>
</fig>
</sec>
<sec>
<title>3.2. Why Do We Want Models That Are More &#x0201C;Causal&#x0201D;?</title>
<p>To make the use of causality concrete for a visio-linguistic application, consider a model that needs to caption images. To do so successfully, it will need to recognize the different objects in the image. In order to recognize objects, the model will first of all make use of the per-object pixel-level information. However, to identify blurry, hard-to-recognize objects, it can complement that information with the context of the object. In other words, it can make use of knowledge about which objects are more or less likely to occur together. Consider the case where the image to be captioned is that of a rainy street. It is then useful for the model to be able to predict whether what it sees is a rain cloud given that it sees umbrellas (or vice versa).</p>
<p>A model that needs to learn parameters to perform this task can solve this in different ways. For instance, it could learn a parameter for each possible value of the joint probability <italic>P</italic>(<italic>U, R</italic>). This might be tractable for two variables, but for more variables, this approach requires too many parameters. Alternatively, it can learn a <italic>factorization</italic> of the joint probability and only learn parameters for each value of the factors. The two possible factorizations in this case are on the one hand <italic>P</italic>(<italic>U, R</italic>) &#x0003D; <italic>P</italic>(<italic>U</italic>|<italic>R</italic>)<italic>P</italic>(<italic>R</italic>), which is aligned with the underlying causal mechanism and on the other hand <italic>P</italic>(<italic>U, R</italic>) &#x0003D; <italic>P</italic>(<italic>R</italic>|<italic>U</italic>)<italic>P</italic>(<italic>U</italic>), which is not.</p>
<p>Each of these factorizations uses the same number of parameters, and each will be able to correctly answer queries like &#x0201C;what is the probability of not seeing umbrellas given that it rains.&#x0201D;</p>
<p>However, consider that there is a distribution change, for example, because we want our model to work for a location with more rain. In this case the factor <italic>P</italic>(<italic>R</italic>) has changed because of a change in the distribution of <italic>E</italic><sub><italic>R</italic></sub>. To adapt, the factorization that was aligned with the underlying causal mechanism needs to change the parameter for only one factor, while the other factorization needs to change all its parameters.</p>
<p>Generalizing from this toy example, a distribution change typically only affects a few external factors <italic>E</italic><sub><italic>i</italic></sub>. Because of this, a model with a factorization that is aligned with the underlying causal mechanisms (a &#x0201C;causal&#x0201D; factorization Sch&#x000F6;lkopf, <xref ref-type="bibr" rid="B25">2019</xref>) will typically need to update fewer parameters than a model with another (&#x0201C;entangled&#x0201D;) factorization, and thus perform well on out-of-distribution data with fewer modifications (Bengio et al., <xref ref-type="bibr" rid="B4">2019</xref>).</p>
</sec>
<sec>
<title>3.3. Do-Operator</title>
<p>Sometimes, we want to predict what <italic>setting</italic> the value of some variable <italic>X</italic> will have as an effect on the probability distribution of another variable <italic>Y</italic>, rather than what <italic>observing</italic> <italic>X</italic> will have as an effect.</p>
<p>Keeping Equation (1) in mind, &#x0201C;setting&#x0201D; a variable <italic>X</italic> to a value is the same as replacing the mechanism <italic>X</italic> &#x0003D; <italic>f</italic>(<italic>PA</italic><sub><italic>X</italic></sub>, <italic>E</italic><sub><italic>X</italic></sub>) that produces <italic>X</italic> with a constant <italic>X</italic> &#x0003D; <italic>x</italic>. Such an intervention to the underlying mechanisms then changes the resulting overall probability distribution. The distribution resulting from setting a variable <italic>X</italic> &#x0003D; <italic>x</italic> is indicated with the notation of the &#x0201C;do&#x0201D;-operator: the distribution of another variable <italic>Y</italic> in the SCM after we set (<italic>X</italic> &#x0003D; <italic>x</italic>) is noted as</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M2"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Y</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi>d</mml:mi><mml:mi>o</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>X</mml:mi><mml:mo>=</mml:mo><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Note the difference with the distribution of <italic>Y</italic> given that we <italic>observe</italic> <italic>X</italic> (rather than setting it):</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Y</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi>X</mml:mi><mml:mo>=</mml:mo><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>An important case where the do-operator highlights the difference between observation and intervention is in the case of confounders. When two variables <italic>X</italic> and <italic>Y</italic> share a common parent <italic>Z</italic> in the SCM, this parent is called a <italic>confounder</italic>. The statistical dependence between <italic>X</italic> and <italic>Y</italic> is then (at least partly) attributable to this parent.</p>
<p>In the presence of a confounder, <italic>P</italic>(<italic>Y</italic>|<italic>X</italic> &#x0003D; <italic>x</italic>) will be different from <italic>P</italic>(<italic>Y</italic>|<italic>do</italic>(<italic>X</italic> &#x0003D; <italic>x</italic>)).</p>
<p>For example, consider the SCM in <xref ref-type="fig" rid="F2">Figure 2</xref>. We might reasonably expect <italic>P</italic>(<italic>L</italic> &#x0003D; 1|<italic>U</italic> &#x0003D; 1) &#x0003E; <italic>P</italic>(<italic>L</italic> &#x0003D; 1|<italic>U</italic> &#x0003D; 0): the probability that a puddle is present increases given that we <italic>observe</italic> an umbrella. On the other hand, we should not expect that the probability of a puddle being present changes when we <italic>put</italic> an umbrella in a scene: <italic>P</italic>(<italic>L</italic> &#x0003D; 1|<italic>do</italic>(<italic>U</italic> &#x0003D; 1)) &#x0003D; <italic>P</italic>(<italic>L</italic> &#x0003D; 1|<italic>do</italic>(<italic>U</italic> &#x0003D; 0)) &#x0003D; <italic>P</italic>(<italic>L</italic> &#x0003D; 1).</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Example SCM.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-736791-g0002.tif"/>
</fig>
</sec>
<sec>
<title>3.4. Deconfounding</title>
<p>The presence of a confounder can stand in the way of learning a causally aligned factorization. Consider again the SCM from <xref ref-type="fig" rid="F2">Figure 2</xref>. The correct causal factorization is <italic>P</italic>(<italic>R, L, U</italic>) &#x0003D; <italic>P</italic>(<italic>U</italic>|<italic>R</italic>)<italic>P</italic>(<italic>L</italic>|<italic>R</italic>)<italic>P</italic>(<italic>R</italic>), but the correlation between <italic>puddle</italic> and <italic>umbrella</italic> might cause the model to use <italic>P</italic>(<italic>L</italic>|<italic>U</italic>) or <italic>P</italic>(<italic>U</italic>|<italic>L</italic>) as a factor. To discourage using these factors, we want to teach the model <italic>P</italic>(<italic>L</italic>|<italic>do</italic>(<italic>U</italic>)) (resp. <italic>P</italic>(<italic>U</italic>|<italic>do</italic>(<italic>L</italic>))) instead, as this spurious correlation disappears in the interventional view. In many domains however, such as the visio-linguistic domain, we cannot do actual real-life &#x0201C;interventions&#x0201D; on the data: we cannot &#x0201C;put&#x0201D; a rain cloud in a captioned image of a street and expect umbrellas to appear.</p>
<p>However, if we know the underlying SCM, we can still estimate <italic>P</italic>(<italic>Y</italic>|<italic>do</italic>(<italic>X</italic>)). To do so, we have to make sure that each confounder <italic>Z</italic><sub><italic>i</italic></sub> is adjusted for. Intuitively, we want each confounder to be &#x0201C;homogeneous&#x0201D; (Pearl, <xref ref-type="bibr" rid="B20">2009</xref>) with regard to <italic>X</italic>: we should take samples from the SCM in such a way that each <italic>Z</italic><sub><italic>i</italic></sub> has the same distribution for every value of <italic>X</italic>. In this way, we &#x0201C;neutralize&#x0201D; any effect <italic>Z</italic><sub><italic>i</italic></sub> might have.</p>
<p>For the example of <xref ref-type="fig" rid="F2">Figure 2</xref>, if we want to discover whether there is a causal link between <italic>P</italic> and <italic>U</italic>, we want to compare the difference in how many times we see puddles for the case in which there <italic>is</italic> an umbrella present, with the case in which there <italic>is not</italic>, without being confounded by the effect of one case having rain more often than the other case. We do this by making sure there are as many rainy days for the case without an umbrella as for the cases with an umbrella.</p>
<p>More generally, there can be more than one confounder (e.g., if there are also sprinklers that cause people to take out their umbrellas and puddles to form), and each variable can have more than 2 possible values (e.g., if the variable is &#x0201C;color of the umbrella&#x0201D; rather than &#x0201C;presence of an umbrella.&#x0201D;) In this more general case, the &#x0201C;back-door criterion&#x0201D; tells us which set of variables <italic>S</italic><sub><italic>Z</italic></sub> = {<italic>Z</italic><sub>1</sub>, &#x02026;, <italic>Z</italic><sub><italic>n</italic></sub>} can be selected to be adjusted for to discover the causal link between two target variables <italic>X</italic> and <italic>Y</italic>: <italic>S</italic><sub><italic>Z</italic></sub> is any set of variables such that:</p>
<list list-type="bullet">
<list-item><p>No node in <italic>S</italic><sub><italic>Z</italic></sub> is a descendant of <italic>X</italic>;</p></list-item>
<list-item><p>The nodes in <italic>S</italic><sub><italic>Z</italic></sub> block every path between <italic>X</italic> and <italic>Y</italic> that contains an arrow into <italic>X</italic>.</p></list-item>
</list>
<p>Note that this means it is also possible to adjust for <italic>too many</italic> variables, creating a spurious correlation where there was none before adjusting, so simply adjusting for <italic>every</italic> variable will not work.</p>
<p>Formally, if each <italic>Z</italic><sub><italic>i</italic></sub> &#x02208; <italic>S</italic><sub><italic>Z</italic></sub> has <italic>n</italic><sub><italic>i</italic></sub> possible values <inline-formula><mml:math id="M4"><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x02026;</mml:mo><mml:mi>c</mml:mi></mml:math></inline-formula>:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M5"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Y</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi>d</mml:mi><mml:mi>o</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:mrow></mml:munderover></mml:mstyle><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Y</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo stretchy="false">|</mml:mo><mml:mi>d</mml:mi><mml:mi>o</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E5"><label>(5)</label><mml:math id="M6"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:mrow></mml:munderover></mml:mstyle><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Y</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi>d</mml:mi><mml:mi>o</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x000B7;</mml:mo><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo stretchy="false">|</mml:mo><mml:mi>d</mml:mi><mml:mi>o</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E7"><label>(6)</label><mml:math id="M8"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:mrow></mml:munderover></mml:mstyle><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Y</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi>X</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x000B7;</mml:mo><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>For example, for the case where there are only two confounders <italic>Z</italic><sub>1</sub> and <italic>Z</italic><sub>2</sub>, each with only two possible values: absent (0) and present (1), Equation (6) becomes</p>
<disp-formula id="E9"><label>(7)</label><mml:math id="M10"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Y</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi>d</mml:mi><mml:mi>o</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Y</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi>X</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Y</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi>X</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Y</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi>X</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Y</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi>X</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>This example is relevant as it applies to <italic>AutoDeconfounding</italic>, where variables are different objects, and their possible values are either &#x0201C;present&#x0201D; (1) or &#x0201C;absent&#x0201D; (0). The SCM assumed by <italic>AutoDeconfounding</italic> does not put restrictions on the connectivity between objects, as long as the resulting graph is a DAG. Moreover, it assumes no hidden confounders.</p>
<p>We will discuss the link of <italic>AutoDeconfounding</italic> with deconfounding in more detail in section 5.1, after first clarifying the details of <italic>AutoDeconfounding</italic> itself in section 4.</p>
</sec>
</sec>
<sec id="s4">
<title>4. Details of <italic>AutoDeconfounding</italic></title>
<p>There are two variations of <italic>AutoDeconfounding</italic>: the one as implemented in VC-R-CNN (Wang et al., <xref ref-type="bibr" rid="B31">2020</xref>), referred to as AD-V, and the one as implemented in DeVLBERT (Zhang et al., <xref ref-type="bibr" rid="B37">2020b</xref>), referred to as AD-D.</p>
<p>Both Wang et al. (<xref ref-type="bibr" rid="B31">2020</xref>) and Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>) aim to improve on visio-linguistic tasks (e.g., Visual Question Answering or Image Captioning). Their models fall within a category of approaches that use <italic>transfer learning</italic>: They pretrain on a task different from the target task (a &#x0201C;proxy&#x0201D; task) for which more data is available (for example, context prediction or masked language modeling), and then fine-tune the resulting model on the actual downstream tasks of interest. Their innovation comes from an adaptation to the proxy task that is intended to prevent the model from learning spurious correlations.</p>
<p>VC-R-CNN and DeVLBERT differ slightly in both the context in which they use <italic>AutoDeconfounding</italic> and the exact implementation of <italic>AutoDeconfounding</italic>. This section will explain both models in detail.</p>
<sec>
<title>4.1. VC-R-CNN</title>
<sec>
<title>4.1.1. Backbone and Modalities</title>
<p>The backbone of VC-R-CNN (Wang et al., <xref ref-type="bibr" rid="B31">2020</xref>) is an image-region feature extractor [BUTD (Anderson et al., <xref ref-type="bibr" rid="B1">2018</xref>)] which produces feature vectors for all regions-of-interest (ROIs) in an image. The image-region feature extractor is adapted by retraining it on a proxy task designed to prevent spurious correlations to be learned from vision data. Hence, VC-R-CNN focuses its contribution only on the image modality when pretraining.</p>
</sec>
<sec>
<title>4.1.2. Proxy Task</title>
<p>During the pretraining, the loss function of VC-R-CNN consists of two terms:</p>
<disp-formula id="E11"><label>(8)</label><mml:math id="M12"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mo>,</mml:mo><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mtext>AD-V</mml:mtext></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The base objective <italic>L</italic><sub><italic>base,V</italic></sub> is ROI classification, i.e., predicting the class of each of the <italic>N</italic> ROIs in the image. More precisely, if <italic>x</italic><sub><italic>i</italic></sub> is the index of the class of the <italic>i</italic>th ROI and <italic><bold>p</bold></italic> is the vector of predicted probabilities computed based on feature vector <italic><bold>f</bold></italic><sub><italic><bold>x</bold></italic></sub> extracted for the <italic>i</italic>th ROI, then the base objective is:</p>
<disp-formula id="E12"><label>(9)</label><mml:math id="M13"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mo>,</mml:mo><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mo>-</mml:mo><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>p</mml:mi></mml:mstyle><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The additional objective <italic>L</italic><sub>AD-V</sub> regards context prediction, i.e., predicting the class of one ROI based on the features of a different ROI in the image. It does this for each pair of ROIs in the image and sums the resulting losses:</p>
<disp-formula id="E13"><label>(10)</label><mml:math id="M14"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>D</mml:mi><mml:mo>-</mml:mo><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mo>-</mml:mo><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M15"><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msubsup></mml:mstyle></mml:math></inline-formula> is a probability distribution over the possible ROI classes for the <italic>i</italic>th ROI computed based on feature vector <italic><bold>f</bold></italic><sub><italic><bold>x</bold></italic></sub> of the context ROI to use for that prediction, <italic>y</italic><sub><italic>i</italic></sub> is the index of the class of the <italic>i</italic>th ROI, and <italic>N</italic> is the number of ROIs in an image.</p>
</sec>
<sec>
<title>4.1.3. AD-V</title>
<p>In order to predict the context in a &#x0201C;causal&#x0201D; way, VC-R-CNN introduces two elements that are gathered from the entire dataset: a confounder dictionary <inline-formula><mml:math id="M16"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula> and prior probabilities <inline-formula><mml:math id="M17"><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>.</p>
<p>The confounder dictionary <inline-formula><mml:math id="M18"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula> is a set of <italic>C</italic> vectors, one per image class, where each <inline-formula><mml:math id="M19"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> consists of the average ROI feature of all ROIs in all images belonging to class <italic>c</italic>. Given <italic><bold>f</bold></italic><sub><italic><bold>y</bold></italic></sub> and <italic><bold>f</bold></italic><sub><italic><bold>x</bold></italic></sub>, where <italic><bold>f</bold></italic><sub><italic><bold>y</bold></italic></sub> is the feature vector of the ROI whose class is to be predicted and <italic><bold>f</bold></italic><sub><italic><bold>x</bold></italic></sub> is the feature vector of the context ROI to use for that prediction, VC-R-CNN computes a vector of attention scores <bold>&#x003B1;</bold> to select among <inline-formula><mml:math id="M20"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula> those variables that are confounders for the classes corresponding to <italic><bold>f</bold></italic><sub><italic><bold>x</bold></italic></sub> and <italic><bold>f</bold></italic><sub><italic><bold>y</bold></italic></sub>. More precisely, the attention is computed as the inner product between <italic><bold>f</bold></italic><sub><italic><bold>y</bold></italic></sub> and <inline-formula><mml:math id="M21"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula> after projection to a shared embedding space, and converted to a probability distribution using softmax:</p>
<disp-formula id="E14"><label>(11)</label><mml:math id="M22"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mtext>softmax</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mo>&#x02329;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mstyle></mml:mrow><mml:mo>&#x0232A;</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B1;[<italic>c</italic>] denotes the element at the <italic>c</italic>th index in vector <bold>&#x003B1;</bold>, &#x02329;&#x000B7;, &#x000B7;&#x0232A; denotes the inner product, and <inline-formula><mml:math id="M23"><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>D</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>D</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> with <italic>D</italic> the dimension of feature representations. In section 5.3, we use this <bold>&#x003B1;</bold> to investigate whether confounders are actually found.</p>
<p>Next, the model retrieves a pooled vector <italic><bold>f</bold></italic><sub><italic><bold>z</bold></italic></sub> by taking a sum of each of the <italic>C</italic> vectors in <inline-formula><mml:math id="M24"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula> weighted by both the attention score &#x003B1;[<italic>c</italic>] and the prior probability from <inline-formula><mml:math id="M25"><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>:</p>
<disp-formula id="E15"><label>(12)</label><mml:math id="M26"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mstyle><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Here, vector <inline-formula><mml:math id="M27"><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is a probability distribution over the classes according to how many images they occur in<xref ref-type="fn" rid="fn0005"><sup>5</sup></xref>. The weighting of <inline-formula><mml:math id="M31"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula> by its prior is intended to realize the deconfounding. In section 5.1, we explain the exact link with deconfounding.</p>
<p>Finally, a simple feed-forward network <italic>FFN</italic> takes the concatenation of <italic><bold>f</bold></italic><sub><italic><bold>x</bold></italic></sub> and <italic><bold>f</bold></italic><sub><italic><bold>z</bold></italic></sub> and transforms it to the prediction over the possible classes <inline-formula><mml:math id="M32"><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msubsup></mml:mstyle></mml:math></inline-formula>:</p>
<disp-formula id="E16"><label>(13)</label><mml:math id="M33"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:mo>=</mml:mo><mml:mtext>softmax</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>F</mml:mi><mml:mi>F</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mstyle><mml:mo>;</mml:mo><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where [&#x000B7;;&#x000B7;] denotes the concatenation operation. The pipeline for VC-R-CNN is shown in <xref ref-type="fig" rid="F3">Figure 3</xref>.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Pipeline for VC-R-CNN.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-736791-g0003.tif"/>
</fig>
<p>In order to clarify the analysis that we make in section 5.1, it is useful to rewrite <inline-formula><mml:math id="M34"><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msubsup></mml:mstyle></mml:math></inline-formula>. First, in Equation (13) we can write the feed-forward network <italic>FFN</italic> as the function <italic>g</italic><sub><italic>V</italic></sub>:</p>
<disp-formula id="E17"><label>(14)</label><mml:math id="M35"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>F</mml:mi><mml:mi>F</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mstyle><mml:mo>;</mml:mo><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Second, in Equation (12) assume that each &#x003B1;[<italic>c</italic>] were to perfectly select confounders (i.e., if there are <italic>c</italic> true confounders, give each of those <italic>c</italic> true confounders a weight of <inline-formula><mml:math id="M36"><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula><xref ref-type="fn" rid="fn0006"><sup>6</sup></xref>, and all other variables a weight of 0). Use <italic>S</italic><sub><italic>z</italic></sub> to denote the set of <italic>c</italic> indices of vectors in <inline-formula><mml:math id="M37"><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> for which the corresponding class is a confounder for the classes corresponding to <italic><bold>f</bold></italic><sub><italic><bold>x</bold></italic></sub> and <italic><bold>f</bold></italic><sub><italic><bold>y</bold></italic></sub>. Then, we can summarize <inline-formula><mml:math id="M38"><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msubsup></mml:mstyle></mml:math></inline-formula> as:</p>
<disp-formula id="E18"><label>(15)</label><mml:math id="M39"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:mo>=</mml:mo><mml:mtext>softmax</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle displaystyle="true"><mml:munder><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munder></mml:mstyle><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E19"><label>(16)</label><mml:math id="M40"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mtext>softmax</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mstyle><mml:mo>,</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle="true"><mml:msub><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mstyle><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Note that in the case where c &#x0003E; 1, the softmax from Equation (11) will tend to promote the selection of one confounder, even if there are multiple attention scores that are quite high in the input of the softmax.</p>
<p>To deal with the case of multiple confounders, a sigmoid function, subsequently scaled so that the entries still sum to one, would have been a better choice. However, as Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>) and Wang et al. (<xref ref-type="bibr" rid="B31">2020</xref>) use a softmax for <italic>AutoDeconfounding</italic>, we will consider the softmax in our further analysis.</p>
</sec>
<sec>
<title>4.1.4. Pretraining Data</title>
<p>VC-R-CNN uses image datasets with ground truth bounding boxes [MS-COCO (Lin et al., <xref ref-type="bibr" rid="B15">2014</xref>)] and Open Images Kuznetsova et al., <xref ref-type="bibr" rid="B14">2020</xref>). The regions within these bounding boxes are the ROIs.</p>
</sec>
<sec>
<title>4.1.5. Downstream Tasks</title>
<p>VC-R-CNN aims to improve performance on Image Captioning (IC), Visual Commonsense Reasoning (VCR) (Zellers et al., <xref ref-type="bibr" rid="B35">2019</xref>) and Visual Question Answering (VQA) (Antol et al., <xref ref-type="bibr" rid="B2">2015</xref>). More precisely, the image features extracted with VC-R-CNN are used as part of the pipeline of the various downstream models<xref ref-type="fn" rid="fn0007"><sup>7</sup></xref>.</p>
</sec>
</sec>
<sec>
<title>4.2. DeVLBERT</title>
<sec>
<title>4.2.1. Backbone and Modalities</title>
<p>DeVLBERT (Zhang et al., <xref ref-type="bibr" rid="B37">2020b</xref>) uses the exact same backbone and modalities as ViLBERT (Lu et al., <xref ref-type="bibr" rid="B17">2019</xref>). Like ViLBERT, it uses a Faster R-CNN region extractor (Ren et al., <xref ref-type="bibr" rid="B24">2015</xref>) [with ResNet-101 (He et al., <xref ref-type="bibr" rid="B11">2016</xref>) backbone] to convert images into sets of region features, and initializes the weights for the linguistic stream with a BERT language model pretrained on the BookCorpus and English Wikipedia. It also adds the same cross-modal parameters.</p>
<p>Just like ViLBERT, DeVLBERT then performs visio-linguistic pretraining. The only difference is that it adds a number of &#x0201C;causal&#x0201D; parameters and losses during this pretraining, intended to make the model less prone to spurious correlations. These are detailed in the rest of this section. When the model is finetuned for downstream tasks, these extra parameters are no longer used: their only purpose was to change the &#x0201C;non-causal&#x0201D; parameters.</p>
</sec>
<sec>
<title>4.2.2. Proxy Task</title>
<p>During the pretraining, the loss function for DeVLBERT is:</p>
<disp-formula id="E20"><label>(17)</label><mml:math id="M41"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mo>,</mml:mo><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mtext>AD-D</mml:mtext></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>DeVLBERT&#x00027;s base objective equals the one described in ViLBERT (Lu et al., <xref ref-type="bibr" rid="B17">2019</xref>) which consists of a masked token modeling loss for each modality (<italic>L</italic><sub><italic>MTM</italic><sub><italic>V</italic></sub></sub> and <italic>L</italic><sub><italic>MTM</italic><sub><italic>T</italic></sub></sub>), and a caption-image-alignment prediction loss (<italic>L</italic><sub><italic>VLA</italic></sub>):</p>
<disp-formula id="E21"><label>(18)</label><mml:math id="M42"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mo>,</mml:mo><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi><mml:mi>L</mml:mi><mml:mi>A</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi><mml:mi>T</mml:mi><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi><mml:mi>T</mml:mi><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>DeVLBERT&#x00027;s additional objective is similar to that of VC-R-CNN, but then extended to the multi-modal setting. More precisely, DeVLBERT predicts the class index <italic>y</italic><sub><italic>i</italic></sub> of the <italic>i</italic>th token (where &#x0201C;tokens&#x0201D; are words for the text modality, and ROIs for the vision modality) based on the contextualized feature <italic><bold>f</bold></italic><sub><italic><bold>y</bold></italic></sub> for that same token. This means <italic>L</italic><sub>AD-D</sub> consists of 4 loss terms:</p>
<disp-formula id="E22"><label>(19)</label><mml:math id="M43"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>D</mml:mi><mml:mo>-</mml:mo><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>D</mml:mi><mml:mo>-</mml:mo><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mn>2</mml:mn><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>D</mml:mi><mml:mo>-</mml:mo><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mn>2</mml:mn><mml:mi>v</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>D</mml:mi><mml:mo>-</mml:mo><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mn>2</mml:mn><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>D</mml:mi><mml:mo>-</mml:mo><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mn>2</mml:mn><mml:mi>v</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>t</italic> stands for the text modality and <italic>v</italic> for the vision modality. Similar as in VC-R-CNN, a cross-entropy loss is used for the <inline-formula><mml:math id="M44"><mml:msubsup><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>D</mml:mi><mml:mo>-</mml:mo><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x000B7;</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> loss terms. For example, <inline-formula><mml:math id="M45"><mml:msubsup><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>D</mml:mi><mml:mo>-</mml:mo><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mn>2</mml:mn><mml:mi>v</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is computed as:</p>
<disp-formula id="E23"><label>(20)</label><mml:math id="M46"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>D</mml:mi><mml:mo>-</mml:mo><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mn>2</mml:mn><mml:mi>v</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mo>-</mml:mo><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M47"><mml:msubsup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the class index of the <italic>i</italic>th textual token, and the prediction over the possible classes <inline-formula><mml:math id="M48"><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mstyle></mml:math></inline-formula> for the <italic>i</italic>th textual token is computed using the confounder dictionary for the vision modality. The computation of the other <inline-formula><mml:math id="M49"><mml:msubsup><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>D</mml:mi><mml:mo>-</mml:mo><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x000B7;</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> loss terms is completely analogous.</p>
</sec>
<sec>
<title>4.2.3. AD-D</title>
<p>The pipeline for DeVLBERT is shown in <xref ref-type="fig" rid="F4">Figure 4</xref>.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Pipeline for DeVLBERT.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-736791-g0004.tif"/>
</fig>
<p>Note that <xref ref-type="fig" rid="F4">Figure 4</xref> only describes variation &#x0201C;D&#x0201D; of the variations proposed in DeVLBERT, as this is the variation for which most results were reported in Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>).</p>
<p>Since DeVLBERT does <italic>AutoDeconfounding</italic> in the multi-modal setting, one of the main differences compared to VC-R-CNN is that modality-specific confounder dictionaries <inline-formula><mml:math id="M50"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula><mml:math id="M51"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and modality-specific prior probabilities <inline-formula><mml:math id="M52"><mml:msup><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula><mml:math id="M53"><mml:msup><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> are used to do the context prediction in a &#x0201C;causal&#x0201D; way. In the following, we describe the pipeline for the text-to-vision case only, but the other cases are completely analogous.</p>
<p>In the text-to-vision case, for the context prediction of a textual token with feature vector <inline-formula><mml:math id="M54"><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mstyle></mml:math></inline-formula> the confounder dictionary for the visual modality <inline-formula><mml:math id="M55"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is used. First, attention scores <bold>&#x003B1;</bold> are calculated to select among the visual tokens in <inline-formula><mml:math id="M56"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> those that are confounders for the textual token represented by <inline-formula><mml:math id="M57"><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mstyle></mml:math></inline-formula>. Next, a pooled <inline-formula><mml:math id="M58"><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msubsup></mml:mstyle></mml:math></inline-formula> is calculated by weighting the <italic>C</italic> vectors in confounder dictionary <inline-formula><mml:math id="M59"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula> based on the attention score &#x003B1;[<italic>c</italic>] and prior frequency <inline-formula><mml:math id="M60"><mml:msup><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>:</p>
<disp-formula id="E24"><label>(21)</label><mml:math id="M61"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mtext>softmax</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mo>&#x02329;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mstyle></mml:mrow><mml:mo>&#x0232A;</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E25"><label>(22)</label><mml:math id="M62"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B1;[<italic>c</italic>] denotes the element at the <italic>c</italic>th index in vector <bold>&#x003B1;</bold>, &#x02329;&#x000B7;, &#x000B7;&#x0232A; denotes the inner product, and <inline-formula><mml:math id="M63"><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>D</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>D</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> with <italic>D</italic> the dimension of feature representations. In section 5.1, we explain the exact link of weighting <inline-formula><mml:math id="M64"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> by its prior <inline-formula><mml:math id="M65"><mml:msup><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> with deconfounding. Furthermore, in section 5.3, <bold>&#x003B1;</bold> is used to investigate whether confounders are actually found.</p>
<p>Another main difference with VC-R-CNN is the input for the prediction, which is not a concatenation, but only the pooled vector <inline-formula><mml:math id="M66"><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msubsup></mml:mstyle></mml:math></inline-formula>:</p>
<disp-formula id="E26"><label>(23)</label><mml:math id="M67"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:mo>=</mml:mo><mml:mtext>softmax</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>F</mml:mi><mml:mi>F</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msubsup></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p><xref ref-type="fig" rid="F5">Figure 5</xref> shows a zoom-in of the <italic>AutoDeconfounding</italic> operation in DeVLBERT (Equations 21&#x02013;23). Note that for DeVLBERT, the ROI feature vector <italic><bold>f</bold></italic><sub><italic><bold>y</bold></italic></sub> that is used to do the prediction (by selecting <italic><bold>f</bold></italic><sub><italic><bold>z</bold></italic></sub> as an intermediate step), and the class <italic>y</italic><sub><italic>i</italic></sub> that we want to predict, actually correspond to the same token. In other words, DeVLBERT is finding variables that are confounders for the same variable. Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>) justify this by saying <italic><bold>f</bold></italic><sub><italic><bold>y</bold></italic></sub> actually corresponds to a &#x0201C;mix&#x0201D; of tokens, since it has been contextualized through the self-attention mechanism in the ViLBERT backbone. However, this contextualization does not take away that <italic><bold>f</bold></italic><sub><italic><bold>y</bold></italic></sub> mainly corresponds to the token in the image with class <italic>y</italic><sub><italic>i</italic></sub>. In the three-way relation of a confounder, if the two variables <italic>X</italic><sub>1</sub> and <italic>X</italic><sub>2</sub> that are confounded by variable <italic>Z</italic> are one and the same (<italic>X</italic><sub>1</sub> &#x0003D; <italic>X</italic><sub>2</sub> &#x0003D; <italic>X</italic>), we can simply call <italic>Z</italic> a cause of <italic>X</italic>. <xref ref-type="fig" rid="F6">Figure 6</xref> illustrates this. We can thus also speak of &#x0201C;causes&#x0201D; instead of &#x0201C;confounders&#x0201D; when discussing DeVLBERT.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Zoom-in of AD-D.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-736791-g0005.tif"/>
</fig>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Three-way confounding relations collapsing to a two-way causal relation.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-736791-g0006.tif"/>
</fig>
<p>Again, we can rewrite <inline-formula><mml:math id="M68"><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mstyle></mml:math></inline-formula> into a form suitable for the analysis in section 5.1. First, we write the feed-forward network <italic>FFN</italic> as the function <italic>g</italic><sub><italic>D</italic></sub>:</p>
<disp-formula id="E27"><label>(24)</label><mml:math id="M69"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>F</mml:mi><mml:mi>F</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msubsup></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msubsup></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Second, we simplify by assuming that each &#x003B1;[<italic>c</italic>] perfectly selects confounders (or in this case: causes) and that <italic>S</italic><sub><italic>z</italic></sub> is the set of <italic>c</italic> indices of vectors in <inline-formula><mml:math id="M70"><mml:msup><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> for which the corresponding class is a cause for the class corresponding to <inline-formula><mml:math id="M71"><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mstyle></mml:math></inline-formula>. Then, we can summarize <inline-formula><mml:math id="M72"><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mstyle></mml:math></inline-formula> as:</p>
<disp-formula id="E28"><label>(25)</label><mml:math id="M73"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:mo>=</mml:mo><mml:mtext>softmax</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle displaystyle="true"><mml:munder><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munder></mml:mstyle><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mfrac><mml:msup><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E29"><label>(26)</label><mml:math id="M74"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mtext>softmax</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mstyle displaystyle="true"><mml:msub><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mstyle><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
</sec>
<sec>
<title>4.2.4. Pretraining Data</title>
<p>DeVLBERT uses a dataset of images paired with captions [Conceptual Captions (Sharma et al., <xref ref-type="bibr" rid="B27">2018</xref>)]. We refer to an image-caption pair as a &#x0201C;record.&#x0201D; DeVLBERT does not use the images directly though, but uses a frozen region feature extraction network [BUTD (Anderson et al., <xref ref-type="bibr" rid="B1">2018</xref>)] to represent the images as sets of ROI features. Note that for images, DeVLBERT does not use ground truth bounding boxes (as those do not exist for Conceptual Captions), but takes the predictions of the frozen and pretrained BUTD region proposal network as approximate ground truth.</p>
</sec>
<sec>
<title>4.2.5. Downstream Tasks</title>
<p>For DeVLBERT, the downstream tasks are VQA, Image Retrieval (IR) and Zero-Shot Image Retrieval (ZSIR). Specifically, they train and evaluate VQA on the VQA 2.0 dataset (Antol et al., <xref ref-type="bibr" rid="B2">2015</xref>) consisting of 1.1 million questions about COCO images (Chen et al., <xref ref-type="bibr" rid="B7">2015</xref>), each with 10 correct answers, and (ZS)IR on the Flickr30k dataset (Young et al., <xref ref-type="bibr" rid="B32">2014</xref>) consisting of 31 000 images from Flickr with five captions each. They exactly follow the splits and training setup of ViLBERT for this, as do we in the experiments described in the next section. Note that when applied to downstream tasks, DeVLBERT has the same architecture as ViLBERT: the only difference is that its weights are different due to the different pretraining objective (Equation 17).</p>
</sec>
<sec>
<title>4.2.6. Out-Of-Distribution Setting</title>
<p>The DeVLBERT authors argue that there is a distribution shift between the pretraining data (Conceptual Captions) and the data of downstream tasks (VQA 2.0 and Flickr30k), namely because the captions in Conceptual Captions are automatically extracted based on alt-text attributes, whereas the VQA and Flickr30k captions are human-annotated. This is in line with other works on multi-modal pretraining that also consider Conceptual Captions to have no expected overlap with the data of common downstream tasks (Bugliarello et al., <xref ref-type="bibr" rid="B5">2021</xref>). As explained in section 3.2, a model that makes prediction using only correlations that are also causal, should be expected to better adapt to a distribution shift.</p>
<p>Note that DeVLBERT uses the same region proposal network and text tokenizer for pretraining and downstream tasks. This means that the distribution is not different in <italic>which</italic> object or token classes appear, but rather in the statistics of the appearance of the same set of object or token classes.</p>
<p>To investigate the reported improvements of DeVLBERT, we copy their out-of-distribution setting. However, for future work, it can be interesting to consider a setting where the distribution shift is more explicit. For example, a distribution shift where certain classes are known to have different correlations.</p>
</sec>
</sec>
</sec>
<sec sec-type="methods" id="s5">
<title>5. Methodology</title>
<p>The previous sections provided background knowledge of causality and of <italic>AutoDeconfounding</italic>. This section will build upon that knowledge to examine <italic>AutoDeconfounding</italic> more closely.</p>
<p>Our methodology for investigating <italic>AutoDeconfounding</italic> consists of three parts: one theoretical analysis, and two empirical investigations for which we retrain a number of variations of the SOTA model that uses <italic>AutoDeconfounding</italic>. First, we theoretically examine whether the implementation of <italic>AutoDeconfounding</italic> actually corresponds to deconfounding. Second, we evaluate performance on downstream tasks to isolate which component of <italic>AutoDeconfounding</italic> is responsible for the reported improvements on those tasks. Finally, we examine to what extent confounders are actually found. We develop a ground truth dataset of causal relations to investigate this quantitatively, and qualitatively show a subset of the most-selected confounders.</p>
<sec>
<title>5.1. Theoretical Examination of Deconfounding in <italic>AutoDeconfounding</italic></title>
<p>The implementation of <italic>AutoDeconfounding</italic> was detailed in section 4. In this section, we explain how that implementation relates to the formula for a deconfounded prediction (Equation 6). We show the derivation as made in VC-R-CNN and DeVLBERT, and expand it to clarify the link with the underlying causal model.</p>
<p>The derivation starts with the formula for deconfounding:</p>
<disp-formula id="E30"><label>(27)</label><mml:math id="M75"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Y</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi>d</mml:mi><mml:mi>o</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Y</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi>X</mml:mi><mml:mo>,</mml:mo><mml:mi>z</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>As explained in Equation 6, this is really a simplified notation for</p>
<disp-formula id="E31"><label>(28)</label><mml:math id="M76"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Y</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">|</mml:mo><mml:mi>d</mml:mi><mml:mi>o</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>X</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mtext>&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mtext>&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover></mml:mstyle><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Y</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">|</mml:mo><mml:mi>X</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>v</italic><sub><italic>x</italic></sub> and <italic>v</italic><sub><italic>y</italic></sub> each can be either 0, meaning &#x0201C;absent from the scene&#x0201D; or 1, meaning &#x0201C;present in the scene,&#x0201D; and {<italic>Z</italic><sub>1</sub>&#x02026;<italic>Z</italic><sub><italic>C</italic></sub>} is the set of classes that form a confounder for <italic>X</italic> and <italic>Y</italic>. For simplicity of notation, we rewrite</p>
<disp-formula id="E33"><mml:math id="M78"><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover></mml:mstyle><mml:mrow><mml:mo>&#x02329;</mml:mo><mml:mrow><mml:mo>&#x02026;</mml:mo></mml:mrow><mml:mo>&#x0232A;</mml:mo></mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></disp-formula>
<p>as</p>
<disp-formula id="E34"><mml:math id="M79"><mml:msub><mml:mrow><mml:mo>&#x1D53C;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mo>&#x02329;</mml:mo><mml:mrow><mml:mo>&#x02026;</mml:mo></mml:mrow><mml:mo>&#x0232A;</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:math></disp-formula>
<p>The derivation in VC-R-CNN and DeVLBERT then shows that <italic>P</italic>(<italic>Y</italic> &#x0003D; <italic>v</italic><sub><italic>y</italic></sub>|<italic>X</italic> &#x0003D; <italic>v</italic><sub><italic>x</italic></sub>, <italic>Z</italic><sub>1</sub> &#x0003D; <italic>v</italic><sub><italic>z</italic><sub>1</sub></sub>, &#x02026;, <italic>Z</italic><sub><italic>c</italic></sub> &#x0003D; <italic>v</italic><sub><italic>z</italic><sub><italic>c</italic></sub></sub>) is calculated with softmax(<italic>g</italic>(<italic><bold>f</bold></italic><sub><italic><bold>x</bold></italic></sub>, <italic><bold>f</bold></italic><sub><italic><bold>z</bold></italic></sub>)). Equation (28) then becomes:</p>
<disp-formula id="E35"><label>(29)</label><mml:math id="M80"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Y</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">|</mml:mo><mml:mi>d</mml:mi><mml:mi>o</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>X</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mo>&#x1D53C;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>softmax</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Note that this implies that the state of <italic>X</italic> is encoded in <italic><bold>f</bold></italic><sub><italic><bold>x</bold></italic></sub>, and the joint state of <italic>Z</italic><sub>1</sub>, &#x02026;, <italic>Z</italic><sub><italic>c</italic></sub> is encoded in <italic><bold>f</bold></italic><sub><italic><bold>z</bold></italic></sub>.</p>
<p>The derivation then makes an approximation to move <italic>E</italic><sub><italic>c</italic><sub><italic>z</italic></sub></sub> into the softmax following the idea of the Normalized Weighted Geometric Mean, or NWGM. The idea of NWGM is similar to that of how dropout approximates an ensemble of models. It approximates the aggregate result of resampling (in multiple passes) cases where <italic>X</italic> &#x0003D; <italic>v</italic><sub><italic>x</italic></sub> so that <italic>Z</italic> occurs at the rate <italic>P</italic>(<italic>Z</italic> &#x0003D; <italic>v</italic><sub><italic>z</italic></sub>) by 1) only doing one pass for <italic>X</italic> &#x0003D; <italic>v</italic><sub><italic>x</italic></sub>, but 2) using the NWGM of the possible values <italic>v</italic><sub><italic>z</italic></sub> of <italic>Z</italic>, weighted by their prior distribution <italic>P</italic>(<italic>Z</italic> &#x0003D; <italic>v</italic><sub><italic>z</italic></sub>).</p>
<p>Further, given that the function <italic>g</italic>(&#x000B7;, &#x000B7;) is linear, the expectation can be moved next to the argument <italic><bold>f</bold></italic><sub><italic><bold>z</bold></italic></sub>:</p>
<disp-formula id="E36"><label>(30)</label><mml:math id="M81"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Y</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">|</mml:mo><mml:mi>d</mml:mi><mml:mi>o</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>X</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mtext>&#x000A0;&#x000A0;</mml:mtext></mml:mtd><mml:mtd><mml:mover><mml:mrow><mml:mo>&#x02248;</mml:mo></mml:mrow><mml:mrow><mml:mtext>NWGM</mml:mtext></mml:mrow></mml:mover><mml:mtext>&#x000A0;&#x000A0;softmax</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x1D53C;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E37"><label>(31)</label><mml:math id="M82"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mtext>softmax</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mstyle><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mo>&#x1D53C;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Writing &#x1D53C;<sub><italic>c</italic><sub><italic>z</italic></sub></sub> in full again:</p>
<disp-formula id="E38"><label>(32)</label><mml:math id="M83"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Y</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">|</mml:mo><mml:mi>d</mml:mi><mml:mi>o</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>X</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mtext>softmax</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover></mml:mstyle></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mstyle><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>we can now compare this with the actual implementation as explained in section 4 in Equations (16) and (26).</p>
<disp-formula id="E40"><label>(33)</label><mml:math id="M85"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext>VC-R-CNN:</mml:mtext><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:mo>=</mml:mo><mml:mtext>softmax</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mstyle><mml:mo>,</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle="true"><mml:msub><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mstyle><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E41"><label>(34)</label><mml:math id="M86"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext>DeVLBERT:</mml:mtext><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:mo>=</mml:mo><mml:mtext>softmax</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mstyle displaystyle="true"><mml:msub><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mstyle><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>First, specifically for DeVLBERT (Equation 34), there is a mismatch in the first argument of <italic>g</italic>: <italic><bold>f</bold></italic><sub><italic><bold>x</bold></italic></sub> is no longer involved for <italic>g</italic><sub><italic>D</italic></sub>, meaning that the prediction is made based on <italic>only</italic> the state of the confounder variables.</p>
<p>Second, for both VC-R-CNN and DeVLBERT, there is an apparent mismatch in the second argument of <italic>g</italic>. Equation (32) shows that the sum should be taken over all possible combination of values for every possible confounder. However, in the actual implementation, Equations (33) and (34) show that the there is only one term per confounder, corresponding to the case where its value is &#x0201C;present.&#x0201D;</p>
<p>For example, if <italic>Y</italic> is puddle, <italic>X</italic> is umbrella, <italic>Z</italic><sub>1</sub> is rain cloud and <italic>Z</italic><sub>2</sub> is sprinkler, then there are two confounders for <italic>P</italic>(<italic>Y</italic>|<italic>X</italic>): <italic>Z</italic><sub>1</sub> and <italic>Z</italic><sub>2</sub>. Instead of calculating <italic>P</italic>(<italic>Y</italic>|<italic>do</italic>(<italic>X</italic>)) by taking the average value of <italic>P</italic>(<italic>Y</italic>|<italic>X, Z</italic><sub>1</sub>, <italic>Z</italic><sub>2</sub>) weighted by the prior probabilities <italic>P</italic>(<italic>Z</italic><sub>1</sub> &#x0003D; <italic>present, Z</italic><sub>2</sub> &#x0003D; <italic>present</italic>), <italic>P</italic>(<italic>Z</italic><sub>1</sub> &#x0003D; <italic>present, Z</italic><sub>2</sub> &#x0003D; <italic>absent</italic>), <italic>P</italic>(<italic>Z</italic><sub>1</sub> &#x0003D; <italic>absent, Z</italic><sub>2</sub> &#x0003D; <italic>present</italic>), <italic>P</italic>(<italic>Z</italic><sub>1</sub> &#x0003D; <italic>absent, Z</italic><sub>2</sub> &#x0003D; <italic>absent</italic>), <italic>AutoDeconfounding</italic> takes the average value weighted by the prior probabilities <italic>P</italic>(<italic>Z</italic><sub>1</sub> &#x0003D; <italic>present</italic>), <italic>P</italic>(<italic>Z</italic><sub>2</sub> &#x0003D; <italic>present</italic>).</p>
<p>Despite this apparent mismatch, with two additional assumptions that were not specified in Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>) or Wang et al. (<xref ref-type="bibr" rid="B31">2020</xref>), it <italic>is</italic> possible for this second argument of <italic>g</italic> from Equation 32 to simplify into matching the second argument of <italic>g</italic> in Equations (33) and (34):</p>
<p>For the first assumption, note that in the implementation, the joint state <italic>Z</italic><sub>1</sub>, &#x02026;, <italic>Z</italic><sub><italic>c</italic></sub> is encoded as follows: For each <italic>Z</italic><sub><italic>i</italic></sub>, if its state is &#x0201C;present,&#x0201D; it is represented by the average ROI feature vector <italic><bold>f</bold></italic><sub><italic><bold>z</bold></italic><sub><italic><bold>i</bold></italic></sub></sub> of the corresponding class. These average vectors are the ones collected in the confounder dictionary <inline-formula><mml:math id="M87"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula>. If its state is &#x0201C;absent,&#x0201D; it is represented by a zero-vector with the same shape as <italic><bold>f</bold></italic><sub><italic><bold>z</bold></italic><sub><italic><bold>i</bold></italic></sub></sub>. Then, the joint state is represented as the average vector <inline-formula><mml:math id="M88"><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mstyle><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msubsup><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula>. The assumption is then that <italic><bold>f</bold></italic><sub><italic><bold>z</bold></italic></sub> can successfully distinguish all possible joint states.</p>
<p>The second assumption is that all the confounders are independent of one another. This entails that <inline-formula><mml:math id="M89"><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mo>&#x0220F;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>.</p>
<p>Under these two assumptions, it can be shown that terms of the sum in Equation (32) reduce (barring scaling factors) to the terms in the implementation Equations (33) and (34). The proof for this is in <xref ref-type="supplementary-material" rid="SM1">Appendix section 1.1</xref>.</p>
<p>In conclusion, we show that certain additional assumptions are needed to overcome the mismatch between theoretical deconfounding on the one hand, and the implementation of <italic>AutoDeconfounding</italic> on the other hand. More specifically, it is necessary to assume that the various confounders are independent of one another, and that the encoding of the joint confounder state as implemented in <italic>AutoDeconfounding</italic> can uniquely determine each state.</p>
<p>In the next sections, we evaluate aspects of <italic>AutoDeconfounding</italic> empirically. We perform our empirical experiments only for DeVLBERT, as it is the state-of-the-art of this articles that use <italic>AutoDeconfounding</italic>.</p>
</sec>
<sec>
<title>5.2. Ablation Studies on Downstream Performance</title>
<p>As explained in section 4.2, Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>) evaluate the quality of the DeVLBERT model by its performance on downstream visio-linguistic tasks. They report a significant improvement on these tasks when extending their baseline model with <italic>AutoDeconfounding</italic>.</p>
<p>Our analysis from section 5.1 shows that additional assumptions are necessary to equate the implementation of <italic>AutoDeconfounding</italic> with deconfounding. Inspired by this, we perform ablation studies to verify to what extent deconfounding is actually responsible for the reported improvement in scores. We do this by adapting <italic>AutoDeconfounding</italic> in such a way that it retains access to the confounder dictionary, but is no longer related to deconfounding. We also verify the extent of the contribution of <italic>AutoDeconfounding</italic> as a whole, by comparing with a baseline that does not use <italic>AutoDeconfounding</italic> on a more like-for-like basis.</p>
<p>Because we want to investigate the relation between the implementation of <italic>AutoDeconfounding</italic> and the performance increase on downstream tasks as reported by DeVLBERT, we evaluate on exactly the same downstream tasks as DeVLBERT.</p>
<sec>
<title>Isolating the Contribution of the Confounder Dictionary</title>
<p>For our first experiment on downstream task performance, we test the hypothesis that the key ingredient is the use of the &#x0201C;confounder&#x0201D; dictionary <inline-formula><mml:math id="M90"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula>. We hypothesize that <inline-formula><mml:math id="M91"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula> forces contextualized representations to be sufficiently close to a per-class average representation, thus providing some kind of regularization. In this hypothesis, <inline-formula><mml:math id="M92"><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is an irrelevant component of <italic>AutoDeconfounding</italic>. To isolate the effect of <inline-formula><mml:math id="M93"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula>, we have implemented and trained-from-scratch two variations of <italic>AutoDeconfounding</italic> for which we alter <inline-formula><mml:math id="M94"><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>.</p>
<p>First, we create a model named DeVLBERT-NoPrior in which <inline-formula><mml:math id="M95"><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is left out completely. This changes Equation (22) to:</p>
<disp-formula id="E42"><label>(35)</label><mml:math id="M96"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x000B7;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Second, we create a model named DeVLBERT-DepPrior in which <inline-formula><mml:math id="M97"><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is replaced by a dependent prior. As opposed to vanilla DeVLBERT, DeVLBERT-DepPrior does not weight each entry in <inline-formula><mml:math id="M98"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula> by the prior frequency of the class corresponding to that entry. Rather, it takes into consideration both the class <italic>C</italic><sub><italic>z</italic></sub> of the entry in the confounder dictionary <inline-formula><mml:math id="M99"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula>, and the class <italic>C</italic><sub><italic>y</italic></sub> of the token that is being predicted. It weights each entry by the frequency of tokens of class <italic>C</italic><sub><italic>z</italic></sub> <italic>within</italic> records of the Conceptual Captions dataset where a token of class <italic>C</italic><sub><italic>y</italic></sub> is also present.</p>
<p>Because DeVLBERT has a loss term for each combination of modalities, DeVLBERT-DepPrior has a dependent prior for each modality combination. For example, the dependent prior <inline-formula><mml:math id="M100"><mml:msup><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mn>2</mml:mn><mml:mi>v</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> used to calculate the loss term <inline-formula><mml:math id="M101"><mml:msubsup><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>D</mml:mi><mml:mo>-</mml:mo><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mn>2</mml:mn><mml:mi>v</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is:</p>
<disp-formula id="E43"><label>(36)</label><mml:math id="M102"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mn>2</mml:mn><mml:mi>v</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E44"><label>(37)</label><mml:math id="M103"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E45"><label>(38)</label><mml:math id="M104"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Where <italic>T</italic> is the total number of records, <italic>j</italic> and <italic>k</italic> denote the class indexes for modality <italic>t</italic> and <italic>v</italic>, respectively, and <italic>I</italic><sub><italic>t</italic></sub>(<italic>j, k</italic>) is an indicator function that is 1 if modality <italic>t</italic> of the <italic>j</italic>th record contains a token with class index <italic>k</italic> and 0 otherwise.</p>
<p>Equation (22) then becomes:</p>
<disp-formula id="E46"><label>(39)</label><mml:math id="M105"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mn>2</mml:mn><mml:mi>v</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>i</italic> is the index of <italic>C</italic><sub><italic>y</italic></sub> and <italic>c</italic> the index of <italic>C</italic><sub><italic>z</italic></sub>.</p>
<p>If these variations achieve a similar score to DeVLBERT, that supports the hypothesis that <inline-formula><mml:math id="M106"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula> is the key component of <italic>AutoDeconfounding</italic>.</p>
</sec>
<sec>
<title>Comparing Like-For-Like</title>
<p>For our second experiment on downstream task performance, we observe that the comparison between DeVLBERT and ViLBERT in Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>) is not made on a completely like-for-like basis, and so does not properly isolate the effect of <italic>AutoDeconfounding</italic>.</p>
<p>More specifically, DeVLBERT was trained for 24 epochs, where for the last 12 epochs the region mask probability is changed from 0.15 to 0.3 (Zhang et al., <xref ref-type="bibr" rid="B37">2020b</xref>), whereas ViLBERT is only trained for 10 epochs (with region mask probability 0.15)(Lu et al., <xref ref-type="bibr" rid="B17">2019</xref>). Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>) report that longer pretraining can be especially beneficial for zero-shot IR performance<xref ref-type="fn" rid="fn0008"><sup>8</sup></xref>. Moreover, for fine-tuning, ViLBERT uses the last checkpoint for evaluation, whereas DeVLBERT uses the best checkpoint (based on the validation score) for evaluation.</p>
<p>We retrain and ViLBERT and DeVLBERT ourselves on a like-for-like basis. The details of our experimental setup are in section 6.1.</p>
</sec>
</sec>
<sec>
<title>5.3. Investigating Confounder Finding</title>
<p>The last question we investigate is whether confounders are actually found. According to the Causal Hierarchy Theorem (Bareinboim et al., <xref ref-type="bibr" rid="B3">2020</xref>), with access to only observational data, you cannot make a model that correctly answers interventional (causal) queries for &#x0201C;almost-all<xref ref-type="fn" rid="fn0009"><sup>9</sup></xref>&#x0201D; underlying SCMs. Or as stated by Cartwright (<xref ref-type="bibr" rid="B6">1994</xref>): &#x0201C;no causes-in, no causes-out.&#x0201D; Moreover, Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>) and Wang et al. (<xref ref-type="bibr" rid="B31">2020</xref>) did not quantitatively verify whether confounders are found, but take this as a given. Taking the above into account, it is not obvious that confounders are actually found by <italic>AutoDeconfounding</italic>.</p>
<p>We evaluate the confounder-finding capacities of DeVLBERT both quantitatively and qualitatively. Because we focus on DeVLBERT, we will speak of &#x0201C;causes&#x0201D; instead of &#x0201C;confounders.&#x0201D;</p>
<p>A lot of the tokens from the text modality are not meaningful as causal variables (words such as &#x0201C;over,&#x0201D; &#x0201C;the,&#x0201D; etc.) Hence, we focus specifically on the image modality, where all of the tokens correspond to real objects.</p>
<sec>
<title>5.3.1. Quantitative Analysis</title>
<sec>
<title>5.3.1.1. How to Collect Ground Truth Confounders?</title>
<p>To check <italic>quantitatively</italic> whether actual causes are found, we need ground truth labels on the causality between objects in a scene.</p>
<p>To do this, we create a novel dataset with ground truth labels. Ideally, the way to gather causal labels would be to do interventions in the real world. However, this is difficult to realize (e.g., it is hard to &#x0201C;put&#x0201D; a rain cloud into a scene). Because many causal relations between objects are obvious to human common sense, we rely on human judgement to annotate causal relations instead. The assumption here is that the &#x0201C;mental intervention&#x0201D; that humans do when they answer a question like &#x0201C;Would changing the presence of &#x02018;umbrella&#x02019; influence the presence of &#x02018;rain cloud&#x02019; in the scene?&#x0201D; is a good approximation of the real-world intervention.</p>
</sec>
<sec>
<title>5.3.1.2. Details of the Data Collection</title>
<p><italic>Selecting Data to Label</italic> DeVLBERT works with 1,600 visual object classes, so the number of class-pairs for which a causal-link question can be asked is 1, 600<sup>2</sup> = 2.56 million pairs. Because we believe most of these pairs will have no direct link at all, and labeling all 2.56 million pairs is too expensive, we select a subset of 1,000 class pairs to label. To select this subset, we use the following heuristic:</p>
<p>We assume that candidate pairs that are not correlated in the dataset, will not exhibit a causal link either<xref ref-type="fn" rid="fn0010"><sup>10</sup></xref>. Hence, to select a subset of pairs, we ranked pairs by how strongly they were correlated in the dataset. More specifically, if <italic>P</italic>(<italic>X</italic> &#x0003D; 1) is the probability that an image contains an object of class <italic>X</italic>, and <italic>P</italic>(<italic>X</italic> &#x0003D; 1, <italic>Y</italic> &#x0003D; 1) is the probability that an image contains both an object of class <italic>X</italic> and an object of class <italic>Y</italic>, we select classes for which the difference between <italic>P</italic>(<italic>X</italic> &#x0003D; 1, <italic>Y</italic> &#x0003D; 1) and <italic>P</italic>(<italic>X</italic> &#x0003D; 1)<italic>P</italic>(<italic>Y</italic> &#x0003D; 1) is large.</p>
<p>We select 500 class pairs for which the absolute difference |<italic>P</italic>(<italic>X</italic> &#x0003D; 1, <italic>Y</italic> &#x0003D; 1)&#x02212;<italic>P</italic>(<italic>X</italic> &#x0003D; 1)<italic>P</italic>(<italic>Y</italic> &#x0003D; 1)| is the highest, and 500 pairs for which the relative difference <italic>log</italic>(<italic>P</italic>(<italic>X</italic> &#x0003D; 1, <italic>Y</italic> &#x0003D; 1)/<italic>P</italic>(<italic>X</italic> &#x0003D; 1)<italic>P</italic>(<italic>Y</italic> &#x0003D; 1))<xref ref-type="fn" rid="fn0011"><sup>11</sup></xref> is the highest<xref ref-type="fn" rid="fn0012"><sup>12</sup></xref>.</p>
<p><bold>Assuring Label Quality</bold>. To label the data, we used Amazon Mechanical Turk (MTurk)<xref ref-type="fn" rid="fn0013"><sup>13</sup></xref>. We ask workers to label pairs (<italic>X</italic>,<italic>Y</italic>) of correlated objects with one of three options: <italic>X</italic> causes <italic>Y</italic>, <italic>Y</italic> causes <italic>X</italic>, or neither (in other words some confounder <italic>Z</italic> causes <italic>X</italic> and <italic>Y</italic>). In the latter case, we also provide a free-form box where workers could enter what they thought this confounder would be. An example of the form we used can be found in the <xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref>. We also let workers fill in a confidence score of 1&#x02013;3 of how confident they are in their answer.</p>
<p>The kind of causality that we target is non-trivial to non-experts: we say that one object &#x0201C;is the cause of&#x0201D; another if intervening on its presence influences the probability of the other object being present. To ensure that workers understand the task well, we provide detailed instructions in the form, along with an explanation video. We also require that workers get a minimum score on a test with a small number of pairs with an obvious causal direction. The test questions can be found in the <xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref>.</p>
<p>We let each pair be labeled by 5 different workers and keep only those pairs for which agreement was at least 4 out of 5<xref ref-type="fn" rid="fn0014"><sup>14</sup></xref>. This left us with 595 pairs (or about 60 percent of pairs with at least 4/5 agreement). <xref ref-type="table" rid="T2">Table 2</xref> shows 10 of these pairs.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Subset of response pairs from crowdworkers.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Object 1 (X)</bold></th>
<th valign="top" align="left"><bold>Object 2 (Y)</bold></th>
<th valign="top" align="left"><bold>Most selected response</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Trick</td>
<td valign="top" align="left">skater</td>
<td valign="top" align="left">Y causes X</td>
</tr>
<tr>
<td valign="top" align="left">Laptops</td>
<td valign="top" align="left">office</td>
<td valign="top" align="left">Y causes X</td>
</tr>
<tr>
<td valign="top" align="left">Person</td>
<td valign="top" align="left">shirt</td>
<td valign="top" align="left">X causes Y</td>
</tr>
<tr>
<td valign="top" align="left">Table</td>
<td valign="top" align="left">man</td>
<td valign="top" align="left">a confounder Z causes X and Y</td>
</tr>
<tr>
<td valign="top" align="left">Face</td>
<td valign="top" align="left">tree</td>
<td valign="top" align="left">a confounder Z causes X and Y</td>
</tr>
<tr>
<td valign="top" align="left">Sleeve</td>
<td valign="top" align="left">shirt</td>
<td valign="top" align="left">Y causes X</td>
</tr>
<tr>
<td valign="top" align="left">Arm</td>
<td valign="top" align="left">man</td>
<td valign="top" align="left">Y causes X</td>
</tr>
<tr>
<td valign="top" align="left">Players</td>
<td valign="top" align="left">plant</td>
<td valign="top" align="left">a confounder Z causes X and Y</td>
</tr>
<tr>
<td valign="top" align="left">Nose</td>
<td valign="top" align="left">sky</td>
<td valign="top" align="left">a confounder Z causes X and Y</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="table" rid="T3">Table 3</xref> shows some examples of confounders that workers entered in the free-form box. One pattern to observe is that when the two objects are parts of a whole, the whole is sometimes suggested (e.g., <italic>windshield wipers</italic> and <italic>doors</italic> are caused by a <italic>car</italic>, or <italic>outfield</italic> and <italic>mound</italic> by <italic>baseball field</italic>). Suggested confounders can also vary quite widely (e.g., <italic>table, office</italic>, or <italic>coffee shop</italic> as confounder for <italic>coffee cup</italic> and <italic>monitors</italic>).</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Subset of free-from confounders suggested by crowdworkers.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Object 1 (X)</bold></th>
<th valign="top" align="left"><bold>Object 2 (Y)</bold></th>
<th valign="top" align="left"><bold>Free-form suggested confounders</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Hand</td>
<td valign="top" align="left">shoe</td>
<td valign="top" align="left">human, person, child</td>
</tr>
<tr>
<td valign="top" align="left">Ski poles</td>
<td valign="top" align="left">ski boot</td>
<td valign="top" align="left">mountains, skier, snow</td>
</tr>
<tr>
<td valign="top" align="left">Windshield wipers</td>
<td valign="top" align="left">doors</td>
<td valign="top" align="left">car, cars</td>
</tr>
<tr>
<td valign="top" align="left">Outfield</td>
<td valign="top" align="left">mound</td>
<td valign="top" align="left">baseball field, baseball</td>
</tr>
<tr>
<td valign="top" align="left">Barricade</td>
<td valign="top" align="left">cones</td>
<td valign="top" align="left">street, road, construction</td>
</tr>
<tr>
<td valign="top" align="left">Coffee cup</td>
<td valign="top" align="left">monitors</td>
<td valign="top" align="left">table, office, coffee shop</td>
</tr>
<tr>
<td valign="top" align="left">Cucumber</td>
<td valign="top" align="left">cauliflower</td>
<td valign="top" align="left">salad</td>
</tr>
<tr>
<td valign="top" align="left">Wall</td>
<td valign="top" align="left">pillow</td>
<td valign="top" align="left">bedroom, room</td>
</tr>
<tr>
<td valign="top" align="left">Geese</td>
<td valign="top" align="left">ducks</td>
<td valign="top" align="left">lake, animals</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Note that these were not used in the confounder ranking metric</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>The full dataset with responses, confidences and free-text confounder responses is publicly available on Google Drive<xref ref-type="fn" rid="fn0015"><sup>15</sup></xref>.</p>
</sec>
</sec>
<sec>
<title>5.3.1.3. Confounder Ranking Metric</title>
<p>Recall from Equation 22 that the mechanism by which causes are found consists of using attention scores <bold>&#x003B1;</bold> to pool vectors from <inline-formula><mml:math id="M107"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula> whose classes correspond to causes.</p>
<p>We take the trained model from DeVLBERT, and for every ROI <italic>y</italic> in every image, we produce <bold>&#x003B1;</bold>. We then examine the objects classes <italic>o</italic><sub><italic>i</italic></sub> in our dataset <italic>for which we have the ground truth relation to</italic> <italic>y</italic> (i.e., either <italic>o</italic><sub><italic>i</italic></sub> is a cause of <italic>y</italic>, <italic>y</italic> is a cause of <italic>o</italic><sub><italic>i</italic></sub>, or they are &#x0201C;mere correlates.&#x0201D;) We call the set of <italic>o</italic><sub><italic>i</italic></sub> for a particular <italic>y</italic> <italic>O</italic><sub><italic>y</italic></sub>. A successful <bold>&#x003B1;</bold> should rank the <italic>o</italic><sub><italic>i</italic></sub> that are a cause of <italic>y</italic> higher than the <italic>o</italic><sub><italic>i</italic></sub> that are either consequences or mere correlates. We use mean average precision (mAP) as metric to measure the ranking performance. We compare the resulting mAP score with a baseline that ranks the elements of <italic>O</italic><sub><italic>y</italic></sub> in a completely random way.</p>
<p>We report on the results in section 6.2.1.</p>
</sec>
<sec>
<title>5.3.2. Qualitative Analysis</title>
<p>The quantitative analysis only considers classes for which we have collected ground truth causality information. It can be informative, however, to look at the <italic>complete</italic> ranking of candidate causes for a certain class.</p>
<p>We calculate the per-class average value of &#x003B1;[<italic>c</italic>] over all <italic>R</italic> ROIs in the dataset whose ROI class corresponds to <italic>c</italic> for each class <italic>c</italic>:</p>
<disp-formula id="E47"><label>(40)</label><mml:math id="M108"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>v</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle="true"><mml:msubsup><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:mi>I</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E48"><label>(41)</label><mml:math id="M109"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>&#x02329;</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>&#x0232A;</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M110"><mml:mstyle mathvariant="bold-italic"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:mstyle></mml:math></inline-formula> is the contextualized ROI-feature corresponding to the <italic>i</italic>th ROI, <inline-formula><mml:math id="M111"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is the confounder dictionary for the image modality, and <italic>I</italic>(<italic>i, c</italic>) is an indicator function that is 1 if the <italic>i</italic>th ROI has class <italic>c</italic> and 0 otherwise.</p>
<p>We also show a few qualitative examples during a downstream task, specifically VQA. We check the last-layer cross-modal attention from a query word in the question to the bounding box of an object which we know is a cause of the query word. This can indicate whether certain models pay more attention to actual causes during downstream tasks.</p>
</sec>
</sec>
</sec>
<sec id="s6">
<title>6. Results</title>
<sec>
<title>6.1. Ablation Studies on Downstream Performance</title>
<sec>
<title>Setup</title>
<p>As explained in section 4.2, Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>) evaluate the quality of the DeVLBERT model by its performance on three downstream visio-linguistic tasks: Image Retrieval and Visual Question Answering, for which the model is further fine-tuned, and Zero-Shot Image Retrieval, for which the pretrained model is immediately used. Performance on (Zero-Shot) Image Retrieval is measured by recall at <italic>k</italic> or <italic>R&#x00040;k</italic>, and performance on Visual Question Answering is measured by accuracy. We evaluate in the same way.</p>
<p>When reproducing DeVLBERT, we have tried to follow the original setup as closely as possible. We train all models with the same batch size and learning rate as DeVLBERT<xref ref-type="fn" rid="fn0016"><sup>16</sup></xref>.</p>
<p>There are, however, still a few differences in set-up. First, for Visual Question Answering, we only report performance on the &#x0201C;test-dev&#x0201D; split, and not on the &#x0201C;test-std&#x0201D; split: only 5 submissions to &#x0201C;test-std&#x0201D; are allowed, and we evaluate more than 5 models. Second, because the Conceptual Captions dataset consists of hyperlinks to web content, over time some of the links go stale. Hence we work with 2.9 million records, compared to the 3.1 million that DeVLBERT originally trained on.</p>
<p>Finishing 24 epochs of pretraining on the Conceptual Captions dataset takes 3&#x02013;5 days<xref ref-type="fn" rid="fn0017"><sup>17</sup></xref>.</p>
<p>A pretrained checkpoint was made publicly available by Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>). We redo only the fine-tuning step ourselves for this checkpoint, and also include its performance in our results. We refer to this run as DeVLBERT-CkptCopy.</p>
<p>To make sure that the improvement is indeed due to <italic>AutoDeconfounding</italic>, we retrain ViLBERT and DeVLBERT ourselves, this time in exactly the same way: both for 24 epochs, where for the last 12 epochs the region mask probability is changed from 0.15 to 0.3, and both using the best checkpoint in the fine-tuning tasks for evaluation.</p>
</sec>
<sec>
<title>Results</title>
<p><xref ref-type="table" rid="T4">Table 4</xref> shows an overview of the downstream tasks performances of the different models we evaluated<xref ref-type="fn" rid="fn0018"><sup>18</sup></xref>.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Scores on downstream tasks: top-1 recall, top-5 recall and top-10 recall for Image Retrieval (IR R&#x00040;1, IR R&#x00040;5, IR R&#x00040;10) and zero-shot image retrieval (ZSIR R&#x00040;1, ZSIR R&#x00040;5, ZSIR R&#x00040;10), and accuracy on the test-dev split for Visual Question Answering (VQA).</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="center"><bold>IR R&#x00040;1</bold></th>
<th valign="top" align="center"><bold>IR R&#x00040;5</bold></th>
<th valign="top" align="center"><bold>IR R&#x00040;10</bold></th>
<th valign="top" align="center"><bold>ZSIR R&#x00040;1</bold></th>
<th valign="top" align="center"><bold>ZSIR R&#x00040;5</bold></th>
<th valign="top" align="center"><bold>ZSIR R&#x00040;10</bold></th>
<th valign="top" align="center"><bold>VQA test-dev</bold></th>
</tr>
</thead>
<tbody>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left"><bold>Run name</bold></td>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
</tr> <tr>
<td valign="top" align="left">DeVLBERT reported</td>
<td valign="top" align="center" style="background-color:#b6b7ba">61.60</td>
<td valign="top" align="center" style="background-color:#aaacaf">87.10</td>
<td valign="top" align="center" style="background-color:#b3b5b8">92.60</td>
<td valign="top" align="center" style="background-color:#aaacaf">36.00</td>
<td valign="top" align="center" style="background-color:#939598; color:#ffffff">67.10</td>
<td valign="top" align="center" style="background-color:#939598; color:#ffffff">78.30</td>
<td valign="top" align="center" style="background-color:#939598; color:#ffffff">71.50</td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left">ViLBERT reported</td>
<td valign="top" align="center" style="background-color:#f2f2f3">58.20</td>
<td valign="top" align="center" style="background-color:#f2f2f3">84.90</td>
<td valign="top" align="center" style="background-color:#f2f2f3">91.50</td>
<td valign="top" align="center" style="background-color:#f2f2f3">31.90</td>
<td valign="top" align="center" style="background-color:#f2f2f3">61.10</td>
<td valign="top" align="center" style="background-color:#f2f2f3">72.80</td>
<td valign="top" align="center" style="background-color:#c1c2c4">70.90</td>
</tr> <tr>
<td valign="top" align="left">DeVLBERT repro (5 run avg &#x000B1; stdev)</td>
<td valign="top" align="center" style="background-color:#bfc0c3">61.06 &#x000B1; 1.04</td>
<td valign="top" align="center" style="background-color:#a5a7aa; color:#ffffff">87.27 &#x000B1; 0.48</td>
<td valign="top" align="center" style="background-color:#a9abae">92.78 &#x000B1; 0.4</td>
<td valign="top" align="center" style="background-color:#b8b9bc">35.19 &#x000B1; 0.97</td>
<td valign="top" align="center" style="background-color:#babcbe">64.56 &#x000B1; 0.82</td>
<td valign="top" align="center" style="background-color:#c8c9cb">75.16 &#x000B1; 0.65</td>
<td valign="top" align="center" style="background-color:#b5b7b9">71.05 &#x000B1; 0.06</td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left">ViLBERT repro (5 run avg &#x000B1; stdev)</td>
<td valign="top" align="center" style="background-color:#adafb1">62.13 &#x000B1; 0.61</td>
<td valign="top" align="center" style="background-color:#9ea0a3; color:#ffffff">87.48 &#x000B1; 0.43</td>
<td valign="top" align="center" style="background-color:#a2a4a7; color:#ffffff">92.92 &#x000B1; 0.31</td>
<td valign="top" align="center" style="background-color:#c9cacc">34.2 &#x000B1; 1.17</td>
<td valign="top" align="center" style="background-color:#cacbcd">63.53 &#x000B1; 1.2</td>
<td valign="top" align="center" style="background-color:#d1d2d4">74.59 &#x000B1; 1.25</td>
<td valign="top" align="center" style="background-color:#e4e5e6">70.45 &#x000B1; 0.42</td>
</tr> <tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left">DeVLBERT-CkptCopy</td>
<td valign="top" align="center" style="background-color:#b7b8bb">61.56</td>
<td valign="top" align="center" style="background-color:#9c9ea1; color:#ffffff">87.56</td>
<td valign="top" align="center" style="background-color:#9d9fa2; color:#ffffff">93.02</td>
<td valign="top" align="center" style="background-color:#939598; color:#ffffff">37.42</td>
<td valign="top" align="center" style="background-color:#999b9e; color:#ffffff">66.74</td>
<td valign="top" align="center" style="background-color:#9a9c9f; color:#ffffff">77.88</td>
<td valign="top" align="center" style="background-color:#c3c4c6">70.88</td>
</tr> <tr>
<td valign="top" align="left">DeVLBERT-DepPrior</td>
<td valign="top" align="center" style="background-color:#939598; color:#ffffff">63.70</td>
<td valign="top" align="center" style="background-color:#939598; color:#ffffff">87.86</td>
<td valign="top" align="center" style="background-color:#939598; color:#ffffff">93.20</td>
<td valign="top" align="center" style="background-color:#e7e8e9">32.40</td>
<td valign="top" align="center" style="background-color:#d7d8d9">62.70</td>
<td valign="top" align="center" style="background-color:#dedfe1">73.80</td>
<td valign="top" align="center" style="background-color:#f2f2f3">70.29</td>
</tr>
<tr>
<td valign="top" align="left">DeVLBERT-NoPrior</td>
<td valign="top" align="center" style="background-color:#bec0c2">61.12</td>
<td valign="top" align="center" style="background-color:#a3a5a8; color:#ffffff">87.32</td>
<td valign="top" align="center" style="background-color:#adafb1">92.72</td>
<td valign="top" align="center" style="background-color:#cbccce">34.08</td>
<td valign="top" align="center" style="background-color:#d9dadb">62.56</td>
<td valign="top" align="center" style="background-color:#eceded">73.06</td>
<td valign="top" align="center" style="background-color:#cccdcf">70.76</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The table shows originally reported scores (DeVLBERT reported and ViLBERT reported), scores for retrained models (DeVLBERT repro, DeVLBERT-CkptCopy, ViLBERT repro, average over 5 runs), and scores for adapted models (DeVLBERT-DepPrior, DeVLBERT-NoPrior). The color scaling (darker for higher score) is per column</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>We make a couple of observations:</p>
<p>The top two rows show the improvement of DeVLBERT over ViLBERT as reported in Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>). The next two rows show the same comparison, but for our like-for-like reproduction where we retrained DeVLBERT and ViLBERT from scratch. Contrary to Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>) we do <italic>not</italic> observe an improvement across the downstream tasks. Part of the gap is closed by the better score of ViLBERT. Especially on IR and ZSIR, we see that training for more epochs improves the R&#x00040;1 for ViLBERT by almost 2 to 3 percentage points. This indicates that the reported improvement in Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>) might be largely due to differences in training, rather than due to <italic>AutoDeconfounding</italic>.</p>
<p>However, another part of the difference between the reported scores and our reproduces scores is that we get a lower score than reported for DeVLBERT in our reproductions. As we use the code provided by Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>) to retrain models, we hypothesize that this difference is due to the slightly smaller size of the pretraining dataset that was available to us.</p>
<p>We see that DeVLBERT-CkptCopy scores are different from the reported scores. This is because the model checkpoint that the authors of DeVLBERT made available is not the one for which they reported results<xref ref-type="fn" rid="fn0019"><sup>19</sup></xref>.</p>
<p>Although we could not quite reproduce the results in the same way, our results indicate that the reported improvement in Zhang et al. (<xref ref-type="bibr" rid="B37">2020b</xref>) might be mainly due to a different training regime.</p>
<p>Finally, the results for DeVLBERT-DepPrior and DeVLBERT-NoPrior do not show a consistent degradation of performance on all downstream tasks when compared to DeVLBERT repro. This indicates that the <inline-formula><mml:math id="M112"><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> component of <italic>AutoDeconfounding</italic> indeed is not a key component.</p>
<p>All in all, these results cast serious doubt on the validity of <italic>AutoDeconfounding</italic> as a method to improve performance on out-of-domain downstream tasks.</p>
</sec>
</sec>
<sec>
<title>6.2. Investigating Confounder Finding</title>
<sec>
<title>6.2.1. Quantitative Analysis</title>
<sec>
<title>Setup</title>
<p>For the experiments that investigate confounder finding, we only evaluate runs that are variations of DeVLBERT, but not runs that are variations of ViLBERT. ViLBERT does not contain the mechanism that selects confounders, shown in Equation (21), and so cannot be judged on whether it finds confounders. More specifically, we evaluate DeVLBERT repro, DeVLBERT-CkptCopy, DeVLBERT-DepPrior, and DeVLBERT-NoPrior.</p>
</sec>
<sec>
<title>Results</title>
<p><xref ref-type="table" rid="T5">Table 5</xref> shows the mAP results.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Comparison of confounder-finding performance of DeVLBERT with a random baseline.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="center"><bold>mAP score</bold></th>
<th valign="top" align="center"><bold>mAP excess over random baseline</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>Run name</bold></td>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">DeVLBERT repro (5 run avg &#x000B1; stdev)</td>
<td valign="top" align="center" style="background-color:#939598; color:#ffffff">0.81 &#x000B1; 0.03</td>
<td valign="top" align="center" style="background-color:#939598; color:#ffffff">0.1 &#x000B1; 0.03</td>
</tr>
<tr>
<td valign="top" align="left">DeVLBERT-CkptCopy</td>
<td valign="top" align="center" style="background-color:#dddedf">0.68</td>
<td valign="top" align="center" style="background-color:#dddedf">-0.02</td>
</tr>
<tr>
<td valign="top" align="left">DeVLBERT-DepPrior</td>
<td valign="top" align="center" style="background-color:#f2f2f3">0.65</td>
<td valign="top" align="center" style="background-color:#f2f2f3">-0.05</td>
</tr>
<tr>
<td valign="top" align="left">DeVLBERT-NoPrior</td>
<td valign="top" align="center" style="background-color:#97999c; color:#ffffff">0.80</td>
<td valign="top" align="center" style="background-color:#97999c; color:#ffffff">0.10</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Recall from section 5.3.1.3 that the mAP score is calculated based on how the attention weights rank candidate causes for which we have a ground-truth label in the self-collected dataset described in section 5.3.1.2</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p><xref ref-type="table" rid="T5">Table 5</xref> shows that the reproduced runs behave differently than DeVLBERT-CkptCopy. Whereas DeVLBERT-CkptCopy is worse than random at correctly ranking the causes in the ground truth dataset, the reproduced runs score better than random. We trained multiple reproductions to see whether this was due to high variance for the mAP score over different initializations, but this behavior holds up over 5 runs. Recall from section 6.1 that the reproduced model scores slightly <italic>lower</italic> than DeVLBERT-CkptCopy on downstream tasks. This indicates that finding or not finding confounders does not correlate with performance on the downstream tasks.</p>
<p>Further, for DeVLBERT-DepPrior and DeVLBERT-NoPrior, we get mixed results. DeVLBERT-NoPrior shows a similar result to the reproduced runs, while DeVLBERT-DepPrior again shows a worse-than-random result. Given the Causal Hierarchy Theorem mentioned in section 2, we might expect that it would not be possible to correctly identify causes given only observational data, and so we would expect all runs to score around the random baseline. We propose the following explanation for this apparent paradox. While the CHT states that the correct SCM cannot be recovered from only observational data, it <italic>can</italic> be recovered up to its Markov Equivalence Class. We hypothesize that for the ranking task we developed, this might be sufficient to score above-random. We hypothesize that a more specialized ranking task that explicitly seeks to test cause-finding beyond what can be deduced up to the level of a Markov Equivalence Class will be harder. Performance on such a harder confounder-finding task might then be a better predictor of out-of-distribution performance. We leave the development of such a task for future work.</p>
<p>In conclusion, we get mixed results in trying to evaluate ground truth confounder-finding, however the results do indicate that confounder-finding ability does not correlate with performance on downstream tasks.</p>
</sec>
</sec>
<sec>
<title>6.2.2. Qualitative Analysis</title>
<sec>
<title>Setup</title>
<p>For the qualitative investigation of confounder-finding, we do not average the result of DeVLBERT repro, but show the result for only one of the runs. Because different runs might produce a different ranking, it is hard to display the average results in a compact way. Because it is a qualitative investigation, the result for a single run is sufficiently informative. We display the run with the best mAP score.</p>
</sec>
<sec>
<title>Results</title>
<p>The full 1600 &#x000D7; 1600 tables with the attention distributions can be found in the <xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref>. The scores shown are the values of &#x003B1;[<italic>c</italic>] in Equation (21), averaged over all records containing the effect variable. <xref ref-type="table" rid="T6">Table 6</xref> shows a subset, that is, the top ranked classes for the 10 most common objects.</p>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption><p>The average top predicted &#x0201C;causes&#x0201D; in the confounder dictionary <inline-formula><mml:math id="M113"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula> and the corresponding average attention score &#x003B1;[<italic>c</italic>] for the 10 most common objects (in bold in the leftmost column).</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="left"><bold>Effect variable</bold></th>
<th valign="top" align="center" colspan="4"><bold>Top cause variables</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="middle" align="left" rowspan="10">DeVLBERT repro (best run)</td>
<td valign="top" align="left">Man</td>
<td valign="top" align="left" style="background-color:#eeefef">key: 0.029</td>
<td valign="top" align="left" style="background-color:#eeefef">brick: 0.028</td>
<td valign="top" align="left" style="background-color:#eeefef">kitchen: 0.019</td>
<td valign="top" align="left" style="background-color:#eeefef">houses: 0.019</td>
</tr>
<tr>
<td valign="top" align="left">Building</td>
<td valign="top" align="left" style="background-color:#eeefef">grape: 0.054</td>
<td valign="top" align="left" style="background-color:#eeefef">strawberry: 0.033</td>
<td valign="top" align="left" style="background-color:#eeefef">skull: 0.032</td>
<td valign="top" align="left" style="background-color:#eeefef">brick: 0.03</td>
</tr>
<tr>
<td valign="top" align="left">Woman</td>
<td valign="top" align="left" style="background-color:#eeefef">key: 0.03</td>
<td valign="top" align="left" style="background-color:#eeefef">brick: 0.029</td>
<td valign="top" align="left" style="background-color:#eeefef">kitchen: 0.022</td>
<td valign="top" align="left" style="background-color:#eeefef">houses: 0.019</td>
</tr>
<tr>
<td valign="top" align="left">Tree</td>
<td valign="top" align="left" style="background-color:#eeefef">skull: 0.043</td>
<td valign="top" align="left" style="background-color:#eeefef">key: 0.035</td>
<td valign="top" align="left" style="background-color:#eeefef">grape: 0.031</td>
<td valign="top" align="left" style="background-color:#eeefef">brick: 0.027</td>
</tr>
<tr>
<td valign="top" align="left">Window</td>
<td valign="top" align="left" style="background-color:#eeefef">brick: 0.031</td>
<td valign="top" align="left" style="background-color:#eeefef">houses: 0.022</td>
<td valign="top" align="left" style="background-color:#eeefef">strawberry: 0.022</td>
<td valign="top" align="left" style="background-color:#eeefef">grape: 0.018</td>
</tr>
<tr>
<td valign="top" align="left">Shirt</td>
<td valign="top" align="left" style="background-color:#eeefef">cupcake: 0.026</td>
<td valign="top" align="left" style="background-color:#eeefef">skull: 0.024</td>
<td valign="top" align="left" style="background-color:#eeefef">houses: 0.021</td>
<td valign="top" align="left" style="background-color:#eeefef">kitchen: 0.019</td>
</tr>
<tr>
<td valign="top" align="left">Sky</td>
<td valign="top" align="left" style="background-color:#eeefef">cupcake: 0.029</td>
<td valign="top" align="left" style="background-color:#eeefef">grape: 0.026</td>
<td valign="top" align="left" style="background-color:#eeefef">skull: 0.023</td>
<td valign="top" align="left" style="background-color:#eeefef">brick: 0.021</td>
</tr>
<tr>
<td valign="top" align="left">Wall</td>
<td valign="top" align="left" style="background-color:#eeefef">key: 0.038</td>
<td valign="top" align="left" style="background-color:#eeefef">grape: 0.034</td>
<td valign="top" align="left" style="background-color:#eeefef">brick: 0.028</td>
<td valign="top" align="left" style="background-color:#eeefef">strawberry: 0.023</td>
</tr>
<tr>
<td valign="top" align="left">Hair</td>
<td valign="top" align="left" style="background-color:#eeefef">key: 0.036</td>
<td valign="top" align="left" style="background-color:#eeefef">kitchen: 0.026</td>
<td valign="top" align="left" style="background-color:#eeefef">restaurant: 0.023</td>
<td valign="top" align="left" style="background-color:#eeefef">sunset: 0.02</td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left">Head</td>
<td valign="top" align="left" style="background-color:#eeefef">key: 0.03</td>
<td valign="top" align="left" style="background-color:#eeefef">strawberry: 0.022</td>
<td valign="top" align="left" style="background-color:#eeefef">kitchen: 0.021</td>
<td valign="top" align="left" style="background-color:#eeefef">restaurant: 0.02</td>
</tr> <tr>
<td valign="middle" align="left" rowspan="10">DeVLBERT-CkptCopy</td>
<td valign="top" align="left">Man</td>
<td valign="top" align="left" style="background-color:#96989b; color:#ffffff">butter knife: 0.943</td>
<td valign="top" align="left" style="background-color:#eeefef">home: 0.001</td>
<td valign="top" align="left" style="background-color:#eeefef">little girl: 0.0</td>
<td valign="top" align="left" style="background-color:#eeefef">strawberry: 0.0</td>
</tr>
<tr>
<td valign="top" align="left">Building</td>
<td valign="top" align="left" style="background-color:#96989b; color:#ffffff">butter knife: 0.961</td>
<td valign="top" align="left" style="background-color:#eeefef">home: 0.001</td>
<td valign="top" align="left" style="background-color:#eeefef">little girl: 0.0</td>
<td valign="top" align="left" style="background-color:#eeefef">doll: 0.0</td>
</tr>
<tr>
<td valign="top" align="left">Woman</td>
<td valign="top" align="left" style="background-color:#96989b; color:#ffffff">butter knife: 0.937</td>
<td valign="top" align="left" style="background-color:#eeefef">home: 0.001</td>
<td valign="top" align="left" style="background-color:#eeefef">little girl: 0.001</td>
<td valign="top" align="left" style="background-color:#eeefef">flooring: 0.0</td>
</tr>
<tr>
<td valign="top" align="left">Tree</td>
<td valign="top" align="left" style="background-color:#96989b; color:#ffffff">butter knife: 0.948</td>
<td valign="top" align="left" style="background-color:#eeefef">home: 0.001</td>
<td valign="top" align="left" style="background-color:#eeefef">strawberry: 0.0</td>
<td valign="top" align="left" style="background-color:#eeefef">orange: 0.0</td>
</tr>
<tr>
<td valign="top" align="left">Window</td>
<td valign="top" align="left" style="background-color:#96989b; color:#ffffff">butter knife: 0.957</td>
<td valign="top" align="left" style="background-color:#eeefef">home: 0.001</td>
<td valign="top" align="left" style="background-color:#eeefef">strawberry: 0.0</td>
<td valign="top" align="left" style="background-color:#eeefef">flooring: 0.0</td>
</tr>
<tr>
<td valign="top" align="left">Shirt</td>
<td valign="top" align="left" style="background-color:#96989b; color:#ffffff">butter knife: 0.935</td>
<td valign="top" align="left" style="background-color:#eeefef">home: 0.001</td>
<td valign="top" align="left" style="background-color:#eeefef">little girl: 0.0</td>
<td valign="top" align="left" style="background-color:#eeefef">skull: 0.0</td>
</tr>
<tr>
<td valign="top" align="left">Sky</td>
<td valign="top" align="left" style="background-color:#96989b; color:#ffffff">butter knife: 0.965</td>
<td valign="top" align="left" style="background-color:#eeefef">home: 0.001</td>
<td valign="top" align="left" style="background-color:#eeefef">grape: 0.0</td>
<td valign="top" align="left" style="background-color:#eeefef">flooring: 0.0</td>
</tr>
<tr>
<td valign="top" align="left">Wall</td>
<td valign="top" align="left" style="background-color:#96989b; color:#ffffff">butter knife: 0.969</td>
<td valign="top" align="left" style="background-color:#eeefef">home: 0.001</td>
<td valign="top" align="left" style="background-color:#eeefef">flooring: 0.0</td>
<td valign="top" align="left" style="background-color:#eeefef">doll: 0.0</td>
</tr>
<tr>
<td valign="top" align="left">Hair</td>
<td valign="top" align="left" style="background-color:#96989b; color:#ffffff">butter knife: 0.909</td>
<td valign="top" align="left" style="background-color:#eeefef">home: 0.001</td>
<td valign="top" align="left" style="background-color:#eeefef">little girl: 0.001</td>
<td valign="top" align="left" style="background-color:#eeefef">grape: 0.001</td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left">Head</td>
<td valign="top" align="left" style="background-color:#96989b; color:#ffffff">butter knife: 0.917</td>
<td valign="top" align="left" style="background-color:#eeefef">little girl: 0.001</td>
<td valign="top" align="left" style="background-color:#eeefef">home: 0.001</td>
<td valign="top" align="left" style="background-color:#eeefef">strawberry: 0.001</td>
</tr> <tr>
<td valign="middle" align="left" rowspan="10">DeVLBERT-NoPrior</td>
<td valign="top" align="left">Man</td>
<td valign="top" align="left" style="background-color:#eeefef">floret: 0.013</td>
<td valign="top" align="left" style="background-color:#eeefef">pizzas: 0.013</td>
<td valign="top" align="left" style="background-color:#eeefef">remotes: 0.012</td>
<td valign="top" align="left" style="background-color:#eeefef">products: 0.011</td>
</tr>
<tr>
<td valign="top" align="left">Building</td>
<td valign="top" align="left" style="background-color:#eeefef">cub: 0.015</td>
<td valign="top" align="left" style="background-color:#eeefef">veggie: 0.014</td>
<td valign="top" align="left" style="background-color:#eeefef">floret: 0.013</td>
<td valign="top" align="left" style="background-color:#eeefef">boulders: 0.012</td>
</tr>
<tr>
<td valign="top" align="left">Woman</td>
<td valign="top" align="left" style="background-color:#eeefef">floret: 0.014</td>
<td valign="top" align="left" style="background-color:#eeefef">pizzas: 0.012</td>
<td valign="top" align="left" style="background-color:#eeefef">control: 0.012</td>
<td valign="top" align="left" style="background-color:#eeefef">pizza slice: 0.011</td>
</tr>
<tr>
<td valign="top" align="left">Tree</td>
<td valign="top" align="left" style="background-color:#eeefef">floret: 0.017</td>
<td valign="top" align="left" style="background-color:#eeefef">cub: 0.016</td>
<td valign="top" align="left" style="background-color:#eeefef">boulders: 0.014</td>
<td valign="top" align="left" style="background-color:#eeefef">products: 0.012</td>
</tr>
<tr>
<td valign="top" align="left">Window</td>
<td valign="top" align="left" style="background-color:#eeefef">cub: 0.017</td>
<td valign="top" align="left" style="background-color:#eeefef">boulders: 0.016</td>
<td valign="top" align="left" style="background-color:#eeefef">remotes: 0.012</td>
<td valign="top" align="left" style="background-color:#eeefef">floret: 0.012</td>
</tr>
<tr>
<td valign="top" align="left">Shirt</td>
<td valign="top" align="left" style="background-color:#eeefef">control: 0.013</td>
<td valign="top" align="left" style="background-color:#eeefef">pizzas: 0.012</td>
<td valign="top" align="left" style="background-color:#eeefef">floret: 0.012</td>
<td valign="top" align="left" style="background-color:#eeefef">air vent: 0.011</td>
</tr>
<tr>
<td valign="top" align="left">Sky</td>
<td valign="top" align="left" style="background-color:#eeefef">lunch: 0.014</td>
<td valign="top" align="left" style="background-color:#eeefef">cub: 0.013</td>
<td valign="top" align="left" style="background-color:#eeefef">products: 0.012</td>
<td valign="top" align="left" style="background-color:#eeefef">pizzas: 0.012</td>
</tr>
<tr>
<td valign="top" align="left">Wall</td>
<td valign="top" align="left" style="background-color:#eeefef">cub: 0.015</td>
<td valign="top" align="left" style="background-color:#eeefef">pizzas: 0.012</td>
<td valign="top" align="left" style="background-color:#eeefef">key: 0.011</td>
<td valign="top" align="left" style="background-color:#eeefef">trick: 0.011</td>
</tr>
<tr>
<td valign="top" align="left">Hair</td>
<td valign="top" align="left" style="background-color:#eeefef">floret: 0.015</td>
<td valign="top" align="left" style="background-color:#eeefef">remotes: 0.014</td>
<td valign="top" align="left" style="background-color:#eeefef">motorcyclist: 0.013</td>
<td valign="top" align="left" style="background-color:#eeefef">bath tub: 0.013</td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left">Head</td>
<td valign="top" align="left" style="background-color:#eeefef">remotes: 0.014</td>
<td valign="top" align="left" style="background-color:#eeefef">floret: 0.013</td>
<td valign="top" align="left" style="background-color:#eeefef">bath tub: 0.013</td>
<td valign="top" align="left" style="background-color:#eeefef">motorcyclist: 0.012</td>
</tr> <tr>
<td valign="middle" align="left" rowspan="10">DeVLBERT-DepPrior</td>
<td valign="top" align="left">Man</td>
<td valign="top" align="left" style="background-color:#eeefef">sun: 0.018</td>
<td valign="top" align="left" style="background-color:#eeefef">palm tree: 0.012</td>
<td valign="top" align="left" style="background-color:#eeefef">star: 0.012</td>
<td valign="top" align="left" style="background-color:#eeefef">balloon: 0.012</td>
</tr>
<tr>
<td valign="top" align="left">Building</td>
<td valign="top" align="left" style="background-color:#eeefef">balloon: 0.023</td>
<td valign="top" align="left" style="background-color:#eeefef">thumb: 0.018</td>
<td valign="top" align="left" style="background-color:#eeefef">triangle: 0.015</td>
<td valign="top" align="left" style="background-color:#eeefef">elephant: 0.013</td>
</tr>
<tr>
<td valign="top" align="left">Woman</td>
<td valign="top" align="left" style="background-color:#eeefef">apple: 0.018</td>
<td valign="top" align="left" style="background-color:#eeefef">palm tree: 0.013</td>
<td valign="top" align="left" style="background-color:#eeefef">vehicle: 0.013</td>
<td valign="top" align="left" style="background-color:#eeefef">newspaper: 0.012</td>
</tr>
<tr>
<td valign="top" align="left">Tree</td>
<td valign="top" align="left" style="background-color:#eeefef">balloon: 0.031</td>
<td valign="top" align="left" style="background-color:#eeefef">brick: 0.018</td>
<td valign="top" align="left" style="background-color:#eeefef">paint: 0.014</td>
<td valign="top" align="left" style="background-color:#eeefef">stone: 0.011</td>
</tr>
<tr>
<td valign="top" align="left">Window</td>
<td valign="top" align="left" style="background-color:#eeefef">dot: 0.018</td>
<td valign="top" align="left" style="background-color:#eeefef">hole: 0.016</td>
<td valign="top" align="left" style="background-color:#eeefef">button: 0.015</td>
<td valign="top" align="left" style="background-color:#eeefef">bolt: 0.012</td>
</tr>
<tr>
<td valign="top" align="left">Shirt</td>
<td valign="top" align="left" style="background-color:#eeefef">sheep: 0.027</td>
<td valign="top" align="left" style="background-color:#eeefef">border: 0.021</td>
<td valign="top" align="left" style="background-color:#eeefef">hay: 0.018</td>
<td valign="top" align="left" style="background-color:#eeefef">skull: 0.016</td>
</tr>
<tr>
<td valign="top" align="left">Sky</td>
<td valign="top" align="left" style="background-color:#eeefef">elephant: 0.015</td>
<td valign="top" align="left" style="background-color:#eeefef">cake: 0.013</td>
<td valign="top" align="left" style="background-color:#eeefef">balloon: 0.012</td>
<td valign="top" align="left" style="background-color:#eeefef">thumb: 0.012</td>
</tr>
<tr>
<td valign="top" align="left">Wall</td>
<td valign="top" align="left" style="background-color:#eeefef">rain: 0.011</td>
<td valign="top" align="left" style="background-color:#eeefef">sheep: 0.01</td>
<td valign="top" align="left" style="background-color:#eeefef">moon: 0.009</td>
<td valign="top" align="left" style="background-color:#eeefef">string: 0.009</td>
</tr>
<tr>
<td valign="top" align="left">Hair</td>
<td valign="top" align="left" style="background-color:#eeefef">pavement: 0.019</td>
<td valign="top" align="left" style="background-color:#eeefef">rain: 0.016</td>
<td valign="top" align="left" style="background-color:#eeefef">blanket: 0.015</td>
<td valign="top" align="left" style="background-color:#eeefef">vehicle: 0.015</td>
</tr>
<tr>
<td valign="top" align="left">Head</td>
<td valign="top" align="left" style="background-color:#eeefef">pavement: 0.02</td>
<td valign="top" align="left" style="background-color:#eeefef">vehicle: 0.018</td>
<td valign="top" align="left" style="background-color:#eeefef">balloon: 0.016</td>
<td valign="top" align="left" style="background-color:#eeefef">apple: 0.014</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="table" rid="T7">Table 7</xref> shows the same values, but for a particular example of each effect variable rather than for the average<xref ref-type="fn" rid="fn0020"><sup>20</sup></xref>.</p>
<table-wrap position="float" id="T7">
<label>Table 7</label>
<caption><p>The top predicted &#x0201C;causes&#x0201D; in the confounder dictionary <inline-formula><mml:math id="M114"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula> and the corresponding attention score &#x003B1;[<italic>c</italic>], not averaged over all images as in <xref ref-type="table" rid="T6">Table 6</xref>, but for the objects detected in the specific images shown above.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="center" colspan="6"><inline-graphic xlink:href="frai-05-736791-i0001.tif"/></th>
</tr>
</thead>
<tbody>
<tr style="border-bottom: thin solid #000000;">
<td/>
<td valign="top" align="left"><bold>Effect variable</bold></td>
<td valign="top" align="center" colspan="4"><bold>Top cause variables</bold></td>
</tr>
<tr>
<td valign="middle" align="left" rowspan="10">DeVLBERT repro (best run)</td>
<td valign="top" align="left">Man</td>
<td valign="top" align="left" style="background-color:#eceded">key: 0.03</td>
<td valign="top" align="left" style="background-color:#eceded">brick: 0.027</td>
<td valign="top" align="left" style="background-color:#eceded">orange: 0.025</td>
<td valign="top" align="left" style="background-color:#eceded">kitchen: 0.022</td>
</tr>
<tr>
<td valign="top" align="left">Building</td>
<td valign="top" align="left" style="background-color:#eceded">key: 0.012</td>
<td valign="top" align="left" style="background-color:#eceded">strawberry: 0.011</td>
<td valign="top" align="left" style="background-color:#eceded">skull: 0.01</td>
<td valign="top" align="left" style="background-color:#eceded">brick: 0.009</td>
</tr>
<tr>
<td valign="top" align="left">Woman</td>
<td valign="top" align="left" style="background-color:#eceded">key: 0.032</td>
<td valign="top" align="left" style="background-color:#eceded">kitchen: 0.03</td>
<td valign="top" align="left" style="background-color:#eceded">brick: 0.029</td>
<td valign="top" align="left" style="background-color:#eceded">apple: 0.023</td>
</tr>
<tr>
<td valign="top" align="left">Tree</td>
<td valign="top" align="left" style="background-color:#eceded">key: 0.05</td>
<td valign="top" align="left" style="background-color:#eceded">skull: 0.048</td>
<td valign="top" align="left" style="background-color:#eceded">grape: 0.028</td>
<td valign="top" align="left" style="background-color:#eceded">strawberry: 0.027</td>
</tr>
<tr>
<td valign="top" align="left">Window</td>
<td valign="top" align="left" style="background-color:#eceded">brick: 0.031</td>
<td valign="top" align="left" style="background-color:#eceded">houses: 0.023</td>
<td valign="top" align="left" style="background-color:#eceded">tile: 0.022</td>
<td valign="top" align="left" style="background-color:#eceded">skull: 0.021</td>
</tr>
<tr>
<td valign="top" align="left">Shirt</td>
<td valign="top" align="left" style="background-color:#eceded">cupcake: 0.034</td>
<td valign="top" align="left" style="background-color:#eceded">houses: 0.026</td>
<td valign="top" align="left" style="background-color:#eceded">skull: 0.025</td>
<td valign="top" align="left" style="background-color:#eceded">pumpkin: 0.025</td>
</tr>
<tr>
<td valign="top" align="left">Sky</td>
<td valign="top" align="left" style="background-color:#eceded">cupcake: 0.047</td>
<td valign="top" align="left" style="background-color:#eceded">grape: 0.046</td>
<td valign="top" align="left" style="background-color:#eceded">kites: 0.036</td>
<td valign="top" align="left" style="background-color:#eceded">mouse pad: 0.027</td>
</tr>
<tr>
<td valign="top" align="left">Wall</td>
<td valign="top" align="left" style="background-color:#eceded">cupcake: 0.039</td>
<td valign="top" align="left" style="background-color:#eceded">key: 0.037</td>
<td valign="top" align="left" style="background-color:#eceded">grape: 0.028</td>
<td valign="top" align="left" style="background-color:#eceded">brick: 0.028</td>
</tr>
<tr>
<td valign="top" align="left">Hair</td>
<td valign="top" align="left" style="background-color:#eceded">key: 0.043</td>
<td valign="top" align="left" style="background-color:#eceded">kitchen: 0.028</td>
<td valign="top" align="left" style="background-color:#eceded">sunset: 0.028</td>
<td valign="top" align="left" style="background-color:#eceded">cupcake: 0.026</td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left">Head</td>
<td valign="top" align="left" style="background-color:#eceded">key: 0.025</td>
<td valign="top" align="left" style="background-color:#eceded">cupcake: 0.024</td>
<td valign="top" align="left" style="background-color:#eceded">page: 0.022</td>
<td valign="top" align="left" style="background-color:#eceded">brick: 0.021</td>
</tr> <tr>
<td valign="middle" align="left" rowspan="10">DeVLBERT-CkptCopy</td>
<td valign="top" align="left">Man</td>
<td valign="top" align="left" style="background-color:#97999c; color:#ffffff">butter knife: 1.0</td>
<td valign="top" align="left" style="background-color:#eceded">flooring: 0.0</td>
<td valign="top" align="left" style="background-color:#eceded">home: 0.0</td>
<td valign="top" align="left" style="background-color:#eceded">skull: 0.0</td>
</tr>
<tr>
<td valign="top" align="left">Building</td>
<td valign="top" align="left" style="background-color:#97999c; color:#ffffff">butter knife: 1.0</td>
<td valign="top" align="left" style="background-color:#eceded">home: 0.0</td>
<td valign="top" align="left" style="background-color:#eceded">wild: 0.0</td>
<td valign="top" align="left" style="background-color:#eceded">houses: 0.0</td>
</tr>
<tr>
<td valign="top" align="left">Woman</td>
<td valign="top" align="left" style="background-color:#97999c; color:#ffffff">butter knife: 0.999</td>
<td valign="top" align="left" style="background-color:#eceded">little girl: 0.0</td>
<td valign="top" align="left" style="background-color:#eceded">skull: 0.0</td>
<td valign="top" align="left" style="background-color:#eceded">flooring: 0.0</td>
</tr>
<tr>
<td valign="top" align="left">Tree</td>
<td valign="top" align="left" style="background-color:#97999c; color:#ffffff">butter knife: 0.999</td>
<td valign="top" align="left" style="background-color:#eceded">home: 0.0</td>
<td valign="top" align="left" style="background-color:#eceded">orange: 0.0</td>
<td valign="top" align="left" style="background-color:#eceded">girls: 0.0</td>
</tr>
<tr>
<td valign="top" align="left">Window</td>
<td valign="top" align="left" style="background-color:#97999c; color:#ffffff">butter knife: 0.999</td>
<td valign="top" align="left" style="background-color:#eceded">home: 0.0</td>
<td valign="top" align="left" style="background-color:#eceded">adult: 0.0</td>
<td valign="top" align="left" style="background-color:#eceded">flooring: 0.0</td>
</tr>
<tr>
<td valign="top" align="left">Shirt</td>
<td valign="top" align="left" style="background-color:#97999c; color:#ffffff">butter knife: 1.0</td>
<td valign="top" align="left" style="background-color:#eceded">home: 0.0</td>
<td valign="top" align="left" style="background-color:#eceded">grape: 0.0</td>
<td valign="top" align="left" style="background-color:#eceded">skull: 0.0</td>
</tr>
<tr>
<td valign="top" align="left">Sky</td>
<td valign="top" align="left" style="background-color:#97999c; color:#ffffff">butter knife: 0.962</td>
<td valign="top" align="left" style="background-color:#eceded">flooring: 0.001</td>
<td valign="top" align="left" style="background-color:#eceded">home: 0.001</td>
<td valign="top" align="left" style="background-color:#eceded">adult: 0.0</td>
</tr>
<tr>
<td valign="top" align="left">Wall</td>
<td valign="top" align="left" style="background-color:#97999c; color:#ffffff">butter knife: 1.0</td>
<td valign="top" align="left" style="background-color:#eceded">end table: 0.0</td>
<td valign="top" align="left" style="background-color:#eceded">little girl: 0.0</td>
<td valign="top" align="left" style="background-color:#eceded">flooring: 0.0</td>
</tr>
<tr>
<td valign="top" align="left">Hair</td>
<td valign="top" align="left" style="background-color:#97999c; color:#ffffff">butter knife: 0.994</td>
<td valign="top" align="left" style="background-color:#eceded">grape: 0.0</td>
<td valign="top" align="left" style="background-color:#eceded">blueberry: 0.0</td>
<td valign="top" align="left" style="background-color:#eceded">strawberry: 0.0</td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left">Head</td>
<td valign="top" align="left" style="background-color:#97999c; color:#ffffff">butter knife: 1.0</td>
<td valign="top" align="left" style="background-color:#eceded">wild: 0.0</td>
<td valign="top" align="left" style="background-color:#eceded">home: 0.0</td>
<td valign="top" align="left" style="background-color:#eceded">grape: 0.0</td>
</tr> <tr>
<td valign="middle" align="left" rowspan="10">DeVLBERT-NoPrior</td>
<td valign="top" align="left">Man</td>
<td valign="top" align="left" style="background-color:#eceded">lunch: 0.021</td>
<td valign="top" align="left" style="background-color:#eceded">ox: 0.02</td>
<td valign="top" align="left" style="background-color:#eceded">meal: 0.017</td>
<td valign="top" align="left" style="background-color:#eceded">racer: 0.017</td>
</tr>
<tr>
<td valign="top" align="left">Building</td>
<td valign="top" align="left" style="background-color:#eceded">trick: 0.024</td>
<td valign="top" align="left" style="background-color:#eceded">home: 0.023</td>
<td valign="top" align="left" style="background-color:#eceded">pizza slice: 0.019</td>
<td valign="top" align="left" style="background-color:#eceded">cub: 0.018</td>
</tr>
<tr>
<td valign="top" align="left">Woman</td>
<td valign="top" align="left" style="background-color:#eceded">sign post: 0.004</td>
<td valign="top" align="left" style="background-color:#eceded">pizza slice: 0.004</td>
<td valign="top" align="left" style="background-color:#eceded">pizzas: 0.004</td>
<td valign="top" align="left" style="background-color:#eceded">breast: 0.004</td>
</tr>
<tr>
<td valign="top" align="left">Tree</td>
<td valign="top" align="left" style="background-color:#eceded">sandwiches: 0.009</td>
<td valign="top" align="left" style="background-color:#eceded">wild: 0.007</td>
<td valign="top" align="left" style="background-color:#eceded">home: 0.006</td>
<td valign="top" align="left" style="background-color:#eceded">book shelf: 0.006</td>
</tr>
<tr>
<td valign="top" align="left">Window</td>
<td valign="top" align="left" style="background-color:#eceded">windshield wipers: 0.031</td>
<td valign="top" align="left" style="background-color:#eceded">foliage: 0.024</td>
<td valign="top" align="left" style="background-color:#eceded">plain: 0.024</td>
<td valign="top" align="left" style="background-color:#eceded">lunch: 0.022</td>
</tr>
<tr>
<td valign="top" align="left">Shirt</td>
<td valign="top" align="left" style="background-color:#eceded">mountain range: 0.024</td>
<td valign="top" align="left" style="background-color:#eceded">windshield wipers: 0.022</td>
<td valign="top" align="left" style="background-color:#eceded">trick: 0.021</td>
<td valign="top" align="left" style="background-color:#eceded">electrical outlet: 0.019</td>
</tr>
<tr>
<td valign="top" align="left">Sky</td>
<td valign="top" align="left" style="background-color:#eceded">lunch: 0.028</td>
<td valign="top" align="left" style="background-color:#eceded">trick: 0.024</td>
<td valign="top" align="left" style="background-color:#eceded">seasoning: 0.023</td>
<td valign="top" align="left" style="background-color:#eceded">pizza slice: 0.019</td>
</tr>
<tr>
<td valign="top" align="left">Wall</td>
<td valign="top" align="left" style="background-color:#eceded">trick: 0.025</td>
<td valign="top" align="left" style="background-color:#eceded">jets: 0.022</td>
<td valign="top" align="left" style="background-color:#eceded">lunch: 0.019</td>
<td valign="top" align="left" style="background-color:#eceded">plain: 0.015</td>
</tr>
<tr>
<td valign="top" align="left">Hair</td>
<td valign="top" align="left" style="background-color:#eceded">pizzas: 0.025</td>
<td valign="top" align="left" style="background-color:#eceded">soil: 0.024</td>
<td valign="top" align="left" style="background-color:#eceded">windshield wipers: 0.023</td>
<td valign="top" align="left" style="background-color:#eceded">lunch: 0.023</td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left">Head</td>
<td valign="top" align="left" style="background-color:#eceded">windshield wipers: 0.031</td>
<td valign="top" align="left" style="background-color:#eceded">flooring: 0.021</td>
<td valign="top" align="left" style="background-color:#eceded">mountain range: 0.02</td>
<td valign="top" align="left" style="background-color:#eceded">pizzas: 0.018</td>
</tr> <tr>
<td valign="middle" align="left" rowspan="10">DeVLBERT-DepPrior</td>
<td valign="top" align="left">Man</td>
<td valign="top" align="left" style="background-color:#eceded">palm tree: 0.028</td>
<td valign="top" align="left" style="background-color:#eceded">graffiti: 0.026</td>
<td valign="top" align="left" style="background-color:#eceded">branches: 0.026</td>
<td valign="top" align="left" style="background-color:#eceded">photo: 0.023</td>
</tr>
<tr>
<td valign="top" align="left">Building</td>
<td valign="top" align="left" style="background-color:#eceded">moon: 0.085</td>
<td valign="top" align="left" style="background-color:#eceded">bolt: 0.05</td>
<td valign="top" align="left" style="background-color:#eceded">hole: 0.045</td>
<td valign="top" align="left" style="background-color:#eceded">dot: 0.039</td>
</tr>
<tr>
<td valign="top" align="left">Woman</td>
<td valign="top" align="left" style="background-color:#eceded">skull: 0.076</td>
<td valign="top" align="left" style="background-color:#eceded">string: 0.031</td>
<td valign="top" align="left" style="background-color:#eceded">sheep: 0.022</td>
<td valign="top" align="left" style="background-color:#eceded">graffiti: 0.021</td>
</tr>
<tr>
<td valign="top" align="left">Tree</td>
<td valign="top" align="left" style="background-color:#eceded">string: 0.046</td>
<td valign="top" align="left" style="background-color:#eceded">board: 0.036</td>
<td valign="top" align="left" style="background-color:#eceded">newspaper: 0.034</td>
<td valign="top" align="left" style="background-color:#eceded">bolt: 0.03</td>
</tr>
<tr>
<td valign="top" align="left">Window</td>
<td valign="top" align="left" style="background-color:#eceded">balloon: 0.061</td>
<td valign="top" align="left" style="background-color:#eceded">button: 0.046</td>
<td valign="top" align="left" style="background-color:#eceded">dot: 0.035</td>
<td valign="top" align="left" style="background-color:#eceded">newspaper: 0.034</td>
</tr>
<tr>
<td valign="top" align="left">Shirt</td>
<td valign="top" align="left" style="background-color:#eceded">skull: 0.038</td>
<td valign="top" align="left" style="background-color:#eceded">star: 0.024</td>
<td valign="top" align="left" style="background-color:#eceded">light: 0.022</td>
<td valign="top" align="left" style="background-color:#eceded">border: 0.022</td>
</tr>
<tr>
<td valign="top" align="left">Sky</td>
<td valign="top" align="left" style="background-color:#eceded">bench: 0.011</td>
<td valign="top" align="left" style="background-color:#eceded">baby: 0.008</td>
<td valign="top" align="left" style="background-color:#eceded">plane: 0.007</td>
<td valign="top" align="left" style="background-color:#eceded">flower: 0.007</td>
</tr>
<tr>
<td valign="top" align="left">Wall</td>
<td valign="top" align="left" style="background-color:#eceded">string: 0.034</td>
<td valign="top" align="left" style="background-color:#eceded">balloon: 0.032</td>
<td valign="top" align="left" style="background-color:#eceded">sheep: 0.022</td>
<td valign="top" align="left" style="background-color:#eceded">hay: 0.017</td>
</tr>
<tr>
<td valign="top" align="left">Hair</td>
<td valign="top" align="left" style="background-color:#eceded">umbrella: 0.054</td>
<td valign="top" align="left" style="background-color:#eceded">skull: 0.049</td>
<td valign="top" align="left" style="background-color:#eceded">screen: 0.042</td>
<td valign="top" align="left" style="background-color:#eceded">horse: 0.031</td>
</tr>
<tr>
<td valign="top" align="left">Head</td>
<td valign="top" align="left" style="background-color:#eceded">skull: 0.028</td>
<td valign="top" align="left" style="background-color:#eceded">mouse: 0.028</td>
<td valign="top" align="left" style="background-color:#eceded">word: 0.024</td>
<td valign="top" align="left" style="background-color:#eceded">candle: 0.023</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>For DeVLBERT-CkptCopy, the by-far most-attended-to element of the confounder dictionary <inline-formula><mml:math id="M115"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula> is &#x0201C;butter knife,&#x0201D; and the rest of top-attended elements are similar between objects. For the other runs, there is no one most-attended-to element, but we also see a small set of classes appearing as top-cause candidates for different effect variables.</p>
<p>To explain the high-confidence &#x0201C;butter knife&#x0201D;-selecting behavior, we hypothesize that DeVLBERT-CkptCopy found a beneficial local optimum that makes use of <inline-formula><mml:math id="M116"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:math></inline-formula> in a specific way. The low-confidence predictions for the other runs then indicate that they did not find such an optimum.</p>
<p>It does <italic>not</italic> seem to be the case however that either case corresponds to finding actual confounders. For DeVLBERT-CkptCopy, it seems unlikely that &#x0201C;butter knife&#x0201D; is indeed the most likely confounder for every class. The top causes selected by the other models also do not seem intuitively causal: &#x0201C;key&#x0201D; causing &#x0201C;man&#x0201D; for DeVLBERT repro, &#x0201C;cub&#x0201D; causing &#x0201C;building&#x0201D; for DeVLBERT-NoPrior, or &#x0201C;apple&#x0201D; causing &#x0201C;woman&#x0201D; for DeVLBERT-DepPrior do not seem related to true causality.</p>
<p>The fact that the same classes appear as top causes for unrelated effect variables indicates an indifference of the model to the value of the effect variable: Which cause variable receives a large attention score, depends more on the value of the particular (fixed) embedding for that cause variable, than on how well it matches with the effect variable for which a cause is supposedly found. Note that the cause variable representation comes from the (fixed) confounder dictionary, whereas the effect variable representation is the contextualized output of the Transformer model. We hypothesize that this asymmetry is due to the projection matrices that precede the inner product in the calculation of attention scores. The model might have found it beneficial to make the weights in these projection matrices such that the attention output is more related to the cause variable than to the effect variable.</p>
<p>Note that the attention parameters in <italic>AutoDeconfounding</italic> are never explicitly trained to make confounder-finding predictions. Rather, it is assumed by <italic>AutoDeconfounding</italic> that the attention scores can be interpreted as cause-selecting-scores. The selection of top causes observed in <xref ref-type="table" rid="T6">Table 6</xref> indicates that this interpretation is not valid.</p>
<p><xref ref-type="table" rid="T8">Table 8</xref> shows a few example images of a downstream task (VQA), where the query word is an effect, and the bounding box shown is that of a cause. If a model learned to depend less on spurious correlations and more on causes during pretraining, it is expected that this is reflected in higher attention values from effects to causes. We do not observe that the models using variations of <italic>AutoDeconfounding</italic> pay significantly more attention to the cause object than the baseline ViLBERT model.</p>
<table-wrap position="float" id="T8">
<label>Table 8</label>
<caption><p>Some examples of the attention from query word to cause bounding box, for different models.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>VQA Question</bold></th>
<th valign="top" align="left"><bold>Is there a car on the road?</bold></th>
<th valign="top" align="left"><bold>What is odd about the dog&#x00027;s eyes</bold></th>
<th valign="top" align="left"><bold>What color is the person&#x00027;s helmet?</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Object in bounding box</td>
<td valign="top" align="left">Street</td>
<td valign="top" align="left">Head</td>
<td valign="top" align="left">Person</td>
</tr>
<tr>
<td valign="top" align="left">DeVLBERT</td>
<td valign="top" align="left"><inline-graphic xlink:href="frai-05-736791-i0002.tif"/></td>
<td valign="top" align="left"><inline-graphic xlink:href="frai-05-736791-i0003.tif"/></td>
<td valign="top" align="left"><inline-graphic xlink:href="frai-05-736791-i0004.tif"/></td>
</tr>
<tr>
<td valign="top" align="left">ViLBERT</td>
<td valign="top" align="left"><inline-graphic xlink:href="frai-05-736791-i0005.tif"/></td>
<td valign="top" align="left"><inline-graphic xlink:href="frai-05-736791-i0006.tif"/></td>
<td valign="top" align="left"><inline-graphic xlink:href="frai-05-736791-i0007.tif"/></td>
</tr>
<tr>
<td valign="top" align="left">DeVLBERT-NoPrior</td>
<td valign="top" align="left"><inline-graphic xlink:href="frai-05-736791-i0008.tif"/></td>
<td valign="top" align="left"><inline-graphic xlink:href="frai-05-736791-i0009.tif"/></td>
<td valign="top" align="left"><inline-graphic xlink:href="frai-05-736791-i0010.tif"/></td>
</tr>
<tr>
<td valign="top" align="left">DeVLBERT-DepPrior</td>
<td valign="top" align="left"><inline-graphic xlink:href="frai-05-736791-i0011.tif"/></td>
<td valign="top" align="left"><inline-graphic xlink:href="frai-05-736791-i0012.tif"/></td>
<td valign="top" align="left"><inline-graphic xlink:href="frai-05-736791-i0013.tif"/></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The word colored red in the VQA question is the query for the attention. Out of all possible bounding boxes to display, a bounding box with a &#x0201C;cause&#x0201D; for the query is manually selected, and displayed. Note that it is not necessary for the query word to be present in the image. For the presence of &#x0201C;car,&#x0201D; presence of &#x0201C;street&#x0201D; is a cause, for the presence of &#x02018;eyes,&#x0201D; presence of &#x0201C;head&#x0201D; is a cause, and for the presence of &#x0201C;helmet,&#x0201D; presence of &#x0201C;person&#x0201D; is a cause. The number on the image indicates the attention score given by the model to that bounding box</italic>.</p>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
</sec>
</sec>
<sec sec-type="conclusions" id="s7">
<title>7. Conclusion</title>
<p>Models relying on spurious correlations are an important issue, and leveraging causal knowledge to address the issue is a promising approach. Causal models benefit transfer learning, allowing for faster adaptation to other distributions and thus being more generalizable (Sch&#x000F6;lkopf, <xref ref-type="bibr" rid="B25">2019</xref>). This has been a popular approach to tackle spurious correlations specifically in the visio-linguistic domain, where a number of works have used it to further improve on the already impressive representation-learning capacities of Transformer-like models.</p>
<p>Leveraging causality with <italic>automatically</italic> discovered causal structure is especially interesting, as it could scale much better than human-labeled causal structure. This critical analysis has uncovered some of the issues with this approach.</p>
<p>First, care needs to be taken in being specific with regard to the underlying causal model that is assumed. As shown in section 5.1, only when making the link between causal variables and data-representations explicit is it possible to specify the assumptions under which an implementation of deconfounding is valid. An interesting avenue for future work could be to adapt the implementation to work with a less strict set of assumptions. Furthermore, more thought should be given to the design of models that produce interpretable representations that provide insight in the causal structure and relations of objects captured by these models. It would also be insightful to validate whether the assumptions hold for the application of interest.</p>
<p>Second, it is important to <italic>isolate</italic> the effect of causal representation learning: as section 5.2 shows, when doing a like-for-like comparison, in which the baseline is reproduced under the same circumstances, it is important to avoid confusing the effect of training circumstances with those of the added loss. Moreover, it is crucial to ablate every component to verify to what extent it is responsible for the improvement of the whole. Specifically for out-of-distribution applications, it would be interesting to discover which spurious correlations change between the distributions of interest, and whether those have been correctly captured by the model.</p>
<p>Finally, a key element in leveraging causality with automatically discovered causal structure is assessing to what extent the discovered structure is accurate. Our investigation using human-proxy-for-ground-truth shows mixed results in this regard, with the models that perform better on downstream visio-linguistic tasks scoring worse than random in a cause-ranking task. First testing models on domains for which the causal structure is known, for instance, as done by Lopez-Paz et al. (<xref ref-type="bibr" rid="B16">2017</xref>) can help to build confidence in the fact that causal operations such as deconfounding are realized using the correct causal model.</p>
<p>For future work, creating more extensive causally annotated datasets can enable progress in causal discovery. Additionally, it can be interesting to explore causal models that are more fine-grained than object co-occurrence, as spurious correlations are present at the level of objects as well as the level of object attributes. For example, taking attributes as causal variables, or making use of temporal data such as videos. More fine-grained variables can also be expected to be useful for novel distributions with unseen objects: if the unseen objects consist of known &#x0201C;parts,&#x0201D; their causal properties could still be predicted.</p>
</sec>
<sec sec-type="data-availability" id="s8">
<title>Data Availability Statement</title>
<p>The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/<xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref>.</p>
</sec>
<sec id="s9">
<title>Author Contributions</title>
<p>NC came up with the idea to critically investigate <italic>AutoDeconfounding</italic>, conducted the experiments, created the figures, and was the main contributor to the text. KL and M-FM provided useful insights and suggestions during meetings, and provided suggestions and corrections for this article text during writing. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="funding-information" id="s10">
<title>Funding</title>
<p>This research was financed by the CALCULUS project&#x02014;Commonsense and Anticipation enriched Learning of Continuous representations&#x02014;European Research Council Advanced Grant H2020-ERC-2017-ADG 788506, <ext-link ext-link-type="uri" xlink:href="http://calculus-project.eu/">http://calculus-project.eu/</ext-link>. KL was supported by a grant of the Research Foundation&#x02014;Flanders (FWO) no. 1S55420N.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s11">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec> </body>
<back>
<ack><p>The authors would like to acknowledge the Vlaams Supercomputer Centrum for providing affordable access to sufficiently powerful hardware to perform the needed experiments in a reasonable timeframe.</p>
</ack>
<sec sec-type="supplementary-material" id="s12">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/frai.2022.736791/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/frai.2022.736791/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Presentation_1.pdf" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Anderson</surname> <given-names>P.</given-names></name> <name><surname>He</surname> <given-names>X.</given-names></name> <name><surname>Buehler</surname> <given-names>C.</given-names></name> <name><surname>Teney</surname> <given-names>D.</given-names></name> <name><surname>Johnson</surname> <given-names>M.</given-names></name> <name><surname>Gould</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>Bottom-up and top-down attention for image captioning and visual question answering</article-title>. <source>arXiv:1707.07998 [cs]</source> arXiv: 1707.07998.</citation>
</ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Antol</surname> <given-names>S.</given-names></name> <name><surname>Agrawal</surname> <given-names>A.</given-names></name> <name><surname>Lu</surname> <given-names>J.</given-names></name> <name><surname>Mitchell</surname> <given-names>M.</given-names></name> <name><surname>Batra</surname> <given-names>D.</given-names></name> <name><surname>Zitnick</surname> <given-names>C. L.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Vqa: visual question answering</article-title>, in <source>Proceedings of the IEEE International Conference on Computer Vision</source> (<publisher-loc>Santiago</publisher-loc>), <fpage>2425</fpage>&#x02013;<lpage>2433</lpage>. <pub-id pub-id-type="pmid">30418897</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Bareinboim</surname> <given-names>E.</given-names></name> <name><surname>Correa</surname> <given-names>J.</given-names></name> <name><surname>Ibeling</surname> <given-names>D.</given-names></name> <name><surname>Icard</surname> <given-names>T.</given-names></name></person-group> (<year>2020</year>). <article-title>On pearl&#x00027;s hierarchy and the foundations of causal inference</article-title>. <source>ACM Special Volume in Honor of Judea Pearl (provisional title)</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.semanticscholar.org/paper/1-On-Pearl-%E2%80%99-s-Hierarchy-and-the-Foundations-of-Bareinboim-Correa/6f7fe92f2bd20375b82f8a7f882086b88ca11ed2">https://www.semanticscholar.org/paper/1-On-Pearl-%E2%80%99-s-Hierarchy-and-the-Foundations-of-Bareinboim-Correa/6f7fe92f2bd20375b82f8a7f882086b88ca11ed2</ext-link></citation>
</ref>
<ref id="B4">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Deleu</surname> <given-names>T.</given-names></name> <name><surname>Rahaman</surname> <given-names>N.</given-names></name> <name><surname>Ke</surname> <given-names>R.</given-names></name> <name><surname>Lachapelle</surname> <given-names>S.</given-names></name> <name><surname>Bilaniuk</surname> <given-names>O.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>A meta-transfer objective for learning to disentangle causal mechanisms</article-title>. <source>arXiv:1901.10912 [cs, stat]</source> arXiv: 1901.10912.</citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bugliarello</surname> <given-names>E.</given-names></name> <name><surname>Cotterell</surname> <given-names>R.</given-names></name> <name><surname>Okazaki</surname> <given-names>N.</given-names></name> <name><surname>Elliott</surname> <given-names>D.</given-names></name></person-group> (<year>2021</year>). <article-title>Multimodal pretraining unmasked: a meta-analysis and a unified framework of vision-and-language berts</article-title>. <source>Trans. Assoc. Comput. Linguist.</source> <volume>9</volume>, <fpage>978</fpage>&#x02013;<lpage>994</lpage>. <pub-id pub-id-type="doi">10.1162/tacl_a_00408</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cartwright</surname> <given-names>N.</given-names></name></person-group> (<year>1994</year>). <article-title>Nature&#x00027;s capacities and their measurement</article-title>. <source>OUP Catalogue</source>.</citation>
</ref>
<ref id="B7">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>X.</given-names></name> <name><surname>Fang</surname> <given-names>H.</given-names></name> <name><surname>Lin</surname> <given-names>T.-Y.</given-names></name> <name><surname>Vedantam</surname> <given-names>R.</given-names></name> <name><surname>Gupta</surname> <given-names>S.</given-names></name> <name><surname>Doll&#x000E1;r</surname> <given-names>P.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Microsoft coco captions: Data collection and evaluation server</article-title>. <source>arXiv preprint</source> arXiv:1504.00325.</citation>
</ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>Y.-C.</given-names></name> <name><surname>Li</surname> <given-names>L.</given-names></name> <name><surname>Yu</surname> <given-names>L.</given-names></name> <name><surname>El Kholy</surname> <given-names>A.</given-names></name> <name><surname>Ahmed</surname> <given-names>F.</given-names></name> <name><surname>Gan</surname> <given-names>Z.</given-names></name> <name><surname>Cheng</surname> <given-names>Y.</given-names></name> <name><surname>Liu</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>Uniter: universal image-text representation learning</article-title>, in <source>European Conference on Computer Vision</source> (<publisher-loc>Glasgow</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>104</fpage>&#x02013;<lpage>120</lpage>.</citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chickering</surname> <given-names>D. M.</given-names></name></person-group> (<year>2002</year>). <article-title>Optimal structure identification with greedy search</article-title>. <source>J. Mach. Learn. Res.</source> <volume>3</volume>, <fpage>507</fpage>&#x02013;<lpage>554</lpage>. <pub-id pub-id-type="doi">10.1162/153244303321897717</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Colombo</surname> <given-names>D.</given-names></name> <name><surname>Maathuis</surname> <given-names>M. H.</given-names></name> <name><surname>Kalisch</surname> <given-names>M.</given-names></name> <name><surname>Richardson</surname> <given-names>T. S.</given-names></name></person-group> (<year>2012</year>). <article-title>Learning high-dimensional directed acyclic graphs with latent and selection variables</article-title>. <source>Ann. Stat.</source> <volume>40</volume>, <fpage>294</fpage>&#x02013;<lpage>321</lpage>. <pub-id pub-id-type="doi">10.1214/11-AOS940</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>Deep residual learning for image recognition</article-title>, in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Las Vegas, NV</publisher-loc>), <fpage>770</fpage>&#x02013;<lpage>778</lpage>. <pub-id pub-id-type="pmid">32166560</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>L.</given-names></name> <name><surname>Wang</surname> <given-names>W.</given-names></name> <name><surname>Chen</surname> <given-names>J.</given-names></name> <name><surname>Wei</surname> <given-names>X.-Y.</given-names></name></person-group> (<year>2019</year>). <article-title>Attention on attention for image captioning</article-title>, in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Seoul</publisher-loc>), <fpage>4634</fpage>&#x02013;<lpage>4643</lpage>.</citation>
</ref>
<ref id="B13">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Ke</surname> <given-names>N. R.</given-names></name> <name><surname>Bilaniuk</surname> <given-names>O.</given-names></name> <name><surname>Goyal</surname> <given-names>A.</given-names></name> <name><surname>Bauer</surname> <given-names>S.</given-names></name> <name><surname>Larochelle</surname> <given-names>H.</given-names></name> <name><surname>Sch&#x000F6;lkopf</surname> <given-names>B.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Learning neural causal models from unknown interventions</article-title>. <source>arXiv:1910.01075 [cs, stat]</source> arXiv: 1910.01075.</citation>
</ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kuznetsova</surname> <given-names>A.</given-names></name> <name><surname>Rom</surname> <given-names>H.</given-names></name> <name><surname>Alldrin</surname> <given-names>N.</given-names></name> <name><surname>Uijlings</surname> <given-names>J.</given-names></name> <name><surname>Krasin</surname> <given-names>I.</given-names></name> <name><surname>Pont-Tuset</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>The open images dataset v4</article-title>. <source>International Journal of Computer Vision</source> <volume>7</volume>, <fpage>1</fpage>&#x02013;<lpage>26</lpage>. <pub-id pub-id-type="doi">10.1007/s11263-020-01316-z</pub-id></citation>
</ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>T.-Y.</given-names></name> <name><surname>Maire</surname> <given-names>M.</given-names></name> <name><surname>Belongie</surname> <given-names>S.</given-names></name> <name><surname>Hays</surname> <given-names>J.</given-names></name> <name><surname>Perona</surname> <given-names>P.</given-names></name> <name><surname>Ramanan</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>Microsoft coco: common objects in context</article-title>, in <source>European Conference on Computer Vision</source> (<publisher-loc>Z&#x000FC;rich</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>740</fpage>&#x02013;<lpage>755</lpage>.</citation>
</ref>
<ref id="B16">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Lopez-Paz</surname> <given-names>D.</given-names></name> <name><surname>Nishihara</surname> <given-names>R.</given-names></name> <name><surname>Chintala</surname> <given-names>S.</given-names></name> <name><surname>Sch&#x000F6;lkopf</surname> <given-names>B.</given-names></name> <name><surname>Bottou</surname> <given-names>L.</given-names></name></person-group> (<year>2017</year>). <article-title>Discovering causal signals in images</article-title>. <source>arXiv:1605.08179 [cs, stat]</source> arXiv: 1605.08179.</citation>
</ref>
<ref id="B17">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Lu</surname> <given-names>J.</given-names></name> <name><surname>Batra</surname> <given-names>D.</given-names></name> <name><surname>Parikh</surname> <given-names>D.</given-names></name> <name><surname>Lee</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks</article-title>. <source>arXiv:1908.02265 [cs]</source> arXiv: 1908.02265.</citation>
</ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Niu</surname> <given-names>Y.</given-names></name> <name><surname>Tang</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Lu</surname> <given-names>Z.</given-names></name> <name><surname>Hua</surname> <given-names>X.-S.</given-names></name> <name><surname>Wen</surname> <given-names>J.-R.</given-names></name></person-group> (<year>2021</year>). <article-title>Counterfactual vqa: a cause-effect look at language bias</article-title>, in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Nashville, TN</publisher-loc>) <fpage>12700</fpage>&#x02013;<lpage>12710</lpage>.</citation>
</ref>
<ref id="B19">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Parascandolo</surname> <given-names>G.</given-names></name> <name><surname>Kilbertus</surname> <given-names>N.</given-names></name> <name><surname>Rojas-Carulla</surname> <given-names>M.</given-names></name> <name><surname>Sch&#x000F6;lkopf</surname> <given-names>B.</given-names></name></person-group> (<year>2018</year>). <article-title>Learning Independent Causal Mechanisms</article-title>. <source>arXiv:1712.00961 [cs, stat]</source> arXiv: 1712.00961.</citation>
</ref>
<ref id="B20">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Pearl</surname> <given-names>J.</given-names></name></person-group> (<year>2009</year>). <source>Causality</source>. <publisher-loc>Los Angeles, CA</publisher-loc>: <publisher-name>Cambridge University Press</publisher-name>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.cambridge.org/core/books/causality/B0046844FAE10CBF274D4ACBDAEB5F5B">https://www.cambridge.org/core/books/causality/B0046844FAE10CBF274D4ACBDAEB5F5B</ext-link></citation>
</ref>
<ref id="B21">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Pearl</surname> <given-names>J.</given-names></name></person-group> (<year>2012</year>). <article-title>The do-calculus revisited</article-title>. <source>arXiv:1210.4852 [cs, stat]</source> arXiv: 1210.4852.</citation>
</ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pearl</surname> <given-names>J.</given-names></name> <name><surname>Mackenzie</surname> <given-names>D.</given-names></name></person-group> (<year>2018</year>). <source>The Book of Why: The New Science of Cause and Effect</source>, <edition>1st Edn</edition>. <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Basic Books</publisher-name>.</citation>
</ref>
<ref id="B23">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Qi</surname> <given-names>J.</given-names></name> <name><surname>Niu</surname> <given-names>Y.</given-names></name> <name><surname>Huang</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>H.</given-names></name></person-group> (<year>2020</year>). <article-title>Two causal principles for improving visual dialog</article-title>, in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Seattle, WA</publisher-loc>), <fpage>10860</fpage>&#x02013;<lpage>10869</lpage>.</citation>
</ref>
<ref id="B24">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>Faster r-cnn: towards realtime object detection with region proposal networks</article-title>, in <source>Advances in Neural Information Processing Systems</source>, Vol. <volume>28</volume>, <fpage>91</fpage>&#x02013;<lpage>99</lpage>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://proceedings.neurips.cc/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046-Abstract.html">https://proceedings.neurips.cc/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046-Abstract.html</ext-link></citation>
</ref>
<ref id="B25">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Sch&#x000F6;lkopf</surname> <given-names>B.</given-names></name></person-group> (<year>2019</year>). <article-title>Causality for machine learning</article-title>. <source>arXiv:1911.10500 [cs, stat]</source> arXiv: 1911.10500.</citation>
</ref>
<ref id="B26">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Sch&#x000F6;lkopf</surname> <given-names>B.</given-names></name> <name><surname>Locatello</surname> <given-names>F.</given-names></name> <name><surname>Bauer</surname> <given-names>S.</given-names></name> <name><surname>Ke</surname> <given-names>N. R.</given-names></name> <name><surname>Kalchbrenner</surname> <given-names>N.</given-names></name> <name><surname>Goyal</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Towards causal representation learning</article-title>. <source>arXiv:2102.11107 [cs]</source> arXiv: 2102.11107.</citation>
</ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sharma</surname> <given-names>P.</given-names></name> <name><surname>Ding</surname> <given-names>N.</given-names></name> <name><surname>Goodman</surname> <given-names>S.</given-names></name> <name><surname>Soricut</surname> <given-names>R.</given-names></name></person-group> (<year>2018</year>). <article-title>Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning</article-title>, in <source>Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source> (<publisher-loc>Melbourne, VIC</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>2556</fpage>&#x02013;<lpage>2565</lpage>.</citation>
</ref>
<ref id="B28">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Su</surname> <given-names>W.</given-names></name> <name><surname>Zhu</surname> <given-names>X.</given-names></name> <name><surname>Cao</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>B.</given-names></name> <name><surname>Lu</surname> <given-names>L.</given-names></name> <name><surname>Wei</surname> <given-names>F.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Vl-bert: Pre-training of generic visual-linguistic representations</article-title>. <source>arXiv preprint</source> arXiv:1908.08530.</citation>
</ref>
<ref id="B29">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Tan</surname> <given-names>H.</given-names></name> <name><surname>Bansal</surname> <given-names>M.</given-names></name></person-group> (<year>2019</year>). <article-title>Lxmert: learning cross-modality encoder representations from transformers</article-title>. <source>arXiv preprint</source> arXiv:1908.07490.</citation>
</ref>
<ref id="B30">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Vaswani</surname> <given-names>A.</given-names></name> <name><surname>Shazeer</surname> <given-names>N.</given-names></name> <name><surname>Parmar</surname> <given-names>N.</given-names></name> <name><surname>Uszkoreit</surname> <given-names>J.</given-names></name> <name><surname>Jones</surname> <given-names>L.</given-names></name> <name><surname>Gomez</surname> <given-names>A. N.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Attention is all you need</article-title>. <source>arXiv:1706.03762 [cs]</source> arXiv: 1706.03762.</citation>
</ref>
<ref id="B31">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>T.</given-names></name> <name><surname>Huang</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Sun</surname> <given-names>Q.</given-names></name></person-group> (<year>2020</year>). <article-title>Visual commonsense R-CNN</article-title>. <source>arXiv:2002.12204 [cs]</source> arXiv: 2002.12204.</citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Young</surname> <given-names>P.</given-names></name> <name><surname>Lai</surname> <given-names>A.</given-names></name> <name><surname>Hodosh</surname> <given-names>M.</given-names></name> <name><surname>Hockenmaier</surname> <given-names>J.</given-names></name></person-group> (<year>2014</year>). <article-title>From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions</article-title>. <source>Trans. Assoc. Comput. Linguist.</source> <volume>2</volume>, <fpage>67</fpage>&#x02013;<lpage>78</lpage>. <pub-id pub-id-type="doi">10.1162/tacl_a_00166</pub-id></citation>
</ref>
<ref id="B33">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>Z.</given-names></name> <name><surname>Yu</surname> <given-names>J.</given-names></name> <name><surname>Cui</surname> <given-names>Y.</given-names></name> <name><surname>Tao</surname> <given-names>D.</given-names></name> <name><surname>Tian</surname> <given-names>Q.</given-names></name></person-group> (<year>2019</year>). <article-title>Deep Modular co-attention networks for visual question answering</article-title>. <source>arXiv:1906.10770 [cs]</source> arXiv: 1906.10770.</citation>
</ref>
<ref id="B34">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Yue</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Sun</surname> <given-names>Q.</given-names></name> <name><surname>Hua</surname> <given-names>X.-S.</given-names></name></person-group> (<year>2020</year>). <article-title>Interventional few-shot learning</article-title>. <source>arXiv preprint</source> arXiv:2009.13000.</citation>
</ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zellers</surname> <given-names>R.</given-names></name> <name><surname>Bisk</surname> <given-names>Y.</given-names></name> <name><surname>Farhadi</surname> <given-names>A.</given-names></name> <name><surname>Choi</surname> <given-names>Y.</given-names></name></person-group> (<year>2019</year>). <article-title>From recognition to cognition: visual commonsense reasoning</article-title>, in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Long Beach, CA</publisher-loc>), <fpage>6720</fpage>&#x02013;<lpage>6731</lpage>. <pub-id pub-id-type="pmid">35072174</pub-id></citation></ref>
<ref id="B36">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>D.</given-names></name> <name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Tang</surname> <given-names>J.</given-names></name> <name><surname>Hua</surname> <given-names>X.</given-names></name> <name><surname>Sun</surname> <given-names>Q.</given-names></name></person-group> (<year>2020a</year>). <article-title>Causal intervention for weakly-supervised semantic segmentation</article-title>. <source>arXiv preprint</source> arXiv:2009.12547.</citation>
</ref>
<ref id="B37">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>S.</given-names></name> <name><surname>Jiang</surname> <given-names>T.</given-names></name> <name><surname>Wang</surname> <given-names>T.</given-names></name> <name><surname>Kuang</surname> <given-names>K.</given-names></name> <name><surname>Zhao</surname> <given-names>Z.</given-names></name> <name><surname>Zhu</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2020b</year>). <article-title>DeVLBert: learning deconfounded visio-linguistic representations</article-title>. <source>arXiv:2008.06884 [cs]</source> arXiv: 2008.06884.</citation>
</ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>Intuitively, intervening on the value of some variable (like the presence of an object in a scene) means manipulating its value independently of the value of all other variables. This is defined formally in section 3.3.</p></fn>
<fn id="fn0002"><p><sup>2</sup>For example, the correlation observed between umbrellas in the street, and number of taxis being taken is not due to a direct causal link, but due to the common cause of rainy weather.</p></fn>
<fn id="fn0003"><p><sup>3</sup>Also Structural Equation Model.</p></fn>
<fn id="fn0004"><p><sup>4</sup>The set of SCMs that encode the same set of conditional probabilities.</p></fn>
<fn id="fn0005"><p><sup>5</sup>Note that the way <inline-formula><mml:math id="M28"><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is normalized implies that if every class would appear in every image, each class would have a value of <inline-formula><mml:math id="M29"><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula> in <inline-formula><mml:math id="M30"><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Z</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> rather than a value of 1.</p></fn>
<fn id="fn0006"><p><sup>6</sup>Note that because a softmax is used to calculate the attention score, it is not possible to give more than one confounder a weight of 1.</p></fn>
<fn id="fn0007"><p><sup>7</sup>Up-Down (Anderson et al., <xref ref-type="bibr" rid="B1">2018</xref>) for captioning and VQA, AoANet (Huang et al., <xref ref-type="bibr" rid="B12">2019</xref>) for captioning, MCAN (Yu et al., <xref ref-type="bibr" rid="B33">2019</xref>) for VQA, R2C (Zellers et al., <xref ref-type="bibr" rid="B35">2019</xref>) and ViLBERT (Lu et al., <xref ref-type="bibr" rid="B17">2019</xref>) for VCR.</p></fn>
<fn id="fn0008"><p><sup>8</sup>See DeVLBERT replication instructions at <ext-link ext-link-type="uri" xlink:href="https://github.com/shengyuzhang/DeVLBert">https://github.com/shengyuzhang/DeVLBert</ext-link>.</p></fn>
<fn id="fn0009"><p><sup>9</sup>&#x0201C;Almost-all&#x0201D; is meant in a measure-theoretic sense, explained in Bareinboim et al. (<xref ref-type="bibr" rid="B3">2020</xref>).</p></fn>
<fn id="fn0010"><p><sup>10</sup>Although correlation is not a necessary condition for causation, for this heuristic we assume this is still a useful way to weed out many uninteresting pairwise relations.</p></fn>
<fn id="fn0011"><p><sup>11</sup>We do not use the absolute value of <italic>log</italic>(<italic>P</italic>(<italic>X</italic> &#x0003D; 1, <italic>Y</italic> &#x0003D; 1)/<italic>P</italic>(<italic>X</italic> &#x0003D; 1)<italic>P</italic>(<italic>Y</italic> &#x0003D; 1)) because this fills the top of the ranking with pairs of classes that occur only once but not together: for these pairs, <italic>log</italic>(<italic>P</italic>(<italic>X</italic> &#x0003D; 1, <italic>Y</italic> &#x0003D; 1)/<italic>P</italic>(<italic>X</italic> &#x0003D; 1)<italic>P</italic>(<italic>Y</italic> &#x0003D; 1)) is minus infinity. These pairs are almost all unrelated and thus not causally meaningful.</p></fn>
<fn id="fn0012"><p><sup>12</sup>The absolute difference surfaces more common classes, while the relative difference also surfaces rare classes.</p></fn>
<fn id="fn0013"><p><sup>13</sup>We detail the crowdworkers&#x00027; reward per label and estimated time spent per label in the Ethical Considerations section at the end of this article.</p></fn>
<fn id="fn0014"><p><sup>14</sup>We did not end up weighting agreement by confidence level as the difference in resulting pairs was small anyway.</p></fn>
<fn id="fn0015"><p><sup>15</sup><ext-link ext-link-type="uri" xlink:href="https://drive.google.com/file/d/17CTPMoZ4uJH6cSQaxD6Vmyv_bVLJwVJp/view?usp=sharing">https://drive.google.com/file/d/17CTPMoZ4uJH6cSQaxD6Vmyv_bVLJwVJp/view?usp=sharing</ext-link></p></fn>
<fn id="fn0016"><p><sup>16</sup>For pretraining, the batch size is 512, for finetuning VQA, it is 256, and for finetuning IR it is 64.</p></fn>
<fn id="fn0017"><p><sup>17</sup>Using 8 32GB NVIDIA V100 GPUs with a batch size of 64 per GPU, training takes 3 days. Using 4 16GB NVIDIA P100 GPUS with batch size 64 and gradient accumulation of 2 (making an effective batch size 512), training takes about 5 days.</p></fn>
<fn id="fn0018"><p><sup>18</sup>The code to reproduce all results can be found on github: <ext-link ext-link-type="uri" xlink:href="https://github.com/Natithan/p1_causality">https://github.com/Natithan/p1_causality</ext-link>.</p></fn>
<fn id="fn0019"><p><sup>19</sup>This was confirmed after correspondence with the authors of DeVLBERT.</p></fn>
<fn id="fn0020"><p><sup>20</sup>The original images for &#x0201C;woman&#x0201D; and &#x0201C;wall&#x0201D; have a high height/width ration, and are shown trimmed for display purposes.</p></fn>
</fn-group>
</back>
</article>