<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurosci.</journal-id>
<journal-title>Frontiers in Neuroscience</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurosci.</abbrev-journal-title>
<issn pub-type="epub">1662-453X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnins.2022.631347</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Implicit Counterfactual Effect in Partial Feedback Reinforcement Learning: Behavioral and Modeling Approach</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Barakchian</surname> <given-names>Zahra</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1139738/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Vahabie</surname> <given-names>Abdol-Hossein</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<xref ref-type="author-notes" rid="fn002"><sup>&#x02020;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1154408/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Nili Ahmadabadi</surname> <given-names>Majid</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="author-notes" rid="fn002"><sup>&#x02020;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/138617/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Cognitive Neuroscience, Institute for Research in Fundamental Sciences</institution>, <addr-line>Tehran</addr-line>, <country>Iran</country></aff>
<aff id="aff2"><sup>2</sup><institution>Cognitive Systems Laboratory, Control and Intelligent Processing Center of Excellence, School of Electrical and Computer Engineering, College of Engineering, University of Tehran</institution>, <addr-line>Tehran</addr-line>, <country>Iran</country></aff>
<aff id="aff3"><sup>3</sup><institution>Department of Psychology, Faculty of Psychology and Education, University of Tehran</institution>, <addr-line>Tehran</addr-line>, <country>Iran</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Paul E. M. Phillips, University of Washington, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Duda Kvitsiani, Aarhus University, Denmark; Mael Lebreton, Universit&#x000E9; de Gen&#x000E8;ve, Switzerland</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Zahra Barakchian <email>zbarakchian&#x00040;ipm.ir</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Decision Neuroscience, a section of the journal Frontiers in Neuroscience</p></fn>
<fn fn-type="equal" id="fn002"><p>&#x02020;These authors have contributed equally to this work</p></fn></author-notes>
<pub-date pub-type="epub">
<day>10</day>
<month>05</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>16</volume>
<elocation-id>631347</elocation-id>
<history>
<date date-type="received">
<day>19</day>
<month>11</month>
<year>2020</year>
</date>
<date date-type="accepted">
<day>28</day>
<month>03</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Barakchian, Vahabie and Nili Ahmadabadi.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Barakchian, Vahabie and Nili Ahmadabadi</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract>
<p>Context remarkably affects learning behavior by adjusting option values according to the distribution of available options. Displaying counterfactual outcomes, the outcomes of the unchosen option alongside the chosen one (i.e., providing complete feedback), would increase the contextual effect by inducing participants to compare the two outcomes during learning. However, when the context only consists of the juxtaposition of several options and there is no such explicit counterfactual factor (i.e., only partial feedback is provided), it is not clear whether and how the contextual effect emerges. In this research, we employ Partial and Complete feedback paradigms in which options are associated with different reward distributions. Our modeling analysis shows that the model that uses the outcome of the chosen option for updating the values of both chosen and unchosen options in opposing directions can better account for the behavioral data. This is also in line with the diffusive effect of dopamine on the striatum. Furthermore, our data show that the contextual effect is not limited to probabilistic rewards, but also extends to magnitude rewards. These results suggest that by extending the counterfactual concept to include the effect of the chosen outcome on the unchosen option, we can better explain why there is a contextual effect in situations in which there is no extra information about the unchosen outcome.</p></abstract>
<kwd-group>
<kwd>reinforcement learning</kwd>
<kwd>value learning</kwd>
<kwd>contextual effect</kwd>
<kwd>counterfactual outcome</kwd>
<kwd>partial and complete feedback</kwd>
</kwd-group>
<counts>
<fig-count count="8"/>
<table-count count="4"/>
<equation-count count="49"/>
<ref-count count="50"/>
<page-count count="19"/>
<word-count count="12320"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Behavior necessarily occurs within a specific context. A wealth of studies have investigated the effect of context on decision making (Summerfield and Tsetsos, <xref ref-type="bibr" rid="B44">2015</xref>; Rigoli et al., <xref ref-type="bibr" rid="B36">2016a</xref>,<xref ref-type="bibr" rid="B38">b</xref>, <xref ref-type="bibr" rid="B37">2017</xref>, <xref ref-type="bibr" rid="B35">2018</xref>; Tsetsos et al., <xref ref-type="bibr" rid="B49">2016</xref>), while the effect of context on reinforcement learning has received little attention. Recent studies have shown that many cognitive biases arise due to the effect of the context in which the value learning process occurs (Palminteri et al., <xref ref-type="bibr" rid="B32">2015</xref>; Klein et al., <xref ref-type="bibr" rid="B26">2017</xref>; Bavard et al., <xref ref-type="bibr" rid="B4">2018</xref>). The choice context is comprised of the currently available options. Two paradigms have been implemented to investigate the value learning process. In the Complete feedback paradigm, participants are shown the outcomes of the options they select (factual outcomes) as well as the outcomes of the options they forgo (counterfactual outcomes). Thus, participants are able to compare the factual and counterfactual outcomes and thereby learn the value of the selected option relative to the value of the forgone option (Palminteri et al., <xref ref-type="bibr" rid="B32">2015</xref>; Bavard et al., <xref ref-type="bibr" rid="B4">2018</xref>). In the Partial feedback paradigm, participants are only shown the outcomes of the selected options, so they are not able to compare the two outcomes. It is unknown if and how the contextual effect appears in the Partial feedback paradigm.</p>
<p>In reinforcement learning, the value of an option is usually learned through trial and error (Sutton and Barto, <xref ref-type="bibr" rid="B46">2018</xref>). Reinforcement learning is an incremental process in which option values are updated <italic>via</italic> prediction errors, that is, the difference between the received reward versus the expected reward (Sutton and Barto, <xref ref-type="bibr" rid="B46">2018</xref>). Prediction errors are encoded in the brain by the neurotransmitter dopamine (Schultz et al., <xref ref-type="bibr" rid="B40">1997</xref>). Dopamine releases diffusively and has opposing excitatory and inhibitory effects on two distinct populations of striatal neurons called D1-SPNs and D2-SPNs (spiny projection neurons), respectively. These two clusters encode the values of the two competing options (Frank et al., <xref ref-type="bibr" rid="B19">2004</xref>; Tai et al., <xref ref-type="bibr" rid="B47">2012</xref>; Collins and Frank, <xref ref-type="bibr" rid="B10">2014</xref>; Donahue et al., <xref ref-type="bibr" rid="B17">2018</xref>; Nonomura et al., <xref ref-type="bibr" rid="B30">2018</xref>; Shin et al., <xref ref-type="bibr" rid="B42">2018</xref>; Bariselli et al., <xref ref-type="bibr" rid="B1">2019</xref>). Inspired by the opposing effects of dopamine on D1- and D2-SPNs, we propose a simple reinforcement learning model called the Opposing Learning (OL) model. In the OL model, the chosen prediction error not only updates the value of the chosen option, but also that of the unchosen option, in opposite directions. Moreover, the updating of both option values depends on the observed rewards of the chosen option as well as those of the unchosen option. This implies that two competing options with identical expected rewards will have different learned values in different contexts.</p>
<p>In a typical value learning task, participants aim to maximize expected rewards. However, in the Complete feedback paradigm, in which counterfactual outcomes are also presented, the value learning strategy can be more complex (Palminteri et al., <xref ref-type="bibr" rid="B32">2015</xref>; Klein et al., <xref ref-type="bibr" rid="B26">2017</xref>; Bavard et al., <xref ref-type="bibr" rid="B4">2018</xref>): Participants aim to learn option values by comparing the two outcomes relative to each other. This comparison will trigger regret (when the factual outcome is the less favorable) or relief [when the counterfactual outcome is the less favorable]. In an attempt to minimize regret and maximize relief, people aim to optimize the outcome difference, i.e., [<italic>outcome</italic><sub><italic>factual</italic></sub> &#x02212; <italic>outcome</italic><sub><italic>counterfactual</italic></sub>] (Camille et al., <xref ref-type="bibr" rid="B8">2004</xref>; Coricelli et al., <xref ref-type="bibr" rid="B11">2005</xref>, <xref ref-type="bibr" rid="B12">2007</xref>). Recent studies have shown that people are neither fully expected-reward optimizers nor fully outcome-difference optimizers; they are hybrid optimizers who use both of these strategies but weight them differently (Kishida et al., <xref ref-type="bibr" rid="B25">2016</xref>; Bavard et al., <xref ref-type="bibr" rid="B4">2018</xref>). The individual differences between people depend on the degree to which a person utilizes each of these strategies. By adding a hybrid component to the simple OL model, we extend the OL model to account for the results in the Complete feedback paradigm as well.</p>
<p>Most of the previous studies have aimed to explain the contextual effects as resulting from the effect of the forgone outcome on the chosen value. In this study, we go beyond that explanation and aim to explain the contextual effect as resulting from the effect of the factual outcome on the unchosen value, especially in situations in which there is no forgone outcome. To this end, we designed two types of feedback paradigms, with and without forgone outcomes, and will show that we observed the contextual effect in both feedback paradigms. We introduce a novel reinforcement learning model that is better able to account for the underlying contextual bias in behavioral data than previous models. To study situations that occur frequently in everyday life, we use reward magnitude rather than reward probability and thereby show that the contextual effect is also present in paradigms using reward magnitude.</p>
</sec>
<sec sec-type="results" id="s2">
<title>2. Results</title>
<sec>
<title>2.1. Behavioral Task</title>
<p>Two groups of participants performed two different versions of the instrumental learning task: the Partial feedback version, in which we only provided them with factual outcomes, and the Complete feedback version, in which we provided them with both factual and counterfactual outcomes. Participants were to gain the most possible rewards during the task. The rewards were random independent numbers drawn from specified normal distributions. Participants faced two pairs of options (<italic>A</italic><sub>1</sub>, <italic>B</italic>) and (<italic>A</italic><sub>2</sub>, <italic>C</italic>), where <italic>A</italic><sub>1</sub> and <italic>A</italic><sub>2</sub> were associated with rewards from the same distribution as <inline-formula><mml:math id="M75"><mml:mi mathvariant='script'>N</mml:mi></mml:math></inline-formula>(64, 13) and <italic>B</italic> and <italic>C</italic> were associated with rewards from two different distributions <inline-formula><mml:math id="M65"><mml:mi mathvariant='script'>N</mml:mi></mml:math></inline-formula>(54, 13) and <inline-formula><mml:math id="M76"><mml:mi mathvariant='script'>N</mml:mi></mml:math></inline-formula>(44, 13), respectively. To conceal the task structure from the participants, different images were assigned to <italic>A</italic><sub>1</sub> and <italic>A</italic><sub>2</sub>, although their associated values were equal. After the learning phase, the participants unexpectedly entered the post-learning transfer phase in which all possible binary combinations of options (six pairs) were presented to them (each combination presented four times), and they were asked to choose the option with the highest expected reward. The transfer phase design aims to reveal any bias between <italic>A</italic><sub>2</sub> and <italic>A</italic><sub>1</sub>. Similar designs can be found in the context-dependent value learning literature (Palminteri et al., <xref ref-type="bibr" rid="B32">2015</xref>; Klein et al., <xref ref-type="bibr" rid="B26">2017</xref>; Bavard et al., <xref ref-type="bibr" rid="B4">2018</xref>). To avoid interfering with the participants&#x00027; previous learning, no feedback was provided in the transfer phase (Frank et al., <xref ref-type="bibr" rid="B19">2004</xref>, <xref ref-type="bibr" rid="B18">2007</xref>; Palminteri et al., <xref ref-type="bibr" rid="B32">2015</xref>; Klein et al., <xref ref-type="bibr" rid="B26">2017</xref>; Bavard et al., <xref ref-type="bibr" rid="B4">2018</xref>). After each choice, participants reported their confidence in that choice on a scale of 0 to 100. Finally, in the value estimation phase, participants reported their estimated expected value of each stimulus on a scale of 0&#x02013;100 (<xref ref-type="fig" rid="F1">Figure 1</xref>).</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Behavioral design. Timelines of the Partial and Complete feedback versions of the task. Participants were given written instructions and trained through 20 trials before beginning the main task. They learned two pairs of options in the Learning phase by trial and error. In the subsequent transfer phase, they were presented with two options and were to choose the more advantageous option and then report their level of confidence about their choice. In the transfer phase, all possible binary combinations of options were presented. Finally, in the value estimation phase, they were to estimate the option value on a scale from 0 to 100.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-631347-g0001.tif"/>
</fig>
</sec>
<sec>
<title>2.2. Performance</title>
<p>First, to see whether the participants had learned the option values during the task, we assessed their performance in the learning phase by calculating the percentage of trials in which they chose the advantageous option (the option with the higher expected reward). We observed that, in both versions of the task, the participants&#x00027; performance was significantly better than random (0.5) [Partial: performance = 0.7613&#x000B1;0.1130; <italic>t</italic>-test, <italic>p</italic> &#x0003D; 1.1041<italic>e</italic> &#x02212; 15, <italic>t</italic><sub>(34)</sub> &#x0003D; 13.6787, Complete: performance = 0.8823&#x000B1;0.0853, <italic>t</italic>-test, <italic>p</italic> &#x0003D; 2.8382<italic>e</italic> &#x02212; 29, <italic>t</italic><sub>(41)</sub> &#x0003D; 29.0489; <bold>Figure 3A</bold>]. We also compared the participants&#x00027; performance in the two versions of the task and found that their performance was significantly better in the complete feedback version [<italic>p</italic> &#x0003D; 4.5603e &#x02212; 07, <italic>t</italic><sub>(75)</sub> &#x0003D; 5.3522, one-tailed <italic>t</italic>-test]. This means that providing information about counterfactual outcomes to participants facilitated their learning. This result is consistent with the previous studies (Palminteri et al., <xref ref-type="bibr" rid="B32">2015</xref>; Klein et al., <xref ref-type="bibr" rid="B26">2017</xref>; Bavard et al., <xref ref-type="bibr" rid="B4">2018</xref>).</p>
<p>We also observed that participants&#x00027; performance was significantly better than random (0.5) in the transfer phase [Partial: performance = 0.8786&#x000B1;0.2868, <italic>t</italic>-test, <italic>p</italic> &#x0003D; 8.7844<italic>e</italic> &#x02212; 22, <italic>t</italic><sub>(34)</sub> &#x0003D; 21.673; Complete: performance = 0.9226&#x000B1;0.2618, <italic>t</italic>-test, <italic>p</italic> &#x0003D; 2.4064<italic>e</italic> &#x02212; 24, <italic>t</italic><sub>(41)</sub> &#x0003D; 21.6362; <xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 1</xref>]. Additionally, the reported confidence was significantly higher when participants had chosen the advantageous option than when they had chosen the non-advantageous option (Partial: average confidences of advantageous options = 0.7533&#x000B1;0.1895, average confidences of non-advantageous options = 0.4882&#x000B1;0.2326; Complete: average confidences of advantageous options = 0.7961&#x000B1;0.1818, average confidences of non-advantageous options = 0.5752&#x000B1;0.2124).</p>
<p>To determine whether the two versions of the task had different reward sensitivities, we ran a hierarchical model as follows. <italic>action</italic> &#x0007E; 1 &#x0002B; <italic>vdif</italic> &#x0002A; <italic>task</italic> &#x0002B; (1 &#x0002B; <italic>vdif</italic> &#x0002A; <italic>task</italic>|<italic>subject</italic>), where the <italic>action</italic> variable represents choosing the left option, the <italic>vdif</italic> variable is the option values difference, <italic>task</italic> variable is a categorical variable with 1 for the Partial and 2 for the Complete feedback version, and <italic>subject</italic> is the random effect variable. As can be seen in <xref ref-type="supplementary-material" rid="SM1">Supplementary Table 1</xref>, reward sensitivity was significantly higher in the Complete feedback version than in the Partial feedback version (<italic>p</italic>-value of the <italic>vdif</italic>:<italic>task</italic>2 regressor is 8.0551<italic>e</italic> &#x02212; 17). For these and the following analyses, unanswered trials in the learning phase were excluded.</p>
</sec>
<sec>
<title>2.3. Contextual Effect</title>
<p>After the participants had learned the option values, we turned to the transfer phase to see whether there was any contextual effect. We found that participants&#x00027; preferences between <italic>A</italic><sub>1</sub> and <italic>A</italic><sub>2</sub> had been significantly modulated by their distance from their paired options, such that despite having equal absolute values, participants preferred <italic>A</italic><sub>2</sub> over <italic>A</italic><sub>1</sub> (<italic>transfer bias</italic>) in both versions (Partial: <italic>p</italic> &#x0003D; 0.04, <italic>ratio</italic> &#x0003D; 0.65; Complete: <italic>p</italic> &#x0003D; 0.01, <italic>ratio</italic> &#x0003D; 0.66; binomial test; <xref ref-type="fig" rid="F2">Figures 2</xref>, <xref ref-type="fig" rid="F3">3B</xref>, <xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 1</xref>). Although this analysis has bee done on the first iterations of the participants choices in the transfer phase, this trend still remained after we considered all four iterations of <italic>A</italic><sub>1</sub> and <italic>A</italic><sub>2</sub> (the rates of choosing <italic>A</italic><sub>2</sub> over <italic>A</italic><sub>1</sub> for each participant), though it lost significance (Partial: <italic>p</italic> &#x0003D; 0.083; Complete: <italic>p</italic> &#x0003D; 0.063; <italic>t</italic>-test). This loss of significance might be explained as follows. In the learning phase, only certain pairs of options appeared together, allowing participants to compare and learn the options&#x00027; relative values. However, in the transfer phase, the participants were presented with pairs of options that had not previously been paired so they were not able to compare the options&#x00027; relative values. It may thus have been a better strategy not to rely completely on the relative values, but to use the absolute values of the options (For details of the binomial test see <xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref>).</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Transfer effect. In the transfer phase of the Partial and the Complete feedback versions of the task, participants significantly more often preferred the option with higher relative value (<italic>A</italic><sub>2</sub>, dark colors), although the both options had equal absolute value.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-631347-g0002.tif"/>
</fig>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Behavioral results in the learning, transfer, and estimation phases. <bold>(A)</bold> The learning curves show that, when presented with paired options, participants learned to choose the more advantageous option in the pair (A<italic>A</italic><sub>1</sub> in <italic>A</italic><sub>1</sub><italic>B</italic> pair and <italic>A</italic><sub>2</sub> in <italic>A</italic><sub>2</sub><italic>C</italic> pair). The learning curve of the OL models shows similar results. Each bin in the x-axis is the average of choices in 10 trials. Solid lines show the behavioral data, dashed lines show the synthetic data. <bold>(B)</bold> The summarized preferences of the participants in six combinations (top) and their corresponding confidence levels (bottom), along with the predictions of the OL model (black dots). <bold>(C)</bold> The participants&#x00027; value estimations (colored bars) are very close to the real expected rewards of the options <italic>A</italic><sub>1</sub> and <italic>A</italic><sub>2</sub> (colored lines). The Partial version is green and the Complete version is brown. Shadings denote <italic>SD</italic> and error bars denote <italic>SEM</italic>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-631347-g0003.tif"/>
</fig>
<p>To ensure that the observed bias in the transfer phase was a result of context-dependent value learning, and not of confounding factors, we examined which other factors could have affected the participants&#x00027; preference for <italic>A</italic><sub>2</sub>. The observed bias may have occurred because, in the learning phase, participants chose <italic>A</italic><sub>2</sub> more frequently than <italic>A</italic><sub>1</sub>. To test this possibility, we ran a logistic regression analysis to see whether the preference of <italic>A</italic><sub>2</sub> over <italic>A</italic><sub>1</sub> was due to the difference in frequency of choosing <italic>A</italic><sub>2</sub> versus <italic>A</italic><sub>1</sub> in the learning phase. This analysis showed that the effect on the transfer bias of participants having chosen <italic>A</italic><sub>2</sub> more frequently than <italic>A</italic><sub>1</sub> in the learning phase was almost significant for the Partial version, but not significant for the Complete version (<italic>t</italic>-test on the regression weights, Partial: <italic>p</italic> &#x0003D; 0.054; Complete: <italic>p</italic> &#x0003D; 0.12). The significant intercept of the regression confirms the transfer effect, even when choice frequency is controlled (<italic>t</italic>-test on the Intercept weight, Partial: <italic>p</italic> &#x0003D; 0.03; Complete: <italic>p</italic> &#x0003D; 0.02). Although the above analysis has been done on the first iteration of (<italic>A</italic><sub>1</sub>, <italic>A</italic><sub>2</sub>), the result is almost the same when we consider all iterations of (<italic>A</italic><sub>1</sub>, <italic>A</italic><sub>2</sub>), i.e., the rates of choosing <italic>A</italic><sub>2</sub> over <italic>A</italic><sub>1</sub> (<italic>t</italic>-test on the regression weights, Partial: <italic>p</italic> &#x0003D; 0.0851; Complete: <italic>p</italic> &#x0003D; 0.060, <italic>t</italic>-test on the Intercept weight, Partial: <italic>p</italic> &#x0003D; 0.081; Complete: <italic>p</italic> &#x0003D; 0.080).</p>
<p>Furthermore, we repeated the analysis described in the previous paragraph for the last 20 trials. We again found no significant effect of late choice frequencies on the transfer bias (t-test on the regression weights, Partial: <italic>p</italic> &#x0003D; 0.56; Complete: <italic>p</italic> &#x0003D; 0.29) while intercepts remained almost significant (Partial: <italic>p</italic> &#x0003D; 0.06; Complete: <italic>p</italic> &#x0003D; 0.03). Although the above analysis has been done on the first iteration of (<italic>A</italic><sub>1</sub>, <italic>A</italic><sub>2</sub>), the result is the same when we consider all iterations of (<italic>A</italic><sub>1</sub>, <italic>A</italic><sub>2</sub>) (<italic>t</italic>-test on the regression weights, Partial: <italic>p</italic> &#x0003D; 0.730; Complete: <italic>p</italic> &#x0003D; 0.798, <italic>t</italic>-test on the Intercept weight, Partial: <italic>p</italic> &#x0003D; 0.132; Complete: <italic>p</italic> &#x0003D; 0.108).</p>
<p>The other possible confounding factors for the transfer bias might be the amount of very small or very large rewards (upper or lower tails of the reward distributions). To test this, first, we summed up the rewards greater than &#x003BC; &#x0002B; 2.5&#x003C3; (&#x003BC; and &#x003C3; are the mean and standard deviation of the rewards, respectively), and using logistic regression analysis, we tested whether this sum had a significant effect on the transfer bias. We repeated the same analysis for rewards less than &#x003BC; &#x02212; 2.5&#x003C3;. We found no significant effect of large or small rewards in either version (<italic>t</italic>-test on the regression weights, large rewards: [Partial: <italic>p</italic> &#x0003D; 0.40; Complete: <italic>p</italic> &#x0003D; 0.62], Small rewards: [Partial: <italic>p</italic> &#x0003D; 0.54; Complete: <italic>p</italic> &#x0003D; 0.47]). Again, although the above analysis has been done on the first iteration of (<italic>A</italic><sub>1</sub>, <italic>A</italic><sub>2</sub>), the result is the same when we consider all iterations of (<italic>A</italic><sub>1</sub>, <italic>A</italic><sub>2</sub>) (<italic>t</italic>-test on the regression weights, large rewards: [Partial: <italic>p</italic> &#x0003D; 0.684; Complete: <italic>p</italic> &#x0003D; 0.508], Small rewards: [Partial: <italic>p</italic> &#x0003D; 0.630; Complete: <italic>p</italic> &#x0003D; 0.879]).</p>
<p>Next, we assessed whether the confidence participants reported about their choices differed in the two feedback versions. To this end, we ran a <italic>t</italic>-test analysis and found no significant difference in reported confidences between two feedback versions [<italic>p</italic> &#x0003D; 0.156, <italic>t</italic><sub>(75)</sub> &#x0003D; &#x02212;1.43, <italic>t</italic>-test].</p>
</sec>
<sec>
<title>2.4. Value Estimation</title>
<p>We then turned our attention to the analysis of the value estimation phase. We found that participant were able to estimate the expected rewards of the advantageous options fairly accurately, but they significantly underestimated the expected rewards of the non-advantageous options (<xref ref-type="fig" rid="F3">Figure 3C</xref>). These results can be explained as follows. When a given option is chosen frequently, participants could either track its average rewards or calculate its value at the moment of estimation.</p>
<p>Our next question was whether the value estimation phase introduced any bias similar to that introduced by the transfer phase. To test this, we ran a paired <italic>t</italic>-test analysis on the estimated values. We found that there was no significant difference between estimation of <italic>A</italic><sub>1</sub> and <italic>A</italic><sub>2</sub> in either version, yet there was a trend toward overestimating <italic>A</italic><sub>2</sub> compared to <italic>A</italic><sub>1</sub> [Partial: <italic>p</italic> &#x0003D; 0.1457, <italic>t</italic><sub>(34)</sub> &#x0003D; &#x02212;1.48; Complete: <italic>p</italic> &#x0003D; 0.651, <italic>t</italic><sub>(41)</sub> &#x0003D; &#x02212;0.45; paired <italic>t</italic>-test]. To assess whether there are any differences in estimation variabilities in the two feedback versions, we considered the standard error of the four reported values for each stimulus. To analyze this, we ran a t-test analysis and found that there were no significant differences in estimation variabilities in the two versions [<italic>p</italic> &#x0003D; 0.888, <italic>t</italic><sub>(75)</sub> &#x0003D; 0.141, <italic>t</italic>-test].</p>
</sec>
<sec>
<title>2.5. Comparison Effect</title>
<p>In the next step, we studied the effects of regret and relief on participants&#x00027; behavior. The idea of regret and relief is that, to learn the consequences of one&#x00027;s decision, one compares the outcome of the selected option with that of the non-selected option. This comparison triggers regret or relief depending on whether the outcome of one&#x00027;s decision is worse or better, respectively, than the outcome of the opposite decision. People naturally tend to avoid regret (approach relief), and when facing regret (relief), they are likely to switch to the other option (or select the same option again; Camille et al., <xref ref-type="bibr" rid="B8">2004</xref>; Coricelli et al., <xref ref-type="bibr" rid="B11">2005</xref>).</p>
<p>In each trial of our experiment, regret and relief were operationalized as the difference between outcomes in that trial. To test whether the difference in outcomes of the previous trial influenced the decision to select a different option (&#x0201C;switch&#x0201D;) or the same option (&#x0201C;stay&#x0201D;) as in the previous trial in the current trial, we used a hierarchical logistic regression analysis as follows. <italic>action</italic> &#x0007E; 1 &#x0002B; <italic>vdif</italic> &#x0002B; <italic>odif</italic> &#x0002B; <italic>cond</italic> &#x0002B; (1 &#x0002B; <italic>vdif</italic> &#x0002B; <italic>odif</italic> &#x0002B; <italic>cond</italic>|<italic>subject</italic>), where <italic>action</italic> is the participants switching behavior (1 if participant switched, 0 if participant stayed), and <italic>odif</italic> is the outcome difference of the previous trial and the value difference of the current trial. The outcome difference in the Complete version was defined as the difference between the factual and counterfactual outcomes, {<italic>r</italic><sub><italic>FC</italic></sub> &#x02212; <italic>r</italic><sub><italic>CF</italic></sub>}, and for the Partial version, we used <italic>V</italic><sub><italic>CF</italic></sub> instead of <italic>r</italic><sub><italic>CF</italic></sub>. The <italic>vdif</italic> variable is the option values difference, <italic>cond</italic> variable is a categorical variable with 1 for the <italic>A</italic><sub>1</sub><italic>B</italic> pair and 2 for the <italic>A</italic><sub>2</sub><italic>C</italic> pair, and <italic>subject</italic> is the random effect variable.</p>
<p>We found a significant comparison effect in the Complete version, but not in the Partial version (<xref ref-type="table" rid="T1">Table 1</xref>). This means that participants tended to switch from or stay with their previous choice according to whether they were facing regret or relief, respectively, and this tendency was stronger in the Complete version. To investigate this effect more thoroughly, we performed a similar analysis on the logarithm of reaction times: <italic>logrt</italic> &#x0007E; 1 &#x0002B; <italic>vdif</italic> &#x0002B; <italic>odif</italic> &#x0002B; <italic>cond</italic> &#x0002B; (1 &#x0002B; <italic>vdif</italic> &#x0002B; <italic>odif</italic> &#x0002B; <italic>cond</italic>|<italic>subject</italic>). We observed that, in the Complete version but not the Partial version, reaction times in each trial were significantly modulated by the outcome difference from the previous trial such that the smaller the difference, the slower the reaction time, and vice versa (<xref ref-type="table" rid="T1">Table 1</xref>). This result is consistent with the post-error slowing phenomena reported frequently in the decision-making literature (Jentzsch and Dudschig, <xref ref-type="bibr" rid="B22">2009</xref>; Notebaert et al., <xref ref-type="bibr" rid="B31">2009</xref>).</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Comparison effect of the participants&#x00027; switching behavior and reaction times.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="center" colspan="10" style="border-bottom: thin solid #000000;"><bold>Switch</bold></th>
</tr>
<tr>
<th/>
<th valign="top" align="center" colspan="4" style="border-bottom: thin solid #000000;"><bold>Partial</bold></th>
<th valign="top" align="center" colspan="5" style="border-bottom: thin solid #000000;"><bold>Complete</bold></th>
</tr>
<tr>
<th valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>Name</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>Estimate</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>SE</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>t-stat</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><italic><bold>p</bold></italic><bold>-value</bold></th>
<th/>
<th valign="top" align="center"><bold>Estimate</bold></th>
<th valign="top" align="center"><bold>SE</bold></th>
<th valign="top" align="center"><bold>t-stat</bold></th>
<th valign="top" align="left"><italic><bold>p</bold></italic><bold>-value</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">(Intercept)</td>
<td valign="top" align="center">&#x02212;1.5528097</td>
<td valign="top" align="center">0.10551335</td>
<td valign="top" align="center">&#x02212;14.716713</td>
<td valign="top" align="center">2.69E-48</td>
<td/>
<td valign="top" align="center">&#x02212;2.724902</td>
<td valign="top" align="center">0.17598005</td>
<td valign="top" align="center">&#x02212;15.484153</td>
<td valign="top" align="left">2.28E-53</td>
</tr>
<tr>
<td valign="top" align="left">Outcome difference</td>
<td valign="top" align="center">&#x02212;0.0879529</td>
<td valign="top" align="center">0.0567055</td>
<td valign="top" align="center">&#x02212;1.5510467</td>
<td valign="top" align="center">1.21E-01</td>
<td/>
<td valign="top" align="center">&#x02212;0.5462195</td>
<td valign="top" align="center">0.06292942</td>
<td valign="top" align="center">&#x02212;8.6798744</td>
<td valign="top" align="left">4.68E-18</td>
</tr>
<tr>
<td valign="top" align="left">Value difference</td>
<td valign="top" align="center">&#x02212;1.123403</td>
<td valign="top" align="center">0.08767908</td>
<td valign="top" align="center">&#x02212;12.812668</td>
<td valign="top" align="center">3.67E-37</td>
<td/>
<td valign="top" align="center">&#x02212;0.9158512</td>
<td valign="top" align="center">0.06505058</td>
<td valign="top" align="center">&#x02212;14.079062</td>
<td valign="top" align="left">1.57E-44</td>
</tr>
<tr>
<td valign="top" align="left" style="border-bottom: thin solid #000000;">Condition</td>
<td valign="top" align="center" style="border-bottom: thin solid #000000;">0.25688705</td>
<td valign="top" align="center" style="border-bottom: thin solid #000000;">0.08761796</td>
<td valign="top" align="center" style="border-bottom: thin solid #000000;">2.93189954</td>
<td valign="top" align="center" style="border-bottom: thin solid #000000;">0.00337999</td>
<td style="border-bottom: thin solid #000000;"/>
<td valign="top" align="center" style="border-bottom: thin solid #000000;">0.25104619</td>
<td valign="top" align="center" style="border-bottom: thin solid #000000;">0.12809207</td>
<td valign="top" align="center" style="border-bottom: thin solid #000000;">1.95988856</td>
<td valign="top" align="left" style="border-bottom: thin solid #000000;">0.05004073</td>
</tr> <tr>
<td valign="top" align="left" colspan="10" style="border-bottom: thin solid #000000;"><bold>Reaction time</bold></td>
</tr>
<tr>
<td/>
<td valign="top" align="center" colspan="4" style="border-bottom: thin solid #000000;"><bold>Partial</bold></td>
<td valign="top" align="center" colspan="5" style="border-bottom: thin solid #000000;"><bold>Complete</bold></td>
</tr>
<tr>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>Name</bold></td>
<td valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>Estimate</bold></td>
<td valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>SE</bold></td>
<td valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>t-stat</bold></td>
<td valign="top" align="center" style="border-bottom: thin solid #000000;"><italic><bold>p</bold></italic><bold>-value</bold></td>
<td style="border-bottom: thin solid #000000;"/>
<td valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>Estimate</bold></td>
<td valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>SE</bold></td>
<td valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>tstat</bold></td>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>pValue</bold></td>
</tr> <tr>
<td valign="top" align="left">(Intercept)</td>
<td valign="top" align="center">&#x02212;0.1164283</td>
<td valign="top" align="center">0.03073684</td>
<td valign="top" align="center">&#x02212;3.7879077</td>
<td valign="top" align="center">0.00015321</td>
<td/>
<td valign="top" align="center">&#x02212;0.1211333</td>
<td valign="top" align="center">0.03585658</td>
<td valign="top" align="center">&#x02212;3.3782727</td>
<td valign="top" align="left">0.00073263</td>
</tr>
<tr>
<td valign="top" align="left">Outcome difference</td>
<td valign="top" align="center">0.01123051</td>
<td valign="top" align="center">0.00651389</td>
<td valign="top" align="center">1.72408744</td>
<td valign="top" align="center">0.08473669</td>
<td/>
<td valign="top" align="center">&#x02212;0.0164905</td>
<td valign="top" align="center">0.00526292</td>
<td valign="top" align="center">&#x02212;3.1333433</td>
<td valign="top" align="left">0.00173402</td>
</tr>
<tr>
<td valign="top" align="left">Value difference</td>
<td valign="top" align="center">&#x02212;0.0699353</td>
<td valign="top" align="center">0.0101347</td>
<td valign="top" align="center">&#x02212;6.9005836</td>
<td valign="top" align="center">5.64E-12</td>
<td/>
<td valign="top" align="center">&#x02212;0.0698999</td>
<td valign="top" align="center">0.01654412</td>
<td valign="top" align="center">&#x02212;4.2250579</td>
<td valign="top" align="left">2.41E-05</td>
</tr>
<tr>
<td valign="top" align="left">Condition</td>
<td valign="top" align="center">0.04191482</td>
<td valign="top" align="center">0.02390541</td>
<td valign="top" align="center">1.75336139</td>
<td valign="top" align="center">0.07958424</td>
<td/>
<td valign="top" align="center">0.03658956</td>
<td valign="top" align="center">0.02364193</td>
<td valign="top" align="center">1.54765513</td>
<td valign="top" align="left">0.12174177</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The hierarchical logistic regression and hierarchical simple regression analyses were performed on the switching behavior and logarithms of participants&#x00027; reaction times, respectively. Along with the outcome difference as the main regressor, the current value differences between the two paired options and the condition type (A<sub>1</sub>B, A<sub>2</sub>C) were also included as control regressors. The results show that the participants&#x00027; current choices as well as their current reaction times were significantly influenced by the outcome differences of their previous choices in the Complete, but not the Partial feedback version</italic>.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>2.6. Opposing Learning Model (OL)</title>
<p>In the following, we introduce a novel reinforcement learning model, called the Opposing Learning (OL) model, adopted from the standard Q-learning model and inspired by the striatal mechanism. First, we will introduce the basic model for the Partial feedback version, and then we will extend the model for the Complete feedback version.</p>
<sec>
<title>2.6.1. Model Description</title>
<p>Our model focuses on the chosen option in the sense that value updating is based solely on the prediction error of the chosen option. Following the choice, the chosen prediction error will simultaneously update the chosen and unchosen values in opposite directions (increasing and decreasing, respectively).</p>
<disp-formula id="E1"><mml:math id="M1"><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E2"><mml:math id="M2"><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<p>where <italic>ch</italic> refers to the <italic>chosen</italic> option, <italic>un</italic> refers to the <italic>unchosen</italic> option, and &#x003B4;<sub><italic>ch</italic></sub> &#x0003D; <italic>r</italic><sub><italic>ch</italic></sub> &#x02212; <italic>Q</italic><sub><italic>ch</italic></sub>. At the final stage, the decision is made following the softmax rule, <inline-formula><mml:math id="M3"><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0002B;</mml:mo><mml:msup><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:math></inline-formula>, where &#x003B2; is the inverse of the temperature parameter. The model equation is inspired by the effect of dopamine on the striatum. The striatum consists of D1 and D2 spiny projection neurons (SPNs) which encode chosen and unchosen values, respectively. The presence of prediction error in both chosen and unchosen value updating comes from the fact that the dopamine release is diffusive and thus non-selective. The specified signs of prediction error in the model equations relates to the opposite effects of dopamine on D1- and D2-SPNs (<xref ref-type="fig" rid="F4">Figure 4</xref>).</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>The schematic of the OL model and its extension. <bold>(A)</bold> A common strategy in the value learning task, especially when counterfactual outcomes are also provided, is to compare competing outcomes. This comparison triggers the regret (relief) that subsequently drives avoidance (approach) behavior. The tendency to minimize regret (and maximize relief) along with the tendency to maximize expected rewards, which is a hybrid strategy, can better account for the behavioral data than either of these strategies. The absolute and relative weights assigned to each strategy (maximize expected rewards and minimize regret) determine the amount of their effect on behavior. <bold>(B)</bold> The idea behind the OL model comes from the opposing effect of dopamine on two distinct populations of spiny projection neurons (viz., D1 and D2). It has been proposed that they encode the values of chosen and unchosen options, respectively, by promoting the latter and inhibiting the former. Similarly, in the inspired model, chosen prediction error has an opposing role in updating the chosen and unchosen option values, by strengthening the latter and weakening the former.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-631347-g0004.tif"/>
</fig>
</sec>
<sec>
<title>2.6.2. Contextual Effect in the OL Model</title>
<p>In the OL model, the chosen and unchosen values are coupled and thus not independent. We measured the correlation between these two values in a simulation. The correlation turned out to be negative and proportionate to the ratio of two learning rates (<xref ref-type="fig" rid="F5">Figure 5B</xref>):</p>
<disp-formula id="E3"><mml:math id="M4"><mml:mrow><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02248;</mml:mo><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<p>When &#x003B1;<sub>2</sub> changes from 0 to &#x003B1;<sub>1</sub>, the correlation between <italic>Q</italic><sub>1</sub> and <italic>Q</italic><sub>2</sub> changes from 0 to &#x02212;1, and the encoding changes from almost fully absolute to almost fully relative. <xref ref-type="fig" rid="F5">Figure 5A</xref> shows how <italic>Q</italic><sub>1</sub> and <italic>Q</italic><sub>2</sub> start to move away from orthogonality to fully negatively correlated. In simulations, typical agent &#x003B1;<sub>2</sub> &#x0003D; 0 shows no contextual effect, agent 0 &#x0003C; &#x003B1;<sub>2</sub> &#x0003C; &#x003B1;<sub>1</sub> shows a moderate and temporary contextual effect, and agent &#x003B1;<sub>2</sub> &#x0003D; &#x003B1;<sub>1</sub> shows a large and permanent contextual effect (<xref ref-type="fig" rid="F5">Figure 5C</xref>).</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Correlation between two competing option values estimated by the OL model. <bold>(A)</bold> When &#x003B1;<sub>2</sub> &#x0003D; 0, two estimated values are equal to their absolute values and they are orthogonal. However, whenever &#x003B1;<sub>2</sub> gets closer to &#x003B1;<sub>1</sub>, the estimated values for each pair become more correlated. Moreover, when &#x003B1;<sub>2</sub> &#x0003D; &#x003B1;<sub>1</sub>, estimated values are approximately fully correlated(<italic>corr</italic> &#x02248; &#x02212;1). <bold>(B)</bold> Correlation between two paired option values as a function of &#x02212;&#x003B1;<sub>2</sub>/&#x003B1;<sub>1</sub>. <bold>(C)</bold> The difference in the estimated values of A1 and A2 (contextual bias) emerges with increasing &#x003B1;<sub>2</sub>. The diagram for <italic>q</italic>-values and differences of <italic>q</italic>-values have been shown at the top and bottom, respectively. The simulation was performed using two different pairs of options [<inline-formula><mml:math id="M66"><mml:mi mathvariant='script'>N</mml:mi></mml:math></inline-formula>(7, 1), <inline-formula><mml:math id="M54"><mml:mi mathvariant='script'>N</mml:mi></mml:math></inline-formula>(5, 1)], and [<inline-formula><mml:math id="M55"><mml:mi mathvariant='script'>N</mml:mi></mml:math></inline-formula>(7, 1), <inline-formula><mml:math id="M56"><mml:mi mathvariant='script'>N</mml:mi></mml:math></inline-formula>(3, 1)], with &#x003B2; &#x0003D; 0.1, &#x003B1;<sub>1</sub> &#x0003D; 0.2, and four different &#x003B1;<sub>2</sub> &#x0003D; 0, 0.1, 0.18, 0.2.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-631347-g0005.tif"/>
</fig>
</sec>
<sec>
<title>2.6.3. Performance of the OL Model</title>
<p>We performed a simulation analysis to study the behavior of the OL model. First, we found that the OL model as a reinforcement learning model performs better when the difference between competing option values increases (<xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 2</xref>). Second, we studied the effect of parameter &#x003B1;<sub>2</sub> on agents&#x00027; learning performance. This analysis showed that when &#x003B1;<sub>2</sub> &#x0003E; 0, average performance is better than when &#x003B1;<sub>2</sub> &#x0003D; 0 (SQL model). Moreover, increasing &#x003B1;<sub>2</sub> results in an increase in average performance (<xref ref-type="fig" rid="F6">Figures 6A,B</xref>). This increase is due to the inhibition role of the chosen prediction error on the unchosen value that would lead to an increase in the contrast between two competing option values, and thus an increase in performance (<xref ref-type="fig" rid="F6">Figure 6A</xref>). Note that the above results are restricted to the case in which the parameter &#x003B2; is in a reasonable range. (For details about the simulation, see Section 4.4).</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Comparison of OL and SQL model performance. <bold>(A)</bold> As &#x003B1;<sub>2</sub>/&#x003B1;<sub>1</sub> goes from 0 (SQL) to 1 (the OL<sub>1</sub>), the peak of the performance shifts to the left, where the value of &#x003B2; is smaller and more reasonable. In this &#x003B2; range, performance peaks where &#x003B1;<sub>2</sub>/&#x003B1;<sub>1</sub> is higher. The larger &#x003B2; is, the larger the behavioral variances. The performances were obtained by averaging performances across all task settings and different ranges of &#x003B1;<sub>2</sub>/&#x003B1;<sub>1</sub>. <bold>(B)</bold> This heat map shows that, by increasing &#x003B1;<sub>2</sub>/&#x003B1;<sub>1</sub>, performance increases. This simulation was performed using two different pairs of options [<inline-formula><mml:math id="M57"><mml:mi mathvariant='script'>N</mml:mi></mml:math></inline-formula>(10, 1), <inline-formula><mml:math id="M58"><mml:mi mathvariant='script'>N</mml:mi></mml:math></inline-formula>(7, 1)] with &#x003B2; &#x0003D; 0.1. &#x003BC; and &#x003C3; stand for the mean and standard deviation of the performance, respectively.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-631347-g0006.tif"/>
</fig>
</sec>
<sec>
<title>2.6.4. Extending the OL Model</title>
<p>Several studies have shown that, in the Complete feedback version of the task, in the presence of counterfactual outcomes, the quantity encoded by dopamine is not the simple prediction error alone, but rather a combination of the simple prediction error and the counterfactual prediction error (i.e., the prediction error of the outcome difference; Kishida et al., <xref ref-type="bibr" rid="B25">2016</xref>). Furthermore, some studies have shown that by incorporating the outcome difference into the learning procedure, the model can better account for the behavioral (Bavard et al., <xref ref-type="bibr" rid="B4">2018</xref>) and physiological (Coricelli et al., <xref ref-type="bibr" rid="B12">2007</xref>) data. To this end, we replaced the reward term with a hybrid combination of the absolute reward (<italic>r</italic><sub><italic>FC</italic></sub>) and the relative reward (<italic>r</italic><sub><italic>FC</italic></sub> &#x02212; <italic>r</italic><sub><italic>CF</italic></sub>, the outcome difference; <xref ref-type="fig" rid="F4">Figure 4B</xref>). Recall that the outcome difference played a significant role in the participants&#x00027; switching behavior in the Complete feedback version (see Section 2.5). The updating equations of the extended OL model are exactly the same as those in the original OL model, but the prediction error is defined as follows:</p>
<disp-formula id="E4"><mml:math id="M5"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi><mml:mi>y</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E5"><mml:math id="M6"><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi><mml:mi>y</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>w</mml:mi><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>b</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>l</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E6"><mml:math id="M7"><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>b</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>F</mml:mi><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#x02003;</mml:mtext><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>l</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>F</mml:mi><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>F</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<p>where <italic>w</italic> is the weight of the absolute strategy.</p>
<p>It turns out that this extended model becomes an instance of the original model by changing the mean rewards (&#x003BC;<sub>1</sub> and &#x003BC;<sub>2</sub>) to <inline-formula><mml:math id="M8"><mml:msubsup><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula><mml:math id="M9"><mml:msubsup><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>. Note that since <inline-formula><mml:math id="M10"><mml:msubsup><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>-</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, the extended OL model is like a simple OL model in which the means have gotten closer to each other. Thus, this modification does not change the main characteristics of the OL behavior, and the extended OL model still preserves all of the above-mentioned properties. This shows how, by designing a proper prediction error, the OL model can be successfully extended for a wide range of conditions.</p>
</sec>
</sec>
<sec>
<title>2.7. Model Comparison</title>
<sec>
<title>2.7.1. Model Fitting and Model Validation</title>
<p>In this part of the analysis, we ran model comparison analyses in two ways: model fitting (learning phase) and model validation (transfer phase). The models we used in our model space consists of some models as benchmarks and some models that aim to explain context-dependent value learning. Our main model-space included the standard Q-learning model (SQL), the reference-point model (RP) (Palminteri et al., <xref ref-type="bibr" rid="B32">2015</xref>), the difference model (Klein et al., <xref ref-type="bibr" rid="B26">2017</xref>), and the hybrid model (Bavard et al., <xref ref-type="bibr" rid="B4">2018</xref>). The same analysis was also performed on the extended model-space which, in addition to the previously named models, included the forgetting reinforcement learning model (FQL) (Barraclough et al., <xref ref-type="bibr" rid="B2">2004</xref>; Ito and Doya, <xref ref-type="bibr" rid="B21">2009</xref>; Katahira, <xref ref-type="bibr" rid="B23">2015</xref>; Niv et al., <xref ref-type="bibr" rid="B29">2015</xref>; Kato and Morita, <xref ref-type="bibr" rid="B24">2016</xref>), the experienced-weighted attraction model (EWA) (Camerer and Hua Ho, <xref ref-type="bibr" rid="B7">1999</xref>), the sample-based episodic memory model (SBE) (Bornstein et al., <xref ref-type="bibr" rid="B5">2017</xref>), and RelAsym model (Garcia et al., <xref ref-type="bibr" rid="B20">2021</xref>; Ting et al., <xref ref-type="bibr" rid="B48">2021</xref>, <xref ref-type="supplementary-material" rid="SM1">Supplementary Tables 2&#x02013;4</xref>).</p>
<p>Except for the difference model, which only had the Complete version, all of the models had two Partial and Complete feedback versions. The OL model had two different versions, OL<sub>1</sub> in which the chosen and unchosen options had the same learning rates, and OL<sub>2</sub> in which they had different learning rates. For the details of the models, see Section 4.</p>
<p>For the learning phase, we performed the fitting procedure for each participant and each model separately, and calculated their exceedance probabilities (xp). For the transfer phase, we calculated the negative log-likelihood for the all iterations. Through model comparison, we found that the OL<sub>1</sub> model (for the Partial and Complete versions), fit the data better in the learning phase and also predicted the data better in the transfer phase (<xref ref-type="table" rid="T2">Table 2</xref>). In addition to the model fitting analysis, we applied all of the behavioral analysis in the Performance and Contextual effect sections on the simulated data. The simulation for each participant in each model was conducted by the participant&#x00027;s best-fitted parameters (averaged over 100 repetitions).</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Model comparison: model fitting and model prediction.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th style="border-bottom: thin solid #000000;"/>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>SQL</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>RPA</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>Dif</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>Hyb</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>OL<sub>1</sub></bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>OL<sub>2</sub></bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" colspan="7" style="background-color:#bcbdc0"><bold>FITTING (LEARNING PHASE)</bold></td>
</tr>
<tr>
<td valign="top" align="left" colspan="7"><bold>Partial</bold></td>
</tr>
<tr>
<td valign="top" align="left">xp</td>
<td valign="top" align="center">2<italic><bold>e</bold></italic>&#x02212;05</td>
<td valign="top" align="center">0</td>
<td/>
<td valign="top" align="center">0</td>
<td valign="top" align="center">0.99998</td>
<td valign="top" align="center">0</td>
</tr>
<tr>
<td valign="top" align="left">pxp</td>
<td valign="top" align="center">2.0047<italic><bold>e</bold></italic>&#x02212;05</td>
<td valign="top" align="center">4.7129<italic><bold>e</bold></italic>&#x02212;08</td>
<td/>
<td valign="top" align="center">4.7129<italic><bold>e</bold></italic>&#x02212;08</td>
<td valign="top" align="center">0.99998</td>
<td valign="top" align="center">4.7129<italic><bold>e</bold></italic>&#x02212;08</td>
</tr>
<tr>
<td valign="top" align="left" colspan="7"><bold>Complete</bold></td>
</tr>
<tr>
<td valign="top" align="left">xp</td>
<td valign="top" align="center">0.001594</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">0.16604</td>
<td valign="top" align="center">0.000685</td>
<td valign="top" align="center">0.66409</td>
<td valign="top" align="center">1<italic><bold>e</bold></italic>&#x02212;06</td>
</tr>
<tr>
<td valign="top" align="left">pxp</td>
<td valign="top" align="center">0.0024225</td>
<td valign="top" align="center">0.00083783</td>
<td valign="top" align="center">0.16591</td>
<td valign="top" align="center">0.0015188</td>
<td valign="top" align="center">0.66104</td>
<td valign="top" align="center">0.00083883</td>
</tr>
<tr>
<td valign="top" align="left" colspan="7" style="background-color:#bcbdc0"><bold>PREDICTION (TRANSFER PHASE)&#x02014;ALL ITERATIONS</bold></td>
</tr>
<tr>
<td valign="top" align="left" colspan="7"><bold>Partial</bold></td>
</tr>
<tr>
<td valign="top" align="left"><italic><bold>A</bold></italic><sub><bold>1</bold></sub><italic><bold>A</bold></italic><sub><bold>2</bold></sub></td>
<td valign="top" align="center">2.77 &#x000B1; 0.16</td>
<td valign="top" align="center">2.83 &#x000B1; 0.22</td>
<td/>
<td valign="top" align="center">2.88 &#x000B1; 0.21</td>
<td valign="top" align="center">2.51 &#x000B1; 0.14</td>
<td valign="top" align="center">2.63 &#x000B1; 0.13</td>
</tr>
<tr>
<td valign="top" align="left">all</td>
<td valign="top" align="center">9.15 &#x000B1; 0.55</td>
<td valign="top" align="center">9.05 &#x000B1; 0.52.</td>
<td/>
<td valign="top" align="center">9.27 &#x000B1; 0.53</td>
<td valign="top" align="center">8.99 &#x000B1; 0.63</td>
<td valign="top" align="center">9.12 &#x000B1; 0.6</td>
</tr>
<tr>
<td valign="top" align="left" colspan="7"><bold>Complete</bold></td>
</tr>
<tr>
<td valign="top" align="left"><italic><bold>A</bold></italic><sub><bold>1</bold></sub><italic><bold>A</bold></italic><sub><bold>2</bold></sub></td>
<td valign="top" align="center">4.69 &#x000B1; 0.84</td>
<td valign="top" align="center">4.8 &#x000B1; 0.85</td>
<td valign="top" align="center">4.2 &#x000B1; 0.75</td>
<td valign="top" align="center">4.2 &#x000B1; 0.64</td>
<td valign="top" align="center">3.49 &#x000B1; 0.44</td>
<td valign="top" align="center">3.5 &#x000B1; 0.45</td>
</tr>
<tr>
<td valign="top" align="left">all</td>
<td valign="top" align="center">15.42 &#x000B1; 2.82</td>
<td valign="top" align="center">14.11 &#x000B1; 2.01</td>
<td valign="top" align="center">12.88 &#x000B1; 1.91</td>
<td valign="top" align="center">14.06 &#x000B1; 1.86</td>
<td valign="top" align="center">12.26 &#x000B1; 2.05</td>
<td valign="top" align="center">12.27 &#x000B1; 1.85</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Fitting: Bayesian exceedance probability (xp) (Stephan et al., <xref ref-type="bibr" rid="B43">2009</xref>) and protected exceedance probability (pxp) (Rigoux et al., <xref ref-type="bibr" rid="B39">2014</xref>) of the learning phase. Prediction: negative log-likelihood (nll) of A<sub>1</sub>A<sub>2</sub> and all six combinations of the transfer phase separately. Mean&#x000B1;SEM</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>This analysis showed that the <italic>OL</italic><sub>1</sub> model was able to generate all key signatures of the behavioral data (<xref ref-type="fig" rid="F3">Figures 3A,B</xref>). In the learning phase, agents&#x00027; performances were higher than 0.5 (Partial: performance = 0.6637&#x000B1;0.0627; Complete: performance = 0.8857&#x000B1;0.0639; <xref ref-type="fig" rid="F3">Figure 3A</xref>), and the performance in the learning phase of the Complete version was significantly higher than that in the Partial version [<italic>p</italic> &#x0003D; 4.4086<italic>e</italic> &#x02212; 25, <italic>t</italic><sub>(75)</sub> &#x0003D; 15.3079, one-tailed <italic>t</italic>-test]. We also observed that agents&#x00027; performance was significantly better than random (0.5) in the transfer phase [Partial: performance = 0.8238&#x000B1;0.1429, <italic>t</italic>-test, <italic>p</italic> &#x0003D; 3.2597<italic>e</italic> &#x02212; 22, <italic>t</italic><sub>(34)</sub> &#x0003D; 22.3594; Complete: performance = 0.9587&#x000B1;0.0746, <italic>t</italic>-test, <italic>p</italic> &#x0003D; 1.2272<italic>e</italic> &#x02212; 34, <italic>t</italic><sub>(41)</sub> &#x0003D; 39.6801; <xref ref-type="fig" rid="F3">Figure 3B</xref>, <xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 1</xref>]. We were also able to replicate the transfer effect (<xref ref-type="fig" rid="F3">Figure 3B</xref>): Agents preferred <italic>A</italic><sub>2</sub> over <italic>A</italic><sub>1</sub> in both feedback versions (Partial: <italic>p</italic> &#x0003D; 0.04096, <italic>ratio</italic> &#x0003D; 0.65714; Complete: <italic>p</italic> &#x0003D; 6.8771<italic>e</italic> &#x02212; 05, <italic>ratio</italic> &#x0003D; 0.78571; binomial test).</p>
<p>We next assessed how the estimated parameter &#x003B2; is different across feedback versions. To do so, we ran a <italic>t</italic>-test analysis and found that the exploitation rate &#x003B2; was significantly higher in the Complete version than in the Partial version (partial: <italic>mean</italic> &#x0003D; 0.0705, complete: <italic>mean</italic> &#x0003D; 0.368, <italic>p</italic> &#x0003D; 1.085<italic>e</italic> &#x02212; 07, <italic>t</italic>-test). Thus, participants explored less in the Complete version than in the Partial version.</p>
</sec>
<sec>
<title>2.7.2. Parameter Recovery and Model Recovery</title>
<p>To validate our model fitting and model comparison procedures, we conducted parameter recovery and model recovery analyses, respectively (Correa et al., <xref ref-type="bibr" rid="B13">2018</xref>; Wilson and Collins, <xref ref-type="bibr" rid="B50">2019</xref>).</p>
<p>To do these analyses, using a common approach in the literature (Daw et al., <xref ref-type="bibr" rid="B15">2011</xref>; Palminteri et al., <xref ref-type="bibr" rid="B32">2015</xref>; Correa et al., <xref ref-type="bibr" rid="B13">2018</xref>), we fitted beta distributions to the best fitted parameters of all participants. Then we sampled synthetic participants from these distributions. Then we generated 30 &#x000D7; <italic>numberofsubjects</italic> simulated behaviors with all models in the main model space (30 repetitions resulting 30 &#x000D7; 35 simulations for the Partial version, and 30 &#x000D7; 42 simulations for the Complete version). Then we fitted the generated data by each model in the main model space to find which models best fitted to these generated data. It should be noted that the task configurations were the same as those used for the real participants.</p>
<p>For parameter recovery analysis, from the above simulation data we took the generated and fitted parameters of the OL models, and calculated the Pearson correlation of them. As can be seen in the <xref ref-type="fig" rid="F7">Figure 7</xref>, the correlations between fitted and recovered parameters are strong. We also regressed recovered parameters against the true parameters. The result of the regression has been reported in the <xref ref-type="table" rid="T3">Table 3</xref>, and shows an acceptable parameter recovery.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Parameter recovery analysis of the OL models. Data from 30 &#x000D7; <italic>number of subjects</italic> were simulated with the OL models. The Pearson correlation between the true and recovered parameters of the OL models shows they have strong correlations.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-631347-g0007.tif"/>
</fig>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Parameter recovery of the <italic>OL</italic><sub>1</sub> model: regression results.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th/>
<th valign="top" align="center"><bold>Coef</bold></th>
<th valign="top" align="center"><bold>Parameter</bold></th>
<th valign="top" align="center"><bold>Estimate</bold></th>
<th valign="top" align="center"><bold>SE</bold></th>
<th valign="top" align="center"><italic><bold>p</bold></italic><bold>-value</bold></th>
<th/>
<th valign="top" align="center"><bold>Parameter</bold></th>
<th valign="top" align="center"><bold>Estimate</bold></th>
<th valign="top" align="center"><bold>SE</bold></th>
<th valign="top" align="center"><italic><bold>p</bold></italic><bold>-value</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
</tr>
<tr>
<td/>
<td valign="top" align="center"><italic>b</italic><sub>0</sub></td>
<td valign="top" align="center">&#x003B2;</td>
<td valign="top" align="center">0.004</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">0.00E &#x0002B; 00</td>
<td/>
<td valign="top" align="center">&#x003B2;</td>
<td valign="top" align="center">0.005</td>
<td valign="top" align="center">0.001</td>
<td valign="top" align="center">0</td>
</tr>
<tr>
<td/>
<td valign="top" align="center"><italic>b</italic><sub>1</sub></td>
<td/>
<td valign="top" align="center">0.818</td>
<td valign="top" align="center">0.013</td>
<td valign="top" align="center">0</td>
<td/>
<td/>
<td valign="top" align="center">0.865</td>
<td valign="top" align="center">0.016</td>
<td valign="top" align="center">0</td>
</tr>
<tr>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">Partial</td>
<td valign="top" align="center"><italic>b</italic><sub>0</sub></td>
<td valign="top" align="center">&#x003B1;</td>
<td valign="top" align="center">0.038</td>
<td valign="top" align="center">0.004</td>
<td valign="top" align="center">0.00E &#x0002B; 00</td>
<td valign="top" align="center">Complete</td>
<td valign="top" align="center">&#x003B1;</td>
<td valign="top" align="center">0.058</td>
<td valign="top" align="center">0.004</td>
<td valign="top" align="center">0</td>
</tr>
<tr>
<td/>
<td valign="top" align="center"><italic>b</italic><sub>1</sub></td>
<td/>
<td valign="top" align="center">0.734</td>
<td valign="top" align="center">0.026</td>
<td valign="top" align="center">0.00E &#x0002B; 00</td>
<td/>
<td/>
<td valign="top" align="center">0.611</td>
<td valign="top" align="center">0.025</td>
<td valign="top" align="center">0</td>
</tr>
<tr>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
</tr>
<tr>
<td/>
<td valign="top" align="center"><italic>b</italic><sub>0</sub></td>
<td/>
<td/>
<td/>
<td/>
<td/>
<td valign="top" align="center"><italic>w</italic></td>
<td valign="top" align="center">0.143</td>
<td valign="top" align="center">0.007</td>
<td valign="top" align="center">0</td>
</tr>
<tr>
<td/>
<td valign="top" align="center"><italic>b</italic><sub>1</sub></td>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td valign="top" align="center">0.549</td>
<td valign="top" align="center">0.021</td>
<td valign="top" align="center">0</td>
</tr>
<tr>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The recovered parameters were regressed against the true parameters. The results of the intercepts (b<sub>0</sub>) and slopes (b<sub>1</sub>) showed an acceptable parameter identification</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>In the model recovery analysis, our aim is to investigate whether the models in the model space can be distinguished from each other. To do this, we used the model recovery approach in the paper of Wilson and Collins (Wilson and Collins, <xref ref-type="bibr" rid="B50">2019</xref>; Ciranka et al., <xref ref-type="bibr" rid="B9">2022</xref>). According to this approach we calculated two metrics: the conditional probability that a model fits best given the true generative model [<italic>p</italic>(<italic>fit</italic>|<italic>gen</italic>)], and the conditional probability that the data was generated by a specific model, given it is the best fitted model [<italic>p</italic>(<italic>gen</italic>|<italic>fit</italic>)]. To calculate <italic>p</italic>(<italic>fit</italic>|<italic>gen</italic>), we took the fitted data on our generated datasets and calculated the corresponding AICs to see how often each model provided the best fit. To calculate <italic>p</italic>(<italic>gen</italic>|<italic>fit</italic>), we used the following Bayes formula with the uniform prior over models <italic>p</italic>(<italic>gen</italic>):</p>
<disp-formula id="E7"><mml:math id="M11"><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mo>&#x023D0;</mml:mo><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mstyle displaystyle='true'><mml:msubsup><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mi>p</mml:mi><mml:msub><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mi>m</mml:mi></mml:msub><mml:mi>p</mml:mi><mml:msub><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mi>m</mml:mi></mml:msub></mml:mrow></mml:mstyle></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<p>If we could recover our models perfectly, the <italic>p</italic>(<italic>fit</italic>|<italic>gen</italic>) matrix must be an identity matrix (a matrix with all the diagonal entries 1 and other entries 0). Unfortunately, some of the models in our model space have rather similar behavior on this task (e.g., the Hybrid model with <italic>w</italic> &#x0003D; 1 is identical to the <italic>SQL</italic> model), therefore we have large off-diagonal elements in this matrix (<xref ref-type="fig" rid="F8">Figure 8</xref>). Since the model recovery was not perfect, we conducted <italic>p</italic>(<italic>gen</italic>|<italic>fit</italic>) analysis, which is a more critical metric to investigate model recovery analysis (Wilson and Collins, <xref ref-type="bibr" rid="B50">2019</xref>; Ciranka et al., <xref ref-type="bibr" rid="B9">2022</xref>). As can be seen in the <xref ref-type="fig" rid="F8">Figure 8</xref>, in the Partial version, all diagonal entries of the <italic>p</italic>(<italic>gen</italic>|<italic>fit</italic>) matrix, except <italic>OL</italic><sub>2</sub> are dominant in their columns which shows that all the models except <italic>OL</italic><sub>2</sub> could be identified well. In the Complete version, all diagonal entries of the <italic>p</italic>(<italic>gen</italic>|<italic>fit</italic>) matrix, except <italic>SQL</italic> and <italic>Dif</italic> models are dominant in their columns. This analysis shows that all the models could be distinguished from each other, except <italic>SQL</italic> model which could not be confidently distinguished from <italic>Dif</italic> model.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Model recovery analysis. Data from 30 &#x000D7; <italic>number of subjects</italic> were simulated with all models in the model space. The generated data were fitted by all models in the main model-space. The <italic>OL</italic><sub>1</sub> model (in the Partial and Complete versions) could be strongly identified.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-631347-g0008.tif"/>
</fig>
<p>We conducted model recovery analysis to identify <italic>OL</italic><sub>2</sub> model with a specific range for &#x003B1;<sub>2</sub> parameter (&#x003B1;<sub>2</sub> is close but not equal to &#x003B1;<sub>1</sub>), and it was successfully identified. Unfortunately by using the range of best fitted parameters to the behavioral data, <italic>OL</italic><sub>2</sub> model could not be recovered. It is critical to note that, although some models could not be identified well, our newly introduced model <italic>OL</italic><sub>1</sub> that is also the winning model in the model-comparison procedure, could be significantly recovered and we can see no strong mixing behavior between <italic>OL</italic><sub>1</sub> and other models.</p>
</sec>
</sec>
</sec>
<sec sec-type="discussion" id="s3">
<title>3. Discussion</title>
<p>Studies of the contextual effect on value learning have mostly focused on the putative role of the unchosen outcome in updating the chosen value in the Complete feedback paradigm (Palminteri et al., <xref ref-type="bibr" rid="B32">2015</xref>; Klein et al., <xref ref-type="bibr" rid="B26">2017</xref>; Bavard et al., <xref ref-type="bibr" rid="B4">2018</xref>). In this study, we showed that we are able to explain the contextual effect in the Partial feedback paradigm by using the chosen outcome in updating the unchosen value. Inspired by the opposing effect of dopamine in the striatum on competing option values, we introduced the novel Opposing Learning model, in which the chosen prediction error updates the chosen and unchosen values in opposing directions. This update rule will make the competing option values correlated to each other, which leads to the emergence of the contextual effect during value learning. On the other hand, due to the inhibitory role of the prediction error in updating unchosen values, the contrast between option values compared to the standard Q-learning model will increase, which leads to a higher performance average. Compared to previous models, this model was better able to account for the behavioral characteristics of the data (Camerer and Hua Ho, <xref ref-type="bibr" rid="B7">1999</xref>; Palminteri et al., <xref ref-type="bibr" rid="B32">2015</xref>; Kato and Morita, <xref ref-type="bibr" rid="B24">2016</xref>; Bornstein et al., <xref ref-type="bibr" rid="B5">2017</xref>; Klein et al., <xref ref-type="bibr" rid="B26">2017</xref>; Bavard et al., <xref ref-type="bibr" rid="B4">2018</xref>; Sutton and Barto, <xref ref-type="bibr" rid="B46">2018</xref>).</p>
<p>Most studies on the instrumental learning paradigm use discrete rewards (1 and 0) as gain and loss. Participants then estimate the probability of rewards for each option to maximize their payoffs (Frank et al., <xref ref-type="bibr" rid="B19">2004</xref>; Palminteri et al., <xref ref-type="bibr" rid="B32">2015</xref>; Klein et al., <xref ref-type="bibr" rid="B26">2017</xref>). Although we sometimes encounter probabilistic rewards in our daily lives (e.g., probability of making a profit on a stock, at a horse race), we more often experience continuous outcomes of our choices, as in the amount of profit from a financial transaction (e.g., stocks, pension plans) or evaluation metrics (assessment scores, citation indices or any other case with quantitative outcomes) and estimate the magnitude of our expected outcomes based on these continuous outcomes. Therefore, our secondary aim in this study was to investigate the contextual effect in a paradigm with continuous reward magnitude. We adapted previous instrumental learning tasks with novel reward designs, in which the stimuli were associated with some rewards drawn from specific normal distributions. With these complementary results, we showed that the contextual effect is not limited to probabilistic rewards, but extends to magnitude rewards.</p>
<p>There are two pathways in the basal ganglia with opposing roles: the direct pathway, which promotes actions, and the indirect pathway, which suppresses actions (Cox and Witten, <xref ref-type="bibr" rid="B14">2019</xref>; Peak et al., <xref ref-type="bibr" rid="B34">2019</xref>). These pathways originate from two distinct populations of striatal neurons, D1- and D2-SPNs, on which dopamine has an opposing effect (viz., stimulating D1-SPNs and inhibiting D2-SPNs; Surmeier et al., <xref ref-type="bibr" rid="B45">2007</xref>; Shen et al., <xref ref-type="bibr" rid="B41">2008</xref>). Associative learning studies have shown that D1- and D2-SPNs encode the values of the chosen and unchosen options, respectively (Frank et al., <xref ref-type="bibr" rid="B19">2004</xref>; Tai et al., <xref ref-type="bibr" rid="B47">2012</xref>; Collins and Frank, <xref ref-type="bibr" rid="B10">2014</xref>; Donahue et al., <xref ref-type="bibr" rid="B17">2018</xref>; Nonomura et al., <xref ref-type="bibr" rid="B30">2018</xref>; Shin et al., <xref ref-type="bibr" rid="B42">2018</xref>; Bariselli et al., <xref ref-type="bibr" rid="B1">2019</xref>). Inspired by these results, we introduced a novel model in which the chosen prediction error updates the chosen and unchosen values concurrently, but in an opposing manner (the latter with positive and the former with negative coefficients). The only model in the literature with similar update rules is the OpAL model introduced by Collins and Frank (<xref ref-type="bibr" rid="B10">2014</xref>). The crucial difference between the OpAL and OL models is that, while the OpAL model uses a reference-point mechanism to account for the contextual effect, the OL model can better explain the effect without resorting to the concept of reference point.</p>
<p>The parameter in the OL model that controls the magnitude of the correlation between competing option values (as an indicator of the contextual effect) is &#x003B1;<sub>2</sub>. According to whether &#x003B1;<sub>2</sub>&#x02248;0, &#x003B1;<sub>2</sub>&#x02248;&#x003B1;<sub>1</sub>, or 0 &#x0003C; &#x003B1;<sub>2</sub> &#x0003C; &#x003B1;<sub>1</sub>, there are three regimes. When &#x003B1;<sub>2</sub>&#x02248;0, the correlation is at its lowest (<italic>corr</italic>&#x02248;0) and there is no contextual effect at all. When &#x003B1;<sub>2</sub> &#x0003D; &#x003B1;<sub>1</sub>, the absolute correlation is at its highest (<italic>corr</italic>&#x02248; &#x02212; 1) and the contextual effect is the strongest and permanent. Finally, when 0 &#x0003C; &#x003B1;<sub>2</sub> &#x0003C; &#x003B1;<sub>1</sub>, the correlation is moderate and the contextual effect is moderate and temporary, disappearing over time (<xref ref-type="fig" rid="F5">Figure 5</xref>). This negative correlation between the chosen and unchosen values in the OL model (especially in the <italic>OL</italic><sub>1</sub> model) causes the competing option values to be learned relative to each other (<italic>q</italic><sub><italic>un</italic></sub>&#x02248; &#x02212; <italic>q</italic><sub><italic>ch</italic></sub>). By this relative encoding, this model can explain not only the reward learning behavior but also the punishment avoidance learning behavior (Palminteri et al., <xref ref-type="bibr" rid="B32">2015</xref>; Palminteri and Lebreton, <xref ref-type="bibr" rid="B33">2021</xref>).</p>
<p>The average performance of the OL model is better than that of the SQL model. In environments with a reasonable amount of noise, the more relative the model (&#x003B1;<sub>2</sub> closer to &#x003B1;<sub>1</sub>), the better the average performance. The performance of the OL model improves as a result of increased contrast between option values, which makes detection of the superior options easier. We should also mention some other related models. First, the confirmation bias model (Lefebvre et al., <xref ref-type="bibr" rid="B28">2022</xref>) which improves the performance in the same way. In this model, it is the asymmetric updating of positive and negative prediction errors that improves the performance by increasing the contrast between option values. Second, the RelAsym model (Garcia et al., <xref ref-type="bibr" rid="B20">2021</xref>; Ting et al., <xref ref-type="bibr" rid="B48">2021</xref>) which is the combination of the confirmation bias and reference point mechanisms. The RelAsym model by having these two factors, not only has the asymmetric updating advantage (performance advantage) but also is able to explain the contextual effect because of the reference point function it used in its mechanism. The RelAsym model from the performance&#x00027;s and contextual effect&#x00027;s viewpoints is similar to the OL model, but these two models are different in their main underlying mechanisms. The RelAsym model uses the explicit reference point mechanism to explain the contextual effect, while the OL model can explain the contextual effect without using any explicit reference point mechanism.</p>
<p>One of the advantages of the OL model is that it can be extended for the Complete feedback version. Several studies have shown that people performing the Complete version of the task are affected not only by absolute rewards (chosen outcomes), but also by relative rewards (the difference between chosen and unchosen outcomes; Camille et al., <xref ref-type="bibr" rid="B8">2004</xref>; Coricelli et al., <xref ref-type="bibr" rid="B11">2005</xref>, <xref ref-type="bibr" rid="B12">2007</xref>). These relative rewards are encoded in the brain by dopamine (Kishida et al., <xref ref-type="bibr" rid="B25">2016</xref>; Lak et al., <xref ref-type="bibr" rid="B27">2016</xref>). Our results are consistent with these findings. In Section 2.5, we showed that relative rewards have a significant effect in the Complete version, but not in the Partial version (<xref ref-type="table" rid="T1">Table 1</xref>). This suggests that participants are using a hybrid strategy, that is, a weighted combination of absolute and relative rewards, when performing the Complete version. This finding is similar to those of previous studies (Coricelli et al., <xref ref-type="bibr" rid="B12">2007</xref>; Bavard et al., <xref ref-type="bibr" rid="B4">2018</xref>). It is noteworthy that the extended OL model preserves all the essential characteristics of the basic OL model.</p>
<p>There are two types of learning models in which the unchosen values are updated when people observe the chosen rewards. The &#x0201C;reference-point learning model&#x0201D; is an example of the first type. In this model, the reference point of a state, which is equivalent to its expected rewards, is updated continuously with its outcomes. The valences of its outcomes are specified relative to the reference point. The valence is positive when the outcome is greater than the reference point and negative when the outcome is smaller than the reference point. Thus, in the first type, the values of the competing options are learned relative to each other (Palminteri et al., <xref ref-type="bibr" rid="B32">2015</xref>; Klein et al., <xref ref-type="bibr" rid="B26">2017</xref>; Bavard et al., <xref ref-type="bibr" rid="B4">2018</xref>).</p>
<p>In the second type, the competing values are learned independent of each other. The &#x0201C;forgetting reinforcement learning model&#x0201D; is an example of the second type. Despite similarities between the OL model and the forgetting reinforcement learning model, there are crucial differences between them. First, in a forgetting reinforcement learning model, the unchosen value decays over time. Therefore, if an option has not been chosen for a long time, its value decays toward zero. However, in the OL model, the unchosen value does not decay, but is updated by the chosen prediction error in an opposing direction. This implies that, if an option has not been chosen for a long time, its value does not decay to zero, but converges toward [ &#x02212; &#x003B1;<sub>2</sub>/&#x003B1;<sub>1</sub>&#x000D7;chosen value]. Second, in contrast to the OL model, in the forgetting reinforcement learning model, the observed rewards of the chosen options do not affect the values of the unchosen options, so the competing values are learned independently of each other.</p>
<p>Taken together, we have shown that context affects people&#x00027;s behavior even in everyday conditions when there is no counterfactual outcome available. Although this contextual effect leads to an ecological advantage by allowing one to gain more rewards in the original context, it results in suboptimal decision making outside the original context. Studying the mechanism underlying context-dependent behavior can also help us to find a solution for the problems that might arise from suboptimal behavior.</p>
</sec>
<sec sec-type="materials and methods" id="s4">
<title>4. Materials and Methods</title>
<sec>
<title>4.1. Participants</title>
<p>Two groups of 41 and 47 participants took part in the Partial and Complete versions of the experiment, respectively. We excluded six participants from the Partial version and five participants from the Complete version. In the Partial and Complete versions, two and three participants, respectively, did not learn the associations, and the difference of expected rewards for <italic>A</italic><sub>1</sub> and <italic>A</italic><sub>2</sub> exceeded one for four and two participants, respectively. After their exclusion, <italic>N</italic> &#x0003D; 35 participants [age: 26&#x000B1;6 (<italic>mean</italic>&#x000B1;<italic>SD</italic>), female: <italic>n</italic> &#x0003D; 16] and <italic>N</italic> &#x0003D; 42 participants [age: 23&#x000B1;5 (<italic>mean</italic>&#x000B1;<italic>SD</italic>), female: <italic>n</italic> &#x0003D; 12] remained for the Partial and Complete versions, respectively. They received their monetary rewards according to their performance after completing the task. They were all healthy volunteers that gave written informed consent before starting the task. The study was approved by the local ethics committee.</p>
</sec>
<sec>
<title>4.2. Behavioral Task</title>
<p>Two different cohorts of participants performed two different versions of instrumental learning tasks, which had been adapted from previous studies (Palminteri et al., <xref ref-type="bibr" rid="B32">2015</xref>; Klein et al., <xref ref-type="bibr" rid="B26">2017</xref>; Bavard et al., <xref ref-type="bibr" rid="B4">2018</xref>). The two tasks were structured very similarly and included three consecutive phases of learning, post-learning transfer, and value estimation. The tasks differed with respect to the way feedback was provided to the participants. In the Partial version of the task, only the outcomes for the chosen option (factual outcomes) were provided to the participants; in the Complete version, both the outcomes for the chosen and unchosen options (factual and counterfactual outcomes) were provided. Before the main task, participants performed a short training session (20 trials) to become familiarized with the learning phase. The stimuli and the reward statistics of the training session were different from those of the main session. The stimuli were selected from the Japanese Hiragana alphabet.</p>
<p>The learning phase was made up of one session in which, in each trial, two stimuli were presented on the screen, and participants were instructed to choose the option with the higher expected reward. This instrumental learning paradigm resulted in participants gradually learning, through trial and error, to choose the most advantageous option in each trial. The cues were shown to the participants from two pairs of stimuli {<italic>A</italic><sub>1</sub><italic>B, A</italic><sub>2</sub><italic>C</italic>}, which means that, in each pair, each stimulus was always presented with a specific other stimulus. Each stimulus pair thus established a fixed context. These two contexts were pseudorandomly interleaved across trials. The rewards of <italic>A</italic><sub>1</sub> and <italic>A</italic><sub>2</sub> stimuli were drawn from the same normal distribution of <inline-formula><mml:math id="M59"><mml:mi mathvariant='script'>N</mml:mi></mml:math></inline-formula>(64, 13) and the rewards of <italic>B</italic> and <italic>C</italic> stimuli were drawn from different normal distributions of <inline-formula><mml:math id="M60"><mml:mi mathvariant='script'>N</mml:mi></mml:math></inline-formula>(54, 13) and <inline-formula><mml:math id="M61"><mml:mi mathvariant='script'>N</mml:mi></mml:math></inline-formula>(44, 13), respectively. To control for confounding factors, reward samples were drawn from the truncated distribution, which was in the [&#x003BC; &#x02212; 3&#x003C3;, &#x003BC; &#x0002B; 3&#x003C3;] ([0, 100]) interval. The parameters of the distributions were unknown to the participants, and they were supposed to learn them. Although the reward statistics of <italic>A</italic><sub>1</sub> and <italic>A</italic><sub>2</sub> were the same, the images associated with them were different to conceal the task structure from the participants.</p>
<p>The side on which each stimulus was presented on the screen, whether to the right or left of the fixation point, was also pseudorandomized during the task, such that for the total number of trials for each context, a given stimulus was presented on the right in half of the trials and on the left in the other half. The participants were asked to select their choices within 4,000 ms. Otherwise they missed the reward in that trial and the &#x0201C;No Response&#x0201D; message was shown on the screen. In each trial, the participants selected their choice by pressing the left or right arrow key for the options displayed on the left or right, respectively. Following the choice, the chosen option was surrounded by a blue square and the related outcomes were presented simultaneously on the screen. In the Partial version, the factual outcome was shown below the chosen option for 500 ms. In the Complete version, both the factual and counterfactual outcomes were shown below the chosen and unchosen options for 1,000 ms, respectively. In the Complete version, participants were to process twice the amount of information processed in the Partial version. In our pilot study, we found that having only 500 ms to process two continuous outcomes was not sufficient and resulted in poorer performance in the Complete compared to the Partial version, so we increased the presentation time in the Complete version to 1,000 ms. The next trial started after a 1,000-ms fixation screen. Each context was presented to the participants in at least 50 trials for a total of at least 100 trials. After at least 100 trials, the task continued for each participant until the experienced mean of <italic>A</italic><sub>1</sub> became almost equal to the experienced mean of <italic>A</italic><sub>2</sub> (i.e., their difference became &#x0003C;1). If this condition was not met by the 300th trial, the learning phase was stopped and the participant&#x00027;s data were excluded analysis. By this design, the number of trials always fell into the range of [100, 300] and the number might be different for each participant.</p>
<p>After the learning phase, participants immediately entered the post-learning transfer phase. We did not inform them about the transfer phase until they had completed the learning phase, so that they would not use any memorizing strategies during the learning phase. In the transfer phase, all possible binary combinations of the stimuli (six combinations) were presented to the participants and they were asked to choose the option with higher expected rewards. We informed them that, in the transfer phase, they would not only see previously paired options, but also options that had not been paired in the preceding (learning) phase. Each combination was presented four times, giving a total of 6 &#x000D7; 4 &#x0003D; 24 trials that were presented in a pseudorandomized order. In contrast to the learning phase, the transfer phase was self-paced (i.e., participants were not forced to choose within a limited time) and no feedback was provided to the participants in order not to interfere with their learned values (Frank et al., <xref ref-type="bibr" rid="B19">2004</xref>, <xref ref-type="bibr" rid="B18">2007</xref>; Palminteri et al., <xref ref-type="bibr" rid="B32">2015</xref>; Klein et al., <xref ref-type="bibr" rid="B26">2017</xref>; Bavard et al., <xref ref-type="bibr" rid="B4">2018</xref>). Following each choice, using the computer mouse, participants were to report their level of confidence about their choice on a scale of 0&#x02013;100, whereby the left side of the axis was labeled &#x0201C;completely unsure&#x0201D; and the right side &#x0201C;completely sure.&#x0201D;</p>
<p>After the transfer phase, participants completed the value estimation phase. In the value estimation phase, stimuli were presented to the participants one by one. Participants were asked to estimate average rewards on a scale of 0&#x02013;100. Each stimulus was presented four times giving a total of 4 &#x000D7; 4 &#x0003D; 16 trials which were presented pseudorandomly. These trials were also self-paced and no feedback was provided to the participants. We informed the participants that their payoff would be based on the sum of rewards they earned during the learning task. In the Complete version, the participants&#x00027; total rewards were based solely on the rewards of their choices. Although they were not paid in the transfer and value estimation phases, they were encouraged to respond as correctly as possible as if their rewards depended on correct responses. At the end of the task, their total rewards were shown on the screen.</p>
</sec>
<sec>
<title>4.3. Computational Models</title>
<sec>
<title>4.3.1. The Standard Q-Learning (SQL) Model</title>
<p>Context-dependent learning models are commonly compared to the standard Q-learning (SQL) model as a benchmark (absolute learning model). In the SQL model, the value of each option is updated only based on its own outcomes (Sutton and Barto, <xref ref-type="bibr" rid="B46">2018</xref>).</p>
<disp-formula id="E8"><mml:math id="M12"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E9"><mml:math id="M13"><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<p>In the simplest form, only the chosen option is updated based on its outcomes, while in the extended form the unchosen options are also updated, but again with their own outcomes:</p>
<disp-formula id="E10"><mml:math id="M14"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E11"><mml:math id="M15"><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E12"><mml:math id="M16"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E13"><mml:math id="M17"><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<p>In this model, the learning rates can be the same or different (&#x003B1;<sub>1</sub> &#x0003D; &#x003B1;<sub>2</sub> or &#x003B1;<sub>1</sub>&#x02260;&#x003B1;<sub>2</sub>).</p>
</sec>
<sec>
<title>4.3.2. The Reference-Point (RP) Model</title>
<p>The idea for the reference-point (RP) model comes from the reference point phenomenon which has been reported by behavioral and economic studies (De Martino et al., <xref ref-type="bibr" rid="B16">2009</xref>; Baucells et al., <xref ref-type="bibr" rid="B3">2011</xref>). According to this model, there is a distinct reference point for each context that is obtained by its expected rewards. Then the relative outcome of each option is calculated compared to this reference point. We implemented several forms of RP models considering the different forms of context reward (Palminteri et al., <xref ref-type="bibr" rid="B32">2015</xref>). The RPD (Reference-Point Direct), RPA (Reference-Point Average), and RPM (Reference-Point Max) models, when the contextual rewards, <italic>r</italic><sub><italic>x</italic></sub>, are considered to be direct <italic>r</italic><sub><italic>ch</italic></sub>, an average of (<italic>r</italic><sub><italic>ch</italic></sub> &#x0002B; <italic>Q</italic><sub><italic>un</italic></sub>)/2, and max(<italic>r</italic><sub><italic>ch</italic></sub>, <italic>Q</italic><sub><italic>un</italic></sub>), respectively, in the Partial version, and <italic>r</italic><sub><italic>ch</italic></sub>, (<italic>r</italic><sub><italic>ch</italic></sub> &#x0002B; <italic>r</italic><sub><italic>un</italic></sub>)/2, and max(<italic>r</italic><sub><italic>ch</italic></sub>, <italic>r</italic><sub><italic>un</italic></sub>) in the Complete version.</p>
<disp-formula id="E14"><mml:math id="M18"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E15"><mml:math id="M19"><mml:mrow><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E16"><mml:math id="M20"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E17"><mml:math id="M21"><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<p>where <italic>V</italic><sub><italic>x</italic></sub> is the value of the context, and <italic>Q</italic><sub><italic>ch</italic></sub> is the value of the chosen option. For the Complete version, we also update the unchosen options as below,</p>
<disp-formula id="E18"><mml:math id="M22"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E19"><mml:math id="M23"><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<p>In the Complete version, we used different versions for RP: one which only updates the chosen value, and one which updates both options with the same and different learning rates.</p>
</sec>
<sec>
<title>4.3.3. The Difference (Dif) Model</title>
<p>In a context in which a participant is to maximize her rewards, the learning strategy is to find an advantageous option as soon as possible. The difference model is one of the models that allow fast detection of the advantageous option by learning the relative value. In this model, participants learn how much better the superior option is compared to the inferior option (Klein et al., <xref ref-type="bibr" rid="B26">2017</xref>).</p>
<disp-formula id="E20"><mml:math id="M24"><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>l</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>F</mml:mi><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>F</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E21"><mml:math id="M25"><mml:mrow><mml:mi>&#x003B4;</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>l</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E22"><mml:math id="M26"><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mi>&#x003B4;</mml:mi></mml:mrow></mml:math></disp-formula>
<p>This model was only applied for the Complete version.</p>
</sec>
<sec>
<title>4.3.4. The Hybrid (Hyb) Model</title>
<p>It has been shown that people are not fully absolute or fully relative learners. Rather they are hybrid learners who use both strategies but weight them differently (Bavard et al., <xref ref-type="bibr" rid="B4">2018</xref>).</p>
<disp-formula id="E23"><mml:math id="M27"><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>b</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>F</mml:mi><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#x02003;</mml:mtext><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>l</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>F</mml:mi><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>F</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E24"><mml:math id="M28"><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi><mml:mi>y</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>w</mml:mi><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>b</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>l</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E25"><mml:math id="M29"><mml:mrow><mml:mi>&#x003B4;</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi><mml:mi>y</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E26"><mml:math id="M30"><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mi>&#x003B4;</mml:mi></mml:mrow></mml:math></disp-formula>
<p>For the Partial version, we used the <italic>Q</italic><sub><italic>un</italic></sub> instead of <italic>r</italic><sub><italic>CF</italic></sub>.</p>
</sec>
<sec>
<title>4.3.5. The Forgetting Q-Learning (FQL) Model</title>
<p>In the Forgetting model, when the chosen value is updated by its prediction error, the unchosen value decays at a different learning rate (Barraclough et al., <xref ref-type="bibr" rid="B2">2004</xref>; Ito and Doya, <xref ref-type="bibr" rid="B21">2009</xref>; Katahira, <xref ref-type="bibr" rid="B23">2015</xref>; Niv et al., <xref ref-type="bibr" rid="B29">2015</xref>; Kato and Morita, <xref ref-type="bibr" rid="B24">2016</xref>).</p>
<disp-formula id="E27"><mml:math id="M31"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E28"><mml:math id="M32"><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E29"><mml:math id="M33"><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>*</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
</sec>
<sec>
<title>4.3.6. The Experience-Weighted Attraction (EWA) Model</title>
<p>It has been shown that, in addition to counterfactual outcomes, the number of times an option is chosen has a substantial effect on value learning. Therefore, Camerer and Hua Ho (<xref ref-type="bibr" rid="B7">1999</xref>) brought these two features together in an augmented version of the Rescorla-Wagner model called the experience-weighted attraction model,</p>
<disp-formula id="E30"><mml:math id="M34"><mml:mrow><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x003C1;</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math></disp-formula>
<disp-formula id="E31"><mml:math id="M35"><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>&#x003C6;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></disp-formula>
<disp-formula id="E32"><mml:math id="M36"><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>&#x003C6;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B4;</mml:mi><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>Where <italic>N</italic> is the <italic>experience weight</italic> of the option that is decayed with parameter &#x003C1;. The option value is also decayed with parameter &#x003C6;. If there is a counterfactual outcome (similar to our Complete feedback version), the counterfactual outcome also affects the updating of the unchosen value with weight &#x003B4;, but if there is not a counterfactual outcome (similar to our Partial feedback version), this parameter is zero.</p>
<disp-formula id="E33"><mml:math id="M37"><mml:mrow><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x003C1;</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math></disp-formula>
<disp-formula id="E34"><mml:math id="M38"><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>&#x003D5;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></disp-formula>
</sec>
<sec>
<title>4.3.7. The Sample-Based Episodic (SBE) Model</title>
<p>The idea of the sample-based episodic model is to calculate option values based on a recency-based sampling strategy rather than tracking the running average of option values (q-learning model; Bornstein and Norman, <xref ref-type="bibr" rid="B6">2017</xref>; Bornstein et al., <xref ref-type="bibr" rid="B5">2017</xref>). To estimate the value of option <italic>a</italic> at trial <italic>t</italic>, denoted by <italic>Q</italic>(<italic>a</italic>), this model stochastically samples one observed reward <italic>r</italic><sub><italic>i</italic></sub> with the following probability:</p>
<disp-formula id="E35"><mml:math id="M39"><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:math></disp-formula>
<p>By this probability, it is most (exponentially) likely to sample the most recent experience. Therefore, the likelihood, the probability of the behavioral data given this model, is computed as the following:</p>
<disp-formula id="E36"><mml:math id="M40"><mml:mrow><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mi>P</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>.</mml:mo><mml:mtext>&#x02009;</mml:mtext><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mi>P</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>.</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:mrow></mml:math></disp-formula>
<p>This is a weighted sum of softmax probability over all possible pairs of competing options. In this model, any sample probability for trials with no rewards to sample from was set to 1.</p>
</sec>
<sec>
<title>4.3.8. The Relative Asymmetric (RelAsym) Model</title>
<p>This RelAsym model consists of two relative value learning component (thorough reference point mechanism) and asymmetric updating component (thorough confirmation bias mechanism; Garcia et al., <xref ref-type="bibr" rid="B20">2021</xref>; Ting et al., <xref ref-type="bibr" rid="B48">2021</xref>). In the reference-point model, outcomes are context-dependent and causes that options&#x00027; values to be learned relative to their reference-point. In the asymmetric updating of option-values, there is a tendency to update the values with positive prediction errors with a larger weight. The reference-point part of the model is as the following:</p>
<disp-formula id="E37"><mml:math id="M41"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E38"><mml:math id="M42"><mml:mrow><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E39"><mml:math id="M43"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E40"><mml:math id="M44"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<p>where <italic>V</italic><sub><italic>x</italic></sub> is the value of the context. The confirmation part of the model is as the following:</p>
<disp-formula id="E41"><mml:math id="M45"><mml:mrow><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>&#x003B4;</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mtext>if</mml:mtext></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:msub><mml:mi>&#x003B4;</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0003E;</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>&#x003B4;</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mtext>if</mml:mtext></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:msub><mml:mi>&#x003B4;</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0003C;</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>&#x003B4;</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mtext>if</mml:mtext></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:msub><mml:mi>&#x003B4;</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0003C;</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>&#x003B4;</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mtext>if</mml:mtext></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:msub><mml:mi>&#x003B4;</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0003E;</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow> </mml:mrow></mml:mrow></mml:math></disp-formula>
<p>where <italic>Q</italic><sub><italic>ch</italic></sub> and <italic>Q</italic><sub><italic>un</italic></sub> are the values of the chosen option and unchosen option, and &#x003B1;<sub><italic>conf</italic></sub> and &#x003B1;<sub><italic>disc</italic></sub>, are learning rates for confirmatory and disconfirmatory information.</p>
</sec>
<sec>
<title>4.3.9. The Opposing Learning (OL) Model</title>
<p>The opposing learning model was inspired by the opposing role of dopamine on the chosen and unchosen options. In this model, both the chosen and unchosen values are simultaneously updated with the chosen prediction error, but in an opposite direction.</p>
<disp-formula id="E42"><mml:math id="M46"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E43"><mml:math id="M47"><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E44"><mml:math id="M48"><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<p>We extended this model for the Complete version by replacing the absolute reward with the weighted combination of absolute and relative rewards (a hybrid strategy).</p>
<disp-formula id="E45"><mml:math id="M49"><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>b</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>F</mml:mi><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#x02003;</mml:mtext><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>l</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>F</mml:mi><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>F</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E46"><mml:math id="M50"><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi><mml:mi>y</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>w</mml:mi><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>b</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>l</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E47"><mml:math id="M51"><mml:mrow><mml:mi>&#x003B4;</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi><mml:mi>y</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="E48"><mml:math id="M52"><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mi>&#x003B4;</mml:mi></mml:mrow></mml:math></disp-formula>
<disp-formula id="E49"><mml:math id="M53"><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mi>&#x003B4;</mml:mi></mml:mrow></mml:math></disp-formula>
</sec>
</sec>
<sec>
<title>4.4. Pure Simulation Procedure</title>
<p>The OL behavior has been examined in a wide range of task and parameter settings. Without loss of generality, we did the simulation with normalized settings such that we had &#x003C3; &#x0003D; 1 in reward distributions. As an example, the normalized version of the setting of task <inline-formula><mml:math id="M63"><mml:mi mathvariant='script'>N</mml:mi></mml:math></inline-formula>(&#x003BC; &#x0003D; 64, &#x003C3; &#x0003D; 10), parameters of &#x003B2; &#x0003D; 0.01, and any &#x003B1;<sub>1</sub>, &#x003B1;<sub>2</sub>, changes to its normalized version of <inline-formula><mml:math id="M64"><mml:mi mathvariant='script'>N</mml:mi></mml:math></inline-formula>(&#x003BC; &#x0003D; 6.4, &#x003C3; &#x0003D; 1) (divide by 10), and parameters of &#x003B2; &#x0003D; 0.1 (multiply by 10), and the same &#x003B1;<sub>1</sub>, &#x003B1;<sub>2</sub>. The task settings included 10 different pairs of options in which their relative values were covered {1, 2, &#x02026;, 10} ([&#x003BC;<sub>1</sub>, &#x003BC;<sub>2</sub>]&#x02208;{[10, 9], [10, 8], &#x02026;, [10, 0]}, and &#x003C3; &#x0003D; 1). The parameter settings covered a wide range of &#x003B2;: {0, 0.025, 0.05, 0.075, 0.1, 0.1025, &#x02026;, 0.4}&#x0222A; {0.5, 0.6, &#x02026;, 1}, &#x003B1;<sub>1</sub>: {0.1, 0.2, &#x02026;, 1}, and &#x003B1;<sub>2</sub>/&#x003B1;<sub>1</sub>: {0, 0.5, 0.75, 0.875, 0.93, 0.96, 0.980.992, 0.996, 0.998, 0.999, 1}.</p>
</sec>
<sec>
<title>4.5. Fitting and Simulation Procedure</title>
<p>The data fitting was implemented using the <italic>fmincon</italic> function of Matlab software (the MathWorks Inc., Natick, MA). The fittings were done with several initial points to have a higher probability of finding the global optimum, rather than getting stuck on a local optimum. We calculated the exceedance probabilities (xp) and protected exceedance probabilities (pxp) for the model-comparison part (Stephan et al., <xref ref-type="bibr" rid="B43">2009</xref>; Rigoux et al., <xref ref-type="bibr" rid="B39">2014</xref>). Since the number of trials for each participant is different, we have fed the BIC to the BMS toolbox. To estimate parameters, we optimized maximum a posteriori (MAP) using weakly informative priors of &#x003B2;(1.2, 1.2) for each parameter. It is worth noting that the option values are on a scale of 0 to 100, so that the range of the &#x003B2; parameter will be on a scale of much &#x0003C;1, thus, the &#x003B2;(1.2, 1.2) would be a proper prior in the model fitting (<xref ref-type="table" rid="T4">Table 4</xref>). The simulation for each participant was done on its best-fitted parameters for 100 repetitions and then the representative behavior of this agent was obtained by averaging over its repetitions.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>The estimated parameters.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="center" colspan="8" style="border-bottom: thin solid #000000;"><bold>Parameters</bold></th>
</tr>
<tr>
<th valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>Parameter</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>Constraint</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>SQL</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>RPA</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>Dif</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>Hyb</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>OL<sub>1</sub></bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>OL<sub>2</sub></bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" colspan="8"><bold>Partial</bold></td>
</tr>
<tr>
<td valign="top" align="left">&#x003B2;</td>
<td valign="top" align="center">0 &#x02264; &#x003B2; &#x0003C; inf</td>
<td valign="top" align="center">0.07 &#x000B1; 0.03</td>
<td valign="top" align="center">0.12 &#x000B1; 0.08</td>
<td/>
<td valign="top" align="center">0.06 &#x000B1; 0.04</td>
<td valign="top" align="center">0.02 &#x000B1; 0.02</td>
<td valign="top" align="center">0.03 &#x000B1; 0.02</td>
</tr>
<tr>
<td valign="top" align="left">&#x003B1;<sub>1</sub></td>
<td valign="top" align="center">0 &#x02264; &#x003B1;<sub>1</sub> &#x02264; 1</td>
<td valign="top" align="center">0.25 &#x000B1; 0.26</td>
<td valign="top" align="center">0.26 &#x000B1; 0.27</td>
<td/>
<td valign="top" align="center">0.37 &#x000B1; 0.29</td>
<td valign="top" align="center">0.26 &#x000B1; 0.2</td>
<td valign="top" align="center">0.32 &#x000B1; 0.23</td>
</tr>
<tr>
<td valign="top" align="left">&#x003B1;<sub>2</sub></td>
<td valign="top" align="center">0 &#x0003C; &#x003B1;<sub>2</sub> &#x02264; &#x003B1;<sub>1</sub></td>
<td/>
<td valign="top" align="center">0.34 &#x000B1; 0.3</td>
<td/>
<td/>
<td/>
<td valign="top" align="center">0.21 &#x000B1; 0.18</td>
</tr>
<tr>
<td valign="top" align="left"><italic>w</italic></td>
<td valign="top" align="center">0 &#x02264; <italic>w</italic> &#x02264; 1</td>
<td/>
<td/>
<td/>
<td valign="top" align="center">0.55 &#x000B1; 0.37</td>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left" colspan="8"><bold>Complete</bold></td>
</tr>
<tr>
<td valign="top" align="left">&#x003B2;</td>
<td valign="top" align="center">0 &#x02264; &#x003B2; &#x0003C; inf</td>
<td valign="top" align="center">0.12 &#x000B1; 0.09</td>
<td valign="top" align="center">0.37 &#x000B1; 0.24</td>
<td valign="top" align="center">0.37 &#x000B1; 0.23</td>
<td valign="top" align="center">0.2 &#x000B1; 0.15</td>
<td valign="top" align="center">0.11 &#x000B1; 0.12</td>
<td valign="top" align="center">0.1 &#x000B1; 0.1</td>
</tr>
<tr>
<td valign="top" align="left">&#x003B1;<sub>1</sub></td>
<td valign="top" align="center">0 &#x02264; &#x003B1;<sub>1</sub> &#x02264; 1</td>
<td valign="top" align="center">0.14 &#x000B1; 0.16</td>
<td valign="top" align="center">0.1 &#x000B1; 0.12</td>
<td valign="top" align="center">0.09 &#x000B1; 0.08</td>
<td valign="top" align="center">0.21 &#x000B1; 0.15</td>
<td valign="top" align="center">0.22 &#x000B1; 0.15</td>
<td valign="top" align="center">0.26 &#x000B1; 0.14</td>
</tr>
<tr>
<td valign="top" align="left">&#x003B1;<sub>2</sub></td>
<td valign="top" align="center">0 &#x0003C; &#x003B1;<sub>2</sub> &#x02264; &#x003B1;<sub>1</sub></td>
<td/>
<td valign="top" align="center">0.11 &#x000B1; 0.13</td>
<td/>
<td/>
<td/>
<td valign="top" align="center">0.19 &#x000B1; 0.16</td>
</tr>
<tr>
<td valign="top" align="left">&#x003B1;<sub>3</sub></td>
<td valign="top" align="center">0 &#x02264; &#x003B1;<sub>3</sub> &#x02264; 1</td>
<td/>
<td valign="top" align="center">0.35 &#x000B1; 0.3</td>
<td/>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left"><italic>w</italic></td>
<td valign="top" align="center">0 &#x02264; <italic>w</italic> &#x02264; 1</td>
<td/>
<td/>
<td/>
<td valign="top" align="center">0.28 &#x000B1; 0.23</td>
<td valign="top" align="center">0.28 &#x000B1; 0.17</td>
<td valign="top" align="center">0.32 &#x000B1; 0.19</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Mean&#x000B1;SD</italic>.</p>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec sec-type="data-availability" id="s5">
<title>Data Availability Statement</title>
<p>The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: <ext-link ext-link-type="uri" xlink:href="https://osf.io/emgph/">https://osf.io/emgph/</ext-link>.</p>
</sec>
<sec id="s6">
<title>Ethics Statement</title>
<p>The studies involving human participants were reviewed and approved by the Ethics Committee of the Institute for Research in Fundamental Sciences. The patients/participants provided their written informed consent to participate in this study.</p>
</sec>
<sec id="s7">
<title>Author Contributions</title>
<p>ZB designed and performed the experiment, analyzed the data, and drafted the manuscript. MN supervised the research and critically revised the manuscript. A-HV contributed to the conception and critically revised the manuscript. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s8">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<sec sec-type="supplementary-material" id="s9">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fnins.2022.631347/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fnins.2022.631347/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.pdf" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bariselli</surname> <given-names>S.</given-names></name> <name><surname>Fobbs</surname> <given-names>W.</given-names></name> <name><surname>Creed</surname> <given-names>M.</given-names></name> <name><surname>Kravitz</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>A competitive model for striatal action selection</article-title>. <source>Brain Res</source>. <volume>1713</volume>, <fpage>70</fpage>&#x02013;<lpage>79</lpage>. <pub-id pub-id-type="doi">10.1016/j.brainres.2018.10.009</pub-id><pub-id pub-id-type="pmid">30300636</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Barraclough</surname> <given-names>D. J.</given-names></name> <name><surname>Conroy</surname> <given-names>M. L.</given-names></name> <name><surname>Lee</surname> <given-names>D.</given-names></name></person-group> (<year>2004</year>). <article-title>Prefrontal cortex and decision making in a mixed-strategy game</article-title>. <source>Nat. Neurosci</source>. <volume>7</volume>, <fpage>404</fpage>&#x02013;<lpage>410</lpage>. <pub-id pub-id-type="doi">10.1038/nn1209</pub-id><pub-id pub-id-type="pmid">15004564</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Baucells</surname> <given-names>M.</given-names></name> <name><surname>Weber</surname> <given-names>M.</given-names></name> <name><surname>Welfens</surname> <given-names>F.</given-names></name></person-group> (<year>2011</year>). <article-title>Reference-point formation and updating</article-title>. <source>Manage. Sci</source>. <volume>57</volume>, <fpage>506</fpage>&#x02013;<lpage>519</lpage>. <pub-id pub-id-type="doi">10.1287/mnsc.1100.1286</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bavard</surname> <given-names>S.</given-names></name> <name><surname>Lebreton</surname> <given-names>M.</given-names></name> <name><surname>Khamassi</surname> <given-names>M.</given-names></name> <name><surname>Coricelli</surname> <given-names>G.</given-names></name> <name><surname>Palminteri</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). <article-title>Reference-point centering and range-adaptation enhance human reinforcement learning at the cost of irrational preferences</article-title>. <source>Nat. Commun</source>. <volume>9</volume>, <fpage>1</fpage>&#x02013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.1038/s41467-018-06781-2</pub-id><pub-id pub-id-type="pmid">30374019</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bornstein</surname> <given-names>A. M.</given-names></name> <name><surname>Khaw</surname> <given-names>M. W.</given-names></name> <name><surname>Shohamy</surname> <given-names>D.</given-names></name> <name><surname>Daw</surname> <given-names>N. D.</given-names></name></person-group> (<year>2017</year>). <article-title>Reminders of past choices bias decisions for reward in humans</article-title>. <source>Nat. Commun</source>. <volume>8</volume>, <fpage>1</fpage>&#x02013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1038/ncomms15958</pub-id><pub-id pub-id-type="pmid">28653668</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bornstein</surname> <given-names>A. M.</given-names></name> <name><surname>Norman</surname> <given-names>K. A.</given-names></name></person-group> (<year>2017</year>). <article-title>Reinstated episodic context guides sampling-based decisions for reward</article-title>. <source>Nat. Neurosci</source>. <volume>20</volume>, <fpage>997</fpage>&#x02013;<lpage>1003</lpage>. <pub-id pub-id-type="doi">10.1038/nn.4573</pub-id><pub-id pub-id-type="pmid">28581478</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Camerer</surname> <given-names>C.</given-names></name> <name><surname>Hua Ho</surname> <given-names>T.</given-names></name></person-group> (<year>1999</year>). <article-title>Experience-weighted attraction learning in normal form games</article-title>. <source>Econometrica</source> <volume>67</volume>, <fpage>827</fpage>&#x02013;<lpage>874</lpage>. <pub-id pub-id-type="doi">10.1111/1468-0262.00054</pub-id><pub-id pub-id-type="pmid">9710553</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Camille</surname> <given-names>N.</given-names></name> <name><surname>Coricelli</surname> <given-names>G.</given-names></name> <name><surname>Sallet</surname> <given-names>J.</given-names></name> <name><surname>Pradat-Diehl</surname> <given-names>P.</given-names></name> <name><surname>Duhamel</surname> <given-names>J.-R.</given-names></name> <name><surname>Sirigu</surname> <given-names>A.</given-names></name></person-group> (<year>2004</year>). <article-title>The involvement of the orbitofrontal cortex in the experience of regret</article-title>. <source>Science</source> <volume>304</volume>, <fpage>1167</fpage>&#x02013;<lpage>1170</lpage>. <pub-id pub-id-type="doi">10.1126/science.1094550</pub-id><pub-id pub-id-type="pmid">15919977</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Ciranka</surname> <given-names>S.</given-names></name> <name><surname>Linde-Domingo</surname> <given-names>J.</given-names></name> <name><surname>Padezhki</surname> <given-names>I.</given-names></name> <name><surname>Wicharz</surname> <given-names>C.</given-names></name> <name><surname>Wu</surname> <given-names>C. M.</given-names></name> <name><surname>Spitzer</surname> <given-names>B.</given-names></name></person-group> (<year>2022</year>). <article-title>Asymmetric reinforcement learning facilitates human inference of transitive relations</article-title>. <source>Nat. Hum. Behav</source>. <fpage>1</fpage>&#x02013;<lpage>10</lpage>. <pub-id pub-id-type="doi">10.1038/s41562-021-01263-w.</pub-id> Available online at: <ext-link ext-link-type="uri" xlink:href="https://psyarxiv.com/k7w38">https://psyarxiv.com/k7w38</ext-link><pub-id pub-id-type="pmid">35102348</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Collins</surname> <given-names>A. G.</given-names></name> <name><surname>Frank</surname> <given-names>M. J.</given-names></name></person-group> (<year>2014</year>). <article-title>Opponent actor learning (opal): modeling interactive effects of striatal dopamine on reinforcement learning and choice incentive</article-title>. <source>Psychol. Rev</source>. 121, 337. <pub-id pub-id-type="doi">10.1037/a0037015</pub-id><pub-id pub-id-type="pmid">25090423</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Coricelli</surname> <given-names>G.</given-names></name> <name><surname>Critchley</surname> <given-names>H. D.</given-names></name> <name><surname>Joffily</surname> <given-names>M.</given-names></name> <name><surname>O&#x00027;Doherty</surname> <given-names>J. P.</given-names></name> <name><surname>Sirigu</surname> <given-names>A.</given-names></name> <name><surname>Dolan</surname> <given-names>R. J.</given-names></name></person-group> (<year>2005</year>). <article-title>Regret and its avoidance: a neuroimaging study of choice behavior</article-title>. <source>Nat. Neurosci</source>. <volume>8</volume>, <fpage>1255</fpage>&#x02013;<lpage>1262</lpage>. <pub-id pub-id-type="doi">10.1038/nn1514</pub-id><pub-id pub-id-type="pmid">16116457</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Coricelli</surname> <given-names>G.</given-names></name> <name><surname>Dolan</surname> <given-names>R. J.</given-names></name> <name><surname>Sirigu</surname> <given-names>A.</given-names></name></person-group> (<year>2007</year>). <article-title>Brain, emotion and decision making: the paradigmatic example of regret</article-title>. <source>Trends Cogn. Sci</source>. <volume>11</volume>, <fpage>258</fpage>&#x02013;<lpage>265</lpage>. <pub-id pub-id-type="doi">10.1016/j.tics.2007.04.003</pub-id><pub-id pub-id-type="pmid">17475537</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Correa</surname> <given-names>C. M.</given-names></name> <name><surname>Noorman</surname> <given-names>S.</given-names></name> <name><surname>Jiang</surname> <given-names>J.</given-names></name> <name><surname>Palminteri</surname> <given-names>S.</given-names></name> <name><surname>Cohen</surname> <given-names>M. X.</given-names></name> <name><surname>Lebreton</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>How the level of reward awareness changes the computational and electrophysiological signatures of reinforcement learning</article-title>. <source>J. Neurosci</source>. <volume>38</volume>, <fpage>10338</fpage>&#x02013;<lpage>10348</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.0457-18.2018</pub-id><pub-id pub-id-type="pmid">30327418</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cox</surname> <given-names>J.</given-names></name> <name><surname>Witten</surname> <given-names>I. B.</given-names></name></person-group> (<year>2019</year>). <article-title>Striatal circuits for reward learning and decision-making</article-title>. <source>Nat. Rev. Neurosci</source>. <volume>20</volume>, <fpage>482</fpage>&#x02013;<lpage>494</lpage>. <pub-id pub-id-type="doi">10.1038/s41583-019-0189-2</pub-id><pub-id pub-id-type="pmid">31171839</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Daw</surname> <given-names>N. D.</given-names></name> <name><surname>Gershman</surname> <given-names>S. J.</given-names></name> <name><surname>Seymour</surname> <given-names>B.</given-names></name> <name><surname>Dayan</surname> <given-names>P.</given-names></name> <name><surname>Dolan</surname> <given-names>R. J.</given-names></name></person-group> (<year>2011</year>). <article-title>Model-based influences on humans&#x00027; choices and striatal prediction errors</article-title>. <source>Neuron</source> <volume>69</volume>, <fpage>1204</fpage>&#x02013;<lpage>1215</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2011.02.027</pub-id><pub-id pub-id-type="pmid">21435563</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>De Martino</surname> <given-names>B.</given-names></name> <name><surname>Kumaran</surname> <given-names>D.</given-names></name> <name><surname>Holt</surname> <given-names>B.</given-names></name> <name><surname>Dolan</surname> <given-names>R. J.</given-names></name></person-group> (<year>2009</year>). <article-title>The neurobiology of reference-dependent value computation</article-title>. <source>J. Neurosci</source>. <volume>29</volume>, <fpage>3833</fpage>&#x02013;<lpage>3842</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.4832-08.2009</pub-id><pub-id pub-id-type="pmid">19321780</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Donahue</surname> <given-names>C. H.</given-names></name> <name><surname>Liu</surname> <given-names>M.</given-names></name> <name><surname>Kreitzer</surname> <given-names>A.</given-names></name></person-group> (<year>2018</year>). <article-title>Distinct value encoding in striatal direct and indirect pathways during adaptive learning</article-title>. <source>bioRxiv</source> <volume>2018</volume>, <fpage>277855</fpage>. <pub-id pub-id-type="doi">10.1101/277855</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Frank</surname> <given-names>M. J.</given-names></name> <name><surname>Moustafa</surname> <given-names>A. A.</given-names></name> <name><surname>Haughey</surname> <given-names>H. M.</given-names></name> <name><surname>Curran</surname> <given-names>T.</given-names></name> <name><surname>Hutchison</surname> <given-names>K. E.</given-names></name></person-group> (<year>2007</year>). <article-title>Genetic triple dissociation reveals multiple roles for dopamine in reinforcement learning</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A</source>. <volume>104</volume>, <fpage>16311</fpage>&#x02013;<lpage>16316</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.0706111104</pub-id><pub-id pub-id-type="pmid">17913879</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Frank</surname> <given-names>M. J.</given-names></name> <name><surname>Seeberger</surname> <given-names>L. C.</given-names></name> <name><surname>O&#x00027;reilly</surname> <given-names>R. C.</given-names></name></person-group> (<year>2004</year>). <article-title>By carrot or by stick: cognitive reinforcement learning in Parkinsonism</article-title>. <source>Science</source> <volume>306</volume>, <fpage>1940</fpage>&#x02013;<lpage>1943</lpage>. <pub-id pub-id-type="doi">10.1126/science.1102941</pub-id><pub-id pub-id-type="pmid">15528409</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Garcia</surname> <given-names>N. A. S.</given-names></name> <name><surname>Palminteri</surname> <given-names>S.</given-names></name> <name><surname>Lebreton</surname> <given-names>M.</given-names></name></person-group> (<year>2021</year>). <article-title>Salemgarcia_2021</article-title>. <source>psyarXiv [Preprint]</source>. <pub-id pub-id-type="doi">10.31234/osf.io/k7w38</pub-id></citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ito</surname> <given-names>M.</given-names></name> <name><surname>Doya</surname> <given-names>K.</given-names></name></person-group> (<year>2009</year>). <article-title>Validation of decision-making models and analysis of decision variables in the rat basal ganglia</article-title>. <source>J. Neurosci</source>. <volume>29</volume>, <fpage>9861</fpage>&#x02013;<lpage>9874</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.6157-08.2009</pub-id><pub-id pub-id-type="pmid">19657038</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jentzsch</surname> <given-names>I.</given-names></name> <name><surname>Dudschig</surname> <given-names>C.</given-names></name></person-group> (<year>2009</year>). <article-title>Short article: why do we slow down after an error? Mechanisms underlying the effects of posterror slowing</article-title>. <source>Q. J. Exp. Psychol</source>. <volume>62</volume>, <fpage>209</fpage>&#x02013;<lpage>218</lpage>. <pub-id pub-id-type="doi">10.1080/17470210802240655</pub-id><pub-id pub-id-type="pmid">18720281</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Katahira</surname> <given-names>K.</given-names></name></person-group> (<year>2015</year>). <article-title>The relation between reinforcement learning parameters and the influence of reinforcement history on choice behavior</article-title>. <source>J. Math. Psychol</source>. <volume>66</volume>, <fpage>59</fpage>&#x02013;<lpage>69</lpage>. <pub-id pub-id-type="doi">10.1016/j.jmp.2015.03.006</pub-id></citation>
</ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kato</surname> <given-names>A.</given-names></name> <name><surname>Morita</surname> <given-names>K.</given-names></name></person-group> (<year>2016</year>). <article-title>Forgetting in reinforcement learning links sustained dopamine signals to motivation</article-title>. <source>PLoS Comput. Biol</source>. 12, e1005145. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1005145</pub-id><pub-id pub-id-type="pmid">27736881</pub-id></citation></ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kishida</surname> <given-names>K. T.</given-names></name> <name><surname>Saez</surname> <given-names>I.</given-names></name> <name><surname>Lohrenz</surname> <given-names>T.</given-names></name> <name><surname>Witcher</surname> <given-names>M. R.</given-names></name> <name><surname>Laxton</surname> <given-names>A. W.</given-names></name> <name><surname>Tatter</surname> <given-names>S. B.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>Subsecond dopamine fluctuations in human striatum encode superposed error signals about actual and counterfactual reward</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A</source>. <volume>113</volume>, <fpage>200</fpage>&#x02013;<lpage>205</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.1513619112</pub-id><pub-id pub-id-type="pmid">26598677</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Klein</surname> <given-names>T. A.</given-names></name> <name><surname>Ullsperger</surname> <given-names>M.</given-names></name> <name><surname>Jocham</surname> <given-names>G.</given-names></name></person-group> (<year>2017</year>). <article-title>Learning relative values in the striatum induces violations of normative decision making</article-title>. <source>Nat. Commun</source>. 8, 16033. <pub-id pub-id-type="doi">10.1038/ncomms16033</pub-id><pub-id pub-id-type="pmid">31249293</pub-id></citation></ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lak</surname> <given-names>A.</given-names></name> <name><surname>Stauffer</surname> <given-names>W. R.</given-names></name> <name><surname>Schultz</surname> <given-names>W.</given-names></name></person-group> (<year>2016</year>). <article-title>Dopamine neurons learn relative chosen value from probabilistic rewards</article-title>. <source>Elife</source> <volume>5</volume>, <fpage>e18044</fpage>. <pub-id pub-id-type="doi">10.7554/eLife.18044</pub-id><pub-id pub-id-type="pmid">27787196</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lefebvre</surname> <given-names>G.</given-names></name> <name><surname>Summerfield</surname> <given-names>C.</given-names></name> <name><surname>Bogacz</surname> <given-names>R.</given-names></name></person-group> (<year>2022</year>). <article-title>A normative account of confirmatory biases during reinforcement learning</article-title>. <source>Neural Comput</source>. <volume>34</volume>, <fpage>307</fpage>&#x02013;<lpage>337</lpage>. <pub-id pub-id-type="doi">10.1162/neco_a_01455</pub-id><pub-id pub-id-type="pmid">34758486</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Niv</surname> <given-names>Y.</given-names></name> <name><surname>Daniel</surname> <given-names>R.</given-names></name> <name><surname>Geana</surname> <given-names>A.</given-names></name> <name><surname>Gershman</surname> <given-names>S. J.</given-names></name> <name><surname>Leong</surname> <given-names>Y. C.</given-names></name> <name><surname>Radulescu</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Reinforcement learning in multidimensional environments relies on attention mechanisms</article-title>. <source>J. Neurosci</source>. <volume>35</volume>, <fpage>8145</fpage>&#x02013;<lpage>8157</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.2978-14.2015</pub-id><pub-id pub-id-type="pmid">26019331</pub-id></citation></ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nonomura</surname> <given-names>S.</given-names></name> <name><surname>Nishizawa</surname> <given-names>K.</given-names></name> <name><surname>Sakai</surname> <given-names>Y.</given-names></name> <name><surname>Kawaguchi</surname> <given-names>Y.</given-names></name> <name><surname>Kato</surname> <given-names>S.</given-names></name> <name><surname>Uchigashima</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>Monitoring and updating of action selection for goal-directed behavior through the striatal direct and indirect pathways</article-title>. <source>Neuron</source> <volume>99</volume>, <fpage>1302</fpage>&#x02013;<lpage>1314</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2018.08.002</pub-id><pub-id pub-id-type="pmid">30146299</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Notebaert</surname> <given-names>W.</given-names></name> <name><surname>Houtman</surname> <given-names>F.</given-names></name> <name><surname>Van Opstal</surname> <given-names>F.</given-names></name> <name><surname>Gevers</surname> <given-names>W.</given-names></name> <name><surname>Fias</surname> <given-names>W.</given-names></name> <name><surname>Verguts</surname> <given-names>T.</given-names></name></person-group> (<year>2009</year>). <article-title>Post-error slowing: an orienting account</article-title>. <source>Cognition</source> <volume>111</volume>, <fpage>275</fpage>&#x02013;<lpage>279</lpage>. <pub-id pub-id-type="doi">10.1016/j.cognition.2009.02.002</pub-id><pub-id pub-id-type="pmid">19285310</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Palminteri</surname> <given-names>S.</given-names></name> <name><surname>Khamassi</surname> <given-names>M.</given-names></name> <name><surname>Joffily</surname> <given-names>M.</given-names></name> <name><surname>Coricelli</surname> <given-names>G.</given-names></name></person-group> (<year>2015</year>). <article-title>Contextual modulation of value signals in reward and punishment learning</article-title>. <source>Nat. Commun</source>. <volume>6</volume>, <fpage>1</fpage>&#x02013;<lpage>14</lpage>. <pub-id pub-id-type="doi">10.1038/ncomms9096</pub-id><pub-id pub-id-type="pmid">26302782</pub-id></citation></ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Palminteri</surname> <given-names>S.</given-names></name> <name><surname>Lebreton</surname> <given-names>M.</given-names></name></person-group> (<year>2021</year>). <article-title>Context-dependent outcome encoding in human reinforcement learning</article-title>. <source>Curr. Opin. Behav. Sci</source>. <volume>41</volume>, <fpage>144</fpage>&#x02013;<lpage>151</lpage>. <pub-id pub-id-type="doi">10.1016/j.cobeha.2021.06.006</pub-id></citation>
</ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Peak</surname> <given-names>J.</given-names></name> <name><surname>Hart</surname> <given-names>G.</given-names></name> <name><surname>Balleine</surname> <given-names>B. W.</given-names></name></person-group> (<year>2019</year>). <article-title>From learning to action: the integration of dorsal striatal input and output pathways in instrumental conditioning</article-title>. <source>Eur. J. Neurosci</source>. <volume>49</volume>, <fpage>658</fpage>&#x02013;<lpage>671</lpage>. <pub-id pub-id-type="doi">10.1111/ejn.13964</pub-id><pub-id pub-id-type="pmid">29791051</pub-id></citation></ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rigoli</surname> <given-names>F.</given-names></name> <name><surname>Chew</surname> <given-names>B.</given-names></name> <name><surname>Dayan</surname> <given-names>P.</given-names></name> <name><surname>Dolan</surname> <given-names>R. J.</given-names></name></person-group> (<year>2018</year>). <article-title>Learning contextual reward expectations for value adaptation</article-title>. <source>J. Cogn. Neurosci</source>. <volume>30</volume>, <fpage>50</fpage>&#x02013;<lpage>69</lpage>. <pub-id pub-id-type="doi">10.1162/jocn_a_01191</pub-id><pub-id pub-id-type="pmid">28949824</pub-id></citation></ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rigoli</surname> <given-names>F.</given-names></name> <name><surname>Friston</surname> <given-names>K. J.</given-names></name> <name><surname>Dolan</surname> <given-names>R. J.</given-names></name></person-group> (<year>2016a</year>). <article-title>Neural processes mediating contextual influences on human choice behaviour</article-title>. <source>Nat. Commun</source>. <volume>7</volume>, <fpage>1</fpage>&#x02013;<lpage>11</lpage>. <pub-id pub-id-type="doi">10.1038/ncomms12416</pub-id><pub-id pub-id-type="pmid">27535770</pub-id></citation></ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rigoli</surname> <given-names>F.</given-names></name> <name><surname>Mathys</surname> <given-names>C.</given-names></name> <name><surname>Friston</surname> <given-names>K. J.</given-names></name> <name><surname>Dolan</surname> <given-names>R. J.</given-names></name></person-group> (<year>2017</year>). <article-title>A unifying bayesian account of contextual effects in value-based choice</article-title>. <source>PLoS Comput. Biol</source>. 13, e1005769. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1005769</pub-id><pub-id pub-id-type="pmid">31577793</pub-id></citation></ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rigoli</surname> <given-names>F.</given-names></name> <name><surname>Rutledge</surname> <given-names>R. B.</given-names></name> <name><surname>Dayan</surname> <given-names>P.</given-names></name> <name><surname>Dolan</surname> <given-names>R. J.</given-names></name></person-group> (<year>2016b</year>). <article-title>The influence of contextual reward statistics on risk preference</article-title>. <source>NeuroImage</source> <volume>128</volume>, <fpage>74</fpage>&#x02013;<lpage>84</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2015.12.016</pub-id><pub-id pub-id-type="pmid">26707890</pub-id></citation></ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rigoux</surname> <given-names>L.</given-names></name> <name><surname>Stephan</surname> <given-names>K. E.</given-names></name> <name><surname>Friston</surname> <given-names>K. J.</given-names></name> <name><surname>Daunizeau</surname> <given-names>J.</given-names></name></person-group> (<year>2014</year>). <article-title>Bayesian model selection for group studies-revisited</article-title>. <source>Neuroimage</source> <volume>84</volume>, <fpage>971</fpage>&#x02013;<lpage>985</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2013.08.065</pub-id><pub-id pub-id-type="pmid">24018303</pub-id></citation></ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schultz</surname> <given-names>W.</given-names></name> <name><surname>Dayan</surname> <given-names>P.</given-names></name> <name><surname>Montague</surname> <given-names>P. R.</given-names></name></person-group> (<year>1997</year>). <article-title>A neural substrate of prediction and reward</article-title>. <source>Science</source> <volume>275</volume>, <fpage>1593</fpage>&#x02013;<lpage>1599</lpage>. <pub-id pub-id-type="doi">10.1126/science.275.5306.1593</pub-id><pub-id pub-id-type="pmid">9054347</pub-id></citation></ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shen</surname> <given-names>W.</given-names></name> <name><surname>Flajolet</surname> <given-names>M.</given-names></name> <name><surname>Greengard</surname> <given-names>P.</given-names></name> <name><surname>Surmeier</surname> <given-names>D. J.</given-names></name></person-group> (<year>2008</year>). <article-title>Dichotomous dopaminergic control of striatal synaptic plasticity</article-title>. <source>Science</source> <volume>321</volume>, <fpage>848</fpage>&#x02013;<lpage>851</lpage>. <pub-id pub-id-type="doi">10.1126/science.1160575</pub-id><pub-id pub-id-type="pmid">18687967</pub-id></citation></ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shin</surname> <given-names>J. H.</given-names></name> <name><surname>Kim</surname> <given-names>D.</given-names></name> <name><surname>Jung</surname> <given-names>M. W.</given-names></name></person-group> (<year>2018</year>). <article-title>Differential coding of reward and movement information in the dorsomedial striatal direct and indirect pathways</article-title>. <source>Nat. Commun</source>. <volume>9</volume>, <fpage>1</fpage>&#x02013;<lpage>14</lpage>. <pub-id pub-id-type="doi">10.1038/s41467-017-02817-1</pub-id><pub-id pub-id-type="pmid">29374173</pub-id></citation></ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stephan</surname> <given-names>K. E.</given-names></name> <name><surname>Penny</surname> <given-names>W. D.</given-names></name> <name><surname>Daunizeau</surname> <given-names>J.</given-names></name> <name><surname>Moran</surname> <given-names>R. J.</given-names></name> <name><surname>Friston</surname> <given-names>K. J.</given-names></name></person-group> (<year>2009</year>). <article-title>Bayesian model selection for group studies</article-title>. <source>Neuroimage</source> <volume>46</volume>, <fpage>1004</fpage>&#x02013;<lpage>1017</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2009.03.025</pub-id><pub-id pub-id-type="pmid">19306932</pub-id></citation></ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Summerfield</surname> <given-names>C.</given-names></name> <name><surname>Tsetsos</surname> <given-names>K.</given-names></name></person-group> (<year>2015</year>). <article-title>Do humans make good decisions?</article-title> <source>Trends Cogn. Sci</source>. <volume>19</volume>, <fpage>27</fpage>&#x02013;<lpage>34</lpage>. <pub-id pub-id-type="doi">10.1016/j.tics.2014.11.005</pub-id><pub-id pub-id-type="pmid">25488076</pub-id></citation></ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Surmeier</surname> <given-names>D. J.</given-names></name> <name><surname>Ding</surname> <given-names>J.</given-names></name> <name><surname>Day</surname> <given-names>M.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Shen</surname> <given-names>W.</given-names></name></person-group> (<year>2007</year>). <article-title>D1 and d2 dopamine-receptor modulation of striatal glutamatergic signaling in striatal medium spiny neurons</article-title>. <source>Trends Neurosci</source>. <volume>30</volume>, <fpage>228</fpage>&#x02013;<lpage>235</lpage>. <pub-id pub-id-type="doi">10.1016/j.tins.2007.03.008</pub-id><pub-id pub-id-type="pmid">17408758</pub-id></citation></ref>
<ref id="B46">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sutton</surname> <given-names>R. S.</given-names></name> <name><surname>Barto</surname> <given-names>A. G.</given-names></name></person-group> (<year>2018</year>). <source>Reinforcement Learning: An Introduction</source>. MIT Press.</citation>
</ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tai</surname> <given-names>L.-H.</given-names></name> <name><surname>Lee</surname> <given-names>A. M.</given-names></name> <name><surname>Benavidez</surname> <given-names>N.</given-names></name> <name><surname>Bonci</surname> <given-names>A.</given-names></name> <name><surname>Wilbrecht</surname> <given-names>L.</given-names></name></person-group> (<year>2012</year>). <article-title>Transient stimulation of distinct subpopulations of striatal neurons mimics changes in action value</article-title>. <source>Nat. Neurosci</source>. 15, 1281. <pub-id pub-id-type="doi">10.1038/nn.3188</pub-id><pub-id pub-id-type="pmid">22902719</pub-id></citation></ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ting</surname> <given-names>C.</given-names></name> <name><surname>Palminteri</surname> <given-names>S.</given-names></name> <name><surname>Lebreton</surname> <given-names>M.</given-names></name> <name><surname>Engelmann</surname> <given-names>J. B.</given-names></name></person-group> (<year>2021</year>). <article-title>The elusive effects of incidental anxiety on reinforcement-learning</article-title>. <source>J. Exp. Psychol. Learn. Mem. Cogn</source>. <pub-id pub-id-type="doi">10.1037/xlm0001033</pub-id><pub-id pub-id-type="pmid">34516205</pub-id></citation></ref>
<ref id="B49">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tsetsos</surname> <given-names>K.</given-names></name> <name><surname>Moran</surname> <given-names>R.</given-names></name> <name><surname>Moreland</surname> <given-names>J.</given-names></name> <name><surname>Chater</surname> <given-names>N.</given-names></name> <name><surname>Usher</surname> <given-names>M.</given-names></name> <name><surname>Summerfield</surname> <given-names>C.</given-names></name></person-group> (<year>2016</year>). <article-title>Economic irrationality is optimal during noisy decision making. Proceedings of the National Academy of Sciences</article-title>, 113, 3102-3107. <pub-id pub-id-type="doi">10.1073/pnas.1519157113</pub-id><pub-id pub-id-type="pmid">26929353</pub-id></citation></ref>
<ref id="B50">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wilson</surname> <given-names>R. C.</given-names></name> <name><surname>Collins</surname> <given-names>A. G.</given-names></name></person-group> (<year>2019</year>). <article-title>Ten simple rules for the computational modeling of behavioral data</article-title>. <source>Elife</source> <volume>8</volume>, <fpage>e49547</fpage>. <pub-id pub-id-type="doi">10.7554/eLife.49547</pub-id><pub-id pub-id-type="pmid">31769410</pub-id></citation></ref>
</ref-list> 
</back>
</article>