<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Syst. Neurosci.</journal-id>
<journal-title>Frontiers in Systems Neuroscience</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Syst. Neurosci.</abbrev-journal-title>
<issn pub-type="epub">1662-5137</issn>
<publisher>
<publisher-name>Frontiers Research Foundation</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnsys.2011.00022</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Hypothesis and Theory</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Dopaminergic Balance between Reward Maximization and Policy Complexity</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Parush</surname> <given-names>Naama</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="author-notes" rid="fn001">&#x0002A;</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Tishby</surname> <given-names>Naftali</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<xref ref-type="aff" rid="aff4"><sup>4</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Bergman</surname> <given-names>Hagai</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff4"><sup>4</sup></xref>
<xref ref-type="aff" rid="aff5"><sup>5</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>The Interdisciplinary Center for Neural Computation, The Hebrew University</institution> <country>Jerusalem, Israel</country></aff>
<aff id="aff2"><sup>2</sup><institution>IBM Haifa Research Lab</institution> <country>Haifa, Israel</country></aff>
<aff id="aff3"><sup>3</sup><institution>The School of Engineering and Computer Science, The Hebrew University</institution> <country>Jerusalem, Israel</country></aff>
<aff id="aff4"><sup>4</sup><institution>The Edmond and Lily Safra Center for Brain Sciences, The Hebrew University</institution> <country>Jerusalem, Israel</country></aff>
<aff id="aff5"><sup>5</sup><institution>Department of Medical Neurobiology (Physiology), Institute of Medical Research Israel-Canada, Hadassah Medical School, The Hebrew University</institution> <country>Jerusalem, Israel</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited By: Charles J. Wilson, University of Texas at San Antonio, USA</p></fn>
<fn fn-type="edited-by"><p>Reviewed By: Charles J. Wilson, University of Texas at San Antonio, USA; Thomas Boraud,Universite de Bordeaux, France</p></fn>
<fn fn-type="corresp" id="fn001"><p>&#x0002A;Correspondence: Naama Parush, The Interdisciplinary Center for Neural Computation, The Hebrew University, Jerusalem, Israel. e-mail: <email>naama.parush&#x00040;gmail.com</email></p></fn>
</author-notes>
<pub-date pub-type="epreprint">
<day>14</day>
<month>02</month>
<year>2011</year>
</pub-date>
<pub-date pub-type="epub">
<day>09</day>
<month>05</month>
<year>2011</year>
</pub-date>
<pub-date pub-type="collection">
<year>2011</year>
</pub-date>
<volume>5</volume>
<elocation-id>22</elocation-id>
<history>
<date date-type="received">
<day>31</day>
<month>12</month>
<year>2010</year>
</date>
<date date-type="accepted">
<day>20</day>
<month>04</month>
<year>2011</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2011 Parush, Tishby and Bergman.</copyright-statement>
<copyright-year>2011</copyright-year>
<license license-type="open-access" xlink:href="http://www.frontiersin.org/licenseagreement"><p>This is an open-access article subject to a non-exclusive license between the authors and Frontiers Media SA, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and other Frontiers conditions are complied with.</p></license>
</permissions>
<abstract>
<p>Previous reinforcement-learning models of the basal ganglia network have highlighted the role of dopamine in encoding the mismatch between prediction and reality. Far less attention has been paid to the computational goals and algorithms of the main-axis (actor). Here, we construct a top-down model of the basal ganglia with emphasis on the role of dopamine as both a reinforcement learning signal and as a pseudo-temperature signal controlling the general level of basal ganglia excitability and motor vigilance of the acting agent. We argue that the basal ganglia endow the thalamic-cortical networks with the optimal dynamic tradeoff between two constraints: minimizing the policy complexity (cost) and maximizing the expected future reward (gain). We show that this multi-dimensional optimization processes results in an experience-modulated version of the softmax behavioral policy. Thus, as in classical softmax behavioral policies, probability of actions are selected according to their estimated values and the pseudo-temperature, but in addition also vary according to the frequency of previous choices of these actions. We conclude that the computational goal of the basal ganglia is not to maximize cumulative (positive and negative) reward. Rather, the basal ganglia aim at optimization of independent gain and cost functions. Unlike previously suggested single-variable maximization processes, this multi-dimensional optimization process leads naturally to a softmax-like behavioral policy. We suggest that beyond its role in the modulation of the efficacy of the cortico-striatal synapses, dopamine directly affects striatal excitability and thus provides a pseudo-temperature signal that modulates the tradeoff between gain and cost. The resulting experience and dopamine modulated softmax policy can then serve as a theoretical framework to account for the broad range of behaviors and clinical states governed by the basal ganglia and dopamine systems.</p>
</abstract>
<kwd-group>
<kwd>basal ganglia</kwd>
<kwd>dopamine</kwd>
<kwd>softmax</kwd>
<kwd>reinforcement-learning</kwd>
</kwd-group>
<counts>
<fig-count count="5"/>
<table-count count="0"/>
<equation-count count="14"/>
<ref-count count="63"/>
<page-count count="11"/>
<word-count count="8004"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="introduction">
<title>Introduction</title>
<p>Many studies have characterized basal ganglia (BG) activity in terms of reinforcement learning (RL) algorithms (Barto, <xref ref-type="bibr" rid="B5">1995</xref>; Schultz et al., <xref ref-type="bibr" rid="B51">1997</xref>; Bar-Gad et al., <xref ref-type="bibr" rid="B4">2003b</xref>; Gurney et al., <xref ref-type="bibr" rid="B19">2004</xref>; Balleine et al., <xref ref-type="bibr" rid="B2">2007</xref>). Early physiological works revealed that phasic dopamine activity encodes the mismatch between prediction and reality or the RL temporal difference (TD) error signal (Schultz et al., <xref ref-type="bibr" rid="B51">1997</xref>; Dayan and Balleine, <xref ref-type="bibr" rid="B14">2002</xref>; Fiorillo et al., <xref ref-type="bibr" rid="B17">2003</xref>; Satoh et al., <xref ref-type="bibr" rid="B48">2003</xref>; Morris et al., <xref ref-type="bibr" rid="B32">2004</xref>; Nakahara et al., <xref ref-type="bibr" rid="B34">2004</xref>; Bayer and Glimcher, <xref ref-type="bibr" rid="B6">2005</xref>). In accordance with these RL models of the BG network, dopamine has been shown to modulate the efficacy of cortico-striatal transmission (Reynolds et al., <xref ref-type="bibr" rid="B43">2001</xref>; Surmeier et al., <xref ref-type="bibr" rid="B56">2007</xref>; Kreitzer and Malenka, <xref ref-type="bibr" rid="B26">2008</xref>; Pan et al., <xref ref-type="bibr" rid="B39">2008</xref>; Pawlak and Kerr, <xref ref-type="bibr" rid="B42">2008</xref>; Shen et al., <xref ref-type="bibr" rid="B53">2008</xref>). However most RL models of the BG do not explicitly discuss the issue of BG-driven behavioral policy, or the interactions between the acting agent and the environment.</p>
<p>This work adopts the RL actor/critic framework to model the BG networks. We assume that cortical activity represents the state and modulates the activity of the BG input stages &#x02013; the striatum. Cortico-striatal synaptic efficacy is adjusted by dopamine modulated Hebbian rules (Reynolds et al., <xref ref-type="bibr" rid="B43">2001</xref>; Reynolds and Wickens, <xref ref-type="bibr" rid="B44">2002</xref>; McClure et al., <xref ref-type="bibr" rid="B29">2003</xref>; Shen et al., <xref ref-type="bibr" rid="B53">2008</xref>). Striatal activity is further shaped in the downstream BG network (e.g., in the external segment of the globus pallidus, GPe). Finally, the activity of the BG output structures (the internal part of the globus pallidus and the substantia nigra pars reticulata; GPi and SNr respectively) modulate activity in the brainstem motor nuclei and thalamo-frontal cortex networks that control ongoing and future actions (Deniau and Chevalier, <xref ref-type="bibr" rid="B15">1985</xref>; Mink, <xref ref-type="bibr" rid="B31">1996</xref>; Hikosaka, <xref ref-type="bibr" rid="B21">2007</xref>). It is assumed that the mapping of the BG activity and action does not change along the BG main axis (from the striatum to the BG output stages) or in the BG target structures. Therefore, the specific or the distributed activity of the striatal neurons and the neurons in the downstream BG structures represents the desired action. Moreover, the excitatory cortical input to the striatum as dictated by the cortical activity and the efficacy of the cortico-striatal synapses represents the specific state&#x02013;action pair Q-value.</p>
<p>To simplify our BG model, we modeled the BG main axis as the connections from the D2 containing projection neurons of the striatum, through the GPe, to the BG output structures. We neglected (at this stage) many of the other critical features of the BG networks such as the BG direct pathway structures (direct connections between D1 dopamine receptors containing striatal cells and the GPi/SNr), the subthalamic nucleus (STN) and the reciprocal connections between the GPe and the striatum and the STN. We further assumed that the activity of the BG output structures inhibits their target structures &#x02013; the thalamus and brainstem motor nuclei (Hikosaka and Wurtz, <xref ref-type="bibr" rid="B22">1983</xref>; Deniau and Chevalier, <xref ref-type="bibr" rid="B15">1985</xref>; Parush et al., <xref ref-type="bibr" rid="B41">2008</xref>); thus the action probability is considered to be inversely proportional to the BG output distributed activity.</p>
<p>Most previous RL models of the BG network assume that the computational goal of the BG is to maximize the (discounted) cumulative sum of a single variable &#x02013; the reward (pleasure) prediction error. Thus, the omission of reward and aversive events are considered events with negative reward values as compared to the positive values of food/water predicting cues and delivery. However, in many cases the cost of an action is different from a negative gain. We therefore suggest that the emotional dimensions of behavior in animals and humans must be represented by more than a single axis. In the following sections we present a behavioral policy that seeks the optimal tradeoff between maximization of cumulative expected reward and minimization of cost. Here we use policy complexity as the representative of a cost. We assume that agents pay a price for a more complicated behavioral policy, and therefore try to minimize the complexity of their behavioral policy. We simulate the behavior of an agent aiming at multi-dimensional optimization of its behavior while engaged in a decision task similar to the multi-armed bandit problem (Vulkan, <xref ref-type="bibr" rid="B62">2000</xref>; Morris et al., <xref ref-type="bibr" rid="B33">2006</xref>).</p>
<p>Although we used two axes (gain and cost), we obviously do not claim that there are no other, or better, axes that span the emotional space of the animal. For example, arousal, novelty, and minimization of pain could all be functions that the BG network attempts to optimize. Nevertheless, we believe that the demonstration of the much richer repertoire of behavioral policy enabled by the multi-dimensional optimization processes sheds light on the goals and algorithms of the BG network. Future research should enable us to determine the actual computational aim and algorithms of the BG networks.</p>
</sec>
<sec>
<title>&#x0201C;Minimal Complexity &#x02013; Maximal Reward&#x0201D; Behavioral Policy</title>
<p>When an agent is faced with the task of selecting and executing an action, it needs to perform a transformation from a state representing the present and past (internal and external) environment to an action. However, at least two competitive principles guide the agent. On the one hand, it aims to maximize the valuable outcome (cumulative future-discounted reward) of the selected action. On the other hand, the agent is interested in minimizing the cost of its action, for example to act according to a policy with minimal complexity.</p>
<p>The transition from state to action requires knowledge of the state identity. A state identity representation can be thought of as a long vector of letters describing the size, shape, color, smell, and other variables of the objects in the current environment. The longer the vector representing the state, the better is our knowledge of that state. The complexity of the state representation required by a policy reflects the complexity of the policy. Therefore we define policy complexity as the length of the state representation required by that policy. We can estimate the length of the representation of the state identity required by a policy by observing the length of the state that can be extracted on average given the chosen actions. This definition therefore classifies policies that require detailed representations of the state as complex. On the other hand, a policy that does not commit to a specific pair of actions and states, and therefore does not require a lengthy state representation, has low complexity. Formally, we can therefore define the state&#x02013;action mutual information &#x02013; MI(<italic>S; A</italic>) (for a brief review of the concepts of entropy and mutual information &#x02013; see Box <xref ref-type="boxed-text" rid="BX1">1</xref>) as a measure of policy complexity (see formal details in Appendix 1).</p>
<boxed-text id="BX1">
<title>Entropy, mutual information, and uncertainty</title>
<p>The entropy function quantifies in bits the amount of &#x0201C;randomness&#x0201D; or &#x0201C;uncertainty&#x0201D; of a distribution. If |<italic>X</italic>| &#x0003D;&#x02009;<italic>n</italic>, <italic>x</italic>&#x02009;&#x02208;&#x02009;<italic>X</italic> is a variable with distribution <inline-formula><mml:math id="M1"><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>X</mml:mi></mml:mrow></mml:msub><mml:mspace/><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></inline-formula>, then the entropy is defined by: <inline-formula><mml:math id="M2"><mml:mrow><mml:mi>H</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>X</mml:mi></mml:mrow></mml:msub><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></inline-formula> (Cover and Thomas, <xref ref-type="bibr" rid="B12">1991</xref>).</p>
<p>The entropy values range from 0 to log<sub>2</sub>(<italic>n</italic>).</p>
<p>The situation where <italic>H</italic>(<italic>X</italic>) &#x0003D;&#x02009;0 is obtained when there is no randomness associated with the variable; i.e., the identity of <italic>x</italic> is known with full certainty. For example: <italic>p</italic>(<italic>x</italic>&#x02009;&#x0003D;&#x02009;<italic>c</italic>) &#x0003D;&#x02009;1, <italic>p</italic>(<italic>x</italic>&#x02009;&#x02260;&#x02009;<italic>c</italic>) &#x0003D;&#x02009;0.</p>
<p>The situation where <italic>H</italic>(<italic>X</italic>) &#x0003D;&#x02009;log<sub>2</sub>(<italic>n</italic>) is obtained when <italic>x</italic> is totally random: <italic>p</italic>(<italic>x</italic>) &#x0003D;&#x02009;1/<italic>n</italic> for all values of <italic>x</italic>. Intermediate values correspond to intermediate levels of uncertainty.</p>
<p>Entropy quantifies the amount of &#x0201C;uncertainty&#x0201D; when dealing with two variables.</p>
<p><italic>H</italic>(<italic>X</italic>|<italic>Y</italic>) denotes the entropy of variable <italic>x</italic>&#x02009;&#x02208;&#x02009;<italic>X</italic> given variable <italic>y</italic>&#x02009;&#x02208;&#x02009;<italic>Y</italic>; i.e., <inline-formula><mml:math id="M3"><mml:mrow><mml:mi>H</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>X</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:mi>Y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>X</mml:mi></mml:mrow></mml:msub><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></inline-formula>. The entropy of a pair of variables is given by <italic>H</italic>(<italic>X</italic>, <italic>Y</italic>) &#x0003D;&#x02009;<italic>H</italic>(<italic>X</italic>) &#x0002B;&#x02009;<italic>H</italic>(<italic>Y</italic>|<italic>X</italic>).</p>
<p>The mutual information between two variables can be defined as the number of bits of &#x0201C;uncertainty&#x0201D; of one of the variables reduced by knowledge of the other variable (on average): <italic>MI</italic>(<italic>X</italic>; <italic>Y</italic>) &#x0003D;&#x02009;<italic>H</italic>(<italic>X</italic>) &#x02212;&#x02009;<italic>H</italic>(<italic>X</italic>|<italic>Y</italic>).</p>
<p>The mutual information between two variables can also be defined by the Kullback&#x02013;Leibler divergence (Dkl) between the actual probability of the pair <italic>X</italic>, <italic>Y</italic> [<italic>p</italic>(<italic>x</italic>, <italic>y</italic>)] and the expected probability if the variables were independent [<italic>p</italic>(<italic>x</italic>)&#x0002A;<italic>p</italic> (<italic>y</italic>)] (Cover and Thomas, <xref ref-type="bibr" rid="B12">1991</xref>):</p>
<disp-formula id="E1"><mml:math id="M4"><mml:mrow><mml:mi>M</mml:mi><mml:mi>I</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>X</mml:mi><mml:mo>;</mml:mo><mml:mi>Y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mtext>Dkl</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:mrow><mml:mo>.</mml:mo></mml:math></disp-formula>
</boxed-text>
<p>The following example can serve to better understand the notion of representation length and policy complexity. Assume an agent is facing one of four possible states <italic>S</italic><sub>1</sub>,<italic>S</italic><sub>2</sub>,<italic>S</italic><sub>3</sub>,<italic>S</italic><sub>4</sub> with equal probability, and using policy A, B, or C chooses one of two possible actions <italic>A</italic><sub>1</sub>,<italic>A</italic><sub>2</sub>. Policy A determines that action <italic>A</italic><sub>1</sub> is chosen for all states, policy B chooses the action randomly for all states, and policy C determines that action <italic>A</italic><sub>1</sub> is chosen for states <italic>S</italic><sub>1</sub>,<italic>S</italic><sub>2</sub>, and action <italic>A</italic><sub>2</sub> is chosen for states <italic>S</italic><sub>3</sub>,<italic>S</italic><sub>4</sub>. In policies A and B determining the action does not require knowledge of the state (and the state can not be extracted given the chosen action), and therefore there is no required state representation, and the representation length and policy complexity is 0. By contrast in policy C determining the action does not require full knowledge of the state but only whether the state is <italic>S</italic><sub>1</sub>,<italic>S</italic><sub>2</sub> or <italic>S</italic><sub>3</sub>,<italic>S</italic><sub>4</sub>. Therefore the required state representation only needs to differentiate between two possibilities. This could be done using a codeword of one bit (for example 0 representing <italic>S</italic><sub>1</sub>,<italic>S</italic><sub>2</sub> and 1 representing <italic>S</italic><sub>3</sub>,<italic>S</italic><sub>4</sub>). Hence the representation length and policy complexity is 1 bit. As expected, it can be shown that for policies A and B MI(<italic>S; A</italic>) &#x0003D;&#x02009;0, and for policy C MI(<italic>S; A</italic>) &#x0003D;&#x02009;1.</p>
<p>The policy complexity is a measure of the policy commitment to the future action given the state (see formal details in Appendix 1). Higher MI values make it possible to classify the action (given a state) with higher resolution. In the extreme high case, the specific action is determined from the state, MI(<italic>S; A</italic>) &#x0003D;&#x02009;log<sub>2</sub>(number of possible actions), and all the entropy (uncertainty) of the action is eliminated. In the extreme low case MI(<italic>S; A</italic>) &#x0003D;&#x02009;0, and the chosen action is completely unpredictable from the state.</p>
<p>Combining both expected reward and policy complexity factors produces an optimization problem that aims at minimal commitment to state&#x02013;action mapping (maximal exploration) while maximizing the future reward. A similar optimization can be found in (Klyubin et al., <xref ref-type="bibr" rid="B25">2007</xref>; Tishby and Polani, <xref ref-type="bibr" rid="B61">2010</xref>). Below we show that the optimization problem introduces a tradeoff parameter &#x003B2; that balances the two optimization goals. Setting a high &#x003B2; value will bias the optimization problem toward maximizing the future reward, while setting a low &#x003B2; value will bias the optimization problem toward minimizing the cost, i.e., the policy complexity.</p>
<p>We solve the optimization problem of minimum complexity &#x02013; maximum reward by a generalization of the Blahut-Arimoto algorithm for rate distortion problems (Blahut, <xref ref-type="bibr" rid="B7">1972</xref>; Cover and Thomas, <xref ref-type="bibr" rid="B12">1991</xref>, and see Appendix 2 for details). This results in the following equation:</p>
<disp-formula id="E2"><label>(1)</label><mml:math id="M5"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mtext>&#x0007C;</mml:mtext><mml:mi>s</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>a</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>Z</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:msup><mml:mtext>e</mml:mtext><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:mi>Q</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>a</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>s</mml:mi></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mtext>&#x0007C;</mml:mtext><mml:mi>s</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>Z</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>a</mml:mi></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>a</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:msup><mml:mtext>e</mml:mtext><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:mi>Q</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>p</italic>(<italic>a|s</italic>) is the probability of action <italic>a</italic> given a state <italic>s</italic>, or the behavioral policy, <italic>p</italic>(<italic>a</italic>) is the overall probability of action <italic>a</italic>, averaged over all possible states. <italic>Q</italic>(<italic>s,a</italic>) is the value of the state&#x02013;action pairs, and &#x003B2; is the inverse of the pseudo-temperature parameter, or the tradeoff parameter that balances the two optimization goals. Finally, <italic>Z</italic>(<italic>s</italic>) is a normalization factor (summed over all possible actions) that limits <italic>p</italic>(<italic>a|s</italic>) to the range of 0&#x02013;1.</p>
<p>In the RL framework the state&#x02013;action Q-value is updated as a function of the discrepancy between the predicted Q value and the actual outcome. Thus, when choosing the next step, the behavioral policy influences which of the state&#x02013;action pairs is updated. In the more general case of an agent interacting with a stochastic environment, the behavioral policy changes the state&#x02013;action Q-value (expected reward of a state&#x02013;action pair), which in turn may change the policy. Thus, another equation concerning the expected reward (Q(<italic>s,a</italic>) values) should be associated with the previous equations (convergence of value and policy iterations, Sutton and Barto, <xref ref-type="bibr" rid="B57">1998</xref>). However, in our simplified BG model, the policy and Q-value are not changed simultaneously since the Q-value is modified by the cortico-striatal synaptic plasticity, and the policy is modified by the level of dopamine. These two specific modifications may occur through different molecular mechanisms, e.g., D1 activation that affects synaptic plasticity and D2 activation that affects post synaptic excitability (Kerr and Wickens, <xref ref-type="bibr" rid="B24">2001</xref>; Pawlak and Kerr, <xref ref-type="bibr" rid="B42">2008</xref>; but see Shen et al., <xref ref-type="bibr" rid="B53">2008</xref>) and at different timescales (Schultz, <xref ref-type="bibr" rid="B50">1998</xref>; Goto et al., <xref ref-type="bibr" rid="B18">2007</xref>). At this stage of our model, we therefore do not require simultaneous convergence of the expected reward values with the policy.</p>
<p>The behavioral policy <inline-formula><mml:math id="M6"><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>z</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:msup><mml:mtext>e</mml:mtext><mml:mrow><mml:mo>&#x003B2;</mml:mo><mml:mtext>Q</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> that optimizes the reward/complexity tradeoff resembles the classical RL softmax distribution where the probability of choosing an action is exponentially dependent on the action&#x00027;s expected reward and &#x003B2; &#x02013; the inverse of the pseudo-temperature parameter (Sutton and Barto, <xref ref-type="bibr" rid="B57">1998</xref>). Here, the probability of choosing an action given a specific state <italic>p</italic>(<italic>a|s</italic>) is exponentially dependent on the state&#x02013;action Q-value multiplied by the prior probability of choosing the specific action independently of the state &#x02013; <italic>p</italic>(<italic>a</italic>). This prior probability gives the advantage to actions that are chosen more often, and for this reason was dubbed the &#x0201C;experience-modulated softmax policy&#x0201D; here. This is in line with preservation behavior, where selected actions are influenced by the pattern of the agent&#x00027;s past choices (Slovin et al., <xref ref-type="bibr" rid="B54">1999</xref>; Lau and Glimcher, <xref ref-type="bibr" rid="B27">2005</xref>; Rutledge et al., <xref ref-type="bibr" rid="B47">2009</xref>). In cases where the a-priori probability of all actions is equal, the experience-modulated softmax policy is equivalent to the classical softmax policy. Finally, in single state scenarios (i.e., an agent is facing only one state, but still has more than one possible action) where <italic>p</italic>(<italic>a|s</italic>) &#x0003D;&#x02009;<italic>p</italic>(<italic>a</italic>), the policy maximizes the expected reward without minimizing the state&#x02013;action MI. Therefore, <italic>p</italic>(<italic>a</italic>)&#x02009;&#x0003D;<italic>&#x02009;</italic>1 for the action with the highest Q-value.</p>
</sec>
<sec>
<title>The Dual Role of Dopamine in the Model</title>
<p>Many studies have indicated that dopamine influences BG firing rate properties directly and not only by modulating cortico-striatal synaptic plasticity. Apomorphine (an ultrafast-acting D2 dopamine agonist) has an immediate (&#x0003C;1&#x02009;min) effect on Parkinsonian patients and on the discharge rate of BG neurons (Stefani et al., <xref ref-type="bibr" rid="B55">1997</xref>; Levy et al., <xref ref-type="bibr" rid="B28">2001</xref>; Nevet et al., <xref ref-type="bibr" rid="B35">2004</xref>). There is no consensus regarding the effect of dopamine on the excitability of striatal neurons (Nicola et al., <xref ref-type="bibr" rid="B36">2000</xref>; Onn et al., <xref ref-type="bibr" rid="B38">2000</xref>; Day et al., <xref ref-type="bibr" rid="B13">2008</xref>), probably since the <italic>in vivo</italic> effect of dopamine on striatal excitability is confounded by the many closed loops inside the striatum (Tepper et al., <xref ref-type="bibr" rid="B59">2008</xref>), and the reciprocal connections with the GPe and the STN. Nevertheless, most researchers concur that high tonic dopamine levels decrease the discharge rate of BG output structures, whereas low levels of tonic dopamine increase the activity of BG output (in rodents: Ruskin et al., <xref ref-type="bibr" rid="B45">1998</xref>; in primates: Filion et al., <xref ref-type="bibr" rid="B16">1991</xref>; Boraud et al., <xref ref-type="bibr" rid="B10">1998</xref>, <xref ref-type="bibr" rid="B9">2001</xref>; Papa et al., <xref ref-type="bibr" rid="B40">1999</xref>; Heimer et al., <xref ref-type="bibr" rid="B20">2002</xref>; Nevet et al., <xref ref-type="bibr" rid="B35">2004</xref>; and in human patients: Merello et al., <xref ref-type="bibr" rid="B30">1999</xref>; Levy et al., <xref ref-type="bibr" rid="B28">2001</xref>). These findings strongly indicate that tonic dopamine plays a significant role in shaping behavioral policy beyond a modulation of the efficacy of the cortico-striatal synapses. We suggest that dopamine serves as the inverse of &#x003B2;; i.e., as the pseudo-temperature, or the tradeoff parameter between policy complexity and expected reward (Eq. <xref ref-type="disp-formula" rid="E2">1</xref>).</p>
<p>In our model, dopamine thus plays a dual role in the striatum. First, dopamine has a role in updating the Q-values by modulating the efficacy of the cortico-striatal connections, and second, in setting &#x003B2; (the inverse of the pseudo temperature). However, since changing the excitability is faster than modulating synaptic plasticity, dopamine acts at different timescales and the effects of lack or excess of dopamine may appear more rapidly as changes in the softmax pseudo-temperature parameter of the behavioral policy than in the changes in the Q-values.</p>
<p>The following description can provide a possible characterization of the influence of dopamine on the computational physiology of the BG. The baseline activity of the striatal neurons, and by extension of the BG output neurons that represent all actions, is modulated by the tonic levels of striatal dopamine. In addition, striatal neural activity is modulated by the specific state&#x02013;action value (Q-value), and in turn determines the activity of the BG output neurons which encode a specific probability for each action. High dopamine levels decrease the dynamic range of the Q-value&#x00027;s influence (the baseline activity of the striatal neurons decreases, and consequently the dynamic range of the additional decrease in their discharge is reduced). Therefore different Q-values will result in similar BG output activity, and consequently the action probability will be more uniform. On the other hand, low dopamine levels result in a large dynamic range of striatal discharge, producing a probability distribution that is more closely related to the cortical Q-values preferring higher values. At moderate or normal dopamine levels the probability distribution of future action is dependent on the Q-values.</p>
<p>This behavior is also captured in the specifics of our model. A high amount of dopamine is equivalent to low &#x003B2; values (or a high pseudo temperature), yielding a low state&#x02013;action MI. This policy resembles gambling, where the probability of choosing an action is not dependent on the state and therefore is not correlated with the outcome prospects. Lowering the amount of dopamine, increasing &#x003B2;, causes an increase in the MI. In this case, the action probability is specifically related to the state&#x02013;action Q-value preferring higher reward prospects. In the extreme and most conservative case, the policy chooses deterministically the action with the highest reward prospect (greedy behavior).</p>
</sec>
<sec>
<title>Simulating a Probabilistic Two-Choice Task</title>
<p>We simulated the behavior of the experience modulated softmax model in a probabilistic two-choice task similar to one used previously in our group (Morris et al., <xref ref-type="bibr" rid="B33">2006</xref>). We only simulated the portion of the task in which there are multiple states in which the subject is expected to choose one of two actions (either move left or right). Intermingled with the trials on the binary decision task are forced choice trials (not discussed here). The different states are characterized by their different (action dependent) reward prospects. Actions can lead to a reward with one of the following probabilities: 25, 50, 75, or 100%. The task states consist of all combinations of the different reward probabilities. The states are distributed uniformly (i.e., all 16 states have equal probability). Note that since both sides are symmetrically balanced between high and low probabilities, there should be no prior preference for either of the actions (the trials on the forced choice task are also symmetrically balanced). Therefore, there is equal probability of choosing either of the sides (<italic>p</italic>(left)&#x0003D;<italic>p</italic>(right)&#x0003D;0.5), and the experience-modulated softmax behaves like the regular softmax policy.</p>
<p>Figures <xref ref-type="fig" rid="F1">1</xref>&#x02013;<xref ref-type="fig" rid="F4">4</xref> illustrate the simulation results. Figure <xref ref-type="fig" rid="F1">1</xref> illustrates the expected reward as a function of the state&#x02013;action MI for different dopamine levels. Since in our model dopamine acts as 1/&#x003B2;, decreasing the dopamine level causes both the state&#x02013;action MI (complexity of the policy, cost) and the average expected reward (gain) to increase until they reach a plateau. On the other hand, increasing the dopamine level leads to conditions with close to 0 complexity and reward (&#x0201C;no pain, no gain&#x0201D; state).</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p><bold>Expected reward as a function of the complexity of the behavioral policy for different &#x003B2;/dopamine levels</bold>. High dopamine (Da) or low &#x003B2; (inverse of the softmax pseudo-temperature variable) lead to a simple reward policy, e.g., a random policy with low state&#x02013;action mutual information (MI(<italic>s,a</italic>)), and low expected reward &#x0003C;Q&#x0003E;. On the other hand, a low dopamine level leads to a complex (deterministic) policy with high MI and high expected reward. In general, both state&#x02013;action MI and expected reward &#x0003C;Q&#x0003E; increase with beta (though when beta is close to 0, the increase in MI is very slow).</p></caption>
<graphic xlink:href="fnsys-05-00022-g001.tif"/>
</fig>
<p>Figure <xref ref-type="fig" rid="F2">2</xref> illustrates, for different dopamine levels, the probability of choosing an action as a function of the expected reward relative to the total sum of expected rewards. At low dopamine levels the expected reward is maximized, and therefore the action with a higher expected reward is always chosen (greedy behavioral policy). At moderate dopamine levels (i.e., simulating normal dopamine conditions) the probability of choosing an action is proportional to its relative expected reward. This is very similar to the results seen in (Morris et al., <xref ref-type="bibr" rid="B33">2006</xref>), and in line with a probability matching action selection policy (Vulkan, <xref ref-type="bibr" rid="B62">2000</xref>) where the probability of choosing an action is proportional to the action&#x00027;s relative expected reward. High dopamine levels yield a random policy, where the probability of choosing an action is not dependent on its expected reward.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p><bold>Behavioral policies at different &#x003B2;/dopamine levels</bold>. Probability of choosing Q1 as a function of the ratio between Q1 and (Q1&#x0002B;Q2): high dopamine (low &#x003B2;) &#x02013; random policy, not dependent on the Q-value, normal (moderate dopamine and &#x003B2;) &#x02013; policy dependent on the Q-value (preferring higher values), low dopamine (high &#x003B2;) &#x02013; deterministic (greedy) policy &#x02013; choosing the higher Q-values, and the dots represent values calculated in the simulation, and the lines are linear curve fittings of these points.</p></caption>
<graphic xlink:href="fnsys-05-00022-g002.tif"/>
</fig>
<p>A unique feature of the multi-dimensional optimization policy (Eq. <xref ref-type="disp-formula" rid="E2">1</xref>) is the effect of an <italic>a priori</italic> probability for action (modulation by experience). Figure <xref ref-type="fig" rid="F3">3</xref> illustrates the behavioral policy of an agent with moderate dopamine levels in two scenarios. In the first scenario the agent is facing a uniform state distribution, whereas in the second scenario the agent is facing an asymmetrical distribution in which states with a higher reward probability on the left side appear twice as often as states with a higher reward probability on the right. In the latter distribution there is clear preference for the left side (e.g., the policy prefers the left over the right side in states with equal reward probability on both sides). Thus, the history or the prior probability to perform an action influences the action selection policy. Figure <xref ref-type="fig" rid="F4">4</xref> illustrates the expected reward as a function of the state&#x02013;action MI for both the experience-modulated softmax and the regular softmax policies. As expected, since the experience-modulated softmax policy is driven by minimizing the state&#x02013;action MI while maximizing the reward, the experience-modulated softmax policy will result in higher expected reward values.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p><bold>Color coded illustration of behavioral policies for different state distributions</bold>. In the left column the agent is facing a uniform state distribution, whereas in the right column the agent is facing an asymmetrical distribution in which states with the higher reward probability on the left side appear twice as often as states with a higher reward probability on the right. In the right column there is a clear preference for the left side (e.g., the policy prefers the left over the right side in states with equal reward probability on both sides).</p></caption>
<graphic xlink:href="fnsys-05-00022-g003.tif"/>
</fig>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p><bold>Random learning: Illustration of the average probability of choosing Q1 as a function of the ratio between Q1 and (Q1&#x0002B;Q2) after 0 (normal), 30, 50, and 100 iterations of random learning</bold>. Random learning will eventually lead to a random policy (policy after 100 iterations, red curve).</p></caption>
<graphic xlink:href="fnsys-05-00022-g004.tif"/>
</fig>
</sec>
<sec>
<title>Modeling Dopamine Related Movement Disorders</title>
<p>Our simulations depict a maximization (greedy) action selection policy for low dopamine levels. However, in practice, an extreme lack of dopamine causes Parkinsonian patients to exhibit akinesia &#x02013; a lack of movement. Severe akinesia cannot be explained mathematically by our model. The normalization of the softmax equation ensures that the sum of <italic>p</italic>(<italic>a|s</italic>) over all <italic>a</italic> is 1, and for this reason there cannot be a condition where all <italic>p</italic>(<italic>a|s</italic>), for all <italic>a</italic> and all <italic>s</italic>, are close to 0. We suggest that in these extreme cases the BG neural network does not unequivocally implement the experience-modulated softmax algorithm. Since the activity of the BG output structures inhibits their target structures, and a lack of dopamine increases the BG output activity, extremely low dopamine levels can result in complete inhibition and therefore total blockage of activity, i.e., akinesia. In these cases an extraordinary high Q-value may momentarily overcome the inhibition and cause paradoxical kinesia (Keefe et al., <xref ref-type="bibr" rid="B23">1989</xref>; Schlesinger et al., <xref ref-type="bibr" rid="B49">2007</xref>; Bonanni et al., <xref ref-type="bibr" rid="B8">2010</xref>).</p>
<p>Another dopamine related movement disorder is levo-3,4-dihydroxyphenylalanine (l-DOPA) induced dyskinesia. Dopamine replacement therapy (DRT) by either l-DOPA or dopamine agonists is the most effective pharmacological treatment for Parkinson&#x00027;s disease. However, almost all patients treated with long term DRT develop dyskinesia &#x02013; severely disabling involuntary movements. Once these involuntary movements have been established, they will occur on every administration of DRT. Our model provides two possible computational explanations for l-DOPA induced dyskinesia. First, the high levels of dopamine force the system to act according to a random or gambling policy. The second possible cause of dyskinesia is related to the classical role of dopamine in modulating synaptic plasticity and reshaping the cortico-striatal connectivity (Surmeier et al., <xref ref-type="bibr" rid="B56">2007</xref>; Kreitzer and Malenka, <xref ref-type="bibr" rid="B26">2008</xref>; Russo et al., <xref ref-type="bibr" rid="B46">2010</xref>). Thus high (but not appropriate) dopamine levels randomly reinforce state&#x02013;action pairs. We define this type of random reinforcement as random learning. Figure <xref ref-type="fig" rid="F5">5</xref> illustrates the average action policy caused by random learning over time. Thus, dyskinesia may be avoided by dopaminergic treatments that do not modulate the cortico-striatal synaptic efficacy (less D1 activation) while maintaining all other D2 therapeutic benefits.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p><bold>Expected reward (&#x0003C;Q&#x0003E;) as a function of the complexity of the behavioral policy (MI(<italic>S,A</italic>)) for both the experience-modulated softmax and the regular softmax policies in the case of asymmetrical distribution of states</bold>. The agent is facing an asymmetrical distribution in which states with the higher reward probability on the left side appear twice as often as states with a higher reward probability on the right. In this scenario, for given complexity values, the experience-modulated softmax policy yields a higher expected reward value.</p></caption>
<graphic xlink:href="fnsys-05-00022-g005.tif"/>
</fig>
</sec>
<sec sec-type="discussion">
<title>Discussion</title>
<p>In contrast to previous BG models that have concentrated on either explaining pathological behavior (e.g., Albin et al., <xref ref-type="bibr" rid="B1">1989</xref>) or on learning paradigms and action selection (e.g., Schultz et al., <xref ref-type="bibr" rid="B51">1997</xref>; Cohen and Frank, <xref ref-type="bibr" rid="B11">2009</xref>; Wiecki and Frank, <xref ref-type="bibr" rid="B63">2010</xref>), here we attempt to integrate both the phasic and tonic effects of dopamine to account for both normal and pathological behaviors in the same model. We presented a BG related top-down model in which the tonic dopamine level balances maximizing the expected reward and reducing the policy complexity. Our agent aims to maximize the expected reward while minimizing the complexity of the state description, i.e., by preserving the minimal information for reward maximization. This approach is also related to the information bottleneck method (Tishby et al., <xref ref-type="bibr" rid="B60">1999</xref>), where dimensionality reduction aims to reduce the MI between the input and output layers while maximizing the MI between the output layer and a third variable. Hence,&#x02009;the transition from input to output preserves only relevant information. In the current model, the dimensionality reduction from state to action, or at the network level from cortex to BG (Bar-Gad et al., <xref ref-type="bibr" rid="B4">2003b</xref>), preserves relevant information on reward prospects. This dimensionality reduction can also account for the de-correlation issues associated with the BG pathway (Bar-Gad et al., <xref ref-type="bibr" rid="B3">2003a</xref>,<xref ref-type="bibr" rid="B4">b</xref>). In addition, the complexity of the representation of the states can be considered as the &#x0201C;cost&#x0201D; of the internal representation of these states. Hence the model solves a minimum cost vs. maximum reward variation problem. This is the first BG model to show that a softmax like policy is not arbitrary selected, but rather is the outcome of the optimization problem solved by the BG.</p>
<p>Like the softmax policy (Sutton and Barto, <xref ref-type="bibr" rid="B57">1998</xref>), our model experience-modulated softmax policy is exponentially dependent on the expected reward. However in this history-modulated distribution, the probability of an action <italic>a</italic>, given a state <italic>s</italic>, is also dependent on the prior action probability. In cases where the prior probability uniformly distributes over the different actions, the experience-modulated softmax policy behaves like the regular softmax. Therefore our model can account for softmax and probability matching action selection policies seen in previous studies (Vulkan, <xref ref-type="bibr" rid="B62">2000</xref>; Morris et al., <xref ref-type="bibr" rid="B33">2006</xref>). Furthermore, it would be interesting to confirm these predictions by replicating these or similar experiments while manipulating the prior action statistics (for example as seen in Figure <xref ref-type="fig" rid="F3">3</xref>).</p>
<p>Changing the dopamine level from low to high shifts the action policy from a conservative (greedy) policy that chooses the highest outcome to a policy that probabilistically chooses the action according to the outcome (probability matching). Eventually, with a very high level of dopamine, the policy will turn into a random (gambling) policy where the probability of choosing an action is independent of its outcome. This shift in behavioral policy can result from normal or pathological transitions. High dopamine levels can be associated with situations that involve excitement or where the outcome provides high motivation (Satoh et al., <xref ref-type="bibr" rid="B48">2003</xref>; Niv et al., <xref ref-type="bibr" rid="B37">2006</xref>). Pathological lacks or excesses of dopamine also change the policy as is seen in akinetic and dyskinetic states typical of Parkinson&#x00027;s disease. We suggest that blocking the dopamine treatment effects leading to random learning while preserving the pseudo-temperature effects of the treatment may lead to amelioration of akinesia while avoiding l-DOPA induced dyskinesia.</p>
<p>To conclude, the experience-modulated soft-max model provides a new conceptual framework that casts dopamine in the role of setting the action policy on a scale of risky to conservative and normal to pathological behaviors. This model introduces additional dimensions to the problem of optimal behavioral policy. The organism not only aims to satisfy reward maximization but also other objectives. This pattern has been observed in many experiments where behavior is not in line with merely maximizing task return (Talmi et al., <xref ref-type="bibr" rid="B58">2009</xref>). In the future, other objectives can be added to the model as well as other balancing substances. These additional dimensions will introduce richer behavior to the BG model that will more closely resemble real life decisions and perhaps account for other pathological cases as well.</p>
</sec>
<sec>
<title>Conflict of Interest Statement</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</body>
<back>
<app-group>
<app id="A1">
<title>Appendix</title>
<sec>
<title>Formal Quantification of Policy Complexity</title>
<p>In this paper policy complexity is defined as the length of the state representation required by the policy; i.e., the length of the representation of the state identity that can be extracted given the chosen action.</p>
<p>A state representation is a codeword that encodes the state, and the representation length is the codeword length. The term &#x0201C;length&#x0201D; refers to the number of letters in the codeword that can uniquely represent the state (distinguish it from all other possible states). Since the codeword should be decoded in a unique way, its length is bounded from below by the minimal uniquely decodable encoding of the state identity that can be extracted from the chosen action. In order to quantify the minimal length we turn to the Kraft&#x02013;McMillan inequality: source symbols (x) from an alphabet of size <italic>d</italic> can be encoded into a uniquely decodable code if the codeword lengths <italic>l</italic>(<italic>x</italic>) obtain &#x003A3;<sub>{<italic>x</italic>}</sub><italic>d</italic><sup>&#x02212;<italic>l</italic>(<italic>x</italic>)</sup>&#x02009;&#x02264;&#x02009;1 (Cover and Thomas, <xref ref-type="bibr" rid="B12">1991</xref>).</p>
<p>We denote the average codeword by <italic>L</italic>(<italic>C</italic>) &#x0003D;&#x02009;&#x003A3;<sub>{<italic>x</italic>}</sub><italic>p</italic>(<italic>x</italic>)<italic>l</italic>(<italic>x</italic>), where <italic>p</italic>(<italic>x</italic>) is the probability of source word <italic>x</italic>, and the entropy of the source is H<italic><sub>d</sub></italic>(<italic>X</italic>) &#x0003D;&#x02009;&#x02212;&#x003A3;<sub>{<italic>x</italic>}</sub><italic>p</italic>(<italic>x</italic>)log<italic><sub>d</sub></italic>(<italic>p</italic>(<italic>x</italic>)).</p>
<disp-formula id="E3"><mml:math id="M7"><mml:mtable><mml:mtr><mml:mtd><mml:mi>L</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>C</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mi>d</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mi>d</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mi>d</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mi>d</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mi>d</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>/</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>/</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mstyle></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mi>d</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>/</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mstyle></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x02009;&#x02009;</mml:mtext><mml:mo>&#x02212;</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mrow><mml:mi>log</mml:mi><mml:mo>&#x02061;</mml:mo></mml:mrow><mml:mi>d</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mi>d</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>/</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mstyle></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mstyle><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mi>log</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Let&#x00027;s denote: <inline-formula><mml:math id="M8"><mml:mrow><mml:mi>c</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:msub><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#x02009;&#x02009;&#x02009;</mml:mtext><mml:mi>q</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mi>c</mml:mi></mml:mfrac></mml:mrow></mml:math></inline-formula></p>
<disp-formula id="E4"><mml:math id="M9"><mml:mtable><mml:mtr><mml:mtd><mml:mi>L</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>C</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mi>d</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>q</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mstyle><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mi>log</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>c</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>q</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mi>log</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>c</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p><italic>D</italic><sub><italic>kl</italic></sub>(<italic>p</italic>||<italic>q</italic>) &#x02265;&#x02009;0 (Cover and Thomas, <xref ref-type="bibr" rid="B12">1991</xref>), and <italic>c</italic>&#x02009;&#x02264;&#x02009;1, log<italic><sub>d</sub></italic>(<italic>c</italic>) &#x02264;&#x02009;0 (Kraft McMillan inequality).</p>
<p>Therefore:</p>
<disp-formula id="E5"><mml:math id="M10"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>q</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mi>log</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>c</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02265;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>L</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>C</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02265;</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Hence the average codeword length is equal or larger than the entropy of the source <italic>H</italic><sub><italic>d</italic></sub>(<italic>X</italic>).</p>
<p>The source entropy corresponds to the amount of uncertainty in the distribution of source words X. This uncertainty is resolved once the identity of the source word is known. In our settings the source word is the state representation that can be extracted given the chosen action, and the relevant source entropy is the amount of uncertainty on the state identity that is resolved by knowing the chosen action. <italic>H</italic>(<italic>S</italic>) is the original state uncertainty, and <italic>H</italic>(<italic>S</italic>|<italic>A</italic>) is the uncertainty remaining even when the action is given. The difference between these terms is the state uncertainty that is resolved given the chosen action. Therefore in our case the relevant source entropy is <italic>H</italic>(<italic>S</italic>) &#x02212;&#x02009;<italic>H</italic>(<italic>S</italic>|<italic>A</italic>). This term is also known as the state action mutual information MI(<italic>S</italic>; <italic>A</italic>) &#x0003D;&#x02009;<italic>H</italic>(<italic>S</italic>) &#x02212;&#x02009;<italic>H</italic>(<italic>S</italic>|<italic>A</italic>). In other words MI(<italic>S; A</italic>) is a lower bound of the policy state representation length. Consequently minimizing MI(<italic>S; A</italic>) is equivalent to minimizing the policy state representation length, i.e., minimizing the policy complexity.</p>
<p>In addition we can measure the commitment to the future directly by the mutual information between the current state (denoted by <italic>s</italic><sub><italic>t</italic></sub>) and the following series of actions and states [denoted by (<italic>a</italic><sub><italic>t</italic></sub>, <italic>s</italic><sub><italic>t</italic>&#x02009;&#x0002B;&#x02009;1</sub>, &#x02026;,&#x02009;<italic>a</italic><sub><italic>n</italic>&#x02009;&#x02212;&#x02009;1</sub>,<italic>s</italic><sub><italic>n</italic></sub>)]:</p>
<disp-formula id="E6"><mml:math id="M11"><mml:mtable><mml:mtr><mml:mtd><mml:mtext>MI</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mtext>MI</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mtext>MI</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x02009;&#x02009;&#x02009;</mml:mtext><mml:mo>+</mml:mo><mml:mtext>MI</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x02009;&#x02009;&#x02009;</mml:mtext><mml:mo>+</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>+</mml:mo><mml:mtext>MI</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>[according to the chain rule of information (Cover and Thomas, <xref ref-type="bibr" rid="B12">1991</xref>)]. However, due to the first order Markov property of the series, the transition from state to state depends only on the action chosen according to the previous state. In other words, it is independent of states that are more than one step backward or the order of the states:</p>
<disp-formula id="E7"><mml:math id="M12"><mml:mtable><mml:mtr><mml:mtd><mml:mtext>MI</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mtext>MI</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x02009;&#x02009;&#x02009;&#x02009;&#x02009;&#x02009;</mml:mtext><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mtext>&#x02009;&#x02009;</mml:mtext><mml:mi>k</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mtext>&#x02009;MI</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mtext>MI</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:mi>S</mml:mi><mml:mo>;</mml:mo><mml:mi>A</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mtext>MI</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mtext>MI</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:mi>S</mml:mi><mml:mo>;</mml:mo><mml:mi>A</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mtext>MI</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where MI(<italic>s</italic><sub><italic>t</italic></sub>; <italic>s</italic><sub><italic>t</italic>&#x02009;&#x0002B;&#x02009;1</sub>|<italic>a</italic><sub><italic>t</italic></sub>) denotes the mutual information between two adjacent states (state at step <italic>t</italic> and state at step <italic>t</italic>&#x0002B;1) given the action that generated the transformation between the states. Since this measure is dependent solely on <italic>p</italic>(<italic>s</italic><sub><italic>t</italic></sub>; <italic>s</italic><sub><italic>t</italic>&#x02009;&#x0002B;&#x02009;1</sub>|<italic>a</italic><sub><italic>t</italic></sub>), and in our setting is independent of the agent&#x00027;s policy, minimizing MI(<italic>s</italic><sub>1</sub>;<italic>a</italic><sub>1</sub>,<italic>a</italic><sub>2</sub>,<italic>s</italic><sub>2</sub>, &#x02026;,&#x02009;<italic>a</italic><sub><italic>n</italic>&#x02009;&#x02212;&#x02009;1</sub>,<italic>s</italic><sub><italic>n</italic></sub>) is equivalent to minimizing MI(<italic>S</italic>; <italic>A</italic>). Therefore, MI(<italic>S</italic>; <italic>A</italic>) (state&#x02013;action MI) can be used as a measure of policy complexity.</p>
</sec>
<sec>
<title>Combining Maximum Reward and Minimum Complexity Goals</title>
<p>The optimal tradeoff of achieving the two goals of maximum reward and minimum complexity can be achieved by solving a variation problem similar to rate distortion theory (RDT, Shannon, <xref ref-type="bibr" rid="B52">1959</xref>). In the framework of communication theory, RDT characterizes the tradeoff between the rate, or signal representation size, and the average distortion of the reconstructed signal. It determines the level of the expected distortion, given the desired information rate. Here we characterize the tradeoff between the state representation size and a function (state action value) dependent on the original state (similar formalizations can be found in (Klyubin et al., <xref ref-type="bibr" rid="B25">2007</xref>; Tishby and Polani, <xref ref-type="bibr" rid="B61">2010</xref>):</p>
<disp-formula id="E8"><mml:math id="M13"><mml:mrow><mml:msub><mml:mrow><mml:mi>min</mml:mi><mml:mo>&#x02061;</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:mtext>MI</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:mi>S</mml:mi><mml:mo>,</mml:mo><mml:mi>A</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:mi>Q</mml:mi><mml:mo>&#x0003E;</mml:mo><mml:mo>+</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>s</mml:mi></mml:munder><mml:mrow><mml:mi>&#x003BB;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>a</mml:mi></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:mrow> <mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M14"><mml:mrow><mml:mtext>MI</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:mi>S</mml:mi><mml:mo>,</mml:mo><mml:mi>A</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mstyle><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math></inline-formula></p>
<p>&#x0003C;<italic>Q</italic>&#x0003E;&#x02009;&#x0003D;&#x02009;&#x003A3;<italic><sub>s,a</sub>p</italic>(<italic>a</italic>|<italic>s</italic>)<italic>p</italic>(<italic>s</italic>)<italic>Q</italic>(<italic>s</italic>, <italic>a</italic>),</p>
<p>&#x003B2; is the tradeoff parameter (the Lagrange multiplier), and <italic>Q</italic>(<italic>s,a</italic>) (the state&#x02013;action Q-value) denotes the expected reward when performing action <italic>a</italic> in state <italic>s</italic>.</p>
<p>The third part of the equation <inline-formula><mml:math id="M15"><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>s</mml:mi></mml:munder><mml:mrow><mml:mi>&#x003BB;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>a</mml:mi></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:mrow></mml:math></inline-formula> adds the normalization constraint on the total of the distribution of each state to be 1 (&#x003BB;(<italic>s</italic>) are the normalization Lagrange multipliers for each state <italic>s</italic>).</p>
<p>The probability of choosing an action <italic>a</italic> independent of the state is given by:</p>
<p><italic>P</italic>(<italic>a</italic>) &#x0003D;&#x02009;&#x003A3;<italic><sub>s</sub>p</italic>(<italic>a</italic>|<italic>s</italic>)<italic>p</italic>(<italic>s</italic>).</p>
<p>The solution to the variation problem:</p>
<disp-formula id="E9"><mml:math id="M16"><mml:mrow><mml:mfrac><mml:mrow><mml:mo>&#x02202;</mml:mo><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mtext>MI</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:mi>S</mml:mi><mml:mo>,</mml:mo><mml:mi>A</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>S</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>s</mml:mi></mml:munder><mml:mrow><mml:mi>&#x003BB;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>a</mml:mi></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x02202;</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:math></disp-formula>
<disp-formula id="E10"><label>1.</label><mml:math id="M17"><mml:mtable><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mo>&#x02202;</mml:mo><mml:mtext>MI</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:mi>S</mml:mi><mml:mo>,</mml:mo><mml:mi>A</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x02202;</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mo>&#x02202;</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mi>log</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x02009;</mml:mtext><mml:mo>&#x000D7;</mml:mo><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x02009;</mml:mtext><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:msup><mml:mi>s</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>&#x02260;</mml:mo><mml:mi>s</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:mfrac><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x02009;</mml:mtext><mml:mrow><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>s</mml:mi></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mi>log</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E11"><label>2.</label><mml:math id="M18"><mml:mfrac><mml:mrow><mml:mo>&#x02202;</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>S</mml:mi><mml:mo>,</mml:mo><mml:mi>A</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x02202;</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x02202;</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>Q</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mo>&#x02202;</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>Q</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:math></disp-formula>
<disp-formula id="E12"><label>3.</label><mml:math id="M19"><mml:mfrac><mml:mrow><mml:mo>&#x02202;</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>s</mml:mi></mml:munder><mml:mrow><mml:mi>&#x003BB;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>a</mml:mi></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mo>&#x02202;</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mi>&#x003BB;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:math></disp-formula>
<disp-formula id="E13"><label>4.</label><mml:math id="M20"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mo>&#x02202;</mml:mo><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mtext>MI</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:mi>S</mml:mi><mml:mo>,</mml:mo><mml:mi>A</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>S</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>s</mml:mi></mml:munder><mml:mrow><mml:mi>&#x003BB;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>a</mml:mi></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x02202;</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>&#x021D2;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:mi>Q</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x003BB;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow> <mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>&#x021D2;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msup><mml:mtext>e</mml:mtext><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:mi>Q</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x003BB;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>a</mml:mi></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x021D2;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>a</mml:mi></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msup><mml:mtext>e</mml:mtext><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:mi>Q</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x003BB;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x021D2;</mml:mo></mml:mrow></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msup><mml:mtext>e</mml:mtext><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x003BB;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>a</mml:mi></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msup><mml:mtext>e</mml:mtext><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:mi>Q</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mstyle></mml:mrow></mml:mfrac><mml:mo>&#x021D2;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x02009;&#x02009;&#x02009;&#x02009;&#x02009;</mml:mtext><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msup><mml:mtext>e</mml:mtext><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:mi>Q</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:msup><mml:mi>a</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>a</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:msup><mml:mtext>e</mml:mtext><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:mi>Q</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>a</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mstyle></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The solution can be obtained by a generalization of the Blahut&#x02013;Arimoto algorithm for rate distortion problems (Blahut, <xref ref-type="bibr" rid="B7">1972</xref>; Cover and Thomas, <xref ref-type="bibr" rid="B12">1991</xref>); namely alternately iterating between the following equations until they converge:</p>
<disp-formula id="E14"><mml:math id="M21"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>Z</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:msup><mml:mtext>e</mml:mtext><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:mi>Q</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>s</mml:mi></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>Z</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>a</mml:mi></mml:munder><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msup><mml:mtext>e</mml:mtext><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:mi>Q</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Note that using the state expected reward values <italic>V</italic>(<italic>s</italic>) instead of the state&#x02013;action pair expected reward values <italic>Q</italic>(<italic>s,a</italic>) yields similar results:</p>
<p><inline-formula><mml:math id="M22"><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>Z</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:msup><mml:mtext>e</mml:mtext><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:mi>V</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> where <italic>s</italic>&#x02032; is the state that follows state <italic>s</italic> given action <italic>a</italic>.</p>
</sec>
</app>
</app-group>
<ack><p>This study was partly supported by the FP7 Select and Act grant (Hagai Bergman) and by the Gatsby Charitable Foundation (Naftali Tishby).</p>
</ack>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Albin</surname> <given-names>R. L.</given-names></name> <name><surname>Young</surname> <given-names>A. B.</given-names></name> <name><surname>Penney</surname> <given-names>J. B.</given-names></name></person-group> (<year>1989</year>). <article-title>The functional anatomy of basal ganglia disorders</article-title>. <source>Trends Neurosci.</source> <volume>12</volume>, <fpage>366</fpage>&#x02013;<lpage>375</lpage>.<pub-id pub-id-type="doi">10.1016/0166-2236(89)90074-X</pub-id><pub-id pub-id-type="pmid">2479133</pub-id></citation></ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Balleine</surname> <given-names>B. W.</given-names></name> <name><surname>Delgado</surname> <given-names>M. R.</given-names></name> <name><surname>Hikosaka</surname> <given-names>O.</given-names></name></person-group> (<year>2007</year>). <article-title>The role of the dorsal striatum in reward and decision-making</article-title>. <source>J. Neurosci.</source> <volume>27</volume>, <fpage>8161</fpage>&#x02013;<lpage>8165</lpage>.<pub-id pub-id-type="doi">10.1523/JNEUROSCI.1554-07.2007</pub-id><pub-id pub-id-type="pmid">17670959</pub-id></citation></ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bar-Gad</surname> <given-names>I.</given-names></name> <name><surname>Heimer</surname> <given-names>G.</given-names></name> <name><surname>Ritov</surname> <given-names>Y.</given-names></name> <name><surname>Bergman</surname> <given-names>H.</given-names></name></person-group> (<year>2003a</year>). <article-title>Functional correlations between neighboring neurons in the primate globus pallidus are weak or nonexistent</article-title>. <source>J. Neurosci.</source> <volume>23</volume>, <fpage>4012</fpage>&#x02013;<lpage>4016</lpage>.</citation></ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bar-Gad</surname> <given-names>I.</given-names></name> <name><surname>Morris</surname> <given-names>G.</given-names></name> <name><surname>Bergman</surname> <given-names>H.</given-names></name></person-group> (<year>2003b</year>). <article-title>Information processing, dimensionality reduction and reinforcement learning in the basal ganglia</article-title>. <source>Prog. Neurobiol.</source> <volume>71</volume>, <fpage>439</fpage>&#x02013;<lpage>473</lpage>.<pub-id pub-id-type="doi">10.1016/j.pneurobio.2003.12.001</pub-id></citation></ref>
<ref id="B5"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Barto</surname> <given-names>A. G.</given-names></name></person-group> (<year>1995</year>). <article-title>&#x0201C;Adaptive critics and the basal ganglia,&#x0201D;</article-title> in <source>Models of Information Processing in the Basal Ganglia</source>, eds <person-group person-group-type="editor"><name><surname>Houk</surname> <given-names>J. C.</given-names></name> <name><surname>Davis</surname> <given-names>J. L.</given-names></name> <name><surname>Beiser</surname> <given-names>D. G.</given-names></name></person-group> (<publisher-loc>Cambridge</publisher-loc>: <publisher-name>The MIT Press</publisher-name>), <fpage>215</fpage>&#x02013;<lpage>232</lpage>.</citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bayer</surname> <given-names>H. M.</given-names></name> <name><surname>Glimcher</surname> <given-names>P. W.</given-names></name></person-group> (<year>2005</year>). <article-title>Midbrain dopamine neurons encode a quantitative reward prediction error signal</article-title>. <source>Neuron</source> <volume>47</volume>, <fpage>129</fpage>&#x02013;<lpage>141</lpage>.<pub-id pub-id-type="doi">10.1016/j.neuron.2005.05.020</pub-id><pub-id pub-id-type="pmid">15996553</pub-id></citation></ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Blahut</surname> <given-names>R. E.</given-names></name></person-group> (<year>1972</year>). <article-title>Computation of channel capacity and rate distortion function</article-title>. <source>IEEE Trans. Inform. Theory IT</source> <volume>18</volume>, <fpage>460</fpage>&#x02013;<lpage>473</lpage>.<pub-id pub-id-type="doi">10.1109/TIT.1972.1054855</pub-id></citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bonanni</surname> <given-names>L.</given-names></name> <name><surname>Thomas</surname> <given-names>A.</given-names></name> <name><surname>Anzellotti</surname> <given-names>F.</given-names></name> <name><surname>Monaco</surname> <given-names>D.</given-names></name> <name><surname>Ciccocioppo</surname> <given-names>F.</given-names></name> <name><surname>Varanese</surname> <given-names>S.</given-names></name> <name><surname>Bifolchetti</surname> <given-names>S.</given-names></name> <name><surname>D&#x00027;Amico</surname> <given-names>M. C.</given-names></name> <name><surname>Di Iorio</surname> <given-names>A.</given-names></name> <name><surname>Onofrj</surname> <given-names>M.</given-names></name></person-group> (<year>2010</year>). <article-title>Protracted benefit from paradoxical kinesia in typical and atypical parkinsonisms</article-title>. <source>Neurol. Sci.</source> <volume>31</volume>, <fpage>751</fpage>&#x02013;<lpage>756</lpage>.<pub-id pub-id-type="pmid">20859648</pub-id></citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Boraud</surname> <given-names>T.</given-names></name> <name><surname>Bezard</surname> <given-names>E.</given-names></name> <name><surname>Bioulac</surname> <given-names>B.</given-names></name> <name><surname>Gross</surname> <given-names>C. E.</given-names></name></person-group> (<year>2001</year>). <article-title>Dopamine agonist-induced dyskinesias are correlated to both firing pattern and frequency alterations of pallidal neurones in the MPTP-treated monkey</article-title>. <source>Brain</source> <volume>124</volume>, <fpage>546</fpage>&#x02013;<lpage>557</lpage>.<pub-id pub-id-type="doi">10.1093/brain/124.3.546</pub-id><pub-id pub-id-type="pmid">11222455</pub-id></citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Boraud</surname> <given-names>T.</given-names></name> <name><surname>Bezard</surname> <given-names>E.</given-names></name> <name><surname>Guehl</surname> <given-names>D.</given-names></name> <name><surname>Bioulac</surname> <given-names>B.</given-names></name> <name><surname>Gross</surname> <given-names>C.</given-names></name></person-group> (<year>1998</year>). <article-title>Effects of L-DOPA on neuronal activity of the globus pallidus externalis (GPe) and globus pallidus internalis (GPi) in the MPTP-treated monkey</article-title>. <source>Brain Res.</source> <volume>787</volume>, <fpage>157</fpage>&#x02013;<lpage>160</lpage>.<pub-id pub-id-type="doi">10.1016/S0006-8993(97)01563-1</pub-id><pub-id pub-id-type="pmid">9518590</pub-id></citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cohen</surname> <given-names>M. X.</given-names></name> <name><surname>Frank</surname> <given-names>M. J.</given-names></name></person-group> (<year>2009</year>). <article-title>Neurocomputational models of basal ganglia function in learning, memory and choice</article-title>. <source>Behav. Brain Res.</source> <volume>199</volume>, <fpage>141</fpage>&#x02013;<lpage>156</lpage>.<pub-id pub-id-type="pmid">18950662</pub-id></citation></ref>
<ref id="B12"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Cover</surname> <given-names>T. M.</given-names></name> <name><surname>Thomas</surname> <given-names>J. A.</given-names></name></person-group> (<year>1991</year>). <source>Elements of Information Theory</source>. <publisher-name>Wiley</publisher-name> <publisher-loc>New York</publisher-loc>.</citation></ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Day</surname> <given-names>M.</given-names></name> <name><surname>Wokosin</surname> <given-names>D.</given-names></name> <name><surname>Plotkin</surname> <given-names>J. L.</given-names></name> <name><surname>Tian</surname> <given-names>X.</given-names></name> <name><surname>Surmeier</surname> <given-names>D. J.</given-names></name></person-group> (<year>2008</year>). <article-title>Differential excitability and modulation of striatal medium spiny neuron dendrites</article-title>. <source>J. Neurosci.</source> <volume>28</volume>, <fpage>11603</fpage>&#x02013;<lpage>11614</lpage>.<pub-id pub-id-type="doi">10.1523/JNEUROSCI.1840-08.2008</pub-id><pub-id pub-id-type="pmid">18987196</pub-id></citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dayan</surname> <given-names>P.</given-names></name> <name><surname>Balleine</surname> <given-names>B. W.</given-names></name></person-group> (<year>2002</year>). <article-title>Reward, motivation, and reinforcement learning</article-title>. <source>Neuron</source> <volume>36</volume>, <fpage>285</fpage>&#x02013;<lpage>298</lpage>.<pub-id pub-id-type="doi">10.1016/S0896-6273(02)00963-7</pub-id><pub-id pub-id-type="pmid">12383782</pub-id></citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Deniau</surname> <given-names>J. M.</given-names></name> <name><surname>Chevalier</surname> <given-names>G.</given-names></name></person-group> (<year>1985</year>). <article-title>Disinhibition as a basic process in the expression of striatal functions. II. The striato-nigral influence on thalamocortical cells of the ventromedial thalamic nucleus</article-title>. <source>Brain Res.</source> <volume>334</volume>, <fpage>227</fpage>&#x02013;<lpage>233</lpage>.<pub-id pub-id-type="doi">10.1016/0006-8993(85)90214-8</pub-id><pub-id pub-id-type="pmid">3995318</pub-id></citation></ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Filion</surname> <given-names>M.</given-names></name> <name><surname>Tremblay</surname> <given-names>L.</given-names></name> <name><surname>Bedard</surname> <given-names>P. J.</given-names></name></person-group> (<year>1991</year>). <article-title>Effects of dopamine agonists on the spontaneous activity of globus pallidus neurons in monkeys with MPTP-induced parkinsonism</article-title>. <source>Brain Res.</source> <volume>547</volume>, <fpage>152</fpage>&#x02013;<lpage>161</lpage>.<pub-id pub-id-type="pmid">1677608</pub-id></citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fiorillo</surname> <given-names>C. D.</given-names></name> <name><surname>Tobler</surname> <given-names>P. N.</given-names></name> <name><surname>Schultz</surname> <given-names>W.</given-names></name></person-group> (<year>2003</year>). <article-title>Discrete coding of reward probability and uncertainty by dopamine neurons</article-title>. <source>Science</source> <volume>299</volume>, <fpage>1898</fpage>&#x02013;<lpage>1902</lpage>.<pub-id pub-id-type="doi">10.1126/science.1077349</pub-id><pub-id pub-id-type="pmid">12649484</pub-id></citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goto</surname> <given-names>Y.</given-names></name> <name><surname>Otani</surname> <given-names>S.</given-names></name> <name><surname>Grace</surname> <given-names>A. A.</given-names></name></person-group> (<year>2007</year>). <article-title>The Yin and Yang of dopamine release: a new perspective</article-title>. <source>Neuropharmacology</source> <volume>53</volume>, <fpage>583</fpage>&#x02013;<lpage>587</lpage>.<pub-id pub-id-type="doi">10.1016/j.neuropharm.2007.07.007</pub-id><pub-id pub-id-type="pmid">17709119</pub-id></citation></ref>
<ref id="B19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gurney</surname> <given-names>K.</given-names></name> <name><surname>Prescott</surname> <given-names>T. J.</given-names></name> <name><surname>Wickens</surname> <given-names>J. R.</given-names></name> <name><surname>Redgrave</surname> <given-names>P.</given-names></name></person-group> (<year>2004</year>). <article-title>Computational models of the basal ganglia: from robots to membranes</article-title>. <source>Trends Neurosci.</source> <volume>27</volume>, <fpage>453</fpage>&#x02013;<lpage>459</lpage>.<pub-id pub-id-type="doi">10.1016/j.tins.2004.06.003</pub-id><pub-id pub-id-type="pmid">15271492</pub-id></citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Heimer</surname> <given-names>G.</given-names></name> <name><surname>Bar-Gad</surname> <given-names>I.</given-names></name> <name><surname>Goldberg</surname> <given-names>J. A.</given-names></name> <name><surname>Bergman</surname> <given-names>H.</given-names></name></person-group> (<year>2002</year>). <article-title>Dopamine replacement therapy reverses abnormal synchronization of pallidal neurons in the 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine primate model of parkinsonism</article-title>. <source>J. Neurosci.</source> <volume>22</volume>, <fpage>7850</fpage>&#x02013;<lpage>7855</lpage>.<pub-id pub-id-type="pmid">12223537</pub-id></citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hikosaka</surname> <given-names>O.</given-names></name></person-group> (<year>2007</year>). <article-title>GABAergic output of the basal ganglia</article-title>. <source>Prog. Brain Res.</source> <volume>160</volume>, <fpage>209</fpage>&#x02013;<lpage>226</lpage>.<pub-id pub-id-type="doi">10.1016/S0079-6123(06)60012-5</pub-id><pub-id pub-id-type="pmid">17499116</pub-id></citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hikosaka</surname> <given-names>O.</given-names></name> <name><surname>Wurtz</surname> <given-names>R. H.</given-names></name></person-group> (<year>1983</year>). <article-title>Visual and oculomotor functions of monkey substantia nigra pars reticulata. IV. Relation of substantia nigra to superior colliculus</article-title>. <source>J. Neurophysiol.</source> <volume>49</volume>, <fpage>1285</fpage>&#x02013;<lpage>1301</lpage>.<pub-id pub-id-type="pmid">6306173</pub-id></citation></ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Keefe</surname> <given-names>K. A.</given-names></name> <name><surname>Salamone</surname> <given-names>J. D.</given-names></name> <name><surname>Zigmond</surname> <given-names>M. J.</given-names></name> <name><surname>Stricker</surname> <given-names>E. M.</given-names></name></person-group> (<year>1989</year>). <article-title>Paradoxical kinesia in parkinsonism is not caused by dopamine release. Studies animal model</article-title>. <source>Arch. Neurol.</source> <volume>46</volume>, <fpage>1070</fpage>&#x02013;<lpage>1075</lpage>.<pub-id pub-id-type="pmid">2508609</pub-id></citation></ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kerr</surname> <given-names>J. N.</given-names></name> <name><surname>Wickens</surname> <given-names>J. R.</given-names></name></person-group> (<year>2001</year>). <article-title>Dopamine D-1/D-5 receptor activation is required for long-term potentiation in the rat neostriatum in vitro</article-title>. <source>J. Neurophysiol.</source> <volume>85</volume>, <fpage>117</fpage>&#x02013;<lpage>124</lpage>.<pub-id pub-id-type="pmid">11152712</pub-id></citation></ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Klyubin</surname> <given-names>A. S.</given-names></name> <name><surname>Polani</surname> <given-names>D.</given-names></name> <name><surname>Nehaniv</surname> <given-names>C. L.</given-names></name></person-group> (<year>2007</year>). <article-title>Representations of space and time in the maximization of information flow in the perception-action loop</article-title>. <source>Neural Comput.</source> <volume>19</volume>, <fpage>2387</fpage>&#x02013;<lpage>2432</lpage>.<pub-id pub-id-type="doi">10.1162/neco.2007.19.9.2387</pub-id><pub-id pub-id-type="pmid">17650064</pub-id></citation></ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kreitzer</surname> <given-names>A. C.</given-names></name> <name><surname>Malenka</surname> <given-names>R. C.</given-names></name></person-group> (<year>2008</year>). <article-title>Striatal plasticity and basal ganglia circuit function</article-title>. <source>Neuron</source> <volume>60</volume>, <fpage>543</fpage>&#x02013;<lpage>554</lpage>.<pub-id pub-id-type="doi">10.1016/j.neuron.2008.11.005</pub-id><pub-id pub-id-type="pmid">19038213</pub-id></citation></ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lau</surname> <given-names>B.</given-names></name> <name><surname>Glimcher</surname> <given-names>P. W.</given-names></name></person-group> (<year>2005</year>). <article-title>Dynamic response-by-response models of matching behavior in rhesus monkeys</article-title>. <source>J. Exp. Anal. Behav.</source> <volume>84</volume>, <fpage>555</fpage>&#x02013;<lpage>579</lpage>.<pub-id pub-id-type="pmid">16596980</pub-id></citation></ref>
<ref id="B28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Levy</surname> <given-names>R.</given-names></name> <name><surname>Dostrovsky</surname> <given-names>J. O.</given-names></name> <name><surname>Lang</surname> <given-names>A. E.</given-names></name> <name><surname>Sime</surname> <given-names>E.</given-names></name> <name><surname>Hutchison</surname> <given-names>W. D.</given-names></name> <name><surname>Lozano</surname> <given-names>A. M.</given-names></name></person-group> (<year>2001</year>). <article-title>Effects of apomorphine on subthalamic nucleus and globus pallidus internus neurons in patients with Parkinson&#x00027;s disease</article-title>. <source>J. Neurophysiol.</source> <volume>86</volume>, <fpage>249</fpage>&#x02013;<lpage>260</lpage>.<pub-id pub-id-type="pmid">11431506</pub-id></citation></ref>
<ref id="B29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>McClure</surname> <given-names>S. M.</given-names></name> <name><surname>Daw</surname> <given-names>N. D.</given-names></name> <name><surname>Montague</surname> <given-names>P. R.</given-names></name></person-group> (<year>2003</year>). <article-title>A computational substrate for incentive salience</article-title>. <source>Trends Neurosci.</source> <volume>26</volume>, <fpage>423</fpage>&#x02013;<lpage>428</lpage>.<pub-id pub-id-type="doi">10.1016/S0166-2236(03)00177-2</pub-id><pub-id pub-id-type="pmid">12900173</pub-id></citation></ref>
<ref id="B30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Merello</surname> <given-names>M.</given-names></name> <name><surname>Balej</surname> <given-names>J.</given-names></name> <name><surname>Delfino</surname> <given-names>M.</given-names></name> <name><surname>Cammarota</surname> <given-names>A.</given-names></name> <name><surname>Betti</surname> <given-names>O.</given-names></name> <name><surname>Leiguarda</surname> <given-names>R.</given-names></name></person-group> (<year>1999</year>). <article-title>Apomorphine induces changes in GPi spontaneous outflow in patients with Parkinson&#x00027;s disease</article-title>. <source>Mov. Disord.</source> <volume>14</volume>, <fpage>45</fpage>&#x02013;<lpage>49</lpage>.<pub-id pub-id-type="doi">10.1002/1531-8257(199901)14:1&#x0003C;45::AID-MDS1009&#x0003E;3.0.CO;2-F</pub-id><pub-id pub-id-type="pmid">9918343</pub-id></citation></ref>
<ref id="B31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mink</surname> <given-names>J. W.</given-names></name></person-group> (<year>1996</year>). <article-title>The basal ganglia: focused selection and inhibition of competing motor programs</article-title>. <source>Prog. Neurobiol.</source> <volume>50</volume>, <fpage>381</fpage>&#x02013;<lpage>425</lpage>.<pub-id pub-id-type="pmid">9004351</pub-id></citation></ref>
<ref id="B32"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Morris</surname> <given-names>G.</given-names></name> <name><surname>Arkadir</surname> <given-names>D.</given-names></name> <name><surname>Nevet</surname> <given-names>A.</given-names></name> <name><surname>Vaadia</surname> <given-names>E.</given-names></name> <name><surname>Bergman</surname> <given-names>H.</given-names></name></person-group> (<year>2004</year>). <article-title>Coincident but distinct messages of midbrain dopamine and striatal tonically active neurons</article-title>. <source>Neuron</source> <volume>43</volume>, <fpage>133</fpage>&#x02013;<lpage>143</lpage>.<pub-id pub-id-type="doi">10.1016/j.neuron.2004.06.012</pub-id><pub-id pub-id-type="pmid">15233923</pub-id></citation></ref>
<ref id="B33"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Morris</surname> <given-names>G.</given-names></name> <name><surname>Nevet</surname> <given-names>A.</given-names></name> <name><surname>Arkadir</surname> <given-names>D.</given-names></name> <name><surname>Vaadia</surname> <given-names>E.</given-names></name> <name><surname>Bergman</surname> <given-names>H.</given-names></name></person-group> (<year>2006</year>). <article-title>Midbrain dopamine neurons encode decisions for future action</article-title>. <source>Nat. Neurosci.</source> <volume>9</volume>, <fpage>1057</fpage>&#x02013;<lpage>1063</lpage>.<pub-id pub-id-type="pmid">16862149</pub-id></citation></ref>
<ref id="B34"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nakahara</surname> <given-names>H.</given-names></name> <name><surname>Itoh</surname> <given-names>H.</given-names></name> <name><surname>Kawagoe</surname> <given-names>R.</given-names></name> <name><surname>Takikawa</surname> <given-names>Y.</given-names></name> <name><surname>Hikosaka</surname> <given-names>O.</given-names></name></person-group> (<year>2004</year>). <article-title>Dopamine neurons can represent context-dependent prediction error</article-title>. <source>Neuron</source> <volume>41</volume>, <fpage>269</fpage>&#x02013;<lpage>280</lpage>.<pub-id pub-id-type="doi">10.1016/S0896-6273(03)00869-9</pub-id><pub-id pub-id-type="pmid">14741107</pub-id></citation></ref>
<ref id="B35"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nevet</surname> <given-names>A.</given-names></name> <name><surname>Morris</surname> <given-names>G.</given-names></name> <name><surname>Saban</surname> <given-names>G.</given-names></name> <name><surname>Fainstein</surname> <given-names>N.</given-names></name> <name><surname>Bergman</surname> <given-names>H.</given-names></name></person-group> (<year>2004</year>). <article-title>Rate of substantia nigra pars reticulata neurons is reduced in non-parkinsonian monkeys with apomorphine-induced orofacial dyskinesia</article-title>. <source>J Neurophysiol.</source> <volume>92</volume>, <fpage>1973</fpage>&#x02013;<lpage>1981</lpage>.<pub-id pub-id-type="doi">10.1152/jn.01036.2003</pub-id><pub-id pub-id-type="pmid">15115785</pub-id></citation></ref>
<ref id="B36"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nicola</surname> <given-names>S. M.</given-names></name> <name><surname>Surmeier</surname> <given-names>J.</given-names></name> <name><surname>Malenka</surname> <given-names>R. C.</given-names></name></person-group> (<year>2000</year>). <article-title>Dopaminergic modulation of neuronal excitability in the striatum and nucleus accumbens</article-title>. <source>Annu. Rev. Neurosci.</source> <volume>23</volume>, <fpage>185</fpage>&#x02013;<lpage>215</lpage>.<pub-id pub-id-type="pmid">10845063</pub-id></citation></ref>
<ref id="B37"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Niv</surname> <given-names>Y.</given-names></name> <name><surname>Joel</surname> <given-names>D.</given-names></name> <name><surname>Dayan</surname> <given-names>P.</given-names></name></person-group> (<year>2006</year>). <article-title>A normative perspective on motivation</article-title>. <source>Trends Cogn. Sci.</source> <volume>10</volume>, <fpage>375</fpage>&#x02013;<lpage>381</lpage>.<pub-id pub-id-type="pmid">16843041</pub-id></citation></ref>
<ref id="B38"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Onn</surname> <given-names>S. P.</given-names></name> <name><surname>West</surname> <given-names>A. R.</given-names></name> <name><surname>Grace</surname> <given-names>A. A.</given-names></name></person-group> (<year>2000</year>). <article-title>Dopamine-mediated regulation of striatal neuronal and network interactions</article-title>. <source>Trends Neurosci.</source> <volume>23</volume>: <fpage>S48</fpage>&#x02013;<lpage>S56</lpage>.<pub-id pub-id-type="pmid">11052220</pub-id></citation></ref>
<ref id="B39"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pan</surname> <given-names>W. X.</given-names></name> <name><surname>Schmidt</surname> <given-names>R.</given-names></name> <name><surname>Wickens</surname> <given-names>J. R.</given-names></name> <name><surname>Hyland</surname> <given-names>B. I.</given-names></name></person-group> (<year>2008</year>). <article-title>Tripartite mechanism of extinction suggested by dopamine neuron activity and temporal difference model</article-title>. <source>J. Neurosci.</source> <volume>28</volume>, <fpage>9619</fpage>&#x02013;<lpage>9631</lpage>.<pub-id pub-id-type="doi">10.1523/JNEUROSCI.0255-08.2008</pub-id><pub-id pub-id-type="pmid">18815248</pub-id></citation></ref>
<ref id="B40"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Papa</surname> <given-names>S. M.</given-names></name> <name><surname>DeSimone</surname> <given-names>R.</given-names></name> <name><surname>Fiorani</surname> <given-names>M.</given-names></name> <name><surname>Oldfield</surname> <given-names>E. H.</given-names></name></person-group> (<year>1999</year>). <article-title>Internal globus pallidus discharge is nearly suppressed during levodopa-induced dyskinesias</article-title>. <source>Ann. Neurol.</source> <volume>46</volume>, <fpage>732</fpage>&#x02013;<lpage>738</lpage>.<pub-id pub-id-type="pmid">10553990</pub-id></citation></ref>
<ref id="B41"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Parush</surname> <given-names>N.</given-names></name> <name><surname>Arkadir</surname> <given-names>D.</given-names></name> <name><surname>Nevet</surname> <given-names>A.</given-names></name> <name><surname>Morris</surname> <given-names>G.</given-names></name> <name><surname>Tishby</surname> <given-names>N.</given-names></name> <name><surname>Nelken</surname> <given-names>I.</given-names></name> <name><surname>Bergman</surname> <given-names>H.</given-names></name></person-group> (<year>2008</year>). <article-title>Encoding by response duration in the basal ganglia</article-title>. <source>J. Neurophysiol.</source> <volume>100</volume>, <fpage>3244</fpage>&#x02013;<lpage>3252</lpage>.<pub-id pub-id-type="doi">10.1152/jn.90400.2008</pub-id><pub-id pub-id-type="pmid">18842956</pub-id></citation></ref>
<ref id="B42"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pawlak</surname> <given-names>V.</given-names></name> <name><surname>Kerr</surname> <given-names>J. N.</given-names></name></person-group> (<year>2008</year>). <article-title>Dopamine receptor activation is required for corticostriatal spike-timing-dependent plasticity</article-title>. <source>J. Neurosci.</source> <volume>28</volume>, <fpage>2435</fpage>&#x02013;<lpage>2446</lpage>.<pub-id pub-id-type="doi">10.1523/JNEUROSCI.4402-07.2008</pub-id><pub-id pub-id-type="pmid">18322089</pub-id></citation></ref>
<ref id="B43"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Reynolds</surname> <given-names>J. N.</given-names></name> <name><surname>Hyland</surname> <given-names>B. I.</given-names></name> <name><surname>Wickens</surname> <given-names>J. R.</given-names></name></person-group> (<year>2001</year>). <article-title>A cellular mechanism of reward-related learning</article-title>. <source>Nature</source> <volume>413</volume>, <fpage>67</fpage>&#x02013;<lpage>70</lpage>.<pub-id pub-id-type="doi">10.1038/35092560</pub-id><pub-id pub-id-type="pmid">11544526</pub-id></citation></ref>
<ref id="B44"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Reynolds</surname> <given-names>J. N.</given-names></name> <name><surname>Wickens</surname> <given-names>J. R.</given-names></name></person-group> (<year>2002</year>). <article-title>Dopamine-dependent plasticity of corticostriatal synapses</article-title>. <source>Neural Netw.</source> <volume>15</volume>, <fpage>507</fpage>&#x02013;<lpage>521</lpage>.<pub-id pub-id-type="doi">10.1016/S0893-6080(02)00045-X</pub-id><pub-id pub-id-type="pmid">12371508</pub-id></citation></ref>
<ref id="B45"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ruskin</surname> <given-names>D. N.</given-names></name> <name><surname>Rawji</surname> <given-names>S. S.</given-names></name> <name><surname>Walters</surname> <given-names>J. R.</given-names></name></person-group> (<year>1998</year>). <article-title>Effects of full D1 dopamine receptor agonists on firing rates in the globus pallidus and substantia nigra pars compacta in vivo: tests for D1 receptor selectivity and comparisons to the partial agonist SKF 38393</article-title>. <source>J. Pharmacol. Exp. Ther.</source> <volume>286</volume>, <fpage>272</fpage>&#x02013;<lpage>281</lpage>.<pub-id pub-id-type="pmid">9655869</pub-id></citation></ref>
<ref id="B46"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Russo</surname> <given-names>S. J.</given-names></name> <name><surname>Dietz</surname> <given-names>D. M.</given-names></name> <name><surname>Dumitriu</surname> <given-names>D.</given-names></name> <name><surname>Morrison</surname> <given-names>J. H.</given-names></name> <name><surname>Malenka</surname> <given-names>R. C.</given-names></name> <name><surname>Nestler</surname> <given-names>E. J.</given-names></name></person-group> (<year>2010</year>). <article-title>The addicted synapse: mechanisms of synaptic and structural plasticity in nucleus accumbens</article-title>. <source>Trends Neurosci.</source> <volume>33</volume>, <fpage>267</fpage>&#x02013;<lpage>276</lpage>.<pub-id pub-id-type="doi">10.1016/j.tins.2010.02.002</pub-id><pub-id pub-id-type="pmid">20207024</pub-id></citation></ref>
<ref id="B47"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rutledge</surname> <given-names>R. B.</given-names></name> <name><surname>Lazzaro</surname> <given-names>S. C.</given-names></name> <name><surname>Lau</surname> <given-names>B.</given-names></name> <name><surname>Myers</surname> <given-names>C. E.</given-names></name> <name><surname>Gluck</surname> <given-names>M. A.</given-names></name> <name><surname>Glimcher</surname> <given-names>P. W.</given-names></name></person-group> (<year>2009</year>). <article-title>Dopaminergic drugs modulate learning rates and perseveration in Parkinson&#x00027;s patients in a dynamic foraging task</article-title>. <source>J. Neurosci.</source> <volume>29</volume>, <fpage>15104</fpage>&#x02013;<lpage>15114</lpage>.<pub-id pub-id-type="doi">10.1523/JNEUROSCI.3524-09.2009</pub-id><pub-id pub-id-type="pmid">19955362</pub-id></citation></ref>
<ref id="B48"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Satoh</surname> <given-names>T.</given-names></name> <name><surname>Nakai</surname> <given-names>S.</given-names></name> <name><surname>Sato</surname> <given-names>T.</given-names></name> <name><surname>Kimura</surname> <given-names>M.</given-names></name></person-group> (<year>2003</year>). <article-title>Correlated coding of motivation and outcome of decision by dopamine neurons</article-title>. <source>J. Neurosci.</source> <volume>23</volume>, <fpage>9913</fpage>&#x02013;<lpage>9923</lpage>.<pub-id pub-id-type="pmid">14586021</pub-id></citation></ref>
<ref id="B49"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schlesinger</surname> <given-names>I.</given-names></name> <name><surname>Erikh</surname> <given-names>I.</given-names></name> <name><surname>Yarnitsky</surname> <given-names>D.</given-names></name></person-group> (<year>2007</year>). <article-title>Paradoxical kinesia at war</article-title>. <source>Mov. Disord.</source> <volume>22</volume>, <fpage>2394</fpage>&#x02013;<lpage>2397</lpage>.<pub-id pub-id-type="doi">10.1002/mds.21739</pub-id><pub-id pub-id-type="pmid">17914720</pub-id></citation></ref>
<ref id="B50"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schultz</surname> <given-names>W.</given-names></name></person-group> (<year>1998</year>). <article-title>The phasic reward signal of primate dopamine neurons</article-title>. <source>Adv. Pharmacol.</source> <volume>42</volume>, <fpage>686</fpage>&#x02013;<lpage>690</lpage>.<pub-id pub-id-type="doi">10.1016/S1054-3589(08)60841-8</pub-id><pub-id pub-id-type="pmid">9327992</pub-id></citation></ref>
<ref id="B51"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schultz</surname> <given-names>W.</given-names></name> <name><surname>Dayan</surname> <given-names>P.</given-names></name> <name><surname>Montague</surname> <given-names>P. R.</given-names></name></person-group> (<year>1997</year>). <article-title>A neural substrate of prediction and reward</article-title>. <source>Science</source> <volume>275</volume>, <fpage>1593</fpage>&#x02013;<lpage>1599</lpage>.<pub-id pub-id-type="doi">10.1126/science.275.5306.1593</pub-id><pub-id pub-id-type="pmid">9054347</pub-id></citation></ref>
<ref id="B52"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shannon</surname> <given-names>C. E.</given-names></name></person-group> (<year>1959</year>). <article-title>Coding theorems for a discrete source with a fidelity criterion</article-title>. <source>IRE Nat. Conv. Rec.</source> <volume>4</volume>, <fpage>142</fpage>&#x02013;<lpage>163</lpage>.</citation></ref>
<ref id="B53"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shen</surname> <given-names>W.</given-names></name> <name><surname>Flajolet</surname> <given-names>M.</given-names></name> <name><surname>Greengard</surname> <given-names>P.</given-names></name> <name><surname>Surmeier</surname> <given-names>D. J.</given-names></name></person-group> (<year>2008</year>). <article-title>Dichotomous dopaminergic control of striatal synaptic plasticity</article-title>. <source>Science</source> <volume>321</volume>, <fpage>848</fpage>&#x02013;<lpage>851</lpage>.<pub-id pub-id-type="doi">10.1126/science.1160575</pub-id><pub-id pub-id-type="pmid">18687967</pub-id></citation></ref>
<ref id="B54"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Slovin</surname> <given-names>H.</given-names></name> <name><surname>Abeles</surname> <given-names>M.</given-names></name> <name><surname>Vaadia</surname> <given-names>E.</given-names></name> <name><surname>Haalman</surname> <given-names>I.</given-names></name> <name><surname>Prut</surname> <given-names>Y.</given-names></name> <name><surname>Bergman</surname> <given-names>H.</given-names></name></person-group> (<year>1999</year>). <article-title>Frontal cognitive impairments and saccadic deficits in low-dose MPTP-treated monkeys</article-title>. <source>J. Neurophysiol.</source> <volume>81</volume>, <fpage>858</fpage>&#x02013;<lpage>874</lpage>.<pub-id pub-id-type="pmid">10036286</pub-id></citation></ref>
<ref id="B55"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stefani</surname> <given-names>A.</given-names></name> <name><surname>Stanzione</surname> <given-names>P.</given-names></name> <name><surname>Bassi</surname> <given-names>A.</given-names></name> <name><surname>Mazzone</surname> <given-names>P.</given-names></name> <name><surname>Vangelista</surname> <given-names>T.</given-names></name> <name><surname>Bernardi</surname> <given-names>G.</given-names></name></person-group> (<year>1997</year>). <article-title>Effects of increasing doses of apomorphine during stereotaxic neurosurgery in Parkinson&#x00027;s disease: clinical score and internal globus pallidus activity. Short communication</article-title>. <source>J. Neural. Transm.</source> <volume>104</volume>, <fpage>895</fpage>&#x02013;<lpage>904</lpage>.<pub-id pub-id-type="pmid">9451721</pub-id></citation></ref>
<ref id="B56"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Surmeier</surname> <given-names>D. J.</given-names></name> <name><surname>Ding</surname> <given-names>J.</given-names></name> <name><surname>Day</surname> <given-names>M.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Shen</surname> <given-names>W.</given-names></name></person-group> (<year>2007</year>). <article-title>D1 and D2 dopamine-receptor modulation of striatal glutamatergic signaling in striatal medium spiny neurons</article-title>. <source>Trends Neurosci.</source> <volume>30</volume>, <fpage>228</fpage>&#x02013;<lpage>235</lpage>.<pub-id pub-id-type="doi">10.1016/j.tins.2007.03.008</pub-id><pub-id pub-id-type="pmid">17408758</pub-id></citation></ref>
<ref id="B57"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Sutton</surname> <given-names>R. S.</given-names></name> <name><surname>Barto</surname> <given-names>A. G.</given-names></name></person-group> (<year>1998</year>). <source>Reinforcement Learning &#x02013; An Introduction</source>. <publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>The MIT Press</publisher-name>.</citation></ref>
<ref id="B58"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Talmi</surname> <given-names>D.</given-names></name> <name><surname>Dayan</surname> <given-names>P.</given-names></name> <name><surname>Kiebel</surname> <given-names>S. J.</given-names></name> <name><surname>Frith</surname> <given-names>C. D.</given-names></name> <name><surname>Dolan</surname> <given-names>R. J.</given-names></name></person-group> (<year>2009</year>). <article-title>How humans integrate the prospects of pain and reward during choice</article-title>. <source>J. Neurosci.</source> <volume>29</volume>, <fpage>14617</fpage>&#x02013;<lpage>14626</lpage>.<pub-id pub-id-type="doi">10.1523/JNEUROSCI.2026-09.2009</pub-id><pub-id pub-id-type="pmid">19923294</pub-id></citation></ref>
<ref id="B59"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tepper</surname> <given-names>J. M.</given-names></name> <name><surname>Wilson</surname> <given-names>C. J.</given-names></name> <name><surname>Koos</surname> <given-names>T.</given-names></name></person-group> (<year>2008</year>). <article-title>Feedforward and feedback inhibition in neostriatal GABAergic spiny neurons</article-title>. <source>Brain Res. Rev.</source> <volume>58</volume>, <fpage>272</fpage>&#x02013;<lpage>281</lpage>.<pub-id pub-id-type="pmid">18054796</pub-id></citation></ref>
<ref id="B60"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Tishby</surname> <given-names>N.</given-names></name> <name><surname>Pereira</surname> <given-names>F.</given-names></name> <name><surname>Bialek</surname> <given-names>W.</given-names></name></person-group> (<year>1999</year>). <article-title>&#x0201C;The information bottelneck method 9-9-1999,&#x0201D;</article-title> in <conf-name>The 37th Annual Allerton Conference on Communication, Control, and Computing</conf-name>, <conf-loc>Allerton</conf-loc>.</citation></ref>
<ref id="B61"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Tishby</surname> <given-names>N.</given-names></name> <name><surname>Polani</surname> <given-names>D.</given-names></name></person-group> (<year>2010</year>). <article-title>&#x0201C;Information theory of decisions and actions,&#x0201D;</article-title> in <source>Perception-Reason-Action Cycle: Models, Algorithms and Systems</source> eds <person-group person-group-type="editor"><name><surname>Vassilis</surname> <given-names>C.</given-names></name> <name><surname>Polani</surname> <given-names>D.</given-names></name> <name><surname>Hussain</surname> <given-names>A.</given-names></name> <name><surname>Tishby</surname> <given-names>N.</given-names></name> <name><surname>Taylor</surname> <given-names>J. G.</given-names></name></person-group> (<publisher-loc>New York</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>601</fpage>&#x02013;<lpage>636</lpage>.</citation></ref>
<ref id="B62"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vulkan</surname> <given-names>N.</given-names></name></person-group> (<year>2000</year>). <article-title>An economist&#x00027;s perspective on probability matching</article-title>. <source>J. Econ. Surv.</source> <volume>14</volume>, <fpage>101</fpage>&#x02013;<lpage>118</lpage>.</citation></ref>
<ref id="B63"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wiecki</surname> <given-names>T. V.</given-names></name> <name><surname>Frank</surname> <given-names>M. J.</given-names></name></person-group> (<year>2010</year>). <article-title>Neurocomputational models of motor and cognitive deficits in Parkinson&#x00027;s disease</article-title>. <source>Prog. Brain Res.</source> <volume>183</volume>, <fpage>275</fpage>&#x02013;<lpage>297</lpage>.<pub-id pub-id-type="doi">10.1016/S0079-6123(10)83014-6</pub-id><pub-id pub-id-type="pmid">20696325</pub-id></citation></ref>
</ref-list>
</back>
</article>
