<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="review-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Robot. AI</journal-id>
<journal-title>Frontiers in Robotics and AI</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Robot. AI</abbrev-journal-title>
<issn pub-type="epub">2296-9144</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/frobt.2019.00153</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Robotics and AI</subject>
<subj-group>
<subject>Review</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Symbolic, Distributed, and Distributional Representations for Natural Language Processing in the Era of Deep Learning: A Survey</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Ferrone</surname> <given-names>Lorenzo</given-names></name>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Zanzotto</surname> <given-names>Fabio Massimo</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/656611/overview"/>
</contrib>
</contrib-group>
<aff><institution>Department of Enterprise Engineering, University of Rome Tor Vergata</institution>, <addr-line>Rome</addr-line>, <country>Italy</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Giovanni Luca Christian Masala, Manchester Metropolitan University, United Kingdom</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Nicola Di Mauro, University of Bari Aldo Moro, Italy; Marco Pota, Institute for High Performance Computing and Networking (ICAR), Italy</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Fabio Massimo Zanzotto <email>fabio.massimo.zanzotto&#x00040;uniroma2.it</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Computational Intelligence in Robotics, a section of the journal Frontiers in Robotics and AI</p></fn></author-notes>
<pub-date pub-type="epub">
<day>21</day>
<month>01</month>
<year>2020</year>
</pub-date>
<pub-date pub-type="collection">
<year>2019</year>
</pub-date>
<volume>6</volume>
<elocation-id>153</elocation-id>
<history>
<date date-type="received">
<day>05</day>
<month>05</month>
<year>2019</year>
</date>
<date date-type="accepted">
<day>20</day>
<month>12</month>
<year>2019</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2020 Ferrone and Zanzotto.</copyright-statement>
<copyright-year>2020</copyright-year>
<copyright-holder>Ferrone and Zanzotto</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>Natural language is inherently a discrete symbolic representation of human knowledge. Recent advances in machine learning (ML) and in natural language processing (NLP) seem to contradict the above intuition: discrete symbols are fading away, erased by vectors or tensors called <italic>distributed</italic> and <italic>distributional representations</italic>. However, there is a strict link between distributed/distributional representations and discrete symbols, being the first an approximation of the second. A clearer understanding of the strict link between distributed/distributional representations and symbols may certainly lead to radically new deep learning networks. In this paper we make a survey that aims to renew the link between symbolic representations and distributed/distributional representations. This is the right time to revitalize the area of interpreting how discrete symbols are represented inside neural networks.</p></abstract>
<kwd-group>
<kwd>natural language processing (NLP)</kwd>
<kwd>distributed representation</kwd>
<kwd>concatenative compositionality</kwd>
<kwd>deep learning (DL)</kwd>
<kwd>compositional distributional semantic models</kwd>
<kwd>compositionality</kwd>
</kwd-group>
<counts>
<fig-count count="3"/>
<table-count count="1"/>
<equation-count count="45"/>
<ref-count count="78"/>
<page-count count="15"/>
<word-count count="12337"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Natural language is inherently a discrete symbolic representation of human knowledge. Sounds are transformed in letters or ideograms and these discrete symbols are composed to obtain words. Words then form sentences and sentences form texts, discourses, dialogs, which ultimately convey knowledge, emotions, and so on. This composition of symbols in words and of words in sentences follow rules that both the hearer and the speaker know (Chomsky, <xref ref-type="bibr" rid="B11">1957</xref>). Hence, it seems extremely odd thinking to natural language understanding systems that are not based on discrete symbols.</p>
<p>Recent advances in machine learning (ML) applied to natural language processing (NLP) seem to contradict the above intuition: discrete symbols are fading away, erased by vectors or tensors called <italic>distributed</italic> and <italic>distributional representations</italic>. In ML applied to NLP, <italic>distributed representations</italic> are pushing deep learning models (LeCun et al., <xref ref-type="bibr" rid="B40">2015</xref>; Schmidhuber, <xref ref-type="bibr" rid="B58">2015</xref>) toward amazing results in many high-level tasks such as image generation (Goodfellow et al., <xref ref-type="bibr" rid="B26">2014</xref>), image captioning (Vinyals et al., <xref ref-type="bibr" rid="B69">2015b</xref>; Xu et al., <xref ref-type="bibr" rid="B72">2015</xref>), machine translation (Zou et al., <xref ref-type="bibr" rid="B78">2013</xref>; Bahdanau et al., <xref ref-type="bibr" rid="B3">2015</xref>), syntactic parsing (Vinyals et al., <xref ref-type="bibr" rid="B68">2015a</xref>; Weiss et al., <xref ref-type="bibr" rid="B70">2015</xref>) and in a variety of other NLP tasks (Devlin et al., <xref ref-type="bibr" rid="B17">2019</xref>). In a more traditional NLP, <italic>distributional representations</italic> are pursued as a more flexible way to represent semantics of natural language, the so-called <italic>distributional semantics</italic> (see Turney and Pantel, <xref ref-type="bibr" rid="B64">2010</xref>). Words as well as sentences are represented as vectors or tensors of real numbers. Vectors for words are obtained observing how these words co-occur with other words in document collections. Moreover, as in traditional compositional representations, vectors for phrases (Clark et al., <xref ref-type="bibr" rid="B12">2008</xref>; Mitchell and Lapata, <xref ref-type="bibr" rid="B46">2008</xref>; Baroni and Zamparelli, <xref ref-type="bibr" rid="B6">2010</xref>; Zanzotto et al., <xref ref-type="bibr" rid="B75">2010</xref>; Grefenstette and Sadrzadeh, <xref ref-type="bibr" rid="B28">2011</xref>) and sentences (Socher et al., <xref ref-type="bibr" rid="B60">2011</xref>, <xref ref-type="bibr" rid="B61">2012</xref>; Kalchbrenner and Blunsom, <xref ref-type="bibr" rid="B37">2013</xref>) are obtained by composing vectors for words.</p>
<p>The success of distributed and distributional representations over symbolic approaches is mainly due to the advent of new parallel paradigms that pushed neural networks (Rosenblatt, <xref ref-type="bibr" rid="B54">1958</xref>; Werbos, <xref ref-type="bibr" rid="B71">1974</xref>) toward deep learning (LeCun et al., <xref ref-type="bibr" rid="B40">2015</xref>; Schmidhuber, <xref ref-type="bibr" rid="B58">2015</xref>). Massively parallel algorithms running on Graphic Processing Units (GPUs) (Chetlur et al., <xref ref-type="bibr" rid="B10">2014</xref>; Cui et al., <xref ref-type="bibr" rid="B15">2015</xref>) crunch vectors, matrices, and tensors faster than decades ago. The back-propagation algorithm can be now computed for complex and large neural networks. Symbols are not needed any more during &#x0201C;resoning.&#x0201D; Hence, discrete symbols only survive as inputs and outputs of these wonderful learning machines.</p>
<p>However, there is a strict link between distributed/distributional representations and symbols, being the first an approximation of the second (Fodor and Pylyshyn, <xref ref-type="bibr" rid="B22">1988</xref>; Plate, <xref ref-type="bibr" rid="B52">1994</xref>, <xref ref-type="bibr" rid="B53">1995</xref>; Ferrone et al., <xref ref-type="bibr" rid="B19">2015</xref>). The representation of the input and the output of these networks is not that far from their internal representation. The similarity and the interpretation of the internal representation is clearer in image processing (Zeiler and Fergus, <xref ref-type="bibr" rid="B76">2014a</xref>). In fact, networks are generally interpreted visualizing how subparts represent salient subparts of target images. Both input images and subparts are tensors of real number. Hence, these networks can be examined and understood. The same does not apply to natural language processing with its discrete symbols.</p>
<p>A clearer understanding of the strict link between distributed/distributional representations and discrete symbols is needed (Jacovi et al., <xref ref-type="bibr" rid="B34">2018</xref>; Jang et al., <xref ref-type="bibr" rid="B35">2018</xref>) to understand how neural networks treat information and to propose novel deep learning architectures. Model interpretability is becoming an important topic in machine learning in general (Lipton, <xref ref-type="bibr" rid="B42">2018</xref>). This clearer understanding is then the dawn of a new range of possibilities: understanding what part of the current symbolic techniques for natural language processing have a sufficient representation in deep neural networks; and, ultimately, understanding whether a more brain-like model&#x02014;the neural networks&#x02014;is compatible with methods for syntactic parsing or semantic processing that have been defined in these decades of studies in computational linguistics and natural language processing. There is thus a tremendous opportunity to understand whether and how symbolic representations are used and emitted in a brain model.</p>
<p>In this paper we make a survey that aims to draw the link between symbolic representations and distributed/distributional representations. This is the right time to revitalize the area of interpreting how symbols are represented inside neural networks. In our opinion, this survey will help to devise new deep neural networks that can exploit existing and novel symbolic models of classical natural language processing tasks.</p>
<p>The paper is structured as follow: first we give an introduction to the very general concept of representation, the notion of <italic>concatenative composition</italic> and the difference between <italic>local</italic> and <italic>distributed</italic> representations (Plate, <xref ref-type="bibr" rid="B53">1995</xref>). After that we present each techniques in detail. Afterwards, we focus on distributional representations (Turney and Pantel, <xref ref-type="bibr" rid="B64">2010</xref>), which we treat as a specific example of a distributed representation. Finally we discuss more in depth the general issue of compositionality, analyzing three different approaches to the problem: compositional distributional semantics (Clark et al., <xref ref-type="bibr" rid="B12">2008</xref>; Baroni et al., <xref ref-type="bibr" rid="B4">2014</xref>), holographic reduced representations (Plate, <xref ref-type="bibr" rid="B52">1994</xref>; Neumann, <xref ref-type="bibr" rid="B49">2001</xref>), and recurrent neural networks (Socher et al., <xref ref-type="bibr" rid="B61">2012</xref>; Kalchbrenner and Blunsom, <xref ref-type="bibr" rid="B37">2013</xref>).</p>
</sec>
<sec id="s2">
<title>2. Symbolic and Distributed Representations: Interpretability and <italic>Concatenative</italic> Compositionality</title>
<p><italic>Distributed representations</italic> put symbolic expressions in metric spaces where similarity among examples is used to learn regularities for specific tasks by using neural networks or other machine learning models. Given two symbolic expressions, their distributed representation should capture their similarity along specific features useful for the final task. For example, two sentences such as <italic>s</italic><sub>1</sub> &#x0003D; &#x0201C;<italic>a mouse eats some cheese&#x0201D;</italic> and <italic>s</italic><sub>2</sub> &#x0003D; &#x0201C;<italic>a cat swallows a mouse&#x0201D;</italic> can be considered similar in many different ways: (1) number of words in common; (2) realization of the pattern &#x0201C;<monospace>ANIMAL EATS FOOD</monospace>.&#x0201D; The key point is to decide or to let an algorithm decide which is the best representation for a specific task.</p>
<p><italic>Distributed representations</italic> are then replacing long-lasting, successful <italic>discrete symbolic representations</italic> in representing knowledge for learning machines but these representations are less human <italic>interpretable</italic>. Hence, discussing about basic, obvious properties of <italic>discrete symbolic representations</italic> is not useless as these properties may guarantee success to distributed representations similar to the one of discrete symbolic representations.</p>
<p>Discrete symbolic representations are human <italic>interpretable</italic> as <italic>symbols are not altered in expressions</italic>. This is one of the most important, obvious feature of these representations. Infinite sets of expressions, which are sequences of symbols, can be <italic>interpreted</italic> as these expressions are obtained by concatenating a finite set of basic symbols according to some concatenative rules. During concatenation, symbols are not altered and, then, can be recognized. By using the principle of <italic>semantic compositionality</italic>, the meaning of expressions can be obtained by combining the meaning of the parts and, hence, recursively, by combining the meaning of the finite set of basic symbols. For example, given the set of basic symbols <inline-formula><mml:math id="M1"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow><mml:mo>=</mml:mo></mml:math></inline-formula> {<italic>mouse, cat, a, swallows, (</italic>,<italic>)</italic>}, expressions like:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M2"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x0201C;</mml:mo><mml:mi>a</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mi>s</mml:mi><mml:mi>w</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi><mml:mi>s</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mi>a</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mo>&#x0201D;</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E2"><label>(2)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>(</mml:mo><mml:mo>(</mml:mo><mml:mi>a</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mo>)</mml:mo><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mi>w</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi><mml:mi>s</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mo>(</mml:mo><mml:mi>a</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mo>)</mml:mo><mml:mo>)</mml:mo><mml:mo>)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>are totally plausible and interpretable given rules for producing natural language utterances or for producing tree structured representations in parenthetical form, respectively. This strongly depends on the fact that individual symbols can be recognized.</p>
<p>Distributed representations instead seem to <italic>alter symbols</italic> when applied to symbolic inputs and, thus, are less interpretable. In fact, symbols as well as expressions are represented as vectors in these metric spaces. Observing distributed representations, symbols and expressions do not immediately emerge. Moreover, these distributed representations may be transformed by using matrix multiplication or by using non-linear functions. Hence, it is generally unclear: (1) what is the relation between the initial symbols or expressions and their distributed representations and (2) how these expressions are manipulated during matrix multiplication or when applying non-linear functions. In other words, it is unclear whether symbols can be recognized in distributed representations.</p>
<p>Hence, a debated question is whether discrete symbolic representations and distributed representations are two very different ways of encoding knowledge because of the difference in <italic>altering symbols</italic>. The debate dates back in the late 80s. For Fodor and Pylyshyn (<xref ref-type="bibr" rid="B22">1988</xref>), distributed representations in Neural Network architectures are &#x0201C;<italic>only an implementation of the Classical approach&#x0201D;</italic> where classical approach is related to discrete symbolic representations. Whereas, for Chalmers (<xref ref-type="bibr" rid="B9">1992</xref>), distributed representations give the important opportunity to reason &#x0201C;<italic>holistically&#x0201D;</italic> about encoded knowledge. This means that decisions over some specific part of the stored knowledge can be taken without retrieving the specific part but acting on the whole representation. However, this does not solve the debated question as it is still unclear what is in a distributed representation.</p>
<p>To contribute to the above debated question, Gelder (<xref ref-type="bibr" rid="B24">1990</xref>) has formalized the property of <italic>altering symbols in expressions</italic> by defining two different notions of compositionality: <italic>concatenative</italic> compositionality and <italic>functional</italic> compositionality.</p>
<p><italic>Concatenative compositionality</italic> explains how discrete symbolic representations compose symbols to obtain expressions. In fact, the mode of combination is an extended concept of juxtaposition that provides a way of linking successive symbols without altering them as these form expressions. Concatenative compositionality explains discrete symbolic representations no matter the means is used to store expressions: a piece of paper or a computer memory. Concatenation is sometime expressed with an operator like &#x02218;, which can be used in a infix or prefix notation, that is a sort of function with arguments &#x02218;(<italic>w</italic><sub>1</sub>, ..., <italic>w</italic><sub><italic>n</italic></sub>). By using the operator for concatenation, the two above examples <italic>s</italic><sub>1</sub> and <italic>t</italic><sub>1</sub> can be represented as the following:</p>
<disp-formula id="E3"><mml:math id="M4"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>a</mml:mi><mml:mo>&#x02218;</mml:mo><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mo>&#x02218;</mml:mo><mml:mi>s</mml:mi><mml:mi>w</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi><mml:mi>s</mml:mi><mml:mo>&#x02218;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x02218;</mml:mo><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>that represents a sequence with the infix notation and</p>
<disp-formula id="E4"><mml:math id="M5"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mo>&#x02218;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x02218;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x02218;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mi>w</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mo>&#x02218;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>that represents a tree with the prefix notation.</p>
<p><italic>Functional compositionality</italic> explains compositionality in distributed representations and in semantics. In functional compositionality, the mode of combination is a function &#x003A6; that gives a reliable, general process for producing expressions given its constituents. Within this perspective, semantic compositionality is a special case of functional compositionality where the target of the composition is a way for meaning representation (Blutner et al., <xref ref-type="bibr" rid="B8">2003</xref>).</p>
<p><italic>Local distributed representations</italic> (as referred in Plate, <xref ref-type="bibr" rid="B53">1995</xref>) or <italic>one-hot encodings</italic> are the easiest way to visualize how <italic>functional compositionality</italic> acts on <italic>distributed representations</italic>. Local distributed representations give a first, simple encoding of discrete symbolic representations in a metric space. Given a set of symbols <inline-formula><mml:math id="M6"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula>, a local distributed representation maps the <italic>i</italic>-th symbol in <inline-formula><mml:math id="M7"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> to the <italic>i</italic>-th base unit vector <bold>e</bold><sub><italic>i</italic></sub> in &#x0211D;<sup><italic>n</italic></sup>, where <italic>n</italic> is the cardinality of <inline-formula><mml:math id="M8"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula>. Hence, the <italic>i</italic>-th unit vector represents the <italic>i</italic>-th symbol. In <italic>functional compositionality</italic>, expressions <italic>s</italic> &#x0003D; <italic>w</italic><sub>1</sub>&#x02026;<italic>w</italic><sub><italic>k</italic></sub> are represented by vectors <bold>s</bold> obtained with an eventually recursive function &#x003A6; applied to vectors <bold>e</bold><sub><italic>w</italic><sub>1</sub></sub>&#x02026;<bold>e</bold><sub><italic>w</italic><sub><italic>k</italic></sub></sub>. The function <italic>f</italic> may be very simple as the sum or more complex. In case the function &#x003A6; is the sum, that is:</p>
<disp-formula id="E5"><label>(3)</label><mml:math id="M9"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>func</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mo>&#x003A3;</mml:mo></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>e</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>the derived vector is the classical bag-of-word vector space model (Salton, <xref ref-type="bibr" rid="B57">1989</xref>). Whereas, more complex functions <italic>f</italic> can range from different vector-to-vector operations like circular convolution in Holographic Reduced Representations (Plate, <xref ref-type="bibr" rid="B53">1995</xref>) to matrix multiplications plus non-linear operations in models such as in recurrent neural networks (Hochreiter and Schmidhuber, <xref ref-type="bibr" rid="B33">1997</xref>; Schuster and Paliwal, <xref ref-type="bibr" rid="B59">1997</xref>) or in neural networks with attention (Vaswani et al., <xref ref-type="bibr" rid="B65">2017</xref>; Devlin et al., <xref ref-type="bibr" rid="B17">2019</xref>). Example <italic>s</italic><sub>1</sub> in Equation (1) can be useful to describe <italic>functional</italic> compositionality. The set <inline-formula><mml:math id="M10"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow><mml:mo>=</mml:mo></mml:math></inline-formula> {<italic>mouse, cat, a, swallows, eats, some, cheese, (</italic>,<italic>)</italic>} may be represented with the base vectors <inline-formula><mml:math id="M11"><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>e</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mn>9</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> where <bold>e</bold><sub>1</sub> is the base vector for <italic>mouse</italic>, <bold>e</bold><sub>2</sub> for <italic>cat</italic>, <bold>e</bold><sub>3</sub> for <italic>a</italic>, <bold>e</bold><sub>4</sub> for <italic>swallaws</italic>, <bold>e</bold><sub>5</sub> for <italic>eats</italic>, <bold>e</bold><sub>6</sub> for <italic>some</italic>, <bold>e</bold><sub>7</sub> for <italic>cheese</italic>, <bold>e</bold><sub>8</sub> for <italic>(</italic>, and <bold>e</bold><sub>9</sub> for<italic>)</italic>. The additive functional composition of the expression <italic>s</italic><sub>1</sub> &#x0003D; <italic>a cat swallows a mouse</italic> is then:</p>
<p><inline-graphic xlink:href="frobt-06-00153-i0001.tif"/></p>
<p>where the concatenative operator &#x02218; has been substituted with the sum &#x0002B;. Just to observe, in the additive functional composition <bold>func<sub>&#x003A3;</sub>(s<sub>1</sub>)</bold>, symbols are still visible but the sequence is lost. In fact, it is difficult to reproduce the initial discrete symbolic expression. However, for example, the additive composition function gives the possibility to compare two expressions. Given the expression <italic>s</italic><sub>1</sub> and <italic>s</italic><sub>2</sub> &#x0003D; <italic>a mouse eats some cheese</italic>, the dot product between <bold>func<sub>&#x003A3;</sub>(s<sub>1</sub>)</bold> and <inline-formula><mml:math id="M12"><mml:mstyle mathvariant='bold'><mml:mtext>fun</mml:mtext></mml:mstyle><mml:msub><mml:mstyle mathvariant='bold'><mml:mtext>c</mml:mtext></mml:mstyle><mml:mo>&#x02211;</mml:mo></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mstyle mathvariant='bold'><mml:mtext>s</mml:mtext></mml:mstyle><mml:mn>2</mml:mn></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mtext>&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mtext>&#x000A0;</mml:mtext><mml:mn>0</mml:mn><mml:mtext>&#x000A0;</mml:mtext><mml:mn>1</mml:mn><mml:mtext>&#x000A0;</mml:mtext><mml:mn>0</mml:mn><mml:mtext>&#x000A0;</mml:mtext><mml:mn>1</mml:mn><mml:mtext>&#x000A0;</mml:mtext><mml:mn>1</mml:mn><mml:mtext>&#x000A0;</mml:mtext><mml:mn>1</mml:mn><mml:mtext>&#x000A0;</mml:mtext><mml:mn>0</mml:mn><mml:mtext>&#x000A0;</mml:mtext><mml:mn>0</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup></mml:math></inline-formula> counts the common words between the two expressions. In a functional composition with a function &#x003A6;, the expression <italic>s</italic><sub>1</sub> may become <bold>func<sub>&#x003A6;</sub>(s<sub>1</sub>)</bold> &#x0003D; &#x003A6;(&#x003A6;(&#x003A6;(&#x003A6;(<bold>e<sub>3</sub></bold>, <bold>e<sub>2</sub></bold>), <bold>e<sub>4</sub></bold>), <bold>e<sub>3</sub></bold>), <bold>e<sub>1</sub></bold>) by following the concatenative compositionality of the discrete symbolic expression. The same functional compositional principle can be applied to discrete symbolic trees as <italic>t</italic><sub>1</sub> by producing this distributed representation &#x003A6;(&#x003A6;(<bold>e<sub>3</sub></bold>, <bold>e<sub>2</sub></bold>), &#x003A6;(<bold>e<sub>4</sub></bold>, &#x003A6;(<bold>e<sub>3</sub></bold>, <bold>e<sub>1</sub></bold>))). Finally, in the functional composition with a generic recursive function <bold>func<sub>&#x003A6;</sub>(s<sub>1</sub>)</bold>, the function &#x003A6; will be crucial to determine whether symbols can be recognized and sequence is preserved.</p>
<p><italic>Distributed representations</italic> in their general form are more ambitious than distributed <italic>local</italic> representations and tend to encode basic symbols of <inline-formula><mml:math id="M13"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> in vectors in &#x0211D;<sup><italic>d</italic></sup> where <italic>d</italic> &#x0003C; &#x0003C; <italic>n</italic>. These vectors generally alter symbols as there is not a direct link between symbols and dimensions of the space. Given a distributed local representation <bold>e</bold><sub><italic>w</italic></sub> of a symbol <italic>w</italic>, the encoder for a distributed representation is a matrix <bold>W<sub>d&#x000D7;n</sub></bold> that transforms <bold>e</bold><sub><italic>w</italic></sub> in <bold>y</bold><sub><italic>w</italic></sub> &#x0003D; <bold>W<sub>d&#x000D7;n</sub>e</bold><sub><italic>w</italic></sub>. As an example, the encoding matrix <bold>W<sub>d&#x000D7;n</sub></bold> can be build by modeling words in <inline-formula><mml:math id="M14"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> around three dimensions: number of vowels, number of consonants and, finally, number of non-alphabetic symbols. Given these dimensions, the matrix <bold>W<sub>3 &#x000D7;9</sub></bold> for the example is :</p>
<disp-formula id="E6"><label>(4)</label><mml:math id="M15"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mstyle mathvariant='bold'><mml:mtext>W</mml:mtext></mml:mstyle><mml:mrow><mml:mstyle mathvariant='bold'><mml:mn>3</mml:mn><mml:mo>&#x000D7;</mml:mo><mml:mn>9</mml:mn></mml:mstyle></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mn>3</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>2</mml:mn></mml:mtd><mml:mtd><mml:mn>2</mml:mn></mml:mtd><mml:mtd><mml:mn>2</mml:mn></mml:mtd><mml:mtd><mml:mn>3</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>2</mml:mn></mml:mtd><mml:mtd><mml:mn>2</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>6</mml:mn></mml:mtd><mml:mtd><mml:mn>2</mml:mn></mml:mtd><mml:mtd><mml:mn>2</mml:mn></mml:mtd><mml:mtd><mml:mn>3</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>This is a simple example of a <italic>distributed</italic> representation. In a distributed representation (Hinton et al., <xref ref-type="bibr" rid="B32">1986</xref>; Plate, <xref ref-type="bibr" rid="B53">1995</xref>) the informational content is distributed (hence the name) among multiple units, and at the same time each unit can contribute to the representation of multiple elements. Distributed representation has two evident advantages with respect to a distributed local representation: it is more efficient (in the example, the representation uses only 3 numbers instead of 9) and it does not treat each element as being equally different to any other. In fact, <italic>mouse</italic> and <italic>cat</italic> in this representation are more similar than <italic>mouse</italic> and <italic>a</italic>. In other words, this representation captures by construction something interesting about the set of symbols. The drawback is that symbols are altered and, hence, it may be difficult to interpret which symbol is given its distributed representation. In the example, the distributed representations for <italic>eats</italic> and <italic>some</italic> are exactly the same vector <bold>W<sub>3 &#x000D7;9</sub> e<sub>5</sub></bold> &#x0003D; <bold>W<sub>3 &#x000D7;9</sub> e<sub>6</sub></bold>.</p>
<p>Even for distributed representations in the general form, it is possible to define <italic>functional composition</italic> to represent expressions. Vectors <bold>W<sub>d&#x000D7;n</sub>e<sub>i</sub></bold> should be replaced to vectors <bold>e<sub>i</sub></bold> in the definition of functional compositionality. Equation (3) for additive functional compositionality becomes:</p>
<disp-formula id="E7"><mml:math id="M16"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>y</mml:mtext></mml:mstyle><mml:mstyle mathvariant="bold"><mml:mtext>s</mml:mtext></mml:mstyle></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>W</mml:mtext></mml:mstyle><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>d&#x000D7;n</mml:mtext></mml:mstyle></mml:mrow></mml:msub><mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>func</mml:mtext></mml:mstyle><mml:mo>&#x003A3;</mml:mo></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mo>)</mml:mo><mml:mo>=</mml:mo></mml:mrow><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mtext>&#x0200A;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x0200A;</mml:mtext><mml:mn>1</mml:mn></mml:mrow><mml:mi>k</mml:mi></mml:munderover><mml:mrow><mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>W</mml:mtext></mml:mstyle><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>d&#x000D7;n</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow></mml:mstyle><mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>e</mml:mtext></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>In the running example, the additive functional compositionality of sentence <italic>s</italic><sub>1</sub> in Example 1 is:</p>
<disp-formula id="E8"><mml:math id="M17"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>y</mml:mtext></mml:mstyle><mml:mrow><mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>s</mml:mtext></mml:mstyle><mml:mstyle mathvariant="bold"><mml:mn>1</mml:mn></mml:mstyle></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>W</mml:mtext></mml:mstyle><mml:mrow><mml:mstyle mathvariant="bold"><mml:mn>3</mml:mn><mml:mo>&#x000D7;</mml:mo><mml:mn>9</mml:mn></mml:mstyle></mml:mrow></mml:msub><mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>func</mml:mtext></mml:mstyle><mml:mo>&#x003A3;</mml:mo></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mn>8</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mn>12</mml:mn></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Clearly, in this case, it is extremely difficult to derive back the discrete symbolic sequence <italic>s</italic><sub>1</sub> that has generated the final distributed representation.</p>
<p>Hence, <bold>interpretability</bold> of distributed representations can be framed as the following question:</p>
<p><italic>how much the underlying functional composition of distributed representations is</italic> <bold><italic>concatenative</italic></bold><italic>?</italic></p>
<p>In fact, discrete symbolic representations are <italic>interpretable</italic> as their composition is concatenative. Then, in order to be interpretable, distributed representations, and the related functional composition, should have some concatenative properties.</p>
<p>Then, since a distributed representation <italic>y</italic><sub><italic>s</italic></sub> of discrete symbolic expressions <italic>s</italic> are obtained by using an encoder <bold>W<sub>d&#x000D7;n</sub></bold> and a composition function, assessing interpretability becomes:</p>
<list list-type="bullet">
<list-item><p><bold>Symbol-level Interpretability</bold> - The question &#x0201C;Can discrete symbols be recognized?&#x0201D; becomes &#x0201C;to which degree the embedding matrix <bold>W</bold> is invertible?&#x0201D;</p></list-item>
<list-item><p><bold>Sequence-level Interpretability</bold> - The question &#x0201C;Can symbols and their relations be recognized in sequences of symbols?&#x0201D; becomes &#x0201C;how much functional composition models are concatenative?&#x0201D;</p></list-item>
</list>
<p>The two driving questions of <italic>Symbol-level Interpretability</italic> and <italic>Sequence-level Interpretability</italic> will be used to describe the presented distributed representations. In fact, we are interested in understanding whether distributed representations can be used to encode discrete symbolic structures and whether it is possible to decode the underlying discrete symbolic structure given a distributed representation. For example, it is clear that a local distributed representation is more interpretable at symbol level than the distributed representation presented in Equation (4). Yet, both representations lack in concatenative compositionality when sequences are collapsed in vectors. In fact, the sum as composition function builds bag-of-word local and distributed representation, which neglect the order of symbols in sequences. In the rest of the paper, we analyze whether other representations, such as holographic reduced representations (Plate, <xref ref-type="bibr" rid="B53">1995</xref>), recurrent and recursive neural networks (Hochreiter and Schmidhuber, <xref ref-type="bibr" rid="B33">1997</xref>; Schuster and Paliwal, <xref ref-type="bibr" rid="B59">1997</xref>) or neural networks with attention (Vaswani et al., <xref ref-type="bibr" rid="B65">2017</xref>; Devlin et al., <xref ref-type="bibr" rid="B17">2019</xref>), are instead more interpretable.</p>
</sec>
<sec id="s3">
<title>3. Strategies to Obtain Distributed Representations from Symbols</title>
<p>There is a wide range of techniques to transform symbolic representations in distributed representations. When combining natural language processing and machine learning, this is a major issue: transforming symbols, sequences of symbols or symbolic structures in vectors or tensors that can be used in learning machines. These techniques generally propose a function &#x003B7; to transform a <italic>local representation</italic> with a large number of dimensions in a <italic>distributed representation</italic> with a lower number of dimensions:</p>
<disp-formula id="E9"><mml:math id="M18"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x003B7;</mml:mi><mml:mo>:</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msup><mml:mo>&#x02192;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>This function is often called <italic>encoder</italic>.</p>
<p>We propose to categorize techniques to obtain distributed representations in two broad categories, showing some degree of overlapping (Cotterell et al., <xref ref-type="bibr" rid="B14">2017</xref>):</p>
<list list-type="bullet">
<list-item><p>Representations derived from dimensionality reduction techniques;</p></list-item>
<list-item><p>Learned representations.</p></list-item>
</list>
<p>In the rest of the section, we will introduce the different strategies according to the proposed categorization. Moreover, we will emphasize its degree of interpretability for each representation and its related function &#x003B7; by answering to two questions:</p>
<list list-type="bullet">
<list-item><p>Has a specific dimension in &#x0211D;<sup><italic>d</italic></sup> a clear meaning?</p></list-item>
<list-item><p>Can we decode an encoded symbolic representation? In other words, assuming a decoding function &#x003B4; : &#x0211D;<sup><italic>d</italic></sup> &#x02192; &#x0211D;<sup><italic>n</italic></sup>, how far is <italic>v</italic> &#x02208; &#x0211D;<sup><italic>n</italic></sup>, which represents a symbolic representation, from <italic>v</italic>&#x02032; &#x0003D; &#x003B4;(&#x003B7;(<italic>v</italic>))?</p></list-item>
</list>
<p><italic>Sequence-level interpretability</italic> of the resulting representations will be analyzed in section 5.</p>
<sec>
<title>3.1. Dimensionality Reduction With Random Projections</title>
<p><italic>Random projection</italic> (RP) (Bingham and Mannila, <xref ref-type="bibr" rid="B7">2001</xref>; Fodor, <xref ref-type="bibr" rid="B21">2002</xref>) is a technique based on random matrices <inline-formula><mml:math id="M19"><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. Generally, the rows of the matrix <italic>W</italic><sub><italic>d</italic></sub> are sampled from a Gaussian distribution with zero mean, and normalized as to have unit length (Johnson and Lindenstrauss, <xref ref-type="bibr" rid="B36">1984</xref>) or even less complex random vectors (Achlioptas, <xref ref-type="bibr" rid="B1">2003</xref>). Random projections from Gaussian distributions approximately preserves pairwise distance between points (see the <italic>Johnsonn-Lindenstrauss Lemma</italic>; Johnson and Lindenstrauss, <xref ref-type="bibr" rid="B36">1984</xref>), that is, for any vector <italic>x, y</italic> &#x02208; <italic>X</italic>:</p>
<disp-formula id="E10"><mml:math id="M20"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mi>&#x003B5;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msup><mml:mrow><mml:mo>&#x02016;</mml:mo><mml:mrow><mml:mstyle mathvariant='bold'><mml:mtext>x&#x000A0;</mml:mtext></mml:mstyle><mml:mo>&#x02212;</mml:mo><mml:mstyle mathvariant='bold'><mml:mtext>&#x000A0;y</mml:mtext></mml:mstyle></mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mo>&#x02264;</mml:mo><mml:msup><mml:mrow><mml:mo>&#x02016;</mml:mo><mml:mrow><mml:mi>W</mml:mi><mml:mstyle mathvariant='bold'><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>&#x02212;</mml:mo><mml:mi>W</mml:mi><mml:mstyle mathvariant='bold'><mml:mtext>y</mml:mtext></mml:mstyle></mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mo>&#x02264;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi>&#x003B5;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mo>&#x02016;</mml:mo><mml:mrow><mml:mstyle mathvariant='bold'><mml:mtext>x&#x000A0;</mml:mtext></mml:mstyle><mml:mo>&#x02212;</mml:mo><mml:mstyle mathvariant='bold'><mml:mtext>&#x000A0;y</mml:mtext></mml:mstyle></mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where the approximation factor &#x003B5; depends on the dimension of the projection, namely, to assure that the approximation factor is &#x003B5;, the dimension <italic>k</italic> must be chosen such that:</p>
<disp-formula id="E11"><mml:math id="M21"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>k</mml:mi><mml:mo>&#x02265;</mml:mo><mml:mfrac><mml:mrow><mml:mn>8</mml:mn><mml:mo class="qopname">log</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x003B5;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Constraints for building the matrix <italic>W</italic> can be significantly relaxed to less complex random vectors (Achlioptas, <xref ref-type="bibr" rid="B1">2003</xref>). Rows of the matrix can be sampled from very simple zero-mean distributions such as:</p>
<disp-formula id="E12"><mml:math id="M22"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msqrt><mml:mn>3</mml:mn></mml:msqrt><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mtext>&#x000A0;&#x000A0;with&#x000A0;probability&#x000A0;</mml:mtext><mml:mfrac><mml:mn>1</mml:mn><mml:mn>6</mml:mn></mml:mfrac></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn><mml:mtext>&#x000A0;&#x000A0;with&#x000A0;probability&#x000A0;</mml:mtext><mml:mfrac><mml:mn>1</mml:mn><mml:mn>6</mml:mn></mml:mfrac></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mn>0</mml:mn><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;with&#x000A0;probability&#x000A0;</mml:mtext><mml:mfrac><mml:mn>2</mml:mn><mml:mn>3</mml:mn></mml:mfrac></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>without the need to manually ensure unit-length of the rows, and at the same time providing a significant speed up in computation due to the sparsity of the projection.</p>
<p>These vectors &#x003B7;(<bold>v</bold>) are <italic>interpretable at symbol level</italic> as these functions can be inverted. The inverted function, that is, the decoding function, is:</p>
<disp-formula id="E13"><mml:math id="M23"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x003B4;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mstyle mathvariant='bold'><mml:mi>v</mml:mi></mml:mstyle><mml:mo>&#x02032;</mml:mo></mml:msup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mi>d</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:msup><mml:mstyle mathvariant='bold'><mml:mi>v</mml:mi></mml:mstyle><mml:mo>&#x02032;</mml:mo></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>and <inline-formula><mml:math id="M24"><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02248;</mml:mo><mml:mi>I</mml:mi></mml:math></inline-formula> when <italic>W</italic><sub><italic>d</italic></sub> is derived using Gaussian random vectors. Hence, distributed vectors in &#x0211D;<sup><italic>d</italic></sup> can be approximately decoded back in the original symbolic representation with a degree of approximation that depends on the distance between <italic>d</italic>.</p>
<p>The major advantage of RP is the matrix <italic>W</italic><sub><italic>d</italic></sub> can be produced <italic>&#x000E0;-la-carte</italic> starting from the symbols encountered so far in the encoding procedure. In fact, it is sufficient to generate new Gaussian vectors for new symbols when they appear.</p>
</sec>
<sec>
<title>3.2. Learned Representation</title>
<p>Learned representations differ from the dimensionality reduction techniques by the fact that: (1) encoding/decoding functions may not be linear; (2) learning can optimize functions that are different with respect to the target of Principal Component Analysis (see section 4.2); and, (3) solutions are not derived in a closed form but are obtained using optimization techniques such as <italic>stochastic gradient decent</italic>.</p>
<p>Learned representation can be further classified into:</p>
<list list-type="bullet">
<list-item><p><italic>Task-independent representations</italic> learned with a standalone algorithm (as in <italic>autoencoders</italic>; Socher et al., <xref ref-type="bibr" rid="B60">2011</xref>; Liou et al., <xref ref-type="bibr" rid="B41">2014</xref>) which is independent from any task, and which learns a representation that only depends from the dataset used;</p></list-item>
<list-item><p><italic>Task-dependent representations</italic> learned as the first step of another algorithm (this is called <italic>end-to-end training</italic>), usually the first layer of a deep neural network. In this case the new representation is driven by the task.</p></list-item>
</list>
<sec>
<title>3.2.1. Autoencoder</title>
<p>Autoencoders are a task independent technique to learn a distributed representation encoder &#x003B7; : &#x0211D;<sup><italic>n</italic></sup> &#x02192; &#x0211D;<sup><italic>d</italic></sup> by using local representations of a set of examples (Socher et al., <xref ref-type="bibr" rid="B60">2011</xref>; Liou et al., <xref ref-type="bibr" rid="B41">2014</xref>). The distributed representation encoder &#x003B7; is half of an autoencoder.</p>
<p>An autoencoder is a neural network that aims to reproduce an input vector in &#x0211D;<sup><italic>n</italic></sup> as output by traversing hidden layer(s) that are in &#x0211D;<sup><italic>d</italic></sup>. Given &#x003B7; : &#x0211D;<sup><italic>n</italic></sup> &#x02192; &#x0211D;<sup><italic>d</italic></sup> and &#x003B4; : &#x0211D;<sup><italic>d</italic></sup> &#x02192; &#x0211D;<sup><italic>n</italic></sup> as the encoder and the decoder, respectively, an autoencoder aims to maximize the following function:</p>
<disp-formula id="E14"><mml:math id="M25"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mi>&#x02112;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold'><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>&#x02016;</mml:mo><mml:mrow><mml:mstyle mathvariant='bold'><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>&#x02212;</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:mstyle></mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where</p>
<disp-formula id="E15"><mml:math id="M26"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant='bold'><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:mstyle><mml:mo>=</mml:mo><mml:mi>&#x003B4;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003B7;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold'><mml:mi>x</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The encoding and decoding module are two neural networks, which means that they are functions depending on a set of parameters &#x003B8; of the form</p>
<disp-formula id="E16"><mml:math id="M27"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>W</mml:mi><mml:mi>x</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>b</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mi>y</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:msup><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where the parameters of the entire model are &#x003B8;, &#x003B8;&#x02032; &#x0003D; {<italic>W, b, W</italic>&#x02032;, <italic>b</italic>&#x02032;} with <italic>W, W</italic>&#x02032; matrices, <italic>b, b</italic>&#x02032; vectors and <italic>s</italic> is a function that can be either a non-linearity sigmoid shaped function, or in some cases the identity function. In some variants the matrices <italic>W</italic> and <italic>W</italic>&#x02032; are constrained to <italic>W</italic><sup><italic>T</italic></sup> &#x0003D; <italic>W</italic>&#x02032;. This model is different with respect to PCA due to the target loss function and the use of non-linear functions.</p>
<p>Autoencoders have been further improved with <italic>denoising autoencoders</italic> (Vincent et al., <xref ref-type="bibr" rid="B66">2008</xref>, <xref ref-type="bibr" rid="B67">2010</xref>; Masci et al., <xref ref-type="bibr" rid="B44">2011</xref>) that are a variant of autoencoders where the goal is to reconstruct the input from a corrupted version. The intuition is that higher level features should be robust with regard to small noise in the input. In particular, the input <bold>x</bold> gets corrupted via a stochastic function:</p>
<disp-formula id="E17"><mml:math id="M28"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>and then one minimizes again the reconstruction error, but with regard to the <italic>original</italic> (uncorrupted) input:</p>
<disp-formula id="E18"><mml:math id="M29"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mi>&#x02112;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold'><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>&#x02016;</mml:mo><mml:mrow><mml:mstyle mathvariant='bold'><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>&#x02212;</mml:mo><mml:mi>&#x003B4;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003B7;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>g</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold'><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Usually <italic>g</italic> can be either:</p>
<list list-type="bullet">
<list-item><p>Adding Gaussian noise: <italic>g</italic>(<bold>x</bold>) &#x0003D; <bold>x</bold> &#x0002B; &#x003B5;, where <inline-formula><mml:math id="M30"><mml:mi>&#x003B5;</mml:mi><mml:mo>&#x0007E;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>&#x003C3;</mml:mi><mml:mo>&#x1D540;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>;</p></list-item>
<list-item><p>Masking noise: where a given a fraction &#x003BD; of the components of the input gets set to 0.</p></list-item>
</list>
<p>For what concerns <italic>symbol-level interpretability</italic>, as for random projection, distributed representations &#x003B7;(<bold>v</bold>) obtained with encoders from autoencoders and denoising autoencoders are <italic>invertible</italic>, that is decodable, as this is the nature of autoencoders.</p>
</sec>
<sec>
<title>3.2.2. Embedding Layers</title>
<p>Embedding layers are generally the first layers of more complex neural networks which are responsible to transform an initial local representation in the first internal distributed representation. The main difference with autoencoders is that these layers are shaped by the entire overall learning process. The learning process is generally task dependent. Hence, these first embedding layers depend on the final task.</p>
<p>It is argued that each layers learn a higher-level representation of its input. This is particularly visible with convolutional network (Krizhevsky et al., <xref ref-type="bibr" rid="B38">2012</xref>) applied to computer vision tasks. In these suggestive visualizations (Zeiler and Fergus, <xref ref-type="bibr" rid="B77">2014b</xref>), the hidden layers are seen to correspond to abstract feature of the image, starting from simple edges (in lower layers) up to faces in the higher ones.</p>
<p>However, these embedding layers produce encoding functions and, thus, distributed representations that are not interpretable at symbol level. In fact, these embedding layers do not naturally provide decoders.</p>
</sec>
</sec>
</sec>
<sec id="s4">
<title>4. <italic>Distributional</italic> Representations as Another Side of the Coin</title>
<p><italic>Distributional</italic> semantics is an important area of research in natural language processing that aims to describe meaning of words and sentences with vectorial representations (see Turney and Pantel, <xref ref-type="bibr" rid="B64">2010</xref> for a survey). These representations are called <italic>distributional representations</italic>.</p>
<p>It is a strange historical accident that two similar sounding names&#x02014;<italic>distributed</italic> and <italic>distributional</italic>&#x02014;have been given to two concepts that should not be confused for many. Maybe, this has happened because the two concepts are definitely related. We argue that distributional representation are nothing more than a subset of distributed representations, and in fact can be categorized neatly into the divisions presented in the previous section.</p>
<p>Distributional semantics is based on a famous slogan&#x02014;&#x0201C;<italic>you shall judge a word by the company it keeps&#x0201D;</italic> (Firth, <xref ref-type="bibr" rid="B20">1957</xref>)&#x02014;and on the <italic>distributional hypothesis</italic> (Harris, <xref ref-type="bibr" rid="B30">1954</xref>)&#x02014;words have similar meaning if used in similar contexts, that is, words with the same or similar <italic>distribution</italic>. Hence, the name distributional as well as the core hypothesis comes from a linguistic rather than computer science background.</p>
<p>Distributional vectors represent words by describing information related to the contexts in which they appear. Put in this way it is apparent that a distributional representation <italic>is</italic> a specific case of a distributed representation, and the different name is only an indicator of the context in which this techniques originated. Representations for sentences are generally obtained combining vectors representing words.</p>
<p>Hence, distributional semantics is a special case of distributed representations with a restriction on what can be used as features in vector spaces: features represent a bit of contextual information. Then, the largest body of research is on what should be used to represent contexts and how it should be taken into account. Once this is decided, large matrices <italic>X</italic> representing words in context are collected and, then, dimensionality reduction techniques are applied to have treatable and more discriminative vectors.</p>
<p>In the rest of the section, we present how to build matrices representing words in context, we will shortly recap on how dimensionality reduction techniques have been used in distributional semantics, and, finally, we report on <monospace>word2vec</monospace> (Mikolov et al., <xref ref-type="bibr" rid="B45">2013</xref>), which is a novel distributional semantic techniques based on deep learning.</p>
<sec>
<title>4.1. Building Distributional Representations for Words From a Corpus</title>
<p>The major issue in distributional semantics is how to build distributional representations for words by observing word contexts in a collection of documents. In this section, we will describe these techniques using the example of the corpus in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>A very small corpus.</p></caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td valign="top" align="left"><italic>s</italic><sub>1</sub></td>
<td valign="top" align="left"><italic>a cat catches a mouse</italic></td>
</tr>
<tr>
<td valign="top" align="left"><italic>s</italic><sub>2</sub></td>
<td valign="top" align="left"><italic>a dog eats a mouse</italic></td>
</tr>
<tr>
<td valign="top" align="left"><italic>s</italic><sub>3</sub></td>
<td valign="top" align="left"><italic>a dog catches a cat</italic></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>A first and simple distributional semantic representations of words is given by word vs. document matrices as those typical in information retrieval (Salton, <xref ref-type="bibr" rid="B57">1989</xref>). Word context are represented by document indexes. Then, words are similar if these words similarly appear in documents. This is generally referred as <italic>topical similarity</italic> (Landauer and Dumais, <xref ref-type="bibr" rid="B39">1997</xref>) as words belonging to the same topic tend to be more similar.</p>
<p>A second strategy to build distributional representations for words is to build word vs. contextual feature matrices. These contextual features represent <italic>proxies</italic> for semantic attributes of modeled words (Baroni and Lenci, <xref ref-type="bibr" rid="B5">2010</xref>). For example, contexts of the word <italic>dog</italic> will somehow have relation with the fact that a dog has four legs, barks, eats, and so on. In this case, these vectors capture a similarity that is more related to a co-hyponymy, that is, words sharing similar attributes are similar. For example, <italic>dog</italic> is more similar to <italic>cat</italic> than to <italic>car</italic> as <italic>dog</italic> and <italic>cat</italic> share more attributes than <italic>dog</italic> and <italic>car</italic>. This is often referred as <italic>attributional similarity</italic> (Turney, <xref ref-type="bibr" rid="B63">2006</xref>).</p>
<p>A simple example of this second strategy are word-to-word matrices obtained by observing n-word windows of target words. For example, a word-to-word matrix obtained for the corpus in <xref ref-type="table" rid="T1">Table 1</xref> by considering a 1-word window is the following:</p>
<p><inline-graphic xlink:href="frobt-06-00153-i0002.tif"/></p>
<p>Hence, the word <italic>cat</italic> is represented by the vector <bold>cat</bold> &#x0003D; (2 0 0 0 1 0) and the similarity between <italic>cat</italic> and <italic>dog</italic> is higher than the similarity between <italic>cat</italic> and <italic>mouse</italic> as the cosine similarity <italic>cos</italic>(<bold>cat</bold>, <bold>dog</bold>) is higher than the cosine similarity <italic>cos</italic>(<bold>cat</bold>, <bold>mouse</bold>).</p>
<p>The research on distributional semantics focuses on two aspects: (1) the best features to represent contexts; (2) the best correlation measure among target words and features.</p>
<p>How to represent contexts is a crucial problem in distributional semantics. This problem is strictly correlated to the classical question of feature definition and feature selection in machine learning. A wide variety of features have been tried. Contexts have been represented as set of relevant words, sets of relevant syntactic triples involving target words (Pado and Lapata, <xref ref-type="bibr" rid="B50">2007</xref>; Rothenh&#x000E4;usler and Sch&#x000FC;tze, <xref ref-type="bibr" rid="B55">2009</xref>) and sets of labeled lexical triples (Baroni and Lenci, <xref ref-type="bibr" rid="B5">2010</xref>).</p>
<p>Finding the best correlation measure among target words and their contextual features is the other issue. Many correlation measures have been tried. The classical measures are <italic>term frequency-inverse document frequency</italic> (<italic>tf-idf</italic>) (Salton, <xref ref-type="bibr" rid="B57">1989</xref>) and <italic>point-wise mutual information</italic> (<italic>pmi</italic>). These, among other measures, are used to better capture the importance of contextual features for representing distributional semantic of words.</p>
<p>This first formulation of distributional semantics is a distributed representation that is <italic>human-interpretable</italic>. In fact, features represent contextual information which is a proxy for semantic attributes of target words (Baroni and Lenci, <xref ref-type="bibr" rid="B5">2010</xref>).</p>
</sec>
<sec>
<title>4.2. Compacting Distributional Representations</title>
<p>As distributed representations, <italic>distributional representations</italic> can undergo the process of dimensionality reduction with Principal Component Analysis and Random Indexing. This process is used for two issues. The first is the classical problem of reducing the dimensions of the representation to obtain more compact representations. The second instead want to help the representation to focus on more discriminative dimensions. This latter issue focuses on the feature selection and merging which is an important task in making these representations more effective on the final task of similarity detection.</p>
<p>Principal Component Analysis (PCA) is largely applied in compacting distributional representations: Latent Semantic Analysis (LSA) is a prominent example (Landauer and Dumais, <xref ref-type="bibr" rid="B39">1997</xref>). LSA were born in Information Retrieval with the idea of reducing word-to-document matrices. Hence, in this compact representation, word context are documents and distributional vectors of words report on the documents where words appear. This or similar matrix reduction techniques have been then applied to word-to-word matrices.</p>
<p>Principal Component Analysis (PCA) (Pearson, <xref ref-type="bibr" rid="B51">1901</xref>; Markovsky, <xref ref-type="bibr" rid="B43">2011</xref>) is a linear method which reduces the number of dimensions by projecting &#x0211D;<sup><italic>n</italic></sup> into the &#x0201C;<italic>best&#x0201D;</italic> linear subspace of a given dimension <italic>d</italic> by using the a set of data points. The &#x0201C;<italic>best&#x0201D;</italic> linear subspace is a subspace where dimensions maximize the variance of the data points in the set. PCA can be interpreted either as a probabilistic method or as a matrix approximation and is then usually known as <italic>truncated singular value decomposition</italic>. We are here interested in describing PCA as probabilistic method as it related to the <italic>interpretability</italic> of the related <italic>distributed representation</italic>.</p>
<p>As a probabilistic method, PCA finds an orthogonal projection matrix <inline-formula><mml:math id="M32"><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> such that the variance of the projected set of data points is maximized. The set of data points is referred as a matrix <italic>X</italic> &#x02208; &#x0211D;<sup><italic>m</italic>&#x000D7;<italic>n</italic></sup> where each row <inline-formula><mml:math id="M33"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is a single observation. Hence, the variance that is maximized is <inline-formula><mml:math id="M34"><mml:msub><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>X</mml:mi><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>.</p>
<p>More specifically, let&#x00027;s consider the first weight vector <bold>w<sub>1</sub></bold>, which maps an element of the dataset <bold>x</bold> into a single number &#x02329;<bold>x</bold>, <bold>w<sub>1</sub></bold>&#x0232A;. Maximizing the variance means that <bold>w</bold> is such that:</p>
<disp-formula id="E20"><mml:math id="M35"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mstyle mathvariant='bold'><mml:mtext>w</mml:mtext></mml:mstyle><mml:mstyle mathvariant='bold'><mml:mtext>1</mml:mtext></mml:mstyle></mml:msub><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mtext>arg&#x000A0;max</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mo>&#x02016;</mml:mo><mml:mstyle mathvariant='bold'><mml:mtext>w</mml:mtext></mml:mstyle><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munder><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>i</mml:mi></mml:munder><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mo>&#x02329;</mml:mo><mml:msub><mml:mstyle mathvariant='bold'><mml:mi>x</mml:mi></mml:mstyle><mml:mtext>i</mml:mtext></mml:msub><mml:mo>,</mml:mo><mml:mstyle mathvariant='bold'><mml:mi>w</mml:mi></mml:mstyle><mml:mo>&#x0232A;</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mstyle></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>and it can be shown that the optimal value is achieved when <bold>w</bold> is the eigenvector of <italic>X</italic><sup><italic>T</italic></sup><italic>X</italic> with largest eigenvalue. This then produces a projected dataset:</p>
<disp-formula id="E21"><mml:math id="M36"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mover accent='true'><mml:mi>X</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover><mml:mn>1</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mi>X</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi>W</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mi>X</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold'><mml:mi>w</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold'><mml:mn>1</mml:mn></mml:mstyle></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The algorithm can then compute iteratively the second and further components by first subtracting the components already computed from <italic>X</italic>:</p>
<disp-formula id="E22"><mml:math id="M37"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>X</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mi>X</mml:mi><mml:msub><mml:mstyle mathvariant='bold'><mml:mi>w</mml:mi></mml:mstyle><mml:mn>1</mml:mn></mml:msub><mml:msubsup><mml:mstyle mathvariant='bold'><mml:mi>w</mml:mi></mml:mstyle><mml:mn>1</mml:mn><mml:mrow><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>T</mml:mi></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>and then proceed as before. However, it turns out that all subsequent components are related to the eigenvectors of the matrix <italic>X</italic><sup><italic>T</italic></sup><italic>X</italic>, that is, the <italic>d</italic>-th weight vector is the eigenvector of <italic>X</italic><sup><italic>T</italic></sup><italic>X</italic> with the <italic>d</italic>-th largest corresponding eigenvalue.</p>
<p>The encoding matrix for distributed representations derived with a PCA method is the matrix:</p>
<disp-formula id="E23"><mml:math id="M38"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mtable style="text-align:axis;" equalrows="false" columnlines="" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x02026;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <bold>w</bold><sub><italic>i</italic></sub> are eigenvectors with eigenvalues decreasing with <italic>i</italic>. Hence, local representations <bold>v</bold> &#x02208; &#x0211D;<sup><italic>n</italic></sup> are represented in distributed representations in &#x0211D;<sup><italic>d</italic></sup> as:</p>
<disp-formula id="E24"><mml:math id="M39"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x003B7;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Hence, vectors &#x003B7;(<bold>v</bold>) are <italic>human-interpretable</italic> as their dimensions represent linear combinations of dimensions in the original local representation and these dimensions are ordered according to their importance in the dataset, that is, their variance. Moreover, each dimension is a linear combination of the original symbols. Then, the matrix <italic>W</italic><sub><italic>d</italic></sub> reports on which combination of the original symbols is more important to distinguish data points in the set.</p>
<p>Moreover, vectors &#x003B7;(<bold>v</bold>) are <italic>decodable</italic>. The decoding function is:</p>
<disp-formula id="E25"><mml:math id="M40"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x003B4;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mstyle mathvariant='bold'><mml:mi>v</mml:mi></mml:mstyle><mml:mo>&#x02032;</mml:mo></mml:msup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mi>d</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:msup><mml:mstyle mathvariant='bold'><mml:mi>v</mml:mi></mml:mstyle><mml:mo>&#x02032;</mml:mo></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>and <inline-formula><mml:math id="M41"><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>I</mml:mi></mml:math></inline-formula> if <italic>d</italic> is the rank of the matrix <italic>X</italic>, otherwise it is a degraded approximation (for more details refer to Fodor, <xref ref-type="bibr" rid="B21">2002</xref>; Sorzano et al., <xref ref-type="bibr" rid="B62">2014</xref>). Hence, distributed vectors in &#x0211D;<sup><italic>d</italic></sup> can be decoded back in the original symbolic representation with a degree of approximation that depends on the distance between <italic>d</italic> and the rank of the matrix <italic>X</italic>.</p>
<p>The compelling limit of PCA is that all the data points have to be used in order to obtain the encoding/decoding matrices. This is not feasible in two cases. First, when the model has to deal with big data. Second, when the set of symbols to be encoded in extremely large. In this latter case, local representations cannot be used to produce matrices <italic>X</italic> for applying PCA.</p>
<p>In Distributional Semantics, <italic>random indexing</italic> has been used to solve some issues that arise naturally with PCA when working with large vocabularies and large corpora. PCA has some scalability problems:</p>
<list list-type="bullet">
<list-item><p>The original co-occurrence matrix is very costly to obtain and store, moreover, it is only needed to be later transformed;</p></list-item>
<list-item><p>Dimensionality reduction is also very costly, moreover, with the dimensions at hand it can only be done with iterative methods;</p></list-item>
<list-item><p>The entire method is not incremental, if we want to add new words to our corpus we have to recompute the entire co-occurrence matrix and then re-perform the PCA step.</p></list-item>
</list>
<p>Random Indexing (Sahlgren, <xref ref-type="bibr" rid="B56">2005</xref>) solves these problems: it is an incremental method (new words can be easily added any time at low computational cost) which creates word vector of reduced dimension without the need to create the full dimensional matrix.</p>
<p>Interpretability of compacted distributional semantic vectors is comparable to the interpretability of distributed representations obtained with the same techniques.</p>
</sec>
<sec>
<title>4.3. Learning Representations: Word2vec</title>
<p>Recently, <italic>distributional hypothesis</italic> has invaded neural networks: <italic>word2vec</italic> (Mikolov et al., <xref ref-type="bibr" rid="B45">2013</xref>) uses contextual information to learn word vectors. Hence, we discuss this technique in the section devoted to <italic>distributional semantics</italic>.</p>
<p>The name word2Vec comprises two similar techniques, called <italic>skip grams</italic> and <italic>continuous bag of words</italic> (CBOW). Both methods are neural networks, the former takes input a word and try to predict its context, while the latter does the reverse process, predicting a word from the words surrounding it. With this technique there is no explicitly computed co-occurrence matrix, and neither there is an explicit association feature between pairs of words, instead, the regularities and distribution of the words are learned implicitly by the network.</p>
<p>We describe only CBOW because it is conceptually simpler and because the core ideas are the same in both cases. The full network is generally realized with two layers <italic>W</italic>1<sub><italic>n</italic>&#x000D7;<italic>k</italic></sub> and <italic>W</italic>2<sub><italic>k</italic>&#x000D7;<italic>n</italic></sub> plus a softmax layer to reconstruct the final vector representing the word. In the learning phase, the input and the output of the network are local representation for words. In CBOW, the network aims to predict a target word given context words. For example, given the sentence <italic>s</italic><sub>1</sub> of the corpus in <xref ref-type="table" rid="T1">Table 1</xref>, the network has to predict <italic>catches</italic> given its context (see <xref ref-type="fig" rid="F1">Figure 1</xref>).</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>word2vec: CBOW model.</p></caption>
<graphic xlink:href="frobt-06-00153-g0001.tif"/>
</fig>
<p>Hence, CBOW offers an encoder <italic>W</italic>1<sub><italic>n</italic>&#x000D7;<italic>k</italic></sub>, that is, a linear word encoder from data where <italic>n</italic> is the size of the vocabulary and <italic>k</italic> is the size of the distributional vector. This encoder models contextual information learned by maximizing the prediction capability of the network. A nice description on how this approach is related to previous techniques is given in Goldberg and Levy (<xref ref-type="bibr" rid="B25">2014</xref>).</p>
<p>Clearly, CBOW distributional vectors are not easily human and machine <italic>interpretable</italic>. In fact, specific dimensions of vectors have not a particular meaning and, differently from what happens for auto-encoders (see section 3.2.1), these networks are not trained to be invertible.</p>
</sec>
</sec>
<sec id="s5">
<title>5. Composing Distributed Representations</title>
<p>In the previous sections, we described how one symbol or a bag-of-symbols can be transformed in distributed representations focusing on whether these distributed representations are <italic>interpretable</italic>. In this section, we want to investigate a second and important aspect of these representations, that is, have these representations <italic>Concatenative Compositionality</italic> as symbolic representations? And, if these representations are <italic>composed</italic>, are still <italic>interpretable</italic>?</p>
<p><italic>Concatenative Compositionality</italic> is the ability of a symbolic representation to describe sequences or structures by composing symbols with specific rules. In this process, symbols remain distinct and composing rules are clear. Hence, final sequences and structures can be used for subsequent steps as knowledge repositories.</p>
<p><italic>Concatenative Compositionality</italic> is an important aspect for any representation and, then, for a distributed representation. Understanding to what extent a distributed representation has <italic>concatenative compositionality</italic> and how information can be recovered is then a critical issue. In fact, this issue has been strongly posed by Plate (<xref ref-type="bibr" rid="B52">1994</xref>, <xref ref-type="bibr" rid="B53">1995</xref>) who analyzed how same specific distributed representations encode structural information and how this structural information can be recovered back.</p>
<p>Current approaches for treating distributed/distributional representation of sequences and structures mix two aspects in one model: a &#x0201C;<italic>semantic&#x0201D;</italic> aspect and a <italic>representational</italic> aspect. Generally, the semantic aspect is the predominant and the representational aspect is left aside. For &#x0201C;<italic>semantic&#x0201D;</italic> aspect, we refer to the reason why distributed symbols are composed: a final task in neural network applications or the need to give a <italic>distributional semantic vector</italic> for sequences of words. This latter is the case for <italic>compositional distributional semantics</italic> (Clark et al., <xref ref-type="bibr" rid="B12">2008</xref>; Baroni et al., <xref ref-type="bibr" rid="B4">2014</xref>). For the <italic>representational</italic> aspect, we refer to the fact that composed distributed representations are in fact representing structures and these representations can be decoded back in order to extract what is in these structures.</p>
<p>Although the &#x0201C;<italic>semantic&#x0201D;</italic> aspect seems to be predominant in <italic>models-that-compose</italic>, the <italic>convolution conjecture</italic> (Zanzotto et al., <xref ref-type="bibr" rid="B74">2015</xref>) hypothesizes that the two aspects coexist and the <italic>representational</italic> aspect plays always a crucial role. According to this conjecture, structural information is preserved in any model that composes and structural information emerges back when comparing two distributed representations with dot product to determine their similarity.</p>
<p>Hence, given the <italic>convolution conjecture, models-that-compose</italic> produce distributed representations for structures that can be interpreted back. <italic>Interpretability</italic> is a very important feature in these <italic>models-that-compose</italic> which will drive our analysis.</p>
<p>In this section we will explore the issues faced with the compositionality of representations, and the main &#x0201C;trends&#x0201D;, which correspond somewhat to the categories already presented. In particular we will start from the work on compositional distributional semantics, then we revise the work on holographic reduced representations (Plate, <xref ref-type="bibr" rid="B53">1995</xref>; Neumann, <xref ref-type="bibr" rid="B49">2001</xref>) and, finally, we analyze the recent approaches with recurrent and recursive neural networks. Again, these categories are not entirely disjoint, and methods presented in one class can be often interpreted to belonging into another class.</p>
<sec>
<title>5.1. Compositional Distributional Semantics</title>
<p>In distributional semantics, <italic>models-that-compose</italic> have the name of <italic>compositional distributional semantics models</italic> (CDSMs) (Mitchell and Lapata, <xref ref-type="bibr" rid="B47">2010</xref>; Baroni et al., <xref ref-type="bibr" rid="B4">2014</xref>) and aim to apply the principle of compositionality (Frege, <xref ref-type="bibr" rid="B23">1884</xref>; Montague, <xref ref-type="bibr" rid="B48">1974</xref>) to compute distributional semantic vectors for phrases. These CDSMs produce distributional semantic vectors of phrases by composing distributional vectors of words in these phrases. These models generally exploit <italic>structured or syntactic representations</italic> of phrases to derive their distributional meaning. Hence, CDSMs aim to give a complete semantic model for distributional semantics.</p>
<p>As in distributional semantics for words, the aim of CDSMs is to produce similar vectors for semantically similar sentences regardless their lengths or structures. For example, words and word definitions in dictionaries should have similar vectors as discussed in Zanzotto et al. (<xref ref-type="bibr" rid="B75">2010</xref>). As usual in distributional semantics, similarity is captured with dot products (or similar metrics) among distributional vectors.</p>
<p>The applications of these CDSMs encompass multi-document summarization, recognizing textual entailment (Dagan et al., <xref ref-type="bibr" rid="B16">2013</xref>) and, obviously, semantic textual similarity detection (Agirre et al., <xref ref-type="bibr" rid="B2">2013</xref>).</p>
<p>Apparently, these CDSMs are far from having <italic>concatenative compositionality</italic>, since these distributed representations that can be <italic>interpreted</italic> back. In some sense, their nature wants that resulting vectors forget how these are obtained and focus on the final distributional meaning of phrases. There is some evidence that this is not exactly the case.</p>
<p>The <italic>convolution conjecture</italic> (Zanzotto et al., <xref ref-type="bibr" rid="B74">2015</xref>) suggests that many CDSMs produce distributional vectors where structural information and vectors for individual words can be still <italic>interpreted</italic>. Hence, many CDSMs have the <italic>concatenative compositionality</italic> property and <italic>interpretable</italic>.</p>
<p>In the rest of this section, we will show some classes of these CDSMs and we focus on describing how these morels are interpretable.</p>
<sec>
<title>5.1.1. Additive Models</title>
<p><italic>Additive models</italic> for compositional distributional semantics are important examples of <italic>models-that-composes</italic> where <italic>semantic</italic> and <italic>representational</italic> aspects is clearly separated. Hence, these models can be highly <italic>interpretable</italic>.</p>
<p>These additive models have been formally captured in the general framework for two words sequences proposed by Mitchell and Lapata (<xref ref-type="bibr" rid="B46">2008</xref>). The general framework for composing distributional vectors of two word sequences &#x0201C;<italic>uv&#x0201D;</italic> is the following:</p>
<disp-formula id="E26"><label>(6)</label><mml:math id="M42"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold"><mml:mtext>p</mml:mtext></mml:mstyle><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>u</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle><mml:mo>;</mml:mo><mml:mi>R</mml:mi><mml:mo>;</mml:mo><mml:mi>K</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <bold>p</bold> &#x02208; &#x0211D;<sup><italic>n</italic></sup> is the composition vector, <bold>u</bold> and <bold>v</bold> are the vectors for the two words <italic>u</italic> and <italic>v</italic>, <italic>R</italic> is the grammatical relation linking the two words and <italic>K</italic> is any other additional knowledge used in the composition operation. In the additive model, this equation has the following form:</p>
<disp-formula id="E27"><label>(7)</label><mml:math id="M43"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold"><mml:mtext>p</mml:mtext></mml:mstyle><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>u</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle><mml:mo>;</mml:mo><mml:mi>R</mml:mi><mml:mo>;</mml:mo><mml:mi>K</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>u</mml:mtext></mml:mstyle><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>A</italic><sub><italic>R</italic></sub> and <italic>B</italic><sub><italic>R</italic></sub> are two square matrices depending on the grammatical relation <italic>R</italic> which may be learned from data (Guevara, <xref ref-type="bibr" rid="B29">2010</xref>; Zanzotto et al., <xref ref-type="bibr" rid="B75">2010</xref>).</p>
<p>Before investigating if these models are interpretable, let introduce a recursive formulation of additive models which can be applied to structural representations of sentences. For this purpose, we use dependency trees. A dependency tree can be defined as a tree whose nodes are words and the typed links are the relations between two words. The root of the tree represents the word that governs the meaning of the sentence. A dependency tree <italic>T</italic> is then a word if it is a final node or it has a root <italic>r</italic><sub><italic>T</italic></sub> and links (<italic>r</italic><sub><italic>T</italic></sub>, <italic>R, C</italic><sub><italic>i</italic></sub>) where <italic>C</italic><sub><italic>i</italic></sub> is the i-th subtree of the node <italic>r</italic><sub><italic>T</italic></sub> and <italic>R</italic> is the relation that links the node <italic>r</italic><sub><italic>T</italic></sub> with <italic>C</italic><sub><italic>i</italic></sub>. The dependency trees of two example sentences are reported in <xref ref-type="fig" rid="F2">Figure 2</xref>. The recursive formulation is then the following:</p>
<disp-formula id="E28"><mml:math id="M44"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mi>f</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>T</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>i</mml:mi></mml:munder><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>A</mml:mi><mml:mi>R</mml:mi></mml:msub><mml:msub><mml:mstyle mathvariant='bold'><mml:mi>r</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold'><mml:mi>T</mml:mi></mml:mstyle></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>B</mml:mi><mml:mi>R</mml:mi></mml:msub><mml:msub><mml:mi>f</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo></mml:mrow></mml:mstyle><mml:msub><mml:mi>C</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>According to the recursive definition of the additive model, the function <italic>f</italic><sub><italic>r</italic></sub>(<italic>T</italic>) results in a linear combination of elements <italic>M</italic><sub><italic>s</italic></sub><bold>w</bold><sub><italic>s</italic></sub> where <italic>M</italic><sub><italic>s</italic></sub> is a product of matrices that <italic>represents the structure</italic> and <bold>w</bold><sub><italic>s</italic></sub> is the <italic>distributional meaning</italic> of one word in this structure, that is:</p>
<disp-formula id="E29"><mml:math id="M45"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>S</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:munder></mml:mstyle><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>S</italic>(<italic>T</italic>) are the relevant substructures of <italic>T</italic>. In this case, <italic>S</italic>(<italic>T</italic>) contains the link chains. For example, the first sentence in <xref ref-type="fig" rid="F2">Figure 2</xref> has a distributed vector defined in this way:</p>
<disp-formula id="E30"><mml:math id="M46"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>cows eat animal extracts</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>eat</mml:mtext></mml:mstyle><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>cow</mml:mtext></mml:mstyle><mml:mstyle mathvariant="bold"><mml:mtext>s</mml:mtext></mml:mstyle><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>eat</mml:mtext></mml:mstyle><mml:mo>&#x0002B;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>animal extracts</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>eat</mml:mtext></mml:mstyle><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>cows</mml:mtext></mml:mstyle><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>eat</mml:mtext></mml:mstyle><mml:mo>&#x0002B;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>extracts</mml:mtext></mml:mstyle><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>animal</mml:mtext></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>A sentence and its dependency graph.</p></caption>
<graphic xlink:href="frobt-06-00153-g0002.tif"/>
</fig>
<p>Each term of the sum has a part that represents the structure and a part that represents the meaning, for example:</p>
<disp-formula id="E31"><mml:math id="M47"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mover class="msup"><mml:mrow><mml:mover accent="false"><mml:mrow><mml:msub><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x0FE37;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>c</mml:mi><mml:mi>t</mml:mi><mml:mi>u</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:mover><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mstyle displaystyle="true"><mml:munder accentunder="false"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>beef</mml:mtext></mml:mstyle></mml:mrow><mml:mo>&#x0FE38;</mml:mo></mml:munder></mml:mstyle></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:munder></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Hence, this recursive additive model for compositional semantics is a <italic>model-that-composes</italic> which, in principle, can be highly <italic>interpretable</italic>. By selecting matrices <bold>M</bold><sub><italic>s</italic></sub> such that:</p>
<disp-formula id="E32"><label>(8)</label><mml:math id="M48"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mstyle mathvariant='bold'><mml:mi>M</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mi>T</mml:mi></mml:msubsup><mml:msub><mml:mstyle mathvariant='bold'><mml:mi>M</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:msub><mml:mo>&#x02248;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mstyle mathvariant='bold'><mml:mi>I</mml:mi></mml:mstyle></mml:mtd><mml:mtd><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mtext>1</mml:mtext></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mstyle mathvariant='bold'><mml:mn>0</mml:mn></mml:mstyle></mml:mtd><mml:mtd><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x02260;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>it is possible to recover distributional semantic vectors related to words that are in specific parts of the structure. For example, the main verb of the sample sentence in <xref ref-type="fig" rid="F2">Figure 2</xref> with a matrix <inline-formula><mml:math id="M49"><mml:msubsup><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, that is:</p>
<disp-formula id="E33"><mml:math id="M50"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>cows eat animal extracts</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02248;</mml:mo><mml:mn>2</mml:mn><mml:mstyle mathvariant="bold"><mml:mtext>eat</mml:mtext></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>In general, matrices derived for compositional distributional semantic models (Guevara, <xref ref-type="bibr" rid="B29">2010</xref>; Zanzotto et al., <xref ref-type="bibr" rid="B75">2010</xref>) do not have this property but it is possible to obtain matrices with this property by applying thee Jonson-Linderstrauss Tranform (Johnson and Lindenstrauss, <xref ref-type="bibr" rid="B36">1984</xref>) or similar techniques as discussed also in Zanzotto et al. (<xref ref-type="bibr" rid="B74">2015</xref>).</p>
</sec>
<sec>
<title>5.1.2. Lexical Functional Compositional Distributional Semantic Models</title>
<p>Lexical Functional Models are compositional distributional semantic models where words are tensors and each type of word is represented by tensors of different order. Composing meaning is then composing these tensors to obtain vectors. These models have solid mathematical background linking Lambek pregroup theory, formal semantics and distributional semantics (Coecke et al., <xref ref-type="bibr" rid="B13">2010</xref>). Lexical Function models are concatenative compositional, yet, in the following, we will examine whether these models produce vectors that my be <italic>interpreted</italic>.</p>
<p>To determine whether these models produce <italic>interpretable</italic> vectors, we start from a simple Lexical Function model applied to two word sequences. This model has been largely analyzed in Baroni and Zamparelli (<xref ref-type="bibr" rid="B6">2010</xref>) as matrices were considered better linear models to encode <italic>adjectives</italic>.</p>
<p>In Lexical Functional models over two words sequences, there is one of the two words which as a tensor of order 2 (that is, a matrix) and one word that is represented by a vector. For example, <italic>adjectives</italic> are matrices and nouns are vectors (Baroni and Zamparelli, <xref ref-type="bibr" rid="B6">2010</xref>) in adjective-noun sequences. Hence, adjective-noun sequences like &#x0201C;<italic>black cat&#x0201D;</italic> or &#x0201C;<italic>white dog&#x0201D;</italic> are represented as:</p>
<disp-formula id="E34"><mml:math id="M51"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>black&#x000A0;cat</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>BLACKcat</mml:mtext></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E35"><mml:math id="M52"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>white&#x000A0;dog</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>WHITEdog</mml:mtext></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <bold>BLACK</bold> and <bold>WHITE</bold> are matrices representing the two adjectives and <bold>cat</bold> and <bold>dog</bold> are the two vectors representing the two nouns.</p>
<p>These two words models are <italic>partially interpretable</italic>: knowing the adjective it is possible to extract the noun but not vice-versa. In fact, if matrices for adjectives are invertible, there is the possibility of extracting which nouns has been related to particular adjectives. For example, if <bold>BLACK</bold> is invertible, the inverse matrix <bold>BLACK</bold><sup>&#x02212;1</sup> can be used to extract the vector of <italic>cat</italic> from the vector <italic>f</italic>(black cat):</p>
<disp-formula id="E36"><mml:math id="M53"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold"><mml:mtext>cat</mml:mtext></mml:mstyle><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>BLACK</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>black cat</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>This contributes to the <italic>interpretability</italic> of this model. Moreover, if matrices for adjectives are built using Jonson-Lindestrauss Transforms (Johnson and Lindenstrauss, <xref ref-type="bibr" rid="B36">1984</xref>), that is matrices with the property in Equation (8), it is possible to pack different pieces of sentences in a single vector and, then, select only relevant information, for example:</p>
<disp-formula id="E37"><mml:math id="M54"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold"><mml:mtext>cat</mml:mtext></mml:mstyle><mml:mo>&#x02248;</mml:mo><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>BLACK</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>black cat</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>white dog</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>On the contrary, knowing noun vectors, it is not possible to extract back adjective matrices. This is a strong limitation in term of interpretability.</p>
<p>Lexical Functional models for larger structures are concatenative compositional but not interpretable at all. In fact, in general these models have tensors in the middle and these tensors are the only parts that can be inverted. Hence, in general these models are not interpretable. However, using the <italic>convolution conjecture</italic> (Zanzotto et al., <xref ref-type="bibr" rid="B74">2015</xref>), it is possible to know whether subparts are contained in some final vectors obtained with these models.</p>
</sec>
</sec>
<sec>
<title>5.2. Holographic Representations</title>
<p>Holographic reduced representations (HRRs) are <italic>models-that-compose</italic> expressly designed to be <italic>interpretable</italic> (Plate, <xref ref-type="bibr" rid="B53">1995</xref>; Neumann, <xref ref-type="bibr" rid="B49">2001</xref>). In fact, these models encode flat structures representing assertions and these assertions should be then searched in order to recover pieces of knowledge that is in. For example, these representations have been used to encode logical propositions such as <italic>eat</italic>(<italic>John, apple</italic>). In this case, each atomic element has an associated vector and the vector for the compound is obtained by combining these vectors. The major concern here is to build encoding functions that can be decoded, that is, it should be possible to retrieve composing elements from final distributed vectors such as the vector of <italic>eat</italic>(<italic>John, apple</italic>).</p>
<p>In HRRs, <italic>nearly orthogonal unit vectors</italic> (Johnson and Lindenstrauss, <xref ref-type="bibr" rid="B36">1984</xref>) for basic symbols, <italic>circular convolution</italic> &#x02297; and <italic>circular correlation</italic> &#x02295; guarantees <italic>composability</italic> and <italic>interpretability</italic>. HRRs are the extension of Random Indexing (see section 3.1) to structures. Hence, symbols are represented with vectors sampled from a multivariate normal distribution <inline-formula><mml:math id="M55"><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:mfrac><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. The composition function is the circular convolution indicated as &#x02297; and defined as:</p>
<disp-formula id="E38"><mml:math id="M56"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>a&#x000A0;</mml:mtext></mml:mstyle><mml:mo>&#x02297;</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>&#x000A0;b</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>-</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where subscripts are modulo <italic>d</italic>. Circular convolution is commutative and bilinear. This operation can be also computed using <italic>circulant matrices</italic>:</p>
<disp-formula id="E39"><mml:math id="M57"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold"><mml:mtext>z</mml:mtext></mml:mstyle><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>a&#x000A0;</mml:mtext></mml:mstyle><mml:mo>&#x02297;</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>&#x000A0;b</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>A</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mo>&#x02218;</mml:mo></mml:mrow></mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>b</mml:mtext></mml:mstyle><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>B</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mo>&#x02218;</mml:mo></mml:mrow></mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>a</mml:mtext></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <bold>A</bold><sub>&#x02218;</sub> and <bold>B</bold><sub>&#x02218;</sub> are circulant matrices of the vectors <bold>a</bold> and <bold>b</bold>. Given the properties of vectors <bold>a</bold> and <bold>b</bold>, matrices <bold>A</bold><sub>&#x02218;</sub> and <bold>B</bold><sub>&#x02218;</sub> have the property in Equation (8). Hence, <italic>circular convolution</italic> is approximately invertible with the <italic>circular correlation</italic> function (&#x02295;) defined as follows:</p>
<disp-formula id="E40"><mml:math id="M58"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>z&#x000A0;</mml:mtext></mml:mstyle><mml:mo>&#x02295;</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>&#x000A0;b</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where again subscripts are modulo <italic>d</italic>. Circular correlation is related to inverse matrices of circulant matrices, that is <inline-formula><mml:math id="M59"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>B</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mo>&#x02218;</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. In the decoding with &#x02295;, parts of the structures can be derived in an approximated way, that is:</p>
<disp-formula id="E41"><mml:math id="M60"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>a&#x000A0;</mml:mtext></mml:mstyle><mml:mo>&#x02297;</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>&#x000A0;b</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02295;</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>b</mml:mtext></mml:mstyle><mml:mo>&#x02248;</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>a</mml:mtext></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Hence, circular convolution &#x02297; and circular correlation &#x02295; allow to build interpretable representations. For example, having the vectors <bold>e</bold>, <bold>J</bold>, and <bold>a</bold> for <italic>eat</italic>, <italic>John</italic> and <italic>apple</italic>, respectively, the following encoding and decoding produces a vector that approximates the original vector for <italic>John</italic>:</p>
<disp-formula id="E42"><mml:math id="M61"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold"><mml:mtext>J</mml:mtext></mml:mstyle><mml:mo>&#x02248;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>J</mml:mtext></mml:mstyle><mml:mo>&#x02297;</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>e</mml:mtext></mml:mstyle><mml:mo>&#x02297;</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>a</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02295;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>e&#x000A0;</mml:mtext></mml:mstyle><mml:mo>&#x02297;</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>&#x000A0;a</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The &#x0201C;invertibility&#x0201D; of these representations is important because it allow us not to consider these representations as black boxes.</p>
<p>However, holographic representations have severe limitations as these can encode and decode simple, flat structures. In fact, these representations are based on the circular convolution, which is a commutative function; this implies that the representation cannot keep track of composition of objects where the order matters and this phenomenon is particularly important when encoding nested structures.</p>
<p>Distributed trees (Zanzotto and Dell&#x00027;Arciprete, <xref ref-type="bibr" rid="B73">2012</xref>) have shown that the principles expressed in holographic representation can be applied to encode larger structures, overcoming the problem of reliably encoding the order in which elements are composed using the <italic>shuffled circular convolution</italic> function as the composition operator. Distributed trees are encoding functions that transform trees into low-dimensional vectors that also contain the encoding of every substructures of the tree. Thus, these distributed trees are particularly attractive as they can be used to represent structures in linear learning machines which are computationally efficient.</p>
<p>Distributed trees and, in particular, distributed smoothed trees (Ferrone and Zanzotto, <xref ref-type="bibr" rid="B18">2014</xref>) represent an interesting middle way between compositional distributional semantic models and holographic representation.</p>
</sec>
<sec>
<title>5.3. Compositional Models in Neural Networks</title>
<p>When neural networks are applied to sequences or structured data, these networks are in fact <italic>models-that-compose</italic>. However, these models result in <italic>models-that-compose</italic> which are not interpretable. In fact, composition functions are trained on specific tasks and not on the possibility of reconstructing the structured input, unless in some rare cases (Socher et al., <xref ref-type="bibr" rid="B60">2011</xref>). The input of these networks are sequences or structured data where basic symbols are embedded in <italic>local</italic> representations or <italic>distributed</italic> representations obtained with word embedding (see section 4.3). The output are distributed vectors derived for specific tasks. Hence, these <italic>models-that-compose</italic> are not interpretable in our sense for their final aim and for the fact that <italic>non linear</italic> functions are adopted in the specification of the neural networks.</p>
<p>In this section, we revise some prominent neural network architectures that can be interpreted as <italic>models-that-compose</italic>: the <italic>recurrent neural networks</italic> (Krizhevsky et al., <xref ref-type="bibr" rid="B38">2012</xref>; Graves, <xref ref-type="bibr" rid="B27">2013</xref>; Vinyals et al., <xref ref-type="bibr" rid="B68">2015a</xref>; He et al., <xref ref-type="bibr" rid="B31">2016</xref>) and the <italic>recursive neural networks</italic> (Socher et al., <xref ref-type="bibr" rid="B61">2012</xref>).</p>
<sec>
<title>5.3.1. Recurrent Neural Networks</title>
<p>Recurrent neural networks form a very broad family of neural networks architectures that deal with the representation (and processing) of complex objects. At its core a recurrent neural network (RNN) is a network which takes in input the current element in the sequence and processes it based on an internal state which depends on previous inputs. At the moment the most powerful network architectures are convolutional neural networks (Krizhevsky et al., <xref ref-type="bibr" rid="B38">2012</xref>; He et al., <xref ref-type="bibr" rid="B31">2016</xref>) for vision related tasks and LSTM-type network for language related task (Graves, <xref ref-type="bibr" rid="B27">2013</xref>; Vinyals et al., <xref ref-type="bibr" rid="B68">2015a</xref>).</p>
<p>A recurrent neural network takes as input a sequence <bold>x</bold> &#x0003D; (<bold>x<sub>1</sub></bold> &#x02026; <bold>x<sub>n</sub></bold>) and produce as output a single vector <bold>y</bold> &#x02208; &#x0211D;<sup><italic>n</italic></sup> which is a representation of the entire sequence. At each step <xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> <italic>t</italic> the network takes as input the current element <bold>x<sub>t</sub></bold>, the previous output <bold>h<sub>t&#x02212;1</sub></bold> and performs the following operation to produce the current output <bold>h<sub>t</sub></bold></p>
<disp-formula id="E43"><label>(9)</label><mml:math id="M62"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x003C3;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>W</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mstyle mathvariant='bold'><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mstyle mathvariant='bold'><mml:mi>t</mml:mi></mml:mstyle><mml:mo>&#x02212;</mml:mo><mml:mstyle mathvariant='bold'><mml:mn>1</mml:mn></mml:mstyle></mml:mrow></mml:msub><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold'><mml:mi>x</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold'><mml:mi>t</mml:mi></mml:mstyle></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mi>b</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003C3; is a non-linear function such as the logistic function or the hyperbolic tangent and [<bold>h<sub>t&#x02212;1</sub> x<sub>t</sub></bold>] denotes the concatenation of the vectors <bold>h<sub>t&#x02212;1</sub></bold> and <bold>x<sub>t</sub></bold>. The parameters of the model are the matrix <italic>W</italic> and the bias vector <italic>b</italic>.</p>
<p>Hence, a recurrent neural network is effectively a learned composition function, which dynamically depends on its current input, all of its previous inputs and also on the dataset on which is trained. However, this learned composition function is basically impossible to analyze or interpret in any way. Sometime an &#x0201C;intuitive&#x0201D; explanation is given about what the learned weights represent: with some weights representing information that must be remembered or forgotten.</p>
<p>Even more complex recurrent neural networks as long-short term memory (LSTM) (Hochreiter and Schmidhuber, <xref ref-type="bibr" rid="B33">1997</xref>) have the same problem of interpretability. LSTM are a recent and successful way for neural network to deal with longer sequences of inputs, overcoming some difficulty that RNN face in the training phase. As with RNN, LSTM network takes as input a sequence <bold>x</bold> &#x0003D; (<bold>x<sub>1</sub></bold> &#x02026; <bold>x<sub>n</sub></bold>) and produce as output a single vector <bold>y</bold> &#x02208; &#x0211D;<sup><italic>n</italic></sup> which is a representation of the entire sequence. At each step <italic>t</italic> the network takes as input the current element <bold>x<sub>t</sub></bold>, the previous output <bold>h<sub>t&#x02212;1</sub></bold> and performs the following operation to produce the current output <bold>h<sub>t</sub></bold> and update the internal state <bold>c<sub>t</sub></bold>.</p>
<disp-formula id="E44"><mml:math id="M63"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:msub><mml:mi>f</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x003C3;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mstyle mathvariant='bold'><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mstyle mathvariant='bold'><mml:mi>t</mml:mi></mml:mstyle><mml:mo>&#x02212;</mml:mo><mml:mstyle mathvariant='bold'><mml:mn>1</mml:mn></mml:mstyle></mml:mrow></mml:msub><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold'><mml:mi>x</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold'><mml:mi>t</mml:mi></mml:mstyle></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x003C3;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mstyle mathvariant='bold'><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mstyle mathvariant='bold'><mml:mi>t</mml:mi></mml:mstyle><mml:mo>&#x02212;</mml:mo><mml:mstyle mathvariant='bold'><mml:mn>1</mml:mn></mml:mstyle></mml:mrow></mml:msub><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold'><mml:mi>x</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold'><mml:mi>t</mml:mi></mml:mstyle></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>o</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x003C3;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>o</mml:mi></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mstyle mathvariant='bold'><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mstyle mathvariant='bold'><mml:mi>t</mml:mi></mml:mstyle><mml:mo>&#x02212;</mml:mo><mml:mstyle mathvariant='bold'><mml:mn>1</mml:mn></mml:mstyle></mml:mrow></mml:msub><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold'><mml:mi>x</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold'><mml:mi>t</mml:mi></mml:mstyle></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>o</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mstyle mathvariant='bold'><mml:mover accent='true'><mml:mi>c</mml:mi><mml:mo>&#x002DC;</mml:mo></mml:mover></mml:mstyle><mml:mstyle mathvariant='bold'><mml:mi>t</mml:mi></mml:mstyle></mml:msub><mml:mo>=</mml:mo><mml:mi>tanh</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mstyle mathvariant='bold'><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mstyle mathvariant='bold'><mml:mi>t</mml:mi></mml:mstyle><mml:mo>&#x02212;</mml:mo><mml:mstyle mathvariant='bold'><mml:mn>1</mml:mn></mml:mstyle></mml:mrow></mml:msub><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold'><mml:mi>x</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold'><mml:mi>t</mml:mi></mml:mstyle></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mstyle mathvariant='bold'><mml:mi>c</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold'><mml:mi>t</mml:mi></mml:mstyle></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>&#x02299;</mml:mo><mml:msub><mml:mstyle mathvariant='bold'><mml:mi>c</mml:mi></mml:mstyle><mml:mrow><mml:mstyle mathvariant='bold'><mml:mi>t</mml:mi></mml:mstyle><mml:mo>&#x02212;</mml:mo><mml:mstyle mathvariant='bold'><mml:mi>i</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>&#x02299;</mml:mo><mml:msub><mml:mstyle mathvariant='bold'><mml:mover accent='true'><mml:mi>c</mml:mi><mml:mo>&#x002DC;</mml:mo></mml:mover></mml:mstyle><mml:mstyle mathvariant='bold'><mml:mi>t</mml:mi></mml:mstyle></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>o</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>&#x02299;</mml:mo><mml:mi>tanh</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mstyle mathvariant='bold'><mml:mi>c</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold'><mml:mi>t</mml:mi></mml:mstyle></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x02299; stands for element-wise multiplication, and the parameters of the model are the matrices <italic>W</italic><sub><italic>f</italic></sub>, <italic>W</italic><sub><italic>i</italic></sub>, <italic>W</italic><sub><italic>o</italic></sub>, <italic>W</italic><sub><italic>c</italic></sub> and the bias vectors <italic>b</italic><sub><italic>f</italic></sub>, <italic>b</italic><sub><italic>i</italic></sub>, <italic>b</italic><sub><italic>o</italic></sub>, <italic>b</italic><sub><italic>c</italic></sub>.</p>
<p>Generally, the interpretation offered for recursive neural networks is <italic>functional</italic> or &#x0201C;<italic>psychological&#x0201D;</italic> and not on the content of intermediate vectors. For example, an interpretation of the parameters of LSTM is the following:</p>
<list list-type="bullet">
<list-item><p><italic>f</italic><sub><italic>t</italic></sub> is the <italic>forget gate</italic>: at each step takes in consideration the new input and output computed so far to decide which information in the internal state must be <italic>forgotten</italic> (that is, set to 0);</p></list-item>
<list-item><p><italic>i</italic><sub><italic>t</italic></sub> is the <italic>input gate</italic>: it decides which position in the internal state will be updated, and by how much;</p></list-item>
<list-item><p><inline-formula><mml:math id="M64"><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:math></inline-formula> is the proposed new internal state, which will then be updated effectively combining the previous gate;</p></list-item>
<list-item><p><italic>o</italic><sub><italic>t</italic></sub> is the <italic>output gate</italic>: it decides how to modulate the internal state to produce the output</p></list-item>
</list>
<p>These <italic>models-that-compose</italic> have high performance on final tasks but are definitely not interpretable.</p>
</sec>
<sec>
<title>5.3.2. Recursive Neural Network</title>
<p>The last class of <italic>models-that-compose</italic> that we present is the class of <italic>recursive neural networks</italic> (Socher et al., <xref ref-type="bibr" rid="B61">2012</xref>). These networks are applied to data structures as trees and are in fact applied recursively on the structure. Generally, the aim of the network is a final task as <italic>sentiment analysis</italic> or <italic>paraphrase detection</italic>.</p>
<p>Recursive neural networks is then a basic block that is recursively applied on trees like the one in <xref ref-type="fig" rid="F3">Figure 3</xref>. The formal definition is the following:</p>
<disp-formula id="E45"><mml:math id="M65"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>p</mml:mtext></mml:mstyle><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>U</mml:mi><mml:mo>,</mml:mo><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>u</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mi>V</mml:mi><mml:mstyle mathvariant="bold"><mml:mtext>u</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mi>U</mml:mi><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mi>W</mml:mi><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none none none none none none none none none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mi>V</mml:mi><mml:mstyle mathvariant="bold"><mml:mtext>u</mml:mtext></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>U</mml:mi><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>g</italic> is a component-wise sigmoid function or tanh, and <italic>W</italic> is a matrix that maps the concatenation vector <inline-formula><mml:math id="M66"><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none none none none none none none none none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mi>V</mml:mi><mml:mstyle mathvariant="bold"><mml:mtext>u</mml:mtext></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>U</mml:mi><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:math></inline-formula> to have the same dimension.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>A simple binary tree.</p></caption>
<graphic xlink:href="frobt-06-00153-g0003.tif"/>
</fig>
<p>This method deals naturally with recursion: given a binary parse tree of a sentence <italic>s</italic>, the algorithm creates vectors and matrices representation for each node, starting from the terminal nodes. Words are represented by distributed representations or local representations. For example, the tree in <xref ref-type="fig" rid="F3">Figure 3</xref> is processed by the recursive network in the following way. First, the network is applied to the pair <italic>(animal,extracts)</italic> and <italic>f</italic><sub><italic>UV</italic></sub>(<bold>animal</bold>, <bold>extract</bold>) is obtained. Then, the network is applied to the result and <italic>eat</italic> and <italic>f</italic><sub><italic>UV</italic></sub>(<bold>eat</bold>, <italic>f</italic><sub><italic>UV</italic></sub>(<bold>animal</bold>, <bold>extract</bold>)) is obtained and so on.</p>
<p>Recursive neural networks are not easily interpretable even if quite similar to the additive <italic>compositional distributional semantic models</italic> as those presented in section 5.1.1. In fact, the non-linear function <italic>g</italic> is the one that makes final vectors less interpretable.</p>
</sec>
<sec>
<title>5.3.3. Attention Neural Network</title>
<p>Attention neural networks (Vaswani et al., <xref ref-type="bibr" rid="B65">2017</xref>; Devlin et al., <xref ref-type="bibr" rid="B17">2019</xref>) are an extremely successful approach for combining distributed representations of sequences of symbols. Yet, these models are very simple. In fact, these attention models are basically gigantic multi-layered perceptrons applied to distributed representations of discrete symbols. The key point is that these gigantic multi-layer percpetrons are trained on generic tasks and, then, these pre-trained models are used in specific tasks by training the last layers. From the point of view of sequence-level interpretability, these models are still under investigation as the eventual concatenative compositionality is scattered in the overall network.</p>
</sec>
</sec>
</sec>
<sec sec-type="conclusions" id="s6">
<title>6. Conclusions</title>
<p>In the &#x02018;90, the hot debate on neural networks was whether or not distribute representations are <italic>only an implementation</italic> of discrete symbolic representations. The question behind this debate is in fact crucial to understand if neural networks may exploit something more that systems strictly based on discrete symbolic representations. The question is again becoming extremely relevant since natural language is by construction a discrete symbolic representations and, nowadays, deep neural networks are solving many tasks.</p>
<p>We made this survey to revitalize the debate. In fact, this is the right time to focus on this fundamental question. As we show, distributed representations have a the not-surprising link with discrete symbolic representations. In our opinion, by shading a light on this debate, this survey will help to devise new deep neural networks that can exploit existing and novel symbolic models of classical natural language processing tasks. We believe that a clearer understanding of the strict link between distributed/distributional representations and symbols may lead to radically new deep learning networks.</p>
</sec>
<sec id="s7">
<title>Author Contributions</title>
<p>All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.</p>
<sec>
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Achlioptas</surname> <given-names>D.</given-names></name></person-group> (<year>2003</year>). <article-title>Database-friendly random projections: Johnson-lindenstrauss with binary coins</article-title>. <source>J. Comput. Syst. Sci.</source> <volume>66</volume>, <fpage>671</fpage>&#x02013;<lpage>687</lpage>. <pub-id pub-id-type="doi">10.1016/S0022-0000(03)00025-4</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Agirre</surname> <given-names>E.</given-names></name> <name><surname>Cer</surname> <given-names>D.</given-names></name> <name><surname>Diab</surname> <given-names>M.</given-names></name> <name><surname>Gonzalez-Agirre</surname> <given-names>A.</given-names></name> <name><surname>Guo</surname> <given-names>W.</given-names></name></person-group> (<year>2013</year>). <article-title>sem 2013 shared task: Semantic textual similarity,</article-title> in <source>Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity</source> (<publisher-loc>Atlanta, GA</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>32</fpage>&#x02013;<lpage>43</lpage>.</citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bahdanau</surname> <given-names>D.</given-names></name> <name><surname>Cho</surname> <given-names>K.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name></person-group> (<year>2015</year>). <article-title>Neural machine translation by jointly learning to align and translate,</article-title> in <source>Proceedings of the 3rd International Conference on Learning Representations (ICLR)</source>.</citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Baroni</surname> <given-names>M.</given-names></name> <name><surname>Bernardi</surname> <given-names>R.</given-names></name> <name><surname>Zamparelli</surname> <given-names>R.</given-names></name></person-group> (<year>2014</year>). <article-title>Frege in space: a program of compositional distributional semantics</article-title>. <source>Linguist. Issues Lang. Technol.</source> <volume>9</volume>, <fpage>241</fpage>&#x02013;<lpage>346</lpage>.</citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Baroni</surname> <given-names>M.</given-names></name> <name><surname>Lenci</surname> <given-names>A.</given-names></name></person-group> (<year>2010</year>). <article-title>Distributional memory: a general framework for corpus-based semantics</article-title>. <source>Comput. Linguist.</source> <volume>36</volume>, <fpage>673</fpage>&#x02013;<lpage>721</lpage>. <pub-id pub-id-type="doi">10.1162/coli_a_00016</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Baroni</surname> <given-names>M.</given-names></name> <name><surname>Zamparelli</surname> <given-names>R.</given-names></name></person-group> (<year>2010</year>). <article-title>Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space,</article-title> in <source>Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing</source> (<publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>1183</fpage>&#x02013;<lpage>1193</lpage>.</citation></ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bingham</surname> <given-names>E.</given-names></name> <name><surname>Mannila</surname> <given-names>H.</given-names></name></person-group> (<year>2001</year>). <article-title>Random projection in dimensionality reduction: applications to image and text data,</article-title> in <source>Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source> (<publisher-loc>San Francisco</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>245</fpage>&#x02013;<lpage>250</lpage>.</citation></ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Blutner</surname> <given-names>R.</given-names></name> <name><surname>Hendriks</surname> <given-names>P.</given-names></name> <name><surname>de Hoop</surname> <given-names>H.</given-names></name></person-group> (<year>2003</year>). <article-title>A new hypothesis on compositionality,</article-title> in <source>Proceedings of the Joint International Conference on Cognitive Science</source> (<publisher-loc>Sydney, NSW</publisher-loc>).</citation></ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chalmers</surname> <given-names>D. J.</given-names></name></person-group> (<year>1992</year>). <source>Syntactic Transformations on Distributed Representations.</source> <publisher-loc>Dordrecht</publisher-loc>: <publisher-name>Springer</publisher-name>.</citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chetlur</surname> <given-names>S.</given-names></name> <name><surname>Woolley</surname> <given-names>C.</given-names></name> <name><surname>Vandermersch</surname> <given-names>P.</given-names></name> <name><surname>Cohen</surname> <given-names>J.</given-names></name> <name><surname>Tran</surname> <given-names>J.</given-names></name> <name><surname>Catanzaro</surname> <given-names>B.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>cudnn: Efficient primitives for deep learning</article-title>. <source>arXiv (Preprint). arXiv:1410.0759</source>.</citation></ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chomsky</surname> <given-names>N.</given-names></name></person-group> (<year>1957</year>). <source>Aspect of Syntax Theory</source>. <publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT Press</publisher-name>.</citation></ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Clark</surname> <given-names>S.</given-names></name> <name><surname>Coecke</surname> <given-names>B.</given-names></name> <name><surname>Sadrzadeh</surname> <given-names>M.</given-names></name></person-group> (<year>2008</year>). <article-title>A compositional distributional model of meaning,</article-title> in <source>Proceedings of the Second Symposium on Quantum Interaction (QI-2008)</source> (<publisher-loc>Oxford</publisher-loc>), <fpage>133</fpage>&#x02013;<lpage>140</lpage>.</citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Coecke</surname> <given-names>B.</given-names></name> <name><surname>Sadrzadeh</surname> <given-names>M.</given-names></name> <name><surname>Clark</surname> <given-names>S.</given-names></name></person-group> (<year>2010</year>). <article-title>Mathematical foundations for a compositional distributional model of meaning</article-title>. <source>arXiv:1003.4394</source>.</citation></ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cotterell</surname> <given-names>R.</given-names></name> <name><surname>Poliak</surname> <given-names>A.</given-names></name> <name><surname>Van Durme</surname> <given-names>B.</given-names></name> <name><surname>Eisner</surname> <given-names>J.</given-names></name></person-group> (<year>2017</year>). <article-title>Explaining and generalizing skip-gram through exponential family principal component analysis,</article-title> in <source>Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers</source> (<publisher-loc>Valencia</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), 175-181.</citation></ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cui</surname> <given-names>H.</given-names></name> <name><surname>Ganger</surname> <given-names>G. R.</given-names></name> <name><surname>Gibbons</surname> <given-names>P. B.</given-names></name></person-group> (<year>2015</year>). <source>Scalable Deep Learning on Distributed GPUS with a GPU-Specialized Parameter Server.</source> Technical report, CMU PDL Technical Report (CMU-PDL-15-107).</citation></ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dagan</surname> <given-names>I.</given-names></name> <name><surname>Roth</surname> <given-names>D.</given-names></name> <name><surname>Sammons</surname> <given-names>M.</given-names></name> <name><surname>Zanzotto</surname> <given-names>F. M.</given-names></name></person-group> (<year>2013</year>). <source>Recognizing Textual Entailment: Models and Applications</source>. <publisher-loc>San Rafael, CA</publisher-loc>: <publisher-name>Morgan &#x00026; Claypool Publishers</publisher-name>.</citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Devlin</surname> <given-names>J.</given-names></name> <name><surname>Chang</surname> <given-names>M.</given-names></name> <name><surname>Lee</surname> <given-names>K.</given-names></name> <name><surname>Toutanova</surname> <given-names>K.</given-names></name></person-group> (<year>2019</year>). <article-title>BERT: pre-training of deep bidirectional transformers for language understanding,</article-title> in <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>, <fpage>4171</fpage>&#x02013;<lpage>4186</lpage>.</citation></ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ferrone</surname> <given-names>L.</given-names></name> <name><surname>Zanzotto</surname> <given-names>F. M.</given-names></name></person-group> (<year>2014</year>). <article-title>Towards syntax-aware compositional distributional semantic models,</article-title> in <source>Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers</source> (<publisher-loc>Dublin</publisher-loc>: <publisher-name>Dublin City University and Association for Computational Linguistics</publisher-name>), <fpage>721</fpage>&#x02013;<lpage>730</lpage>.</citation></ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ferrone</surname> <given-names>L.</given-names></name> <name><surname>Zanzotto</surname> <given-names>F. M.</given-names></name> <name><surname>Carreras</surname> <given-names>X.</given-names></name></person-group> (<year>2015</year>). <article-title>Decoding distributed tree structures,</article-title> in <source>Statistical Language and Speech Processing - Third International Conference, SLSP 2015</source> (<publisher-loc>Budapest</publisher-loc>), <fpage>73</fpage>&#x02013;<lpage>83</lpage>.</citation></ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Firth</surname> <given-names>J. R.</given-names></name></person-group> (<year>1957</year>). <source>Papers in Linguistics.</source> <publisher-loc>London</publisher-loc>: <publisher-name>Oxford University Press</publisher-name>.</citation></ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fodor</surname> <given-names>I.</given-names></name></person-group> (<year>2002</year>). <source>A Survey of Dimension Reduction Techniques.</source> Technical report. Lawrence Livermore National Lab., CA, USA.</citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fodor</surname> <given-names>J. A.</given-names></name> <name><surname>Pylyshyn</surname> <given-names>Z. W.</given-names></name></person-group> (<year>1988</year>). <article-title>Connectionism and cognitive architecture: a critical analysis</article-title>. <source>Cognition</source> <volume>28</volume>, <fpage>3</fpage>&#x02013;<lpage>71</lpage>.<pub-id pub-id-type="pmid">2450716</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Frege</surname> <given-names>G.</given-names></name></person-group> (<year>1884</year>). <source>Die Grundlagen der Arithmetik (The Foundations of Arithmetic): eine logisch-mathematische Untersuchung &#x000FC;ber den Begriff der Zahl</source>. <publisher-loc>Breslau</publisher-loc>: <publisher-name>W. Koebner</publisher-name>.</citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gelder</surname> <given-names>T. V.</given-names></name></person-group> (<year>1990</year>). <article-title>Compositionality: a connectionist variation on a classical theme</article-title>. <source>Cogn. Sci.</source> <volume>384</volume>, <fpage>355</fpage>&#x02013;<lpage>384</lpage>.</citation></ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goldberg</surname> <given-names>Y.</given-names></name> <name><surname>Levy</surname> <given-names>O.</given-names></name></person-group> (<year>2014</year>). <article-title>word2vec explained: deriving mikolov et al.&#x00027;s negative-sampling word-embedding method</article-title>. <source>arXiv (Preprint). arXiv:1402.3722</source>.</citation></ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Goodfellow</surname> <given-names>I.</given-names></name> <name><surname>Pouget-Abadie</surname> <given-names>J.</given-names></name> <name><surname>Mirza</surname> <given-names>M.</given-names></name> <name><surname>Xu</surname> <given-names>B.</given-names></name> <name><surname>Warde-Farley</surname> <given-names>D.</given-names></name> <name><surname>Ozair</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>Generative adversarial nets,</article-title> in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Montreal, QC</publisher-loc>), <fpage>2672</fpage>&#x02013;<lpage>2680</lpage>.</citation></ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Graves</surname> <given-names>A.</given-names></name></person-group> (<year>2013</year>). <article-title>Generating sequences with recurrent neural networks</article-title>. <source>arXiv:1308.0850</source>.</citation></ref>
<ref id="B28">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Grefenstette</surname> <given-names>E.</given-names></name> <name><surname>Sadrzadeh</surname> <given-names>M.</given-names></name></person-group> (<year>2011</year>). <article-title>Experimental support for a categorical compositional distributional model of meaning,</article-title> in <source>Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP &#x00027;11</source> (<publisher-loc>Stroudsburg, PA</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>1394</fpage>&#x02013;<lpage>1404</lpage>.</citation></ref>
<ref id="B29">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Guevara</surname> <given-names>E.</given-names></name></person-group> (<year>2010</year>). <article-title>A regression model of adjective-noun compositionality in distributional semantics,</article-title> in <source>Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics</source> (<publisher-loc>Uppsala</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>33</fpage>&#x02013;<lpage>37</lpage>.</citation></ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Harris</surname> <given-names>Z.</given-names></name></person-group> (<year>1954</year>). <article-title>Distributional structure</article-title>. <source>Word</source> <volume>10</volume>, <fpage>146</fpage>&#x02013;<lpage>162</lpage>.</citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>Identity mappings in deep residual networks</article-title>. <source>arXiv (preprint) arXiv:1603.05027</source>. <pub-id pub-id-type="doi">10.1007/978-3-319-46493-0_38</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hinton</surname> <given-names>G. E.</given-names></name> <name><surname>McClelland</surname> <given-names>J. L.</given-names></name> <name><surname>Rumelhart</surname> <given-names>D. E.</given-names></name></person-group> (<year>1986</year>). <article-title>Distributed representations,</article-title> in <source>Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations</source>, eds <person-group person-group-type="editor"><name><surname>Rumelhart</surname> <given-names>D. E.</given-names></name> <name><surname>McClelland</surname> <given-names>J. L.</given-names></name></person-group> (<publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT Press</publisher-name>), <fpage>77</fpage>&#x02013;<lpage>109</lpage>.</citation></ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hochreiter</surname> <given-names>S.</given-names></name> <name><surname>Schmidhuber</surname> <given-names>J.</given-names></name></person-group> (<year>1997</year>). <article-title>Long short-term memory</article-title>. <source>Neural Comput.</source> <volume>9</volume>, <fpage>1735</fpage>&#x02013;<lpage>1780</lpage>.<pub-id pub-id-type="pmid">9377276</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jacovi</surname> <given-names>A.</given-names></name> <name><surname>Shalom</surname> <given-names>O. S.</given-names></name> <name><surname>Goldberg</surname> <given-names>Y.</given-names></name></person-group> (<year>2018</year>). <article-title>Understanding convolutional neural networks for text classification,</article-title> in <source>Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</source> (<publisher-loc>Brussels</publisher-loc>), <fpage>56</fpage>&#x02013;<lpage>65</lpage>.</citation></ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jang</surname> <given-names>K.-R.</given-names></name> <name><surname>Kim</surname> <given-names>S.-B.</given-names></name> <name><surname>Corp</surname> <given-names>N.</given-names></name></person-group> (<year>2018</year>). <article-title>Interpretable word embedding contextualization,</article-title> in <source>Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</source> (<publisher-loc>Brussels</publisher-loc>), <fpage>341</fpage>&#x02013;<lpage>343</lpage>.</citation></ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Johnson</surname> <given-names>W.</given-names></name> <name><surname>Lindenstrauss</surname> <given-names>J.</given-names></name></person-group> (<year>1984</year>). <article-title>Extensions of lipschitz mappings into a hilbert space</article-title>. <source>Contemp. Math.</source> <volume>26</volume>, <fpage>189</fpage>&#x02013;<lpage>206</lpage>.</citation></ref>
<ref id="B37">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kalchbrenner</surname> <given-names>N.</given-names></name> <name><surname>Blunsom</surname> <given-names>P.</given-names></name></person-group> (<year>2013</year>). <article-title>Recurrent convolutional neural networks for discourse compositionality,</article-title> in <source>Proceedings of the 2013 Workshop on Continuous Vector Space Models and Their Compositionality</source> (<publisher-loc>Sofia</publisher-loc>).</citation></ref>
<ref id="B38">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Krizhevsky</surname> <given-names>A.</given-names></name> <name><surname>Sutskever</surname> <given-names>I.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name></person-group> (<year>2012</year>). <article-title>Imagenet classification with deep convolutional neural networks,</article-title> in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Lake Tahoe, NV</publisher-loc>), <fpage>1097</fpage>&#x02013;<lpage>1105</lpage>.</citation></ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Landauer</surname> <given-names>T. K.</given-names></name> <name><surname>Dumais</surname> <given-names>S. T.</given-names></name></person-group> (<year>1997</year>). <article-title>A solution to plato&#x00027;s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge</article-title>. <source>Psychol. Rev.</source> <volume>104</volume>, <fpage>211</fpage>&#x02013;<lpage>240</lpage>.</citation></ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>LeCun</surname> <given-names>Y.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Hinton</surname> <given-names>G.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep learning</article-title>. <source>Nature</source> <volume>521</volume>, <fpage>436</fpage>&#x02013;<lpage>444</lpage>.<pub-id pub-id-type="pmid">26017442</pub-id></citation></ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liou</surname> <given-names>C.-Y.</given-names></name> <name><surname>Cheng</surname> <given-names>W.-C.</given-names></name> <name><surname>Liou</surname> <given-names>J.-W.</given-names></name> <name><surname>Liou</surname> <given-names>D.-R.</given-names></name></person-group> (<year>2014</year>). <article-title>Autoencoder for words</article-title>. <source>Neurocomputing</source> <volume>139</volume>, <fpage>84</fpage>&#x02013;<lpage>96</lpage>. <pub-id pub-id-type="doi">10.1016/j.neucom.2013.09.055</pub-id></citation></ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lipton</surname> <given-names>Z. C.</given-names></name></person-group> (<year>2018</year>). <article-title>The mythos of model interpretability</article-title>. <source>Commun. ACM</source> <volume>61</volume>, <fpage>36</fpage>&#x02013;<lpage>43</lpage>. <pub-id pub-id-type="doi">10.1145/3233231</pub-id></citation></ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Markovsky</surname> <given-names>I.</given-names></name></person-group> (<year>2011</year>). <source>Low Rank Approximation: Algorithms, Implementation, Applications</source>. Springer Publishing Company, Incorporated.</citation></ref>
<ref id="B44">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Masci</surname> <given-names>J.</given-names></name> <name><surname>Meier</surname> <given-names>U.</given-names></name> <name><surname>Cire&#x0015F;an</surname> <given-names>D.</given-names></name> <name><surname>Schmidhuber</surname> <given-names>J.</given-names></name></person-group> (<year>2011</year>). <article-title>Stacked convolutional auto-encoders for hierarchical feature extraction,</article-title> in <source>International Conference on Artificial Neural Networks</source> (<publisher-loc>Springer</publisher-loc>), <fpage>52</fpage>&#x02013;<lpage>59</lpage>.</citation></ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mikolov</surname> <given-names>T.</given-names></name> <name><surname>Chen</surname> <given-names>K.</given-names></name> <name><surname>Corrado</surname> <given-names>G.</given-names></name> <name><surname>Dean</surname> <given-names>J.</given-names></name></person-group> (<year>2013</year>). <article-title>Efficient estimation of word representations in vector space,</article-title> in <source>Proceedings of the International Conference on Learning Representations (ICLR)</source>.</citation></ref>
<ref id="B46">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mitchell</surname> <given-names>J.</given-names></name> <name><surname>Lapata</surname> <given-names>M.</given-names></name></person-group> (<year>2008</year>). <article-title>Vector-based models of semantic composition,</article-title> in <source>Proceedings of ACL-08: HLT</source> (<publisher-loc>Columbus, OH</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>236</fpage>&#x02013;<lpage>244</lpage>.<pub-id pub-id-type="pmid">21564253</pub-id></citation></ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mitchell</surname> <given-names>J.</given-names></name> <name><surname>Lapata</surname> <given-names>M.</given-names></name></person-group> (<year>2010</year>). <article-title>Composition in distributional models of semantics</article-title>. <source>Cogn. Sci</source>. <volume>34</volume>, <fpage>1388</fpage>&#x02013;<lpage>1429</lpage>. <pub-id pub-id-type="doi">10.1111/j.1551-6709.2010.01106.x</pub-id><pub-id pub-id-type="pmid">21564253</pub-id></citation></ref>
<ref id="B48">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Montague</surname> <given-names>R.</given-names></name></person-group> (<year>1974</year>). <article-title>English as a formal language,</article-title> in <source>Formal Philosophy: Selected Papers of Richard Montague</source>, ed R. Thomason (<publisher-loc>New Haven</publisher-loc>: <publisher-name>Yale University Press</publisher-name>), <fpage>188</fpage>&#x02013;<lpage>221</lpage>.</citation></ref>
<ref id="B49">
<citation citation-type="thesis"><person-group person-group-type="author"><name><surname>Neumann</surname> <given-names>J.</given-names></name></person-group> (<year>2001</year>). <source>Holistic processing of hierarchical structures in connectionist networks</source> (Ph.D. thesis). University of Edinburgh, Edinburgh.</citation></ref>
<ref id="B50">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pado</surname> <given-names>S.</given-names></name> <name><surname>Lapata</surname> <given-names>M.</given-names></name></person-group> (<year>2007</year>). <article-title>Dependency-based construction of semantic space models</article-title>. <source>Comput. Linguist.</source> <volume>33</volume>, <fpage>161</fpage>&#x02013;<lpage>199</lpage>. <pub-id pub-id-type="doi">10.1162/coli.2007.33.2.161</pub-id></citation></ref>
<ref id="B51">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pearson</surname> <given-names>K.</given-names></name></person-group> (<year>1901</year>). <article-title>Principal components analysis</article-title>. <source>Lond. Edinburgh Dublin Philos. Magn. J.</source> 6566.</citation></ref>
<ref id="B52">
<citation citation-type="thesis"><person-group person-group-type="author"><name><surname>Plate</surname> <given-names>T. A.</given-names></name></person-group> (<year>1994</year>). <source>Distributed representations and nested compositional structure</source>. Ph.D. thesis. University of Toronto, Toronto, Canada.</citation></ref>
<ref id="B53">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Plate</surname> <given-names>T. A.</given-names></name></person-group> (<year>1995</year>). <article-title>Holographic reduced representations</article-title>. <source>IEEE Trans. Neural Netw.</source> <volume>6</volume>, <fpage>623</fpage>&#x02013;<lpage>641</lpage>.<pub-id pub-id-type="pmid">18263348</pub-id></citation></ref>
<ref id="B54">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rosenblatt</surname> <given-names>F.</given-names></name></person-group> (<year>1958</year>). <article-title>The perceptron: a probabilistic model for information storage and organization in the brain</article-title>. <source>Psychol. Rev.</source> <volume>65</volume>, <fpage>386</fpage>&#x02013;<lpage>408</lpage>.<pub-id pub-id-type="pmid">13602029</pub-id></citation></ref>
<ref id="B55">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rothenh&#x000E4;usler</surname> <given-names>K.</given-names></name> <name><surname>Sch&#x000FC;tze</surname> <given-names>H.</given-names></name></person-group> (<year>2009</year>). <article-title>Unsupervised classification with dependency based word spaces,</article-title> in <source>Proceedings of the Workshop on Geometrical Models of Natural Language Semantics, GEMS &#x00027;09</source> (<publisher-loc>Stroudsburg, PA</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>17</fpage>&#x02013;<lpage>24</lpage>.</citation></ref>
<ref id="B56">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sahlgren</surname> <given-names>M.</given-names></name></person-group> (<year>2005</year>). <article-title>An introduction to random indexing,</article-title> in <source>Proceedings of the Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering TKE</source> (<publisher-loc>Copenhagen</publisher-loc>).</citation></ref>
<ref id="B57">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Salton</surname> <given-names>G.</given-names></name></person-group> (<year>1989</year>). <source>Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer</source>. <publisher-loc>Boston, MA</publisher-loc>: <publisher-name>Addison-Wesley</publisher-name>.</citation></ref>
<ref id="B58">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schmidhuber</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep learning in neural networks: an overview</article-title>. <source>Neural Netw.</source> <volume>61</volume>, <fpage>85</fpage>&#x02013;<lpage>117</lpage>. <pub-id pub-id-type="doi">10.1016/j.neunet.2014.09.003</pub-id><pub-id pub-id-type="pmid">25462637</pub-id></citation></ref>
<ref id="B59">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schuster</surname> <given-names>M.</given-names></name> <name><surname>Paliwal</surname> <given-names>K.</given-names></name></person-group> (<year>1997</year>). <article-title>Bidirectional recurrent neural networks</article-title>. <source>Trans. Sig. Proc.</source> <volume>45</volume>, <fpage>2673</fpage>&#x02013;<lpage>2681</lpage>.</citation></ref>
<ref id="B60">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Socher</surname> <given-names>R.</given-names></name> <name><surname>Huang</surname> <given-names>E. H.</given-names></name> <name><surname>Pennington</surname> <given-names>J.</given-names></name> <name><surname>Ng</surname> <given-names>A. Y.</given-names></name> <name><surname>Manning</surname> <given-names>C. D.</given-names></name></person-group> (<year>2011</year>). <article-title>Dynamic pooling and unfolding recursive autoencoders for paraphrase detection,</article-title> in <source>Advances in Neural Information Processing Systems 24</source> (<publisher-loc>Granada</publisher-loc>).</citation></ref>
<ref id="B61">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Socher</surname> <given-names>R.</given-names></name> <name><surname>Huval</surname> <given-names>B.</given-names></name> <name><surname>Manning</surname> <given-names>C. D.</given-names></name> <name><surname>Ng</surname> <given-names>A. Y.</given-names></name></person-group> (<year>2012</year>). <article-title>Semantic compositionality through recursive matrix-vector spaces,</article-title> in <source>Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source> (<publisher-loc>Jeju</publisher-loc>).</citation></ref>
<ref id="B62">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sorzano</surname> <given-names>C. O. S.</given-names></name> <name><surname>Vargas</surname> <given-names>J.</given-names></name> <name><surname>Montano</surname> <given-names>A. P.</given-names></name></person-group> (<year>2014</year>). <article-title>A survey of dimensionality reduction techniques</article-title>. <source>arXiv (Preprint). arXiv:1403.2877</source>.</citation></ref>
<ref id="B63">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Turney</surname> <given-names>P. D.</given-names></name></person-group> (<year>2006</year>). <article-title>Similarity of semantic relations</article-title>. <source>Comput. Linguist.</source> <volume>32</volume>, <fpage>379</fpage>&#x02013;<lpage>416</lpage>. <pub-id pub-id-type="doi">10.1162/coli.2006.32.3.379</pub-id></citation></ref>
<ref id="B64">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Turney</surname> <given-names>P. D.</given-names></name> <name><surname>Pantel</surname> <given-names>P.</given-names></name></person-group> (<year>2010</year>). <article-title>From frequency to meaning: vector space models of semantics</article-title>. <source>J. Artif. Intell. Res.</source> <volume>37</volume>, <fpage>141</fpage>&#x02013;<lpage>188</lpage>. <pub-id pub-id-type="doi">10.1613/jair.2934</pub-id></citation></ref>
<ref id="B65">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Vaswani</surname> <given-names>A.</given-names></name> <name><surname>Shazeer</surname> <given-names>N.</given-names></name> <name><surname>Parmar</surname> <given-names>N.</given-names></name> <name><surname>Uszkoreit</surname> <given-names>J.</given-names></name> <name><surname>Jones</surname> <given-names>L.</given-names></name> <name><surname>Gomez</surname> <given-names>A. N.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Attention is all you need,</article-title> in <source>Advances in Neural Information Processing Systems 30</source>, eds <person-group person-group-type="editor"><name><surname>Guyon</surname> <given-names>I.</given-names></name> <name><surname>Luxburg</surname> <given-names>U. V.</given-names></name> <name><surname>Bengio</surname> <given-names>S.</given-names></name> <name><surname>Wallach</surname> <given-names>H.</given-names></name> <name><surname>Fergus</surname> <given-names>R.</given-names></name> <name><surname>Vishwanathan</surname> <given-names>S.</given-names></name> <name><surname>Garnett</surname> <given-names>R.</given-names></name></person-group> (<publisher-loc>Long Beach, CA</publisher-loc>: <publisher-name>Curran Associates, Inc</publisher-name>.), <fpage>5998</fpage>&#x02013;<lpage>6008</lpage>.</citation></ref>
<ref id="B66">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Vincent</surname> <given-names>P.</given-names></name> <name><surname>Larochelle</surname> <given-names>H.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Manzagol</surname> <given-names>P.-A.</given-names></name></person-group> (<year>2008</year>). <article-title>Extracting and composing robust features with denoising autoencoders,</article-title> in <source>Proceedings of the 25th International Conference on Machine learning</source> (<publisher-loc>Helsinki</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>1096</fpage>&#x02013;<lpage>1103</lpage>.</citation></ref>
<ref id="B67">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vincent</surname> <given-names>P.</given-names></name> <name><surname>Larochelle</surname> <given-names>H.</given-names></name> <name><surname>Lajoie</surname> <given-names>I.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Manzagol</surname> <given-names>P.-A.</given-names></name></person-group> (<year>2010</year>). <article-title>Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion</article-title>. <source>J. Mach. Learn. Res.</source> <volume>11</volume>, <fpage>3371</fpage>&#x02013;<lpage>3408</lpage>.</citation></ref>
<ref id="B68">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Vinyals</surname> <given-names>O.</given-names></name> <name><surname>Kaiser</surname> <given-names>L. u.</given-names></name> <name><surname>Koo</surname> <given-names>T.</given-names></name> <name><surname>Petrov</surname> <given-names>S.</given-names></name> <name><surname>Sutskever</surname> <given-names>I.</given-names></name> <name><surname>Hinton</surname> <given-names>G.</given-names></name></person-group> (<year>2015a</year>). <article-title>Grammar as a foreign language,</article-title> in <source>Advances in Neural Information Processing Systems 28</source>, eds <person-group person-group-type="editor"><name><surname>Cortes</surname> <given-names>C.</given-names></name> <name><surname>Lawrence</surname> <given-names>N. D.</given-names></name> <name><surname>Lee</surname> <given-names>D. D.</given-names></name> <name><surname>Sugiyama</surname> <given-names>M.</given-names></name> <name><surname>Garnett</surname> <given-names>R.</given-names></name></person-group> (<publisher-loc>Montreal, QC</publisher-loc>: <publisher-name>Curran Associates, Inc</publisher-name>.), <fpage>2755</fpage>&#x02013;<lpage>2763</lpage>.</citation></ref>
<ref id="B69">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Vinyals</surname> <given-names>O.</given-names></name> <name><surname>Toshev</surname> <given-names>A.</given-names></name> <name><surname>Bengio</surname> <given-names>S.</given-names></name> <name><surname>Erhan</surname> <given-names>D.</given-names></name></person-group> (<year>2015b</year>). <article-title>Show and tell: a neural image caption generator,</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Boston, MA</publisher-loc>), <fpage>3156</fpage>&#x02013;<lpage>3164</lpage>.</citation></ref>
<ref id="B70">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Weiss</surname> <given-names>D.</given-names></name> <name><surname>Alberti</surname> <given-names>C.</given-names></name> <name><surname>Collins</surname> <given-names>M.</given-names></name> <name><surname>Petrov</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). <article-title>Structured training for neural network transition-based parsing</article-title>. <source>arXiv (Preprint). arXiv:1506.06158</source>. <pub-id pub-id-type="doi">10.3115/v1/P15-1032</pub-id></citation></ref>
<ref id="B71">
<citation citation-type="thesis"><person-group person-group-type="author"><name><surname>Werbos</surname> <given-names>P.</given-names></name></person-group> (<year>1974</year>). <source>Beyond regression: new tools for prediction and analysis in the behavioral sciences.</source> Ph.D. Thesis, Harvard University, Cambridge.</citation></ref>
<ref id="B72">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>K.</given-names></name> <name><surname>Ba</surname> <given-names>J.</given-names></name> <name><surname>Kiros</surname> <given-names>R.</given-names></name> <name><surname>Cho</surname> <given-names>K.</given-names></name> <name><surname>Courville</surname> <given-names>A.</given-names></name> <name><surname>Salakhudinov</surname> <given-names>R.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Show, attend and tell: neural image caption generation with visual attention,</article-title> in <source>Proceedings of the 32nd International Conference on Machine Learning, in PMLR</source>, Vol. <volume>37</volume>, <fpage>2048</fpage>&#x02013;<lpage>2057</lpage>.</citation></ref>
<ref id="B73">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zanzotto</surname> <given-names>F. M.</given-names></name> <name><surname>Dell&#x00027;Arciprete</surname> <given-names>L.</given-names></name></person-group> (<year>2012</year>). <article-title>Distributed tree kernels,</article-title> in <source>Proceedings of International Conference on Machine Learning</source> (<publisher-loc>Edinburg</publisher-loc>).</citation></ref>
<ref id="B74">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zanzotto</surname> <given-names>F. M.</given-names></name> <name><surname>Ferrone</surname> <given-names>L.</given-names></name> <name><surname>Baroni</surname> <given-names>M.</given-names></name></person-group> (<year>2015</year>). <article-title>When the whole is not greater than the combination of its parts: a decompositional look at compositional distributional semantics</article-title>. <source>Comput. Linguist.</source> <volume>41</volume>, <fpage>165</fpage>&#x02013;<lpage>173</lpage>. <pub-id pub-id-type="doi">10.1162/COLI_a_00215</pub-id></citation></ref>
<ref id="B75">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zanzotto</surname> <given-names>F. M.</given-names></name> <name><surname>Korkontzelos</surname> <given-names>I.</given-names></name> <name><surname>Fallucchi</surname> <given-names>F.</given-names></name> <name><surname>Manandhar</surname> <given-names>S.</given-names></name></person-group> (<year>2010</year>). <article-title>Estimating linear models for compositional distributional semantics,</article-title> in <source>Proceedings of the 23rd International Conference on Computational Linguistics (COLING)</source> (<publisher-loc>Beijing</publisher-loc>).</citation></ref>
<ref id="B76">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zeiler</surname> <given-names>M. D.</given-names></name> <name><surname>Fergus</surname> <given-names>R.</given-names></name></person-group> (<year>2014a</year>). <article-title>Visualizing and understanding convolutional networks,</article-title> in <source>Computer Vision &#x02013; ECCV 2014</source>, eds <person-group person-group-type="editor"><name><surname>Fleet</surname> <given-names>D.</given-names></name> <name><surname>Pajdla</surname> <given-names>T.</given-names></name> <name><surname>Schiele</surname> <given-names>B.</given-names></name> <name><surname>Tuytelaars</surname> <given-names>T.</given-names></name></person-group> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>), <fpage>818</fpage>&#x02013;<lpage>833</lpage>.</citation></ref>
<ref id="B77">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zeiler</surname> <given-names>M. D.</given-names></name> <name><surname>Fergus</surname> <given-names>R.</given-names></name></person-group> (<year>2014b</year>). <article-title>Visualizing and understanding convolutional networks,</article-title> in <source>European Conference on Computer Vision</source> (<publisher-loc>Zurich</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>818</fpage>&#x02013;<lpage>833</lpage>.</citation></ref>
<ref id="B78">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zou</surname> <given-names>W. Y.</given-names></name> <name><surname>Socher</surname> <given-names>R.</given-names></name> <name><surname>Cer</surname> <given-names>D. M.</given-names></name> <name><surname>Manning</surname> <given-names>C. D.</given-names></name></person-group> (<year>2013</year>). <article-title>Bilingual word embeddings for phrase-based machine translation,</article-title> in <source>EMNLP</source> (<publisher-loc>Seattle, WA</publisher-loc>), <fpage>1393</fpage>&#x02013;<lpage>1398</lpage>.</citation></ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>we can usually think of this as a timestep, but not all applications of recurrent neural network have a temporal interpretation.</p></fn>
</fn-group>
</back>
</article>