<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Psychol.</journal-id>
<journal-title>Frontiers in Psychology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Psychol.</abbrev-journal-title>
<issn pub-type="epub">1664-1078</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fpsyg.2017.01551</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Psychology</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Recurrent Convolutional Neural Networks: A Better Model of Biological Object Recognition</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Spoerer</surname> <given-names>Courtney J.</given-names></name>
<xref ref-type="author-notes" rid="fn001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/399453/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>McClure</surname> <given-names>Patrick</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/360873/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Kriegeskorte</surname> <given-names>Nikolaus</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/1330/overview"/>
</contrib>
</contrib-group>
<aff><institution>Medical Research Council Cognition and Brain Sciences Unit, University of Cambridge</institution> <country>Cambridge, United Kingdom</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: James L. McClelland, University of Pennsylvania, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Fred H. Hamker, Technische Universit&#x000E4;t Chemnitz, Germany; Ko Sakai, University of Tsukuba, Japan</p></fn>
<fn fn-type="corresp" id="fn001"><p>&#x0002A;Correspondence: Courtney J. Spoerer <email>courtney.spoerer&#x00040;mrc-cbu.cam.ac.uk</email></p></fn>
<fn fn-type="other" id="fn002"><p>This article was submitted to Perception Science, a section of the journal Frontiers in Psychology</p></fn></author-notes>
<pub-date pub-type="epub">
<day>12</day>
<month>09</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="collection">
<year>2017</year>
</pub-date>
<volume>8</volume>
<elocation-id>1551</elocation-id>
<history>
<date date-type="received">
<day>03</day>
<month>05</month>
<year>2017</year>
</date>
<date date-type="accepted">
<day>25</day>
<month>08</month>
<year>2017</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2017 Spoerer, McClure and Kriegeskorte.</copyright-statement>
<copyright-year>2017</copyright-year>
<copyright-holder>Spoerer, McClure and Kriegeskorte</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>Feedforward neural networks provide the dominant model of how the brain performs visual object recognition. However, these networks lack the lateral and feedback connections, and the resulting recurrent neuronal dynamics, of the ventral visual pathway in the human and non-human primate brain. Here we investigate recurrent convolutional neural networks with bottom-up (B), lateral (L), and top-down (T) connections. Combining these types of connections yields four architectures (B, BT, BL, and BLT), which we systematically test and compare. We hypothesized that recurrent dynamics might improve recognition performance in the challenging scenario of partial occlusion. We introduce two novel occluded object recognition tasks to test the efficacy of the models, <italic>digit clutter</italic> (where multiple target digits occlude one another) and <italic>digit debris</italic> (where target digits are occluded by digit fragments). We find that recurrent neural networks outperform feedforward control models (approximately matched in parametric complexity) at recognizing objects, both in the absence of occlusion and in all occlusion conditions. Recurrent networks were also found to be more robust to the inclusion of additive Gaussian noise. Recurrent neural networks are better in two respects: (1) they are more neurobiologically realistic than their feedforward counterparts; (2) they are better in terms of their ability to recognize objects, especially under challenging conditions. This work shows that computer vision can benefit from using recurrent convolutional architectures and suggests that the ubiquitous recurrent connections in biological brains are essential for task performance.</p></abstract>
<kwd-group>
<kwd>object recognition</kwd>
<kwd>occlusion</kwd>
<kwd>top-down processing</kwd>
<kwd>convolutional neural network</kwd>
<kwd>recurrent neural network</kwd>
</kwd-group>
<contract-num rid="cn001">MC-A060- 5PR20</contract-num>
<contract-num rid="cn002">ERC-2010-StG 261352</contract-num>
<contract-sponsor id="cn001">Medical Research Council<named-content content-type="fundref-id">10.13039/501100000265</named-content></contract-sponsor>
<contract-sponsor id="cn002">European Research Council<named-content content-type="fundref-id">10.13039/501100000781</named-content></contract-sponsor>
<counts>
<fig-count count="9"/>
<table-count count="5"/>
<equation-count count="15"/>
<ref-count count="49"/>
<page-count count="14"/>
<word-count count="8745"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>The primate visual system is highly efficient at object recognition, requiring only brief presentations of the stimulus to perform the task (Potter, <xref ref-type="bibr" rid="B33">1976</xref>; Thorpe et al., <xref ref-type="bibr" rid="B42">1996</xref>; Keysers et al., <xref ref-type="bibr" rid="B20">2001</xref>). Within 150ms of stimulus onset, neurons in inferior temporal cortex (IT) encode object information in a form that is robust to transformations in scale and position (Hung et al., <xref ref-type="bibr" rid="B17">2005</xref>; Isik et al., <xref ref-type="bibr" rid="B18">2014</xref>), and is predictive of human behavioral responses (Majaj et al., <xref ref-type="bibr" rid="B29">2015</xref>).</p>
<p>This rapid processing lends support to the idea that invariant object recognition can be explained through a feedforward process (DiCarlo et al., <xref ref-type="bibr" rid="B9">2012</xref>), a claim that has been supported by the recent successes of feedforward neural networks in computer vision (e.g., Krizhevsky et al., <xref ref-type="bibr" rid="B24">2012</xref>) and the usefulness of these networks as models of primate visual processing (Wallis and Rolls, <xref ref-type="bibr" rid="B43">1997</xref>; Riesenhuber and Poggio, <xref ref-type="bibr" rid="B34">1999</xref>; Serre et al., <xref ref-type="bibr" rid="B37">2007</xref>; Yamins et al., <xref ref-type="bibr" rid="B46">2013</xref>, <xref ref-type="bibr" rid="B47">2014</xref>; Khaligh-Razavi and Kriegeskorte, <xref ref-type="bibr" rid="B21">2014</xref>; G&#x000FC;&#x000E7;l&#x000FC; and van Gerven, <xref ref-type="bibr" rid="B14">2015</xref>).</p>
<p>The success of feedforward models of visual object recognition has resulted in feedback processing being underexplored in this domain. However, both anatomical and functional evidence seems to suggest that feedback connections play a role in object recognition. For instance, it is well known that the ventral visual pathway contains similar densities of feedforward and feedback connections (Felleman and Van Essen, <xref ref-type="bibr" rid="B11">1991</xref>; Sporns and Zwi, <xref ref-type="bibr" rid="B39">2004</xref>; Markov et al., <xref ref-type="bibr" rid="B30">2014</xref>), and functional evidence from primate and human electrophysiology experiments show that processing of object information unfolds over time, beyond what would be interpreted as feedforward processing (Sugase et al., <xref ref-type="bibr" rid="B40">1999</xref>; Brincat and Connor, <xref ref-type="bibr" rid="B3">2006</xref>; Freiwald and Tsao, <xref ref-type="bibr" rid="B12">2010</xref>; Carlson et al., <xref ref-type="bibr" rid="B4">2013</xref>; Cichy et al., <xref ref-type="bibr" rid="B6">2014</xref>; Clarke et al., <xref ref-type="bibr" rid="B7">2015</xref>). Some reports of robust object representations, normally attributed to feedforward processing (Isik et al., <xref ref-type="bibr" rid="B18">2014</xref>; Majaj et al., <xref ref-type="bibr" rid="B29">2015</xref>), occur within temporal delays that are consistent with fast local recurrent processing (Wyatte et al., <xref ref-type="bibr" rid="B45">2014</xref>). This suggests that we need to move beyond the standard feedforward model if we are to gain a complete understanding of visual object recognition within the brain.</p>
<p>Fast local recurrent processing is temporally dissociable from attentional effects in frontal and parietal areas, and is thought to be particularly important in recognition of the degraded objects (for a review see Wyatte et al., <xref ref-type="bibr" rid="B45">2014</xref>). In particular, object recognition in the presence of occlusion is thought to engage recurrent processing. This is supported by the finding that recognition under these conditions produces delayed behavioral and neural responses, and recognition can be disrupted by masking, which is thought to interfere with recurrent processing (Johnson and Olshausen, <xref ref-type="bibr" rid="B19">2005</xref>; Wyatte et al., <xref ref-type="bibr" rid="B44">2012</xref>; Tang et al., <xref ref-type="bibr" rid="B41">2014</xref>). Furthermore, competitive processing, which is thought to be supported by lateral recurrent connectivity (Adesnik and Scanziani, <xref ref-type="bibr" rid="B1">2010</xref>), aids recognition of occluded objects (Kolankeh et al., <xref ref-type="bibr" rid="B22">2015</xref>). Scene information can also be decoded from areas of early visual cortex that correspond to occluded regions of the visual field (Smith and Muckli, <xref ref-type="bibr" rid="B38">2010</xref>) further supporting the claim that feedback processing is engaged when there is occlusion in the visual input.</p>
<p>Occluded object recognition has been investigated using neural network models in previous work, which found an important role for feedback connections when stimuli were partially occluded (O&#x00027;Reilly et al., <xref ref-type="bibr" rid="B32">2013</xref>). However, the type of occlusion used in these simulations, and previous experimental work, has involved fading out or deleting parts of images (Smith and Muckli, <xref ref-type="bibr" rid="B38">2010</xref>; Wyatte et al., <xref ref-type="bibr" rid="B44">2012</xref>; Tang et al., <xref ref-type="bibr" rid="B41">2014</xref>). This does not correspond well to vision in natural environments where occlusion is generated by objects occluding one another. Moreover, deleting parts of objects, as opposed to occluding them, leads to poorer accuracies and differences in early event-related potentials (ERPs) that could indicate different effects on local recurrent processing (Johnson and Olshausen, <xref ref-type="bibr" rid="B19">2005</xref>). Therefore, it is important to investigate the effects of actual object occlusion in neural networks to complement prior work on deletion.</p>
<p>In scenes where objects occlude one another it is important to correctly assign border ownership for successful recognition. Border ownership can be thought of as indicating which object is the occluder and which object is being occluded. Border ownership cells require information from outside their classical receptive field and border ownership signals are delayed relative to the initial feedforward input, which both suggest the involvement of recurrent processing (Craft et al., <xref ref-type="bibr" rid="B8">2007</xref>). A number of computational models have been developed to explain border ownership cells. What is common amongst these models is the presence of lateral or top-down connections (Zhaoping, <xref ref-type="bibr" rid="B49">2005</xref>; Sakai and Nishimura, <xref ref-type="bibr" rid="B36">2006</xref>; Craft et al., <xref ref-type="bibr" rid="B8">2007</xref>). The importance of recurrent processing for developing selectivity to border ownership further suggests that recurrence has an important role for recognizing occluded objects.</p>
<p>To test the effects of occlusion, we developed a new generative model for occlusion stimuli. The images contain parameterized, computer-generated digits in randomly jittered positions (optionally, the size and orientation can also be randomly varied). The code for generating these images is made available at <ext-link ext-link-type="uri" xlink:href="https://github.com/cjspoerer/digitclutter">https://github.com/cjspoerer/digitclutter</ext-link>. The task is to correctly identify these digits. Different forms of occlusion are added to these images, including occlusion from non-targets and other targets present in the image, we refer to these as <italic>digit debris</italic> and <italic>digit clutter</italic>, respectively. The first form of occlusion, digit debris, simulates situations where targets are occluded by other objects that are task irrelevant. The second case, digit clutter, simulates occlusion where the objective is to account for the occlusion without suppressing the occluder, which is itself a target. This stimulus set has a number of benefits. Firstly, the underlying task is relatively simple to solve, which allows us to study the effects of occlusion and recurrence with small-scale neural networks. Therefore, any challenges to the network will only result from the introduction of occlusion. Additionally, as the stimuli are procedurally generated, they can be produced in large quantities, which enables the training of the networks.</p>
<p>Recurrent processing is sometimes thought of as cleaning up noise, where occlusion is a special case of noise. A simple case of noise is additive Gaussian noise, but we hypothesize that recurrence is unlikely to show benefits in these conditions. Consider the case of detecting simple visual features that show no variation, e.g., edges of different orientations. An optimal linear filter can be learnt for detecting these features. This linear filter would remain optimal under independent, additive Gaussian noise, as the expected value of the input and output will remain the same under repeated presentations. Whilst this result does not exactly hold for the case of non-linear filters that are normally used in neural networks, we might expect similar results. Therefore, we would expect no specific benefit of recurrence in the presence of additive Gaussian noise. If this is true, we can infer that the role of recurrence is not for performing object recognition in noisy conditions, generally. Otherwise, it would support the conclusion that reccurence is useful across a wider range of challenging conditions.</p>
<p>In this work, we investigate object recognition using convolutional neural networks. We extend the idea of the convolutional architecture to networks with bottom-up (B), lateral (L), and top-down (T) connections in a similar fashion to previous work (Liang and Hu, <xref ref-type="bibr" rid="B27">2015</xref>; Liao and Poggio, <xref ref-type="bibr" rid="B28">2016</xref>). These connections roughly correspond to processing information from lower and higher regions in the ventral visual hierarchy (bottom-up and top-down connections), and processing information from within a region (lateral connections). We choose to use the convolutional architecture as it is a parameter efficient method for building large neural networks that can perform real-world tasks (LeCun et al., <xref ref-type="bibr" rid="B25">2015</xref>). It is directly inspired by biology, with restricted receptive fields and feature detectors that replicate across the visual field (Hubel and Wiesel, <xref ref-type="bibr" rid="B16">1968</xref>) and advances based on this architecture have produced useful models for visual neuroscience (Kriegeskorte, <xref ref-type="bibr" rid="B23">2015</xref>). The interchange between biology and engineering is important for the progress of both fields (Hassabis et al., <xref ref-type="bibr" rid="B15">2017</xref>). By using convolutional neural networks as the basis of our models, we aim to maximize the transfer of knowledge from these more biologically motivated experiments to applications in computer vision, and by using recurrent connections, we hope that our models will contribute to a better understanding of recurrent connections in biological vision whilst maintaining the benefits of scalability from convolutional architectures.</p>
<p>To test whether recurrent neural networks perform better than feedforward networks at occluded object recognition, we trained and tested a range of networks to perform a digit recognition task under varying levels of occlusion. Any difference in performance reflects the degree to which networks learn the underlying task of recognizing the target digits, and handle the occlusion when recognizing the digit. To differentiate between these two cases we also look at how well networks trained on occluded object recognition generalize to object recognition without occlusion. We also test whether recurrence shows an advantage for standard object recognition and when dealing with noisy inputs, more generally, by measuring object recognition performance with and without the presence of additive Gaussian noise. Finally, we study whether any benefit of recurrence extends to occluded object recognition where the occluder is also a target, the networks are tested on multiple digit recognition tasks where the targets overlap.</p>
</sec>
<sec sec-type="materials and methods" id="s2">
<title>2. Materials and methods</title>
<sec>
<title>2.1. Generative model for stimuli</title>
<p>To investigate the effect of occlusion in object recognition, we opt to use a task that could be solved trivially without the presence of occlusion, computer generated digit recognition. Each digit uses the same font, color, and size. The only variable is the position of the digit, which is drawn from a uniform random distribution. This means, the only invariance problem that needs to be solved is translation invariance, which is effectively built into the convolutional networks we use. Therefore, we restrict ourselves to only altering the level of occlusion to increase task difficulty. This means we need to use some challenging occlusion scenarios to differentiate between the models. However, this allows us to isolate the effects of occlusion and, by keeping the overall task relatively simple, we can use small networks, allowing us to train them across a wide range of conditions.</p>
<p>We generate occlusion using two methods, by scattering debris across the image, digit debris, and by presenting overlapping digits within a scene, which the network has to simultaneously recognize, digit clutter.</p>
<p>For digit debris, we obtain debris from fragments of each of the possible targets, taking random crops from randomly selected digits. Each of these fragments are then added to a mask that is overlaid on the target digit (Figure <xref ref-type="fig" rid="F1">1</xref>). As a result, the visual features of non-target objects, that the network has to ignore, are present in the scene. These conditions mean that summing the overall visual features present for each digit becomes a less reliable strategy for inferring the target digit. This is in contrast to deletion where there is only a removal of features that belong to the target.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>The process for generating stimuli for digit debris. First the target digit is generated. Random crops of all possible targets are taken to create a mask of debris, which is applied to the target as an occluder.</p></caption>
<graphic xlink:href="fpsyg-08-01551-g0001.tif"/>
</fig>
<p>However, within natural visual scenes, occlusion is generated by other whole objects. These objects might also be of interest to the observer. In this scenario, simply ignoring the occluding objects would not make sense. In digit clutter, these cases are simulated by generating images with multiple digits that are sequentially placed in an image, where their positions are also drawn from a uniform random distribution. This generates a series of digits that overlap, producing a relative depth order. The task of the network is then to recognize all digits that are present.</p>
<p>Design of these images was performed at high resolution (512 &#x000D7; 512 pixels) and, for computational simplicity, the images were resized to a low resolution (32 &#x000D7; 32 pixels) when presented to the network.</p>
<p>In these experiments we use stimulus sets, that vary in either the number of digits in a scene&#x02014;three digits, four digits, or five digits&#x02014;or the number of fragments that make up the debris&#x02014;10 fragments (light debris), 30 fragments (moderate debris), or 50 fragments (heavy debris). Examples from these stimulus sets are shown in Figure <xref ref-type="fig" rid="F2">2</xref>. This allows us to measure how the performance of the networks differ across these task types and levels of occlusion.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>High resolution examples from the stimulus sets used in these experiments. The top row shows digit debris stimuli for each of the three conditions tested here, with 10, 30, and 50 fragments. The bottom row shows digit clutter stimuli with 3, 4, and 5 digits.</p></caption>
<graphic xlink:href="fpsyg-08-01551-g0002.tif"/>
</fig>
<p>For each of these image sets, we randomly generated a training set of 100,000 images and a validation set of 10,000 images, which were used for the determining the hyperparameters and learning regime. All analyses where performed on an independent test set of 10,000 images.</p>
<p>All images underwent pixel-wise normalization prior to being passed to the network. For an input pixel <italic>x</italic> in position <italic>i, j</italic>, this is defined as:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mrow><mml:msub><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mover accent='true'><mml:mi>x</mml:mi><mml:mo>&#x000AF;</mml:mo></mml:mover><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<p>where <italic>x</italic><sub><italic>i, j</italic></sub> is the raw pixel value, <inline-formula><mml:math id="M16"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo>&#x00304;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the mean pixel value and <italic>s</italic><sub><italic>x</italic><sub><italic>i, j</italic></sub></sub> is the standard deviation of pixel values. The mean and standard deviation are computed for each specific position across the whole of the training data.</p>
<p>To test the hypothesis that the benefit of recurrence is not simply for cleaning up noise, we also test the network on object recognition where the input has additive Gaussian noise. To prevent ceiling performance, we use the MNIST handwritten digit recognition data set (LeCun et al., <xref ref-type="bibr" rid="B26">1998</xref>). The MNIST data set contains 60,000 images in total that are divided into a training set of 50,000 images, a validation set of 5,000 images, and a testing set of 10,000 images.</p>
<p>We add Gaussian noise to these images after normalization, which allows an easy interpretation in terms of signal to noise ratio. In this case, we use Gaussian noise with a standard deviation of 1 and 2, which produces images with a signal-to-noise ratio (SNR) of 1 and 0.5, respectively.</p>
</sec>
<sec>
<title>2.2. Models</title>
<p>In these experiments we use a range of convolutional neural networks (for an introduction to this architecture, see Goodfellow et al., <xref ref-type="bibr" rid="B13">2016</xref>). These networks can be categorized by the particular combination of bottom-up, lateral, and top-down connections that are present. As it does not make sense to construct the networks without bottom-up connections (as information from the input cannot reach higher layers), we are left with four possible architectures with the following connections, bottom-up only (B), bottom-up and top-down (BT), bottom-up and lateral (BL), and bottom-up, lateral and top-down (BLT). Each of these architectures are illustrated schematically in Figure <xref ref-type="fig" rid="F3">3</xref>.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Schematic diagrams for each of the architectures used. Arrows indicate bottom-up (blue), lateral (green), and top-down (red) convolutions.</p></caption>
<graphic xlink:href="fpsyg-08-01551-g0003.tif"/>
</fig>
<p>Adding top-down or lateral connections to feedforward models introduces cycles into the graphical structure of the network. The presence of cycles in these networks allow recurrent computations to take place, introducing internally generated temporal dynamics to the models. In comparison, temporal dynamics of feedforward networks can only be driven by changes in the input. The effect of recurrent connections can be seen through the unrolling of the computational graph across time steps. In these experiments, we run our models for four time steps and the resulting graph for BLT is illustrated in Figure <xref ref-type="fig" rid="F4">4</xref>.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>The computational graph of BLT unrolled over four time steps. The shaded boxes indicate hidden layers that receive purely feedforward input (blue) and those that receive both feedforward and recurrent input (purple).</p></caption>
<graphic xlink:href="fpsyg-08-01551-g0004.tif"/>
</fig>
<p>As the recurrent networks (BT, BL, and BLT) have additional connections compared to purely feedforward networks (B), they also have a larger number of free parameters (Table <xref ref-type="table" rid="T1">1</xref>). To control for this difference, we test two variants of B that have a more similar number parameters to the recurrent networks. The first control increases the number of features that can be learned by the bottom-up connections and the second control increases the size of individual features (known as the kernel size). These are referred to as B-F and B-K, respectively. Conceptually, B-K is a more appropriate control compared to B-F, as it effectively increases the number of connections that each unit has, holding everything else constant. In comparison, B-F increases the number of units within a layer, altering the layers representational power, in addition to changing the number of parameters. However, B-F is more closely parameter matched to some of the recurrent models, which motivates the inclusion of B-F in our experiments.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Brief descriptions of the models used in these experiments including the number of learnable parameters and the number of units in each model.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="center"><bold>Kernel size</bold></th>
<th valign="top" align="center"><bold>No. Features</bold></th>
<th valign="top" align="center"><bold>No. parameters</bold></th>
<th valign="top" align="center"><bold>No. units</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">B</td>
<td valign="top" align="center">3 &#x000D7; 3</td>
<td valign="top" align="center">32</td>
<td valign="top" align="center">9,898</td>
<td valign="top" align="center">40,970</td>
</tr>
<tr>
<td valign="top" align="left">B-F</td>
<td valign="top" align="center">3 &#x000D7; 3</td>
<td valign="top" align="center">64</td>
<td valign="top" align="center">38,218</td>
<td valign="top" align="center">81,930</td>
</tr>
<tr>
<td valign="top" align="left">B-K</td>
<td valign="top" align="center">5 &#x000D7; 5</td>
<td valign="top" align="center">32</td>
<td valign="top" align="center">26,794</td>
<td valign="top" align="center">40,970</td>
</tr>
<tr>
<td valign="top" align="left">BT</td>
<td valign="top" align="center">3 &#x000D7; 3</td>
<td valign="top" align="center">32</td>
<td valign="top" align="center">19,114</td>
<td valign="top" align="center">40,970</td>
</tr>
<tr>
<td valign="top" align="left">BL</td>
<td valign="top" align="center">3 &#x000D7; 3</td>
<td valign="top" align="center">32</td>
<td valign="top" align="center">28,330</td>
<td valign="top" align="center">40,970</td>
</tr>
<tr>
<td valign="top" align="left">BLT</td>
<td valign="top" align="center">3 &#x000D7; 3</td>
<td valign="top" align="center">32</td>
<td valign="top" align="center">37,546</td>
<td valign="top" align="center">40,970</td>
</tr>
</tbody>
</table>
</table-wrap>
<sec>
<title>2.2.1. Architecture overview</title>
<p>All of the models tested consist of two hidden recurrent convolutional layers (described in Section 2.2.2) followed by a readout layer (described in Section 2.2.3). Bottom-up and lateral connections are implemented as standard convolutional layers with a 1 &#x000D7; 1 stride. The feedforward inputs between the hidden layers go through a max pooling operation, with a 2 &#x000D7; 2 stride and a 2 &#x000D7; 2 kernel. This has the effect of reducing the height and width of a layer by a factor of two. As a result, we cannot use standard convolutions for top-down connections, as the size of the top-down input from the second hidden layer would not match the size of the first hidden layer. To increase the size of the top-down input, we use transposed convolution (also known as deconvolution Zeiler et al., <xref ref-type="bibr" rid="B48">2010</xref>) with an output stride of 2 &#x000D7; 2. This deconvolution increases the size of the top-down input so that it matches the size of the first hidden layer. The connectivity of this layer can be understood as a normal convolutional layer with 2 &#x000D7; 2 stride where the input and output sides of the layer have been switched.</p>
<p>As feedforward networks do not have any internal dynamics and the stimuli are static, feedforward networks only run for one time step. Each of the recurrent networks are run for four time steps. This is implemented as a computational graph unrolled over time (Figure <xref ref-type="fig" rid="F4">4</xref>), where the weights for particular connections are shared across each time step. The input is also replicated at each time point.</p>
<p>To train the network, error is backpropagated through time for each time point (Section 2.2.4), which means that the network is trained to converge as soon as possible, rather than at the final time step. However, when measuring the accuracy, we use the predictions at the final time step as this generally produces the highest accuracy.</p>
</sec>
<sec>
<title>2.2.2. Recurrent convolutional layers</title>
<p>The key component of these models is the recurrent convolutional layer (RCL). The inputs to these layers are denoted by <bold>h</bold><sub>(&#x003C4;, <italic>m, i, j</italic>)</sub>, which represents the vectorized input from a patch centered on location <italic>i, j</italic>, in layer <italic>m</italic>, computed at time step &#x003C4;, across all features maps (indexed by <italic>k</italic>). We define <bold>h</bold><sub>(&#x003C4;, 0, <italic>i, j</italic>)</sub> as the input image to the network.</p>
<p>For B, the lack of recurrent connections reduces RCLs to a standard convolutional layer where the pre-activation at time step &#x003C4; for a unit in layer <italic>m</italic>, in feature map <italic>k</italic>, in position <italic>i, j</italic> is defined as:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M2"><mml:mrow><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>w</mml:mi></mml:mstyle><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mi>b</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mstyle mathvariant="sans-serif"><mml:mi>T</mml:mi></mml:mstyle></mml:msup><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003C4;</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<p>where &#x003C4; &#x0003D; 0 (as B only runs for a single time step) the convolutional kernel for bottom-up connections is given in vectorized format by <inline-formula><mml:math id="M17"><mml:msubsup><mml:mrow><mml:mstyle class="text"><mml:mtext class="textit" mathvariant="bold-italic">w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and the bias for feature map <italic>k</italic> in layer <italic>m</italic> is given by <italic>b</italic><sub><italic>m, k</italic></sub>.</p>
<p>In BL, lateral inputs are added to the pre-activation, giving:</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M3"><mml:mrow><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>w</mml:mi></mml:mstyle><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mi>b</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mstyle mathvariant="sans-serif"><mml:mi>T</mml:mi></mml:mstyle></mml:msup><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003C4;</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>w</mml:mi></mml:mstyle><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mi>l</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mstyle mathvariant="sans-serif"><mml:mi>T</mml:mi></mml:mstyle></mml:msup><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003C4;</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<p>The term for lateral inputs <inline-formula><mml:math id="M18"><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle class="text"><mml:mtext class="textit" mathvariant="bold-italic">w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mstyle mathvariant="sans-serif"><mml:mi>T</mml:mi></mml:mstyle></mml:mrow></mml:msup><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mrow><mml:mstyle class="text"><mml:mtext class="textit" mathvariant="bold-italic">h</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> uses the same indexing conventions as the bottom-up inputs in Equation (2), where <inline-formula><mml:math id="M19"><mml:msubsup><mml:mrow><mml:mstyle class="text"><mml:mtext class="textit" mathvariant="bold-italic">w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the lateral convolutional kernel in vectorized format. As the lateral input is dependent on outputs computed on the timestep &#x003C4;&#x02212;1, they are undefined for the first time step, when &#x003C4; &#x0003D; 0. Therefore, when &#x003C4; &#x0003D; 0 we set recurrent inputs to be a vector of zeros. This rule applies for all recurrent input, including top-down inputs.</p>
<p>In BT, we add top-down inputs to the pre-activation instead of lateral inputs. This gives:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M4"><mml:mrow><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>w</mml:mi></mml:mstyle><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mi>b</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mstyle mathvariant="sans-serif"><mml:mi>T</mml:mi></mml:mstyle></mml:msup><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003C4;</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>w</mml:mi></mml:mstyle><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mstyle mathvariant="sans-serif"><mml:mi>T</mml:mi></mml:mstyle></mml:msup><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003C4;</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
<p>Where the top-down term is <inline-formula><mml:math id="M20"><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle class="text"><mml:mtext class="textit" mathvariant="bold-italic">w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mstyle mathvariant="sans-serif"><mml:mi>T</mml:mi></mml:mstyle></mml:mrow></mml:msup><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mrow><mml:mstyle class="text"><mml:mtext class="textit" mathvariant="bold-italic">h</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>, and <inline-formula><mml:math id="M26"><mml:msubsup><mml:mrow><mml:mstyle class="text"><mml:mtext class="textit" mathvariant="bold-italic">w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the top-down convolutional kernel in vectorized format. In our models, top-down connections can only be received from other hidden layers. As a result, top-down inputs are only given when <italic>m</italic> &#x0003D; 1 and otherwise they are set to a vector of zeros. The rule for top-down inputs also applies to top-down inputs in BLT.</p>
<p>Finally, we can add both lateral and top-down inputs to the pre-activation, which generates the layers we use in BLT:</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M5"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>w</mml:mi></mml:mstyle><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mi>b</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mstyle mathvariant="sans-serif"><mml:mi>T</mml:mi></mml:mstyle></mml:msup><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003C4;</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>w</mml:mi></mml:mstyle><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mi>l</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mstyle mathvariant="sans-serif"><mml:mi>T</mml:mi></mml:mstyle></mml:msup><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003C4;</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>+</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>w</mml:mi></mml:mstyle><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mstyle mathvariant="sans-serif"><mml:mi>T</mml:mi></mml:mstyle></mml:msup><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003C4;</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The output, <italic>h</italic><sub>&#x003C4;, <italic>m, i, j, k</italic></sub>, is calculated using the same operations for all layers. The pre-activation <italic>z</italic><sub>&#x003C4;, <italic>m, i, j, k</italic></sub> is passed through a layer of rectified linear units (ReLUs), and local response normalization (Krizhevsky et al., <xref ref-type="bibr" rid="B24">2012</xref>).</p>
<p>ReLUs are defined as:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M6"><mml:mrow><mml:msub><mml:mi>&#x003C3;</mml:mi><mml:mi>z</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mtext>max</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0007D;</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>and local response normalization is defined for input <italic>x</italic><sub>&#x003C4;, <italic>m, i, j, k</italic></sub> as:</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M7"><mml:mrow><mml:mi>&#x003C9;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mtext>&#x000A0;</mml:mtext><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:msup><mml:mi>k</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>=</mml:mo><mml:mtext>max</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mi>n</mml:mi><mml:mo>/</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mtext>min</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mi>n</mml:mi><mml:mo>/</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:munderover><mml:mrow><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>k</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:mrow></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>&#x003B2;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></disp-formula>
<p>For local response normalization, we use <italic>n</italic> &#x0003D; 5, <italic>c</italic> &#x0003D; 1, &#x003B1; &#x0003D; 10<sup>&#x02212;4</sup>, and &#x003B2; &#x0003D; 0.5 throughout. This has the effect of inducing competition across the <italic>n</italic> closest features within a spatial location. The features are ordered arbitrarily and this ordering is held constant.</p>
<p>The output of layer <italic>l</italic> at time step <italic>t</italic> is then given by:</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M8"><mml:mrow><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x003C9;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>&#x003C3;</mml:mi><mml:mi>z</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math></disp-formula>
</sec>
<sec>
<title>2.2.3. Readout layer</title>
<p>In the final layer of each time step, a readout is calculated for each class. This is performed in three steps. The first stage is a global max pooling layer, which returns the maximum output value for each feature map. The output of the global max pooling layer is then used as input to a fully connected layer with 10 output units. These outputs are passed through a sigmoid non-linearity, &#x003C3;<sub><italic>y</italic></sub>(<italic>x</italic>), defined as:</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M9"><mml:mrow><mml:msub><mml:mi>&#x003C3;</mml:mi><mml:mi>y</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>x</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<p>This has the effect of bounding the output between 0 and 1. The response of each of these outputs can be interpreted as the probability that each target is present or not.</p>
</sec>
<sec>
<title>2.2.4. Learning</title>
<p>At each time step, the networks give an output from the readout layer, which we denote <inline-formula><mml:math id="M100"><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mstyle><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, where we interpret each output as the probability that a particular target is present or not.</p>
<p>In training, the objective is to match this output to a ground truth <bold>y</bold>, which uses binary encoding such that its elements <italic>y</italic><sub><italic>i</italic></sub> are defined as:</p>
<disp-formula id="E10"><label>(10)</label><mml:math id="M10"><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mn>1</mml:mn></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>if&#x000A0;</mml:mtext><mml:mi>i</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:msup><mml:mi>y</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:mstyle></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mn>0</mml:mn></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>otherwise</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>Where <italic>y&#x00027;</italic> is the list of target digits present.</p>
<p>We used cross-entropy to calculate the error between <inline-formula><mml:math id="M101"><mml:mrow><mml:msub><mml:mrow><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo stretchy='true'>&#x0005E;</mml:mo></mml:mover></mml:mrow><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <italic>y</italic>, which is summed across all time steps:</p>
<disp-formula id="E11"><label>(11)</label><mml:math id="M11"><mml:mrow><mml:mi>E</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo stretchy='true'>&#x0005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>y</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mtext>&#x0200A;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x0200A;</mml:mtext><mml:mn>0</mml:mn></mml:mrow><mml:mi>T</mml:mi></mml:munderover><mml:mrow><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mtext>&#x0200A;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x0200A;</mml:mtext><mml:mn>0</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mo>&#x000B7;</mml:mo><mml:mi>log</mml:mi><mml:msub><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x000B7;</mml:mo><mml:mi>log</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></disp-formula>
<p>L2-regularization is included, with a coefficient of &#x003BB; &#x0003D; 0.0005, making the overall loss function:</p>
<disp-formula id="E12"><label>(12)</label><mml:math id="M12"><mml:mrow><mml:mi mathcolor="black" mathvariant="-tex-caligraphic">L</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo stretchy='true'>&#x0005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>y</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>E</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo stretchy='true'>&#x0005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>y</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x003BB;</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>w</mml:mi></mml:mstyle><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mo>&#x0007C;</mml:mo><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:math></disp-formula>
<p>Where <bold>w</bold> the vector of all trainable parameters in the model.</p>
<p>This loss function was then used to train the networks by changing the parameters at the end of each mini-batch of 100 images according to the momentum update rule:</p>
<disp-formula id="E13"><label>(13)</label><mml:math id="M13"><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x003BC;</mml:mi><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mi>n</mml:mi></mml:msub><mml:mo>&#x02212;</mml:mo><mml:mi>&#x003B5;</mml:mi><mml:mfrac><mml:mrow><mml:mo>&#x02202;</mml:mo><mml:mi mathcolor="black" mathvariant="-tex-caligraphic">L</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo stretchy='true'>&#x0005E;</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>y</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x02202;</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>w</mml:mi></mml:mstyle></mml:mrow></mml:mfrac></mml:math></disp-formula>
<disp-formula id="E14"><label>(14)</label><mml:math id="M14"><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>w</mml:mi></mml:mstyle><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>w</mml:mi></mml:mstyle><mml:mi>n</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></disp-formula>
<p>Where <italic>n</italic> is the iteration index, &#x003BC; is the momentum coefficient, and &#x003B5; is the learning rate. We use &#x003BC; &#x0003D; 0.9 for all models and set &#x003B5; by the following weight decay rule:</p>
<disp-formula id="E15"><label>(15)</label><mml:math id="M15"><mml:mrow><mml:msub><mml:mi>&#x003B5;</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x003B7;</mml:mi><mml:msup><mml:mi>&#x003B4;</mml:mi><mml:mrow><mml:mfrac><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:mfrac></mml:mrow></mml:msup></mml:mrow></mml:math></disp-formula>
<p>Where &#x003B7; is the initial learning rate, &#x003B4; is the decay rate, <italic>e</italic> is the epoch (a whole iteration through all training images), and <italic>d</italic> is the decay step. In our experiments we use &#x003B7; &#x0003D; 0.1, &#x003B4; &#x0003D; 0.1, and <italic>d</italic> &#x0003D; 40. All networks were trained for 100 epochs. The parameters for the training regime where optimized manually using the validation set.</p>
</sec>
</sec>
<sec>
<title>2.3. Analyzing model performance</title>
<sec>
<title>2.3.1. Comparing model accuracy</title>
<p>We measured the performance of the networks by calculating the accuracy across the test set. For digit clutter tasks with multiple labels, we took the top-<italic>n</italic> class outputs as the network predictions, where <italic>n</italic> is the number of digits present in that task.</p>
<p>Accuracy was compared within image sets by performing pairwise McNemar&#x00027;s tests between all of the trained models (McNemar, <xref ref-type="bibr" rid="B31">1947</xref>). McNemar&#x00027;s test is used here, which uses the variability in performance across stimuli as the basis for statistical inference (Dietterich, <xref ref-type="bibr" rid="B10">1998</xref>). This does not require repeated training from different random seeds, which is both computationally expensive, and redundant, as networks converge on highly similar performance levels. By avoiding the need to retrain networks from different random initializations we are able to explore a variety of qualitatively different architectures and infer differences between them.</p>
<p>To mitigate the increased risk of false positives due to multiple comparisons, we control the false discovery rate (the expected proportion of false positives among the positive outcomes) at 0.05 for each group of pairwise tests using the Benjamini-Hochberg procedure (Benjamini and Hochberg, <xref ref-type="bibr" rid="B2">1995</xref>).</p>
</sec>
<sec>
<title>2.3.2. Comparing model robustness</title>
<p>To understand whether networks have varying levels of robustness to increased task difficulty (i.e., increased levels of debris, clutter, and Gaussian noise), we test for differences in the increase in error between all networks as task difficulty increases.</p>
<p>To achieve this, we fit a linear model to the error rates for each network separately, with the difficulty levels as predictors (e.g., light debris = 1, moderate debris = 2, heavy debris = 3). We extract the slope parameters from the linear models for a pair of networks and test if the difference in these slope parameters significantly differs from zero, by using a permutation test.</p>
<p>To construct a null distribution for the permutation test, we randomly shuffle predictions for a single image between a pair of networks. Error rates are then calculated for these shuffled predictions. A linear model is fit to these sampled error rates, for each model separately, and the difference between the slope parameters is entered into the null distribution. This procedure is run 10,000 times to approximate the null distribution. The <italic>p</italic>-value for this test is obtained by making a two-tailed comparison between the observed value for the difference in slope parameters and the null distribution. Based on the uncorrected <italic>p</italic>-values, a threshold is chosen to control the FDR at 0.05.</p>
</sec>
</sec>
</sec>
<sec sec-type="results" id="s3">
<title>3. Results</title>
<sec>
<title>3.1. Recognition of digits under debris</title>
<sec>
<title>3.1.1. Learning to recognize digits occluded by debris</title>
<p>Networks were trained and tested to recognize digits under debris to test for a particular benefit of recurrence when recognizing objects under structured occlusion. We used three image sets containing different levels of debris, 10 fragments (light debris), 30 fragments (moderate debris), and 50 fragments (heavy debris). For every model, the error rate was found to increase as the level of debris in the image increased (Figure <xref ref-type="fig" rid="F5">5</xref>).</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Classification error for all models on single digit detection under varying levels of debris. Examples of the images used to train and test the networks are also shown. Matrices to the right indicate significant results of pairwise McNemar tests. Comparisons are across models and within image sets. Black boxes indicate significant differences at <italic>p</italic> &#x0003C; 0.05 when controlling the expected false discovery rate at 0.05.</p></caption>
<graphic xlink:href="fpsyg-08-01551-g0005.tif"/>
</fig>
<p>Under light and moderate debris, all but one of the pairwise differences were found to be significant (FDR = 0.05) with no significant difference between BL or BLT for light (&#x003C7;<sup>2</sup>(1, <italic>N</italic> &#x0003D; 10, 000) &#x0003D; 0.04, <italic>p</italic> &#x0003D; 0.835) and moderate debris (&#x003C7;<sup>2</sup>(1, <italic>N</italic> &#x0003D; 10, 000) &#x0003D; 0.00, <italic>p</italic> &#x0003D; 0.960). Of the feedforward models, B-K was the best performing. The error rates for each of the models are shown in Table <xref ref-type="table" rid="T2">2</xref>.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Classification error for all of the models on single digit detection with varying levels of debris.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Image set</bold></th>
<th valign="top" align="center"><bold>B (%)</bold></th>
<th valign="top" align="center"><bold>B-F (%)</bold></th>
<th valign="top" align="center"><bold>B-K (%)</bold></th>
<th valign="top" align="center"><bold>BT (%)</bold></th>
<th valign="top" align="center"><bold>BL (%)</bold></th>
<th valign="top" align="center"><bold>BLT (%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Light debris</td>
<td valign="top" align="center">6.24</td>
<td valign="top" align="center">4.23</td>
<td valign="top" align="center">1.73</td>
<td valign="top" align="center">1.30</td>
<td valign="top" align="center">0.77</td>
<td valign="top" align="center">0.80</td>
</tr>
<tr>
<td valign="top" align="left">Moderate debris</td>
<td valign="top" align="center">40.73</td>
<td valign="top" align="center">31.16</td>
<td valign="top" align="center">11.68</td>
<td valign="top" align="center">7.31</td>
<td valign="top" align="center">3.72</td>
<td valign="top" align="center">3.70</td>
</tr>
<tr>
<td valign="top" align="left">Heavy debris</td>
<td valign="top" align="center">75.63</td>
<td valign="top" align="center">68.49</td>
<td valign="top" align="center">29.58</td>
<td valign="top" align="center">17.01</td>
<td valign="top" align="center">11.13</td>
<td valign="top" align="center">9.32</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Under heavy debris all pairwise differences were significant (FDR = 0.05) including the difference between BL and BLT, which was not significant under light and moderate debris, with BLT outperforming BL. This suggests that at lower levels of occlusion, feedforward and lateral connections are sufficient for good performance. However, top-down connections become beneficial when the task involves recognizing digits under heavier levels of debris.</p>
</sec>
<sec>
<title>3.1.2. Learning to recognize unoccluded digits when trained with occlusion</title>
<p>To test if the networks learn a good model of the digit when trained to recognize the digit under debris, we test the performance of networks when recognizing unoccluded digits.</p>
<p>When networks were trained to recognize digits under heavy debris, and tested to recognize unoccluded digits, we found all pairwise differences to be significant (FDR &#x0003D; 0.05, Figure <xref ref-type="fig" rid="F6">6</xref>). The best performing network was B-K, followed by recurrent networks. B and B-F performed much worse than all of the other networks (Table <xref ref-type="table" rid="T3">3</xref>).</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Classification error for all models trained under heavy debris conditions and tested with or without debris. Examples of the images used to train and test the networks are also shown. Matrices to the right indicate significant results of pairwise McNemar tests. Comparisons are across models and within image sets. Black boxes indicate significant differences at <italic>p</italic> &#x0003C; 0.05 when controlling the expected false discovery rate at 0.05.</p></caption>
<graphic xlink:href="fpsyg-08-01551-g0006.tif"/>
</fig>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Classification error for all of the models on single digit detection when trained on heavy debris and tested without debris.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Image set</bold></th>
<th valign="top" align="center"><bold>B (%)</bold></th>
<th valign="top" align="center"><bold>B-F (%)</bold></th>
<th valign="top" align="center"><bold>B-K (%)</bold></th>
<th valign="top" align="center"><bold>BT (%)</bold></th>
<th valign="top" align="center"><bold>BL (%)</bold></th>
<th valign="top" align="center"><bold>BLT (%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Tested on debris</td>
<td valign="top" align="center">75.63</td>
<td valign="top" align="center">68.49</td>
<td valign="top" align="center">29.58</td>
<td valign="top" align="center">17.01</td>
<td valign="top" align="center">11.13</td>
<td valign="top" align="center">9.32</td>
</tr>
<tr>
<td valign="top" align="left">Tested without debris</td>
<td valign="top" align="center">79.37</td>
<td valign="top" align="center">69.88</td>
<td valign="top" align="center">0.34</td>
<td valign="top" align="center">3.10</td>
<td valign="top" align="center">2.11</td>
<td valign="top" align="center">0.55</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>These results show that feedforward networks (specifically B-K) can perform very well at recognizing the digit without occlusion, when trained to recognize digits under occlusion. This suggests that they have learnt a good model of the underlying task of digit recognition. However, B-K performs worse than the recurrent models when recognizing the target under occlusion. This indicates that B-K has difficulty recognizing the digit under occlusion rather than a problem learning to perform the task of digit recognition given the occluded training images. In comparison, recurrent networks show much lower error rates when recognizing the target under occlusion.</p>
</sec>
</sec>
<sec>
<title>3.2. Recognition of multiple digits</title>
<p>To examine the ability of the networks to handle occlusion when the occluder is not a distractor, the networks were trained and tested on their ability to recognize multiple overlapping digits. In this case, when testing for significance, we used a variant of McNemar&#x00027;s test that corrects for dependence between predictions (Durkalski et al., <xref ref-type="bibr" rid="B10a">2003</xref>), which can arise when identifyingmultiple digits in the same image.</p>
<p>When recognizing three digits simultaneously, recurrent networks generally outperformed feedforward networks (Figure <xref ref-type="fig" rid="F7">7</xref>), with the exception of BT and B-K where no significant difference was found [&#x003C7;<sup>2</sup>(1, <italic>N</italic> &#x0003D; 30, 000) &#x0003D; 3.82, <italic>p</italic> &#x0003D; 0.05]. All other differences were found to be significant (<italic>FDR</italic> = 0.05). The error rates for all models are shown in Table <xref ref-type="table" rid="T4">4</xref>. A similar pattern is found when recognizing both four and five digits simultaneously. However, in both four and five digit tasks, all pairwise differences were found to be significant, with B-K outperforming BT (Figure <xref ref-type="fig" rid="F7">7</xref>). This suggests that, whilst recurrent networks generally perform better at this task, they do not exclusively outperform feedforward models.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Classification error for all models on multiple digit detection with varying numbers of digits. Examples of the images used to train and test the networks are also shown. Matrices to the right indicate significant results of pairwise McNemar tests. Comparisons are across models and within image sets. Black boxes indicate significant differences at <italic>p</italic> &#x0003C; 0.05 when controlling the expected false discovery rate at 0.05.</p></caption>
<graphic xlink:href="fpsyg-08-01551-g0007.tif"/>
</fig>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Classification error for all of the models on multiple digit recognition with varying numbers of targets.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Image set</bold></th>
<th valign="top" align="center"><bold>B (%)</bold></th>
<th valign="top" align="center"><bold>B-F (%)</bold></th>
<th valign="top" align="center"><bold>B-K (%)</bold></th>
<th valign="top" align="center"><bold>BT (%)</bold></th>
<th valign="top" align="center"><bold>BL (%)</bold></th>
<th valign="top" align="center"><bold>BLT (%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">3 digits</td>
<td valign="top" align="center">9.35</td>
<td valign="top" align="center">6.30</td>
<td valign="top" align="center">3.74</td>
<td valign="top" align="center">3.97</td>
<td valign="top" align="center">2.45</td>
<td valign="top" align="center">1.85</td>
</tr>
<tr>
<td valign="top" align="left">4 digits</td>
<td valign="top" align="center">15.95</td>
<td valign="top" align="center">12.37</td>
<td valign="top" align="center">9.43</td>
<td valign="top" align="center">10.88</td>
<td valign="top" align="center">6.69</td>
<td valign="top" align="center">5.94</td>
</tr>
<tr>
<td valign="top" align="left">5 digits</td>
<td valign="top" align="center">19.57</td>
<td valign="top" align="center">16.50</td>
<td valign="top" align="center">13.97</td>
<td valign="top" align="center">15.80</td>
<td valign="top" align="center">12.31</td>
<td valign="top" align="center">11.50</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>3.3. Mnist with gaussian noise</title>
<p>To test the hypothesis that the benefit of recurrence does not extend to dealing with noise in general, we test the performance of the networks on MNIST with unstructured additive Gaussian noise.</p>
<p>The error rates for all models were found to grow as the amount of noise increased (Table <xref ref-type="table" rid="T5">5</xref>). Recurrent networks performed significantly better than the feedforward models on MNIST (FDR = 0.05). This supports the idea that recurrent networks are not only better at recognition under challenging conditions, but also in more standard object recognition tasks.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Classification error for all of the models on MNIST with varying levels of Gaussian noise.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Image set</bold></th>
<th valign="top" align="center"><bold>B (%)</bold></th>
<th valign="top" align="center"><bold>B-F (%)</bold></th>
<th valign="top" align="center"><bold>B-K (%)</bold></th>
<th valign="top" align="center"><bold>BT (%)</bold></th>
<th valign="top" align="center"><bold>BL (%)</bold></th>
<th valign="top" align="center"><bold>BLT (%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">No Noise</td>
<td valign="top" align="center">2.99</td>
<td valign="top" align="center">2.42</td>
<td valign="top" align="center">1.43</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">0.95</td>
</tr>
<tr>
<td valign="top" align="left">SNR &#x0003D; 1</td>
<td valign="top" align="center">13.01</td>
<td valign="top" align="center">10.59</td>
<td valign="top" align="center">4.04</td>
<td valign="top" align="center">1.82</td>
<td valign="top" align="center">2.01</td>
<td valign="top" align="center">1.96</td>
</tr>
<tr>
<td valign="top" align="left">SNR &#x0003D; 0.5</td>
<td valign="top" align="center">39.04</td>
<td valign="top" align="center">35.15</td>
<td valign="top" align="center">17.44</td>
<td valign="top" align="center">8.69</td>
<td valign="top" align="center">11.51</td>
<td valign="top" align="center">8.85</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>All pairwise differences were found to be significant between feedforward models. Recurrent networks continued to outperform feedforward networks with the addition of Gaussian noise (Figure <xref ref-type="fig" rid="F8">8</xref>).</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Classification error for all models on recognition in MNIST with and without Gaussian noise. Examples of the images used to train and test the networks are also shown. Matrices to the right indicate significant results of pairwise McNemar tests. Comparisons are across models and within image sets. Black boxes indicate significant differences at <italic>p</italic> &#x0003C; 0.05 when controlling the expected false discovery rate at 0.05.</p></caption>
<graphic xlink:href="fpsyg-08-01551-g0008.tif"/>
</fig>
<p>At the highest noise levels (SNR = 0.5), BL was found to perform significantly worse than both BT [&#x003C7;<sup>2</sup>(1, <italic>N</italic> &#x0003D; 10, 000) &#x0003D; 61.69, <italic>p</italic> &#x0003C; 0.01] and BLT [&#x003C7;<sup>2</sup>(1, <italic>N</italic> &#x0003D; 10, 000) &#x0003D; 55.12, <italic>p</italic> &#x0003C; 0.01]. This means that top-down connections might be more useful for than lateral connections recognizing digits under high levels of additive Gaussian noise.</p>
</sec>
<sec>
<title>3.4. Robustness under challenging conditions</title>
<p>When testing for robustness to increasing levels of debris and Gaussian noise, we found that recurrent networks were always more robust than the feedforward networks. This relationship was not found in the case of clutter. Only one network, BT, was found to be significantly less robust to increases in clutter, and all other networks were found to have similar levels of robustness (Figure <xref ref-type="fig" rid="F9">9</xref>).</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>Pairwise differences in model robustness to increased task difficulty. Arrows indicate the more robust model out of the pair tested.</p></caption>
<graphic xlink:href="fpsyg-08-01551-g0009.tif"/>
</fig>
<p>Within feedforward networks, B-K was always the most robust to debris and noise, and B-F was always more robust than B. Within recurrent networks, BLT was the most robust to debris and BL was more robust to debris than BT. However, BLT and BT were more robust than BL to Gaussian noise.</p>
<p>These results suggest that, when debris or Gaussian noise are added, recurrent models take smaller hits to the error rate than feedforward networks. However, when clutter is added, recurrent networks (though still better in absolute performance) take similar hits to the error rate.</p>
<p>More specifically, in the scenarios tested here, lateral recurrence seem to have greater benefit when handling debris and top-down connections improve robustness to Gaussian noise. By utilizing both lateral and top-down connections, BLT is more robust to both increasing levels of debris and increasing levels of Gaussian noise.</p>
</sec>
</sec>
<sec sec-type="discussion" id="s4">
<title>4. Discussion</title>
<p>We found support for the hypothesis that recurrence helps when recognizing objects in a range of challenging conditions, as well as aiding recognition in more standard scenarios. The benefit of recurrence for object recognition in challenging conditions appears to be particularly strong in the case of occlusion generated by a non-target and the addition of Gaussian noise, with recurrent networks appearing more robust. In the multiple digit recognition tasks, where the occlusion is generated by other targets, the best performing networks are still recurrent. However, recurrent networks are not more robust, than feedforward networks, to an increased number of digits.</p>
<p>Of the feedforward models, B-K is always the best performing and can outperform recurrent models in some tasks, in the case of multiple digit recognition. One potential explanation is that B-K incorporates some of the benefits of recurrence by having a larger receptive field. This is because recurrence increases the effective receptive field of a unit by receiving input from neighboring units. This may also explain why BT tends to be the worst performing recurrent model (and outperformed by B-K) in some tasks. BT does not have lateral connections that more directly integrate information from neighboring units, but information has to go through a higher layer first in order to achieve this. The difference in performance between BT and BL may also tell us about what tasks benefit more directly from incorporating information from outside the classical receptive field (where BL shows an advantage) as opposed to specifically utilizing information from more abstract features (where BT shows an advantage). In these experiments, BLT is the best performing network across all tasks, showing that it is able to utilize the benefits of both lateral and top-down connections.</p>
<p>We find evidence to suggest that feedforward networks have particular difficulty recognizing objects under occlusion generated by debris, and not just learning the task of recognizing digits when trained with heavily occluded objects (Section 3.1.2). This gives specific support to the hypothesis that recurrent processing helps in occluded object recognition.</p>
<p>Recurrent networks also outperformed the parameter matched controls on object recognition tasks where no occlusion was present (Section 3.3). This is consistent with previous work that has shown that recurrent networks, similar in architecture to the BL networks used here, perform strongly compared to other feedforward models with larger numbers of parameters (Liang and Hu, <xref ref-type="bibr" rid="B27">2015</xref>). Therefore, some level of recurrence may be beneficial in standard object recognition, an idea that is supported by neural evidence that shows object information unfolding over time, even without the presence of occlusion (Sugase et al., <xref ref-type="bibr" rid="B40">1999</xref>; Brincat and Connor, <xref ref-type="bibr" rid="B3">2006</xref>; Freiwald and Tsao, <xref ref-type="bibr" rid="B12">2010</xref>; Carlson et al., <xref ref-type="bibr" rid="B4">2013</xref>; Cichy et al., <xref ref-type="bibr" rid="B6">2014</xref>; Clarke et al., <xref ref-type="bibr" rid="B7">2015</xref>).</p>
<p>This work suggests that networks with recurrent connections generally show performance gains relative to feedforward models when performing a broad spectrum of object recognition tasks. However, it does not indicate which of these models best describe human object recognition. Future comparisons to neural or behavioral data will be needed to test the efficacy of these models. For example, as these models are recurrent and unfold over time, they can be used to predict human recognition dynamics for the same stimuli, such as reaction time distributions and the order that digits are reported, in the multiple digit recognition tasks.</p>
<p>Furthermore, we can study whether the activation patterns of these networks predict neural dynamics of object recognition. This is similar to previous work that has attempted to explain neural dynamics of representations using individual layers of deep feedforward networks (Cichy et al., <xref ref-type="bibr" rid="B5">2016</xref>), but by using the recurrent models we can directly relate temporal dynamics in the model to temporal dynamics in the brain. For instance, in tasks with multiple targets (such as those in Section 3.2) we can look at the target representations over recurrent iterations and layers in the model, and compare this to the spatiotemporal dynamics of multiple object representations in neural data. Testing these models against this experimental data will allow us to better understand the importance of lateral and top-down connections, in these models, for explaining neural data.</p>
<p>In addition, whilst we know that adding recurrent connections leads to performance gains in these models, we do not know the exact function of these recurrent connections. For instance, in the case of occlusion, the recurrent connections might complete some of the missing information from occluded regions of the input image, which would be consistent with experimental evidence in cases where parts of the image have been deleted (Smith and Muckli, <xref ref-type="bibr" rid="B38">2010</xref>; O&#x00027;Reilly et al., <xref ref-type="bibr" rid="B32">2013</xref>). Alternatively, as our occluders contain visual features that could be potentially misleading, recurrent connections may have more of an effect of suppressing the network&#x00027;s representation of the occluders through competitive processing (Adesnik and Scanziani, <xref ref-type="bibr" rid="B1">2010</xref>; Kolankeh et al., <xref ref-type="bibr" rid="B22">2015</xref>). Recurrent connectivity could also learn to produce border ownership cells that would help in identifying occluders in the image (Zhaoping, <xref ref-type="bibr" rid="B49">2005</xref>; Sakai and Nishimura, <xref ref-type="bibr" rid="B36">2006</xref>; Craft et al., <xref ref-type="bibr" rid="B8">2007</xref>), which would help suppress occluders in tasks where occluders are non-targets. If these networks are to be useful models of visual processing, then it is important that future work attempts to understand the underlying processes taking place.</p>
<p>It could be argued that BLT performs the best due to the larger number of parameters it can learn. However, we know that the performance of these networks is not only explained by the number of learnable parameters, as B-F has the largest number of parameters of the models tested (Table <xref ref-type="table" rid="T1">1</xref>) and performs poorly in all tasks relative to the recurrent models. Finding exactly parameter matched controls for these models that are conceptually sound is difficult. As discussed earlier (Section 2.2), altering the kernel size of the feedforward models is the best control, but this provides a relatively coarse-grained way to match the number of parameters. Altering the number of learnt features allows more fine-tuned controls for the number of parameters, but this also changes the number of units in the network, which is undesirable. We believe that the models used here represent a good compromise between exact parameter matching and the number of units in each model.</p>
<p>This research suggests that recurrent convolutional neural networks can outperform their feedforward counterparts across a diverse set of object recognition tasks and that they show greater robustness in a range of challenging scenarios, including occlusion. This builds on previous work showing a benefit of recurrent connections in non-convolutional networks where parts of target objects are deleted (O&#x00027;Reilly et al., <xref ref-type="bibr" rid="B32">2013</xref>). This work represents initial steps for using recurrent convolutional neural networks as models of visual object recognition. Scaling up these networks and training them on large sets of natural images (e.g., Russakovsky et al., <xref ref-type="bibr" rid="B35">2015</xref>) will also be important for developing models that mirror processing in the visual system more closely. Future work with these networks will allow us to capture temporal aspects of visual object recognition that are currently neglected in most models, whilst incorporating the important spatial aspects that have been established by prior work (DiCarlo et al., <xref ref-type="bibr" rid="B9">2012</xref>). Modeling these temporal properties will lead to a more complete understanding of visual object recognition in the brain.</p>
</sec>
<sec id="s5">
<title>Author contributions</title>
<p>CS, PM, and NK designed the models. CS and NK designed the stimulus set. CS carried out the experiments and analyses. CS, PM, and NK wrote the paper.</p>
<sec>
<title>Conflict of interest statement</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</sec>
</body>
<back>
<ack><p>This research was funded by the UK Medical Research Council (Programme MC-A060- 5PR20), by a European Research Council Starting Grant (ERC-2010-StG 261352), and by the Human Brain Project (EU grant 604102 Context-sensitive multisensory object recognition: a deep network model constrained by multi-level, multi-species data).</p>
</ack>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Adesnik</surname> <given-names>H.</given-names></name> <name><surname>Scanziani</surname> <given-names>M.</given-names></name></person-group> (<year>2010</year>). <article-title>Lateral competition for cortical space by layer-specific horizontal circuits</article-title>. <source>Nature</source> <volume>464</volume>, <fpage>1155</fpage>&#x02013;<lpage>1160</lpage>. <pub-id pub-id-type="doi">10.1038/nature08935</pub-id><pub-id pub-id-type="pmid">20414303</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Benjamini</surname> <given-names>Y.</given-names></name> <name><surname>Hochberg</surname> <given-names>Y.</given-names></name></person-group> (<year>1995</year>). <article-title>Controlling the false discovery rate: a practical and powerful approach to multiple testing</article-title>. <source>J. R. Stat. Soc. B</source> <volume>57</volume>, <fpage>289</fpage>&#x02013;<lpage>300</lpage>.</citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Brincat</surname> <given-names>S. L.</given-names></name> <name><surname>Connor</surname> <given-names>C. E.</given-names></name></person-group> (<year>2006</year>). <article-title>Dynamic shape synthesis in posterior inferotemporal cortex</article-title>. <source>Neuron</source> <volume>49</volume>, <fpage>17</fpage>&#x02013;<lpage>24</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2005.11.026</pub-id><pub-id pub-id-type="pmid">16387636</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Carlson</surname> <given-names>T.</given-names></name> <name><surname>Tovar</surname> <given-names>D. A.</given-names></name> <name><surname>Alink</surname> <given-names>A.</given-names></name> <name><surname>Kriegeskorte</surname> <given-names>N.</given-names></name></person-group> (<year>2013</year>). <article-title>Representational dynamics of object vision: the first 1000 ms</article-title>. <source>J. Vis.</source> <volume>13</volume>, <fpage>1</fpage>&#x02013;<lpage>19</lpage>. <pub-id pub-id-type="doi">10.1167/13.10.1</pub-id><pub-id pub-id-type="pmid">23908380</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cichy</surname> <given-names>R. M.</given-names></name> <name><surname>Khosla</surname> <given-names>A.</given-names></name> <name><surname>Pantazis</surname> <given-names>D.</given-names></name> <name><surname>Torralba</surname> <given-names>A.</given-names></name> <name><surname>Oliva</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <article-title>Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence</article-title>. <source>Sci. Rep.</source> <volume>6</volume>:<fpage>27755</fpage>. <pub-id pub-id-type="doi">10.1038/srep27755</pub-id><pub-id pub-id-type="pmid">27282108</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cichy</surname> <given-names>R. M.</given-names></name> <name><surname>Pantazis</surname> <given-names>D.</given-names></name> <name><surname>Oliva</surname> <given-names>A.</given-names></name></person-group> (<year>2014</year>). <article-title>Resolving human object recognition in space and time</article-title>. <source>Nat. Neurosci.</source> <volume>17</volume>, <fpage>455</fpage>&#x02013;<lpage>462</lpage>. <pub-id pub-id-type="doi">10.1038/nn.3635</pub-id><pub-id pub-id-type="pmid">24464044</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Clarke</surname> <given-names>A.</given-names></name> <name><surname>Devereux</surname> <given-names>B. J.</given-names></name> <name><surname>Randall</surname> <given-names>B.</given-names></name> <name><surname>Tyler</surname> <given-names>L. K.</given-names></name></person-group> (<year>2015</year>). <article-title>Predicting the time course of individual objects with meg</article-title>. <source>Cereb. Cortex</source> <volume>25</volume>, <fpage>3602</fpage>&#x02013;<lpage>3612</lpage>. <pub-id pub-id-type="doi">10.1093/cercor/bhu203</pub-id><pub-id pub-id-type="pmid">25209607</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Craft</surname> <given-names>E.</given-names></name> <name><surname>Sch&#x000FC;tze</surname> <given-names>H.</given-names></name> <name><surname>Niebur</surname> <given-names>E.</given-names></name> <name><surname>Von Der Heydt</surname> <given-names>R.</given-names></name></person-group> (<year>2007</year>). <article-title>A neural model of figure&#x02013;ground organization</article-title>. <source>J. Neurophysiol.</source> <volume>97</volume>, <fpage>4310</fpage>&#x02013;<lpage>4326</lpage>. <pub-id pub-id-type="doi">10.1152/jn.00203.2007</pub-id><pub-id pub-id-type="pmid">17442769</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name> <name><surname>Zoccolan</surname> <given-names>D.</given-names></name> <name><surname>Rust</surname> <given-names>N. C.</given-names></name></person-group> (<year>2012</year>). <article-title>How does the brain solve visual object recognition?</article-title> <source>Neuron</source> <volume>73</volume>, <fpage>415</fpage>&#x02013;<lpage>434</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2012.01.010</pub-id><pub-id pub-id-type="pmid">22325196</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dietterich</surname> <given-names>T. G.</given-names></name></person-group> (<year>1998</year>). <article-title>Approximate statistical tests for comparing supervised classification learning algorithms</article-title>. <source>Neural Comput.</source> <volume>10</volume>, <fpage>1895</fpage>&#x02013;<lpage>1923</lpage>. <pub-id pub-id-type="doi">10.1162/089976698300017197</pub-id><pub-id pub-id-type="pmid">9744903</pub-id></citation></ref>
<ref id="B10a">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Durkalski</surname> <given-names>V. L.</given-names></name> <name><surname>Palesch</surname> <given-names>Y. Y.</given-names></name> <name><surname>Lipsitz</surname> <given-names>S. R.</given-names></name> <name><surname>Rust</surname> <given-names>P. F.</given-names></name></person-group> (<year>2003</year>). <article-title>Analysis of clustered matched-pair data</article-title>. <source>Stat. Med.</source> <volume>22</volume>, <fpage>2417</fpage>&#x02013;<lpage>2428</lpage>. <pub-id pub-id-type="doi">10.1002/sim.1438</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Felleman</surname> <given-names>D. J.</given-names></name> <name><surname>Van Essen</surname> <given-names>D. C.</given-names></name></person-group> (<year>1991</year>). <article-title>Distributed hierarchical processing in the primate cerebral cortex</article-title>. <source>Cereb. Cortex</source> <volume>1</volume>, <fpage>1</fpage>&#x02013;<lpage>47</lpage>. <pub-id pub-id-type="doi">10.1093/cercor/1.1.1</pub-id><pub-id pub-id-type="pmid">1822724</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Freiwald</surname> <given-names>W. A.</given-names></name> <name><surname>Tsao</surname> <given-names>D. Y.</given-names></name></person-group> (<year>2010</year>). <article-title>Functional compartmentalization and viewpoint generalization within the macaque face-processing system</article-title>. <source>Science</source> <volume>330</volume>, <fpage>845</fpage>&#x02013;<lpage>851</lpage>. <pub-id pub-id-type="doi">10.1126/science.1194908</pub-id><pub-id pub-id-type="pmid">21051642</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goodfellow</surname> <given-names>I.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Courville</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <article-title>Convolutional networks</article-title>, in <source>Deep Learning (MIT Press)</source>, Chapter <volume>9</volume>, <fpage>330</fpage>&#x02013;<lpage>372</lpage>. Available online at: <ext-link ext-link-type="uri" xlink:href="http://www.deeplearningbook.org">http://www.deeplearningbook.org</ext-link>.</citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>G&#x000FC;&#x000E7;l&#x000FC;</surname> <given-names>U.</given-names></name> <name><surname>van Gerven</surname> <given-names>M. A.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream</article-title>. <source>J. Neurosci.</source> <volume>35</volume>, <fpage>10005</fpage>&#x02013;<lpage>10014</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.5023-14.2015</pub-id><pub-id pub-id-type="pmid">26157000</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hassabis</surname> <given-names>D.</given-names></name> <name><surname>Kumaran</surname> <given-names>D.</given-names></name> <name><surname>Summerfield</surname> <given-names>C.</given-names></name> <name><surname>Botvinick</surname> <given-names>M.</given-names></name></person-group> (<year>2017</year>). <article-title>Neuroscience-inspired artificial intelligence</article-title>. <source>Neuron</source> <volume>95</volume>, <fpage>245</fpage>&#x02013;<lpage>258</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2017.06.011</pub-id><pub-id pub-id-type="pmid">28728020</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hubel</surname> <given-names>D. H.</given-names></name> <name><surname>Wiesel</surname> <given-names>T. N.</given-names></name></person-group> (<year>1968</year>). <article-title>Receptive fields and functional architecture of monkey striate cortex</article-title>. <source>J. Physiol.</source> <volume>195</volume>, <fpage>215</fpage>&#x02013;<lpage>243</lpage>. <pub-id pub-id-type="doi">10.1113/jphysiol.1968.sp008455</pub-id><pub-id pub-id-type="pmid">4966457</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hung</surname> <given-names>C. P.</given-names></name> <name><surname>Kreiman</surname> <given-names>G.</given-names></name> <name><surname>Poggio</surname> <given-names>T.</given-names></name> <name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name></person-group> (<year>2005</year>). <article-title>Fast readout of object identity from macaque inferior temporal cortex</article-title>. <source>Science</source> <volume>310</volume>, <fpage>863</fpage>&#x02013;<lpage>866</lpage>. <pub-id pub-id-type="doi">10.1126/science.1117593</pub-id><pub-id pub-id-type="pmid">16272124</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Isik</surname> <given-names>L.</given-names></name> <name><surname>Meyers</surname> <given-names>E. M.</given-names></name> <name><surname>Leibo</surname> <given-names>J. Z.</given-names></name> <name><surname>Poggio</surname> <given-names>T.</given-names></name></person-group> (<year>2014</year>). <article-title>The dynamics of invariant object recognition in the human visual system</article-title>. <source>J. Neurophysiol.</source> <volume>111</volume>, <fpage>91</fpage>&#x02013;<lpage>102</lpage>. <pub-id pub-id-type="doi">10.1152/jn.00394.2013</pub-id><pub-id pub-id-type="pmid">24089402</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Johnson</surname> <given-names>J. S.</given-names></name> <name><surname>Olshausen</surname> <given-names>B. A.</given-names></name></person-group> (<year>2005</year>). <article-title>The recognition of partially visible natural objects in the presence and absence of their occluders</article-title>. <source>Vis. Res.</source> <volume>45</volume>, <fpage>3262</fpage>&#x02013;<lpage>3276</lpage>. <pub-id pub-id-type="doi">10.1016/j.visres.2005.06.007</pub-id><pub-id pub-id-type="pmid">16043208</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Keysers</surname> <given-names>C.</given-names></name> <name><surname>Xiao</surname> <given-names>D.-K.</given-names></name> <name><surname>F&#x000F6;ldi&#x000E1;k</surname> <given-names>P.</given-names></name> <name><surname>Perrett</surname> <given-names>D. I.</given-names></name></person-group> (<year>2001</year>). <article-title>The speed of sight</article-title>. <source>J. Cogn. Neurosci.</source> <volume>13</volume>, <fpage>90</fpage>&#x02013;<lpage>101</lpage>. <pub-id pub-id-type="doi">10.1162/089892901564199</pub-id><pub-id pub-id-type="pmid">11224911</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Khaligh-Razavi</surname> <given-names>S.-M.</given-names></name> <name><surname>Kriegeskorte</surname> <given-names>N.</given-names></name></person-group> (<year>2014</year>). <article-title>Deep supervised, but not unsupervised, models may explain it cortical representation</article-title>. <source>PLoS Comput. Biol.</source> <volume>10</volume>:<fpage>e1003915</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1003915</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kolankeh</surname> <given-names>A. K.</given-names></name> <name><surname>Teichmann</surname> <given-names>M.</given-names></name> <name><surname>Hamker</surname> <given-names>F. H.</given-names></name></person-group> (<year>2015</year>). <article-title>Competition improves robustness against loss of information</article-title>. <source>Front. Comput. Neurosci.</source> <volume>9</volume>:<fpage>35</fpage>. <pub-id pub-id-type="doi">10.3389/fncom.2015.00035</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kriegeskorte</surname> <given-names>N.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep neural networks: a new framework for modeling biological vision and brain information processing</article-title>. <source>Annu. Rev. Vis. Sci.</source> <volume>1</volume>, <fpage>417</fpage>&#x02013;<lpage>446</lpage>. <pub-id pub-id-type="doi">10.1146/annurev-vision-082114-035447</pub-id><pub-id pub-id-type="pmid">28532370</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Krizhevsky</surname> <given-names>A.</given-names></name> <name><surname>Sutskever</surname> <given-names>I.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name></person-group> (<year>2012</year>). <article-title>Imagenet classification with deep convolutional neural networks</article-title>, in <source>Advances in Neural Information Processing Systems 25</source>, eds <person-group person-group-type="editor"><name><surname>Pereira</surname> <given-names>F.</given-names></name> <name><surname>Burges</surname> <given-names>C. J. C.</given-names></name> <name><surname>Bottou</surname> <given-names>L.</given-names></name> <name><surname>Weinberger</surname> <given-names>K. Q.</given-names></name></person-group> (<publisher-loc>South Lake Tahoe, CA</publisher-loc>: <publisher-name>Curran Associates, Inc.</publisher-name>), <fpage>1097</fpage>&#x02013;<lpage>1105</lpage>.</citation></ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>LeCun</surname> <given-names>Y.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Hinton</surname> <given-names>G.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep learning</article-title>. <source>Nature</source> <volume>521</volume>, <fpage>436</fpage>&#x02013;<lpage>444</lpage>. <pub-id pub-id-type="doi">10.1038/nature14539</pub-id><pub-id pub-id-type="pmid">26017442</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>LeCun</surname> <given-names>Y.</given-names></name> <name><surname>Bottou</surname> <given-names>L.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Haffner</surname> <given-names>P.</given-names></name></person-group> (<year>1998</year>). <article-title>Gradient-based learning applied to document recognition</article-title>. <source>Proc. IEEE</source> <volume>86</volume>, <fpage>2278</fpage>&#x02013;<lpage>2324</lpage>.</citation></ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Liang</surname> <given-names>M.</given-names></name> <name><surname>Hu</surname> <given-names>X.</given-names></name></person-group> (<year>2015</year>). <article-title>Recurrent convolutional neural network for object recognition</article-title>, in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Boston, MA</publisher-loc>), <fpage>3367</fpage>&#x02013;<lpage>3375</lpage>.</citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liao</surname> <given-names>Q.</given-names></name> <name><surname>Poggio</surname> <given-names>T.</given-names></name></person-group> (<year>2016</year>). <article-title>Bridging the gaps between residual learning, recurrent neural networks and visual cortex</article-title>. <source>arXiv preprint arXiv:1604.03640</source>.</citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Majaj</surname> <given-names>N. J.</given-names></name> <name><surname>Hong</surname> <given-names>H.</given-names></name> <name><surname>Solomon</surname> <given-names>E. A.</given-names></name> <name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name></person-group> (<year>2015</year>). <article-title>Simple learned weighted sums of inferior temporal neuronal firing rates accurately predict human core object recognition performance</article-title>. <source>J. Neurosci.</source> <volume>35</volume>, <fpage>13402</fpage>&#x02013;<lpage>13418</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.5181-14.2015</pub-id><pub-id pub-id-type="pmid">26424887</pub-id></citation></ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Markov</surname> <given-names>N. T.</given-names></name> <name><surname>Vezoli</surname> <given-names>J.</given-names></name> <name><surname>Chameau</surname> <given-names>P.</given-names></name> <name><surname>Falchier</surname> <given-names>A.</given-names></name> <name><surname>Quilodran</surname> <given-names>R.</given-names></name> <name><surname>Huissoud</surname> <given-names>C.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>Anatomy of hierarchy: feedforward and feedback pathways in macaque visual cortex</article-title>. <source>J. Comp. Neurol.</source> <volume>522</volume>, <fpage>225</fpage>&#x02013;<lpage>259</lpage>. <pub-id pub-id-type="doi">10.1002/cne.23458</pub-id><pub-id pub-id-type="pmid">23983048</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>McNemar</surname> <given-names>Q.</given-names></name></person-group> (<year>1947</year>). <article-title>Note on the sampling error of the difference between correlated proportions or percentages</article-title>. <source>Psychometrika</source> <volume>12</volume>, <fpage>153</fpage>&#x02013;<lpage>157</lpage>. <pub-id pub-id-type="doi">10.1007/BF02295996</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>O&#x00027;Reilly</surname> <given-names>R. C.</given-names></name> <name><surname>Wyatte</surname> <given-names>D.</given-names></name> <name><surname>Herd</surname> <given-names>S.</given-names></name> <name><surname>Mingus</surname> <given-names>B.</given-names></name> <name><surname>Jilk</surname> <given-names>D. J.</given-names></name></person-group> (<year>2013</year>). <article-title>Recurrent processing during object recognition</article-title>. <source>Front. Psychol.</source> <volume>4</volume>:<fpage>124</fpage>. <pub-id pub-id-type="doi">10.3389/fpsyg.2013.00124</pub-id><pub-id pub-id-type="pmid">23554596</pub-id></citation></ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Potter</surname> <given-names>M. C.</given-names></name></person-group> (<year>1976</year>). <article-title>Short-term conceptual memory for pictures</article-title>. <source>J. Exp. Psychol. Hum. Learn. Mem.</source> <volume>2</volume>, <fpage>509</fpage>&#x02013;<lpage>522</lpage>. <pub-id pub-id-type="doi">10.1037/0278-7393.2.5.509</pub-id><pub-id pub-id-type="pmid">1003124</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Riesenhuber</surname> <given-names>M.</given-names></name> <name><surname>Poggio</surname> <given-names>T.</given-names></name></person-group> (<year>1999</year>). <article-title>Hierarchical models of object recognition in cortex</article-title>. <source>Nat. Neurosci.</source> <volume>2</volume>, <fpage>1019</fpage>&#x02013;<lpage>1025</lpage>. <pub-id pub-id-type="doi">10.1038/14819</pub-id><pub-id pub-id-type="pmid">10526343</pub-id></citation></ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Russakovsky</surname> <given-names>O.</given-names></name> <name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Su</surname> <given-names>H.</given-names></name> <name><surname>Krause</surname> <given-names>J.</given-names></name> <name><surname>Satheesh</surname> <given-names>S.</given-names></name> <name><surname>Ma</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>ImageNet large scale visual recognition challenge</article-title>. <source>Int. J. Comput. Vis.</source> <volume>115</volume>, <fpage>211</fpage>&#x02013;<lpage>252</lpage>. <pub-id pub-id-type="doi">10.1007/s11263-015-0816-y</pub-id></citation></ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sakai</surname> <given-names>K.</given-names></name> <name><surname>Nishimura</surname> <given-names>H.</given-names></name></person-group> (<year>2006</year>). <article-title>Surrounding suppression and facilitation in the determination of border ownership</article-title>. <source>J. Cogn. Neurosci.</source> <volume>18</volume>, <fpage>562</fpage>&#x02013;<lpage>579</lpage>. <pub-id pub-id-type="doi">10.1162/jocn.2006.18.4.562</pub-id><pub-id pub-id-type="pmid">16768360</pub-id></citation></ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Serre</surname> <given-names>T.</given-names></name> <name><surname>Oliva</surname> <given-names>A.</given-names></name> <name><surname>Poggio</surname> <given-names>T.</given-names></name></person-group> (<year>2007</year>). <article-title>A feedforward architecture accounts for rapid categorization</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A.</source> <volume>104</volume>, <fpage>6424</fpage>&#x02013;<lpage>6429</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.0700622104</pub-id><pub-id pub-id-type="pmid">17404214</pub-id></citation></ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Smith</surname> <given-names>F. W.</given-names></name> <name><surname>Muckli</surname> <given-names>L.</given-names></name></person-group> (<year>2010</year>). <article-title>Nonstimulated early visual areas carry information about surrounding context</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A.</source> <volume>107</volume>, <fpage>20099</fpage>&#x02013;<lpage>20103</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.1000233107</pub-id><pub-id pub-id-type="pmid">21041652</pub-id></citation></ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sporns</surname> <given-names>O.</given-names></name> <name><surname>Zwi</surname> <given-names>J. D.</given-names></name></person-group> (<year>2004</year>). <article-title>The small world of the cerebral cortex</article-title>. <source>Neuroinformatics</source> <volume>2</volume>, <fpage>145</fpage>&#x02013;<lpage>162</lpage>. <pub-id pub-id-type="doi">10.1385/NI:2:2:145</pub-id><pub-id pub-id-type="pmid">15319512</pub-id></citation></ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sugase</surname> <given-names>Y.</given-names></name> <name><surname>Yamane</surname> <given-names>S.</given-names></name> <name><surname>Ueno</surname> <given-names>S.</given-names></name> <name><surname>Kawano</surname> <given-names>K.</given-names></name></person-group> (<year>1999</year>). <article-title>Global and fine information coded by single neurons in the temporal visual cortex</article-title>. <source>Nature</source> <volume>400</volume>, <fpage>869</fpage>&#x02013;<lpage>873</lpage>. <pub-id pub-id-type="pmid">10476965</pub-id></citation></ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tang</surname> <given-names>H.</given-names></name> <name><surname>Buia</surname> <given-names>C.</given-names></name> <name><surname>Madhavan</surname> <given-names>R.</given-names></name> <name><surname>Crone</surname> <given-names>N. E.</given-names></name> <name><surname>Madsen</surname> <given-names>J. R.</given-names></name> <name><surname>Anderson</surname> <given-names>W. S.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>Spatiotemporal dynamics underlying object completion in human ventral visual cortex</article-title>. <source>Neuron</source> <volume>83</volume>, <fpage>736</fpage>&#x02013;<lpage>748</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2014.06.017</pub-id><pub-id pub-id-type="pmid">25043420</pub-id></citation></ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Thorpe</surname> <given-names>S.</given-names></name> <name><surname>Fize</surname> <given-names>D.</given-names></name> <name><surname>Marlot</surname> <given-names>C.</given-names></name></person-group> (<year>1996</year>). <article-title>Speed of processing in the human visual system</article-title>. <source>Nature</source> <volume>381</volume>, <fpage>520</fpage>&#x02013;<lpage>522</lpage>. <pub-id pub-id-type="pmid">8632824</pub-id></citation></ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wallis</surname> <given-names>G.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>1997</year>). <article-title>Invariant face and object recognition in the visual system</article-title>. <source>Prog. Neurobiol.</source> <volume>51</volume>, <fpage>167</fpage>&#x02013;<lpage>194</lpage>. <pub-id pub-id-type="doi">10.1016/S0301-0082(96)00054-8</pub-id><pub-id pub-id-type="pmid">9247963</pub-id></citation></ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wyatte</surname> <given-names>D.</given-names></name> <name><surname>Curran</surname> <given-names>T.</given-names></name> <name><surname>O&#x00027;Reilly</surname> <given-names>R.</given-names></name></person-group> (<year>2012</year>). <article-title>The limits of feedforward vision: recurrent processing promotes robust object recognition when objects are degraded</article-title>. <source>J. Cogn. Neurosci.</source> <volume>24</volume>, <fpage>2248</fpage>&#x02013;<lpage>2261</lpage>. <pub-id pub-id-type="doi">10.1162/jocn_a_00282</pub-id><pub-id pub-id-type="pmid">22905822</pub-id></citation></ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wyatte</surname> <given-names>D.</given-names></name> <name><surname>Jilk</surname> <given-names>D. J.</given-names></name> <name><surname>O&#x00027;Reilly</surname> <given-names>R. C.</given-names></name></person-group> (<year>2014</year>). <article-title>Early recurrent feedback facilitates visual object recognition under challenging conditions</article-title>. <source>Front. Psychol.</source> <volume>5</volume>:<fpage>674</fpage>. <pub-id pub-id-type="doi">10.3389/fpsyg.2014.00674</pub-id><pub-id pub-id-type="pmid">25071647</pub-id></citation></ref>
<ref id="B46">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yamins</surname> <given-names>D. L.</given-names></name> <name><surname>Hong</surname> <given-names>H.</given-names></name> <name><surname>Cadieu</surname> <given-names>C.</given-names></name> <name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name></person-group> (<year>2013</year>). <article-title>Hierarchical modular optimization of convolutional networks achieves representations similar to macaque it and human ventral stream</article-title>, in <source>Advances in Neural Information Processing Systems 26</source>, eds <person-group person-group-type="editor"><name><surname>Burges</surname> <given-names>C. J. C.</given-names></name> <name><surname>Bottou</surname> <given-names>L.</given-names></name> <name><surname>Welling</surname> <given-names>M.</given-names></name> <name><surname>Ghahramani</surname> <given-names>Z.</given-names></name> <name><surname>Weinberger</surname> <given-names>K. Q.</given-names></name></person-group> (<publisher-loc>South Lake Tahoe, CA</publisher-loc>: <publisher-name>Curran Associates, Inc.</publisher-name>), <fpage>3093</fpage>&#x02013;<lpage>3101</lpage>.</citation></ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yamins</surname> <given-names>D. L.</given-names></name> <name><surname>Hong</surname> <given-names>H.</given-names></name> <name><surname>Cadieu</surname> <given-names>C. F.</given-names></name> <name><surname>Solomon</surname> <given-names>E. A.</given-names></name> <name><surname>Seibert</surname> <given-names>D.</given-names></name> <name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name></person-group> (<year>2014</year>). <article-title>Performance-optimized hierarchical models predict neural responses in higher visual cortex</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A.</source> <volume>111</volume>, <fpage>8619</fpage>&#x02013;<lpage>8624</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.1403112111</pub-id><pub-id pub-id-type="pmid">24812127</pub-id></citation></ref>
<ref id="B48">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zeiler</surname> <given-names>M. D.</given-names></name> <name><surname>Krishnan</surname> <given-names>D.</given-names></name> <name><surname>Taylor</surname> <given-names>G. W.</given-names></name> <name><surname>Fergus</surname> <given-names>R.</given-names></name></person-group> (<year>2010</year>). <article-title>Deconvolutional networks</article-title>, in <source>Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on</source> (<publisher-loc>San Francisco, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2528</fpage>&#x02013;<lpage>2535</lpage>.</citation></ref>
<ref id="B49">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhaoping</surname> <given-names>L.</given-names></name></person-group> (<year>2005</year>). <article-title>Border ownership from intracortical interactions in visual area v2</article-title>. <source>Neuron</source> <volume>47</volume>, <fpage>143</fpage>&#x02013;<lpage>153</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2005.04.005</pub-id><pub-id pub-id-type="pmid">15996554</pub-id></citation></ref>
</ref-list>
</back>
</article>