<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neuroinform.</journal-id>
<journal-title>Frontiers in Neuroinformatics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neuroinform.</abbrev-journal-title>
<issn pub-type="epub">1662-5196</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fninf.2021.679838</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Technology and Code</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title><monospace>THINGSvision</monospace>: A Python Toolbox for Streamlining the Extraction of Activations From Deep Neural Networks</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Muttenthaler</surname> <given-names>Lukas</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1265668/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Hebart</surname> <given-names>Martin N.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c002"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/38119/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Vision and Computational Cognition Group, Max Planck Institute for Human Cognitive and Brain Sciences</institution>, <addr-line>Leipzig</addr-line>, <country>Germany</country></aff>
<aff id="aff2"><sup>2</sup><institution>Machine Learning Group, Technical University of Berlin</institution>, <addr-line>Berlin</addr-line>, <country>Germany</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Ludovico Minati, Tokyo Institute of Technology, Japan</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Matthew Brett, University of Birmingham, United Kingdom; Dimitris Pinotsis, City University of London, United Kingdom; Pieter Simoens, Ghent University, Belgium</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Lukas Muttenthaler <email>muttenthaler&#x00040;cbs.mpg.de</email></corresp>
<corresp id="c002">Martin N. Hebart <email>hebart&#x00040;cbs.mpg.de</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>22</day>
<month>09</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>15</volume>
<elocation-id>679838</elocation-id>
<history>
<date date-type="received">
<day>12</day>
<month>03</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>10</day>
<month>08</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2021 Muttenthaler and Hebart.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Muttenthaler and Hebart</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract><p>Over the past decade, deep neural network (DNN) models have received a lot of attention due to their near-human object classification performance and their excellent prediction of signals recorded from biological visual systems. To better understand the function of these networks and relate them to hypotheses about brain activity and behavior, researchers need to extract the activations to images across different DNN layers. The abundance of different DNN variants, however, can often be unwieldy, and the task of extracting DNN activations from different layers may be non-trivial and error-prone for someone without a strong computational background. Thus, researchers in the fields of cognitive science and computational neuroscience would benefit from a library or package that supports a user in the extraction task. <monospace>THINGSvision</monospace> is a new Python module that aims at closing this gap by providing a simple and unified tool for extracting layer activations for a wide range of pretrained and randomly-initialized neural network architectures, even for users with little to no programming experience. We demonstrate the general utility of <monospace>THINGsvision</monospace> by relating extracted DNN activations to a number of functional MRI and behavioral datasets using representational similarity analysis, which can be performed as an integral part of the toolbox. Together, <monospace>THINGSvision</monospace> enables researchers across diverse fields to extract features in a streamlined manner for their custom image dataset, thereby improving the ease of relating DNNs, brain activity, and behavior, and improving the reproducibility of findings in these research fields.</p></abstract>
<kwd-group>
<kwd>deep neural network</kwd>
<kwd>computational neuroscience</kwd>
<kwd>Python (programming language)</kwd>
<kwd>artificial intelligence</kwd>
<kwd>feature extraction</kwd>
<kwd>computer vision</kwd>
</kwd-group>
<contract-sponsor id="cn001">Max-Planck-Gesellschaft<named-content content-type="fundref-id">10.13039/501100004189</named-content></contract-sponsor>
<counts>
<fig-count count="2"/>
<table-count count="1"/>
<equation-count count="5"/>
<ref-count count="52"/>
<page-count count="12"/>
<word-count count="8874"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>In recent years, deep neural networks (DNNs) have sparked a lot of interest in the connected fields of cognitive science, computational neuroscience, and artificial intelligence. This is mainly owing to their power as arbitrary function approximators (LeCun et al., <xref ref-type="bibr" rid="B29">2015</xref>), their near-human performance on object recognition and natural language understanding tasks (e.g., Russakovsky et al., <xref ref-type="bibr" rid="B39">2015</xref>; Wang et al., <xref ref-type="bibr" rid="B50">2018</xref>, <xref ref-type="bibr" rid="B49">2019</xref>), and, most crucially, the fact that their latent representations often show a close correspondence to brain recordings and behavioral measurements (G&#x000FC;&#x000E7;l&#x000FC; and van Gerven, <xref ref-type="bibr" rid="B12">2014</xref>; Khaligh-Razavi and Kriegeskorte, <xref ref-type="bibr" rid="B20">2014</xref>; Yamins et al., <xref ref-type="bibr" rid="B52">2014</xref>; Kriegeskorte, <xref ref-type="bibr" rid="B23">2015</xref>; Kietzmann et al., <xref ref-type="bibr" rid="B21">2018</xref>; Schrimpf et al., <xref ref-type="bibr" rid="B41">2018</xref>, <xref ref-type="bibr" rid="B42">2020b</xref>; King et al., <xref ref-type="bibr" rid="B22">2019</xref>).</p>
<p>One important limiting factor for a much broader interdisciplinary adoption of DNNs as computational models lies in the difficulty of extracting layer activations for DNNs. This difficulty is twofold. First, the number of existing models is enormous and increases by the day. Due to this diversity, an extraction strategy that is suited for one model may not apply to any other model. Second, for users without a strong programming background it can be non-trivial to extract features while being confident that no mistakes were made in the process, for example during image preprocessing, layer selection, or making sure that images corresponded to extracted activations. Beyond these difficulties, even experienced programmers would benefit from an efficient and validated toolbox to streamline the extraction process and prevent errors in the process. Together, this demonstrates that researchers in cognitive science and computational neuroscience would benefit from a readily-available package for a streamlined extraction of neural network activation.</p>
<p>With <monospace>THINGSvision</monospace>, we provide a Python toolbox that enables researchers to extract features for most state-of-the-art neural network models for existing or custom image datasets with just a few lines of code. While feature extraction may not seem to be a difficult task for someone with a strong computational background, this toolbox is primarily aimed at supporting those researchers who are inexperienced with Python programming and deep neural network architectures, but interested in the analysis of their representations. However, we believe that even computer scientists will benefit from a publicly available toolbox that is well-maintained and efficiently written. Thus, we regard <monospace>THINGSvision</monospace> as a tool that can be used across research domains.</p>
<p>In the remainder of this article, we introduce and motivate the main functionalities of the library and how to use them. We start by providing an overview of the collection of neural network models for which features can be extracted. The code for <monospace>THINGSvision</monospace> is publicly available and readily available as a Python package under the MIT license <ext-link ext-link-type="uri" xlink:href="https://github.com/ViCCo-Group/THINGSvision">https://github.com/ViCCo-Group/THINGSvision</ext-link>.</p>
<sec>
<title>1.1. Model Collection</title>
<p>All neural network models that are part of <monospace>THINGSvision</monospace> are built in <monospace>PyTorch</monospace> (Paszke et al., <xref ref-type="bibr" rid="B34">2019</xref>) or <monospace>TensorFlow</monospace> (Abadi et al., <xref ref-type="bibr" rid="B1">2015</xref>), which are the two most commonly used deep learning frameworks. We include every neural network model that is part of <monospace>PyTorch</monospace>&#x00027;s publicly available model-zoo, <ext-link ext-link-type="uri" xlink:href="https://pytorch.org/vision/0.8/models.html"><monospace>torchvision</monospace></ext-link>, and <ext-link ext-link-type="uri" xlink:href="https://www.tensorflow.org/api_docs/python/tf/keras/applications"><monospace>TensorFlow</monospace></ext-link>&#x00027;s model zoo, including many DNN models commonly used in research such as AlexNet (Krizhevsky et al., <xref ref-type="bibr" rid="B26">2012</xref>), VGG-16 and VGG-19 (Simonyan and Zisserman, <xref ref-type="bibr" rid="B43">2015</xref>), and ResNet (He et al., <xref ref-type="bibr" rid="B15">2016</xref>). Whenever a new vision architecture is added to <monospace>torchvision</monospace> or <monospace>TensorFlow</monospace>&#x00027;s model zoo, <monospace>THINGSvision</monospace> is designed to automatically make it available, as well.</p>
<p>In addition to models from the <monospace>torchvision</monospace> and <monospace>TensorFlow</monospace> library, we provide both feedforward and recurrent variants of CORnet, a recent DNN model that was inspired by the architecture of the non-human primate visual system and that leverages recurrence to more closely resemble biological processing mechanisms (Kubilius et al., <xref ref-type="bibr" rid="B28">2018</xref>, <xref ref-type="bibr" rid="B27">2019</xref>). At the time of writing, CORnet-S is the best performing computational model on the BrainScore benchmark (Schrimpf et al., <xref ref-type="bibr" rid="B41">2018</xref>, <xref ref-type="bibr" rid="B42">2020b</xref>), a composition of various neural and behavioral benchmarks aimed at assessing the degree to which a DNN is a good model of cortical visual object processing.</p>
<p>Moreover, we include both versions of CLIP (Radford et al., <xref ref-type="bibr" rid="B37">2021</xref>), a multimodal DNN model developed by OpenAI that is based on the Transformer architecture (Vaswani et al., <xref ref-type="bibr" rid="B48">2017</xref>), which has surpassed the performance of previous recurrent and convolutional neural networks on a wide range of core natural language processing and image recognition tasks. CLIP&#x00027;s training procedure makes it possible to simultaneously extract both image and text features for visual concepts and their natural language counterparts. CLIP exists as an advanced, multimodal version of ResNet50 (He et al., <xref ref-type="bibr" rid="B15">2016</xref>) and the so-called Vision-Transformer, <monospace>ViT</monospace> (Dosovitskiy et al., <xref ref-type="bibr" rid="B9">2021</xref>). We additionally provide the possibility to upload model weights pretrained on custom image datasets beyond ImageNet.</p>
<p>To facilitate the reproducibility of computational analyses across research groups and fields, it is crucial to not only make code pertaining to the proposed analysis pipeline publicly available but additionally offer a general and well-documented framework that can easily be adopted by others (Peng, <xref ref-type="bibr" rid="B35">2011</xref>; Esteban et al., <xref ref-type="bibr" rid="B10">2018</xref>; Rush, <xref ref-type="bibr" rid="B38">2018</xref>; Van Lissa et al., <xref ref-type="bibr" rid="B47">2020</xref>). This is why we aspired to follow high software engineering principles such as PEP8 guidelines during development. We regard <monospace>THINGSvision</monospace> as a toolbox that aims at promoting both the interpretability and comparability of research at the intersection of cognitive science, computational neuroscience, and artificial intelligence. Instead of simply providing an unwieldy collection of existing computational models, we decided to focus on models whose functional composition has been demonstrated to be similar to the primate visual system (Kriegeskorte, <xref ref-type="bibr" rid="B23">2015</xref>; Kietzmann et al., <xref ref-type="bibr" rid="B21">2018</xref>) and models that are widely adopted by the research community.</p>
</sec>
</sec>
<sec id="s2">
<title>2. Method</title>
<p><monospace>THINGSvision</monospace> is a toolbox that was written in the high-level programming language Python and, therefore, requires Python version 3.7 or later to be installed on a user&#x00027;s machine. The toolbox leverages two of the most widely used packages in the context of machine learning research and numerical analysis, namely PyTorch (Paszke et al., <xref ref-type="bibr" rid="B34">2019</xref>), TensorFlow (Abadi et al., <xref ref-type="bibr" rid="B1">2015</xref>) and NumPy (Harris et al., <xref ref-type="bibr" rid="B14">2020</xref>). Since all relevant NumPy operations were made an integral part of <monospace>THINGSvision</monospace>, it is not necessary to import NumPy or any other Python package explicitly.</p>
<p>To extract features from a neural network model for a custom set of images, users are first required to select a model and additionally define whether the model&#x00027;s weights were pretrained on ImageNet (Deng et al., <xref ref-type="bibr" rid="B8">2009</xref>; Russakovsky et al., <xref ref-type="bibr" rid="B39">2015</xref>) or randomly initialized. If the comparison is aimed at investigating the correspondence between learned representations of a model and brain or behavior, we recommend to use pretrained weights. If the comparison is aimed at investigating how architectural constraints alone can lead to similar representations in models and brain or behavior, then representations from randomly initialized weights carry valuable additional information irrespective of learning (Yamins et al., <xref ref-type="bibr" rid="B52">2014</xref>; G&#x000FC;&#x000E7;l&#x000FC; and van Gerven, <xref ref-type="bibr" rid="B13">2015</xref>; Schrimpf et al., <xref ref-type="bibr" rid="B40">2020a</xref>; Storrs et al., <xref ref-type="bibr" rid="B45">2020b</xref>). Second, input and output folders as well as the number of samples to be processed in parallel in so-called mini-batches are passed to a function that converts the user&#x00027;s images into a PyTorch dataset. This dataset subsequently serves as the input to a function that extracts features for the selected module (e.g., the penultimate layer). The above operations are performed with the following lines of code, which essentially encompass the basic flow of <monospace>THINGSvisions</monospace>&#x00027;s extraction pipeline.</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0001.tif"/></p>
<p>Note that at this point it appears crucial to stress the difference between a layer and a module. Module is a more general reference to the individual parts of a model. A module can refer to non-linearities, pooling operations, batch normalization and convolutional or fully-connected layers, whereas a layer usually refers to an entire model block, such as the composition of the latter set of modules or a single layer (e.g., fully-connected or convolutional). We will, however, use the two terms interchangeably in the remainder of this article whenever a module refers to a layer. Moreover, extracting features is used interchangeably with extracting network activations.</p>
<p><xref ref-type="fig" rid="F1">Figure 1</xref> depicts a high-level overview of how feature extraction is streamlined in <monospace>THINGSvision</monospace>. Given that a user provides the system path to an image dataset, the input to a neural network model is a three-dimensional matrix, <italic>I</italic> &#x02208; &#x0211D;<sup><italic>H</italic>&#x000D7;<italic>W</italic>&#x000D7;<italic>C</italic></sup>, which is the numerical representation of any image. Assuming that a user wants to apply the flattening operation to the activations from the selected module, the output corresponding to each input is a one-dimensional vector, <italic>z</italic> &#x02208; &#x0211D;<sup><italic>KHW</italic></sup>.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p><monospace>THINGSvision</monospace> feature extraction pipeline for an example convolutional neural network architecture. Images and activations in early layers of the model are represented as four-dimensional arrays. The first dimension represents the batch size, i.e., the number of images in a subsample of the data. For simplicity, in this example this number is set to two. The second dimension refers to the channel-dimension, and the last two dimensions represent the height and width of an image or feature map, respectively.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fninf-15-679838-g0001.tif"/>
</fig>
<p>In the following paragraphs, we will explain both operations and the variables necessary for feature extraction in more detail. We start by introducing variables that we deem helpful for structuring the extraction workflow.</p>
<sec>
<title>2.1. Variables</title>
<p>Before leveraging <monospace>THINGSvision</monospace>&#x00027;s full functionality, a user is advised to assign values to seven variables, which, for simplicity, we define as their corresponding keyword argument names: <monospace>root</monospace>, <monospace>model_name</monospace>, <monospace>pretrained</monospace>, <monospace>batch_size</monospace>, <monospace>out_path</monospace>, <monospace>file_format</monospace>, and <monospace>device</monospace>. Note that this is not a necessity, since the values pertaining to those variables can simply be passed as input arguments to the respective functions. It does, however, facilitate the ease of reading, and in our opinion clearly contributes to a better workflow. Moreover, there is the option to additionally assign a value to the variable <monospace>module_name</monospace> whose significance we will explain in section 2.2.2. The above variables, their data types, example assignments, and short descriptions are displayed in <xref ref-type="table" rid="T1">Table 1</xref>. We will explain the details of these variables in the remainder of this section. We want to stress that our variable assignments are arbitrary examples rather than a general recommendation. The exact values are depending on the specific needs of a user. More advanced users can simply jump to section 2.2.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Overview of the variables that are relevant for <monospace>THINGSvision</monospace>&#x00027;s feature extraction pipeline and that facilitate a user&#x00027;s workflow.</p></caption>
<graphic xlink:href="fninf-15-679838-i0025.tif"/>
</table-wrap>
<sec>
<title>2.1.1. Root</title>
<p>We recommend starting with the assignment of the <monospace>root</monospace> variable. This variable is supposed to correspond to the system directory where a user&#x00027;s image set is stored.</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0002.tif"/></p>
</sec>
<sec>
<title>2.1.2. Model Name</title>
<p>Next, a user is required to specify the name of the neural network model whose features corresponding to the images in <monospace>root</monospace> ought to be extracted. The model&#x00027;s name can be defined as one of the available neural network models in <ext-link ext-link-type="uri" xlink:href="https://pytorch.org/vision/0.8/models.html"><monospace>torchvision</monospace></ext-link> or <ext-link ext-link-type="uri" xlink:href="https://www.tensorflow.org/api_docs/python/tf/keras/applications"><monospace>TensorFlow</monospace></ext-link>. Conveniently, as soon as a new model is added to <monospace>torchvision</monospace> or <monospace>TensorFlow</monospace>, it will also be included in <monospace>THINGSvision</monospace>, since we inherit from both <monospace>torchvision</monospace> and <monospace>TensorFlow</monospace>. For simplicity, we use <monospace>alexnet</monospace> throughout the remainder of the article, as shown in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0003.tif"/></p>
</sec>
<sec>
<title>2.1.3. Pretrained</title>
<p>As a subsequent step, a user needs to specify whether to load a pretrained model (i.e., pretrained on <monospace>ImageNet</monospace>) into memory, or whether to solely load the parameters of a model that has not yet been trained on any publicly available dataset (so-called randomly initialized networks). The latter may be relevant for architectural comparisons when one is concerned not with the knowledge of a model but with its architecture. In the current example, we assume that the user is interested in a model&#x00027;s knowledge and not its function composition, which is why we set the variable <monospace>pretrained</monospace> to <monospace>true</monospace>. Note that <monospace>pretrained</monospace> must be assigned with a <monospace>Boolean</monospace> value (see <xref ref-type="table" rid="T1">Table 1</xref>).</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0004.tif"/></p>
</sec>
<sec>
<title>2.1.4. Batch Size</title>
<p>Modern neural network architectures process several images at a time in batches. To make the extraction of neural network activations more time efficient, <monospace>THINGSvision</monospace> follows this processing choice, sampling <italic>B</italic> images in parallel. Thus, the choice of the user lies in the trade-off between processing time and memory usage (GPU memory or RAM). For users who are not concerned with extraction speed, we recommend setting <italic>B</italic> to 32. In our example <italic>B</italic> is set to 64 (see <xref ref-type="table" rid="T1">Table 1</xref>).</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0005.tif"/></p>
</sec>
<sec>
<title>2.1.5. Backend</title>
<p>A user can specify whether to load a neural network model built in <monospace>PyTorch</monospace> (<inline-formula><mml:math id="M20"><mml:mrow><mml:mstyle mathcolor="#ff3939"><mml:mtext>&#x02018;pt&#x02019;</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula>) or <monospace>TensorFlow</monospace> (<inline-formula><mml:math id="M21"><mml:mrow><mml:mstyle mathcolor="#ff3939"><mml:mtext>&#x02018;tf&#x02019;</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula>).</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0006.tif"/></p>
</sec>
<sec>
<title>2.1.6. Device</title>
<p>A user can choose between using a CPU and a GPU if a GPU is available. The advantage of leveraging a GPU lies in its faster computation. Note that GPU usage is possible only if a machine is equipped with an NVIDIA GPU.</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0007.tif"/></p>
</sec>
<sec>
<title>2.1.7. Module Name</title>
<p><monospace>Module_name</monospace> refers to the part of the model from which network activations should be extracted. In case a user is familiar with the architecture of the neural network model for which features should be extracted, the variable <monospace>module_name</monospace> can be set manually (e.g., <monospace>features.10)</monospace>. There is, however, the possibility to first inspect the model architecture through an additional function call, and subsequently select a module based on the output of this function. The function prompts a user to select a module, which is then assigned to <monospace>module_name</monospace> in the form of a string. In section 2.2.2, we will explain in more detail how this can be done.</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0008.tif"/></p>
</sec>
<sec>
<title>2.1.8. Output Directory</title>
<p>Before saving features to disk, a user is required to specify the directory where image features should be stored. For simplicity, in <xref ref-type="table" rid="T1">Table 1</xref> we define <monospace>out_path</monospace> as a succession of previously defined variables.</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0009.tif"/></p>
</sec>
<sec>
<title>2.1.9. File Format</title>
<p>A user can specify the <monospace>file_format</monospace> in which the image features are stored. This variable can be set either to <monospace>hdf5</monospace>, <monospace>txt</monospace>, <monospace>mat</monospace> or <monospace>npy</monospace>. If subsequent analyses are performed in Python, we recommend to set <monospace>file_format</monospace> to <monospace>npy</monospace>, as storing large matrices in <monospace>npy</monospace> format is both more memory and time efficient than doing the same in <monospace>txt</monospace> format. This is due to the fact that the <monospace>npy</monospace> format was specifically designed to accommodate the storing of large matrices to <monospace>NumPy</monospace> (Harris et al., <xref ref-type="bibr" rid="B14">2020</xref>).</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0010.tif"/></p>
</sec>
</sec>
<sec>
<title>2.2. Model and Modules</title>
<sec>
<title>2.2.1. Loading Models</title>
<p>With the previously defined variables in place, a user can now start loading a model into a computer&#x00027;s memory. Since <monospace>model_name</monospace> is set to <monospace>alexnet</monospace> and <monospace>pretrained</monospace> to <monospace>true</monospace>, we load an AlexNet model pretrained on <monospace>ImageNet</monospace> and the corresponding image transformations (which are used in section 2.3) into memory with the following line,</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0011.tif"/></p>
</sec>
<sec>
<title>2.2.2. Selecting Modules</title>
<p>Before extracting DNN features for an image dataset, a user is required to select the part of the model for which features should be extracted. In case a user is familiar with the architecture of a specific neural network model, they can simply assign a value to the variable <monospace>module_name</monospace> (see section 2.1.7). If a user is, however, unfamiliar with the specific architecture of a neural network model, we recommend visualizing the composition of the model&#x00027;s modules through the following function call,</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0012.tif"/></p>
<p>The output of this call, in the case of <monospace>alexnet</monospace>, looks as follows,</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0013.tif"/></p>
<p>For users unfamiliar with details of neural network architectures, this output may look confusing, given that it is well-known that AlexNet consists only of 8 layers. Note, however, that the above terminal output displays the individual modules of AlexNet as well as their specific attributes, such as how many features their inputs and outputs have, or whether a layer is followed by a rectifier non-linearity or pooling operation. Note further that the modules are enumerated in the order in which they appear within the model&#x00027;s composition. This is crucial for the module selection step. During this step, <monospace>THINGSvision</monospace> prompts a user to &#x0201C;enter the part of the model for which a user would like to extract image features.&#x0201D; The user&#x00027;s input is automatically assigned to the variable <monospace>module_name</monospace> in the form of a string. In order to extract features from layers that correspond to early areas of the primate visual system, we recommend selecting convolutional or pooling modules, and linear layers for later areas that encode high-level features.</p>
<p>It is important to stress that each model in <monospace>PyTorch</monospace> or <monospace>TensorFlow</monospace> is represented by a tree structure, where the name of the model refers to the root of the tree (e.g., AlexNet). To access a module, a user is required to compose the string variable <monospace>module_name</monospace> by both the name of one of the leaves that directly follow the tree&#x00027;s root (e.g., <monospace>features</monospace>, <monospace>avgpool</monospace>, <monospace>classifier</monospace>) and the number of the module to be selected, separated by a period (e.g., <monospace>features.5</monospace>). This approach to module selection accounts for all models that are part of <monospace>THINGSvision</monospace>. How to compose the string variable <monospace>module_name</monospace> differs between <monospace>PyTorch</monospace> and <monospace>TensorFlow</monospace>. We use <monospace>PyTorch</monospace> module naming.</p>
<p>In this example, we select the 10th module of AlexNet&#x00027;s leaf <monospace>features</monospace> (i.e., <monospace>features.10</monospace>), which corresponds to the fifth convolutional layer in AlexNet (see above). Hence, features will be extracted exclusively for this module.</p>
</sec>
</sec>
<sec>
<title>2.3. Dataset and Data Loader</title>
<p>Through a dedicated dataset class, <monospace>THINGSvision</monospace> can deal with various types of image data (<monospace>.eps</monospace>, <monospace>.jpg</monospace>, <monospace>.jpeg</monospace>, <monospace>.png</monospace>, <monospace>.PNG</monospace>, <monospace>.tif</monospace>, <monospace>.tiff</monospace>) and is able to transform the images into a ready-to-use <monospace>PyTorch</monospace> or <monospace>TensorFlow</monospace> dataset. System paths to images can follow the folder structure <monospace>./root/class/img_xy.png</monospace> or <monospace>./root/img_xy.png</monospace>, where the former directory contains sub-folders for the respective image classes. A dataset is subsequently wrapped with a <monospace>PyTorch</monospace> or <monospace>TensorFlow</monospace> iterator to enable batch-wise feature extraction. The above is done with,</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0014.tif"/></p>
<p><monospace>THINGSvision</monospace> automatically sorts image files alphabetically (i.e., A-Z or 0-9). Sorting, however, depends on a machine&#x00027;s operating system. An alphabetic sort differs across Windows, macOS, and Ubuntu, which is why we provide the possibility to sort the data according to a list of file names, manually defined by the user. The features will, subsequently, be extracted in the order of the provided file names.</p>
<p>This list must follow the <monospace>List[str]</monospace> data structure (i.e., containing <monospace>strings</monospace>), such as <monospace>[aardvark/aardvark_01.jpg, aardvark/aardvark_02.jpg, ...]</monospace> or <monospace>[aardvark.jpg, anchor.jpg, ...]</monospace>, depending on whether the dataset tree consists of subfolders for classes (see above). The list of file names can be passed as an optional argument as follows,</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0015.tif"/></p>
<p>We use the variable <monospace>dl</monospace> here since it is a commonly used abbreviation for &#x0201C;data loader.&#x0201D; It is, moreover, necessary to pass <monospace>out_path</monospace> to the above function to save a <monospace>txt</monospace> to <monospace>out_path</monospace> consisting of the image names in the order in which features are extracted. This is done to ensure that a user can easily correspond the rows of a feature matrix to the image names, as shown in <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
</sec>
<sec>
<title>2.4. Features</title>
<p>The following section is meant for readers curious to understand what is going on under the hood of <monospace>THINGSvision</monospace>&#x00027;s feature extraction pipeline and, additionally, who aim to get a better grasp of the dimensions depicted in <xref ref-type="fig" rid="F1">Figure 1</xref>. Readers who are familiar with matrices and tensors may want to skip this section and jump directly to Section 2.4.2, since the following paragraphs are not crucial for using the toolbox. We use mathematical notation to denote images (inputs) and features (outputs).</p>
<sec>
<title>2.4.1. Extracting Features</title>
<p>When all variables necessary for feature extraction are set, the user can extract image features for a specific (here, the fifth convolutional) layer in AlexNet (i.e., <monospace>features.10</monospace>). <xref ref-type="fig" rid="F1">Figure 1</xref> shows <monospace>THINGSvision</monospace>&#x00027;s feature extraction pipeline for two example images. The algorithm first searches for the images in the <monospace>root</monospace> folder, subsequently converts them into a ready-to-use dataset, and then passes sub-samples of the data in the form of mini-batches as inputs to the network. For simplicity and to demonstrate the extraction procedure, <xref ref-type="fig" rid="F1">Figure 1</xref> displays an example of a simplified convolutional neural network architecture. Recall that an image is numerically represented as a three-dimensional array, usually in the following format.</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>I</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>W</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>C</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>H</italic> = height, <italic>W</italic> = width, <italic>C</italic> = channels. <italic>C</italic> = 1 or 3, depending on whether images are represented in grayscale or RGB format. In <monospace>PyTorch</monospace>, however, image batches, denoted as <italic>X</italic>, are represented as four-dimensional tensors,</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M2"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>X</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>B</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>C</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>s where <italic>B</italic> = <monospace>batch_size</monospace>, and all other dimensions are permuted. Note, that this is not the case for <monospace>TensorFlow</monospace>, where image dimensions are not permuted. In the example in <xref ref-type="fig" rid="F1">Figure 1</xref>, <italic>B</italic> &#x0003D; 2, since two images are concurrently processed. The channel dimension, now, represents the tensor&#x00027;s second dimension (inside the toolbox, it is the first dimension, since Python starts indexing at 0) to more easily apply convolutions to input images. Hence, features at the level of the selected module, denoted as <italic>Z</italic>, are represented as four-dimensional tensors in the format,</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>Z</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>B</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>K</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where the channel parameter <italic>C</italic> is replaced with <italic>K</italic> referring to the number of feature maps within a representation. Here, <italic>K</italic> &#x0003D; 256, and <italic>H</italic> and <italic>W</italic> are significantly smaller than at the input level. For most analyses in computational neuroscience, researchers are required to flatten this four-dimensional tensor into a two-dimensional matrix of the format,</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M4"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>B</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>K</mml:mi><mml:mi>H</mml:mi><mml:mi>W</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>i.e., one vector per image representation in a batch, which is what we demonstrate in the following example. We provide a keyword argument, called <monospace>flatten_acts</monospace>, that communicates to the function to automatically perform the previous step during feature extraction (see the <italic>flatten</italic> operation in <xref ref-type="fig" rid="F1">Figure 1</xref>). A user must simply set the argument to <monospace>True</monospace> as follows,</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0016.tif"/></p>
<p>The final, two-dimensional, feature matrix is of the form,</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M5"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>K</mml:mi><mml:mi>H</mml:mi><mml:mi>W</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>N</italic> corresponds to the number of images in the dataset. In addition to the feature matrix, <monospace>extract_features</monospace> returns a target vector of size <italic>N</italic> &#x000D7; 1 corresponding to the image classes. A user can decide whether to save or ignore this target vector, depending on the subsequent analyses. Note that flattening a tensor is not necessary for feature extraction to work. If a user wants the original four-dimensional tensor, <monospace>flatten_acts</monospace> must be set to <monospace>False</monospace>. A flattened representation may be desirable when the neural network representations are supposed to be compared against representations extracted from brain or behavior, which are typically compared using multiple linear regression or by computing correlation coefficients, which cannot operate on multidimensional arrays directly. However, if the goal is to compare activations between different model architectures or leverage interpretability techniques to inspect feature maps, then the tensor should be left in its original four-dimensional shape.</p>
<p>To offer a user more flexibility and control over the feature extraction procedure, we do not provide a default value for this keyword argument. Since a user may want store a four-dimensional tensor in <monospace>txt</monospace> format to disk, <monospace>THINGSvision</monospace> comes (1) with a function that slices a four-dimensional tensor into multiple two-dimensional matrices, and (2) a corresponding function that merges the slices back into their original shape at the time of loading the features back into memory.</p>
</sec>
<sec>
<title>2.4.2. Saving Features</title>
<p>To save network activations (no matter from which part of the model) in a flattened format, the following function can be called,</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0017.tif"/></p>
<p>When features are extracted from any of the convolutional layers of the model, the output is a four-dimensional tensor. Since it is not trivial to save four-dimensional tensors in <monospace>txt</monospace> format to be readily used for subsequent analyses of a model&#x00027;s feature maps, a user is required to set the file format argument to <monospace>hdf5</monospace>, <monospace>npy</monospace>, or <monospace>mat</monospace>, of which all enable the saving of four-dimensional tensors in their original shape.</p>
<p>When storing network activations from convolutional layers in their flattened format, it is possible to run into <monospace>MemoryErrors</monospace>. We account for that potential caveat with splitting two-dimensional matrices into <italic>k</italic> equally large splits, whenever that happens. The default value of <italic>k</italic> is set to 10. If 10 splits are not sufficient to counteract the memory issues, a user can change this value to a larger number. We recommend trying multiples of 10, such as</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0018.tif"/></p>
<p>To merge the array splits back into a single, two-dimensional, feature matrix, a user can call,</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0019.tif"/></p>
</sec>
</sec>
<sec>
<title>2.5. Representational Similarity Analysis</title>
<p>Representational Similarity Analysis (RSA), a technique that originated in cognitive computational neuroscience, can be used to relate object representations from different measurement modalities (e.g., fMRI or behavior) and different computational models with each other (Kriegeskorte et al., <xref ref-type="bibr" rid="B24">2008a</xref>,<xref ref-type="bibr" rid="B25">b</xref>). RSA is based on representational dissimilarity matrices (RDMs), which capture the representational geometry present in a given system (e.g., in the brain or a DNN), thereby abstracting away from the underlying multivariate pattern. Rather than directly comparing measurements, RDMs compare representational similarities between two systems. RDMs are symmetric, square matrices, where the rows and columns are indexed by the different conditions or objects. Hence, RSA is a convenient analysis tool to compare visual object representations obtained from different DNNs.</p>
<p>The dissimilarity between each object pair (e.g., two images) is computed within the row space of an RDM. Dissimilarity is quantified as the distance between two objects in the measured representational space, defined by the chosen distance metric. The user can choose between the Euclidean distance (<monospace>euclidean</monospace>), the correlation distance (<monospace>correlation</monospace>), the cosine distance (<monospace>cosine</monospace>) and a radial basis function applied to pairwise distances (<monospace>gaussian</monospace>). Equivalent object representations show a dissimilarity score close to 0. For the <monospace>correlation</monospace> and <monospace>cosine</monospace> distances, the maximum dissimilarity score is bounded to 2, whereas there is no theoretical upper limit for the <monospace>euclidean</monospace> distance.</p>
<p>Since RDMs are symmetric around their main diagonal, it is simple to compare them by correlating their lower or upper triangles. We include both the possibility to compute and visualize an RDM and to correlate the upper triangles of two distinct RDMs. Computing an RDM based on a Pearson correlation distance matrix is as simple as calling</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0020.tif"/></p>
<p>Note that similarities are computed between conditions or objects, not features. To compute the representational similarity between two distinct RDMs, a user can make the following call,</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0021.tif"/></p>
<p>The default <monospace>correlation</monospace> value is the Pearson correlation coefficient, but this can be changed to <monospace>spearman</monospace> if a user assumes that the similarities are not ratio scale and require the computation of a Spearman rank correlation (Nili et al., <xref ref-type="bibr" rid="B33">2014</xref>; Arbuckle et al., <xref ref-type="bibr" rid="B2">2019</xref>). To visualize an RDM and automatically save the output image (in <monospace>.png</monospace> or <monospace>.jpg</monospace> format) to disk, one may call</p>
<p><inline-graphic xlink:href="fninf-15-679838-i0022.tif"/></p>
<p>The default value of <monospace>format</monospace> is set to <monospace>.png</monospace> but can easily be changed to <monospace>.jpg</monospace>. Note that <monospace>.jpg</monospace> is a lossy image compression, whereas <monospace>.png</monospace> is lossless, and, hence, with <monospace>.png</monospace> no information gets lost during compression. Therefore, the <monospace>format</monospace> argument influences both the size and the final resolution of the RDM image representation. The <monospace>dpi</monospace> value is set to 200 to guarantee for a high image resolution, even if <monospace>.jpg</monospace> is selected.</p>
</sec>
</sec>
<sec id="s3">
<title>3. Results and Applications</title>
<p>To demonstrate the usefulness of <monospace>THINGSvision</monospace>, in the following, we present analyses of the image representations of different deep neural network architectures and compare them against representations obtained from behavioral experiments (section 3.1.1) and functional MRI responses to higher visual cortex (section 3.1.2). To qualitatively inspect the DNN representations, we compute and visualize representational dissimilarity matrices (RDMs) within the framework of representational similarity analysis (RSA), as introduced in section 2.5. Moreover, we calculate the Pearson correlation coefficients between human and DNN representations to quantify their similarities, and show how this can easily be done with <monospace>THINGSvision</monospace>. We measure the correspondence between layer activations and human brain or behavioral representations as the Pearson&#x00027;s correlation coefficient, in line with the recent finding that the linearity assumption holds for functional MRI data which validates the use of an interval rather than an ordinal scale (Arbuckle et al., <xref ref-type="bibr" rid="B2">2019</xref>).</p>
<p>In addition to results for pretrained models, we compare randomly initialized models against human brain and behavioral representations. This reveals the degree to which the architecture by itself, without any prior knowledge (e.g., through training), may perform above chance and which model achieves the highest correspondence to behavioral or brain representations under these circumstances. Indeed, a comparison to randomly-initialized networks is increasingly used as a baseline for comparisons (e.g., Yamins et al., <xref ref-type="bibr" rid="B52">2014</xref>; G&#x000FC;&#x000E7;l&#x000FC; and van Gerven, <xref ref-type="bibr" rid="B13">2015</xref>; Cichy et al., <xref ref-type="bibr" rid="B5">2016</xref>; Schrimpf et al., <xref ref-type="bibr" rid="B40">2020a</xref>; Storrs et al., <xref ref-type="bibr" rid="B45">2020b</xref>).</p>
<p>Note that this section should not be regarded as an investigation in its own right. It is supposed to demonstrate the usefulness and versatility of the toolbox. This is the main reason for why we do not make any claims about hypotheses and how to test them. RSA is just one out of many potential applications, of which a subset is mentioned in the section 4.</p>
<sec>
<title>3.1. The Penultimate Layer</title>
<p>The correspondence of a DNN&#x00027;s penultimate layer to human behavioral representations has been studied extensively and is therefore often used when investigating the representations of abstract visual concepts in neural network models (e.g., Mur et al., <xref ref-type="bibr" rid="B32">2013</xref>; Bankson et al., <xref ref-type="bibr" rid="B3">2018</xref>; Jozwik et al., <xref ref-type="bibr" rid="B18">2018</xref>; Peterson et al., <xref ref-type="bibr" rid="B36">2018</xref>; Battleday et al., <xref ref-type="bibr" rid="B4">2019</xref>; Cichy et al., <xref ref-type="bibr" rid="B6">2019</xref>). To the best of our knowledge, our study is the first to compare visual object representations extracted from CLIP (Radford et al., <xref ref-type="bibr" rid="B37">2021</xref>) against the representations of well-known vision models that have previously shown a close correspondence to neural recordings of the primate visual system. We computed RDMs based on the Pearson correlation distance for seven models, namely AlexNet (Krizhevsky et al., <xref ref-type="bibr" rid="B26">2012</xref>), VGG16 and VGG19 with batch normalization (Simonyan and Zisserman, <xref ref-type="bibr" rid="B43">2015</xref>), which show a close correspondence to brain and behavior (Schrimpf et al., <xref ref-type="bibr" rid="B41">2018</xref>, <xref ref-type="bibr" rid="B42">2020b</xref>), ResNet50 (He et al., <xref ref-type="bibr" rid="B15">2016</xref>), BrainScore&#x00027;s current leader CORnet-S (Kubilius et al., <xref ref-type="bibr" rid="B28">2018</xref>, <xref ref-type="bibr" rid="B27">2019</xref>; Schrimpf et al., <xref ref-type="bibr" rid="B42">2020b</xref>), and OpenAI&#x00027;s CLIP variants CLIP-RN and CLIP-ViT (Radford et al., <xref ref-type="bibr" rid="B37">2021</xref>). The comparison was done for six different image datasets that included functional MRI of the human visual system and behavior (Mur et al., <xref ref-type="bibr" rid="B32">2013</xref>; Bankson et al., <xref ref-type="bibr" rid="B3">2018</xref>; Cichy et al., <xref ref-type="bibr" rid="B6">2019</xref>; Mohsenzadeh et al., <xref ref-type="bibr" rid="B31">2019</xref>; Hebart et al., <xref ref-type="bibr" rid="B17">2020</xref>). For the neuroimaging datasets, participants viewed different images of objects while performing an oddball detection task in an MRI scanner. For the behavioral datasets, participants completed similarity judgments using the multiarrangement task (Mur et al., <xref ref-type="bibr" rid="B32">2013</xref>; Bankson et al., <xref ref-type="bibr" rid="B3">2018</xref>) or a triplet odd-one-out task (Hebart et al., <xref ref-type="bibr" rid="B17">2020</xref>).</p>
<p>Note that Bankson et al. (<xref ref-type="bibr" rid="B3">2018</xref>) exploited two different datasets which we label with &#x0201C;(1)&#x0201D; and &#x0201C;(2)&#x0201D; in <xref ref-type="fig" rid="F2">Figure 2</xref>. The number of images per dataset are as follows: Kriegeskorte et al. (<xref ref-type="bibr" rid="B25">2008b</xref>), Mur et al. (<xref ref-type="bibr" rid="B32">2013</xref>), Cichy et al. (<xref ref-type="bibr" rid="B7">2014</xref>): 92; Bankson et al. (<xref ref-type="bibr" rid="B3">2018</xref>) 84 each; Cichy et al. (<xref ref-type="bibr" rid="B5">2016</xref>, <xref ref-type="bibr" rid="B6">2019</xref>): 118; Mohsenzadeh et al. (<xref ref-type="bibr" rid="B31">2019</xref>): 156; Hebart et al. (<xref ref-type="bibr" rid="B16">2019</xref>, <xref ref-type="bibr" rid="B17">2020</xref>): 1854. For each of these datasets except for Mohsenzadeh et al. (<xref ref-type="bibr" rid="B31">2019</xref>), we additionally computed RDMs for group averages obtained from behavioral experiments. Furthermore, we computed RDMs for brain voxel activities obtained from fMRI recordings for the datasets used in Cichy et al. (<xref ref-type="bibr" rid="B7">2014</xref>), Cichy et al. (<xref ref-type="bibr" rid="B5">2016</xref>), and Mohsenzadeh et al. (<xref ref-type="bibr" rid="B31">2019</xref>), based on voxels inside a mask covering higher visual cortex.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p><bold>(A)</bold> RDMs for penultimate layer representations of different pretrained neural network models, for group averages of behavioral judgments, and for fMRI responses to higher visual cortex. For Mohsenzadeh et al. (<xref ref-type="bibr" rid="B31">2019</xref>), no behavioral experiments had been conducted. For both datasets in Bankson et al. (<xref ref-type="bibr" rid="B3">2018</xref>), and for Hebart et al. (<xref ref-type="bibr" rid="B17">2020</xref>), no fMRI recordings were available. For display purposes, Hebart et al. (<xref ref-type="bibr" rid="B17">2020</xref>) was downsampled to 200 conditions. RDMs were reordered according to an unsupervised clustering. <bold>(B,C)</bold> Pearson correlation coefficients for comparisons between neural network representations extracted from the penultimate layer and behavioral representations <bold>(B)</bold> and representations corresponding to fMRI responses of higher visual cortex <bold>(C)</bold>. Activations were extracted from pretrained and randomly initialized models.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fninf-15-679838-g0002.tif"/>
</fig>
<p><xref ref-type="fig" rid="F2">Figure 2A</xref> visualizes all RDMs. We clustered RDMs pertaining to group averages of behavioral judgments into five object clusters and sorted the RDMs corresponding to object representations extracted from DNNs according to the obtained cluster labels. The image datasets used in Kriegeskorte et al. (<xref ref-type="bibr" rid="B25">2008b</xref>), Mur et al. (<xref ref-type="bibr" rid="B32">2013</xref>), and Cichy et al. (<xref ref-type="bibr" rid="B7">2014</xref>), and Mohsenzadeh et al. (<xref ref-type="bibr" rid="B31">2019</xref>) were already sorted according to object categories, which is why we did not perform a clustering on RDMs for those datasets. The number of clusters was chosen arbitrarily. The reordering was done to highlight the similarities and differences in RDMs.</p>
<sec>
<title>3.1.1. Behavioral Correspondences</title>
<sec>
<title>3.1.1.1. Pretrained Weights</title>
<p>Across all compared DNN models, CORnet-S and CLIP-RN showed the overall closest correspondence to behavioral representations. CORnet-S, however, was the only model that performed well across all datasets. CLIP-RN showed a high Pearson correlation (ranging from 0.40 to 0.60) with behavioral representations across most datasets, with Mur et al. (<xref ref-type="bibr" rid="B32">2013</xref>) being the only exception, for which both CLIP versions performed poorly. Interestingly, for one of the datasets in Bankson et al. (<xref ref-type="bibr" rid="B3">2018</xref>), VGG16 with batch normalization (Simonyan and Zisserman, <xref ref-type="bibr" rid="B43">2015</xref>) outperformed both CORnet-S and CLIP-RN (see <xref ref-type="fig" rid="F2">Figure 2B</xref>). AlexNet consistently performed the worst for behavioral fits. Note that the broadest coverage of visual stimuli is provided by Hebart et al. (<xref ref-type="bibr" rid="B16">2019</xref>, <xref ref-type="bibr" rid="B17">2020</xref>), which should therefore be seen as the most representative result (rightmost column in <xref ref-type="fig" rid="F2">Figure 2B</xref>).</p>
</sec>
<sec>
<title>3.1.1.2. Random Weights</title>
<p>Another interesting finding is that for randomly-initialized weights, CLIP-RN is the poorest performing model in four out of five datasets (see bars in <xref ref-type="fig" rid="F2">Figure 2B</xref> corresponding to lower correlation coefficients). Here, AlexNet seems to be the best performing model across datasets, although it achieved the lowest correspondence to behavioral representations when leveraging a pretrained version (see <xref ref-type="fig" rid="F2">Figure 2B</xref>). This indicates the possibility of complex interactions between model architectures and training objectives that require further investigations which <monospace>THINGSvision</monospace> may facilitate.</p>
</sec>
</sec>
<sec>
<title>3.1.2. Brain Correspondences</title>
<p>We performed a similar analysis as above, but this time leveraging RDMs corresponding to fMRI responses to examine the correlation between model and brain representations of higher visual cortex. We first report results obtained from analyses with pretrained models.</p>
<sec>
<title>3.1.2.1. Pretrained Weights</title>
<p>While AlexNet (Krizhevsky et al., <xref ref-type="bibr" rid="B26">2012</xref>) showed the worst correspondence to human behavior in four out of five datasets (see <xref ref-type="fig" rid="F2">Figure 2C</xref>), AlexNet correlated strongly with representations extracted from fMRI responses to higher visual cortex, except for the dataset used in Cichy et al. (<xref ref-type="bibr" rid="B5">2016</xref>) (see <xref ref-type="fig" rid="F2">Figure 2C</xref>). This is interesting, given that among the entire set of analyzed deep neural network models AlexNet shows the poorest performance on ImageNet (Russakovsky et al., <xref ref-type="bibr" rid="B39">2015</xref>). This result contradicts findings from previous studies arguing that object recognition performance is correlated with correspondences to fMRI recordings (Yamins et al., <xref ref-type="bibr" rid="B52">2014</xref>; Schrimpf et al., <xref ref-type="bibr" rid="B42">2020b</xref>). This time, CORnet-S and CLIP-RN performed well for the datasets used in Cichy et al. (<xref ref-type="bibr" rid="B5">2016</xref>) and in Mohsenzadeh et al. (<xref ref-type="bibr" rid="B31">2019</xref>), but were among the poorest performing DNNs for Cichy et al. (<xref ref-type="bibr" rid="B7">2014</xref>). Note, however, that the dataset used in Cichy et al. (<xref ref-type="bibr" rid="B7">2014</xref>) is highly structured and contains a large number of faces and similar images, something AlexNet might pick up more easily in its image features but something that is not reflected in human behavior (Grootswagers and Robinson, <xref ref-type="bibr" rid="B11">2021</xref>).</p>
</sec>
<sec>
<title>3.1.2.2. Random Weights</title>
<p>When comparing representations corresponding to network activations from models with random weights, there appears to be no consistent pattern as to which model correlated most strongly with brain representations of higher visual cortex, although VGG16 and CORnet-S were the only two models that yielded a Pearson correlation coefficient &#x0003E; 0 across datasets. Note, however, that for each model we extracted network activations from the penultimate layer. Results might look different when extracting activations from earlier layers of the networks or when reweighting the DNN features prior to RSA (Kaniuth and Hebart, <xref ref-type="bibr" rid="B19">2020</xref>; Storrs et al., <xref ref-type="bibr" rid="B44">2020a</xref>). We leave further investigations to future studies, as our analyses should only demonstrate the applicability of our toolbox.</p>
</sec>
</sec>
<sec>
<title>3.1.3. Model Comparison</title>
<p>Although CORnet-S and CLIP-RN achieved the overall highest correspondence to both behavioral and human brain representations, our results indicate that more recent, deeper neural network models are not necessarily preferred over previous, shallower models, at least when exclusively leveraging the penultimate layer of a network. Their correspondences appear to be highly dataset-dependent. Although a pretrained version of AlexNet correlated poorly with representations obtained from behavioral experiments (see <xref ref-type="fig" rid="F2">Figure 2B</xref>), there are datasets where AlexNet showed close correspondence to brain representations (see <xref ref-type="fig" rid="F2">Figure 2C</xref>). Similarly, VGG16 was mostly outperformed by CLIP-RN, but in one out of five datasets it yielded a higher correlation with behavioral representations than CLIP-RN.</p>
</sec>
</sec>
</sec>
<sec sec-type="discussion" id="s4">
<title>4. Discussion</title>
<p>Here we introduce <monospace>THINGSvision</monospace>, a Python toolbox for extracting activations from hidden layers of a wide range of deep neural network models. We designed <monospace>THINGSvision</monospace> to facilitate research at the intersection of cognitive science, computational neuroscience, and artificial intelligence.</p>
<p>Recently, an API was released (Mehrer et al., <xref ref-type="bibr" rid="B30">2021</xref>) that enables the extraction of image features from AlexNet and vNet without the requirement to install any library, making it a highly user-friendly contribution to the field. Apart from requiring an installation of Python, <monospace>THINGSvision</monospace> provides a comparably simple way to extract network activations, yet for a much broader set of DNNs and with a higher degree of flexibility and control over the extraction procedure. <monospace>THINGSvision</monospace> can easily be integrated with any other computational analysis pipeline performed in Python or Matlab. We additionally allow for a streamlined comparison of visual object representations obtained from various DNNs employing representational similarity analysis.</p>
<p>We demonstrated the usefulness of <monospace>THINGSvision</monospace> through the application of RSA and the quantification of correspondences between representations extracted from models and human behavior (or brains). Please note that the extracted network activations are not only useful for visualizing and comparing network activations through frameworks such as RSA, but for any downstream application, including regression onto brain data (Yamins et al., <xref ref-type="bibr" rid="B52">2014</xref>; G&#x000FC;&#x000E7;l&#x000FC; and van Gerven, <xref ref-type="bibr" rid="B13">2015</xref>), feature selectivity analysis (e.g., Xu et al., <xref ref-type="bibr" rid="B51">2021</xref>), or fine-tuning of individual layers for external tasks (e.g., Khaligh-Razavi and Kriegeskorte, <xref ref-type="bibr" rid="B20">2014</xref>; Tajbakhsh et al., <xref ref-type="bibr" rid="B46">2016</xref>).</p>
<p><monospace>THINGSvision</monospace> enabled us to investigate object representations of CLIP (Radford et al., <xref ref-type="bibr" rid="B37">2021</xref>) against representations extracted from other neural network models as well as representations from behavioral experiments and fMRI responses to higher visual cortex. To understand why Transformer layers and multimodal training objectives help to achieve strong correspondences to behavioral representations (see <xref ref-type="fig" rid="F2">Figure 2B</xref>), further studies are encouraged to investigate the representations of CLIP and its differences to previous DNN architectures with unimodal objectives.</p>
<p>We hope that <monospace>THINGSvision</monospace> will serve as a useful tool that supports researchers in carrying out such analyses, and we intend to extend the set of models and functionalities that are integral to <monospace>THINGSvision</monospace> over the coming years as a function of advancements and demands in the field.</p>
</sec>
<sec sec-type="data-availability" id="s5">
<title>Data Availability Statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found here: <ext-link ext-link-type="uri" xlink:href="https://osf.io/jum2f/">https://osf.io/jum2f/</ext-link>, <ext-link ext-link-type="uri" xlink:href="http://twinsetfusion.csail.mit.edu/">http://twinsetfusion.csail.mit.edu/</ext-link>, <ext-link ext-link-type="uri" xlink:href="http://userpage.fu-berlin.de/rmcichy/nn_project_page/main.html">http://userpage.fu-berlin.de/rmcichy/nn_project_page/main.html</ext-link>, <ext-link ext-link-type="uri" xlink:href="http://brainmodels.csail.mit.edu/">http://brainmodels.csail.mit.edu/</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fpsyg.2013.00128/full">https://www.frontiersin.org/articles/10.3389/fpsyg.2013.00128/full</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://www.sciencedirect.com/science/article/pii/S1053811919302083">https://www.sciencedirect.com/science/article/pii/S1053811919302083</ext-link>. The code for THINGSvision is also publicly available: <ext-link ext-link-type="uri" xlink:href="https://github.com/ViCCo-Group/THINGSvision">https://github.com/ViCCo-Group/THINGSvision</ext-link>.</p>
</sec>
<sec id="s6">
<title>Author Contributions</title>
<p>LM designed the toolbox and programmed the software. LM and MH collected the data, analyzed and visualized the data, and wrote the manuscript. MH supervised the study and acquired the funding. Both authors agreed with the final version of the manuscript.</p>
</sec>
<sec sec-type="funding-information" id="s7">
<title>Funding</title>
<p>This work was supported by a Max Planck Research Group grant of the Max Planck Society awarded to MH.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s8">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec> </body>
<back>
<ack><p>The authors would like to thank Katja Seeliger, Oliver Contier and Philipp Kaniuth for useful comments on earlier versions of this paper, and in particular Hannes Hansen, who helped running all sorts of tests and enhancing continuous integration of the toolbox.</p>
</ack>
<sec sec-type="supplementary-material" id="s9">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fninf.2021.679838/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fninf.2021.679838/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.pdf" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Abadi</surname> <given-names>M.</given-names></name> <name><surname>Agarwal</surname> <given-names>A.</given-names></name> <name><surname>Barham</surname> <given-names>P.</given-names></name> <name><surname>Brevdo</surname> <given-names>E.</given-names></name> <name><surname>Chen</surname> <given-names>Z.</given-names></name> <name><surname>Citro</surname> <given-names>C.</given-names></name> <etal/></person-group>. (<year>2015</year>). <source>TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="http://tensorflow.org">tensorflow.org</ext-link></citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Arbuckle</surname> <given-names>S. A.</given-names></name> <name><surname>Yokoi</surname> <given-names>A.</given-names></name> <name><surname>Pruszynski</surname> <given-names>J. A.</given-names></name> <name><surname>Diedrichsen</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>Stability of representational geometry across a wide range of fmri activity levels</article-title>. <source>Neuroimage</source> <volume>186</volume>, <fpage>155</fpage>&#x02013;<lpage>163</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2018.11.002</pub-id><pub-id pub-id-type="pmid">30395930</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bankson</surname> <given-names>B.</given-names></name> <name><surname>Hebart</surname> <given-names>M.</given-names></name> <name><surname>Groen</surname> <given-names>I.</given-names></name> <name><surname>Baker</surname> <given-names>C.</given-names></name></person-group> (<year>2018</year>). <article-title>The temporal evolution of conceptual object representations revealed through models of behavior, semantics and deep neural networks</article-title>. <source>Neuroimage</source> <volume>178</volume>, <fpage>172</fpage>&#x02013;<lpage>182</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2018.05.037</pub-id><pub-id pub-id-type="pmid">29777825</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Battleday</surname> <given-names>R. M.</given-names></name> <name><surname>Peterson</surname> <given-names>J. C.</given-names></name> <name><surname>Griffiths</surname> <given-names>T. L.</given-names></name></person-group> (<year>2019</year>). <article-title>Capturing human categorization of natural images at scale by combining deep networks and cognitive models</article-title>. <source>Nat. Commun</source>. <volume>11</volume>:<fpage>5418</fpage>. <pub-id pub-id-type="doi">10.1038/s41467-020-18946-z</pub-id><pub-id pub-id-type="pmid">33110085</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cichy</surname> <given-names>R. M.</given-names></name> <name><surname>Khosla</surname> <given-names>A.</given-names></name> <name><surname>Pantazis</surname> <given-names>D.</given-names></name> <name><surname>Torralba</surname> <given-names>A.</given-names></name> <name><surname>Oliva</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <article-title>Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence</article-title>. <source>Sci. Rep</source>. <volume>6</volume>:<fpage>27755</fpage>. <pub-id pub-id-type="doi">10.1038/srep27755</pub-id><pub-id pub-id-type="pmid">27282108</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cichy</surname> <given-names>R. M.</given-names></name> <name><surname>Kriegeskorte</surname> <given-names>N.</given-names></name> <name><surname>Jozwik</surname> <given-names>K. M.</given-names></name> <name><surname>van den Bosch</surname> <given-names>J. J.</given-names></name> <name><surname>Charest</surname> <given-names>I.</given-names></name></person-group> (<year>2019</year>). <article-title>The spatiotemporal neural dynamics underlying perceived similarity for real-world objects</article-title>. <source>Neuroimage</source> <volume>194</volume>, <fpage>12</fpage>&#x02013;<lpage>24</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2019.03.031</pub-id><pub-id pub-id-type="pmid">30894333</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cichy</surname> <given-names>R. M.</given-names></name> <name><surname>Pantazis</surname> <given-names>D.</given-names></name> <name><surname>Oliva</surname> <given-names>A.</given-names></name></person-group> (<year>2014</year>). <article-title>Resolving human object recognition in space and time</article-title>. <source>Nat. Neurosci</source>. <volume>17</volume>:<fpage>455</fpage>. <pub-id pub-id-type="doi">10.1038/nn.3635</pub-id><pub-id pub-id-type="pmid">24464044</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Dong</surname> <given-names>W.</given-names></name> <name><surname>Socher</surname> <given-names>R.</given-names></name> <name><surname>Li</surname> <given-names>L.</given-names></name> <name><surname>Li</surname> <given-names>K.</given-names></name> <name><surname>Li</surname> <given-names>F.</given-names></name></person-group> (<year>2009</year>). <article-title>&#x0201C;Imagenet: a large-scale hierarchical image database,&#x0201D;</article-title> in <source>2009 IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Miami, FL</publisher-loc>), <fpage>248</fpage>&#x02013;<lpage>255</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2009.5206848</pub-id></citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dosovitskiy</surname> <given-names>A.</given-names></name> <name><surname>Beyer</surname> <given-names>L.</given-names></name> <name><surname>Kolesnikov</surname> <given-names>A.</given-names></name> <name><surname>Weissenborn</surname> <given-names>D.</given-names></name> <name><surname>Zhai</surname> <given-names>X.</given-names></name> <name><surname>Unterthiner</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>&#x0201C;An image is worth 16x16 words: Transformers for image recognition at scale,&#x0201D;</article-title> in <source>9th International Conference on Learning Representations</source>, ICLR 2021, Virtual Event, Austria.</citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Esteban</surname> <given-names>O.</given-names></name> <name><surname>Markiewicz</surname> <given-names>C. J.</given-names></name> <name><surname>Blair</surname> <given-names>R. W.</given-names></name> <name><surname>Moodie</surname> <given-names>C. A.</given-names></name> <name><surname>Isik</surname> <given-names>A. I.</given-names></name> <name><surname>Erramuzpe</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>fMRIPrep: a robust preprocessing pipeline for functional MRI</article-title>. <source>Nat. Methods</source> <volume>16</volume>, <fpage>111</fpage>&#x02013;<lpage>116</lpage>. <pub-id pub-id-type="doi">10.1038/s41592-018-0235-4</pub-id><pub-id pub-id-type="pmid">30532080</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Grootswagers</surname> <given-names>T.</given-names></name> <name><surname>Robinson</surname> <given-names>A. K.</given-names></name></person-group> (<year>2021</year>). <article-title>Overfitting the literature to one set of stimuli and data</article-title>. <source>Front. Hum. Neurosci</source>. <volume>15</volume>:<fpage>386</fpage>. <pub-id pub-id-type="doi">10.3389/fnhum.2021.682661</pub-id><pub-id pub-id-type="pmid">34305552</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>G&#x000FC;&#x000E7;l&#x000FC;</surname> <given-names>U.</given-names></name> <name><surname>van Gerven</surname> <given-names>M. A. J.</given-names></name></person-group> (<year>2014</year>). <article-title>Unsupervised feature learning improves prediction of human brain activity in response to natural images</article-title>. <source>PLoS Comput. Biol</source>. <volume>10</volume>:<fpage>e1003724</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1003724</pub-id><pub-id pub-id-type="pmid">25101625</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>G&#x000FC;&#x000E7;l&#x000FC;</surname> <given-names>U.</given-names></name> <name><surname>van Gerven</surname> <given-names>M. A. J.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream</article-title>. <source>J. Neurosci</source>. <volume>35</volume>, <fpage>10005</fpage>&#x02013;<lpage>10014</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.5023-14.2015</pub-id><pub-id pub-id-type="pmid">26157000</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Harris</surname> <given-names>C. R.</given-names></name> <name><surname>Millman</surname> <given-names>K. J.</given-names></name> <name><surname>van der Walt</surname> <given-names>S.</given-names></name> <name><surname>Gommers</surname> <given-names>R.</given-names></name> <name><surname>Virtanen</surname> <given-names>P.</given-names></name> <name><surname>Cournapeau</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Array programming with NumPy</article-title>. <source>Nature</source> <volume>585</volume>, <fpage>357</fpage>&#x02013;<lpage>362</lpage>. <pub-id pub-id-type="doi">10.1038/s41586-020-2649-2</pub-id><pub-id pub-id-type="pmid">32939066</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Deep residual learning for image recognition,&#x0201D;</article-title> in <source>2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016</source> (<publisher-loc>Las Vegas, NV</publisher-loc>: <publisher-name>IEEE Computer Society</publisher-name>), <fpage>770</fpage>&#x02013;<lpage>778</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2016.90</pub-id><pub-id pub-id-type="pmid">32166560</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hebart</surname> <given-names>M. N.</given-names></name> <name><surname>Dickter</surname> <given-names>A. H.</given-names></name> <name><surname>Kidder</surname> <given-names>A.</given-names></name> <name><surname>Kwok</surname> <given-names>W. Y.</given-names></name> <name><surname>Corriveau</surname> <given-names>A.</given-names></name> <name><surname>Van Wicklin</surname> <given-names>C.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>THINGS: a database of 1,854 object concepts and more than 26,000 naturalistic object images</article-title>. <source>PLoS ONE</source> <volume>14</volume>:<fpage>e0223792</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0223792</pub-id><pub-id pub-id-type="pmid">31613926</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hebart</surname> <given-names>M. N.</given-names></name> <name><surname>Zheng</surname> <given-names>C. Y.</given-names></name> <name><surname>Pereira</surname> <given-names>F.</given-names></name> <name><surname>Baker</surname> <given-names>C. I.</given-names></name></person-group> (<year>2020</year>). <article-title>Revealing the multidimensional mental representations of natural objects underlying human similarity judgements</article-title>. <source>Nat. Hum. Behav</source>. <volume>4</volume>, <fpage>1173</fpage>&#x02013;<lpage>1185</lpage>. <pub-id pub-id-type="doi">10.1038/s41562-020-00951-3</pub-id><pub-id pub-id-type="pmid">33046861</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jozwik</surname> <given-names>K. M.</given-names></name> <name><surname>Kriegeskorte</surname> <given-names>N.</given-names></name> <name><surname>Cichy</surname> <given-names>R. M.</given-names></name> <name><surname>Mur</surname> <given-names>M.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Deep convolutional neural networks, features, and categories perform similarly at explaining primate high-level visual representations,&#x0201D;</article-title> in <source>2018 Conference on Cognitive Computational Neuroscience</source> (<publisher-loc>Philadelphia, PA</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>4</lpage>. <pub-id pub-id-type="doi">10.32470/CCN.2018.1232-0</pub-id></citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kaniuth</surname> <given-names>P.</given-names></name> <name><surname>Hebart</surname> <given-names>M. N.</given-names></name></person-group> (<year>2020</year>). <article-title>Tuned representational similarity analysis: improving the fit between computational models of vision and brain data</article-title>. <source>J. Vis</source>. <volume>20</volume>:<fpage>1076</fpage>. <pub-id pub-id-type="doi">10.1167/jov.20.11.1076</pub-id></citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Khaligh-Razavi</surname> <given-names>S.-M.</given-names></name> <name><surname>Kriegeskorte</surname> <given-names>N.</given-names></name></person-group> (<year>2014</year>). <article-title>Deep supervised, but not unsupervised, models may explain IT cortical representation</article-title>. <source>PLoS Comput. Biol</source>. <volume>10</volume>:<fpage>e1003915</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1003915</pub-id><pub-id pub-id-type="pmid">25375136</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kietzmann</surname> <given-names>T. C.</given-names></name> <name><surname>McClure</surname> <given-names>P.</given-names></name> <name><surname>Kriegeskorte</surname> <given-names>N.</given-names></name></person-group> (<year>2018</year>). <article-title>Deep neural networks in computational neuroscience</article-title>. bioRxiv <italic>[Preprint]</italic>. <pub-id pub-id-type="doi">10.1101/133504</pub-id><pub-id pub-id-type="pmid">29292134</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>King</surname> <given-names>M. L.</given-names></name> <name><surname>Groen</surname> <given-names>I. I.</given-names></name> <name><surname>Steel</surname> <given-names>A.</given-names></name> <name><surname>Kravitz</surname> <given-names>D. J.</given-names></name> <name><surname>Baker</surname> <given-names>C. I.</given-names></name></person-group> (<year>2019</year>). <article-title>Similarity judgments and cortical visual responses reflect different properties of object and scene categories in naturalistic images</article-title>. <source>Neuroimage</source> <volume>197</volume>, <fpage>368</fpage>&#x02013;<lpage>382</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2019.04.079</pub-id><pub-id pub-id-type="pmid">31054350</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kriegeskorte</surname> <given-names>N.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep neural networks: a new framework for modeling biological vision and brain information processing</article-title>. <source>Annu. Rev. Vis. Sci</source>. <volume>1</volume>, <fpage>417</fpage>&#x02013;<lpage>446</lpage>. <pub-id pub-id-type="doi">10.1146/annurev-vision-082114-035447</pub-id><pub-id pub-id-type="pmid">28532370</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kriegeskorte</surname> <given-names>N.</given-names></name> <name><surname>Mur</surname> <given-names>M.</given-names></name> <name><surname>Bandettini</surname> <given-names>P. A.</given-names></name></person-group> (<year>2008a</year>). <article-title>Representational similarity analysis-connecting the branches of systems neuroscience</article-title>. <source>Front. Syst. Neurosci</source>. <volume>2</volume>:<fpage>4</fpage>. <pub-id pub-id-type="doi">10.3389/neuro.06.004.2008</pub-id><pub-id pub-id-type="pmid">19104670</pub-id></citation></ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kriegeskorte</surname> <given-names>N.</given-names></name> <name><surname>Mur</surname> <given-names>M.</given-names></name> <name><surname>Ruff</surname> <given-names>D. A.</given-names></name> <name><surname>Kiani</surname> <given-names>R.</given-names></name> <name><surname>Bodurka</surname> <given-names>J.</given-names></name> <name><surname>Esteky</surname> <given-names>H.</given-names></name> <etal/></person-group>. (<year>2008b</year>). <article-title>Matching categorical object representations in inferior temporal cortex of man and monkey</article-title>. <source>Neuron</source> <volume>60</volume>, <fpage>1126</fpage>&#x02013;<lpage>1141</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2008.10.043</pub-id><pub-id pub-id-type="pmid">19109916</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Krizhevsky</surname> <given-names>A.</given-names></name> <name><surname>Sutskever</surname> <given-names>I.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name></person-group> (<year>2012</year>). <article-title>&#x0201C;Imagenet classification with deep convolutional neural networks,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, eds F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (<publisher-loc>Lake Tahoe, NV</publisher-loc>: <publisher-name>Curran Associates, Inc.</publisher-name>), <fpage>1097</fpage>&#x02013;<lpage>1105</lpage>.</citation>
</ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kubilius</surname> <given-names>J.</given-names></name> <name><surname>Schrimpf</surname> <given-names>M.</given-names></name> <name><surname>Hong</surname> <given-names>H.</given-names></name> <name><surname>Majaj</surname> <given-names>N. J.</given-names></name> <name><surname>Rajalingham</surname> <given-names>R.</given-names></name> <name><surname>Issa</surname> <given-names>E. B.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>&#x0201C;Brain-like object recognition with high-performing shallow recurrent ANNs,&#x0201D;</article-title> in <source>Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019</source>, eds H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d&#x00027;Alch&#x000E9;-Buc, E. B. Fox, and R. Garnett (Vancouver, BC), <fpage>12785</fpage>&#x02013;<lpage>12796</lpage>.</citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kubilius</surname> <given-names>J.</given-names></name> <name><surname>Schrimpf</surname> <given-names>M.</given-names></name> <name><surname>Nayebi</surname> <given-names>A.</given-names></name> <name><surname>Bear</surname> <given-names>D.</given-names></name> <name><surname>Yamins</surname> <given-names>D. L. K.</given-names></name> <name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name></person-group> (<year>2018</year>). <article-title>Cornet: Modeling the neural mechanisms of core object recognition. bioRxiv</article-title> <source>[Preprint]</source>. <pub-id pub-id-type="doi">10.1101/408385</pub-id></citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>LeCun</surname> <given-names>Y.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Hinton</surname> <given-names>G.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep learning</article-title>. <source>Nature</source> <volume>521</volume>, <fpage>436</fpage>&#x02013;<lpage>444</lpage>. <pub-id pub-id-type="doi">10.1038/nature14539</pub-id><pub-id pub-id-type="pmid">26017442</pub-id></citation></ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mehrer</surname> <given-names>J.</given-names></name> <name><surname>Spoerer</surname> <given-names>C. J.</given-names></name> <name><surname>Jones</surname> <given-names>E. C.</given-names></name> <name><surname>Kriegeskorte</surname> <given-names>N.</given-names></name> <name><surname>Kietzmann</surname> <given-names>T. C.</given-names></name></person-group> (<year>2021</year>). <article-title>An ecologically motivated image dataset for deep learning yields better models of human vision</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A</source>. <volume>118</volume>:<fpage>e2011417118</fpage>. <pub-id pub-id-type="doi">10.1073/pnas.2011417118</pub-id><pub-id pub-id-type="pmid">33593900</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mohsenzadeh</surname> <given-names>Y.</given-names></name> <name><surname>Mullin</surname> <given-names>C.</given-names></name> <name><surname>Lahner</surname> <given-names>B.</given-names></name> <name><surname>Cichy</surname> <given-names>R. M.</given-names></name> <name><surname>Oliva</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>Reliability and generalizability of similarity-based fusion of meg and fmri data in human ventral and dorsal visual streams</article-title>. <source>Vision</source> <volume>3</volume>:<fpage>8</fpage>. <pub-id pub-id-type="doi">10.3390/vision3010008</pub-id><pub-id pub-id-type="pmid">31735809</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mur</surname> <given-names>M.</given-names></name> <name><surname>Meys</surname> <given-names>M.</given-names></name> <name><surname>Bodurka</surname> <given-names>J.</given-names></name> <name><surname>Goebel</surname> <given-names>R.</given-names></name> <name><surname>Bandettini</surname> <given-names>P.</given-names></name> <name><surname>Kriegeskorte</surname> <given-names>N.</given-names></name></person-group> (<year>2013</year>). <article-title>Human object-similarity judgments reflect and transcend the primate-it object representation</article-title>. <source>Front. Psychol</source>. <volume>4</volume>:<fpage>128</fpage>. <pub-id pub-id-type="doi">10.3389/fpsyg.2013.00128.eCollection2013</pub-id><pub-id pub-id-type="pmid">23525516</pub-id></citation></ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nili</surname> <given-names>H.</given-names></name> <name><surname>Wingfield</surname> <given-names>C.</given-names></name> <name><surname>Walther</surname> <given-names>A.</given-names></name> <name><surname>Su</surname> <given-names>L.</given-names></name> <name><surname>Marslen-Wilson</surname> <given-names>W.</given-names></name> <name><surname>Kriegeskorte</surname> <given-names>N.</given-names></name></person-group> (<year>2014</year>). <article-title>A toolbox for representational similarity analysis</article-title>. <source>PLoS Comput. Biol</source>. <volume>10</volume>:<fpage>e1003553</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1003553</pub-id><pub-id pub-id-type="pmid">24743308</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Paszke</surname> <given-names>A.</given-names></name> <name><surname>Gross</surname> <given-names>S.</given-names></name> <name><surname>Massa</surname> <given-names>F.</given-names></name> <name><surname>Lerer</surname> <given-names>A.</given-names></name> <name><surname>Bradbury</surname> <given-names>J.</given-names></name> <name><surname>Chanan</surname> <given-names>G.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>&#x0201C;Pytorch: an imperative style, high-performance deep learning library,&#x0201D;</article-title> in <source>Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019</source>, eds H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d&#x00027;Alch&#x000E9;-Buc, E. B. Fox, and R. Garnett (Vancouver, BC), <fpage>8024</fpage>&#x02013;<lpage>8035</lpage>.</citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Peng</surname> <given-names>R. D.</given-names></name></person-group> (<year>2011</year>). <article-title>Reproducible research in computational science</article-title>. <source>Science</source> <volume>334</volume>, <fpage>1226</fpage>&#x02013;<lpage>1227</lpage>. <pub-id pub-id-type="doi">10.1126/science.1213847</pub-id><pub-id pub-id-type="pmid">22144613</pub-id></citation></ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Peterson</surname> <given-names>J. C.</given-names></name> <name><surname>Abbott</surname> <given-names>J. T.</given-names></name> <name><surname>Griffiths</surname> <given-names>T. L.</given-names></name></person-group> (<year>2018</year>). <article-title>Evaluating (and improving) the correspondence between deep neural networks and human representations</article-title>. <source>Cogn. Sci</source>. <volume>42</volume>, <fpage>2648</fpage>&#x02013;<lpage>2669</lpage>. <pub-id pub-id-type="doi">10.1111/cogs.12670</pub-id><pub-id pub-id-type="pmid">30178468</pub-id></citation></ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Radford</surname> <given-names>A.</given-names></name> <name><surname>Kim</surname> <given-names>J. W.</given-names></name> <name><surname>Hallacy</surname> <given-names>C.</given-names></name> <name><surname>Ramesh</surname> <given-names>A.</given-names></name> <name><surname>Goh</surname> <given-names>G.</given-names></name> <name><surname>Agarwal</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Learning transferable visual models from natural language supervision</article-title>. <source>arXiv preprint arXiv:2103.00020</source>.</citation>
</ref>
<ref id="B38">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rush</surname> <given-names>A.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;The annotated transformer,&#x0201D;</article-title> in <source>Proceedings of Workshop for NLP Open Source Software (NLP-OSS)</source> (<publisher-loc>Melbourne, VIC</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>52</fpage>&#x02013;<lpage>60</lpage>. <pub-id pub-id-type="doi">10.18653/v1/W18-2509</pub-id></citation>
</ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Russakovsky</surname> <given-names>O.</given-names></name> <name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Su</surname> <given-names>H.</given-names></name> <name><surname>Krause</surname> <given-names>J.</given-names></name> <name><surname>Satheesh</surname> <given-names>S.</given-names></name> <name><surname>Ma</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Imagenet large scale visual recognition challenge</article-title>. <source>Int. J. Comput. Vis</source>. <volume>115</volume>, <fpage>211</fpage>&#x02013;<lpage>252</lpage>. <pub-id pub-id-type="doi">10.1007/s11263-015-0816-y</pub-id></citation>
</ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schrimpf</surname> <given-names>M.</given-names></name> <name><surname>Blank</surname> <given-names>I.</given-names></name> <name><surname>Tuckute</surname> <given-names>G.</given-names></name> <name><surname>Kauf</surname> <given-names>C.</given-names></name> <name><surname>Hosseini</surname> <given-names>E. A.</given-names></name> <name><surname>Kanwisher</surname> <given-names>N.</given-names></name> <etal/></person-group>. (<year>2020a</year>). <article-title>The neural architecture of language: Integrative reverse-engineering converges on a model for predictive processing</article-title>. bioRxiv <italic>[Preprint]</italic>. <pub-id pub-id-type="doi">10.1101/2020.06.26.174482</pub-id></citation>
</ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schrimpf</surname> <given-names>M.</given-names></name> <name><surname>Kubilius</surname> <given-names>J.</given-names></name> <name><surname>Hong</surname> <given-names>H.</given-names></name> <name><surname>Majaj</surname> <given-names>N. J.</given-names></name> <name><surname>Rajalingham</surname> <given-names>R.</given-names></name> <name><surname>Issa</surname> <given-names>E. B.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>Brain-score: Which artificial neural network for object recognition is most brain-like?</article-title> bioRxiv <italic>[Preprint]</italic>. <pub-id pub-id-type="doi">10.1101/407007</pub-id></citation>
</ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schrimpf</surname> <given-names>M.</given-names></name> <name><surname>Kubilius</surname> <given-names>J.</given-names></name> <name><surname>Lee</surname> <given-names>M. J.</given-names></name> <name><surname>Murty</surname> <given-names>N. A. R.</given-names></name> <name><surname>Ajemian</surname> <given-names>R.</given-names></name> <name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name></person-group> (<year>2020b</year>). <article-title>Integrative benchmarking to advance neurally mechanistic models of human intelligence</article-title>. <source>Neuron</source> <volume>108</volume>, <fpage>413</fpage>&#x02013;<lpage>423</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2020.07.040</pub-id><pub-id pub-id-type="pmid">32918861</pub-id></citation></ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Simonyan</surname> <given-names>K.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Very deep convolutional networks for large-scale image recognition,&#x0201D;</article-title> in <source>3rd International Conference on Learning Representations, ICLR 2015</source>, eds Y. Bengio and Y. LeCun (San Diego, CA), <fpage>1</fpage>&#x02013;<lpage>14</lpage>.</citation>
</ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Storrs</surname> <given-names>K. R.</given-names></name> <name><surname>Khaligh-Razavi</surname> <given-names>S.-M.</given-names></name> <name><surname>Kriegeskorte</surname> <given-names>N.</given-names></name></person-group> (<year>2020a</year>). <article-title>Noise ceiling on the cross-validated performance of reweighted models of representational dissimilarity: Addendum to Khaligh-Razavi &#x00026; Kriegeskorte (2020a)</article-title>. bioRxiv <italic>[Preprint]</italic>. <pub-id pub-id-type="doi">10.1101/2020.03.23.003046</pub-id></citation>
</ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Storrs</surname> <given-names>K. R.</given-names></name> <name><surname>Kietzmann</surname> <given-names>T. C.</given-names></name> <name><surname>Walther</surname> <given-names>A.</given-names></name> <name><surname>Mehrer</surname> <given-names>J.</given-names></name> <name><surname>Kriegeskorte</surname> <given-names>N.</given-names></name></person-group> (<year>2020b</year>). <article-title>Diverse deep neural networks all predict human it well, after training and fitting</article-title>. bioRxiv <italic>[Preprint]</italic>. <pub-id pub-id-type="doi">10.1101/2020.05.07.082743</pub-id><pub-id pub-id-type="pmid">34272948</pub-id></citation></ref>
<ref id="B46">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tajbakhsh</surname> <given-names>N.</given-names></name> <name><surname>Shin</surname> <given-names>J. Y.</given-names></name> <name><surname>Gurudu</surname> <given-names>S. R.</given-names></name> <name><surname>Hurst</surname> <given-names>R. T.</given-names></name> <name><surname>Kendall</surname> <given-names>C. B.</given-names></name> <name><surname>Gotway</surname> <given-names>M. B.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>Convolutional neural networks for medical image analysis: full training or fine tuning?</article-title> <source>IEEE Trans. Med. Imaging</source> <volume>35</volume>, <fpage>1299</fpage>&#x02013;<lpage>1312</lpage>. <pub-id pub-id-type="doi">10.1109/TMI.2016.2535302</pub-id><pub-id pub-id-type="pmid">26978662</pub-id></citation></ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Van Lissa</surname> <given-names>C. J.</given-names></name> <name><surname>Brandmaier</surname> <given-names>A. M.</given-names></name> <name><surname>Brinkman</surname> <given-names>L.</given-names></name> <name><surname>Lamprecht</surname> <given-names>A.-L.</given-names></name> <name><surname>Peikert</surname> <given-names>A.</given-names></name> <name><surname>Struiksma</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>WORCS: a workflow for open reproducible code in science</article-title>. <source>Data Sci</source>. <volume>4</volume>, <fpage>29</fpage>&#x02013;<lpage>49</lpage>. <pub-id pub-id-type="doi">10.3233/DS-210031</pub-id></citation>
</ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vaswani</surname> <given-names>A.</given-names></name> <name><surname>Shazeer</surname> <given-names>N.</given-names></name> <name><surname>Parmar</surname> <given-names>N.</given-names></name> <name><surname>Uszkoreit</surname> <given-names>J.</given-names></name> <name><surname>Jones</surname> <given-names>L.</given-names></name> <name><surname>Gomez</surname> <given-names>A. N.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>&#x0201C;Attention is all you need,&#x0201D;</article-title> in <source>Annual Conference on Neural Information Processing Systems 2017</source>, eds I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Long Beach, CA), <fpage>5998</fpage>&#x02013;<lpage>6008</lpage>.</citation>
</ref>
<ref id="B49">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>A.</given-names></name> <name><surname>Pruksachatkun</surname> <given-names>Y.</given-names></name> <name><surname>Nangia</surname> <given-names>N.</given-names></name> <name><surname>Singh</surname> <given-names>A.</given-names></name> <name><surname>Michael</surname> <given-names>J.</given-names></name> <name><surname>Hill</surname> <given-names>F.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>&#x0201C;Superglue: a stickier benchmark for general-purpose language understanding systems,&#x0201D;</article-title> in <source>Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019</source>, eds H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d&#x00027;Alch&#x000E9;-Buc, E. B. Fox, and R. Garnett (Vancouver, BC), <fpage>3261</fpage>&#x02013;<lpage>3275</lpage>.</citation>
</ref>
<ref id="B50">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>A.</given-names></name> <name><surname>Singh</surname> <given-names>A.</given-names></name> <name><surname>Michael</surname> <given-names>J.</given-names></name> <name><surname>Hill</surname> <given-names>F.</given-names></name> <name><surname>Levy</surname> <given-names>O.</given-names></name> <name><surname>Bowman</surname> <given-names>S. R.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;GLUE: a multi-task benchmark and analysis platform for natural language understanding,&#x0201D;</article-title> in <source>Proceedings of the Workshop: Analyzing and Interpreting Neural Networks for NLP</source>, eds T. Linzen, G. Chrupala, and A. Alishahi (<publisher-loc>Brussels</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>353</fpage>&#x02013;<lpage>355</lpage>.</citation>
</ref>
<ref id="B51">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>S.</given-names></name> <name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Zhen</surname> <given-names>Z.</given-names></name> <name><surname>Liu</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). <article-title>The face module emerged in a deep convolutional neural network selectively deprived of face experience</article-title>. <source>Front. Comput. Neurosci</source>. <volume>15</volume>:<fpage>626259</fpage>. <pub-id pub-id-type="doi">10.3389/fncom.2021.626259</pub-id><pub-id pub-id-type="pmid">34093154</pub-id></citation></ref>
<ref id="B52">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yamins</surname> <given-names>D. L. K.</given-names></name> <name><surname>Hong</surname> <given-names>H.</given-names></name> <name><surname>Cadieu</surname> <given-names>C. F.</given-names></name> <name><surname>Solomon</surname> <given-names>E. A.</given-names></name> <name><surname>Seibert</surname> <given-names>D.</given-names></name> <name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name></person-group> (<year>2014</year>). <article-title>Performance-optimized hierarchical models predict neural responses in higher visual cortex</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A</source>. <volume>111</volume>, <fpage>8619</fpage>&#x02013;<lpage>8624</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.1403112111</pub-id><pub-id pub-id-type="pmid">24812127</pub-id></citation></ref>
</ref-list>
</back>
</article>
