<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurosci.</journal-id>
<journal-title>Frontiers in Neuroscience</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurosci.</abbrev-journal-title>
<issn pub-type="epub">1662-453X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnins.2022.775457</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>MONETA: A Processing-In-Memory-Based Hardware Platform for the Hybrid Convolutional Spiking Neural Network With Online Learning</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Kim</surname> <given-names>Daehyun</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1047446/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Chakraborty</surname> <given-names>Biswadeep</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/1303762/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>She</surname> <given-names>Xueyuan</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/1109590/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Lee</surname> <given-names>Edward</given-names></name>
</contrib>
<contrib contrib-type="author">
<name><surname>Kang</surname> <given-names>Beomseok</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/1477730/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Mukhopadhyay</surname> <given-names>Saibal</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/1120696/overview"/>
</contrib>
</contrib-group>
<aff><institution>Department of Electrical and Computer Engineering, Georgia Institute of Technology</institution>, <addr-line>Atlanta, GA</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Irem Boybat, IBM Research, Switzerland</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Abhronil Sengupta, The Pennsylvania State University (PSU), United States; Deliang Fan, Arizona State University, United States</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Daehyun Kim <email>daehyun.kim&#x00040;gatech.edu</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Neuromorphic Engineering, a section of the journal Frontiers in Neuroscience</p></fn></author-notes>
<pub-date pub-type="epub">
<day>11</day>
<month>04</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>16</volume>
<elocation-id>775457</elocation-id>
<history>
<date date-type="received">
<day>14</day>
<month>09</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>07</day>
<month>03</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Kim, Chakraborty, She, Lee, Kang and Mukhopadhyay.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Kim, Chakraborty, She, Lee, Kang and Mukhopadhyay</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license></permissions>
<abstract>
<p>We present a processing-in-memory (PIM)-based hardware platform, referred to as MONETA, for on-chip acceleration of inference and learning in hybrid convolutional spiking neural network. MONETAuses 8T static random-access memory (SRAM)-based PIM cores for vector matrix multiplication (VMM) augmented with spike-time-dependent-plasticity (STDP) based weight update. The spiking neural network (SNN)-focused data flow is presented to minimize data movement in MONETAwhile ensuring learning accuracy. MONETAsupports on-line and on-chip training on PIM architecture. The STDP-trained convolutional neural network within SNN (ConvSNN) with the proposed data flow, 4-bit input precision, and 8-bit weight precision shows only 1.63% lower accuracy in CIFAR-10 compared to the STDP accuracy implemented by the software. Further, the proposed architecture is used to accelerate a hybrid SNN architecture that couples off-chip supervised (back propagation through time) and on-chip unsupervised (STDP) training. We also evaluate the hybrid network architecture with the proposed data flow. The accuracy of this hybrid network is 10.84% higher than STDP trained accuracy result and 1.4% higher compared to the backpropagated training-based ConvSNN result with the CIFAR-10 dataset. Physical design of MONETAin 65 nm complementary metal-oxide-semiconductor (CMOS) shows 18.69 tera operation per second (TOPS)/W, 7.25 TOPS/W and 10.41 TOPS/W power efficiencies for the inference mode, learning mode, and hybrid learning mode, respectively.</p></abstract>
<kwd-group>
<kwd>spiking neural network (SNN)</kwd>
<kwd>processing-in-memory (PIM)</kwd>
<kwd>convolutional spiking neural network</kwd>
<kwd>on-line learning</kwd>
<kwd>on-chip learning</kwd>
<kwd>spike-time-dependent plasticity (STDP)</kwd>
<kwd>AI accelerator</kwd>
<kwd>hybrid network</kwd>
</kwd-group>
<contract-num rid="cn001">HR001118C0096</contract-num>
<contract-sponsor id="cn001">Defense Advanced Research Projects Agency<named-content content-type="fundref-id">10.13039/100000185</named-content></contract-sponsor>
<counts>
<fig-count count="16"/>
<table-count count="3"/>
<equation-count count="0"/>
<ref-count count="47"/>
<page-count count="17"/>
<word-count count="10250"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Spiking neural network (SNN) (Maass, <xref ref-type="bibr" rid="B27">1997</xref>; Gerstner and Kistler, <xref ref-type="bibr" rid="B15">2002</xref>) with spike-time-dependent-plasticity (STDP) based unsupervised learning provides a bio-inspired and energy-efficient alternative to deep learning (Kim et al., <xref ref-type="bibr" rid="B20">2020</xref>; Panda et al., <xref ref-type="bibr" rid="B31">2020</xref>). There is a growing interest in developing specialized hardware accelerators for SNN (Akopyan et al., <xref ref-type="bibr" rid="B1">2015</xref>; Buhler et al., <xref ref-type="bibr" rid="B3">2017</xref>; Davies et al., <xref ref-type="bibr" rid="B10">2018</xref>; Chen et al., <xref ref-type="bibr" rid="B6">2019</xref>; Park et al., <xref ref-type="bibr" rid="B32">2019</xref>; Chuang et al., <xref ref-type="bibr" rid="B9">2020</xref>). However, majority of the prior accelerators focused on fully connected SNN and shallow networks. Deep Convolutional Neural Network (CNN) architectures incorporated within SNN, hereafter referred to as ConvSNN, can improve the accuracy of SNNs for complex problems (Cao et al., <xref ref-type="bibr" rid="B4">2015</xref>; Tavanaei et al., <xref ref-type="bibr" rid="B45">2016</xref>; Kheradpisheh et al., <xref ref-type="bibr" rid="B18">2018</xref>; Lee et al., <xref ref-type="bibr" rid="B24">2019</xref>). As the complexity of ConvSNN increases, deep ConvSNN requires more synaptic weights and generates larger input/output feature maps, all of which can increase data movement. Processing-in-memory (PIM) has emerged as a key approach to reduce data movement and enhance the energy efficiency of CNNs (Chi et al., <xref ref-type="bibr" rid="B8">2016</xref>; Shafiee et al., <xref ref-type="bibr" rid="B37">2016</xref>; Imani et al., <xref ref-type="bibr" rid="B17">2019</xref>; Long et al., <xref ref-type="bibr" rid="B26">2020</xref>; Sze et al., <xref ref-type="bibr" rid="B44">2020</xref>). However, to the best of our knowledge, there has been no prior work on PIM based accelerator for ConvSNN with on-chip learning.</p>
<p>This article for the first time presents a PIM, hereafter referred to as MONETA, to accelerate ConvSNN with on-chip STDP learning. The overall architecture of MONETAincludes SRAM-based PIM cores for computing synapse responses, all-digital modules for computing membrane potentials of neurons, and centrally manage but locally apply STDP-based weight update. The SRAM-based PIM cores augment the sequential access PIM used in DNN acceleration, such as the ones presented by Long et al. (<xref ref-type="bibr" rid="B26">2020</xref>), with STDP-based weight update modules for parallel updates of synaptic weights (Kim et al., <xref ref-type="bibr" rid="B19">2020</xref>). The novelty of MONETAlies in the optimized data flow for improving resource efficiency while implementing inference and learning in PIM-based ConvSNN.</p>
<p>In traditional CNN, the output feature map (OFM) tensor of a layer is obtained from the total input feature map (TIFM) tensor and filter weights (<xref ref-type="fig" rid="F1">Figure 1A</xref>). In ConvSNN, we first generate a tensor for the membrane potential of all neurons (<italic>TV</italic><sub>mem</sub>), followed by output spikes (OFMs) (<xref ref-type="fig" rid="F1">Figure 1B</xref>). However, as input pixels are encoded as spike trains, multiple time steps (spike cycles) are necessary to process one image using ConvSNN. Hence, the TIFM for each layer must be processed multiple times to generate the <italic>TV</italic><sub>mem</sub> in each spike cycle, leading to a large on-chip buffer for <italic>TV</italic><sub>mem</sub> tensor, and significant off-chip (from DRAM) and on-chip (from <italic>TV</italic><sub>mem</sub> buffer) data movement. Although, Narayanan et al. have analyzed the temporal aspects of SNN for logic-based engines (Narayanan et al., <xref ref-type="bibr" rid="B29">2020</xref>), they did not optimize data flow simultaneously considering data movement and learning accuracy in ConvSNN.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>The computational model in <bold>(A)</bold> convolutional neural network (CNN) and the <bold>(B)</bold> convolutional neural network within spiking neural network (ConvSNN).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-775457-g0001.tif"/>
</fig>
<p>We propose a novel data flow for the PIM-based processing of the TIFM. We read an input feature map (IFM) from the TIFM tensor, process the IFM using PIM, and generate the <italic>V</italic><sub>mem</sub> for output neurons. The sequential processing of an IFM overall spike cycles eliminates repeated reading of TIFM from DRAM and on-chip storage of <italic>TV</italic><sub>mem</sub>. However, the sequential processing of IFMs introduces a bias in the STDP learning as IFMs processed earlier more strongly influence filter weights than the ones processed later. We propose a central STDP controller to ensure each filter is updated based on the IFM that results in the maximum <italic>V</italic><sub>mem</sub> of the firing neuron, rather than the IFMs that were processed earlier in sequence. In summary, our approach minimizes the data movement during inference, while ensuring the accuracy of the STDP learning process.</p>
<p>The accuracy of the accelerator is estimated considering MNIST, CIFAR-10, and CIFAR-100 datasets. With CIFAR-10 dataset, the accuracy with the weights trained by the standard STDP model is 67.88%. When we apply our modified STDP model, the accuracy is 66.25%, which is 1.63% lower than standard STDP model-based result. The experiment result demonstrates that on-chip and on-line STDP learning can be achieved with insignificant accuracy loss. The average power efficiencies of 18.69 TOPS/W and 7.25 TOPS/W are observed for inference and learning, respectively.</p>
<p>Along with a fully-STDP trained ConvSNN, the proposed architecture is also used to accelerate inference and on-line learning of a hybrid ConvSNN architecture that couples supervised (off-chip) trained and STDP (on-chip) learned layers. Previously, the concept of hybridization combining supervised training and STDP has been first introduced for a DNN (She et al., <xref ref-type="bibr" rid="B38">2021</xref>). After that, Chakraborty et al. has shown the same concept of hybridization on SNN (Chakraborty et al., <xref ref-type="bibr" rid="B5">2021</xref>). In this article, we show the hardware platform to accelerate the ConvSNN using the same concept of hybridization.</p>
<p>In addition to homogeneous networks, MONETAalso supports hybrid ConvSNN. Half of the layers can be on-line trained using the STDP algorithm and the other half of the layers are based on the externally programmed fixed weights. These fixed weights are off-chip trained by supervised learning. STDP uses unsupervised local learning to extract low-level features under spatial correlation. On the other hand, surrogate-gradient based backpropagation (BP) in ConvSNN enables global learning between low-level pixel-to-pixel interactions (Wu et al., <xref ref-type="bibr" rid="B47">2018</xref>). It thus aids in high-level detection and classification similar to a SGD trained CNN model. By integrating global features using supervised training and local features using STDP learning, the hybrid network is also much more robust to local uncorrelated perturbations in pixels while extracting the correct feature representation from the overall image. Consequently, hybridization of surrogate-gradient and STDP enables robust image classification improving the accuracy of the baseline backpropagated ConvSNN model.</p>
<p>Based on the hybrid network simulation, we achieve 1.40% higher accuracy (77.83%) in MONETAthan the accuracy based on the supervised learning (76.43%) with the CIFAR-10 dataset. In addition, the average power efficiency for the hybrid on-line learning mode is 10.41 TOPS/W. This power efficiency is larger than on-line learning mode, but smaller than inference mode because half of the layer use inference mode and the other half of the layers use learning mode.</p>
</sec>
<sec id="s2">
<title>2. Background</title>
<sec>
<title>2.1. ConvSNN and Unsupervised Learning Using STDP</title>
<p>The spiking CNN uses the same structure as a traditional CNN (<xref ref-type="fig" rid="F2">Figure 2</xref>). However, the input is a binary spike where the magnitude of the input. For example, the value of an image pixel is encoded in the frequency of the spikes. A spiking neuron computes the membrane potential <italic>V</italic><sub>mem</sub> using the spike levels multiplied by synaptic weights following the leaky integrate and fire (LIF) dynamics (<xref ref-type="fig" rid="F3">Figure 3</xref>). An output spike is generated (neuron firing) when <italic>V</italic><sub>mem</sub> is higher than a threshold <italic>V</italic><sub>th</sub> and resets <italic>V</italic><sub>mem</sub> to <italic>V</italic><sub>reset</sub>.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>The architecture of the CNN.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-775457-g0002.tif"/>
</fig>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p><bold>(A)</bold> Leaky integrate and fire (LIF) neuron computational model. <bold>(B)</bold> Stochastic spike-time-dependent-plasticity (STDP) model.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-775457-g0003.tif"/>
</fig>
<p>When the neuron fires (i.e., generates an output spike), the synaptic weights connected to the spiked neuron are updated following a stochastic STDP model (<xref ref-type="fig" rid="F3">Figure 3B</xref>) (She et al., <xref ref-type="bibr" rid="B39">2019</xref>). The firing of the neuron inhibits the firing of other neurons. There are two types of inhibitions, which are cross-depth inhibition and lateral inhibition. <xref ref-type="fig" rid="F4">Figure 4A</xref> shows the cross-depth inhibition. In the case of the cross-depth inhibition, the firing of a neuron inhibits the firing of all other neurons located at the same (x, y) coordinates of all depths (across &#x0201C;z&#x0201D;-axis) in <italic>TV</italic><sub>mem</sub>. The cross-depth inhibition can be easily implemented within the single PIM array and the neuron set (<xref ref-type="fig" rid="F4">Figure 4B</xref>). In the case of lateral inhibition, the firing of the neuron inhibits all the neurons located at the same <italic>z</italic> coordinates (<xref ref-type="fig" rid="F4">Figure 4C</xref>).</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p><bold>(A)</bold> Cross-depth inhibition <bold>(B)</bold> cross-depth inhibition on the memory array <bold>(C)</bold> lateral inhibition.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-775457-g0004.tif"/>
</fig>
</sec>
<sec>
<title>2.2. CNN Mapping for PIM Architecture</title>
<p><xref ref-type="fig" rid="F1">Figure 1A</xref> shows the basic terminologies for the CNN hardware (Chen et al., <xref ref-type="bibr" rid="B7">2017</xref>). In the layer &#x0003C; n&#x0003E;, the size of TIFM is <italic>I</italic><sub><italic>R</italic></sub>&#x000D7; <italic>I</italic><sub><italic>C</italic></sub>&#x000D7; <italic>I</italic><sub><italic>D</italic></sub>, the size of the filter is <italic>F</italic><sub><italic>R</italic></sub>&#x000D7; <italic>F</italic><sub><italic>C</italic></sub>&#x000D7; <italic>I</italic><sub><italic>D</italic></sub>, and the size of total output feature maps (TOFM) is <italic>O</italic><sub><italic>R</italic></sub>&#x000D7; <italic>O</italic><sub><italic>C</italic></sub>&#x000D7; <italic>O</italic><sub><italic>D</italic></sub>. The number of filters (depth of filters) are the same as the TOFM&#x00027;s depth (<italic>O</italic><sub><italic>D</italic></sub>). The IFM, whose size is <italic>F</italic><sub><italic>R</italic></sub>&#x000D7; <italic>F</italic><sub><italic>C</italic></sub>&#x000D7; <italic>I</italic><sub><italic>D</italic></sub>, is multiplied by each filter and generates the OFM, which size is 1 &#x000D7; 1 &#x000D7; 1. The stride is called as <italic>S</italic>. <xref ref-type="fig" rid="F5">Figure 5</xref> shows the CNN mapping method on the memory for the PIM architecture (Peng et al., <xref ref-type="bibr" rid="B33">2021</xref>). Filter weights are divided by the x and y-axis, whose size is 1 &#x000D7; 1 &#x000D7; <italic>I</italic><sub><italic>D</italic></sub> and distributed on the different memory arrays. Also, each filter is placed on the different columns. To calculate the OFM, IFM is divided and sent to the memory sub-arrays. The multiplication between the synapse matrix and input vector is computed in each array, and outputs are summed to compute the OFM.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>CNN mapping on the memory for processing-in-memory (PIM) architecture.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-775457-g0005.tif"/>
</fig>
</sec>
<sec>
<title>2.3. Prior SNN Accelerator Hardware</title>
<p>Various types of SNN based accelerators have been introduced in recent years. Buhler et al. (<xref ref-type="bibr" rid="B3">2017</xref>) made the analog neuron-based accelerator for the compact and energy-efficient design. However, they use the spiking locally competitive algorithm for an accelerator. Chen et al. (<xref ref-type="bibr" rid="B6">2019</xref>) showed the large-scale neuromorphic processor with 4,096-neuron and 1M-synapse. Their design uses binary activation, but the hardware is not optimized for the ConvSNN (Chen et al., <xref ref-type="bibr" rid="B6">2019</xref>). Park et al. (<xref ref-type="bibr" rid="B32">2019</xref>) showed the ConvSNN based accelerator. However, they only used the stochastic gradient descent algorithm for the learning to improve the accuracy. In addition, the ConvSNN accelerator is introduced by Chuang et al. using a 2D systolic array with efficient data re-use, but their design does not include the on-chip training (Chuang et al., <xref ref-type="bibr" rid="B9">2020</xref>).</p>
<p>Our design accelerates the ConvSNN using PIM architecture with on-chip STDP learning. The ConvSNN requires more complicated hardware design than multilayer perceptron-based SNN, but it has higher accuracy and lower memory usage for the weights on the complex image datasets such as CIFAR-10. The PIM architecture does not require the VMM calculation module, as we calculate the VMM in the SRAM array. In this sense, the PIM architecture can reduce the data transmission, as it does not require transmitting the weights to another module. In addition, the STDP learning rule benefits efficient learning for large-scale models or on-line learning as it enables unsupervised local learning. We propose a modified STDP algorithm to efficiently accelerate the PIM architecture.</p>
</sec>
<sec>
<title>2.4. Hybrid Spiking Neural Network</title>
<p>The SNN training methodologies can be broadly classified into three types: (1) conversion from artificial-to-spiking models (Diehl et al., <xref ref-type="bibr" rid="B14">2015</xref>; Sengupta et al., <xref ref-type="bibr" rid="B35">2019</xref>), (2) surrogate gradient descent based backpropagation with spikes (Lee et al., <xref ref-type="bibr" rid="B22">2018</xref>; Wu et al., <xref ref-type="bibr" rid="B47">2018</xref>; Neftci et al., <xref ref-type="bibr" rid="B30">2019</xref>), and (3) unsupervised STDP based learning (Diehl and Cook, <xref ref-type="bibr" rid="B13">2015</xref>; Srinivasan et al., <xref ref-type="bibr" rid="B43">2018</xref>). Each technique has its own set of advantages and disadvantages. ANN-to-SNN conversion yields state-of-the-art accuracies, even for complex datasets like ImageNet (Deng et al., <xref ref-type="bibr" rid="B11">2009</xref>) and can be used to convert complex architectures, like VGGNet (Simonyan and Zisserman, <xref ref-type="bibr" rid="B41">2014</xref>), ResNet (He et al., <xref ref-type="bibr" rid="B16">2016</xref>), RetinaNet (Miquel et al., <xref ref-type="bibr" rid="B28">2021</xref>), the latency incurred to process the rate-coded image is very high (Pfeiffer and Pfeil, <xref ref-type="bibr" rid="B34">2018</xref>; Lee et al., <xref ref-type="bibr" rid="B23">2020</xref>). Surrogate gradient-based methods address the latency concerns but lag behind conversion in terms of accuracy for larger and complex tasks. The unsupervised STDP training also suffers from accuracy deficiencies. As pointed out by Panda et al. (<xref ref-type="bibr" rid="B31">2020</xref>), the accuracy loss due to vanishing spike propagation and input pixel-to-spike coding are innate properties of SNN design that can be addressed to a certain extent, but, cannot be completely eliminated. In order to achieve competitive accuracy as that of an ANN, previous works have taken a hybrid approach with a partly-artificial-and-partly-spiking neural architecture (Panda et al., <xref ref-type="bibr" rid="B31">2020</xref>; She et al., <xref ref-type="bibr" rid="B40">2020</xref>). As discussed by Ledinauskas et al. (<xref ref-type="bibr" rid="B21">2020</xref>), SNNs obtained by conversion must use only rate encoding, due to which the expressive capacity might be reduced. Another drawback of such conversion using rate-based encoding is that one needs to use forward propagation time steps in the order of thousands during the inference procedure for SNN. This drawback severely limits the computation speed and energy efficiency benefit of SNNs. Large spikes are necessary to reduce the uncertainty of spiking frequency values. Also, several ANN architectures are limited before conversion (e.g., batch normalization cannot be used) (Diehl et al., <xref ref-type="bibr" rid="B14">2015</xref>; Sengupta et al., <xref ref-type="bibr" rid="B35">2019</xref>). This limits ANN performance and the upper bound of SNN performance. Due to these limitations, we use a surrogate gradient-based method to train SNNs directly instead of converting ANN parameters to SNN.</p>
<p>Hence, following the work done by Chakraborty et al. (<xref ref-type="bibr" rid="B5">2021</xref>), we use a hybrid network consisting of surrogate-gradient based backpropagated ConvSNN modules along with the unsupervised STDP trained ConvSNN module. <xref ref-type="fig" rid="F6">Figure 6</xref> shows the architectural block diagrams of the different types of neural networks. <xref ref-type="fig" rid="F6">Figures 6A,B</xref> show the homogeneous network architecture that uses STDP and the backpropagation, respectively. <xref ref-type="fig" rid="F6">Figure 6C</xref> shows the hybrid network architecture whose weights are from the different training algorithms. The hybrid network consists of spiking layers placed in parallel to form different spiking convolution modules. The first spiking convolution module and half of the third spiking convolutional module (shown in blue in <xref ref-type="fig" rid="F6">Figures 6A,C</xref>) are the backpropagated spiking modules. The second spiking convolutional module and the other half of the third spiking convolutional module (shown in orange in <xref ref-type="fig" rid="F6">Figures 6B,C</xref>) are trained with the unsupervised STDP algorithm. The STDP-spiking convolution module is placed in parallel to the backpropagated module to enable robust extraction of local and low-level features. Further, to ensure that the low-level feature extraction also considers global learning, which is the hallmark of gradient back-propagation, several backpropagated ConvSNN layers of a similar size in parallel with the STDP ConvSNN module are used. The output feature map of the two parallel modules is maintained to have the same height and width and concatenated along the depth to be used as input tensor to the final ConvSNN layers. This ConvSNN module is responsible for higher level feature detection as well as the final classification. The main CNN module can be designed based on existing deep learning models. The concatenation of features from backpropagation-based ConvSNN and STDP-based ConvSNN modules help integrate global and local learning.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Architectural block diagram of the <bold>(A)</bold> STDP only network <bold>(B)</bold> backpropagation only network <bold>(C)</bold> hybrid Network.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-775457-g0006.tif"/>
</fig>
<p>In addition, there exist other types of hybridization in the prior works. Lee et al. (<xref ref-type="bibr" rid="B22">2018</xref>) show the STDP-based unsupervised pre-training followed by supervised fine-tuning to improve the accuracy. Other works also show ANN-SNN hybridization that uses both ANN and SNN (Deng et al., <xref ref-type="bibr" rid="B12">2020</xref>; Singh et al., <xref ref-type="bibr" rid="B42">2020</xref>; Wang et al., <xref ref-type="bibr" rid="B46">2021</xref>). On the other hand, hybridization in this article means, using only SNN with the different types of weight training algorithms (pre-trained backpropagation and STDP-based on-line leaning).</p>
</sec>
</sec>
<sec id="s3">
<title>3. Hardware Architecture</title>
<p>The overall MONETAarchitecture consists of synaptic cores, neuron modules, and a central STDP controller (<xref ref-type="fig" rid="F7">Figure 7</xref>). The synaptic core calculates the <italic>V</italic><sub>mem</sub> for each filter based on the IFMs and the weights. The synaptic array inside the synaptic core functions as a digital PIM core and calculates the vector matrix multiplication (VMM) of IFMs and synaptic weights. The results generated by the synaptic array are accumulated in the neuron module. The neuron module generates the output spikes based on the accumulated <italic>V</italic><sub>mem</sub> using the LIF model. The central STDP controller has a filter-update table and the training control module to control the synaptic core and the STDP-based weight update.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>The MONETA system architecture overview.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-775457-g0007.tif"/>
</fig>
<p>The STDP learning is performed using distributed weight update modules embedded in each synaptic core and a central STDP controller (<xref ref-type="fig" rid="F7">Figure 7</xref>). The weight update module reads, computes the update, and writes back the synaptic weights using stochastic STDP (Kim et al., <xref ref-type="bibr" rid="B19">2020</xref>). The central STDP controller manages the filter-update table and the learning process.</p>
<p>There are two phases in our design, the inference phase and weight update phase. In the inference mode, only the inference phase exists. In the learning mode, both the inference phase and the weight update phase exist. More precisely, during the inference phase in the learning mode, the central STDP controller collects the data in the filter update table while other modules do the same function with inference mode. After finishing the inference function for the scheduled cycles, MONETAstarts the weight update phase and updates the weights.</p>
<sec>
<title>3.1. Proposed SNN Inference Methodology</title>
<p>An SNN receives the input as spikes. Based on each pixel&#x00027;s brightness, the range of the spike frequency is <italic>f</italic><sub>spike-min</sub>&#x0007E;<italic>f</italic><sub>spike-max</sub>. Assume, <inline-formula><mml:math id="M2"><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">spike</mml:mtext></mml:mstyle></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">spike-max</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is a unit time-step, and <italic>T</italic><sub>total</sub> is a total exposure time for an input image. Therefore, ConvSNN (<xref ref-type="fig" rid="F1">Figure 1B</xref>) receives all the IFMs, including the input image, for <inline-formula><mml:math id="M3"><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">spike-cycle</mml:mtext></mml:mstyle></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">total</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">spike</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> of spike cycles, computes the <italic>V</italic><sub>mem</sub> for all neurons, i.e., entire <italic>TV</italic><sub>mem</sub> tensor in each cycle based on the LIF neuron policy (<xref ref-type="fig" rid="F3">Figure 3A</xref>). All the <italic>V</italic><sub>mem</sub> values in the <italic>TV</italic><sub>mem</sub> tensor that are higher than the threshold generate an output spike in the OFM.</p>
<sec>
<title>3.1.1. Sequential Processing of Spike Cycles</title>
<p>Ideally, at spike cycle <italic>i</italic>, we need to generate a <italic>TV</italic><sub>mem</sub> tensor which is used along with the TIFM tensor to compute <italic>TV</italic><sub>mem</sub> for cycle <italic>i</italic>&#x0002B;1. Hence, in each spike cycle, we must compute all <italic>V</italic><sub>mem</sub> values in the <italic>TV</italic><sub>mem</sub> tensor by processing the all the IFMs in the TIFM tensor (<xref ref-type="fig" rid="F8">Figure 8A</xref>). As all the IFMs are multiplied by same weight matrix, parallel processing of all the IFMs will require duplication of weight memory by <inline-formula><mml:math id="M4"><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">R</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">S</mml:mtext></mml:mstyle></mml:mrow></mml:mfrac><mml:mo>&#x000D7;</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">C</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">S</mml:mtext></mml:mstyle></mml:mrow></mml:mfrac></mml:math></inline-formula> (where S is the stride), which is infeasible to store on-chip. Hence, we must process each IFMs serially in each accelerator clock cycle (<italic>f</italic><sub>CLK</sub> &#x0003D; 1 GHz in our design), as shown in <xref ref-type="fig" rid="F8">Figure 8A</xref>). Hence, for each spike cycle, we can serially read all IFMs, multiply each IFM to all the filters in one accelerator clock <italic>f</italic><sub>CLK</sub>, and serially compute all the elements of the <italic>TV</italic><sub>mem</sub> tensor. This is similar to operating a normal CNN. However, in ConvSNN, we must process the same TIFM tensor repeatedly for <italic>N</italic><sub>spike-cycle</sub> spike cycles in ConvSNN, such an approach requires either reading the same data (IFMs) from the off-chip memory repeatedly in every spike cycle resulting in a significant (<inline-formula><mml:math id="M5"><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">total</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">unit</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mo>&#x000D7;</mml:mo></mml:math></inline-formula>) increase in data movement or store the TIFM tensor on-chip requiring large buffer. Moreover, we also need a global buffer to store the <italic>TV</italic><sub>mem</sub> tensor generated over the entire spike cycle. While processing spike cycle <italic>i</italic>&#x0002B;1, the global <italic>TV</italic><sub>mem</sub> buffer generated in the spike cycle <italic>i</italic> must be read by individual PIM blocks to generate the <italic>TV</italic><sub>mem</sub> tensor for the <italic>i</italic>&#x0002B;1 spike cycle. We will also need a global on-chip buffer of size <italic>O</italic><sub>R</sub>&#x000D7; <italic>O</italic><sub>C</sub>&#x000D7; <italic>O</italic><sub>D</sub> to store the <italic>TV</italic><sub>mem</sub> tensor increasing on-chip data movement between the PIM cores.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Inference approaches: <bold>(A)</bold> sequential processing of spike cycles: serially process all input feature maps (IFMs) in a total IFM (TIFM) for a given spike cycle and generate the entire <italic>TV</italic><sub>mem</sub>. <bold>(B)</bold> sequential processing of IFMs: an IFM is serially processed for <inline-formula><mml:math id="M1"><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">spike-cycle</mml:mtext></mml:mstyle></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">total</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">spike</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> spike cycles followed by processing the next IFM.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-775457-g0008.tif"/>
</fig>
</sec>
<sec>
<title>3.1.2. Sequential Processing of IFMs</title>
<p>We propose to re-order the IFM processing as shown in <xref ref-type="fig" rid="F8">Figure 8B</xref>. We first read one IFM and serially compute all <italic>V</italic><sub>mem</sub> values generated by that IFM over all <italic>N</italic><sub>spike-cycle</sub> spike cycles. Note, all these <italic>V</italic><sub>mem</sub> values can now be computed in <italic>N</italic><sub>spike-cycle</sub> of accelerator clock cycle. Moreover, as the IFM remains constant, the <italic>V</italic><sub>mem</sub> values for successive spike cycles can be locally accumulated within the PIM block eliminating the need for global <italic>TV</italic><sub>mem</sub> buffer and associated data movement. Moreover, serial processing of all spike cycles for a given IFM eliminates the need for repeated reading of the entire TIFM tensor thereby reducing off-chip data movement.</p>
</sec>
</sec>
<sec>
<title>3.2. Hardware Support for Inference</title>
<sec>
<title>3.2.1. Synaptic Core</title>
<p>The synaptic core is used for distributed computation of <italic>V</italic><sub>mem</sub> and generates the output spike. The synaptic core uses synaptic arrays (weight storage), routers, and neuron modules for the inference. Since the weight matrix is distributed across multiple synaptic cores, each synaptic core has a subarray of dimension <inline-formula><mml:math id="M6"><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">D</mml:mtext></mml:mstyle></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>O</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">D</mml:mtext></mml:mstyle></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">weight&#x00027;s bit-width</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">&#x00023; of synaptic core</mml:mtext></mml:mstyle></mml:mrow></mml:mfrac></mml:math></inline-formula>, and calculates the matrix multiplication results for <inline-formula><mml:math id="M7"><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>O</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">D</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">synaptic core &#x00023;</mml:mtext></mml:mstyle></mml:mrow></mml:mfrac></mml:math></inline-formula> filters.</p>
</sec>
<sec>
<title>3.2.2. Synaptic Array</title>
<p>The synaptic array multiplies the IFMs and synaptic weights to generate the partial sum of the VMM result. A sequential (row-by-row) read access-based PIM design is considered for synaptic arrays to multiply IFMs and weights. Then, the hierarchical network-on-chip (H-NoC) router adds partial sums and sends the VMM result to the neuron module. The synaptic array is implemented by SRAM array, peripherals, and drivers (<xref ref-type="fig" rid="F9">Figure 9</xref>). Synaptic weights are 8 bits and consist of 8 consecutive SRAM cells in a row. The most left SRAM cell represents the sign bit.</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>The synaptic array architecture.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-775457-g0009.tif"/>
</fig>
<p>The synaptic array receives the input spikes of the IFM on the row-wise word-line (RWL) port. The input spikes are sent in row-by-row order, so the RWL peripheral uses a counter-based decoder to send input spikes to an 8T-SRAM array sequentially. When the result of sense amplifier for CBL is 1, the CBL peripheral sends the partial sum, 1, to the H-NoC router and pre-discharges the CBL. The H-NoC connects the synaptic arrays, accumulates the partial sums, and sends the VMM result to the neuron module (Long et al., <xref ref-type="bibr" rid="B25">2019</xref>).</p>
</sec>
<sec>
<title>3.2.3. Neuron Module</title>
<p>The neuron module receives the VMM result from the synaptic arrays, calculates the <italic>V</italic><sub>mem</sub>, and generates the output spike. <xref ref-type="fig" rid="F10">Figure 10</xref> shows the neuron module architecture. The neuron module consists of <inline-formula><mml:math id="M8"><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>O</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">D</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">synapse core</mml:mtext></mml:mstyle><mml:mi>&#x00023;</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula> neuron cells, <italic>V</italic><sub>mem</sub> comparator, and synapse update selector. The neuron cell updates the <italic>V</italic><sub>mem</sub> based on the LIF neuron dynamics and generates the output spike when the <italic>V</italic><sub>mem</sub> over the <italic>V</italic><sub>th</sub>. <italic>V</italic><sub>mem</sub> comparator and Synapse update selector are disabled during inference. These modules are discussed in section 3.3</p>
<fig id="F10" position="float">
<label>Figure 10</label>
<caption><p>The LIF Neuron Module architecture.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-775457-g0010.tif"/>
</fig>
<p>To generate the output spike, the neuron cell receives the VMM result and updates the <italic>V</italic><sub>mem</sub> based on the LIF neuron dynamics. Inside the <italic>V</italic><sub>mem</sub> calculation module (<xref ref-type="fig" rid="F10">Figure 10</xref>), VMM result and the current <italic>V</italic><sub>mem</sub> are added when the compute enable signal is enabled. This <italic>V</italic><sub>mem</sub> accumulation takes <italic>I</italic><sub><italic>D</italic></sub> cycles, as the whole VMM computation requires <italic>I</italic><sub><italic>D</italic></sub> cycles with row-by-row access on synaptic array. Then, the leakage calculator calculates the leakage based on the current <italic>V</italic><sub>mem</sub> and subtracts the leakage to the result of the <italic>V</italic><sub>mem</sub> accumulator to generate the updated <italic>V</italic><sub>mem</sub>. In the end, if the updated <italic>V</italic><sub>mem</sub> is larger than <italic>V</italic><sub>th</sub>, the neuron cell will generate the output spike and reset <italic>V</italic><sub>mem</sub> as 0. Updated <italic>V</italic><sub>mem</sub> is stored in the <italic>V</italic><sub>mem</sub> register inside the neuron cell to be used in the next time step.</p>
</sec>
</sec>
<sec>
<title>3.3. Proposed PIM-Friendly STDP Learning Methodology</title>
<p>We argue that the proposed approach of sequential processing of IFMs can lead to bias in STDP learning. In ConvSNN with cross-depth inhibition, each depth controls the weight update for a filter tensor. Consider a neuron at location <italic>x</italic><sub><italic>k</italic></sub>, <italic>y</italic><sub><italic>k</italic></sub>, <italic>z</italic><sub><italic>k</italic></sub> fires, then it will inhibit the firing of all other neurons across the depth, i.e., all neurons at <italic>x</italic><sub><italic>k</italic></sub>, <italic>y</italic><sub><italic>k</italic></sub> but all locations across the <italic>z</italic>-axis. In an ideal case, <italic>V</italic><sub>mem</sub> values of all the neurons in the same depth of the <italic>TV</italic><sub>mem</sub> tensor are calculated simultaneously. Hence, for a given depth, the neuron with the maximum <italic>V</italic><sub>mem</sub> considering the entire TIFM will fire and control the weight update process for the associated filter. However, when IFMs are processed sequentially, the STDP based updates of filter weights are controlled by the order in which IFMs are processed. For example, considering the order shown in <xref ref-type="fig" rid="F8">Figure 8B</xref>, the IFM in the earliest position (top-left segment in the TIFM tensor) can cause firing at a given depth change with the associated filter weights. The <italic>V</italic><sub>mem</sub> computation for the later IFMs will be performed with the already changed filter weights and hence will have less impact on overall learning. This leads to undesired sequential bias in the STDP learning.</p>
<p>We address this problem by ensuring that at a particular depth the neuron which has the maximum <italic>V</italic><sub>mem</sub> considering all IFMs control STDP-based update of the corresponding filter weight (shown in <xref ref-type="fig" rid="F1">Figure 1B</xref>). This is achieved by maintaining a central filter-update table where for each filter we store a running value of the maximum <italic>V</italic><sub>mem</sub> and corresponding IFM number (<xref ref-type="fig" rid="F11">Figure 11A</xref>). While processing the &#x0201C;<italic>i</italic>th&#x0201D; IFM over all spike cycles, we compute <italic>V</italic><sub>mem</sub>, fire a neuron (as required), reset <italic>V</italic><sub>mem</sub> for all other cross-depth neurons but do not initiate weight update. Instead, we estimate the <italic>V</italic><sub>mem</sub> values for all the neurons at all depths due to the &#x0201C;<italic>i</italic>th&#x0201D; IFM. If at a given depth, the <italic>V</italic><sub>mem</sub> generated by &#x0201C;<italic>i</italic>th&#x0201D; IFM is higher than the maximum <italic>V</italic><sub>mem</sub> value stored in the table for the corresponding filter, we update the central table to indicate &#x0201C;<italic>i</italic>th&#x0201D; IFM results in the maximum <italic>V</italic><sub>mem</sub> for this filter. The table generation is finished after processing all the IFMs. Once completed, we show all the IFMs one more time and update the filter weights based on the filter-update table (<xref ref-type="fig" rid="F11">Figure 11B</xref>). The overhead is cost of processing TIFMs two times, one for generating the filter-update table and the second for updating the weights (<xref ref-type="fig" rid="F3">Figure 3B</xref>). Therefore, our PIM-friendly STDP learning can train the weights based on the STDP algorithm without considerable IFM movements and the bias occurring from the sequential IFM processing.</p>
<fig id="F11" position="float">
<label>Figure 11</label>
<caption><p><bold>(A)</bold> Filter-update table <bold>(B)</bold> relation between IFM number and target filter based on the filter-update table.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-775457-g0011.tif"/>
</fig>
</sec>
<sec>
<title>3.4. Hardware Support for Learning</title>
<sec>
<title>3.4.1. Synaptic Core</title>
<p>In the learning mode, the synaptic core uses synaptic arrays (weights storage), routers, neuron modules, and weight update modules. The weight update module is power gated in the inference mode but is used in the learning mode. Synaptic core and neuron modules calculate the <italic>V</italic><sub>mem</sub> and generate the output spike. When the output spike is generated, weights are updated with the control from the central STDP controller.</p>
</sec>
<sec>
<title>3.4.2. Synaptic Array</title>
<p>The SRAM array in the synaptic arrays is implemented by an 8T-SRAM array. 8T-SRAM allows transposable read and write thereby allowing parallelism in weight update (Seo et al., <xref ref-type="bibr" rid="B36">2011</xref>; Kim et al., <xref ref-type="bibr" rid="B20">2020</xref>). <xref ref-type="fig" rid="F9">Figure 9</xref> shows the synaptic array and its connections with other modules. The 8T-SRAM includes the 6T-SRAM, the PMOS M1, and the PMOS M2. The 6T-SRAM stores the synapse weight, and the PMOS M1 and the PMOS M2 connect RWL, synapse weight, and column-wise bitline (CBL) for the matrix multiplication. When the RWL sends the spike and the synapse weight bit is 1, CBL is charged.</p>
<p>During the weight update phase, the syna does the same inference function until the neuron module generates the output spike. When the output spike is generated, the synaptic array receives the synapse number from the neuron module and decodes it to generate the column-wise wordline (CWLs) to read the SRAM data stored in the 6T-SRAM cell, included in the 8T-SRAM. Total 8 CWLs are generated sequentially for each clock to read the 8-bit synapse weight information. The CWL is connected to the 6T-SRAM cells&#x00027; CWL vertically and reads the data by RBL and RBLB horizontally. The RBL peripheral reads the synapse weight data for each clock and sends it to the weight update module. After the weight update module calculates the synaptic weights, RBL peripheral receives the updated synapse weights and writes them back to the 6T-SRAM cells.</p>
</sec>
<sec>
<title>3.4.3. Neuron Module</title>
<p>In the learning mode, <italic>V</italic><sub>mem</sub> comparator and the synapse update selector are additionally used. During the inference phase in the learning mode, the neuron module compares the <italic>V</italic><sub>mem</sub> at the <italic>V</italic><sub>mem</sub> comparator and sends the maximum <italic>V</italic><sub>mem</sub> and the filter number to the central STDP controller for each IFM. In the weight update phase, the neuron module receives the active filter number from the central STDP controller, and only the selected filter calculates the <italic>V</italic><sub>mem</sub>. The selected filter calculates the <italic>V</italic><sub>mem</sub> in the neuron cell and generates the output spike when the <italic>V</italic><sub>mem</sub> is over the threshold voltage. When the neuron cell generates the output spike, the neuron cell resets the <italic>V</italic><sub>mem</sub> to 0 and holds the <italic>V</italic><sub>mem</sub> calculation while the synapse array updates the weight. The synapse update selector receives the output spike, generates the synapse number, and sends this number to the synapse array.</p>
</sec>
<sec>
<title>3.4.4. Weight Update Module</title>
<p><xref ref-type="fig" rid="F12">Figure 12A</xref> shows the architecture of the weight update module. The weight update module calculates the updated weights based on the current weights and the timing information using the stochastic STDP rule. The timing information is used to check the probability of potentiation or depotentiation (<xref ref-type="fig" rid="F3">Figure 3B</xref>). The configuration register (configs register) stores the programmable configurations for the timing queue control and the stochastic STDP rule. The pseudo-random number generator (PRNG) generates the random number (RND), which decides whether to update or not to update weights and is implemented by linear-feedback shift registers (LFSRs). The counter is used to push the spike history queue.</p>
<fig id="F12" position="float">
<label>Figure 12</label>
<caption><p><bold>(A)</bold> The weight update module architecture. <bold>(B)</bold> The spike history queue function graph. <bold>(C)</bold> The weight calculator architecture. <bold>(D)</bold> The update decision module&#x00027;s state machine for stochastic STDP.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-775457-g0012.tif"/>
</fig>
<p>The spike history queue receives the RWL, delivered from the synaptic array, and stores the input spike history in the spike history queue (<xref ref-type="fig" rid="F12">Figure 12B</xref>). The T[3] is connected to the RWL and is set to 1 when the RWL is 1. When the counter reaches the threshold time (T<sub>threshold</sub>), reset the T[3] to 0 and push the queue from T[3:1] to T[2:0]. As a result, the spike history queue stores the input spike history, and by changing the T<sub>threshold</sub> in the configs register, we can control the spike history period.</p>
<p>The weight queue receives the current weight from the synaptic array one bit per clock until it receives all 8-bit of the synapse weight. After the weight queue receives the current weight, the weight calculator calculates the updated weight based on the spike history and the current weight (<xref ref-type="fig" rid="F12">Figure 12C</xref>). The weight update calculator includes the update decision module and the weight computation module. The update decision module determines whether to update the weight or not according to the stochastic STDP rule. When the update (UP) signal is 1 and the T[3:0] is all 0, the weight computation module decreases the weight. When the UP signal is 1 and the T[3:0] has at least 1, the weight computation module increases the weight. At the end, when the UP signal is 0, the weight computation module does not change the weight. After the updated weight calculation, the weight update module sends the updated weight one bit per clock to the synaptic array.</p>
<p>As shown in <xref ref-type="fig" rid="F12">Figure 12D</xref>, the update decision module&#x00027;s state-machine describes the stochastic STDP. The update decision module receives the input spike history, the RND, and the configurations (configs). The configurations include the potentiation thresholds (P1, P2, P3, and P4) and the depotentiation threshold (PD). The spike history determines which potentiation/depotentiation threshold will be used. The UP is set to 1 when the RND is smaller than the selected threshold. When the RND is equal to or larger than the threshold, UP is set to 0.</p>
</sec>
<sec>
<title>3.4.5. Central STDP Controller</title>
<p>The central STDP controller includes SRAM which stores the filter update table, the <italic>V</italic><sub>mem</sub> comparator to find the maximum <italic>V</italic><sub>mem</sub> of MONETA, and the control logic to control MONETA. The central STDP controller controls the design and determines the weights to update during the learning mode. <xref ref-type="fig" rid="F13">Figure 13</xref> shows the architecture of the central STDP controller.</p>
<fig id="F13" position="float">
<label>Figure 13</label>
<caption><p>The central STDP controller architecture.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-775457-g0013.tif"/>
</fig>
<p>During the inference phase in the learning mode, MONETAfills the filter update table. To generate the data for the filter update table, the synaptic core receives the IFM and calculates the membrane potential. During this process, the neuron modules calculate the <italic>V</italic><sub>mem</sub> and send the maximum <italic>V</italic><sub>mem</sub> and the corresponding filter number to the central STDP controller. The central STDP controller compares the current IFM&#x00027;s maximum <italic>V</italic><sub>mem</sub> and the previous IFM&#x00027;s <italic>V</italic><sub>mem</sub> which is stored in the filter update table. If the current IFM&#x00027;s maximum <italic>V</italic><sub>mem</sub> is larger, the filter update table will store the current IFM and the new maximum <italic>V</italic><sub>mem</sub> in the filter update table. When the weight update phase starts, the central STDP controller reads the IFM for the maximum <italic>V</italic><sub>mem</sub> from off-chip memory for each filter. The IFM is applied to the target synaptic core to re-compute the corresponding <italic>V</italic><sub>mem</sub>, generate the output spikes, and update the weights using the weight update modules. Once all the filters are updated for a TIFM, the filter update table is reset.</p>
<p><xref ref-type="fig" rid="F14">Figure 14</xref> indicates the data flow timing diagram of the central STDP controller. The central STDP controller receives the mode signal from the user to determine the mode of the synaptic core. The central STDP controller controls the function of synaptic cores by sending the phase and mode signals to the synaptic cores. During the inference phase in the learning mode, the central STDP controller receives the maximum <italic>V</italic><sub>mem</sub> of the synaptic cores, filter number from the synaptic cores, and the current IFM number from the off-chip memory to generate the filter update table. In the weight update phase, the control logic request the IFM number to the filter update table in the SRAM with the request (Req) signal. The filter update table sends the updating target IFM number to the off-chip memory and the updating target filter number to the updating target synaptic core. The received IFM signal transmitted from the off-chip memory is used to determine whether the IFM is delivered from off-chip memory during the weight update phase. This is because the central STDP controller sends non-sequential IFM requests to the off-chip memory, so the MONETAneeds to be in the idle state until the IFM is transmitted. The control logic also receives the filter number. After that, the control logic generates the mode signal to the target synaptic core. Only the updating target synaptic core is enabled to update weights by mode signal and other synaptic cores are in the idle state as they do not need to update the synapse weights. This process is continued for all the filters in the filter update table.</p>
<fig id="F14" position="float">
<label>Figure 14</label>
<caption><p>The data flow graph of the central STDP controller.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-775457-g0014.tif"/>
</fig>
</sec>
</sec>
<sec>
<title>3.5. Hybrid Network With Coupled Supervised and Unsupervised Learning</title>
<p>As we discussed in section 2.4, hybrid networks can help us improve the accuracy of the STDP network. Thus, in this article, we used a hybrid supervised-unsupervised learning methodology similar to the works done by Chakraborty et al. (<xref ref-type="bibr" rid="B5">2021</xref>). Supervised learning is the surrogate gradient-based training of the SNNs (Wu et al., <xref ref-type="bibr" rid="B47">2018</xref>; Neftci et al., <xref ref-type="bibr" rid="B30">2019</xref>). The supervised learning-based weights are trained at the off-chip, then are loaded on the synaptic array. These supervised learning-based layers are set as the inference mode (marked blue in <xref ref-type="fig" rid="F6">Figure 6C</xref>). The unsupervised learning algorithm is our modified STDP-based learning. These layers are set as the learning mode and the weights are trained on-chip. Therefore, because our design supports on-line learning, half convolutional layers have supervised learning-based fixed weights and the other half of convolutional layers have unsupervised on-line learning-based flexible weights (marked orange in <xref ref-type="fig" rid="F6">Figure 6C</xref>).</p>
</sec>
</sec>
<sec id="s4">
<title>4. Simulation Results</title>
<sec>
<title>4.1. Configurations of Simulated ConvSNN</title>
<p>As discussed before, we use both homogeneous and hybrid networks. Hybrid networks with supervised training can help us improve the accuracy of the STDP network. Thus, we simulate the hybrid supervised-unsupervised learning methodology similar to the works done by Chakraborty et al. (<xref ref-type="bibr" rid="B5">2021</xref>).</p>
<p><bold>Configurations:</bold> We define the type of networks to compare our hybrid spiking neural network for image classification on the MNIST and CIFAR-10 dataset as follows:</p>
<list list-type="bullet">
<list-item><p><bold>Standard STDP model (Type 1):</bold> we use the 4-layer ConvSNN model trained using the standard STDP model (Bi and Poo, <xref ref-type="bibr" rid="B2">1998</xref>)</p></list-item>
<list-item><p><bold>PIM-friendly STDP model (Type 2):</bold> we use the 4-layer ConvSNN model and train it using the modified STDP rule explained in section 3.3</p></list-item>
<list-item><p><bold>Fully Backpropagated ConvSNN model (Type 3):</bold> for this model, we use another backpropagated ConvSNN block instead of the STDP ConvSNN block (orange block in <xref ref-type="fig" rid="F6">Figure 6</xref>). This makes the entire model to be trained with a surrogate gradient without any unsupervised STDP block.</p></list-item>
<list-item><p><bold>Hybrid model with standard STDP model (Type 4):</bold> for this model, we use the hybrid network as shown in <xref ref-type="fig" rid="F6">Figure 6C</xref>. However, we use the standard STDP learning rule for the STDP ConvSNN block (orange block).</p></list-item>
<list-item><p><bold>Hybrid model with PIM-friendly STDP model (Type 5):</bold> this is the proposed model using hybridization of STDP-based ConvSNN and backpropagated-based ConvSNN blocks. The STDP learning rule used to train the STDP block is the modified STDP rule as discussed in section 3.3</p></list-item>
</list>
<p><bold>Types 1&#x02013;3</bold> are based on the homogeneous network architecture. The weights of architectures in <bold>Types 1&#x02013;3</bold> are trained by single training algorithm. <bold>Types 4-5</bold> are based on the hybrid network architecture we discussed in section 2.4. Half of the weights in <bold>Types 4-5</bold> are trained by backpropagation algorithm and the other half of the weights are trained by different STDP algorithms for each network type.</p>
</sec>
<sec>
<title>4.2. Hardware Architectures for Simulation</title>
<p><xref ref-type="table" rid="T1">Table 1</xref> shows the simulated ConvSNN network architecture with four convolutional (CONV) and one fully-connected (FC) layer. We use 8-bit precision for the weights and 4-bit precision for the input spikes. The total on-chip memory used for synaptic cores is determined by the filter size of the <bold>CONV4</bold> layer in the homogeneous network architecture. Therefore, we need two MONETAchips for the <bold>CONV4</bold> layer in the hybrid network. We divide the total capacity into 8 synaptic cores where each core has nine 128 &#x000D7; 128 synaptic arrays. We consider on-chip STDP is performed using a layer-by-layer fashion because OFMs for one layer are used to train the next layer. Note the memory capacity is sufficient to simultaneously map <bold>CONV1</bold>, <bold>CONV2</bold>, and <bold>CONV3</bold> on the chip during inference. We consider that the <bold>FC</bold> layer exists off-chip and connected with MONETA.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Architecture and parameters of the tested convolutional neural network within spiking neural network (ConvSNN) networks.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Layers</bold></th>
<th valign="top" align="center" colspan="3"><bold>Homogeneous network architecture</bold></th>
<th valign="top" align="center" colspan="6" style="border-bottom: thin solid #000000;"><bold>Hybrid network architecture</bold></th>
</tr>
<tr>
<th/>
</tr>
<tr>
<th/>
<th valign="top" align="center" colspan="3" style="border-bottom: thin solid #000000;"></th>
<th valign="top" align="center" colspan="3" style="border-bottom: thin solid #000000;"><bold>BP ConvSNN block</bold></th>
<th valign="top" align="center" colspan="3" style="border-bottom: thin solid #000000;"><bold>STDP ConvSNN block</bold></th>
</tr>
 <tr>
<th/>
<th valign="top" align="center"><italic><bold>F</bold></italic><sub><bold>R</bold></sub></th>
<th valign="top" align="center"><italic><bold>F</bold></italic><sub><bold>C</bold></sub></th>
<th valign="top" align="center"><italic><bold>F</bold></italic><sub><bold>D</bold></sub></th>
<th valign="top" align="center"><italic><bold>F</bold></italic><sub><bold>R</bold></sub></th>
<th valign="top" align="center"><italic><bold>F</bold></italic><sub><bold>C</bold></sub></th>
<th valign="top" align="center"><italic><bold>F</bold></italic><sub><bold>D</bold></sub></th>
<th valign="top" align="center"><italic><bold>F</bold></italic><sub><bold>R</bold></sub></th>
<th valign="top" align="center"><italic><bold>F</bold></italic><sub><bold>C</bold></sub></th>
<th valign="top" align="center"><italic><bold>F</bold></italic><sub><bold>D</bold></sub></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">CONV1</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">64</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">64</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">64</td>
</tr>
<tr>
<td valign="top" align="left">CONV2</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">64</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">64</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">64</td>
</tr>
<tr>
<td valign="top" align="left">CONV3</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">128</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">128</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">128</td>
</tr>
<tr>
<td valign="top" align="left">CONV4</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">128</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">128</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">128</td>
</tr>
<tr>
<td valign="top" align="left">FC</td>
<td valign="top" align="center"><italic><bold>F</bold></italic><sub><bold>R</bold></sub></td>
<td valign="top" align="center"><italic><bold>F</bold></italic><sub><bold>C</bold></sub></td>
<td valign="top" align="center"><italic><bold>F</bold></italic><sub><bold>D</bold></sub></td>
<td valign="top" align="center" colspan="2"><italic><bold>F</bold></italic><sub><bold>R</bold></sub></td>
<td valign="top" align="center" colspan="2"><italic><bold>F</bold></italic><sub><bold>C</bold></sub></td>
<td valign="top" align="center" colspan="2"><italic><bold>F</bold></italic><sub><bold>D</bold></sub></td>
</tr>
<tr>
<td/>
<td valign="top" align="center">1</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">512</td>
<td valign="top" align="center" colspan="2">1</td>
<td valign="top" align="center" colspan="2">1</td>
<td valign="top" align="center" colspan="2">1,024</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The hardware architecture of MONETAwith 8 synaptic cores and one central STDP controller is implemented in 65 nm CMOS (<xref ref-type="fig" rid="F15">Figure 15</xref>). We used the Virtuoso for the full-custom layout of 128 &#x000D7; 128 SRAM sub-arrays and Innovus for the auto place and route (PNR) of other logic blocks. Each synapse core and the central STDP controller have 1.394 and 0.025 <italic>mm</italic><sup>2</sup> areas. The throughput and power of the design are estimated from the layout and after parasitic extraction.</p>
<fig id="F15" position="float">
<label>Figure 15</label>
<caption><p>The overview of the physical design of MONETA.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-775457-g0015.tif"/>
</fig>
</sec>
<sec>
<title>4.3. Accuracy Analysis</title>
<p>The ConvSNN shown in <xref ref-type="table" rid="T1">Table 1</xref> is simulated using ParallelSpikeSim, an open source GPU accelerated SNN simulator (She et al., <xref ref-type="bibr" rid="B39">2019</xref>). The MNIST and CIFAR-10 datasets are used for accuracy evaluation. All synapses are designed with 8-bit weights. The unsupervised-learning based CONV layers are trained with STDP for unsupervised clustering of inputs. The supervised-learning based CONV layers are trained with the BPTT algorithm. The final FC layer is trained using Stochastic Gradient Descent (SGD) to label the clusters with appropriate classes. Input spike frequency is converted from image pixels intensity to the range of 10&#x02013;50 Hz. We assumed <inline-formula><mml:math id="M9"><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">total</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">unit</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mn>100</mml:mn></mml:math></inline-formula>, i.e., each image is shown to the network for 100 time steps. Each layer learns the entire training set for 5 epochs.</p>
<p><xref ref-type="table" rid="T2">Table 2</xref> shows the accuracies of each ConvSNN configuration. <bold>Type 2</bold> is based on the data flow of MONETA (<xref ref-type="fig" rid="F8">Figures 8B</xref>, <xref ref-type="fig" rid="F11">11</xref>). When we compare the accuracies with CIFAR-10, the accuracy of <bold>Type 2</bold> shows only 1.63 (%) lower accuracy than a fully parallel (as shown in <xref ref-type="fig" rid="F8">Figure 8A</xref>) software implementation of the network using 8-bit precision (<bold>Type 1</bold>). As mentioned before, the fully parallel implementation incurs <inline-formula><mml:math id="M10"><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">total</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">unit</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mo>&#x000D7;</mml:mo></mml:math></inline-formula> (=100 &#x000D7; ) higher data movement than our design. The accuracy of the ConvSNN accelerated using MONETA (<bold>Type 2</bold>) is 10.18% lower than a spiking neural network of the same layer configurations trained using backpropagation (<bold>Type 3</bold>). In the end, the accuracy of the hybrid network shows improved accuracy than supervised learning. The result of the hybrid network using backpropagated-based ConvSNN and the PIM-friendly STDP learning (<bold>Type 5</bold>) shows 1.4% higher accuracy than the fully backpropagated ConvSNN model (<bold>Type 3</bold>). This accuracy from <bold>Type 5</bold> is only 1.11% lower than the hybrid network applying the standard STDP model (<bold>Type 4</bold>).</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Simulated network types and the results.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Type</bold></th>
<th valign="top" align="left"><bold>Learning algorithm</bold></th>
<th valign="top" align="center"><bold>Required</bold><break/> <bold>parameters</bold><break/> <bold>(Kb)</bold></th>
<th valign="top" align="center"><bold>Inference</bold><break/> <bold>throughput</bold><break/> <bold>(TOPS)</bold></th>
<th valign="top" align="center"><bold>Inference</bold><break/> <bold>Energy</bold><break/> <bold>efficiency</bold><break/> <bold>(TOPS/W)</bold></th>
<th valign="top" align="center"><bold>On-line</bold> <break/> <bold>learning</bold><break/> <bold>throughput</bold><break/> <bold>(TOPS)</bold></th>
<th valign="top" align="center"><bold>Learning</bold><break/> <bold>energy</bold><break/> <bold>efficiency</bold><break/> <bold>(TOPS/W)</bold></th>
<th valign="top" align="center"><bold>Accuracy</bold><break/> <bold>(CIFAR-100)</bold></th>
<th valign="top" align="center"><bold>Accuracy</bold><break/> <bold>(CIFAR-10)</bold></th>
<th valign="top" align="center"><bold>Accuracy</bold><break/> <bold>(MNIST)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>1</bold></td>
<td valign="top" align="left">Standard STDP</td>
<td valign="top" align="center">1,152</td>
<td valign="top" align="center">2.304</td>
<td valign="top" align="center">18.69</td>
<td valign="top" align="center">N/A</td>
<td valign="top" align="center">N/A</td>
<td valign="top" align="center">54.25</td>
<td valign="top" align="center">67.88</td>
<td valign="top" align="center">90.89</td>
</tr>
<tr>
<td valign="top" align="left"><bold>2</bold></td>
<td valign="top" align="left">PIM-friendly STDP</td>
<td/>
<td/>
<td/>
<td valign="top" align="center">2.2</td>
<td valign="top" align="center">7.25</td>
<td valign="top" align="center">52.19</td>
<td valign="top" align="center">66.25</td>
<td valign="top" align="center">90.13</td>
</tr>
<tr>
<td valign="top" align="left"><bold>3</bold></td>
<td valign="top" align="left">Backpropagation (BP)</td>
<td/>
<td/>
<td/>
<td valign="top" align="center">N/A</td>
<td valign="top" align="center">N/A</td>
<td valign="top" align="center">62.12</td>
<td valign="top" align="center">76.43</td>
<td valign="top" align="center">92.55</td>
</tr>
<tr>
<td valign="top" align="left"><bold>4</bold></td>
<td valign="top" align="left">BP &#x0002B; Standard STDP</td>
<td valign="top" align="center">2,304</td>
<td valign="top" align="center">4.608</td>
<td/>
<td valign="top" align="center">N/A</td>
<td valign="top" align="center">N/A</td>
<td valign="top" align="center">63.86</td>
<td valign="top" align="center">78.94</td>
<td valign="top" align="center">93.16</td>
</tr>
<tr>
<td valign="top" align="left"><bold>5</bold></td>
<td valign="top" align="left">BP &#x0002B; PIM-friendly STDP</td>
<td/>
<td/>
<td/>
<td valign="top" align="center">4.4</td>
<td valign="top" align="center">10.41</td>
<td valign="top" align="center">62.31</td>
<td valign="top" align="center">77.83</td>
<td valign="top" align="center">92.07</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We note that the accuracy of backpropagation-based trained SNN demonstrated in this article is lower than the state-of-the-art, for example, 99.59% accuracy was observed on MNIST dataset (Lee et al., <xref ref-type="bibr" rid="B23">2020</xref>). This is primarily because, we have reduced the simulation time necessary for training the network with back propagation. For example, instead of 100 epochs of training as performed in the Lee et al. (2020) we only trained the network for 20 epochs. Further, we did not apply pre-processing and the fine- tuning methodologies that are normally applied in BP SOTA, as these techniques are applied off-chip and are not related to the PIM-based hardware implementation of ConvSNN.</p>
</sec>
<sec>
<title>4.4. Throughput Analysis</title>
<p><xref ref-type="table" rid="T2">Table 2</xref> shows the peak throughput of MONETAestimated as Tera Operation per Second (TOPS). The throughput for each synaptic array is determined by the number of parallel multiplies (&#x00023; of weights stored in a row) and accumulate. The total throughput of all synaptic arrays is given by &#x00023; of weights &#x000D7; &#x00023; of synaptic arrays &#x000D7; frequency. H-NoC sums partial outputs from synaptic arrays resulting in a throughput of &#x00023; of weights &#x000D7; &#x00023; of synaptic array &#x000D7; frequency. The neuron modules compute membrane potential neurons at a throughput of &#x00023; of weights &#x000D7; frequency. The total throughput is obtained considering the parallel operation of all synaptic cores. Our design has &#x00023; of weights per word line &#x0003D; 16, &#x00023; of synaptic arrays &#x0003D; 9, &#x00023; of synaptic cores &#x0003D; 8, and frequency &#x0003D; 1GHz. Hence, the total throughput of one MONETAchip is 2.304 TOPS in the inference mode. On the other hand, in the case of the on-line learning mode, the throughput is reduced based on the time used for the training. Because the weight update takes 17 cycles, the throughput becomes 2.304<inline-formula><mml:math id="M11"><mml:mo>&#x000D7;</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0002B;</mml:mo><mml:mn>17</mml:mn><mml:mo>&#x000D7;</mml:mo><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">(output spike rate)</mml:mtext></mml:mstyle></mml:mrow></mml:mfrac></mml:math></inline-formula>. For example, learning mode throughput in <bold>CONV4</bold> layer is 2.2 TOPS with an output spike rate of 0.0028.</p>
<p>In MONETA, each time-step is represented as 1 clock cycle (1 GHz), and 100 time-steps are used for each image. The total clock cycles required to operate on one image in each layer is <inline-formula><mml:math id="M12"><mml:mn>100</mml:mn><mml:mo>&#x000D7;</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">R</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">S</mml:mtext></mml:mstyle></mml:mrow></mml:mfrac><mml:mo>&#x000D7;</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">C</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">S</mml:mtext></mml:mstyle></mml:mrow></mml:mfrac><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">D</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:math></inline-formula> (S is a stride), where <italic>I</italic><sub>D</sub> represents a number of rows in the synaptic array. We compute the image processing rate (fps) of <bold>CONV1-4</bold> is 13.02, 2.44, 9.77, and 19.53 K, respectively, at 1 GHz.</p>
</sec>
<sec>
<title>4.5. Area and Power Analysis</title>
<p>The power of the MONETAdesign is 123.28 mW, 303.52 mW for the inference mode and the learning mode, respectively, at 1 GHz with 1 V supply on the one chip. This power is calculated based on <bold>CONV4</bold> which is the maximum power of MONETA. In addition, the power is computed considering <bold>CONV4</bold>&#x00027;s input and output spike activity ratio (0.0092 and 0.0028). Note, <bold>CONV4</bold> of the hybrid network requires two MONETAchips because of its parameters, so the total power is two times the homogeneous networks (246.56 mW and 607.04 mW for the inference mode and the learning mode, respectively).</p>
<p><xref ref-type="fig" rid="F16">Figure 16A</xref> shows the power breakdown of the synaptic core&#x00027;s inference mode. The weight update module is idle during the inference mode by clock gating. The 8T-SRAM array consumes the 21.86 pJ for matrix multiplication calculation, 4.48 pJ for transposable weight read, and 12.34 pJ for transposable weight write. The SRAM-based computation naturally transforms sparsity in neuron firing (i.e., zero values in IFM) to power saving during inference. If an input spike is absent in a cycle, the SRAM power for that cycle is zero as word lines are not activated. Moreover, as we use single ended sensing in 8T-SRAM, there is no bit-line discharge when the values of the corresponding bit are &#x0201C;0.&#x0201D; Hence, the SRAM contributes very little power to the overall operation. The power is dominated by the <italic>V</italic><sub>mem</sub> calculation in the neuron module. This is because there exists an inherent leakage component in the membrane potential computation (<italic>a</italic>&#x0002B;<italic>bV</italic><sub>mem</sub> in LIF dynamics in <xref ref-type="fig" rid="F3">Figure 3</xref>) that causes the membrane potential to reduce when there are no input spikes. Hence, the neuron module needs to perform the leakage computation in each clock. However, the power in the synaptic array and the H-NoC reduces significantly due to low spiking activity (=0.0092). <xref ref-type="fig" rid="F16">Figure 16B</xref> shows the power distribution of the synaptic cores in the training mode. It shows the much higher power consumption compared to the inference mode, mainly because of the complex weight update module (Kim et al., <xref ref-type="bibr" rid="B19">2020</xref>).</p>
<fig id="F16" position="float">
<label>Figure 16</label>
<caption><p>Power breakdown of MONETA. <bold>(A)</bold> inference mode and <bold>(B)</bold> training mode.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-775457-g0016.tif"/>
</fig>
<p>The central STDP controller consists of 32 &#x000D7; 128 SRAM, the control logic, and the <italic>V</italic><sub>mem</sub> Comparator. The read energy is 1.14 pJ, and the write energy is 3.09 pJ. The bus-width is 32 bits. It also operates at 1 GHz by following the Synapse Core&#x00027;s clock frequency. Overall, the central STDP controller has a much smaller area (0.025 <italic>mm</italic><sup>2</sup>) and power (0.141 mW during inference and 0.154 mW during learning).</p>
</sec>
<sec>
<title>4.6. Comparison With Prior Works</title>
<p><xref ref-type="table" rid="T3">Table 3</xref> shows the comparison of MONETAwith a set of recent SNN accelerators (Buhler et al., <xref ref-type="bibr" rid="B3">2017</xref>; Chen et al., <xref ref-type="bibr" rid="B6">2019</xref>; Park et al., <xref ref-type="bibr" rid="B32">2019</xref>; Chuang et al., <xref ref-type="bibr" rid="B9">2020</xref>). Note that all designs use different SNN architectures for evaluation, and most of the prior designs considered MNIST as the dataset while our work is evaluated on MNIST CIFAR-10 and CIFAR-100. Our design supports STDP learning (fully for the homogeneous network and partially for the hybrid network) and inference.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Comparison with other works.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Reference</bold></th>
<th valign="top" align="center" colspan="4" style="border-bottom: thin solid #000000;"><bold>This work</bold></th>
<th valign="top" align="center"><bold>DAC&#x02018;20</bold></th>
<th valign="top" align="center"><bold>ISSCC&#x02018;19</bold></th>
<th valign="top" align="center"><bold>JSSC&#x02018;19</bold></th>
<th valign="top" align="center"><bold>VLSI&#x02018;17</bold></th>
</tr>
<tr>
<th/>
</tr>
<tr>
<th/>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>Homogeneous</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>Hybrid</bold></th>
<th/>
<th/>
<th/>
<th/>
</tr>
 <tr>
<th/>
<th valign="top" align="center"><bold>Inference</bold></th>
<th valign="top" align="center"><bold>Learning</bold></th>
<th valign="top" align="center"><bold>Inference</bold></th>
<th valign="top" align="center"><bold>Learning</bold></th>
<th/>
<th/>
<th/>
<th/>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Technology (nm)</td>
<td valign="top" align="center" colspan="4">65</td>
<td valign="top" align="center">90</td>
<td valign="top" align="center">65</td>
<td valign="top" align="center">10</td>
<td valign="top" align="center">40</td>
</tr>
<tr>
<td valign="top" align="left">Algorithm</td>
<td valign="top" align="center" colspan="4">ConvSNN</td>
<td valign="top" align="center">ConvSNN</td>
<td valign="top" align="center">SNN</td>
<td valign="top" align="center">SNN</td>
<td valign="top" align="center">SNN</td>
</tr>
<tr>
<td valign="top" align="left">On-chip<break/> Training</td>
<td valign="top" align="center" colspan="4">Yes</td>
<td valign="top" align="center">No</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="center">Yes</td>
</tr>
<tr>
<td valign="top" align="left">Voltage (V)</td>
<td valign="top" align="center" colspan="4">1.0</td>
<td valign="top" align="center">1.0</td>
<td valign="top" align="center">0.8</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">0.9</td>
</tr>
<tr>
<td valign="top" align="left">Frequency (MHz)</td>
<td valign="top" align="center" colspan="4">1,000</td>
<td valign="top" align="center">100</td>
<td valign="top" align="center">20</td>
<td valign="top" align="center">506</td>
<td valign="top" align="center">250</td>
</tr>
<tr>
<td valign="top" align="left">Synapse Bits</td>
<td valign="top" align="center" colspan="4">8</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">7</td>
<td valign="top" align="center">7</td>
<td valign="top" align="center">4</td>
</tr>
<tr>
<td valign="top" align="left">Area (mm<sup>2</sup>)</td>
<td valign="top" align="center" colspan="2">11.155</td>
<td valign="top" align="center" colspan="2">22.23</td>
<td valign="top" align="center">2.07</td>
<td valign="top" align="center">10.08</td>
<td valign="top" align="center">1.72</td>
<td valign="top" align="center">1.31</td>
</tr>
<tr>
<td valign="top" align="left">TOPS/mm<sup>2</sup></td>
<td valign="top" align="center">0.207</td>
<td valign="top" align="center">0.197</td>
<td valign="top" align="center">0.207</td>
<td valign="top" align="center">0.197</td>
<td valign="top" align="center">0.312</td>
<td valign="top" align="center">0.008</td>
<td valign="top" align="center">0.015</td>
<td valign="top" align="center">0.227</td>
</tr>
<tr>
<td valign="top" align="left">Power (mW)</td>
<td valign="top" align="center">123.28</td>
<td valign="top" align="center">303.52</td>
<td valign="top" align="center">246.56</td>
<td valign="top" align="center">423.83</td>
<td valign="top" align="center">45.71</td>
<td valign="top" align="center">23.6</td>
<td valign="top" align="center">208.3</td>
<td valign="top" align="center">87</td>
</tr>
<tr>
<td valign="top" align="left">TOPS/W</td>
<td valign="top" align="center">18.69</td>
<td valign="top" align="center">7.25</td>
<td valign="top" align="center">18.69</td>
<td valign="top" align="center">10.41</td>
<td valign="top" align="center">14.1</td>
<td valign="top" align="center">3.42</td>
<td valign="top" align="center">0.12</td>
<td valign="top" align="center">3.43</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Our throughput is higher than prior works mainly due to highly parallel in-memory computation, as well as higher frequency (1 GHz) of operation. The PIM architecture eliminates the arithmetic computation units used in prior designs leading to a much higher operating speed at similar voltage. Thanks to the PIM architecture, MONETAshows higher compute density (TOPS/mm<sup>2</sup>) compared to the prior works using similar bit-precision. We observe similar area efficiency compared to 4-bit precision-based SNN in 40 nm CMOS, even though our design is realized in 65 nm CMOS. However, compared to the binary SNN design we observe 33% lower area efficiency (note, the binary SNN was implemented in 90 nm CMOS). Further, we observe a higher power efficiency compared to other designs during inference and learning. This is mainly because the PIM-based operation naturally translates the sparsity in neuron firing to power reduction as discussed before.</p>
</sec>
</sec>
<sec sec-type="conclusions" id="s5">
<title>5. Conclusion</title>
<p>This article presents a PIM-based hybrid ConvSNN acceleration platform with an on-chip STDP based weight update. We present an optimized data flow for sequential processing of input feature maps to reduce off-chip data movement while ensuring learning accuracy of the STDP process. The algorithmic simulations show comparable accuracy for MNIST and CIFAR-10 dataset to a pure software implementation. We also show the hybrid architecture and the opportunity of the supervised-unsupervised flexible weight architecture with on-line learning. The power and throughput analysis using 65 nm CMOS physical design show high throughput and energy efficiency. The programming model and compiler infrastructure necessary to map an arbitrary ConvSNN in MONETAis important future work.</p>
</sec>
<sec sec-type="data-availability" id="s6">
<title>Data Availability Statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found at: MNIST: <ext-link ext-link-type="uri" xlink:href="http://yann.lecun.com/exdb/mnist/">http://yann.lecun.com/exdb/mnist/</ext-link>; <ext-link ext-link-type="uri" xlink:href="https://www.cs.toronto.edu/~kriz/cifar.html">https://www.cs.toronto.edu/&#x0007E;kriz/cifar.html</ext-link>.</p>
</sec>
<sec id="s7">
<title>Author Contributions</title>
<p>DK developed the main concepts and algorithm, generated the RTL design and layout, and performed all hardware analysis. BC and XS developed the algorithm for a hybrid network and performed the software simulation for accuracy analysis. All authors assisted in developing the concept and writing this article. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="funding-information" id="s8">
<title>Funding</title>
<p>This research was supported by the DARPA ERI 3DSoC Program under Award HR001118C0096.</p>
</sec>
<sec id="s9"> 
<title>Author Disclaimer</title>
<p>The views and conclusions included in this article are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of DARPA or the U.S. Government.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec> </body>
<back>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Akopyan</surname> <given-names>F.</given-names></name> <name><surname>Sawada</surname> <given-names>J.</given-names></name> <name><surname>Cassidy</surname> <given-names>A.</given-names></name> <name><surname>Alvarez-Icaza</surname> <given-names>R.</given-names></name> <name><surname>Arthur</surname> <given-names>J.</given-names></name> <name><surname>Merolla</surname> <given-names>P.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip</article-title>. <source>IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst.</source> <volume>34</volume>, <fpage>1537</fpage>&#x02013;<lpage>1557</lpage>. <pub-id pub-id-type="doi">10.1109/TCAD.2015.2474396</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bi</surname> <given-names>G.-Q.</given-names></name> <name><surname>Poo</surname> <given-names>M.-M.</given-names></name></person-group> (<year>1998</year>). <article-title>Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type</article-title>. <source>J. Neurosci.</source> <volume>18</volume>, <fpage>10464</fpage>&#x02013;<lpage>10472</lpage>.<pub-id pub-id-type="pmid">9852584</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Buhler</surname> <given-names>F. N.</given-names></name> <name><surname>Brown</surname> <given-names>P.</given-names></name> <name><surname>Li</surname> <given-names>J.</given-names></name> <name><surname>Chen</surname> <given-names>T.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Flynn</surname> <given-names>M. P.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;A 3.43tops/w 48.9pj/pixel 50.1nj/classification 512 analog neuron sparse coding neural network with on-chip learning and classification in 40nm cmos,&#x0201D;</article-title> in <source>2017 Symposium on VLSI Circuits</source> (<publisher-loc>Kyoto</publisher-loc>), C<fpage>30</fpage>&#x02013;C<lpage>31</lpage>.</citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cao</surname> <given-names>Y.</given-names></name> <name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Khosla</surname> <given-names>D.</given-names></name></person-group> (<year>2015</year>). <article-title>Spiking deep convolutional neural networks for energy-efficient object recognition</article-title>. <source>Int. J. Comput. Vis.</source> <volume>113</volume>, <fpage>54</fpage>&#x02013;<lpage>66</lpage>. <pub-id pub-id-type="doi">10.1007/s11263-014-0788-3</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chakraborty</surname> <given-names>B.</given-names></name> <name><surname>She</surname> <given-names>X.</given-names></name> <name><surname>Mukhopadhyay</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>A fully spiking hybrid neural network for energy-efficient object detection</article-title>. <source>IEEE Trans. Image Process</source>. <volume>30</volume>, <fpage>9014</fpage>&#x02013;<lpage>9029</lpage>. <pub-id pub-id-type="doi">10.1109/TIP.2021.3122092</pub-id><pub-id pub-id-type="pmid">34705647</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>G. K.</given-names></name> <name><surname>Kumar</surname> <given-names>R.</given-names></name> <name><surname>Sumbul</surname> <given-names>H. E.</given-names></name> <name><surname>Knag</surname> <given-names>P. C.</given-names></name> <name><surname>Krishnamurthy</surname> <given-names>R. K.</given-names></name></person-group> (<year>2019</year>). <article-title>A 4096-neuron 1m-synapse 3.8-pj/sop spiking neural network with on-chip stdp learning and sparse weights in 10-nm finfet cmos</article-title>. <source>IEEE J. Solid-State Circ.</source> <volume>54</volume>, <fpage>992</fpage>&#x02013;<lpage>1002</lpage>. <pub-id pub-id-type="doi">10.1109/JSSC.2018.2884901</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Krishna</surname> <given-names>T.</given-names></name> <name><surname>Emer</surname> <given-names>J.</given-names></name> <name><surname>Sze</surname> <given-names>V.</given-names></name></person-group> (<year>2017</year>). <article-title>Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks</article-title>. <source>IEEE J. Solid-State Circ.</source> <volume>52</volume>, <fpage>127</fpage>&#x02013;<lpage>138</lpage>. <pub-id pub-id-type="doi">10.1109/JSSC.2016.2616357</pub-id></citation>
</ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chi</surname> <given-names>P.</given-names></name> <name><surname>Li</surname> <given-names>S.</given-names></name> <name><surname>Xu</surname> <given-names>C.</given-names></name> <name><surname>Zhang</surname> <given-names>T.</given-names></name> <name><surname>Zhao</surname> <given-names>J.</given-names></name> <name><surname>Liu</surname> <given-names>Y.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>&#x0201C;Prime: a novel processing-in-memory architecture for neural network computation in reram-based main memory,&#x0201D;</article-title> in <source>Proceedings of the 43rd International Symposium on Computer Architecture ISCA &#x00027;16</source> (<publisher-loc>Seoul</publisher-loc>: <publisher-name>IEEE Press</publisher-name>), <fpage>27</fpage>&#x02013;<lpage>39</lpage>.</citation>
</ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chuang</surname> <given-names>P. Y.</given-names></name> <name><surname>Tan</surname> <given-names>P.-Y.</given-names></name> <name><surname>Wu</surname> <given-names>C.-W.</given-names></name> <name><surname>Lu</surname> <given-names>J.-M.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;A 90nm 103.14 tops/w binary-weight spiking neural network cmos asic for real-time object classification,&#x0201D;</article-title> in <source>2020 57th ACM/IEEE Design Automation Conference (DAC)</source> (<publisher-loc>San Francisco, CA</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>6</lpage>.</citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Davies</surname> <given-names>M.</given-names></name> <name><surname>Srinivasa</surname> <given-names>N.</given-names></name> <name><surname>Lin</surname> <given-names>T.-H.</given-names></name> <name><surname>Chinya</surname> <given-names>G.</given-names></name> <name><surname>Cao</surname> <given-names>Y.</given-names></name> <name><surname>Choday</surname> <given-names>S. H.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>Loihi: a neuromorphic manycore processor with on-chip learning</article-title>. <source>IEEE Micro</source> <volume>38</volume>, <fpage>82</fpage>&#x02013;<lpage>99</lpage>. <pub-id pub-id-type="doi">10.1109/MM.2018.112130359</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Dong</surname> <given-names>W.</given-names></name> <name><surname>Socher</surname> <given-names>R.</given-names></name> <name><surname>Li</surname> <given-names>L.-J.</given-names></name> <name><surname>Li</surname> <given-names>K.</given-names></name> <name><surname>Fei-Fei</surname> <given-names>L.</given-names></name></person-group> (<year>2009</year>). <article-title>&#x0201C;Imagenet: a large-scale hierarchical image database,&#x0201D;</article-title> in <source>2009 IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Miami, FL</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>248</fpage>&#x02013;<lpage>255</lpage>.</citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Deng</surname> <given-names>L.</given-names></name> <name><surname>Wang</surname> <given-names>G.</given-names></name> <name><surname>Li</surname> <given-names>G.</given-names></name> <name><surname>Li</surname> <given-names>S.</given-names></name> <name><surname>Liang</surname> <given-names>L.</given-names></name> <name><surname>Zhu</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Tianjic: aunified and scalable chip bridging spike-based and continuous neural computation</article-title>. <source>IEEE J. Solid-State Circ.</source> <volume>55</volume>, <fpage>2228</fpage>&#x02013;<lpage>2246</lpage>. <pub-id pub-id-type="doi">10.1109/JSSC.2020.2970709</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Diehl</surname> <given-names>P. U.</given-names></name> <name><surname>Cook</surname> <given-names>M.</given-names></name></person-group> (<year>2015</year>). <article-title>Unsupervised learning of digit recognition using spike-timing-dependent plasticity</article-title>. <source>Front. Comput. Neurosci.</source> <volume>9</volume>, <fpage>99</fpage>. <pub-id pub-id-type="doi">10.3389/fncom.2015.00099</pub-id><pub-id pub-id-type="pmid">26941637</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Diehl</surname> <given-names>P. U.</given-names></name> <name><surname>Neil</surname> <given-names>D.</given-names></name> <name><surname>Binas</surname> <given-names>J.</given-names></name> <name><surname>Cook</surname> <given-names>M.</given-names></name> <name><surname>Liu</surname> <given-names>S.-C.</given-names></name> <name><surname>Pfeiffer</surname> <given-names>M.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing,&#x0201D;</article-title> in <source>2015 International Joint Conference on Neural Networks (IJCNN)</source> (<publisher-loc>Killarney</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>.</citation>
</ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Gerstner</surname> <given-names>W.</given-names></name> <name><surname>Kistler</surname> <given-names>W.</given-names></name></person-group> (<year>2002</year>). <source>Spiking Neuron Models: Single Neurons, Populations, Plasticity</source>. <publisher-loc>Camebridge</publisher-loc>: <publisher-name>Cambridge University Press</publisher-name>.</citation>
</ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Deep residual learning for image recognition,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Las Vegas, NV</publisher-loc>), <fpage>770</fpage>&#x02013;<lpage>778</lpage>.<pub-id pub-id-type="pmid">32166560</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Imani</surname> <given-names>M.</given-names></name> <name><surname>Gupta</surname> <given-names>S.</given-names></name> <name><surname>Kim</surname> <given-names>Y.</given-names></name> <name><surname>Rosing</surname> <given-names>T.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Floatpim: in-memory acceleration of deep neural network training with high precision,&#x0201D;</article-title> in <source>2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA)</source> (<publisher-loc>Phoenix, AZ</publisher-loc>), <fpage>802</fpage>&#x02013;<lpage>815</lpage>.</citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kheradpisheh</surname> <given-names>S. R.</given-names></name> <name><surname>Ganjtabesh</surname> <given-names>M.</given-names></name> <name><surname>Thorpe</surname> <given-names>S. J.</given-names></name> <name><surname>Masquelier</surname> <given-names>T.</given-names></name></person-group> (<year>2018</year>). <article-title>STDP-based spiking deep convolutional neural networks for object recognition</article-title>. <source>Neural Netw.</source> <volume>99</volume>, <fpage>56</fpage>&#x02013;<lpage>67</lpage>. <pub-id pub-id-type="doi">10.1016/j.neunet.2017.12.005</pub-id><pub-id pub-id-type="pmid">29328958</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>D.</given-names></name> <name><surname>She</surname> <given-names>X.</given-names></name> <name><surname>Rahman</surname> <given-names>N. M.</given-names></name> <name><surname>Chekuri</surname> <given-names>V. C. K.</given-names></name> <name><surname>Mukhopadhyay</surname> <given-names>S.</given-names></name></person-group> (<year>2020</year>). <article-title>Processing-in-memory-based on-chip learning with spike-time-dependent plasticity in 65-nm cmos</article-title>. <source>IEEE Solid-State Circ. Lett.</source> <volume>3</volume>, <fpage>278</fpage>&#x02013;<lpage>281</lpage>. <pub-id pub-id-type="doi">10.1109/LSSC.2020.3013448</pub-id></citation>
</ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>S.</given-names></name> <name><surname>Park</surname> <given-names>S.</given-names></name> <name><surname>Na</surname> <given-names>B.</given-names></name> <name><surname>Yoon</surname> <given-names>S.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Spiking-yolo: spiking neural network for energy-efficient object detection,&#x0201D;</article-title> in <source>Proceedings of the AAAI Conference on Artificial Intelligence</source> (<publisher-loc>New York, NY</publisher-loc>), vol. <volume>34</volume>, <fpage>11270</fpage>&#x02013;<lpage>11277</lpage>.</citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ledinauskas</surname> <given-names>E.</given-names></name> <name><surname>Ruseckas</surname> <given-names>J.</given-names></name> <name><surname>Jur&#x00161;&#x00117;nas</surname> <given-names>A.</given-names></name> <name><surname>Bura&#x0010D;as</surname> <given-names>G.</given-names></name></person-group> (<year>2020</year>). <article-title>Training deep spiking neural networks</article-title>. <source>arXiv preprint</source> arXiv:2006.04436. <pub-id pub-id-type="doi">10.48550/ARXIV.2006.04436</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>C.</given-names></name> <name><surname>Panda</surname> <given-names>P.</given-names></name> <name><surname>Srinivasan</surname> <given-names>G.</given-names></name> <name><surname>Roy</surname> <given-names>K.</given-names></name></person-group> (<year>2018</year>). <article-title>Training deep spiking convolutional neural networks with stdp-based unsupervised pre-training followed by supervised fine-tuning</article-title>. <source>Front. Neurosci.</source> <volume>12</volume>, <fpage>435</fpage>. <pub-id pub-id-type="doi">10.3389/fnins.2018.00435</pub-id><pub-id pub-id-type="pmid">30123103</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>C.</given-names></name> <name><surname>Sarwar</surname> <given-names>S. S.</given-names></name> <name><surname>Panda</surname> <given-names>P.</given-names></name> <name><surname>Srinivasan</surname> <given-names>G.</given-names></name> <name><surname>Roy</surname> <given-names>K.</given-names></name></person-group> (<year>2020</year>). <article-title>Enabling spike-based backpropagation for training deep neural network architectures</article-title>. <source>Front Neurosci</source>. <volume>14</volume>, 119. <pub-id pub-id-type="doi">10.3389/fnins.2020.00119</pub-id><pub-id pub-id-type="pmid">32180697</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>C.</given-names></name> <name><surname>Srinivasan</surname> <given-names>G.</given-names></name> <name><surname>Panda</surname> <given-names>P.</given-names></name> <name><surname>Roy</surname> <given-names>K.</given-names></name></person-group> (<year>2019</year>). <article-title>Deep spiking convolutional neural network trained with unsupervised spike-timing-dependent plasticity</article-title>. <source>IEEE Trans. Cogn. Develop. Syst.</source> <volume>11</volume>, <fpage>384</fpage>&#x02013;<lpage>394</lpage>. <pub-id pub-id-type="doi">10.1109/TCDS.2018.2833071</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Long</surname> <given-names>Y.</given-names></name></person-group> (<year>2019</year>). <article-title>A ferroelectric fet-based processing-in-memory architecture for dnn acceleration</article-title>. <source>IEEE J. Exp. Solid-State Comput. Dev. Circ.</source> <volume>5</volume>, <fpage>113</fpage>&#x02013;<lpage>122</lpage>. <pub-id pub-id-type="doi">10.1109/JXCDC.2019.2923745</pub-id></citation>
</ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Long</surname> <given-names>Y.</given-names></name> <name><surname>Lee</surname> <given-names>E.</given-names></name> <name><surname>Kim</surname> <given-names>D.</given-names></name> <name><surname>Mukhopadhyay</surname> <given-names>S.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Q-pim: a genetic algorithm based flexible dnn quantization method and application to processing-in-memory platform,&#x0201D;</article-title> in <source>2020 57th ACM/IEEE Design Automation Conference (DAC)</source> (<publisher-loc>San Francisco, CA</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>6</lpage>.</citation>
</ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Maass</surname> <given-names>W.</given-names></name></person-group> (<year>1997</year>). <article-title>Networks of spiking neurons: the third generation of neural network models</article-title>. <source>Neural Netw.</source> <volume>10</volume>, <fpage>1659</fpage>&#x02013;<lpage>1671</lpage>.</citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Miquel</surname> <given-names>J. R.</given-names></name> <name><surname>Tolu</surname> <given-names>S.</given-names></name> <name><surname>Sch&#x000F6;ller</surname> <given-names>F. E.</given-names></name> <name><surname>Galeazzi</surname> <given-names>R.</given-names></name></person-group> (<year>2021</year>). <article-title>Retinanet object detector based on analog-to-spiking neural network conversion</article-title>. <source>arXiv preprint</source> arXiv:2106.05624. <pub-id pub-id-type="doi">10.48550/ARXIV.2106.05624</pub-id></citation>
</ref>
<ref id="B29">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Narayanan</surname> <given-names>S.</given-names></name> <name><surname>Taht</surname> <given-names>K.</given-names></name> <name><surname>Balasubramonian</surname> <given-names>R.</given-names></name> <name><surname>Giacomin</surname> <given-names>E.</given-names></name> <name><surname>Gaillardon</surname> <given-names>PE.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Spinalflow: an architecture and dataflow tailored for spiking neural networks,&#x0201D;</article-title> in <source>2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)</source> (<publisher-loc>Valencia</publisher-loc>), <fpage>349</fpage>&#x02013;<lpage>362</lpage>.</citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Neftci</surname> <given-names>E. O.</given-names></name> <name><surname>Mostafa</surname> <given-names>H.</given-names></name> <name><surname>Zenke</surname> <given-names>F.</given-names></name></person-group> (<year>2019</year>). <article-title>Surrogate gradient learning in spiking neural networks: bringing the power of gradient-based optimization to spiking neural networks</article-title>. <source>IEEE Signal Process. Mag.</source> <volume>36</volume>, <fpage>51</fpage>&#x02013;<lpage>63</lpage>. <pub-id pub-id-type="doi">10.1109/MSP.2019.2931595</pub-id></citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Panda</surname> <given-names>P.</given-names></name> <name><surname>Aketi</surname> <given-names>S. A.</given-names></name> <name><surname>Roy</surname> <given-names>K.</given-names></name></person-group> (<year>2020</year>). <article-title>Toward scalable, efficient, and accurate deep spiking neural networks with backward residual connections, stochastic softmax, and hybridization</article-title>. <source>Front. Neurosci.</source> <volume>14</volume>, <fpage>653</fpage>. <pub-id pub-id-type="doi">10.3389/fnins.2020.00653</pub-id><pub-id pub-id-type="pmid">32694977</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Park</surname> <given-names>J.</given-names></name> <name><surname>Lee</surname> <given-names>J.</given-names></name> <name><surname>Jeon</surname> <given-names>D.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;7.6 a 65nm 236.5nj/classification neuromorphic processor with 7.5energy overhead on-chip learning using direct spike-only feedback,&#x0201D;</article-title> in <source>2019 IEEE International Solid- State Circuits Conference - (ISSCC)</source> (<publisher-loc>San Francisco, CA</publisher-loc>), <fpage>140</fpage>&#x02013;<lpage>142</lpage>.</citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Peng</surname> <given-names>X.</given-names></name> <name><surname>Huang</surname> <given-names>S.</given-names></name> <name><surname>Jiang</surname> <given-names>H.</given-names></name> <name><surname>Lu</surname> <given-names>A.</given-names></name> <name><surname>Yu</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>DNN&#x0002B;NeuroSim V2.0: An end-to-end benchmarking framework for compute-in-memory accelerators for on-chip training</article-title>. <source>IEEE Trans. Comput. Aid. D. Integ. Circui.t Syst</source>. <volume>40</volume>, <fpage>2306</fpage>&#x02013;<lpage>2319</lpage>. <pub-id pub-id-type="doi">10.1109/TCAD.2020.3043731</pub-id></citation>
</ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pfeiffer</surname> <given-names>M.</given-names></name> <name><surname>Pfeil</surname> <given-names>T.</given-names></name></person-group> (<year>2018</year>). <article-title>Deep learning with spiking neurons: opportunities and challenges</article-title>. <source>Front. Neurosci.</source> <volume>12</volume>, <fpage>774</fpage>. <pub-id pub-id-type="doi">10.3389/fnins.2018.00774</pub-id><pub-id pub-id-type="pmid">30410432</pub-id></citation></ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sengupta</surname> <given-names>A.</given-names></name> <name><surname>Ye</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>R.</given-names></name> <name><surname>Liu</surname> <given-names>C.</given-names></name> <name><surname>Roy</surname> <given-names>K.</given-names></name></person-group> (<year>2019</year>). <article-title>Going deeper in spiking neural networks: vgg and residual architectures</article-title>. <source>Front. Neurosci.</source> <volume>13</volume>, <fpage>95</fpage>. <pub-id pub-id-type="doi">10.3389/fnins.2019.00095</pub-id><pub-id pub-id-type="pmid">30899212</pub-id></citation></ref>
<ref id="B36">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Seo</surname> <given-names>J.</given-names></name> <name><surname>Brezzo</surname> <given-names>B.</given-names></name> <name><surname>Liu</surname> <given-names>Y.</given-names></name> <name><surname>Parker</surname> <given-names>B. D.</given-names></name> <name><surname>Esser</surname> <given-names>S. K.</given-names></name> <name><surname>Montoye</surname> <given-names>R. K.</given-names></name> <etal/></person-group>. (<year>2011</year>). <article-title>&#x0201C;A 45nm cmos neuromorphic chip with a scalable architecture for learning in networks of spiking neurons,&#x0201D;</article-title> in <source>2011 IEEE Custom Integrated Circuits Conference (CICC)</source> (<publisher-loc>San Jose, CA</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>4</lpage>.</citation>
</ref>
<ref id="B37">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Shafiee</surname> <given-names>A.</given-names></name> <name><surname>Nag</surname> <given-names>A.</given-names></name> <name><surname>Muralimanohar</surname> <given-names>N.</given-names></name> <name><surname>Balasubramonian</surname> <given-names>R.</given-names></name> <name><surname>Paul Strachan</surname> <given-names>J.</given-names></name> <name><surname>Hu</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>&#x0201C;Isaac: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars,&#x0201D;</article-title> in <source>Proceedings of the 43rd International Symposium on Computer Architecture ISCA &#x00027;16</source> (<publisher-loc>Seoul</publisher-loc>: <publisher-name>IEEE Press</publisher-name>), <fpage>14</fpage>&#x02013;<lpage>26</lpage>.</citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>She</surname> <given-names>X.</given-names></name> <name><surname>Long</surname> <given-names>Y.</given-names></name> <name><surname>Kim</surname> <given-names>D.</given-names></name> <name><surname>Mukhopadhyay</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>Scienet: deep learning with spike-assisted contextual information extraction</article-title>. <source>Pattern Recogn.</source> <volume>118</volume>, <fpage>108002</fpage>. <pub-id pub-id-type="doi">10.1016/j.patcog.2021.108002</pub-id></citation>
</ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>She</surname> <given-names>X.</given-names></name> <name><surname>Long</surname> <given-names>Y.</given-names></name> <name><surname>Mukhopadhyay</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Fast and low-precision learning in gpu-accelerated spiking neural network,&#x0201D;</article-title> in <source>2019 Design, Automation &#x00026; Test in Europe Conference &#x00026; Exhibition (DATE)</source> (<publisher-loc>Florence</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>450</fpage>&#x02013;<lpage>455</lpage>.</citation>
</ref>
<ref id="B40">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>She</surname> <given-names>X.</given-names></name> <name><surname>Saha</surname> <given-names>P.</given-names></name> <name><surname>Kim</surname> <given-names>D.</given-names></name> <name><surname>Long</surname> <given-names>Y.</given-names></name> <name><surname>Mukhopadhyay</surname> <given-names>S.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Safe-dnn: a deep neural network with spike assisted feature extraction for noise robust inference,&#x0201D;</article-title> in <source>2020 International Joint Conference on Neural Networks (IJCNN)</source> (<publisher-loc>Glasgow</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>.</citation>
</ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Simonyan</surname> <given-names>K.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2014</year>). <article-title>Very deep convolutional networks for large-scale image recognition</article-title>. <source>arXiv preprint</source> arXiv:1409.1556. <pub-id pub-id-type="doi">10.48550/ARXIV.1409.1556</pub-id></citation>
</ref>
<ref id="B42">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Singh</surname> <given-names>S.</given-names></name> <name><surname>Sarma</surname> <given-names>A.</given-names></name> <name><surname>Jao</surname> <given-names>N.</given-names></name> <name><surname>Pattnaik</surname> <given-names>A.</given-names></name> <name><surname>Lu</surname> <given-names>S.</given-names></name> <name><surname>Yang</surname> <given-names>K.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>&#x0201C;Nebula: a neuromorphic spin-based ultra-low power architecture for snns and anns,&#x0201D;</article-title> in <source>2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)</source> (<publisher-loc>Valencia</publisher-loc>), <fpage>363</fpage>&#x02013;<lpage>376</lpage>.</citation>
</ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Srinivasan</surname> <given-names>G.</given-names></name> <name><surname>Panda</surname> <given-names>P.</given-names></name> <name><surname>Roy</surname> <given-names>K.</given-names></name></person-group> (<year>2018</year>). <article-title>Stdp-based unsupervised feature learning using convolution-over-time in spiking neural networks for energy-efficient neuromorphic computing</article-title>. <source>ACM J. Emerg. Technol. Comput. Syst. (JETC)</source> <volume>14</volume>, <fpage>1</fpage>&#x02013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.1145/3266229</pub-id></citation>
</ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sze</surname> <given-names>V.</given-names></name> <name><surname>Chen</surname> <given-names>Y.-H.</given-names></name> <name><surname>Yang</surname> <given-names>T.-J.</given-names></name> <name><surname>Emer</surname> <given-names>J. S.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Efficient processing of deep neural networks,&#x0201D;</article-title> in <source>Synthesis Lectures on Computer Architecture</source> San Rafael, CA: Morgan and Claypool, vol. <volume>15</volume>. <fpage>1</fpage>&#x02013;<lpage>341</lpage>.</citation>
</ref>
<ref id="B45">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Tavanaei</surname> <given-names>A.</given-names></name> <name><surname>Maida</surname> <given-names>A. S.</given-names></name></person-group> (<year>2016</year>). <article-title>Bio-inspired spiking convolutional neural network using layer-wise sparse coding and stdp learning</article-title>. <source>arXiv [Preprint]</source>. arXiv: 1611.03000. Available online at: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/pdf/1611.03000.pdf">https://arxiv.org/pdf/1611.03000.pdf</ext-link></citation>
</ref>
<ref id="B46">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>G.</given-names></name> <name><surname>Ma</surname> <given-names>S.</given-names></name> <name><surname>Wu</surname> <given-names>Y.</given-names></name> <name><surname>Pei</surname> <given-names>J.</given-names></name> <name><surname>Zhao</surname> <given-names>R.</given-names></name> <name><surname>Shi</surname> <given-names>L.</given-names></name></person-group> (<year>2021</year>). <article-title>End-to-end implementation of various hybrid neural networks on a cross-paradigm neuromorphic chip</article-title>. <source>Front. Neurosci.</source> <volume>15</volume>, <fpage>45</fpage>. <pub-id pub-id-type="doi">10.3389/fnins.2021.615279</pub-id><pub-id pub-id-type="pmid">33603643</pub-id></citation></ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>Y.</given-names></name> <name><surname>Deng</surname> <given-names>L.</given-names></name> <name><surname>Li</surname> <given-names>G.</given-names></name> <name><surname>Zhu</surname> <given-names>J.</given-names></name> <name><surname>Shi</surname> <given-names>L.</given-names></name></person-group> (<year>2018</year>). <article-title>Spatio-temporal backpropagation for training high-performance spiking neural networks</article-title>. <source>Front. Neurosci.</source> <volume>12</volume>, <fpage>331</fpage>. <pub-id pub-id-type="doi">10.3389/fnins.2018.00331</pub-id><pub-id pub-id-type="pmid">29875621</pub-id></citation></ref>
</ref-list> 
</back>
</article> 