<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article article-type="methods-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Electron.</journal-id>
<journal-title>Frontiers in Electronics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Electron.</abbrev-journal-title>
<issn pub-type="epub">2673-5857</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">1129675</article-id>
<article-id pub-id-type="doi">10.3389/felec.2023.1129675</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Electronics</subject>
<subj-group>
<subject>Methods</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Straightforward data transfer in a blockwise dataflow for an analog RRAM-based CIM system</article-title>
<alt-title alt-title-type="left-running-head">Liu et al.</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/felec.2023.1129675">10.3389/felec.2023.1129675</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Liu</surname>
<given-names>Yuyi</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/2097709/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Gao</surname>
<given-names>Bin</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/953766/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Yao</surname>
<given-names>Peng</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/1205041/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Liu</surname>
<given-names>Qi</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/2276449/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Zhang</surname>
<given-names>Qingtian</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/212820/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wu</surname>
<given-names>Dong</given-names>
</name>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Tang</surname>
<given-names>Jianshi</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/581323/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Qian</surname>
<given-names>He</given-names>
</name>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wu</surname>
<given-names>Huaqiang</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/499923/overview"/>
</contrib>
</contrib-group>
<aff>
<institution>School of Integrated Circuits</institution>, <institution>Beijing National Research Center for Information Science and Technology (BNRist)</institution>, <institution>Tsinghua University</institution>, <addr-line>Beijing</addr-line>, <country>China</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/939955/overview">Can Li</ext-link>, The University of Hong Kong, Hong Kong SAR, China</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1342086/overview">Xiaoming Chen</ext-link>, Chinese Academy of Sciences (CAS), China</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1341086/overview">Bing Li</ext-link>, Capital Normal University, China</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Bin Gao, <email>gaob1@tsinghua.edu.cn</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>17</day>
<month>04</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>4</volume>
<elocation-id>1129675</elocation-id>
<history>
<date date-type="received">
<day>22</day>
<month>12</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>03</day>
<month>04</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2023 Liu, Gao, Yao, Liu, Zhang, Wu, Tang, Qian and Wu.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Liu, Gao, Yao, Liu, Zhang, Wu, Tang, Qian and Wu</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Analog resistive random-access memory (RRAM)-based computation-in-memory (CIM) technology is promising for constructing artificial intelligence (AI) with high energy efficiency and excellent scalability. However, the large overhead of analog-to-digital converters (ADCs) is a key limitation. In this work, we propose a novel LINKAGE architecture that eliminates PE-level ADCs and leverages an analog data transfer module to implement inter-array data processing. A blockwise dataflow is further proposed to accelerate convolutional neural networks (CNNs) to speed up compute-intensive layers and solve the unbalanced pipeline problem. To obtain accurate and reliable benchmark results, key component modules, such as straightforward link (SFL) modules and Tile-level ADCs, are designed in standard 28&#xa0;nm CMOS technology. The evaluation shows that LINKAGE outperforms the conventional ADC/DAC-based architecture with a 2.07&#xd7;&#x223c;11.22&#xd7; improvement in throughput, 2.45&#xd7;&#x223c;7.00&#xd7; in energy efficiency, and 22%&#x2013;51% reduction in the area overhead while maintaining accuracy. Our LINKAGE architecture can achieve 22.9&#x223c;24.4 TOPS/W energy efficiency (4b-IN/4b-W) and 1.82 &#x223c;4.53 TOPS throughput with the blockwise method. This work demonstrates a new method for significantly improving the energy efficiency of CIM chips, which can be applied to general CNNs/FCNNs.</p>
</abstract>
<kwd-group>
<kwd>computation-in-memory</kwd>
<kwd>resistive random-access memory</kwd>
<kwd>computing-intensive</kwd>
<kwd>straightforward link</kwd>
<kwd>energy efficiency</kwd>
<kwd>throughput</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Deep neural networks (DNNs) have seen explosive growth in many AI applications over the last few years, such as computer vision and speech recognition (<xref ref-type="bibr" rid="B18">Sze et al., 2017</xref>). Many domain-specific DNN accelerators have been designed for edge applications, where massive data need to be transferred between computing and memory units. The memory access latency in von Neumann architecture is difficult to improve, and this largely limits its energy efficiency (<xref ref-type="bibr" rid="B18">Sze et al., 2017</xref>; <xref ref-type="bibr" rid="B19">Xu et al., 2018</xref>). Due to their advantages of high density, multilevel capability, and CMOS compatibility, analog RRAM-based computation-in-memory (CIM) chips have been widely investigated as promising candidates to improve energy efficiency and reduce memory bandwidth requirements (<xref ref-type="bibr" rid="B25">Zidan et al., 2018</xref>; <xref ref-type="bibr" rid="B23">Zhang et al., 2020</xref>). However, although the RRAM array has high computation density and energy efficiency, the overhead of the digital-to-analog converter (DAC) and analog-to-digital converter (ADC) between arrays greatly limits the system energy efficiency. For example, it has been shown that 2-bit DAC power consumption accounts for up to 24% of total consumption and 8-bit ADC accounts for as much as 61% (<xref ref-type="bibr" rid="B12">Liu et al., 2020</xref>; <xref ref-type="bibr" rid="B15">Shafiee et al., 2016</xref>). In addition, the energy cost of data movements due to the loaded and stored intermediate digital data also limits the system energy efficiency; this cost can reach 83% of the total cost in PRIME (<xref ref-type="bibr" rid="B3">Chi et al., 2016</xref>).</p>
<p>To address these issues, several recent works have proposed using RRAM to perform in-RRAM partial sum accumulation and adopting RRAM as an analog local buffer to enhance analog data locality, such as in CASCADE (<xref ref-type="bibr" rid="B4">Chou et al., 2019</xref>). Other works have proposed applying analog CMOS components between RRAM arrays to reduce the ADC/DAC overhead (<xref ref-type="bibr" rid="B1">Bayat et al., 2018</xref>; <xref ref-type="bibr" rid="B10">Kiani et al., 2021</xref>; <xref ref-type="bibr" rid="B24">Zhou et al., 2021</xref>). However, these methods are mainly suitable for fully connected neural networks (FCNNs). The convolutional layers are compute-bound. Although a convolutional operation can be represented by general matrix&#x2012;matrix multiplications (GEMMs) <italic>via</italic> the Im2Col operation (<xref ref-type="bibr" rid="B14">Qin et al., 2020</xref>), the next convolutional layer cannot start its operation until enough outputs have been generated by the previous convolutional layer. The data should be aggregated in a buffer between the convolutional layers. However, the proposed components in <xref ref-type="bibr" rid="B1">Bayat et al. (2018</xref>), <xref ref-type="bibr" rid="B10">Kiani et al. (2021</xref>), and <xref ref-type="bibr" rid="B24">Zhou et al. (2021</xref>) cannot buffer intermediate data. Therefore, the energy efficiency can be significantly reduced when the methods are generalized to CNNs. In addition, the shift-add process can be moved before the AD conversions process and conducted in the analog domain, effectively eliminating the digital shift-add module (<xref ref-type="bibr" rid="B7">Jiang et al., 2022a</xref>). TIMELY achieves up to an 18.2&#xd7; improvement over ISAAC (<xref ref-type="bibr" rid="B15">Shafiee et al., 2016</xref>) in energy efficiency. The energy efficiency of ISAAC is approximately 300 GOPs/W. <xref ref-type="bibr" rid="B21">Yun et al. (2021</xref>) proposed value-aware ADC bypass techniques and improved the overall system energy efficiency by up to 3.43 in 8-bit precision networks compared to ISAAC. BRAHMS (<xref ref-type="bibr" rid="B17">Song et al., 2021</xref>) reorders the activation and pooling functions in front of AD conversions and forms fused operators to eliminate useless AD conversions. It exhibits 6.64&#xd7; energy reduction on average than ISAAC-like with 4-bit RRAM precision. ENNA (<xref ref-type="bibr" rid="B8">Jiang et al., 2022b</xref>; <xref ref-type="bibr" rid="B6">Jiang et al., 2023</xref>) has a CIM architecture based on an ADC-free sub-array design with a pulse-width-modulation (PWM)-based input encoding scheme to improve the throughput. To address the overhead of peripheral circuits and local access in analog RRAM-based CIM systems, we present the straightforward link in analog domain and generalizable (LINKAGE) architecture. The key contributions of this paper are as follows:<list list-type="simple">
<list-item>
<p>&#x2022; The proposed LINKAGE architecture eliminates PE-level ADCs and leverages an analog data transfer module to implement inter-array data processing. It exploits a straightforward link module that can save the inter-array analog data to the local analog domain and directly transfer analog data to the next layer.</p>
</list-item>
<list-item>
<p>&#x2022; For CNNs, we propose a blockwise method to speed up compute-intensive layers and solve the unbalanced pipeline problem.</p>
</list-item>
<list-item>
<p>&#x2022; To obtain accurate and reliable evaluation results, the key component modules are designed in standard 28&#xa0;nm CMOS technology. Our LINKAGE architecture can achieve 22.9&#x223c;24.4 TOPS/W energy efficiency and 1.82 &#x223c;4.53 TOPS throughput (4b-IN/4b-W/4b-O) with the blockwise method.</p>
</list-item>
</list>
</p>
</sec>
<sec id="s2">
<title>2 Background</title>
<sec id="s2-1">
<title>2.1 CNN and data reuse</title>
<p>There are three forms of data reuse in the process of CNNs (<xref ref-type="bibr" rid="B2">Chen et al., 2017</xref>), as shown in <xref ref-type="fig" rid="F1">Figure 1A</xref>. The first form is the input feature map (IFM) reuse. Each IFM can be reused by M kernels to generate M output feature map (OFM) channels. The second form is the kernel reuse. Each kernel can be reused by multiple IFMs. The third form is the convolution reuse. Each kernel weight is reused E &#xd7; E times in one IFM, and each IFM pixel is reused K &#xd7; K times in one kernel. The next patch of an IFM simply needs to update &#x23;stride&#xd7;&#x23;K&#xd7;&#x23;Channel input pixels. E and K are the size of each IFM plane and each kernel plane, respectively. To maximize the energy efficiency and minimize the memory bandwidth, the goal is to make the most of the three forms of data reuse in the analog RRAM-based CIM system. For the IFM reuse and kernel reuse, the IFMs are encoded as voltage-level based inputs of the bit-lines (BLs), and the kernel weights are represented as the conductance matrixes of RRAM arrays. RRAM is non-volatile memory, so the conductance matrixes can be stored on the chip all the time. Therefore, IFMs and kernels are naturally reused for the RRAM-based CIM system. Furthermore, we propose a blockwise mapping method and dataflow to take advantage of the convolution reuse to reduce data communication and redundant data production.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>
<bold>(A)</bold> Illustration of data reuse and <bold>(B)</bold> dataflow bottleneck in CNNs.</p>
</caption>
<graphic xlink:href="felec-04-1129675-g001.tif"/>
</fig>
</sec>
<sec id="s2-2">
<title>2.2 Blocked pipeline in CNNs</title>
<p>For convolutional neural networks, the size of the input feature map will become smaller and smaller with the number of layers and the convolution kernel dimension will increase. This results in a low computation/weight ratio for deeper neural network layers and a high computation/weight ratio for shallower layers. The first few layers of a convolutional neural network are compute-intensive, as shown in <xref ref-type="fig" rid="F1">Figure 1B</xref>. When all layers of the VGG16 network are mapped only once to the CIM chip, the number of operands in the first two layers is much larger than that in the other layers and the number of arrays required for each layer is much larger than the first two layers. The process will be blocked in the first few compute-intensive layers. It leads to an unbalance in the pipeline and eventually causes the throughput bottleneck of the system. To improve the throughput of the system, the pipeline should be balanced, such as ISAAC and PipeLayer, replicating the weights of the first few layers of the network to improve the intra-layer parallelism. In short, the first two layers of convolutional neural networks have the highest ratio of computation/weight, large parameters in feature maps, and large amounts of interlayer data transmission. To accelerate the two most compute-intensive layers, a blockwise mapping method and dataflow are proposed to solve the problems of an unbalanced pipeline. In addition, the proposed blockwise method can reuse the analog data stored in local analog buffers, greatly reducing the constraints of data transmission.</p>
</sec>
</sec>
<sec id="s3">
<title>3 Proposed implementation</title>
<sec id="s3-1">
<title>3.1 Design of ADC-less RRAM PE</title>
<p>
<xref ref-type="fig" rid="F2">Figure 2A</xref> shows the design of the ADC-less RRAM processing element (PE). One PE consists of a 576 &#xd7; 128 1T1R RRAM array, BL analog buffers, current subtractors, straightforward links (SFLs), and a digital control module. For the 1T1R array, word-lines (WLs) and source-lines (SLs) are horizontal, and BLs are vertical to SLs. Neural network weights are represented as the conductance of RRAM cells. One weight needs to be represented by two RRAM cells because RRAMs cannot represent negative weights directly. Each 1T1R cell stores a 4-bit value. The subtraction result between the two cells&#x2019; conductance is the real weight value. There are two reasons for the PE size. First, 576 is a multiple of 3 &#xd7; 3. The PE size can match the size of the convolutional layer and maximize the array utilization. Second, for IR-drop, the RRAM array with 128 columns and 576 rows can obtain a small accuracy loss (<xref ref-type="bibr" rid="B22">Zhang et al., 2019</xref>). The PE is an analog input and analog output. The analog inputs are the voltage-level-based inputs encoded by DACs of a higher level of the hierarchy. First, the inputs are applied to RRAM arrays through BL analog buffers. The buffer is a unity gain buffer (UGB), which is a single-ended operational amplifier (OpAmp) with negative unit feedback. The OpAmp is a two-stage amplifier with a class-AB output stage, as shown in <xref ref-type="fig" rid="F2">Figure 2C</xref>. UGBs are adopted to stabilize the analog voltage and drive the BLs of RRAM arrays. Then, vector-matrix multiplication (VMM) between the voltage-level-based inputs and weight matrix is implemented through the RRAM array, and it generates analog current outputs. Current subtractors (I-SUB) based on the current mirror structure (<xref ref-type="bibr" rid="B20">Xue et al., 2019</xref>) are used to subtract two current outputs. At last, the subtracted currents are converted to analog voltages through charge-based conversion. The analog voltages will be stored in the ADC-less RRAM PE temporarily and applied to the next RRAM array as read voltages. The conversion, temporary local analog storage, and driver are all realized by the proposed SFL module.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>
<bold>(A)</bold> ADC-less RRAM PE design; <bold>(B)</bold> schematic representation of the straightforward link (SFL). The SFL consists of a capacitor, a unity gain buffer (UGB), and a ReLU module; <bold>(C)</bold> amplifier in the UGB or BL analog buffer; <bold>(D)</bold> the ReLU module comprises a sense amplifier (SA) and a 2-to-1 MUX. <bold>(E)</bold> The SA is a latch-based voltage mode SA.</p>
</caption>
<graphic xlink:href="felec-04-1129675-g002.tif"/>
</fig>
<p>
<xref ref-type="fig" rid="F2">Figure 2B</xref> demonstrates the design of the SFL. It comprises a capacitor for charge-based current-to-voltage conversion, a UGB as a voltage buffer for stabilizing the voltage and driving the BLs of the next RRAM array, and a rectified linear units (ReLU) module for the activation function. <xref ref-type="fig" rid="F2">Figure 2D</xref> shows the schematic representation of the ReLU module, composed of a sense amplifier (SA, as shown in <xref ref-type="fig" rid="F2">Figure 2E</xref>) and a 2-to-1 multiplexer (MUX). The capacitor is used not only as a converter but also as an analog memory (C<sub>MEM</sub>). The retention time of C<sub>MEM</sub> is on the order of milliseconds, the same as DRAM, so charges in the C<sub>MEM</sub> would not reduce within 100&#xa0;ns. The capacitance value needs to ensure that the range of integrated voltages is no more than the maximum read voltage, 0.2&#xa0;V. In one 576 &#xd7; 128 1T1R RRAM array, the subtracted currents distribute within &#xb1;11uA through algorithm evaluations. Therefore, the capacitance value is 550&#xa0;fF, with an integration time of 10 ns.</p>
<p>We design the SFL module in a standard 28-nm CMOS process through Cadence Virtuoso. The simulation waveforms are shown in <xref ref-type="fig" rid="F3">Figure 3</xref>. In the first phase (PH1), only switch (SW) 1 is turned on and the C<sub>MEM</sub> is reset to 350&#xa0;mV. In the second phase (PH2), only SW 2 is turned on and the C<sub>MEM</sub> is integrated by the subtracted current to realize the charge-based current-to-voltage conversion. In the third phase (PH3), only SW 3 is turned on and the temporarily stored voltages across the C<sub>MEM</sub> are rectified by the ReLU module and then drive the next RRAM array. The latency of the three phases is 30&#xa0;ns, and each phase occupies for 10&#xa0;ns. The key metric of SFL is to transfer the analog data to the next array accurately. Because analog voltages are sensitive to noise and interference, the SFL can be affected by process, voltage, and temperature (PVT) variations. We simulate the SFL under different process corners, temperatures, and supply voltages. The waveforms shown in <xref ref-type="fig" rid="F3">Figure 3</xref> demonstrate that the integrated voltages across the C<sub>MEM</sub> are transferred to the next array in all PVT situations. There is a deviation of about 0.4&#x2013;0.5&#xa0;mV. The total integrated noise of C<sub>MEM</sub> and UGB is 0.12&#xa0;mV<sub>rms</sub> and 0.53&#xa0;mV<sub>rms</sub>, respectively. The UGB offset is &#x2212;65&#xa0;&#x3bc;V, and the charge injected to C<sub>MEM</sub> is 6.29 &#xd7; 10<sup>&#x2212;19</sup>&#xa0;C (about 1.1&#xa0;&#x3bc;V). We consider all the PVT, noise, and interference in the neural network accuracy simulation and benchmark on the ResNet18/CIFAR-10. Because the one-sided swing of SFLs is 200&#xa0;mV (@ tt, 27&#xb0;C, and varies under different corners), the total deviation, 0.4&#x2013;0.5&#xa0;mV, is relatively little compared to the swing. In addition, random noise is considered during the offline training of the neural network, so the neural network can resist circuit noise after noise-aware training. Finally, neural networks also have a certain tolerance for noise. The accuracy can be maintained under noise and interference of the SFLs (as shown in <xref ref-type="table" rid="T1">Table 1</xref>).</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Simulation waveforms of the proposed SFL module.</p>
</caption>
<graphic xlink:href="felec-04-1129675-g003.tif"/>
</fig>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Benchmark of ResNet18/CIFAR-10, considering the PVT, noise, and interference.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Corner</th>
<th align="center">tt, 27&#xb0;C</th>
<th align="center">ff, 0&#xb0;C</th>
<th align="center">ff, 27&#xb0;C</th>
<th align="center">ff, 80&#xb0;C</th>
<th align="center">ss, 0&#xb0;C</th>
<th align="center">ss, 27&#xb0;C</th>
<th align="center">ss, 80&#xb0;C</th>
<th align="center">tt, 27&#xb0;C, 0.9V &#x2b; 10%</th>
<th align="center">tt, 27&#xb0;C, 0.9V-10%</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">Swing (mV)</td>
<td align="center">200.01</td>
<td align="center">212.15</td>
<td align="center">210.57</td>
<td align="center">207.68</td>
<td align="center">189.53</td>
<td align="center">188.13</td>
<td align="center">187.31</td>
<td align="center">200.85</td>
<td align="center">201.89</td>
</tr>
<tr>
<td align="center">Accuracy (%)</td>
<td align="center">90.66</td>
<td align="center">90.73</td>
<td align="center">90.69</td>
<td align="center">90.80</td>
<td align="center">90.67</td>
<td align="center">90.64</td>
<td align="center">90.65</td>
<td align="center">90.63</td>
<td align="center">90.74</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>tt means typical nmos and typical pmos; ff means fast nmos and fast pmos; ss means slow nmos and slow pmos.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="s3-2">
<title>3.2 Hierarchical architecture</title>
<p>There are two levels, the PE level and Tile level, in the LINKAGE hierarchy. One Tile consists of ADC-less RRAM PEs. <xref ref-type="fig" rid="F4">Figure 4</xref> shows the Tile-level of LINKAGE architecture. There are two consecutive ADC-less PE stages, processing two continuous layers of neural networks. The PE level leverages an analog data transfer module to implement inter-array data processing. Although the LINKAGE eliminates ADC, DAC, and local buffers at the PE level, it still needs digital quantization modules at the Tile level to provide the scalability for a large-scale hierarchical system design. It needs to digitize the analog outputs and store them in global buffers for various neural network layers. The Tile of LINKAGE is designed for two consecutive layers. The first layer is mapped to the PEs of the first stage in the Tile, and the second layer is mapped to the PEs of the second stage. The pipelines in LINKAGE are organized by pairs, and two layers occupy a pipeline stage. The second layer can realize the data conversion using Tile-level ADCs to prepare the data for the next pipeline stage. In addition, large NN layers are split and mapped to multiple PEs. For the sum required after splitting, the currents can be summed by connecting two currents to the same node.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Tile level of LINKAGE architecture.</p>
</caption>
<graphic xlink:href="felec-04-1129675-g004.tif"/>
</fig>
<p>Early layers are more compute-intensive, especially the first two layers. To speed up the two most compute-intensive layers, we propose a blockwise mapping method and dataflow for LINKAGE to solve the unbalanced pipeline and communication-bound problems. First, the first two layer&#x2019;s IFMs have the largest size in the plane. It will block the pipeline severely for long latency to prepare enough data for the next layers. Second, the huge inter-layer data would cause communication-bound problems. The proposed blockwise method can reuse the stored inter-layer data and largely reduce data that need to be digitalized.</p>
<p>
<xref ref-type="fig" rid="F5">Figure 5</xref> illustrates the blockwise mapping method and dataflow. We assume that two consecutive convolutional layers are (C, K<sub>1</sub>, K<sub>1</sub>, N) and (N, K<sub>2</sub>, K<sub>2</sub>, M), where K<sub>1</sub> and K<sub>2</sub> are the kernel sizes, C is the input channel of the first layer, N is the input channel of the first layer and the output channel of the second layer, and M is the output channel of the second layer. At each time step, the input block moves by one stride in the IFM, as shown in <xref ref-type="fig" rid="F5">Figure 5A</xref>. One subblock is the size of (C, K<sub>1</sub>, K<sub>1</sub>), the same as one kernel of the first layer. Each subblock is unfolded to a C &#xd7; K<sub>1</sub> &#xd7; K<sub>1</sub> vector to be input to a RRAM array, as shown in <xref ref-type="fig" rid="F5">Figure 5B</xref>, and the array outputs N results. Similarly, the second layer needs N &#xd7; K<sub>2</sub> &#xd7; K<sub>2</sub> data; otherwise, it cannot start the complete VMM operation. To construct a balanced flow, the first layer is replicated by K<sub>2</sub> &#xd7; K<sub>2</sub> on RRAM arrays. An input block also contains K<sub>2</sub> &#xd7; K<sub>2</sub> subblocks, and the subblocks are staggered one stride, as shown in <xref ref-type="fig" rid="F5">Figure 5C</xref>. In the first time step, the K<sub>2</sub> &#xd7; K<sub>2</sub> subblocks are unfolded to K<sub>2</sub> &#xd7; K<sub>2</sub> vectors and all input to RRAM arrays simultaneously. The first layer outputs N &#xd7; K<sub>2</sub> &#xd7; K<sub>2</sub> results, and they are stored in the C<sub>MEM</sub> of SFLs. Next, the results are transferred and input to the second layer through SFLs. The second layer outputs M results. At the next time step, the input block moves by one stride. K<sub>2</sub> &#xd7; (K<sub>2</sub> -1) subblocks in this input block are the same as in the last time step. The N &#xd7; K<sub>2</sub> &#xd7; (K<sub>2</sub> -1) results of these subblocks have been stored in the C<sub>MEM</sub> of SFLs, so they need not be recalculated. Only the new K<sub>2</sub> subblocks will be calculated, and N &#xd7; K<sub>2</sub> &#xd7; 1 results are updated into C<sub>MEM</sub>. Therefore, N &#xd7; K<sub>2</sub> &#xd7; K<sub>2</sub> outputs can be calculated simultaneously at a one-time step. The N &#xd7; K<sub>2</sub> &#xd7; K<sub>2</sub> inputs of the second layer are organized through MUX sets. Each MUX is controlled by a mod-K<sub>2</sub> synchronous counter.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>The blockwise mapping method and dataflow for two consecutive convolutional layers. <bold>(A)</bold> At each time step, the input block moves by one stride in the IFM; <bold>(B)</bold> Each subblock is unfolded and input to an ADC-less RRAM PE; <bold>(C)</bold> Only the new K<sub>2</sub> subblocks will be calculated and N &#xd7; K<sub>2</sub> &#xd7; 1 results are updated into C<sub>MEM</sub>.</p>
</caption>
<graphic xlink:href="felec-04-1129675-g005.tif"/>
</fig>
<p>
<xref ref-type="fig" rid="F6">Figures 6A, C</xref> illustrate the dataflow of the first two consecutive convolutional layers at the architectural level. For the first layer, Conv1, input data are loaded from input buffers to the DACs. The N &#xd7; K<sub>2</sub> &#xd7; K<sub>2</sub> outputs are obtained simultaneously from RRAM arrays in the first stage. The output currents are converted to analog voltages by the SFLs and then stored in the local analog domain. The analog voltages can be directly applied to RRAM arrays in the second stage through the SFLs. Then, the results of the second layer, Conv2, are output through the ADCs and stored first-in first-out (FIFO). It should be noted that the Tile of LINKAGE used to compute convolutional layers is designed for two consecutive layers. Otherwise, the number of replications will increase exponentially. Therefore, the convolutional layers are computed in pairs in the LINKAGE architecture. For the basic block of ResNet (<xref ref-type="bibr" rid="B22">Zhang et al., 2019</xref>), if there is a shortcut layer, the workflow of the shortcut layer is marked with a dotted line (as shown in <xref ref-type="fig" rid="F6">Figure 6C</xref>). The currents of shortcut and conv2 can be summed by connecting the two currents to the same node. <xref ref-type="fig" rid="F6">Figures 6B, D</xref> illustrate the dataflow of the other two consecutive layers. These layers of the neural network are less compute-intensive than the first two layers. It would not adopt array replication and the blockwise method.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Dataflow between PEs for <bold>(A, C)</bold> two consecutive convolutional layers and <bold>(B, D)</bold> other two consecutive layers.</p>
</caption>
<graphic xlink:href="felec-04-1129675-g006.tif"/>
</fig>
</sec>
</sec>
<sec id="s4">
<title>4 Evaluations</title>
<sec id="s4-1">
<title>4.1 Experimental setup</title>
<p>To provide a fair comparison, we build a baseline design with conventional ADC-based PE, as shown in <xref ref-type="fig" rid="F7">Figure 7</xref>. This CIM architecture consists of local buffers, DACs, RRAM arrays, ADCs, shift adders, controllers, and special function units (SFUs), such as pooling units and ReLU. The analog output currents of RRAM arrays are converted to digital voltages through ADCs. Then, outputs of each part are added by a shift-adder, and the results can be stored in a local buffer as inputs to other Tiles. The DAC is shared by RRAM arrays. The 8-bit DAC would have an enormous overhead and be impossible to design, so we choose a 4-bit DAC. We retrain the NN with lower bit width weights and activations. The input and output are both 4 bits, and the weight is also 4 bits. As shown in <xref ref-type="table" rid="T2">Table 2</xref>, the accuracy loss is within 2%. We design the 4-bit ADC and 4-bit DAC modules in a standard 28-nm CMOS process. The 4-bit DAC consists of an R-ladder and clamping buffers. The R-ladder generates discrete voltages, and the clamping buffers are UGBs used to clamp the 2<sup>4</sup> analog voltage levels. In addition, each row in an RRAM array needs a BL analog buffer to drive the input voltage. One ADC is reused by eight SLs in the baseline design.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>Baseline design with conventional ADC-based PE.</p>
</caption>
<graphic xlink:href="felec-04-1129675-g007.tif"/>
</fig>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>The accuracy of different neural networks.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Benchmark</th>
<th align="center">FCNN</th>
<th align="center">VGG-8</th>
<th align="center">ResNet18</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">Dataset</td>
<td align="center">MNIST</td>
<td align="center">CIFAR-10</td>
<td align="center">CIFAR-10</td>
</tr>
<tr>
<td align="center">Accuracy (software baseline)</td>
<td align="center">97.88%</td>
<td align="center">88.90%</td>
<td align="center">91.46%</td>
</tr>
<tr>
<td align="center">Accuracy (w_bit &#x3d; 4, a_bit &#x3d; 4, wnoise &#x3d; 0.05)</td>
<td align="center">97.38%</td>
<td align="center">87.43%</td>
<td align="center">89.96%</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4-2">
<title>4.2 Benchmark results and discussion</title>
<p>To ensure the analysis is close to the real prototype, we build an end-to-end CIM simulator with an integrated framework from the device to the algorithm. The simulator includes the noise-aware offline training algorithms, the complete design of the circuit and architecture for the RRAM neural process unit, and the non-idealities of RRAM (<xref ref-type="bibr" rid="B13">Liu and Gao, 2021</xref>). The performances of modules (measured from the circuit&#x2019;s design) are integrated into the PE level in the LINKAGE hierarchy. For different neural networks, the performance and energy efficiency are evaluated, according to the network structure and the LINKAGE architecture. <xref ref-type="fig" rid="F8">Figure 8</xref> shows the benchmark results on FCNN/MNIST, VGG-8/CIFAR-10, and ResNet-18/CIFAR-10 and ResNet-50/CIFAR-10. The energy efficiency of proposed SFL-based designs could perform 2.45&#xd7;&#x223c;3.17&#xd7; better than baseline designs, and the throughput of SFL-based designs performs 1.67&#xd7;&#x223c;4.30&#xd7; better for different tasks. The IFMs are processed continuously inter-array without quantization, so the latency is reduced and the workloads for Tile-level ADCs are decreased. Our LINKAGE architecture can achieve 8.51&#x223c;10.35 TOPS/W energy efficiency (4b-IN/4b-W) and 0.68 &#x223c;1.73 TOPS throughput without the blockwise method. The blockwise method can further improve the energy efficiency by 2.21&#xd7;&#x223c;2.54&#xd7;. LINKAGE architecture can achieve 22.9&#x223c;24.4 TOPS/W energy efficiency and 1.82 &#x223c;4.53 TOPS throughput with the blockwise method. SFL-based designs could reduce the area by 22%&#x2013;51% (<xref ref-type="fig" rid="F9">Figure 9</xref>), particularly benefitted from a substantial reduction in the total number of BL buffers and ADCs at the Tile-level. In addition, the blockwise method achieves more than twice energy efficiency with little area overhead.</p>
<fig id="F8" position="float">
<label>FIGURE 8</label>
<caption>
<p>Performance benchmarks on different DNN models (4-bit weight/4-bit input configuration).</p>
</caption>
<graphic xlink:href="felec-04-1129675-g008.tif"/>
</fig>
<fig id="F9" position="float">
<label>FIGURE 9</label>
<caption>
<p>Area overhead of different DNN models.</p>
</caption>
<graphic xlink:href="felec-04-1129675-g009.tif"/>
</fig>
<p>To provide a comparison, LINKAGE and other related RRAM-based CIM macros are compared, as shown in <xref ref-type="table" rid="T3">Table 3</xref>. These works also propose ADC-less solutions to solve the ADC overhead problem. We list their array-level interfacing solutions, energy efficiency, and recognition accuracy. The accuracy of these tasks is maintained to the software baseline. To intuitively compare these works, energy efficiency of macros is normalized to a 1-bit &#xd7; 1-bit multiply and accumulate operation (MAC). (Input bits &#xd7; weight bits &#xd7; energy efficiency, or MAC bits &#xd7; energy efficiency). As shown in <xref ref-type="table" rid="T3">Table 3</xref>, this work has the highest normalized energy efficiency.</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Comparison table with recent RRAM-based CIM macros.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Related work</th>
<th align="center">DAC&#x2019;22 (<xref ref-type="bibr" rid="B17">Song et al., 2021</xref>)</th>
<th align="center">ISCA&#x2019;20 (<xref ref-type="bibr" rid="B11">Li et al., 2020</xref>)</th>
<th align="center">MICRO&#x2019;19 (<xref ref-type="bibr" rid="B4">Chou et al., 2019</xref>)</th>
<th align="center">TCAS-1&#x2032;22 (<xref ref-type="bibr" rid="B6">Jiang et al., 2022b</xref>)</th>
<th align="center">This work</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Array size</td>
<td align="center">128&#x00D7;128</td>
<td align="center">256&#x00D7;256</td>
<td align="center">64&#x00D7;64</td>
<td align="center">256&#x00D7;256</td>
<td align="center">576&#x00D7;128</td>
</tr>
<tr>
<td align="left">Weight precision</td>
<td align="center">2-bit/4-bit</td>
<td align="center">4-bit</td>
<td align="center">1-bit</td>
<td align="center">2-bit</td>
<td align="center">4-bit</td>
</tr>
<tr>
<td align="left">Array-level interface</td>
<td align="center">2-bit ARCAM/4-bit:ADC</td>
<td align="center">8-bit DTC/TDC</td>
<td align="center">DAC/TIA</td>
<td align="center">4-bit PWM-based DAC/edge capacitor</td>
<td align="center">DAC/capacitor</td>
</tr>
<tr>
<td align="left">Energy efficiency (TOPS/W)</td>
<td align="center">5.51 (2-bit)/2.52 (4-bit)</td>
<td align="center">21 (8-bit MAC)</td>
<td align="center">1.33</td>
<td align="center">26.97</td>
<td align="center">22.9&#x2013;24.4</td>
</tr>
<tr>
<td align="left">Accuracy</td>
<td align="center">Baseline</td>
<td align="center">&#x2264; 0.1% loss</td>
<td align="center">90% @ MLP-2, w/6-bit BL resolution</td>
<td align="center">Baseline (7-bit Tile-level ADC)</td>
<td align="center">Baseline (4-bit Tile-level ADC)</td>
</tr>
<tr>
<td align="left">Normalized TOPS/W</td>
<td align="center">22.0/40.32</td>
<td align="center">168</td>
<td align="center">7.98</td>
<td align="center">215.76</td>
<td align="center">366.4</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>Normalized: for 1-bit &#xd7; 1-bit MAC operation.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec sec-type="conclusion" id="s5">
<title>5 Conclusion</title>
<p>In this work, we propose a CIM architecture design that eliminates PE-level ADCs. It exploits a straightforward link module that can save the inter-array analog data to the local analog domain and directly transfer analog data to the next array. Furthermore, for CNNs, we propose a blockwise dataflow to speed up compute-intensive layers and solve the unbalanced pipeline problem. To obtain accurate and reliable evaluation results, PE-level modules are designed in the standard 28-nm CMOS technology. Our LINKAGE architecture can achieve 22.9&#x223c;24.4 TOPS/W energy efficiency and 1.82 &#x223c;4.53 TOPS throughput (4b-IN/4b-W/4b-O) with the blockwise method. The evaluation results demonstrate that the LINKAGE architecture could significantly improve the energy efficiency of CIM chips. In addition, LINKAGE provides a new type of PE and extends the search space in CIM design.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding author.</p>
</sec>
<sec id="s7">
<title>Author contributions</title>
<p>YL proposed the LINKAGE architecture and performed most analyses. PY and BG checked its feasibility. PY, QL, and DW checked the function of the straightforward link module. HQ and HW conducted the studies on the RRAM arrays. QZ performed some analyses on algorithms. YL, BG, and JT prepared the manuscript. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec id="s8">
<title>Funding</title>
<p>This work is supported by Natural Science Foundation of China (92064001, 62025111), OPPO-THU Joint Project, IoT Intelligent Microsystem Center of Tsinghua University-China Mobile Joint Research Institute, and Beijing Advanced Innovation Center for Integrated Circuits.</p>
</sec>
<sec sec-type="COI-statement" id="s9">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bayat</surname>
<given-names>F. M.</given-names>
</name>
<name>
<surname>Prezioso</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Chakrabarti</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Nili</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Kataeva</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Strukov</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Implementation of multilayer perceptron network with highly uniform passive memristive crossbar circuits</article-title>. <source>Nat. Commun.</source> <volume>9</volume> (<issue>1</issue>), <fpage>2331</fpage>. <pub-id pub-id-type="doi">10.1038/s41467-018-04482-4</pub-id>
</citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>Y. H.</given-names>
</name>
<name>
<surname>Krishna</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Emer</surname>
<given-names>J. S.</given-names>
</name>
<name>
<surname>Sze</surname>
<given-names>V.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks</article-title>. <source>IEEE J. Solid-State Circuits</source> <volume>52</volume>, <fpage>127</fpage>&#x2013;<lpage>138</lpage>. <pub-id pub-id-type="doi">10.1109/JSSC.2016.2616357</pub-id>
</citation>
</ref>
<ref id="B3">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Chi</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). &#x201c;<article-title>PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory</article-title>,&#x201d; in <conf-name>ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)</conf-name>, <fpage>27</fpage>&#x2013;<lpage>39</lpage>. <pub-id pub-id-type="doi">10.1109/ISCA.2016.13</pub-id>
</citation>
</ref>
<ref id="B4">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Chou</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Botimer</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>CASCADE: Connecting RRAMs to extend analog dataflow in an end-to-end en-memory processing paradigm</article-title>,&#x201d; in <conf-name>52nd Annual IEEE/ACM International Symposium on Microarchitecture</conf-name>, <fpage>114</fpage>&#x2013;<lpage>125</lpage>. <pub-id pub-id-type="doi">10.1145/3352460.3358328</pub-id>
</citation>
</ref>
<ref id="B5">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>He</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Deep residual learning for image recognition</article-title>,&#x201d; in <conf-name>IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</conf-name>, <fpage>770</fpage>&#x2013;<lpage>778</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2016.90</pub-id>
</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jiang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Cosemans</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Catthoor</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2022a</year>). <article-title>Analog-to-Digital converter design exploration for compute-in-memory accelerators</article-title>. <source>IEEE Des. Test.</source> <volume>39</volume>, <fpage>48</fpage>&#x2013;<lpage>55</lpage>. <pub-id pub-id-type="doi">10.1109/MDAT.2021.3050715</pub-id>
</citation>
</ref>
<ref id="B8">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Jiang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2022b</year>). &#x201c;<article-title>A 40nm analog-input ADC-free compute-in-memory RRAM macro with pulse-width modulation between sub-arrays</article-title>,&#x201d; in <conf-name>IEEE Symposium on VLSI Technology and Circuits</conf-name>, <fpage>266</fpage>&#x2013;<lpage>267</lpage>. <pub-id pub-id-type="doi">10.1109/VLSITechnologyandCir46769.2022.9830211</pub-id>
</citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jiang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>ENNA: An efficient neural network accelerator design based on ADC-free compute-in-memory subarrays</article-title>. <source>IEEE Trans. Circuits Syst. I Regul. Pap.</source> <volume>70</volume>, <fpage>353</fpage>&#x2013;<lpage>363</lpage>. <pub-id pub-id-type="doi">10.1109/TCSI.2022.3208755</pub-id>
</citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kiani</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>J. J.</given-names>
</name>
<name>
<surname>Xia</surname>
<given-names>Q.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>A fully hardware-based memristive multilayer neural network</article-title>. <source>Sci. Adv.</source> <volume>7</volume>, <fpage>19</fpage>. <pub-id pub-id-type="doi">10.1126/sciadv.abj4801</pub-id>
</citation>
</ref>
<ref id="B11">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>TIMELY: Pushing Data Movements and Interfaces in PIM Accelerators Towards Local and in Time Domain</article-title>,&#x201d; in <conf-name>ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)</conf-name>, <fpage>832</fpage>&#x2013;<lpage>845</lpage>. <pub-id pub-id-type="doi">10.1109/ISCA45697.2020.00073</pub-id>
</citation>
</ref>
<ref id="B12">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Yao</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Pang</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). &#x201c;<article-title>A fully integrated analog ReRAM based 78.4TOPS/W compute-in-memory chip with fully parallel MAC computing</article-title>,&#x201d; in <conf-name>2020 IEEE International Solid- State Circuits Conference (ISSCC)</conf-name>, <fpage>500</fpage>&#x2013;<lpage>502</lpage>. <pub-id pub-id-type="doi">10.1109/ISSCC19947.2020.9062953</pub-id>
</citation>
</ref>
<ref id="B13">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>System and technology Co-optimization for RRAM based computation-in-memory chip</article-title>,&#x201d; in <conf-name>International Conference on IC Design and Technology (ICICDT) (IEEE)</conf-name>, <fpage>1</fpage>&#x2013;<lpage>4</lpage>. <pub-id pub-id-type="doi">10.1109/ICICDT51558.2021.9626398</pub-id>
</citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qin</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Samajdar</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Kwon</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Nadella</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Srinivasan</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Das</surname>
<given-names>D.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). &#x201c;<article-title>SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training</article-title>,&#x201d; in <conf-name>IEEE International Symposium on High Performance Computer Architecture (HPCA)</conf-name>, <fpage>58</fpage>&#x2013;<lpage>70</lpage>. <pub-id pub-id-type="doi">10.1109/HPCA47549.2020.00015</pub-id>
</citation>
</ref>
<ref id="B15">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Shafiee</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Nag</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Muralimanohar</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Balasubramonian</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Strachan</surname>
<given-names>J. P.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). &#x201c;<article-title>ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars</article-title>,&#x201d; in <conf-name>ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)</conf-name>, <fpage>14</fpage>&#x2013;<lpage>26</lpage>. <pub-id pub-id-type="doi">10.1109/ISCA.2016.12</pub-id>
</citation>
</ref>
<ref id="B17">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Song</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>BRAHMS: Beyond conventional RRAM-based neural network accelerators using hybrid analog memory system</article-title>,&#x201d; in <conf-name>58th ACM/IEEE Design Automation Conference (DAC)</conf-name>, <fpage>1033</fpage>&#x2013;<lpage>1038</lpage>. <pub-id pub-id-type="doi">10.1109/DAC18074.2021.9586247</pub-id>
</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sze</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y.-H.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>T.-J.</given-names>
</name>
<name>
<surname>Emer</surname>
<given-names>J. S.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Efficient processing of deep neural networks: A tutorial and survey</article-title>. <source>Proc. IEEE</source> <volume>105</volume>, <fpage>2295</fpage>&#x2013;<lpage>2329</lpage>. <pub-id pub-id-type="doi">10.1109/JPROC.2017.2761740</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>S. X.</given-names>
</name>
<name>
<surname>Niemier</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Cong</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>Scaling for edge inference of deep neural networks</article-title>. <source>Nat. Electron</source> <volume>1</volume>, <fpage>216</fpage>&#x2013;<lpage>222</lpage>. <pub-id pub-id-type="doi">10.1038/s41928-018-0059-3</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Xue</surname>
<given-names>C.-X.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>W.-H.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>J.-S.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J.-F.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>W.-Y.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>W.-E.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). &#x201c;<article-title>A 1Mb multibit ReRAM computing-in-memory macro with 14.6ns parallel MAC computing time for CNN based AI edge processors</article-title>,&#x201d; in <conf-name>IEEE International Solid- State Circuits Conference (ISSCC)</conf-name>, <fpage>388</fpage>&#x2013;<lpage>390</lpage>. <pub-id pub-id-type="doi">10.1109/ISSCC.2019.8662395</pub-id>
</citation>
</ref>
<ref id="B21">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Yun</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Shin</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Kang</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>L.-S.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Optimizing ADC utilization through value-aware bypass in ReRAM-based DNN accelerator</article-title>,&#x201d; in <conf-name>58th ACM/IEEE Design Automation Conference (DAC)</conf-name>, <fpage>1087</fpage>&#x2013;<lpage>1092</lpage>. <pub-id pub-id-type="doi">10.1109/DAC18074.2021.9586140</pub-id>
</citation>
</ref>
<ref id="B22">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Peng</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). &#x201c;<article-title>Design guidelines of RRAM based neural-processing-unit: A joint device-circuit-algorithm analysis</article-title>,&#x201d; in <conf-name>56th Annual Design Automation Conference (DAC)</conf-name>, <fpage>1</fpage>&#x2013;<lpage>6</lpage>. <pub-id pub-id-type="doi">10.1145/3316781.3317797</pub-id>
</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Yao</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Neuro-inspired computing chips</article-title>. <source>Nat. Electron</source> <volume>3</volume>, <fpage>371</fpage>&#x2013;<lpage>382</lpage>. <pub-id pub-id-type="doi">10.1038/s41928-020-0435-7</pub-id>
</citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Ni</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wen</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zou</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). <article-title>DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients</article-title>. <comment>arXiv preprint arXiv: 1606.06160</comment>
</citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Fang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>An energy efficient computing-in-memory accelerator with 1T2R cell and fully analog processing for edge AI applications</article-title>. <source>IEEE Trans. Circuits Syst. II Express Briefs</source> <volume>68</volume>, <fpage>2932</fpage>&#x2013;<lpage>2936</lpage>. <pub-id pub-id-type="doi">10.1109/TCSII.2021.3065697</pub-id>
</citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zidan</surname>
<given-names>M. A.</given-names>
</name>
<name>
<surname>Strachan</surname>
<given-names>J. P.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>W. D.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>The future of electronics based on memristive systems</article-title>. <source>Nat. Electron</source> <volume>1</volume>, <fpage>22</fpage>&#x2013;<lpage>29</lpage>. <pub-id pub-id-type="doi">10.1038/s41928-017-0006-8</pub-id>
</citation>
</ref>
</ref-list>
</back>
</article>