<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurosci.</journal-id>
<journal-title>Frontiers in Neuroscience</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurosci.</abbrev-journal-title>
<issn pub-type="epub">1662-453X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnins.2021.749811</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Gradient Decomposition Methods for Training Neural Networks With Non-ideal Synaptic Devices</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Zhao</surname> <given-names>Junyun</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="author-notes" rid="fn002"><sup>&#x2020;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1487942/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Huang</surname> <given-names>Siyuan</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="author-notes" rid="fn002"><sup>&#x2020;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/716382/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Yousuf</surname> <given-names>Osama</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1425296/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Gao</surname> <given-names>Yutong</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1534560/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Hoskins</surname> <given-names>Brian D.</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/716347/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Adam</surname> <given-names>Gina C.</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1129644/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Computer Science, George Washington University</institution>, <addr-line>Washington, DC</addr-line>, <country>United States</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Electrical and Computer Engineering, George Washington University</institution>, <addr-line>Washington, DC</addr-line>, <country>United States</country></aff>
<aff id="aff3"><sup>3</sup><institution>Physical Measurement Laboratory, National Institute of Standards and Technology</institution>, <addr-line>Gaithersburg, MD</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Alexantrou Serb, University of Southampton, United Kingdom</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Wei Wang, Technion &#x2013; Israel Institute of Technology, Israel; Seyoung Kim, Pohang University of Science and Technology, South Korea</p></fn>
<corresp id="c001">&#x002A;Correspondence: Gina C. Adam, <email>ginaadam@gwu.edu</email></corresp>
<fn fn-type="equal" id="fn002"><p><sup>&#x2020;</sup>These authors have contributed equally to this work and share first authorship</p></fn>
<fn fn-type="other" id="fn004"><p>This article was submitted to Neuromorphic Engineering, a section of the journal Frontiers in Neuroscience</p></fn>
</author-notes>
<pub-date pub-type="epub">
<day>22</day>
<month>11</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>15</volume>
<elocation-id>749811</elocation-id>
<history>
<date date-type="received">
<day>30</day>
<month>07</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>20</day>
<month>10</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2021 Zhao, Huang, Yousuf, Gao, Hoskins and Adam.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Zhao, Huang, Yousuf, Gao, Hoskins and Adam</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>While promising for high-capacity machine learning accelerators, memristor devices have non-idealities that prevent software-equivalent accuracies when used for online training. This work uses a combination of Mini-Batch Gradient Descent (MBGD) to average gradients, stochastic rounding to avoid vanishing weight updates, and decomposition methods to keep the memory overhead low during mini-batch training. Since the weight update has to be transferred to the memristor matrices efficiently, we also investigate the impact of reconstructing the gradient matrixes both internally (<italic>rank-seq</italic>) and externally (<italic>rank-sum</italic>) to the memristor array. Our results show that streaming batch principal component analysis (streaming batch PCA) and non-negative matrix factorization (NMF) decomposition algorithms can achieve near MBGD accuracy in a memristor-based multi-layer perceptron trained on the MNIST (Modified National Institute of Standards and Technology) database with only 3 to 10 ranks at significant memory savings. Moreover, NMF <italic>rank-seq</italic> outperforms streaming batch PCA <italic>rank-seq</italic> at low-ranks making it more suitable for hardware implementation in future memristor-based accelerators.</p>
</abstract>
<kwd-group>
<kwd>non-negative matrix factorization</kwd>
<kwd>gradient data decomposition</kwd>
<kwd>principal component analysis</kwd>
<kwd>memristor</kwd>
<kwd>non-idealities</kwd>
<kwd>ReRAM</kwd>
</kwd-group>
<contract-sponsor id="cn001">Office of Naval Research <named-content content-type="fundref-id">10.13039/100000006</named-content></contract-sponsor>
<contract-sponsor id="cn002">George Washington University <named-content content-type="fundref-id">10.13039/100007108</named-content></contract-sponsor>
<contract-sponsor id="cn003">National Institute of Standards and Technology <named-content content-type="fundref-id">10.13039/100000161</named-content></contract-sponsor>
<counts>
<fig-count count="10"/>
<table-count count="2"/>
<equation-count count="5"/>
<ref-count count="65"/>
<page-count count="15"/>
<word-count count="11693"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="S1">
<title>Introduction</title>
<p>As artificial intelligence (AI) applications become ubiquitous in medical care, autonomous driving, robotics, and other fields, accuracy requirements and neural network complexity increase in tandem, requiring extensive hardware support for training. For example, GPT-3 is made up of &#x2248;175 billion parameters and requires 285,000 central processing unit (CPU) cores and 10,000 graphics processing units (GPUs) to be trained on tens of billions of web pages and book texts (<xref ref-type="bibr" rid="B37">Langston, 2020</xref>). Moreover, the use of such significant computing resources has major financial and environmental impacts (<xref ref-type="bibr" rid="B45">Nugent and Molter, 2014</xref>; <xref ref-type="bibr" rid="B61">Strubell et al., 2020</xref>). New neuroinspired hardware alternatives are necessary for keeping up with increasing demands on complexity and energy efficiency.</p>
<p>Emerging non-volatile memory (NVM) technologies, such as oxygen vacancy-driven resistive switches, also known as ReRAM or memristors (<xref ref-type="bibr" rid="B10">Chang et al., 2011</xref>; <xref ref-type="bibr" rid="B65">Wong et al., 2012</xref>; <xref ref-type="bibr" rid="B13">Chen, 2020</xref>), can combine data processing and storage. Memristor matrices (crossbar arrays) use physical principles to enable efficient parallel multiply-accumulate (MAC) operations (<xref ref-type="bibr" rid="B30">Hu et al., 2018</xref>). This in-memory computing paradigm can achieve a substantial increase in speed and energy efficiency (<xref ref-type="bibr" rid="B9">Ceze et al., 2016</xref>) without the bottleneck caused by traditional complementary metal-oxide-semiconductor (CMOS) transistor-based von Neumann architectures. However, due to the inherent operational stochasticity of memristors in addition to manufacturing yield and reproducibility challenges, this emerging technology suffers from non-idealities. Thus, the accuracy of a neural network implemented with non-ideal memristor synaptic weights is not software-equivalent. To alleviate the undesirable effects of these devices, it is necessary to engineer better devices and improve the existing training algorithms.</p>
<p>This work investigates the use of Mini-Batch Gradient Descent (MBGD) for high accuracy training of neural networks with non-ideal memristor-based weights together with the use of gradient decomposition methods to alleviate the memory overhead due to the storage of gradient information between batch updates. An initial investigation (<xref ref-type="bibr" rid="B18">Gao et al., 2020</xref>) showed that the MBGD of moderate batch sizes (e.g., 128) can overcome the low accuracy of SGD for a one-hidden-layer perceptron network implemented with non-ideal synaptic weights trained on MNIST dataset. Accuracies of up to 86.5% were obtained for the batch sizes of 128 compared with only 50.9% for SGD. Although these results are promising, they are still far from the software equivalency of 96.5% at our studied network size.</p>
<p>Moreover, MBGD is memory intensive&#x2014;particularly at higher batch sizes&#x2014;since the gradient information needs to be stored before the batch update. We propose using a hardware co-processor to compress MBGD gradient data and work in tandem with the resistive array to support efficient array-level updates (<xref ref-type="fig" rid="F1">Figure 1A</xref>). The first step toward this goal and the key question addressed by this paper is what decomposition algorithm should be mapped to a hardware co-processor to best support the training, particularly in neural networks implemented with non-ideal devices. Different common low-rank decomposition methods are available and have been extensively used in computer science literature to pre-process the dataset, remove noise and reduce the number of the network parameters (<xref ref-type="bibr" rid="B19">Garipov et al., 2016</xref>; <xref ref-type="bibr" rid="B55">Schein et al., 2016</xref>). Our prior work (<xref ref-type="bibr" rid="B31">Huang et al., 2020a</xref>,b) proposed streaming batch Principal Component Analysis (PCA) and showed that an accurate gradient matrix can be recomposed with as few as 3 to 10 ranks depending on the dataset complexity. Tests on CIFAR-10, CIFAR-100, and ImageNet showed near equivalent accuracy to MBGD at significant memory savings. However, in that work, non-ideal neural networks were not investigated.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>Training co-processor for decomposition. <bold>(A)</bold> Sketch showing an integrated system &#x2013; a digital training co-processor will implement the best identified algorithm in an efficient way to support neural network training on non-ideal analog arrays. <bold>(B)</bold> Computational complexity for Mini-Batch Gradient Descent (MBGD), streaming batch principal component analysis (PCA), and non-negative matrix factorization (NMF), as well recomposition methods &#x2013; rank summation (rank-sum) vs. rank-by-rank update (rank-seq). The respective eigenvectors are color coded.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-15-749811-g001.tif"/>
</fig>
<p>In this study, we investigate the device-algorithm interaction which highlights the importance of hyperparameter optimization and stochastic rounding for overcoming the low-bit precision coding of the memristor weights. We propose an expansion of MBGD for larger batch sizes in conjunction with two gradient decomposition methods - Streaming Batch PCA and non-negative matrix factorization (NMF) - and recomposition methods based on rank summation (rank-sum) vs. rank-by-rank update (rank-seq) applied to a network with realistic memristor hardware models. For a m &#x00D7; n gradient matrix with batch size B, the MBGD cost is approximated at 2Bmn. By comparison, Streaming Batch PCA and NMF have asymptotic complexities of k(m+n) and k<sup>2</sup>(m+n)<sup>2</sup>, respectively (see <xref ref-type="fig" rid="F1">Figure 1B</xref>). The issue of gradient recomposition in order to support weight updating is also investigated, considering that rank-sum would require additional overhead on the training co-processor, while for rank-seq it is possible to envision a series of rank-1 array level updates that support recomposition on the array itself. However, it is important to point out that these decomposition algorithms have high complexity requiring QR decompositions or iterative calculations when implemented at the algorithmic level and executed on a CPU. Dedicated hardware decomposers can be envisioned that support streaming operation on data flows.</p>
<p>The remainder of the paper is organized as follows. Section 2 has background information related to memristors and their applicability to neural networks, as well as an overview of decomposition algorithms. Section 3 describes the methodological details, the simulation environment, and the algorithms used. Section 4 introduces the evaluation of the proposed methodology on MNIST and its comparison with SGD and MBGD. Section 5 concludes with a discussion of the results.</p>
</sec>
<sec id="S2">
<title>Related Work</title>
<sec id="S2.SS1">
<title>Resistive Switching Phenomena and Memristor Technology</title>
<p>The resistive switching phenomena was discovered in aluminum oxide in the early 1960s (<xref ref-type="bibr" rid="B26">Hickmott, 1962</xref>) and in other materials in the following decades (<xref ref-type="bibr" rid="B3">Argall, 1968</xref>; <xref ref-type="bibr" rid="B16">Dearnaley et al., 1970</xref>; <xref ref-type="bibr" rid="B48">Oxley, 1977</xref>; <xref ref-type="bibr" rid="B50">Pagnia and Sotnik, 1988</xref>). Due to the focus on silicon integrated circuits of the time, the technological potential of this phenomenon was not explored until the early 2000s, sparked by industry&#x2019;s interest in the one transistor and one memristor (1T1R) cell for digital memories (<xref ref-type="bibr" rid="B4">Baek et al., 2004</xref>; <xref ref-type="bibr" rid="B57">Seo et al., 2004</xref>; <xref ref-type="bibr" rid="B54">Rohde et al., 2005</xref>).</p>
<p>These devices have a simple structure: the upper and lower layers are metal electrodes, and the middle layer is a dielectric layer, typically a transition metal oxide. The device behavior is driven by complex multi-physics phenomena, and it is not yet fully understood. However, the main model is based on the formation and reshaping of conductive filaments. When a voltage pulse is applied, the electronic and ionic conduction driven by local Joule heating causes the filament to reshape, thus changing the device resistance and programming the weight. When the voltage is removed, and the local Joule heating stops, the ions in the structure &#x201C;freeze&#x201D; in place, thus retaining the filament shape and its associated resistance/weight state providing memory to the system.</p>
<p>Due to the inherent stochastic nature of the ionic movement under Joule heating, the devices exhibit non-ideal characteristics, such as programming variability from cycle to cycle and from device to device, the asymmetry between the resistance increase (turn OFF &#x2013; long term depression) and resistance decrease (turn ON &#x2013; long term potentiation), read noise, and limited ON/OFF ratio or accessible resistance states. Other device indicators, such as device yield, read noise, and retention also impact their practical applicability (<xref ref-type="bibr" rid="B21">Gokmen and Vlasov, 2016</xref>; <xref ref-type="bibr" rid="B42">Lin et al., 2019</xref>).</p>
</sec>
<sec id="S2.SS2">
<title>Memristor-Based Neural Network Training</title>
<p>A memristor crossbar can efficiently implement vector matrix multiplication using Ohm&#x2019;s law for the input voltage to synaptic weight conductance multiplication and Kirchhoff&#x2019;s law for the addition of the resulting currents. Together, these principles give rise to a vector dot product, which is the fundamental operation needed for fully-connected neural network layers (<xref ref-type="bibr" rid="B53">Prezioso et al., 2015</xref>). However, memristor non-idealities make the training process difficult (<xref ref-type="bibr" rid="B1">Adam et al., 2018</xref>). Therefore, the classification accuracies of <italic>in-situ</italic> training using non-volatile-memory hardware have generally been less than those of software-based training.</p>
<p>Several approaches have been used to mitigate these memristor device non-idealities. At the software level, binary neural networks (<xref ref-type="bibr" rid="B12">Chen et al., 2018</xref>) can use the devices as ON/OFF switches to reduce the impact of variability and conductance quantization. Alternatively, stochastic networks can exploit inherent cycle-to-cycle variability (<xref ref-type="bibr" rid="B52">Payvand et al., 2019</xref>; <xref ref-type="bibr" rid="B59">She et al., 2019</xref>). At the hardware level, more complex multi-memristor cells can be used (<xref ref-type="bibr" rid="B7">Boybat et al., 2018</xref>) to overcome asymmetry, limited bit precision and device variability at the expense of increased hardware overhead. Feedback circuitry can also be used to set the device to a well-defined value and mitigate the cycle-to-cycle variability of the devices (<xref ref-type="bibr" rid="B58">Serb et al., 2015</xref>). These solutions can be similarly applied to other types of emerging non-volatile memory technologies such as phase-change memory (<xref ref-type="bibr" rid="B36">Kim et al., 2019</xref>), magnetoresistive memory (<xref ref-type="bibr" rid="B27">Hirtzlin et al., 2019</xref>), ferroelectric-based memories (<xref ref-type="bibr" rid="B6">Berdan et al., 2020</xref>), among others.</p>
<p>A recent solution proposed by <xref ref-type="bibr" rid="B2">Ambrogio et al. (2018)</xref> has shown that batch analog systems can achieve equivalent training performance to that of the software but only at the costs of doubling the memory and exerting additional efforts in closed-loop training. Their proposed accelerator uses an analog short-term memory based on capacitors and transistors for fast and highly linear programming during training with only infrequent transfer to an analog long-term memory based on phase changes. The capacitive short-term memory is used to correct problems due to the imperfections in programming long-term phase change memories (<xref ref-type="bibr" rid="B25">Haensch et al., 2018</xref>). This approach, which combines the advantages of two device technologies, is feasible. However, it relies on duplicate short-term and long-term memories. Additionally, any imperfections of the short-term memory also need to be managed in hardware. A working prototype has not yet been demonstrated. Nevertheless, understanding how to leverage alternative algorithms and architectures is critical since evidence suggests that certain algorithms, like batch update, are more resilient to the non-idealities of various devices (<xref ref-type="bibr" rid="B35">Kataeva et al., 2015</xref>; <xref ref-type="bibr" rid="B18">Gao et al., 2020</xref>; <xref ref-type="bibr" rid="B20">Gokmen and Haensch, 2020</xref>).</p>
</sec>
<sec id="S2.SS3">
<title>Matrix Decomposition Algorithms</title>
<p>Rather than using a duplicative short-term memory, linear algebra techniques can be used to compress gradient data and support efficient array-level updates. Principal component analysis (PCA), a commonly used decomposition method, projects high-dimensional data into low-dimensional subspaces. Through computing and analyzing the underlying eigenspectrum, the variance in the data is maximized. Streaming PCA (<xref ref-type="bibr" rid="B46">Oja, 1982</xref>), streaming history PCA (<xref ref-type="bibr" rid="B8">Burrello et al., 2019</xref>; <xref ref-type="bibr" rid="B28">Hoskins et al., 2019</xref>), and streaming batch PCA (<xref ref-type="bibr" rid="B32">Huang et al., 2020b</xref>) were all developed based on the core PCA algorithm. Streaming batch PCA can extract an approximation of a full matrix from samples of its contributed parts by combining bi-iterative stochastic power iterations (<xref ref-type="bibr" rid="B62">Vogels et al., 2019</xref>) with QR factorization to produce low rank approximations of stochastic rectangular matrices. This method reduces gradient storage and processing requirements brought by MBGD and is composed of a batch of randomly generated rank-1 matrices of forward propagated activations and backpropagated errors.</p>
<p>However, streaming batch PCA has no restriction on the sign of the data element, so negative values can appear in the matrix factorization. Even if all the values are strictly positive, such as in an image, the decomposition may include negative terms. This oscillatory behavior, while usually harmless, causes challenges when computation is done at the physical level: for instance, summation on memristor devices which are not inherently reversible in their programming behavior. By contrast, the Non-Negative Matrix Factorization (NMF) algorithm (<xref ref-type="bibr" rid="B49">Paatero and Tapper, 1994</xref>; <xref ref-type="bibr" rid="B63">Wang et al., 2015</xref>) calculates the decomposition by adding the non-negative constraints which results in additive features.</p>
<p>The NMF decomposition is particularly meaningful when the gradient information is mapped on a memristor matrix for physical recomposition. NMF can decrease the overlap between ranks, eliminating the oscillatory behavior during summation that exists in a standard PCA decomposition. This is crucial for devices that do not have a linear and symmetric weight update.</p>
<p>The streaming batch PCA algorithm and NMF decomposition algorithms will be used in the following sections to approximate the MBGD gradient and train a fully connected network to classify MNIST handwritten digits with high accuracy, despite device non-idealities.</p>
</sec>
</sec>
<sec id="S3">
<title>Method Details</title>
<sec id="S3.SS1">
<title>Streaming Batch Principal Component Analysis</title>
<p>Streaming batch PCA or SBPCA (<xref ref-type="bibr" rid="B32">Huang et al., 2020b</xref>) is used to decompose the gradient information from MBGD. It compresses batch data in the neural network training period through rank-<italic>k</italic> outer product updates. The streaming batch PCA can expedite gradient descent training and decrease the memory cost by generating a stochastic low-rank approximation of the gradient. Gradient descent reduces the error between the predicted value of the neural network and the actual value by updating the parameters to minimize the result of the loss function,</p>
<disp-formula id="S3.Ex1"><mml:math id="M1" display="block"><mml:mrow><mml:mrow><mml:msub><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo>-</mml:mo><mml:mrow><mml:mpadded width="+3.3pt"><mml:mi mathvariant="normal">&#x03B1;</mml:mi></mml:mpadded><mml:mo rspace="5.8pt">&#x002A;</mml:mo><mml:mrow><mml:msub><mml:mo>&#x2207;</mml:mo><mml:mi mathvariant="normal">&#x0398;</mml:mi></mml:msub><mml:mo>&#x2061;</mml:mo><mml:mi>l</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>
<p>where &#x0398;<sub><italic>p</italic></sub> is the weight matrix of layer <italic>p</italic>, &#x03B1; is learning rate, <italic>l</italic>(&#x0398;) is the loss function, and <inline-formula><mml:math id="INEQ3"><mml:mrow><mml:mrow><mml:msub><mml:mo>&#x2207;</mml:mo><mml:mi mathvariant="normal">&#x0398;</mml:mi></mml:msub><mml:mo>&#x2061;</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mo>&#x2061;</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow></mml:mfrac></mml:mrow></mml:math></inline-formula> is the gradient.</p>
<p>To extract significant batch gradient data, average out the noise due to non-ideal memristor weights, and improve the network accuracy, a streaming low-rank approximation of <inline-formula><mml:math id="INEQ4"><mml:mrow><mml:msubsup><mml:mover accent="true"><mml:mo>&#x2207;</mml:mo><mml:mo>^</mml:mo></mml:mover><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow></mml:math></inline-formula> is obtained by the Streaming Batch PCA. The gradient is approximated for a batch of size <italic>B</italic> and the top-<italic>k</italic> most important <italic>k</italic> ranks as follows:</p>
<disp-formula id="S3.Ex2"><mml:math id="M2" display="block"><mml:mrow><mml:mrow><mml:mrow><mml:msubsup><mml:mover accent="true"><mml:mo>&#x2207;</mml:mo><mml:mo>^</mml:mo></mml:mover><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mo>&#x22C5;</mml:mo><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mo>&#x22C5;</mml:mo><mml:mpadded width="+6.6pt"><mml:msup><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>T</mml:mi></mml:msup></mml:mpadded></mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="INEQ5"><mml:mrow><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="INEQ6"><mml:mrow><mml:mpadded width="+5pt"><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover></mml:mpadded><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> denote the left singular matrix and right singular matrix, respectively. <inline-formula><mml:math id="INEQ7"><mml:mrow><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>g</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mover accent="true"><mml:mi>&#x03C3;</mml:mi><mml:mo>&#x2192;</mml:mo></mml:mover><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> is a diagonal matrix, which has on its diagonal the corresponding singular values <inline-formula><mml:math id="INEQ8"><mml:mover accent="true"><mml:mi>&#x03C3;</mml:mi><mml:mo>&#x2192;</mml:mo></mml:mover></mml:math></inline-formula> for the top <italic>k</italic> ranks. In the Streaming Batch PCA algorithm, the input <inline-formula><mml:math id="INEQ9"><mml:mrow><mml:mover accent="true"><mml:mi>x</mml:mi><mml:mo>&#x2192;</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> and the error <inline-formula><mml:math id="INEQ10"><mml:mrow><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03B4;</mml:mi><mml:mo>&#x2192;</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> help to update <inline-formula><mml:math id="INEQ11"><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover></mml:math></inline-formula> and <inline-formula><mml:math id="INEQ12"><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover></mml:math></inline-formula>. Based on Oja&#x2019;s rule and stochastic power iterations (<xref ref-type="bibr" rid="B47">Oja, 1992</xref>; <xref ref-type="bibr" rid="B31">Huang et al., 2020a</xref>), <inline-formula><mml:math id="INEQ13"><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover></mml:math></inline-formula> and <inline-formula><mml:math id="INEQ14"><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover></mml:math></inline-formula> are updated separately and bi-iteratively in a streaming fashion with an averaged block size <italic>b &#x003C; B</italic>, followed by re-orthogonalization via QR factorization. Our QR factorization is defined to have non-increasing values on the diagonal of the <italic>R</italic> matrix. For updating <inline-formula><mml:math id="INEQ15"><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover></mml:math></inline-formula>, we use</p>
<disp-formula id="S3.Ex3"><mml:math id="M3" display="block"><mml:mrow><mml:mrow><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mo>&#x2190;</mml:mo><mml:mrow><mml:mi>Q</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>R</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mfrac><mml:mi>i</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mo>&#x22C5;</mml:mo><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mo>&#x22C5;</mml:mo><mml:mpadded width="+5pt"><mml:mfrac><mml:mrow><mml:msup><mml:mover accent="true"><mml:mi>x</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2062;</mml:mo><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03B4;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mo>&#x2062;</mml:mo><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mo>&#x2062;</mml:mo><mml:msup><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mi>b</mml:mi></mml:mfrac></mml:mpadded></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="INEQ16"><mml:mfrac><mml:mi>i</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:math></inline-formula> represents the convergence coefficient and <inline-formula><mml:math id="INEQ17"><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover></mml:math></inline-formula> decays with each QR factorization, running from <italic>i = 1</italic> until reaching <italic>i = B/b</italic>. The update of <inline-formula><mml:math id="INEQ18"><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover></mml:math></inline-formula> is similar,</p>
<disp-formula id="S3.Ex4"><mml:math id="M4" display="block"><mml:mrow><mml:mrow><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mo>&#x2190;</mml:mo><mml:mrow><mml:mrow><mml:mfrac><mml:mi>i</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mo>&#x22C5;</mml:mo><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:msub><mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mtext>rows</mml:mtext></mml:mrow></mml:msub><mml:mfrac><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover accent="true"><mml:mtext>x</mml:mtext><mml:mo>^</mml:mo></mml:mover><mml:mo>&#x2062;</mml:mo><mml:mover accent="true"><mml:mtext>X</mml:mtext><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo largeop="true" mathsize="160%" stretchy="false" symmetric="true">&#x2299;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03B4;</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mo>&#x2062;</mml:mo><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mrow><mml:mi>b</mml:mi></mml:mfrac></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>
<p>where &#x2299; is the Hadamard (elementwise) matrix product.</p>
<p>From the standpoint of computational complexity, Streaming Batch PCA with k-ranks requires <italic>4Bk(m+n)+<inline-formula><mml:math id="INEQ20"><mml:mrow><mml:mfrac><mml:mi>B</mml:mi><mml:mi>b</mml:mi></mml:mfrac><mml:mo>&#x22C5;</mml:mo></mml:mrow></mml:math></inline-formula>4k(m+n)</italic> floating point operations (FLOPs) where <inline-formula><mml:math id="INEQ21"><mml:mrow><mml:mfrac><mml:mi>B</mml:mi><mml:mi>b</mml:mi></mml:mfrac><mml:mo>&#x22C5;</mml:mo></mml:mrow></mml:math></inline-formula>4k(m+n) is for the batch size (B) / block size (b) times QR factorizations. Overall, the complexity tends to scale as k(m+n), leading to an overall reduced computational load as compared to MBGD. However, the recomposition complexity scales as kmn, What this means is that recreating the approximation of the gradient is more computationally expensive than getting the most important eigenvectors making the recomposition calculation the most expensive part the algorithm.</p>
</sec>
<sec id="S3.SS2">
<title>Non-negative Matrix Factorization</title>
<p>The Non-Negative Matrix Factorization (NMF) (<xref ref-type="bibr" rid="B39">Lee and Seung, 1999</xref>) algorithm decomposes a non-negative matrix into two non-negative left and right matrices <inline-formula><mml:math id="INEQ22"><mml:mrow><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msubsup><mml:mi>&#x211D;</mml:mi><mml:mo>+</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="INEQ23"><mml:mrow><mml:mrow><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mo>&#x2208;</mml:mo><mml:msubsup><mml:mi>&#x211D;</mml:mi><mml:mo>+</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math></inline-formula> respectively.</p>
<p>However, the gradient &#x2207;<sub>&#x0398;</sub><italic>l</italic> is not non-negative. This is why in our NMF algorithm, we first start with a batch size <italic>B</italic> approximation of &#x2207;<sub>&#x0398;</sub><italic>l</italic>, and then use the rectified linear unit (ReLU) activation function to restrict the sign of gradient &#x2207;<sub>&#x0398;</sub><italic>l</italic> by its unilateral inhibition feature, whereby ReLU(<italic>v</italic>) = <italic>max</italic>(<italic>v</italic>, 0). The goal is to approximate the positive and negative parts separately with two sets of <italic>k-</italic>rank matrices such that <inline-formula><mml:math id="INEQ27"><mml:mrow><mml:mrow><mml:mover accent="true"><mml:mo>&#x2207;</mml:mo><mml:mo>^</mml:mo></mml:mover><mml:mo>&#x2062;</mml:mo><mml:msub><mml:mi>l</mml:mi><mml:mi>P</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="INEQ28"><mml:mrow><mml:mrow><mml:mover accent="true"><mml:mo>&#x2207;</mml:mo><mml:mo>^</mml:mo></mml:mover><mml:mo>&#x2062;</mml:mo><mml:msub><mml:mi>l</mml:mi><mml:mi>N</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>N</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>N</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:math></inline-formula>. Four random matrices <inline-formula><mml:math id="INEQ29"><mml:mrow><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>N</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msubsup><mml:mi>&#x211D;</mml:mi><mml:mo>+</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="INEQ30"><mml:mrow><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>N</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msubsup><mml:mi>&#x211D;</mml:mi><mml:mo>+</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> are randomly initialized from a Gaussian distribution at the beginning of training with a standard deviation calculated from the root of the mean values of the gradient over the rank <italic>k</italic>, <inline-formula><mml:math id="INEQ31"><mml:msqrt><mml:mfrac><mml:mrow><mml:msub><mml:mo>&#x2207;</mml:mo><mml:mi mathvariant="normal">&#x0398;</mml:mi></mml:msub><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mover accent="true"><mml:mi>l</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub></mml:mrow><mml:mi>k</mml:mi></mml:mfrac></mml:msqrt></mml:math></inline-formula> and <inline-formula><mml:math id="INEQ32"><mml:msqrt><mml:mfrac><mml:mrow><mml:msub><mml:mo>&#x2207;</mml:mo><mml:mi mathvariant="normal">&#x0398;</mml:mi></mml:msub><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mover accent="true"><mml:mi>l</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mi>N</mml:mi></mml:msub></mml:mrow><mml:mi>k</mml:mi></mml:mfrac></mml:msqrt></mml:math></inline-formula>. Then, we use a modified version of the Fast HALS (Hierarchical Alternating Least Squares) (<xref ref-type="bibr" rid="B14">Cichocki and Phan, 2009</xref>) algorithm to alternately update the left and right matrices. To do the minimization, we assume a pair of loss functions of the form <inline-formula><mml:math id="INEQ33"><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mo>&#x2062;</mml:mo><mml:msubsup><mml:mrow><mml:mo fence="true">||</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mo>&#x2207;</mml:mo><mml:mi mathvariant="normal">&#x0398;</mml:mi></mml:msub><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>l</mml:mi><mml:mi>P</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo>-</mml:mo><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mrow><mml:mi>P</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2062;</mml:mo><mml:msubsup><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mi>P</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mi>T</mml:mi></mml:msubsup></mml:mrow></mml:mrow><mml:mo fence="true">||</mml:mo></mml:mrow><mml:mi mathvariant="normal">F</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:mrow></mml:math></inline-formula>, where <italic>k</italic> is the rank and <italic>F</italic> is the Frobenius norm, with one loss function for the positive matrix and a similar one for the negative matrix. This product of the left (<inline-formula><mml:math id="INEQ34"><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mrow><mml:mi>P</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) and right (<inline-formula><mml:math id="INEQ35"><mml:mrow><mml:msubsup><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mi>P</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mi>T</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> matrices best approximates the non-negative gradient when these loss functions are minimized. During the non-negative part iteration update, the quantities <inline-formula><mml:math id="INEQ36"><mml:mrow><mml:msubsup><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mo>&#x2062;</mml:mo><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="INEQ37"><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:msubsup><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> are calculated. The diagonal matrices <inline-formula><mml:math id="INEQ38"><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mi>X</mml:mi></mml:msub><mml:mo>&#x2190;</mml:mo><mml:mrow><mml:mi>Diag</mml:mi><mml:mo>&#x2062;</mml:mo><mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mo>&#x2062;</mml:mo><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="INEQ39"><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mi mathvariant="normal">&#x25B3;</mml:mi></mml:msub><mml:mo>&#x2190;</mml:mo><mml:mrow><mml:mi>Diag</mml:mi><mml:mo>&#x2062;</mml:mo><mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:msubsup><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:math></inline-formula> are calculated to scale the updates (<xref ref-type="bibr" rid="B15">Cichocki et al., 2009</xref>). Similar quantities are calculated for <inline-formula><mml:math id="INEQ40"><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>N</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula><mml:math id="INEQ41"><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>N</mml:mi></mml:msub></mml:math></inline-formula>. With this basic framework in mind, we can iteratively, for as many as <italic>P</italic> cycles, update the positive (or negative) decomposition by</p>
<p><inline-formula><mml:math id="INEQ42"><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub><mml:mo>&#x2190;</mml:mo><mml:mrow><mml:mtext>ReLU</mml:mtext><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:msubsup><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msubsup></mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:msub><mml:mo>&#x2207;</mml:mo><mml:mi mathvariant="normal">&#x0398;</mml:mi></mml:msub><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi>l</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:msubsup><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mi>P</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mi>T</mml:mi></mml:msubsup></mml:mrow></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x2062;</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mi mathvariant="normal">&#x25B3;</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:math></inline-formula>, and</p>
<disp-formula id="S3.Ex5"><mml:math id="M5" display="block"><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub><mml:mo>&#x2190;</mml:mo><mml:mtext>ReLU</mml:mtext><mml:mo>+</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>-</mml:mo><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub><mml:msubsup><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>P</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mo>&#x2207;</mml:mo><mml:mi mathvariant="normal">&#x0398;</mml:mi></mml:msub><mml:msub><mml:mover accent="true"><mml:mi>l</mml:mi><mml:mo>&#x00AF;</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mrow><mml:mi>P</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mtext>X</mml:mtext></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula>
<p>The number of iterations, <italic>P</italic>, will depend on the desired level of convergence as well as the initialization. The number of iterations can be reduced by streaming the current best estimates for <inline-formula><mml:math id="INEQ43"><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="INEQ44"><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>N</mml:mi></mml:msub><mml:mo rspace="7.5pt">,</mml:mo><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>N</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> from batch to batch after the first random initialization, as we do in our case. In this work, we explored using a fixed, 200 iterations to understand the impact of NMF factorization on training. We also studied doing these operations with one iteration to see how streaming would impact training, see Section 3 and <xref ref-type="supplementary-material" rid="FS1">Supplementary Figure 2</xref>.</p>
<p>After convergence, the new left gradient matrix <inline-formula><mml:math id="INEQ45"><mml:mrow><mml:mrow><mml:mover accent="true"><mml:mo>&#x2207;</mml:mo><mml:mo>^</mml:mo></mml:mover><mml:mo>&#x2062;</mml:mo><mml:msub><mml:mi>l</mml:mi><mml:mi>P</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:math></inline-formula> and right matrix <inline-formula><mml:math id="INEQ46"><mml:mrow><mml:mrow><mml:mover accent="true"><mml:mo>&#x2207;</mml:mo><mml:mo>^</mml:mo></mml:mover><mml:mo>&#x2062;</mml:mo><mml:msub><mml:mi>l</mml:mi><mml:mi>N</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>N</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>N</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:math></inline-formula> would be generated. At the end, the low-rank matrix approximation is <inline-formula><mml:math id="INEQ47"><mml:mrow><mml:mrow><mml:msubsup><mml:mover accent="true"><mml:mo>&#x2207;</mml:mo><mml:mo>^</mml:mo></mml:mover><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mover accent="true"><mml:mo>&#x2207;</mml:mo><mml:mo>^</mml:mo></mml:mover><mml:mo>&#x2062;</mml:mo><mml:msub><mml:mi>l</mml:mi><mml:mi>P</mml:mi></mml:msub></mml:mrow><mml:mo>-</mml:mo><mml:mrow><mml:mover accent="true"><mml:mo>&#x2207;</mml:mo><mml:mo>^</mml:mo></mml:mover><mml:mo>&#x2062;</mml:mo><mml:msub><mml:mi>l</mml:mi><mml:mi>N</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:mrow></mml:math></inline-formula>. It is important to understand that, while this method produces a potentially optimal and non-oscillating decomposition, it still relies on summing and reconstructing the batch gradient. This makes it much more computationally complex than the Streaming Batch PCA algorithm. However, its memory overhead could be improved and its hardware mapping will be explored in the future. For this work, we are primarily interested in the impact of the decomposition on training.</p>
<p>Assuming the sequential least squares minimization (e.g., HALS) is done in <italic>p</italic> iterations, the FLOPs required for NMF scales with <inline-formula><mml:math id="INEQ48"><mml:mrow><mml:mn>3</mml:mn><mml:mi>m</mml:mi><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mi>m</mml:mi><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mi>n</mml:mi><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mi>m</mml:mi><mml:mi>n</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>k</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi>m</mml:mi><mml:mo>+</mml:mo><mml:mi>n</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mo>-</mml:mo><mml:mi>m</mml:mi><mml:mo>-</mml:mo><mml:mi>n</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mi>m</mml:mi><mml:mi>n</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>m</mml:mi><mml:mo>+</mml:mo><mml:mi>n</mml:mi><mml:mo>-</mml:mo><mml:mn>4</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mn>4</mml:mn><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mpadded width="+3.3pt"><mml:mi>m</mml:mi></mml:mpadded><mml:mo rspace="5.8pt">+</mml:mo><mml:mpadded width="+3.3pt"><mml:mi>n</mml:mi></mml:mpadded><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:math></inline-formula>. The <inline-formula><mml:math id="INEQ49"><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msup><mml:mi>k</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>+</mml:mo><mml:mi>n</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mo>-</mml:mo><mml:mi>m</mml:mi><mml:mo>-</mml:mo><mml:mi>n</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>+</mml:mo><mml:mi>n</mml:mi></mml:mrow><mml:mo>-</mml:mo><mml:mn>4</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mn>4</mml:mn><mml:mo>&#x2062;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>+</mml:mo><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> calculations are for the <inline-formula><mml:math id="INEQ50"><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula><mml:math id="INEQ51"><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>P</mml:mi></mml:msub></mml:math></inline-formula> or <inline-formula><mml:math id="INEQ52"><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>N</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula><mml:math id="INEQ53"><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>N</mml:mi></mml:msub></mml:math></inline-formula> updates in one iteration. As noted previously, the overall computational complexity scales as k<sup>2</sup>(m+n)<sup>2</sup> making it at this time more computationally complex than MBGD. However, should this performance be improved, it would be very advantageous since the NMF algorithm has a better performance when training networks rank-by-rank, or using the <italic>rank-seq</italic> operation as discussed below.</p>
</sec>
<sec id="S3.SS3">
<title>Rank Gradient Recomposition Methods</title>
<p>The contrast between the oscillatory behavior of the streaming batch PCA and the additivity of the NMF decomposition methods becomes significant when considering the memristor weight updates in hardware. How these updates are performed is important for understanding the choice of algorithm on performance. One option is to do gradient summation across the ranks of interest outside the analog memory crossbar array before transfer. During training, individual samples are used to update the compressed <italic>k</italic>-rank representation of the gradient <inline-formula><mml:math id="INEQ54"><mml:mrow><mml:msubsup><mml:mover accent="true"><mml:mo>&#x2207;</mml:mo><mml:mo>^</mml:mo></mml:mover><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mi>l</mml:mi></mml:mrow></mml:math></inline-formula> based on the calculated <inline-formula><mml:math id="INEQ55"><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover></mml:math></inline-formula>, <inline-formula><mml:math id="INEQ56"><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover></mml:math></inline-formula>, and <inline-formula><mml:math id="INEQ57"><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover></mml:math></inline-formula>. At the end of a training batch, the gradient is recomposed and then added to the matrix in total <inline-formula><mml:math id="INEQ58"><mml:mrow><mml:mrow><mml:msubsup><mml:mover accent="true"><mml:mo>&#x2207;</mml:mo><mml:mo>^</mml:mo></mml:mover><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mo>&#x22C5;</mml:mo><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mo>&#x22C5;</mml:mo><mml:msup><mml:mover accent="true"><mml:mi mathvariant="normal">&#x25B3;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>T</mml:mi></mml:msup></mml:mrow></mml:mrow></mml:math></inline-formula> by sequentially updating each weight one by one. We call this approach the <italic>rank-sum</italic> update and summarize it in <xref ref-type="fig" rid="F2">Figure 2</xref>.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>Transfer principle of the gradient approximation information to the memristor array for the rank summation outside the array (rank-sum). The training co-processor would have to support rank-k gradient recomposition, thus increasing the hardware overhead.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-15-749811-g002.tif"/>
</fig>
<p>However, <italic>rank-sum</italic> is inefficient since (a) the data must be multiplied out and summed on the array and (b) the data must be transferred one by one into each of the individual memristor devices. The estimated computational complexity of this operation, as noted in <xref ref-type="fig" rid="F1">Figure 1</xref>, is 2kmn. A more efficient implementation for pipelining requires the gradient summation inside the array using the update properties of the memristor devices. After producing an approximation of the gradient, the weight matrix is updated rank by rank, and the gradient is summed on the memory devices using outer product update operations. Outer product operations can be done in multiple ways, either using pulses on the rows and columns (<xref ref-type="bibr" rid="B35">Kataeva et al., 2015</xref>; <xref ref-type="bibr" rid="B21">Gokmen and Vlasov, 2016</xref>) or by relying on an exponential dependence on the applied bias on the rows and columns to multiply out the gradient (<xref ref-type="bibr" rid="B35">Kataeva et al., 2015</xref>). Outer product operations restrict the updates, because of the limited row/column access, to rank-1 updates. Consequently, <inline-formula><mml:math id="INEQ59"><mml:mrow><mml:msubsup><mml:mover accent="true"><mml:mo>&#x2207;</mml:mo><mml:mo>^</mml:mo></mml:mover><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow></mml:math></inline-formula> is a rank-1 matrix for the <italic>j</italic>th rank from the matrix product for the column <italic>j</italic> in <inline-formula><mml:math id="INEQ60"><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover></mml:math></inline-formula>, <inline-formula><mml:math id="INEQ61"><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover></mml:math></inline-formula>, and <inline-formula><mml:math id="INEQ62"><mml:mrow><mml:mover accent="true"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mo>:</mml:mo><mml:mrow><mml:mrow><mml:msubsup><mml:mover accent="true"><mml:mo>&#x2207;</mml:mo><mml:mo>^</mml:mo></mml:mover><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2062;</mml:mo><mml:msub><mml:mover accent="true"><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:mmultiscripts><mml:mover accent="true"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:none/><mml:mi>T</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:none/></mml:mmultiscripts></mml:mrow></mml:mrow></mml:mrow></mml:math></inline-formula>. The column number is less than or equal to rank <italic>k</italic>. Unlike the <italic>rank-sum</italic> method, the <italic>rank-seq</italic> method does not pre-sum <inline-formula><mml:math id="INEQ63"><mml:mrow><mml:msubsup><mml:mover accent="true"><mml:mo>&#x2207;</mml:mo><mml:mo>^</mml:mo></mml:mover><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow></mml:math></inline-formula> for all the ranks <italic>k</italic>. The matrix <inline-formula><mml:math id="INEQ64"><mml:mrow><mml:msubsup><mml:mover accent="true"><mml:mo>&#x2207;</mml:mo><mml:mo>^</mml:mo></mml:mover><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi></mml:mrow></mml:math></inline-formula> is used to calculate the necessary updates for rank <italic>j</italic> to be transferred to the memristor matrix where the gradient is recomposed at the physical level. We call this method the <italic>rank-seq</italic> update and show its principles in <xref ref-type="fig" rid="F3">Figure 3</xref>.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p>Transfer principle of the gradient approximation information to the memristor array via rank-by-rank (rank-seq) transfer. The gradient recomposition can be done by physical summation of rank-1 updates at the array level, thus reducing the hardware overhead for the training co-processor.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-15-749811-g003.tif"/>
</fig>
<p>It is worth pointing out that in a traditional floating-point software implementation, the two algorithms are equivalent within rounding error. However, when the gradient information needs to be transferred to a non-ideal memristor circuit, the two methods differ. <italic>Rank-sum</italic> updates the gradient information to the memristor crossbar only once, while the <italic>rank-seq</italic> needs <italic>k</italic> updates for <italic>k</italic> ranks. Updates to non-ideal memristors are accompanied by a loss in gradient precision, which is the reason that <italic>rank-seq</italic> to be expected to have lower accuracy than <italic>rank-sum</italic> for non-overlapping ranks. However, <italic>rank-seq</italic> is more efficient since it requires less digital computation and hardware overhead.</p>
</sec>
<sec id="S3.SS4">
<title>Stochastic Rounding</title>
<p>As part of the gradient transfer, the accuracy of the quantization of the weight update is also investigated in relation to the device properties. Although in theory the memristor has analog programmability to any desired state between the ON and the OFF, the device in practice has low bit precision. The reason for low bit precision is that each state can naturally decay and can be impacted by reading disturbs or be impacted by the programming of neighboring devices (<xref ref-type="bibr" rid="B42">Lin et al., 2019</xref>). Therefore, the number of conductance levels reliably accessible and distinguished from each other is limited. This quantization of the weight update introduces errors due to the lower bit precision. Since the memristor conductance change is related to the number of applied pulses (an integer), the respective weight modification needs to be rounded appropriately to a lower bit precision. Rounding-to-nearest is the method commonly used (<xref ref-type="bibr" rid="B11">Chen et al., 2017</xref>). However, it seems to cause a premature conversion to sub-optimal accuracies at higher batch sizes due to small gradients and low bit precision causing delta weight approximation to zero.</p>
<p>In this work, stochastic rounding is investigated instead to overcome this quantization error vanishing gradient issue in limited precision weights. Stochastic rounding, proposed in the 1950s and 1960s (<xref ref-type="bibr" rid="B17">Forsythe, 1950</xref>; <xref ref-type="bibr" rid="B5">Barnes et al., 1951</xref>; <xref ref-type="bibr" rid="B33">Hull and Swenson, 1966</xref>), can be particularly useful in deep network training with low bit precision arithmetic (<xref ref-type="bibr" rid="B24">Gupta et al., 2015</xref>). A real value <italic>r</italic> which lies between floor value (<italic>r</italic><sub>1</sub>) and ceiling value (<italic>r</italic><sub>2</sub>) is stochastically rounded up to <italic>r</italic><sub>2</sub> with probability (<italic>r</italic>-<italic>r</italic><sub>1</sub>)/(<italic>r</italic><sub>2</sub>-<italic>r</italic><sub>1</sub>) and down to <italic>r</italic><sub>1</sub> with probability (<italic>r</italic><sub>2</sub>-<italic>r</italic>)/(<italic>r</italic><sub>2</sub>-<italic>r</italic><sub>1</sub>). The average error of this rounding method is zero, since the expected value of the result of stochastically rounding <italic>r</italic> is <italic>r</italic> itself. Using this stochastic rounding method, some of the sub-bit information that is discarded by a deterministic rounding scheme can be maintained.</p>
</sec>
</sec>
<sec sec-type="results" id="S4">
<title>Results</title>
<sec id="S4.SS1">
<title>Network Structure and Simulation Environment</title>
<p>A multi-layer perceptron to be trained on the MNIST dataset is chosen. It has high software accuracies and weight matrices map directly to memristor crossbars, making it suitable for exploring device-algorithm interactions. The impact of the proposed methods can be quantified without any interfering effects from a training optimizer, potentially unoptimized deep network or an overly challenging dataset. The network structure is 400 (input layer) - 100 (hidden layer) - 10 (output layer). The hardware mapping and training on the MNIST dataset is available in NeuroSim V3.0. NeuroSim V3.0 is an open-source integrated simulation framework based on C++ for benchmarking synaptic devices and array architectures via system-level learning accuracy and hardware performance indicators (<xref ref-type="bibr" rid="B11">Chen et al., 2017</xref>). As part of this work, modules for MBGD, streaming batch PCA and NMF as well as two weight transfer methods: <italic>rank-sum</italic> and <italic>rank-seq</italic> were implemented and integrated with the existing NeuroSim V3.0 capabilities.</p>
<p>The algorithmic flow between the modules and the device models used are shown in <xref ref-type="fig" rid="F4">Figure 4</xref>.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption><p>The code structure and device models <bold>(A)</bold> expanded NeuroSim v3.0 with added modules and functions highlighted with a double border. <bold>(B)</bold> Ideal memristor with linear symmetric and reproducible weight update and a large ON / OFF ratio vs. <bold>(C)</bold> Non-ideal memristor model with smaller ON / OFF ratio, weight update nonlinearity, and variability (5 cycles shown).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-15-749811-g004.tif"/>
</fig>
<p>The gradient information obtained during backpropagation is decomposed according to the desired method. The desired weight update is calculated in the form of pulses to update the conductance in hardware. This paper uses the ideal device model and the non-ideal (real) device model with the 1T1R configuration of NeuroSim V3.0 to avoid leakage effects. The ideal device model assumes a reproducible linear relationship between the applied number of pulses and the obtained conductance (<xref ref-type="fig" rid="F4">Figure 4B</xref>). In the non-ideal device model, there is non-linearity between the applied pulses and the conductance, which leads to imperfect weight programming and variability in the operation. The nonlinearity values for long term potentiation (LTP) and long term depression (LTD) are 2.40 and &#x2212;4.88, respectively. The cycle-to-cycle variation is 3.5%. This stochasticity is sufficiently large that sending an &#x201C;increase weight&#x201D; pulse can even randomly lead to a &#x201C;decreased weight&#x201D; and vice versa (<xref ref-type="fig" rid="F4">Figure 4C</xref>). Other hardware parameters are the default values of NeuroSim, for example, the read noise is 0, and the minimum and maximum conductance are &#x223C;3nS and 38 nS, respectively. These default values are extracted from fitting experimental weight update data derived from Ag:a-Si devices (<xref ref-type="bibr" rid="B34">Jo et al., 2010</xref>; <xref ref-type="bibr" rid="B11">Chen et al., 2017</xref>). For this work, a device with 500 levels is assumed (approximately 9-bit precision). Each change in level is assumed to correspond to one update pulse, with 500 pulses ultimately putting the device in the fully OFF or fully ON state.</p>
</sec>
<sec id="S4.SS2">
<title>Rounding Effects of the Weight Update</title>
<p><xref ref-type="fig" rid="F5">Figure 5</xref> shows the training on the MLP network with software (64-bit floating-point precision), ideal memristor device (500 levels, 9-bit) and real device model (500 levels, 9-bit with cycle-to-cycle variability and non-linearity). The MNIST testing accuracies in the regular round-to-nearest truncation vs. the stochastic truncation is determined across various batch sizes in a logarithmic search of the learning rate domain. It can be observed that a network implemented with limited precision memristor devices, but no other non-idealities, achieves SGD accuracy 96.5% similar to a traditional software floating-point implementation. However, the quantization of the weight update shrinks the learning rate window dramatically. Whereas the floating-point implementation can achieve an accuracy &#x003E; 95% for any learning rate between 0.001 and 1, the low precision memristor-based network can only train with a learning rate between 0.1 and 1 (<xref ref-type="fig" rid="F5">Figure 5A</xref>). When stochastic rounding is used, the learning rate window for the quantized memristor model widens significantly, resembling the floating-point implementation (<xref ref-type="fig" rid="F5">Figure 5B</xref>). This result highlights the importance of hyperparameter optimization and hardware-sensitive rounding in these low-precision networks.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption><p>The MBGD results for rounding to nearest and stochastic rounding. Learning rate windows for ideal and non-ideal device models and batch sizes using <bold>(A)</bold> rounding-to-nearest vs. <bold>(B)</bold> stochastic rounding of the delta weight. Best convergence curves for the optimal learning rate for each batch size using <bold>(C)</bold> rounding-to-nearest vs. <bold>(D)</bold> stochastic rounding. <bold>(E)</bold> Optimal learning rate corresponding to the highest accuracy vs. the batch size for non-ideal devices. It is observed that the learning rate increases supra-linearly with the batch size. <bold>(F)</bold> Best accuracy at the optimal learning rate vs. the batch size. In <bold>(E)</bold> and <bold>(F)</bold>, the colorful plot markers represent results at convergence. The faded gray markers for the rounding-to-nearest case represent the inability of the network to train properly due to vanishing gradients in this low precision model. The values for the batch sizes 1 to 4096 are obtained using the realistic memristor model. Representative results for software and ideal device model are also included for comparison.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-15-749811-g005.tif"/>
</fig>
<p>Learning rate optimization was used to obtain these best accuracy results. The convergence curves for different device models, different batch sizes, and the two rounding methods were run for learning rates spanning eight orders of magnitude from 10<sup>&#x2013;6</sup> (0.000001) to 10<sup>1.6</sup> (&#x2248; 40). To optimize the search, this range was explored in logarithmic steps. The learning rates corresponding to the best accuracy for each test set are plotted in <xref ref-type="fig" rid="F5">Figure 5E</xref>. As the batch size increased, the value of the optimal learning rate also increased. The learning rates for the round-to-nearest method are higher than the stochastic rounding method, despite their accuracies being similar. This might be due to the fact that stochastic rounding applied to these limited-bit precision systems can still, over many operations in the time series, on average, keep track of some sub-bit information. The stochastic rounding applied across the weights in the array can preserve statistically more gradient information and carry it over to the next back propagation iterations (<xref ref-type="bibr" rid="B24">Gupta et al., 2015</xref>). By comparison, the round-to-nearest truncation discards such gradient information.</p>
<p>Overall, it can be observed that the accuracy increases almost linearly with the log of the batch size for medium batch sizes (up to 128) for both round-to-nearest and stochastic rounding (<xref ref-type="fig" rid="F5">Figure 5F</xref>). It plateaus at higher batch sizes converging to the MBGD floating-point software accuracy for higher batch sizes (<xref ref-type="table" rid="T1">Table 1</xref>). For our implementation, the MBGD at large batch sizes are similarly needed to overcome the gradient noise due to the non-ideal memristor synaptic weights. These results show that ideal memristor behavior, while desirable, is not needed on a single layer perceptron. The effects are likely to be even more apparent in larger fixed precision networks due to compounding effects as seen by related work (<xref ref-type="bibr" rid="B24">Gupta et al., 2015</xref>). Existing memristors can be used successfully despite their non-idealities and neural networks implemented with real memristor models can achieve software equivalency using appropriate algorithmic methods for training</p>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>Best accuracy observed between the rounding methods at different batch sizes.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center"/>
<td valign="top" align="center" colspan="2">MBGD Rounding-to-nearest<hr/></td>
<td valign="top" align="center" colspan="2">MBGD Stochastic Rounding<hr/></td>
</tr>
<tr>
<td valign="top" align="left">Synaptic weight</td>
<td valign="top" align="center">B</td>
<td valign="top" align="center">Best LR</td>
<td valign="top" align="center">Best Accuracy (%)</td>
<td valign="top" align="center">Best LR</td>
<td valign="top" align="center">Best Accuracy (%)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">64-FP (benchmark)</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">10<sup>&#x2013;0.8</sup> = 0.1585</td>
<td valign="top" align="center">96.81</td>
<td valign="top" align="left" colspan="2">Same</td>
</tr>
<tr>
<td valign="top" align="left">Ideal device</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">10<sup>&#x2013;0.2</sup> = 0.6310</td>
<td valign="top" align="center">96.53</td>
<td valign="top" align="center">10<sup>&#x2013;0.4</sup> = 0.3981</td>
<td valign="top" align="center">96.5</td>
</tr>
<tr>
<td valign="top" align="left">Real device</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">10<sup>&#x2013;1.4</sup> = 0.0398</td>
<td valign="top" align="center">48.17</td>
<td valign="top" align="center">10<sup>&#x2013;4</sup> = 0.0001</td>
<td valign="top" align="center">48.45</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">2</td>
<td valign="top" align="center">10<sup>&#x2013;1.2</sup> = 0.0631</td>
<td valign="top" align="center">51.71</td>
<td valign="top" align="center">10<sup>&#x2013;3.8</sup> = 0.0002</td>
<td valign="top" align="center">53.34</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">4</td>
<td valign="top" align="center">10<sup>&#x2013;0.1</sup> = 0.1000</td>
<td valign="top" align="center">56.68</td>
<td valign="top" align="center">10<sup>&#x2013;3.4</sup> = 0.0004</td>
<td valign="top" align="center">60.28</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">8</td>
<td valign="top" align="center">10<sup>&#x2013;0.8</sup> = 0.1585</td>
<td valign="top" align="center">65.88</td>
<td valign="top" align="center">10<sup>&#x2013;3.4</sup> = 0.0004</td>
<td valign="top" align="center">61.13</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">16</td>
<td valign="top" align="center">10<sup>&#x2013;0.6</sup> = 0.2512</td>
<td valign="top" align="center">74.02</td>
<td valign="top" align="center">10<sup>&#x2013;2.6</sup> = 0.0025</td>
<td valign="top" align="center">69.34</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">32</td>
<td valign="top" align="center">10<sup>&#x2013;0.2</sup> = 0.6310</td>
<td valign="top" align="center">80.58</td>
<td valign="top" align="center">10<sup>&#x2013;2.2</sup> = 0.0063</td>
<td valign="top" align="center">78.24</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">64</td>
<td valign="top" align="center">10<sup>1</sup> = 1.0000</td>
<td valign="top" align="center">85.50</td>
<td valign="top" align="center">10<sup>&#x2013;1.6</sup> = 0.0251</td>
<td valign="top" align="center">83.35</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">128</td>
<td valign="top" align="center">10<sup>0.2</sup> = 1.5849</td>
<td valign="top" align="center">88.24</td>
<td valign="top" align="center">10<sup>&#x2013;1.2</sup> = 0.0631</td>
<td valign="top" align="center">86.49</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">256</td>
<td valign="top" align="center">10<sup>0.2</sup> = 1.5849</td>
<td valign="top" align="center">87.48</td>
<td valign="top" align="center">10<sup>0.2</sup> = 0.2512</td>
<td valign="top" align="center">88.53</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">512</td>
<td valign="top" align="center">10<sup>0.2</sup> = 1.5849</td>
<td valign="top" align="center">85.43</td>
<td valign="top" align="center">10<sup>&#x2013;0.2</sup> = 0.6310</td>
<td valign="top" align="center">90.06</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">1024</td>
<td valign="top" align="center">10<sup>0</sup> = 1.0000</td>
<td valign="top" align="center">28.38</td>
<td valign="top" align="center">10<sup>0</sup> = 1.0000</td>
<td valign="top" align="center">91.28</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">2048</td>
<td valign="top" align="center">10<sup>&#x2013;0.6</sup> = 0.2512</td>
<td valign="top" align="center">31.20</td>
<td valign="top" align="center">10<sup>0.4</sup> = 2.5119</td>
<td valign="top" align="center">91.99</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">4096</td>
<td valign="top" align="center">10<sup>&#x2013;0.2</sup> = 0.6310</td>
<td valign="top" align="center">29.99</td>
<td valign="top" align="center">10<sup>0.6</sup> = 3.9811</td>
<td valign="top" align="center">91.93</td>
</tr>
<tr>
<td valign="top" align="left">64-FP</td>
<td valign="top" align="center">4096</td>
<td valign="top" align="center">10<sup>0.2</sup> = 1.5849</td>
<td valign="top" align="center">93.42</td>
<td valign="top" align="left" colspan="2">Same</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">8192</td>
<td valign="top" align="center">10<sup>0.2</sup> = 1.5849</td>
<td valign="top" align="center">90.73</td>
<td valign="top" align="left" colspan="2">Same</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="S4.SS3">
<title>Streaming Batch PCA With Ideal vs. Non-ideal Weights</title>
<p>An in-depth investigation was done to explore how the accuracy changes with the rank, batch size and transfer method, and the difference between streaming batch PCA algorithm and full rank MBGD. Two batch sizes were investigated: 128 and 4096. The learning rates used for this streaming batch PCA investigation correspond to the best accuracies obtained by these batch sizes for MBGD. The decomposition method was applied to both layers at the same rank. The ranks investigated were 1, 3, and 10.</p>
<p><xref ref-type="fig" rid="F6">Figure 6</xref> summarizes the <italic>rank-sum</italic> results and as expected, the accuracy of the rank 1 results was lower than that of rank 3 and rank 10 for both batch size 128 and 4096 for both the device models. The performance for the ideal device model shows that the performance slightly decrease for MBGD at high batch size. However, the performance for non-ideal devices increases with the batch size. The rank-3 decomposition does seem to perform well by comparison with MBGD, particularly at the larger batch size. The convergence performance of rank 10 is at the same level as that of MBGD for the <italic>rank-sum</italic> transfer method. Additionally, with the increase in the rank, the convergence curve tends to smoothen and converge somewhat faster, achieving the desired accuracy in &#x2248; 25 epochs. The investigation of the impact of block size <italic>b</italic> is included as <xref ref-type="supplementary-material" rid="FS1">Supplementary Figure 1</xref>.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption><p>The rank-sum streaming batch PCA results for a multi-layer perceptron with ideal vs. non-ideal memristive weights at different ranks and batch sizes. <bold>(A,B)</bold> Convergence curves using the rank-sum weight transfer method at batch size = 128; and similarly <bold>(C,D)</bold> at batch size = 4096. The results show that a low-rank gradient decomposition can approximate the MBGD results fairly well, particularly for ranks 3 and 10. For all these experiments, the rounding method is stochastic and block size = 32.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-15-749811-g006.tif"/>
</fig>
<p>While the <italic>rank-sum</italic> weight transfer method works very well for streaming batch PCA and achieves close to MBGD performance, its full hardware implementation would be difficult since the gradient approximation needs to be recomposed externally prior to being transferred to the memristor matrix. By contrast, gradient recomposition of <italic>rank-seq</italic> requires minimal hardware overhead. The results for <italic>rank-seq</italic> are summarized in <xref ref-type="fig" rid="F7">Figure 7</xref>.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption><p>The rank-seq streaming batch PCA results for a multi-layer perceptron with ideal vs. non-ideal memristive weights at different ranks and batch sizes. <bold>(A,B)</bold> Convergence curves using the rank-seq weight transfer method at batch size = 128; and similarly <bold>(C,D)</bold> at batch size = 4096. The results show that the streaming batch PCA using rank-seq transfer method cannot approximate the MBGD results, even at high ranks. For all these experiments, the rounding method is stochastic and block size = 32.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-15-749811-g007.tif"/>
</fig>
<p>For ideal device, the accuracies are similar for both rank-sum and rank-seq. By comparison, for the non-ideal devices, accuracies around &#x2248; 70% are obtained for ranks 3 and 10 at both batch sizes 128 and 4096. This is 15 to 20 percentage points lower than the <italic>rank-sum</italic> and full rank MBGD results. These results show that the streaming batch PCA using <italic>rank-seq</italic> transfer method cannot approximate the MBGD results, even at high ranks. This is likely because the principal components of streaming batch PCA can have positive and negative elements, creating an oscillatory effect due to the programming of the non-ideal memristive weights (<xref ref-type="bibr" rid="B56">Scholz et al., 2008</xref>). This effect is observed indirectly in the noisy convergence curves.</p>
</sec>
<sec id="S4.SS4">
<title>Streaming Batch PCA With Ideal vs. Non-ideal Weights</title>
<p><xref ref-type="fig" rid="F8">Figure 8</xref> summarizes the <italic>rank-sum</italic> NMF results for the different ranks at the two different batch sizes and compares them with full rank MBGD. For ideal device, the rank 1 has lower performance, but rank 3 and rank 10 can approximate MBGD well, particularly at batch size 128. For non-ideal devices, the NMF can approximate the gradient information fairly well, particularly at rank 10. Rank 1 has extremely poor performance, similar to SGD. Rank 3 performs well and can converge, but its accuracy is still &#x2248; 5% to 10% lower than the equivalent MBGD result at the respective batch size. It is also worth noting that a decline in the accuracies of these lower ranks can be observed as the training progresses. Higher rank is needed to observe satisfactory accuracy and training stability. The result of rank 10 was only 1% to 2% lower than that of the MBGD algorithm. One reason for the high accuracy in the case of rank 10 is that because the second layer has only 10 neurons, rank 10 is actually equivalent to full rank training in the last layer, though not in the first layer.</p>
<fig id="F8" position="float">
<label>FIGURE 8</label>
<caption><p>The rank-sum Non-Negative Matrix Factorization (NMF) results for a multi-layer perceptron with ideal vs. non-ideal memristive weights at different ranks and batch sizes. <bold>(A,B)</bold> Convergence curves using the rank-sum weight transfer method at batch size = 128; and similarly <bold>(C,D)</bold> at batch size = 4096. The results show that the NMF can approximate fairly well the MBGD results, particularly for ranks 3 and 10. For all these experiments, the rounding method is stochastic.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-15-749811-g008.tif"/>
</fig>
<p>The results for the <italic>rank-seq</italic> transfer method applied to NMF are shown in <xref ref-type="fig" rid="F9">Figure 9</xref>. For the ideal device, the accuracies are similar to rank-sum as expected. By comparison, for the non-ideal devices, rank 3 achieves &#x2248; 70% accuracy at batch size 128 and &#x2248; 80% accuracy at batch size 4096. For rank 10, the NMF algorithm performs within 2% to 3% degradation of the MBGD results for the respective batch size. Overall, the <italic>rank-seq</italic> results are similar with the <italic>rank-sum</italic> ones at the equivalent rank. This is likely due to the fact that there is minimal overlap between the ranks for this additive decomposition method (<xref ref-type="bibr" rid="B38">Lee and Seung, 2000</xref>).</p>
<fig id="F9" position="float">
<label>FIGURE 9</label>
<caption><p>The rank-seq NMF results for a multi-layer perceptron with ideal vs. non-ideal memristive weights at different ranks and batch sizes. <bold>(A,B)</bold> Convergence curves using the rank-seq weight transfer method at batch size = 128; and similarly <bold>(C,D)</bold> at batch size = 4096. The results show that the NMF using rank-seq transfer method can approximate the MBGD results well at high ranks. For all these experiments, the rounding method is stochastic.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-15-749811-g009.tif"/>
</fig>
</sec>
<sec id="S4.SS5">
<title>Comparison Between the Algorithms</title>
<p>The streaming batch PCA shows the most efficient compression of the batch gradient information. It obtains better accuracies than NMF for all ranks and batch sizes when the <italic>rank-sum</italic> transfer method is used. Streaming batch PCA <italic>rank-sum</italic> for rank 10 has an accuracy equivalent to MBGD &#x2248; 91.5% for batch size 4096. This result is around 5 percentage points lower than the traditional 64-bit floating-point algorithmic implementation for MNIST training at batch size 1 (SGD) which is the target / benchmark result for this work. This result, summarized in <xref ref-type="fig" rid="F10">Figure 10</xref> and <xref ref-type="table" rid="T2">Table 2</xref>, shows that decomposition methods in conjunction with large batch size MBGD training can overcome memristive synaptic device non-idealities and achieve close to software-equivalent accuracies.</p>
<fig id="F10" position="float">
<label>FIGURE 10</label>
<caption><p>Comparison of streaming batch PCA and NMF results for rank-sum and rank-seq for a multi-layer perceptron with non-ideal memristive weights at different ranks and batch sizes. The results show that the streaming batch PCA using rank-sum slightly outperforms NMF rank-sum and can approximate very well the MBGD results at high ranks. By comparison, NMF rank-seq significantly outperforms streaming batch PCA rank-seq for higher ranks (e.g., 3 and 10) and can approximate well the MBGD results. For all these experiments, the rounding method is stochastic, batch size = 4096 and block size = 32.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-15-749811-g010.tif"/>
</fig>
<table-wrap position="float" id="T2">
<label>TABLE 2</label>
<caption><p>Summary of the best results for different ranks, batch sizes and truncation methods for streaming batch PCA vs. NMF.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center"/>
<td valign="top" align="center" colspan="2">Streaming Batch PCA<hr/></td>
<td valign="top" align="center" colspan="2">NMF<hr/></td>
</tr>
<tr>
<td valign="top" align="center" colspan="2">Data type</td>
<td valign="top" align="center">Rank-sum accuracy (%)</td>
<td valign="top" align="center">Rank-seq accuracy (%)</td>
<td valign="top" align="center">Rank-sum accuracy (%)</td>
<td valign="top" align="center">Rank-seq accuracy (%)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="center" colspan="2">SGD</td>
<td valign="top" align="center" colspan="4">51.04</td>
</tr>
<tr>
<td valign="top" align="center" colspan="2">MBGD 128</td>
<td valign="top" align="center" colspan="4">86.49</td>
</tr>
<tr>
<td valign="top" align="center" colspan="2">MBGD 4096</td>
<td valign="top" align="center" colspan="4">91.94</td>
</tr>
<tr>
<td valign="top" align="left">Batchsize 128 Block 32</td>
<td valign="top" align="center">Rank1</td>
<td valign="top" align="center">58.60</td>
<td valign="top" align="center">58.08</td>
<td valign="top" align="center">24.15</td>
<td valign="top" align="center">31.02</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">Rank3</td>
<td valign="top" align="center">80.04</td>
<td valign="top" align="center">61.97</td>
<td valign="top" align="center">71.87</td>
<td valign="top" align="center">70.06</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">Rank10</td>
<td valign="top" align="center">87.30</td>
<td valign="top" align="center">64.76</td>
<td valign="top" align="center">83.26</td>
<td valign="top" align="center">82.03</td>
</tr>
<tr>
<td valign="top" align="left">Batchsize 4096 Block 32</td>
<td valign="top" align="center">Rank1</td>
<td valign="top" align="center">46.37</td>
<td valign="top" align="center">43.71</td>
<td valign="top" align="center">37.38</td>
<td valign="top" align="center">29.09</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">Rank3</td>
<td valign="top" align="center">82.33</td>
<td valign="top" align="center">64.71</td>
<td valign="top" align="center">82.15</td>
<td valign="top" align="center">71.27</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">Rank10</td>
<td valign="top" align="center">90.65</td>
<td valign="top" align="center">71.78</td>
<td valign="top" align="center">89.93</td>
<td valign="top" align="center">88.87</td>
</tr>
<tr>
<td valign="top" align="left">Batchsize 4096 Block 128</td>
<td valign="top" align="center">Rank1</td>
<td valign="top" align="center">37.49</td>
<td valign="top" align="center">51.84</td>
<td valign="top" align="center">25.26</td>
<td valign="top" align="center">26.02</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">Rank3</td>
<td valign="top" align="center">81.28</td>
<td valign="top" align="center">63.80</td>
<td valign="top" align="center">81.44</td>
<td valign="top" align="center">76.1</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">Rank10</td>
<td valign="top" align="center">90.78</td>
<td valign="top" align="center">69.52</td>
<td valign="top" align="center">90.12</td>
<td valign="top" align="center">89.39</td>
</tr>
<tr>
<td valign="top" align="left">Batchsize 4096 Block 512</td>
<td valign="top" align="center">Rank1</td>
<td valign="top" align="center">40.30</td>
<td valign="top" align="center">52.37</td>
<td valign="top" align="center">18.63</td>
<td valign="top" align="center">27.37</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">Rank3</td>
<td valign="top" align="center">85.31</td>
<td valign="top" align="center">63.19</td>
<td valign="top" align="center">76.83</td>
<td valign="top" align="center">76.10</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">Rank10</td>
<td valign="top" align="center">91.03</td>
<td valign="top" align="center">71.41</td>
<td valign="top" align="center">90.33</td>
<td valign="top" align="center">88.17</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">Rank1</td>
<td valign="top" align="center">45.29</td>
<td valign="top" align="center">48.37</td>
<td valign="top" align="center">27.48</td>
<td valign="top" align="center">25.69</td>
</tr>
<tr>
<td valign="top" align="left">Batchsize 4096 Block 1024</td>
<td valign="top" align="center">Rank3</td>
<td valign="top" align="center">84.40</td>
<td valign="top" align="center">65.81</td>
<td valign="top" align="center">75.77</td>
<td valign="top" align="center">77.79</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">Rank10</td>
<td valign="top" align="center">91.48</td>
<td valign="top" align="center">72.10</td>
<td valign="top" align="center">90.28</td>
<td valign="top" align="center">89.08</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn><p><italic>Realistic device model used for all these results.</italic></p></fn>
</table-wrap-foot>
</table-wrap>
<p>However, streaming batch PCA has its challenges. The main problem is that it operates on the eigenspace of the entire synaptic weight matrix, statistically representing the direction of largest variance, but there is no clear spatial explanation for negative numbers. Therefore the transfer of the gradient information into the memristor matrix by mapping the gradient data to number of pulses for the update (open loop transfer) is challenging as principal components can have positive and negative signs leading to inefficient oscillatory programming. For this reason, rank-by-rank weight transfer <italic>rank-seq</italic> underperforms by comparison with <italic>rank-sum</italic> for streaming batch PCA. It is important to point out that oscillatory behavior <italic>per se</italic> can be supported by resistive crossbar arrays via successive increase and decrease in conductance. The devices can be tuned with desired precision, but it might take very long trains of pulses and it is not desirable from a speed perspective when using devices with non-linearity and variability. If positive and negative updates to the weight are needed in rapid succession, the device programming becomes very inefficient. Therefore the transfer of the gradient information into the physical device matrices by mapping the gradient data to number of pulses for the update is challenging. In comparison, NMF calculates an approximate matrix factorization with separate positive and negative gradient information which causes the updates to avoid overlapping with one another.</p>
<p>By avoiding overlapping ranks, NMF has superior performance at high ranks by comparison with streaming batch PCA. For example, at rank 3, NMF <italic>rank-seq</italic> outperforms streaming batch PCA <italic>rank-seq</italic> by &#x2248; 5%. At rank 10, the gap is 17%. The best <italic>rank-seq</italic> accuracy is obtained by NMF rank 10 (88.87%) and it is less than 2% lower than the best <italic>rank-sum</italic> accuracy obtained via streaming batch PCA at rank 10 (90.65%). This means in practice that the NMF factorization produces the set of optimally efficient rank 1 update operations to training memristor neural networks.</p>
<p>The main drawback of applying the proposed methodology is related to accuracy. MBGD, particularly at large batch sizes, has lower accuracy than lower batch sizes (<xref ref-type="bibr" rid="B23">Goyal et al., 2017</xref>; <xref ref-type="bibr" rid="B22">Golmant et al., 2018</xref>). Furthermore, low rank decompositions of the MBGD gradient information can negatively affect accuracy when large networks and complex datasets are used for training. The results of this work show that it is possible to obtain low rank decomposition accuracies as close as 2% to 3% from the MBGD accuracies when large batch sizes are used. This slight penalty in accuracy comes at the potential advantage of large storage capacity for the network parameters. This tradeoff needs to be investigated further by taking accuracy targets, hardware overhead, network layer sizes, and other hyperparameters into consideration.</p>
<p>However, their full potential can only be explored on dedicated hardware co-processors. For example, the Streaming Batch PCA algorithm requires computationally intensive QR factorization (<xref ref-type="bibr" rid="B31">Huang et al., 2020a</xref>). The NMF algorithm requires explicit calculation of the full batch matrix to get the separate non-negative components. Optimized NMF algorithms mappable to hardware co-processors need to be developed, e.g., streaming variants (see Section 2 and <xref ref-type="supplementary-material" rid="FS1">Supplementary Figure 2</xref>). These limitations can be overcome in dedicated hardware accelerators, e.g., based on systolic arrays. A discussion of the hardware considerations is included in the <xref ref-type="supplementary-material" rid="FS1">Supplementary Material</xref>. Issues related to energy efficiency and speed need hardware models for the decomposition modules to be integrated with the existing circuit and device models as part of a comprehensive design verification framework (<xref ref-type="bibr" rid="B29">Hoskins et al., 2021</xref>).</p>
</sec>
<sec id="S4.SS6">
<title>Applicability and Scale-Up Potential</title>
<p>In general, the proposed algorithms should be broadly applicable to any family of weights arrays in a matrix where the weights are trained by gradient descent. These simulation results highlight the potential of low-rank gradient decompositions in neural networks using memristor weights and are the first steps toward training co-processors to support the scale-up of machine learning models in such hardware. Several recent works demonstrate the applicability of memristor crossbars to recurrent and convolutional neural networks (<xref ref-type="bibr" rid="B40">Li et al., 2019</xref>; <xref ref-type="bibr" rid="B64">Wang et al., 2019</xref>; <xref ref-type="bibr" rid="B41">Lin et al., 2020</xref>). The same decomposition and implementation principles could be applied to fully connected recurrent layers. For a convolutional network, the fully connected layers performing the classification in a deep network can benefit from these decomposition methods. It is therefore possible to consider the application of the proposed methods to deeper, more complex networks.</p>
<p>For spiking neural networks, this property can prove important since gradient based methods have recently taken on renewed popularity in the training of such networks, especially through the use of surrogate gradient methods (<xref ref-type="bibr" rid="B44">Neftci et al., 2019</xref>). An increasingly common practice, despite the lack of biological plausibility, is to use mini-batch GPU acceleration of spiking networks to train them more rapidly (<xref ref-type="bibr" rid="B43">Neftci et al., 2017</xref>; <xref ref-type="bibr" rid="B51">Payvand et al., 2020</xref>). While researchers cite that future hardware will be able to more efficiently train using batch sizes of 1 (<xref ref-type="bibr" rid="B60">Stewart et al., 2020</xref>), this has also frequently been proposed as the ideal batch size for using memristor-based artificial neural networks due to the memory overhead associated with gradient data. However, as shown in this work, low batch size training leads to catastrophically poor performance and larger batch sizes are needed to improve training of non-ideal hysteretic devices.</p>
<p>Our approach to compress gradient based information as presented here could be an important step toward developing biologically plausible batch averaging during long term learning. The methods can be adapted to require only local neuronal information, thus leading to methods resilient to local nanodevice non-idealities. Compression algorithms similar to the ones studied here, e.g., Oja&#x2019;s learning rule (<xref ref-type="bibr" rid="B46">Oja, 1982</xref>), were initially introduced as biologically plausible means to learn incoming data. Therefore, they could be used in a realistic way to efficiently learn surrogate gradients during the training of spiking neural networks.</p>
</sec>
</sec>
<sec sec-type="conclusion" id="S5">
<title>Conclusion</title>
<p>This paper investigated mini-batch training and gradient decomposition algorithmic methods to overcome the hardware non-idealities of memristor-based neural networks. By testing two different decomposition methods (streaming batch PCA and NMF) and two different weight transfer methods (<italic>rank-sum</italic> and <italic>rank-seq</italic>) for different memristor device models and ranks, we showed that the combination of the above methods is a feasible method for training the fully connected networks implemented with non-ideal devices via rank 1 updates. Our results indicate that stochastic rounding can overcome the loss of precision due to the quantization error from the vanishing gradient issue, which is of particular importance when it comes to the synaptic devices of low-bit precision, such as memristors. While the low-rank decomposition methods both produced accuracies close to those of full-rank MBGD, the choice of the update method was particularly significant for the gradient information transfer to the memristor matrix hardware. Overall, NMF produced a less efficient compression of the batch gradient than that of streaming batch PCA. However, we speculate that it was better for the rank-by-rank transfer to the memristor crossbar since all the gradient components were additive, thus eliminating the effect of device update hysteresis, though this needs further investigation. The <italic>rank-seq</italic> NMF is more in line with the physical constraints of memristor synaptic weights and may represent the optimal set of rank-1 updates that can be used to train a memristor array in an open loop fashion.</p>
<p>Future work will focus on expanding these results to deeper networks, including other types of layers, such as recurrent layers and applicability to spiking neural networks. In addition, other hardware-aware decomposition methods will be investigated. This methodology can be applied to neural networks implemented with other types of non-volatile memory devices such as phase change memory and flash technology. Ultimately, the goal is to test these proposed algorithms in full hardware implementations in memristor-based accelerators that demonstrate software equivalency despite device non-idealities.</p>
</sec>
<sec sec-type="data-availability" id="S6">
<title>Data Availability Statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.</p>
</sec>
<sec id="S7">
<title>Author Contributions</title>
<p>JZ and SH developed the decomposition code modules and performed the simulations. OY helped with the execution time analysis. YG helped with the mini-batch gradient descent code. BH provided guidance and support. GA supervised the work. All authors participated in data analysis, discussed the results, and co-edited the manuscript.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="pudiscl1">
<title>Publisher&#x2019;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<sec sec-type="funding-information" id="S8">
<title>Funding</title>
<p>The authors acknowledge funding support from the ONR/DARPA grant N00014-20-1-2031, the GW University Facilitating Fund and NIST.</p>
</sec>
<sec id="S9" sec-type="supplementary-material">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fnins.2021.749811/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fnins.2021.749811/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.pdf" id="FS1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Adam</surname> <given-names>G. C.</given-names></name> <name><surname>Khiat</surname> <given-names>A.</given-names></name> <name><surname>Prodromakis</surname> <given-names>T.</given-names></name></person-group> (<year>2018</year>). <article-title>Challenges hindering memristive neuromorphic hardware from going mainstream.</article-title> <source><italic>Nat. Commun.</italic></source> <volume>9</volume> <fpage>1</fpage>&#x2013;<lpage>4</lpage>. <pub-id pub-id-type="doi">10.1038/s41467-018-07565-4</pub-id> <pub-id pub-id-type="pmid">30531798</pub-id></citation></ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ambrogio</surname> <given-names>S.</given-names></name> <name><surname>Narayanan</surname> <given-names>P.</given-names></name> <name><surname>Tsai</surname> <given-names>H.</given-names></name> <name><surname>Shelby</surname> <given-names>R. M.</given-names></name> <name><surname>Boybat</surname> <given-names>I.</given-names></name> <name><surname>Di Nolfo</surname> <given-names>C.</given-names></name><etal/></person-group> (<year>2018</year>). <article-title>Equivalent-accuracy accelerated neural-network training using analogue memory.</article-title> <source><italic>Nature</italic></source> <volume>558</volume> <fpage>60</fpage>&#x2013;<lpage>67</lpage>. <pub-id pub-id-type="doi">10.1038/s41586-018-0180-5</pub-id> <pub-id pub-id-type="pmid">29875487</pub-id></citation></ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Argall</surname> <given-names>F.</given-names></name></person-group> (<year>1968</year>). <article-title>Switching phenomena in titanium oxide thin films.</article-title> <source><italic>Solid State Electron.</italic></source> <volume>11</volume> <fpage>535</fpage>&#x2013;<lpage>541</lpage>. <pub-id pub-id-type="doi">10.1016/0038-1101(68)90092-0</pub-id></citation></ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Baek</surname> <given-names>I.</given-names></name> <name><surname>Lee</surname> <given-names>M.</given-names></name> <name><surname>Seo</surname> <given-names>S.</given-names></name> <name><surname>Lee</surname> <given-names>M.</given-names></name> <name><surname>Seo</surname> <given-names>D.</given-names></name> <name><surname>Suh</surname> <given-names>D.-S.</given-names></name><etal/></person-group> (<year>2004</year>). &#x201C;<article-title>Highly scalable nonvolatile resistive memory using simple binary oxide driven by asymmetric unipolar voltage pulses</article-title>,&#x201D; in <source><italic>Proceedings of the IEDM Technical Digest. IEEE International Electron Devices Meeting, 2004</italic></source>, (<publisher-loc>San Francisco, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>587</fpage>&#x2013;<lpage>590</lpage>. <pub-id pub-id-type="doi">10.1109/IEDM.2004.1419228</pub-id></citation></ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Barnes</surname> <given-names>R.</given-names></name> <name><surname>Cooke-Yarborough</surname> <given-names>E.</given-names></name> <name><surname>Thomas</surname> <given-names>D.</given-names></name></person-group> (<year>1951</year>). <article-title>An electronic digital computor using cold cathode counting tubes for storage.</article-title> <source><italic>Electron. Eng.</italic></source> <volume>23</volume> <fpage>286</fpage>&#x2013;<lpage>291</lpage>.</citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Berdan</surname> <given-names>R.</given-names></name> <name><surname>Marukame</surname> <given-names>T.</given-names></name> <name><surname>Ota</surname> <given-names>K.</given-names></name> <name><surname>Yamaguchi</surname> <given-names>M.</given-names></name> <name><surname>Saitoh</surname> <given-names>M.</given-names></name> <name><surname>Fujii</surname> <given-names>S.</given-names></name><etal/></person-group> (<year>2020</year>). <article-title>Low-power linear computation using nonlinear ferroelectric tunnel junction memristors.</article-title> <source><italic>Nat. Electron.</italic></source> <volume>3</volume> <fpage>1</fpage>&#x2013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1038/s41928-020-0405-0</pub-id></citation></ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Boybat</surname> <given-names>I.</given-names></name> <name><surname>Le Gallo</surname> <given-names>M.</given-names></name> <name><surname>Nandakumar</surname> <given-names>S.</given-names></name> <name><surname>Moraitis</surname> <given-names>T.</given-names></name> <name><surname>Parnell</surname> <given-names>T.</given-names></name> <name><surname>Tuma</surname> <given-names>T.</given-names></name><etal/></person-group> (<year>2018</year>). <article-title>Neuromorphic computing with multi-memristive synapses.</article-title> <source><italic>Nat. Commun.</italic></source> <volume>9</volume> <fpage>1</fpage>&#x2013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.1038/s41467-018-04933-y</pub-id> <pub-id pub-id-type="pmid">29955057</pub-id></citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Burrello</surname> <given-names>A.</given-names></name> <name><surname>Marchioni</surname> <given-names>A.</given-names></name> <name><surname>Brunelli</surname> <given-names>D.</given-names></name> <name><surname>Benini</surname> <given-names>L.</given-names></name></person-group> (<year>2019</year>). &#x201C;<article-title>Embedding principal component analysis for data reduction in structural health monitoring on low-cost iot gateways</article-title>,&#x201D; in <source><italic>Proceedings of the 16th ACM International Conference on Computing Frontiers</italic></source>, (<publisher-loc>Alghero</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>235</fpage>&#x2013;<lpage>239</lpage>. <pub-id pub-id-type="doi">10.1145/3310273.3322822</pub-id></citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ceze</surname> <given-names>L.</given-names></name> <name><surname>Hasler</surname> <given-names>J.</given-names></name> <name><surname>Likharev</surname> <given-names>K.</given-names></name> <name><surname>Seo</surname> <given-names>J.-S.</given-names></name> <name><surname>Sherwood</surname> <given-names>T.</given-names></name> <name><surname>Strukov</surname> <given-names>D.</given-names></name><etal/></person-group> (<year>2016</year>). &#x201C;<article-title>Nanoelectronic neurocomputing: status and prospects</article-title>,&#x201D; in <source><italic>Proceedings of the 2016 74th Annual Device Research Conference (DRC)</italic></source>, (<publisher-loc>Newark, DE</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>2</lpage>. <pub-id pub-id-type="doi">10.1109/DRC.2016.7548506</pub-id></citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chang</surname> <given-names>M.-F.</given-names></name> <name><surname>Chiu</surname> <given-names>P.-F.</given-names></name> <name><surname>Wu</surname> <given-names>W.-C.</given-names></name> <name><surname>Chuang</surname> <given-names>C.-H.</given-names></name> <name><surname>Sheu</surname> <given-names>S.-S.</given-names></name></person-group> (<year>2011</year>). &#x201C;<article-title>Challenges and trends in low-power 3D die-stacked IC designs using RAM, memristor logic, and resistive memory (ReRAM)</article-title>,&#x201D; in <source><italic>Proceedings of the 2011 9th IEEE International Conference on ASIC</italic></source>, (<publisher-loc>Xiamen</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>299</fpage>&#x2013;<lpage>302</lpage>.</citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>P.-Y.</given-names></name> <name><surname>Peng</surname> <given-names>X.</given-names></name> <name><surname>Yu</surname> <given-names>S.</given-names></name></person-group> (<year>2017</year>). &#x201C;<article-title>NeuroSim+: an integrated device-to-algorithm framework for benchmarking synaptic devices and array architectures</article-title>,&#x201D; in <source><italic>Proceedings of the 2017 IEEE International Electron Devices Meeting (IEDM)</italic></source>, (<publisher-loc>San Francisco, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>6.1.1</fpage>&#x2013;<lpage>6.1.4</lpage>.</citation></ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>W.-H.</given-names></name> <name><surname>Li</surname> <given-names>K.-X.</given-names></name> <name><surname>Lin</surname> <given-names>W.-Y.</given-names></name> <name><surname>Hsu</surname> <given-names>K.-H.</given-names></name> <name><surname>Li</surname> <given-names>P.-Y.</given-names></name> <name><surname>Yang</surname> <given-names>C.-H.</given-names></name><etal/></person-group> (<year>2018</year>). &#x201C;<article-title>A 65nm 1Mb nonvolatile computing-in-memory ReRAM macro with sub-16ns multiply-and-accumulate for binary DNN AI edge processors</article-title>,&#x201D; in <source><italic>Proceedings of the 2018 IEEE International Solid-State Circuits Conference-(ISSCC)</italic></source>, (<publisher-loc>San Francisco, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>494</fpage>&#x2013;<lpage>496</lpage>.</citation></ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>Y.</given-names></name></person-group> (<year>2020</year>). <article-title>ReRAM: history, status, and future.</article-title> <source><italic>IEEE Trans. Electron Devices</italic></source> <volume>67</volume> <fpage>1420</fpage>&#x2013;<lpage>1433</lpage>. <pub-id pub-id-type="doi">10.1109/TED.2019.2961505</pub-id></citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cichocki</surname> <given-names>A.</given-names></name> <name><surname>Phan</surname> <given-names>A.-H.</given-names></name></person-group> (<year>2009</year>). <article-title>Fast local algorithms for large scale nonnegative matrix and tensor factorizations.</article-title> <source><italic>IEICE Trans. Fundam. Electron. Commun. Comput. Sci.</italic></source> <volume>92</volume> <fpage>708</fpage>&#x2013;<lpage>721</lpage>. <pub-id pub-id-type="doi">10.1587/transfun.E92.A.708</pub-id> <pub-id pub-id-type="pmid">12739966</pub-id></citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cichocki</surname> <given-names>A.</given-names></name> <name><surname>Zdunek</surname> <given-names>R.</given-names></name> <name><surname>Phan</surname> <given-names>A. H.</given-names></name> <name><surname>Amari</surname> <given-names>S.-I.</given-names></name></person-group> (<year>2009</year>). <source><italic>Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation.</italic></source> <publisher-loc>Hoboken, NJ</publisher-loc>: <publisher-name>John Wiley &#x0026; Sons</publisher-name>. <pub-id pub-id-type="doi">10.1002/9780470747278</pub-id></citation></ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dearnaley</surname> <given-names>G.</given-names></name> <name><surname>Stoneham</surname> <given-names>A.</given-names></name> <name><surname>Morgan</surname> <given-names>D.</given-names></name></person-group> (<year>1970</year>). <article-title>Electrical phenomena in amorphous oxide films.</article-title> <source><italic>Rep. Prog. Phys.</italic></source> <volume>33</volume>:<issue>1129</issue>. <pub-id pub-id-type="doi">10.1088/0034-4885/33/3/306</pub-id></citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Forsythe</surname> <given-names>G. E.</given-names></name></person-group> (<year>1950</year>). <article-title>&#x201C;Round-off errors in numerical integration on automatic machinery-preliminary report&#x201D;, in: bulletin of the American mathematical society: AMER MATHEMATICAL SOC 201 CHARLES ST.</article-title> <source><italic>Providence</italic></source> <volume>0294</volume> <fpage>61</fpage>&#x2013;<lpage>61</lpage>.</citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gao</surname> <given-names>Y.</given-names></name> <name><surname>Wu</surname> <given-names>S.</given-names></name> <name><surname>Adam</surname> <given-names>G. C.</given-names></name></person-group> (<year>2020</year>). &#x201C;<article-title>Batch training for neuromorphic systems with device non-idealities</article-title>,&#x201D; in <source><italic>International Conference on Neuromorphic Systems 2020</italic></source>, (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>4</lpage>. <pub-id pub-id-type="doi">10.1145/3407197.3407208</pub-id></citation></ref>
<ref id="B19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Garipov</surname> <given-names>T.</given-names></name> <name><surname>Podoprikhin</surname> <given-names>D.</given-names></name> <name><surname>Novikov</surname> <given-names>A.</given-names></name> <name><surname>Vetrov</surname> <given-names>D.</given-names></name></person-group> (<year>2016</year>). <article-title>Ultimate tensorization: compressing convolutional and fc layers alike.</article-title> <source><italic>arXiv</italic></source> <comment>[Preprint] arXiv:1611.03214,</comment></citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gokmen</surname> <given-names>T.</given-names></name> <name><surname>Haensch</surname> <given-names>W.</given-names></name></person-group> (<year>2020</year>). <article-title>Algorithm for training neural networks on resistive device arrays.</article-title> <source><italic>Front. Neurosci.</italic></source> <volume>14</volume>:<issue>103</issue>. <pub-id pub-id-type="doi">10.3389/fnins.2020.00103</pub-id> <pub-id pub-id-type="pmid">32174807</pub-id></citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gokmen</surname> <given-names>T.</given-names></name> <name><surname>Vlasov</surname> <given-names>Y.</given-names></name></person-group> (<year>2016</year>). <article-title>Acceleration of deep neural network training with resistive cross-point devices: design considerations.</article-title> <source><italic>Front. Neurosci.</italic></source> <volume>10</volume>:<issue>333</issue>. <pub-id pub-id-type="doi">10.3389/fnins.2016.00333</pub-id> <pub-id pub-id-type="pmid">27493624</pub-id></citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Golmant</surname> <given-names>N.</given-names></name> <name><surname>Vemuri</surname> <given-names>N.</given-names></name> <name><surname>Yao</surname> <given-names>Z.</given-names></name> <name><surname>Feinberg</surname> <given-names>V.</given-names></name> <name><surname>Gholami</surname> <given-names>A.</given-names></name> <name><surname>Rothauge</surname> <given-names>K.</given-names></name><etal/></person-group> (<year>2018</year>). <article-title>On the computational inefficiency of large batch sizes for stochastic gradient descent.</article-title> <source><italic>arXiv</italic></source> <comment>[Preprint] arXiv:1811.12941,</comment></citation></ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goyal</surname> <given-names>P.</given-names></name> <name><surname>Doll&#x00E1;r</surname> <given-names>P.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>Noordhuis</surname> <given-names>P.</given-names></name> <name><surname>Wesolowski</surname> <given-names>L.</given-names></name> <name><surname>Kyrola</surname> <given-names>A.</given-names></name><etal/></person-group> (<year>2017</year>). <article-title>Accurate, large minibatch SGD: training imagenet in 1 hour.</article-title> <source><italic>arXiv</italic></source> <comment>[Preprint] arXiv:1706.02677,</comment></citation></ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gupta</surname> <given-names>S.</given-names></name> <name><surname>Agrawal</surname> <given-names>A.</given-names></name> <name><surname>Gopalakrishnan</surname> <given-names>K.</given-names></name> <name><surname>Narayanan</surname> <given-names>P.</given-names></name></person-group> (<year>2015</year>). &#x201C;<article-title>Deep learning with limited numerical precision</article-title>,&#x201D; in <source><italic>Proceedings of the 32nd International Conference on Machine Learning: PMLR</italic></source>, (<publisher-loc>Lille</publisher-loc>: <publisher-name>JMLR.org</publisher-name>), <fpage>1737</fpage>&#x2013;<lpage>1746</lpage>.</citation></ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Haensch</surname> <given-names>W.</given-names></name> <name><surname>Gokmen</surname> <given-names>T.</given-names></name> <name><surname>Puri</surname> <given-names>R.</given-names></name></person-group> (<year>2018</year>). <article-title>The next generation of deep learning hardware: analog computing.</article-title> <source><italic>Proc. IEEE</italic></source> <volume>107</volume> <fpage>108</fpage>&#x2013;<lpage>122</lpage>. <pub-id pub-id-type="doi">10.1109/JPROC.2018.2871057</pub-id></citation></ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hickmott</surname> <given-names>T.</given-names></name></person-group> (<year>1962</year>). <article-title>Low-frequency negative resistance in thin anodic oxide films.</article-title> <source><italic>J. Appl. Phys.</italic></source> <volume>33</volume> <fpage>2669</fpage>&#x2013;<lpage>2682</lpage>. <pub-id pub-id-type="doi">10.1063/1.1702530</pub-id></citation></ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hirtzlin</surname> <given-names>T.</given-names></name> <name><surname>Penkovsky</surname> <given-names>B.</given-names></name> <name><surname>Klein</surname> <given-names>J.-O.</given-names></name> <name><surname>Locatelli</surname> <given-names>N.</given-names></name> <name><surname>Vincent</surname> <given-names>A. F.</given-names></name> <name><surname>Bocquet</surname> <given-names>M.</given-names></name><etal/></person-group> (<year>2019</year>). &#x201C;<article-title>Implementing binarized neural networks with magnetoresistive ram without error correction</article-title>,&#x201D; in <source><italic>Proceedings of the 2019 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH)</italic></source>, (<publisher-loc>Qingdao</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>5</lpage>. <pub-id pub-id-type="doi">10.1109/NANOARCH47378.2019.181300</pub-id></citation></ref>
<ref id="B28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hoskins</surname> <given-names>B. D.</given-names></name> <name><surname>Daniels</surname> <given-names>M. W.</given-names></name> <name><surname>Huang</surname> <given-names>S.</given-names></name> <name><surname>Madhavan</surname> <given-names>A.</given-names></name> <name><surname>Adam</surname> <given-names>G. C.</given-names></name> <name><surname>Zhitenev</surname> <given-names>N.</given-names></name><etal/></person-group> (<year>2019</year>). <article-title>Streaming batch eigenupdates for hardware neural networks.</article-title> <source><italic>Front. Neurosci.</italic></source> <volume>13</volume>:<issue>793</issue>. <pub-id pub-id-type="doi">10.3389/fnins.2019.00793</pub-id> <pub-id pub-id-type="pmid">31447628</pub-id></citation></ref>
<ref id="B29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hoskins</surname> <given-names>B. D.</given-names></name> <name><surname>Ma</surname> <given-names>W.</given-names></name> <name><surname>Fream</surname> <given-names>M.</given-names></name> <name><surname>Liu</surname> <given-names>M.</given-names></name> <name><surname>Daniels</surname> <given-names>M. W.</given-names></name> <name><surname>Madsen</surname> <given-names>R.</given-names></name><etal/></person-group> (<year>2021</year>). &#x201C;<article-title>Design for verification in a resistive neural network prototype</article-title>,&#x201D; in <source><italic>Proceedings of the International Conference on Neuromorphic Systems (ICONS) July 27&#x2013;29, 2021</italic></source>, <publisher-name>Knoxville, TN</publisher-name>. <pub-id pub-id-type="doi">10.1145/3477145.3477260</pub-id></citation></ref>
<ref id="B30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hu</surname> <given-names>M.</given-names></name> <name><surname>Graves</surname> <given-names>C. E.</given-names></name> <name><surname>Li</surname> <given-names>C.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Ge</surname> <given-names>N.</given-names></name> <name><surname>Montgomery</surname> <given-names>E.</given-names></name><etal/></person-group> (<year>2018</year>). <article-title>Memristor-based analog computation and neural network classification with a dot product engine.</article-title> <source><italic>Adv. Mater.</italic></source> <volume>30</volume>:<issue>1705914</issue>. <pub-id pub-id-type="doi">10.1002/adma.201705914</pub-id> <pub-id pub-id-type="pmid">29318659</pub-id></citation></ref>
<ref id="B31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>S.</given-names></name> <name><surname>Hoskins</surname> <given-names>B. D.</given-names></name> <name><surname>Daniels</surname> <given-names>M. W.</given-names></name> <name><surname>Stiles</surname> <given-names>M. D.</given-names></name> <name><surname>Adam</surname> <given-names>G. C.</given-names></name></person-group> (<year>2020a</year>). <article-title>Memory-efficient training with streaming dimensionality reduction.</article-title> <source><italic>arXiv</italic> [Preprint]</source> arXiv:2004.12041,</citation></ref>
<ref id="B32"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>S.</given-names></name> <name><surname>Hoskins</surname> <given-names>B. D.</given-names></name> <name><surname>Daniels</surname> <given-names>M. W.</given-names></name> <name><surname>Stiles</surname> <given-names>M. D.</given-names></name> <name><surname>Adam</surname> <given-names>G. C.</given-names></name></person-group> (<year>2020b</year>). <article-title>Streaming batch gradient tracking for neural network training (student abstract).</article-title> <source><italic>Proc. AAAI Conf. Artif. Intell.</italic></source> <volume>34</volume> <fpage>13813</fpage>&#x2013;<lpage>13814</lpage>.</citation></ref>
<ref id="B33"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hull</surname> <given-names>T. E.</given-names></name> <name><surname>Swenson</surname> <given-names>J. R.</given-names></name></person-group> (<year>1966</year>). <article-title>Tests of probabilistic models for propagation of roundoff errors.</article-title> <source><italic>Commun. ACM</italic></source> <volume>9</volume> <fpage>108</fpage>&#x2013;<lpage>113</lpage>. <pub-id pub-id-type="doi">10.1145/365170.365212</pub-id></citation></ref>
<ref id="B34"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jo</surname> <given-names>S. H.</given-names></name> <name><surname>Chang</surname> <given-names>T.</given-names></name> <name><surname>Ebong</surname> <given-names>I.</given-names></name> <name><surname>Bhadviya</surname> <given-names>B. B.</given-names></name> <name><surname>Mazumder</surname> <given-names>P.</given-names></name> <name><surname>Lu</surname> <given-names>W.</given-names></name></person-group> (<year>2010</year>). <article-title>Nanoscale memristor device as synapse in neuromorphic systems.</article-title> <source><italic>Nano Lett.</italic></source> <volume>10</volume> <fpage>1297</fpage>&#x2013;<lpage>1301</lpage>. <pub-id pub-id-type="doi">10.1021/nl904092h</pub-id> <pub-id pub-id-type="pmid">20192230</pub-id></citation></ref>
<ref id="B35"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kataeva</surname> <given-names>I.</given-names></name> <name><surname>Merrikh-Bayat</surname> <given-names>F.</given-names></name> <name><surname>Zamanidoost</surname> <given-names>E.</given-names></name> <name><surname>Strukov</surname> <given-names>D.</given-names></name></person-group> (<year>2015</year>). &#x201C;<article-title>Efficient training algorithms for neural networks based on memristive crossbar circuits</article-title>,&#x201D; in <source><italic>Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN)</italic></source>, (<publisher-loc>Killarney</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1109/IJCNN.2015.7280785</pub-id></citation></ref>
<ref id="B36"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>W.</given-names></name> <name><surname>Bruce</surname> <given-names>R.</given-names></name> <name><surname>Masuda</surname> <given-names>T.</given-names></name> <name><surname>Fraczak</surname> <given-names>G.</given-names></name> <name><surname>Gong</surname> <given-names>N.</given-names></name> <name><surname>Adusumilli</surname> <given-names>P.</given-names></name><etal/></person-group> (<year>2019</year>). &#x201C;<article-title>Confined PCM-based analog synaptic devices offering low resistance-drift and 1000 programmable states for deep learning</article-title>,&#x201D; in <source><italic>Proceedings of the 2019 Symposium on VLSI Technology</italic></source>, (<publisher-loc>Kyoto</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>T66</fpage>&#x2013;<lpage>T67</lpage>. <pub-id pub-id-type="doi">10.23919/VLSIT.2019.8776551</pub-id></citation></ref>
<ref id="B37"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Langston</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <source><italic>Microsoft Announces New Supercomputer, Lays Out Vision for Future AI Work. Microsoft.</italic></source> <comment>Available online at</comment>: <ext-link ext-link-type="uri" xlink:href="https://blogs.microsoft.com/ai/openai-azure-supercomputer/">https://blogs.microsoft.com/ai/openai-azure-supercomputer/</ext-link> <comment>(accessed August 27, 2020)</comment>.</citation></ref>
<ref id="B38"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>D. D.</given-names></name> <name><surname>Seung</surname> <given-names>H. S.</given-names></name></person-group> (<year>2000</year>). &#x201C;<article-title>Algorithms for non-negative matrix factorization</article-title>,&#x201D; in <source><italic>Proceedings of the 13th International Conference on Neural Information Processing Systems</italic></source>, (<publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT Press</publisher-name>), <fpage>535</fpage>&#x2013;<lpage>541</lpage>.</citation></ref>
<ref id="B39"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>D. D.</given-names></name> <name><surname>Seung</surname> <given-names>H. S.</given-names></name></person-group> (<year>1999</year>). <article-title>Learning the parts of objects by non-negative matrix factorization.</article-title> <source><italic>Nature</italic></source> <volume>401</volume> <fpage>788</fpage>&#x2013;<lpage>791</lpage>. <pub-id pub-id-type="doi">10.1038/44565</pub-id> <pub-id pub-id-type="pmid">10548103</pub-id></citation></ref>
<ref id="B40"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>C.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Rao</surname> <given-names>M.</given-names></name> <name><surname>Belkin</surname> <given-names>D.</given-names></name> <name><surname>Song</surname> <given-names>W.</given-names></name> <name><surname>Jiang</surname> <given-names>H.</given-names></name><etal/></person-group> (<year>2019</year>). <article-title>Long short-term memory networks in memristor crossbar arrays.</article-title> <source><italic>Nat. Mach. Intell.</italic></source> <volume>1</volume> <fpage>49</fpage>&#x2013;<lpage>57</lpage>. <pub-id pub-id-type="doi">10.1038/s42256-018-0001-4</pub-id></citation></ref>
<ref id="B41"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>P.</given-names></name> <name><surname>Li</surname> <given-names>C.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Jiang</surname> <given-names>H.</given-names></name> <name><surname>Song</surname> <given-names>W.</given-names></name><etal/></person-group> (<year>2020</year>). <article-title>Three-dimensional memristor circuits as complex neural networks.</article-title> <source><italic>Nat. Electron.</italic></source> <volume>3</volume> <fpage>225</fpage>&#x2013;<lpage>232</lpage>. <pub-id pub-id-type="doi">10.1038/s41928-020-0397-9</pub-id></citation></ref>
<ref id="B42"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>Y.-H.</given-names></name> <name><surname>Wang</surname> <given-names>C.-H.</given-names></name> <name><surname>Lee</surname> <given-names>M.-H.</given-names></name> <name><surname>Lee</surname> <given-names>D.-Y.</given-names></name> <name><surname>Lin</surname> <given-names>Y.-Y.</given-names></name> <name><surname>Lee</surname> <given-names>F.-M.</given-names></name><etal/></person-group> (<year>2019</year>). <article-title>Performance impacts of analog ReRAM non-ideality on neuromorphic computing.</article-title> <source><italic>IEEE Trans. Electron Devices</italic></source> <volume>66</volume> <fpage>1289</fpage>&#x2013;<lpage>1295</lpage>. <pub-id pub-id-type="doi">10.1109/TED.2019.2894273</pub-id></citation></ref>
<ref id="B43"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Neftci</surname> <given-names>E. O.</given-names></name> <name><surname>Augustine</surname> <given-names>C.</given-names></name> <name><surname>Paul</surname> <given-names>S.</given-names></name> <name><surname>Detorakis</surname> <given-names>G.</given-names></name></person-group> (<year>2017</year>). <article-title>Event-driven random back-propagation: enabling neuromorphic deep learning machines.</article-title> <source><italic>Front. Neurosci.</italic></source> <volume>11</volume>:<issue>324</issue>. <pub-id pub-id-type="doi">10.3389/fnins.2017.00324</pub-id> <pub-id pub-id-type="pmid">28680387</pub-id></citation></ref>
<ref id="B44"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Neftci</surname> <given-names>E. O.</given-names></name> <name><surname>Mostafa</surname> <given-names>H.</given-names></name> <name><surname>Zenke</surname> <given-names>F.</given-names></name></person-group> (<year>2019</year>). <article-title>Surrogate gradient learning in spiking neural networks: bringing the power of gradient-based optimization to spiking neural networks.</article-title> <source><italic>IEEE Signal Process. Mag.</italic></source> <volume>36</volume> <fpage>51</fpage>&#x2013;<lpage>63</lpage>. <pub-id pub-id-type="doi">10.1109/MSP.2019.2931595</pub-id></citation></ref>
<ref id="B45"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nugent</surname> <given-names>M. A.</given-names></name> <name><surname>Molter</surname> <given-names>T. W.</given-names></name></person-group> (<year>2014</year>). <article-title>AHaH computing&#x2013;from metastable switches to attractors to machine learning.</article-title> <source><italic>PLoS One</italic></source> <volume>9</volume>:<issue>e85175</issue>. <pub-id pub-id-type="doi">10.1371/journal.pone.0085175</pub-id> <pub-id pub-id-type="pmid">24520315</pub-id></citation></ref>
<ref id="B46"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Oja</surname> <given-names>E.</given-names></name></person-group> (<year>1982</year>). <article-title>Simplified neuron model as a principal component analyzer.</article-title> <source><italic>J. Math. Biol.</italic></source> <volume>15</volume> <fpage>267</fpage>&#x2013;<lpage>273</lpage>. <pub-id pub-id-type="doi">10.1007/BF00275687</pub-id> <pub-id pub-id-type="pmid">7153672</pub-id></citation></ref>
<ref id="B47"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Oja</surname> <given-names>E.</given-names></name></person-group> (<year>1992</year>). <article-title>Principal components, minor components, and linear neural networks.</article-title> <source><italic>Neural Netw.</italic></source> <volume>5</volume> <fpage>927</fpage>&#x2013;<lpage>935</lpage>. <pub-id pub-id-type="doi">10.1016/S0893-6080(05)80089-9</pub-id></citation></ref>
<ref id="B48"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Oxley</surname> <given-names>D. P.</given-names></name></person-group> (<year>1977</year>). <article-title>Electroforming, switching and memory effects in oxide thin films.</article-title> <source><italic>Electrocomp. Sci. Technol.</italic></source> <volume>3</volume> <fpage>217</fpage>&#x2013;<lpage>224</lpage>. <pub-id pub-id-type="doi">10.1155/APEC.3.217</pub-id></citation></ref>
<ref id="B49"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Paatero</surname> <given-names>P.</given-names></name> <name><surname>Tapper</surname> <given-names>U.</given-names></name></person-group> (<year>1994</year>). <article-title>Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values.</article-title> <source><italic>Environmetrics</italic></source> <volume>5</volume> <fpage>111</fpage>&#x2013;<lpage>126</lpage>. <pub-id pub-id-type="doi">10.1002/env.3170050203</pub-id></citation></ref>
<ref id="B50"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pagnia</surname> <given-names>H.</given-names></name> <name><surname>Sotnik</surname> <given-names>N.</given-names></name></person-group> (<year>1988</year>). <article-title>Bistable switching in electroformed metal&#x2013;insulator&#x2013;metal devices.</article-title> <source><italic>Phys. Status Solidi</italic></source> <volume>108</volume> <fpage>11</fpage>&#x2013;<lpage>65</lpage>.</citation></ref>
<ref id="B51"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Payvand</surname> <given-names>M.</given-names></name> <name><surname>Fouda</surname> <given-names>M. E.</given-names></name> <name><surname>Kurdahi</surname> <given-names>F.</given-names></name> <name><surname>Eltawil</surname> <given-names>A. M.</given-names></name> <name><surname>Neftci</surname> <given-names>E. O.</given-names></name></person-group> (<year>2020</year>). <article-title>On-chip error-triggered learning of multi-layer memristive spiking neural networks.</article-title> <source><italic>IEEE J. Emerg. Sel. Top. Circ. Syst.</italic></source> <volume>10</volume> <fpage>522</fpage>&#x2013;<lpage>535</lpage>. <pub-id pub-id-type="doi">10.1109/JETCAS.2020.3040248</pub-id></citation></ref>
<ref id="B52"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Payvand</surname> <given-names>M.</given-names></name> <name><surname>Nair</surname> <given-names>M. V.</given-names></name> <name><surname>M&#x00FC;ller</surname> <given-names>L. K.</given-names></name> <name><surname>Indiveri</surname> <given-names>G.</given-names></name></person-group> (<year>2019</year>). <article-title>A neuromorphic systems approach to in-memory computing with non-ideal memristive devices: from mitigation to exploitation.</article-title> <source><italic>Faraday Discuss.</italic></source> <volume>213</volume> <fpage>487</fpage>&#x2013;<lpage>510</lpage>. <pub-id pub-id-type="doi">10.1039/C8FD00114F</pub-id> <pub-id pub-id-type="pmid">30357205</pub-id></citation></ref>
<ref id="B53"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Prezioso</surname> <given-names>M.</given-names></name> <name><surname>Merrikh-Bayat</surname> <given-names>F.</given-names></name> <name><surname>Hoskins</surname> <given-names>B.</given-names></name> <name><surname>Adam</surname> <given-names>G. C.</given-names></name> <name><surname>Likharev</surname> <given-names>K. K.</given-names></name> <name><surname>Strukov</surname> <given-names>D. B.</given-names></name></person-group> (<year>2015</year>). <article-title>Training and operation of an integrated neuromorphic network based on metal-oxide memristors.</article-title> <source><italic>Nature</italic></source> <volume>521</volume> <fpage>61</fpage>&#x2013;<lpage>64</lpage>. <pub-id pub-id-type="doi">10.1038/nature14441</pub-id> <pub-id pub-id-type="pmid">25951284</pub-id></citation></ref>
<ref id="B54"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rohde</surname> <given-names>C.</given-names></name> <name><surname>Choi</surname> <given-names>B. J.</given-names></name> <name><surname>Jeong</surname> <given-names>D. S.</given-names></name> <name><surname>Choi</surname> <given-names>S.</given-names></name> <name><surname>Zhao</surname> <given-names>J.-S.</given-names></name> <name><surname>Hwang</surname> <given-names>C. S.</given-names></name></person-group> (<year>2005</year>). <article-title>Identification of a determining parameter for resistive switching of Ti O 2 thin films.</article-title> <source><italic>Appl. Phys. Lett.</italic></source> <volume>86</volume> <issue>262907</issue>. <pub-id pub-id-type="doi">10.1063/1.1968416</pub-id></citation></ref>
<ref id="B55"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schein</surname> <given-names>A.</given-names></name> <name><surname>Zhou</surname> <given-names>M.</given-names></name> <name><surname>Blei</surname> <given-names>D.</given-names></name> <name><surname>Wallach</surname> <given-names>H.</given-names></name></person-group> (<year>2016</year>). &#x201C;<article-title>Bayesian poisson tucker decomposition for learning the structure of international relations</article-title>,&#x201D; in <source><italic>Proceedings of the 33rd International Conference on Machine Learning: PMLR June 19&#x2013;24, 2016</italic></source>, <publisher-name>New York, NY.</publisher-name> <fpage>2810</fpage>&#x2013;<lpage>2819</lpage>.</citation></ref>
<ref id="B56"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Scholz</surname> <given-names>M.</given-names></name> <name><surname>Fraunholz</surname> <given-names>M.</given-names></name> <name><surname>Selbig</surname> <given-names>J.</given-names></name></person-group> (<year>2008</year>). &#x201C;<article-title>Nonlinear principal component analysis: neural network models and applications</article-title>,&#x201D; in <source><italic>Principal manifolds for data visualization and dimension reduction</italic></source>, <role>eds</role> <person-group person-group-type="editor"><name><surname>Gorban</surname> <given-names>A. N.</given-names></name> <name><surname>K&#x00E9;gl</surname> <given-names>B.</given-names></name> <name><surname>Wunsch</surname> <given-names>D. C.</given-names></name> <name><surname>Zinovyev</surname> <given-names>A. Y.</given-names></name></person-group> (<publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>44</fpage>&#x2013;<lpage>67</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-540-73750-6_2</pub-id></citation></ref>
<ref id="B57"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Seo</surname> <given-names>S.</given-names></name> <name><surname>Lee</surname> <given-names>M.</given-names></name> <name><surname>Seo</surname> <given-names>D.</given-names></name> <name><surname>Jeoung</surname> <given-names>E.</given-names></name> <name><surname>Suh</surname> <given-names>D.-S.</given-names></name> <name><surname>Joung</surname> <given-names>Y.</given-names></name><etal/></person-group> (<year>2004</year>). <article-title>Reproducible resistance switching in polycrystalline NiO films.</article-title> <source><italic>Appl. Phys. Lett.</italic></source> <volume>85</volume> <fpage>5655</fpage>&#x2013;<lpage>5657</lpage>. <pub-id pub-id-type="doi">10.1063/1.1831560</pub-id></citation></ref>
<ref id="B58"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Serb</surname> <given-names>A.</given-names></name> <name><surname>Redman-White</surname> <given-names>W.</given-names></name> <name><surname>Papavassiliou</surname> <given-names>C.</given-names></name> <name><surname>Prodromakis</surname> <given-names>T.</given-names></name></person-group> (<year>2015</year>). <article-title>Practical determination of individual element resistive states in selectorless RRAM arrays.</article-title> <source><italic>IEEE Trans. Circ. Syst. I Regul. Papers</italic></source> <volume>63</volume> <fpage>827</fpage>&#x2013;<lpage>835</lpage>. <pub-id pub-id-type="doi">10.1109/TCSI.2015.2476296</pub-id></citation></ref>
<ref id="B59"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>She</surname> <given-names>X.</given-names></name> <name><surname>Long</surname> <given-names>Y.</given-names></name> <name><surname>Mukhopadhyay</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). &#x201C;<article-title>Improving robustness of reram-based spiking neural network accelerator with stochastic spike-timing-dependent-plasticity</article-title>,&#x201D; in <source><italic>Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN)</italic></source>, (<publisher-loc>Budapest</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1109/IJCNN.2019.8851825</pub-id></citation></ref>
<ref id="B60"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stewart</surname> <given-names>K.</given-names></name> <name><surname>Orchard</surname> <given-names>G.</given-names></name> <name><surname>Shrestha</surname> <given-names>S. B.</given-names></name> <name><surname>Neftci</surname> <given-names>E.</given-names></name></person-group> (<year>2020</year>). &#x201C;<article-title>On-chip few-shot learning with surrogate gradient descent on a neuromorphic processor</article-title>,&#x201D; in <source><italic>Proceedings of the 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)</italic></source>, (<publisher-loc>Genova</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>223</fpage>&#x2013;<lpage>227</lpage>. <pub-id pub-id-type="doi">10.1109/AICAS48895.2020.9073961</pub-id></citation></ref>
<ref id="B61"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Strubell</surname> <given-names>E.</given-names></name> <name><surname>Ganesh</surname> <given-names>A.</given-names></name> <name><surname>McCallum</surname> <given-names>A.</given-names></name></person-group> (<year>2020</year>). <article-title>Energy and policy considerations for modern deep learning research.</article-title> <source><italic>Proc. AAAI Conf. Artif. Intell.</italic></source> <volume>34</volume> <fpage>13693</fpage>&#x2013;<lpage>13696</lpage>. <pub-id pub-id-type="doi">10.1609/aaai.v34i09.7123</pub-id></citation></ref>
<ref id="B62"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vogels</surname> <given-names>T.</given-names></name> <name><surname>Karinireddy</surname> <given-names>S. P.</given-names></name> <name><surname>Jaggi</surname> <given-names>M.</given-names></name></person-group> (<year>2019</year>). <article-title>PowerSGD: practical low-rank gradient compression for distributed optimization.</article-title> <source><italic>Adv. Neural Inform. Process. Syst.</italic></source> <volume>32</volume> <fpage>14236</fpage>&#x2013;<lpage>14245</lpage>.</citation></ref>
<ref id="B63"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>D.</given-names></name> <name><surname>Gao</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name></person-group> (<year>2015</year>). <article-title>Semi-supervised nonnegative matrix factorization via constraint propagation.</article-title> <source><italic>IEEE Trans. Cybern.</italic></source> <volume>46</volume> <fpage>233</fpage>&#x2013;<lpage>244</lpage>. <pub-id pub-id-type="doi">10.1109/TCYB.2015.2399533</pub-id> <pub-id pub-id-type="pmid">25706980</pub-id></citation></ref>
<ref id="B64"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Li</surname> <given-names>C.</given-names></name> <name><surname>Lin</surname> <given-names>P.</given-names></name> <name><surname>Rao</surname> <given-names>M.</given-names></name> <name><surname>Nie</surname> <given-names>Y.</given-names></name> <name><surname>Song</surname> <given-names>W.</given-names></name><etal/></person-group> (<year>2019</year>). <article-title>In situ training of feed-forward and recurrent convolutional memristor networks.</article-title> <source><italic>Nat. Mach. Intell.</italic></source> <volume>1</volume> <fpage>434</fpage>&#x2013;<lpage>442</lpage>. <pub-id pub-id-type="doi">10.1038/s42256-019-0089-1</pub-id></citation></ref>
<ref id="B65"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wong</surname> <given-names>H.-S. P.</given-names></name> <name><surname>Lee</surname> <given-names>H.-Y.</given-names></name> <name><surname>Yu</surname> <given-names>S.</given-names></name> <name><surname>Chen</surname> <given-names>Y.-S.</given-names></name> <name><surname>Wu</surname> <given-names>Y.</given-names></name> <name><surname>Chen</surname> <given-names>P.-S.</given-names></name><etal/></person-group> (<year>2012</year>). <article-title>Metal&#x2013;oxide RRAM.</article-title> <source><italic>Proc. IEEE</italic></source> <volume>100</volume> <fpage>1951</fpage>&#x2013;<lpage>1970</lpage>. <pub-id pub-id-type="doi">10.1109/JPROC.2012.2190369</pub-id></citation></ref>
</ref-list>
</back>
</article>