<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/frai.2025.1628943</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Artificial Intelligence</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Spectral momentum integration: hybrid optimization of frequency and time domain gradients</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Huang</surname> <given-names>Zhigao</given-names></name>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/formal-analysis/"/>
<role content-type="https://credit.niso.org/contributor-roles/visualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Chen</surname> <given-names>Musheng</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/3064205/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/funding-acquisition/"/>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/supervision/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Zheng</surname> <given-names>Shiyan</given-names></name>
<role content-type="https://credit.niso.org/contributor-roles/formal-analysis/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/project-administration/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
</contrib-group>
<aff><institution>Department of Physics and Information Engineering, Quanzhou Normal University, Quanzhou</institution>, <addr-line>Fujian</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Mark Eisen, Johns Hopkins University, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Tarun Kumar Vashishth, IIMT University, India</p>
<p>Domingos F. Oliveira, Mandume Ya Ndemufayo University, Angola</p>
<p>Nur Alamsyah, Universitas Informatika dan Bisnis Indonesia, Indonesia</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Musheng Chen <email>07015&#x00040;qztc.edu.cn</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>07</day>
<month>08</month>
<year>2025</year>
</pub-date>
<pub-date pub-type="collection">
<year>2025</year>
</pub-date>
<volume>8</volume>
<elocation-id>1628943</elocation-id>
<history>
<date date-type="received">
<day>16</day>
<month>06</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>16</day>
<month>07</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2025 Huang, Chen and Zheng.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Huang, Chen and Zheng</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>We propose Spectral Momentum Integration (SMI), an optimization enhancement that processes gradients in both frequency and time domains. SMI applies the Fast Fourier Transform to selectively filter gradient frequency components before blending them with original gradients using an adaptive scheduling mechanism. Experiments on a character-level language model demonstrate that SMI can achieve inference acceleration while maintaining model performance. Our approach integrates with existing optimizers without modifying model architecture, though it introduces computational overhead and hyperparameter complexity. While our current validation is limited to small-scale experiments, SMI provides a proof-of-concept for incorporating frequency-domain processing into neural network optimization, suggesting potential for broader applications pending large-scale validation.</p></abstract>
<kwd-group>
<kwd>deep learning</kwd>
<kwd>optimization</kwd>
<kwd>Fast Fourier Transform</kwd>
<kwd>gradient processing</kwd>
<kwd>spectral filtering</kwd>
<kwd>inference acceleration</kwd>
</kwd-group>
<counts>
<fig-count count="5"/>
<table-count count="10"/>
<equation-count count="18"/>
<ref-count count="60"/>
<page-count count="18"/>
<word-count count="10238"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Machine Learning and Artificial Intelligence</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1 Introduction</title>
<p>Deep learning optimization has evolved significantly in recent years, with adaptive gradient methods such as Adam (<xref ref-type="bibr" rid="B20">Kingma and Ba, 2014</xref>), AdamW (<xref ref-type="bibr" rid="B27">Loshchilov and Hutter, 2017</xref>), and more recent variants like Adafactor (<xref ref-type="bibr" rid="B46">Shazeer and Stern, 2018</xref>) and Apollo (<xref ref-type="bibr" rid="B29">Ma et al., 2020</xref>) becoming standard tools for training neural networks. Despite these advances, optimization efficiency remains a critical challenge in the era of increasingly large models (<xref ref-type="bibr" rid="B6">Brown et al., 2020</xref>; <xref ref-type="bibr" rid="B41">Radford et al., 2021</xref>; <xref ref-type="bibr" rid="B50">Touvron et al., 2023</xref>; <xref ref-type="bibr" rid="B38">OpenAI, 2023</xref>; <xref ref-type="bibr" rid="B48">Team et al., 2023</xref>). Current optimizers primarily operate in the time domain, processing gradients based on their magnitudes and historical momentum, a paradigm that has remained largely unchanged since the introduction of momentum-based methods (<xref ref-type="bibr" rid="B40">Polyak, 1964</xref>; <xref ref-type="bibr" rid="B36">Nesterov, 1983</xref>).</p>
<p>Recent studies have highlighted the limitations of conventional optimizers in dealing with gradient noise. <xref ref-type="bibr" rid="B24">Liu et al. (2020)</xref> demonstrated that gradient noise can significantly impede convergence, while <xref ref-type="bibr" rid="B60">Zhuang et al. (2020)</xref> showed that uncertainties in gradient estimates lead to erratic training dynamics. <xref ref-type="bibr" rid="B59">Zhang and Zhang (2022)</xref> further identified that gradient-based optimizers often struggle to differentiate between informative signal and stochastic noise, especially in complex loss landscapes typical of large neural networks (<xref ref-type="bibr" rid="B21">Li et al., 2018</xref>; <xref ref-type="bibr" rid="B12">Fort and Dziugaite, 2019</xref>).</p>
<p>The frequency domain offers a complementary perspective for gradient analysis that has received limited attention in optimization literature. While spectral properties of neural networks have been studied in contexts such as generalization (<xref ref-type="bibr" rid="B42">Rahaman et al., 2019</xref>; <xref ref-type="bibr" rid="B54">Xu et al., 2019a</xref>), initialization (<xref ref-type="bibr" rid="B57">Yang et al., 2022</xref>), and pruning (<xref ref-type="bibr" rid="B52">Wang et al., 2020</xref>), direct application to gradient processing in optimization algorithms remains underexplored.</p>
<p>Related frequency-domain approaches in optimization include Augmented RMSProp (<xref ref-type="bibr" rid="B31">Martinez et al., 2022</xref>), which decomposes gradients into high and low-frequency components but lacks adaptive blending mechanisms, and SignGD (<xref ref-type="bibr" rid="B5">Bernstein et al., 2018</xref>), which can be interpreted as focusing on phase while disregarding magnitude information (<xref ref-type="bibr" rid="B3">Balles and Hennig, 2020</xref>). Spectral methods have also been explored in neural architecture search (<xref ref-type="bibr" rid="B33">Mellor et al., 2021</xref>) and network compression (<xref ref-type="bibr" rid="B52">Wang et al., 2020</xref>), though these focus on network analysis rather than optimization dynamics.</p>
<p>More broadly, frequency-domain analysis has been applied to understanding training dynamics (<xref ref-type="bibr" rid="B12">Fort and Dziugaite, 2019</xref>; <xref ref-type="bibr" rid="B21">Li et al., 2018</xref>) and loss landscape properties (<xref ref-type="bibr" rid="B13">Garipov et al., 2018</xref>; <xref ref-type="bibr" rid="B37">Nguyen and Hein, 2017</xref>), but direct manipulation of gradients in the frequency domain for optimization purposes remains rare. Our work builds on these foundations while exploring direct frequency-domain gradient processing.</p>
<p>The frequency domain representation of gradients contains valuable information about different spatial scales of parameter updates. Low-frequency components typically correspond to broader, more structural changes in the parameter space, while high-frequency components often represent noise or fine details (<xref ref-type="bibr" rid="B55">Xu et al., 2019b</xref>; <xref ref-type="bibr" rid="B28">Loshchilov and Hutter, 2019</xref>). This natural decomposition aligns with observations from information geometry (<xref ref-type="bibr" rid="B58">Zhang et al., 2019</xref>; <xref ref-type="bibr" rid="B51">Tsuji et al., 2022</xref>) and manifold perspectives of optimization (<xref ref-type="bibr" rid="B30">Martens, 2020</xref>; <xref ref-type="bibr" rid="B4">Bernacchia et al., 2022</xref>), suggesting that different frequency bands contribute differently to the optimization process.</p>
<p>Conventional optimizers treat all frequency components equally, which can lead to suboptimal parameter updates. Adaptively weighting these components could potentially improve convergence, especially in the presence of noisy gradients (<xref ref-type="bibr" rid="B9">Defazio and Mishchenko, 2022</xref>) or when navigating complex loss landscapes (<xref ref-type="bibr" rid="B56">Yang et al., 2021</xref>). Recent work by <xref ref-type="bibr" rid="B23">Liu et al. (2022)</xref> demonstrated that gradient components at different scales exhibit varying levels of informativeness throughout training, but did not explore frequency-domain solutions. Similarly, <xref ref-type="bibr" rid="B53">Wang et al. (2022)</xref> showed that selective dampening of certain gradient components can improve stability, though their approach remained in the time domain.</p>
<p>In this paper, we introduce Spectral Momentum Integration (SMI), an optimization enhancement that incorporates frequency-domain gradient processing alongside traditional time-domain methods. While existing optimizers operate exclusively in the time domain, our approach explores the potential benefits of processing gradients in both domains simultaneously. SMI applies Fast Fourier Transform (FFT) to represent gradients in the frequency domain (essentially decomposing gradients into different &#x0201C;frequency patterns&#x0201D;), selectively filters frequency components based on their magnitudes (keeping the most important patterns while removing noise), and then combines the filtered spectral gradients with the original time-domain gradients, using a time-dependent blending coefficient. This substantially differs from prior work such as Augmented RMSProp (<xref ref-type="bibr" rid="B31">Martinez et al., 2022</xref>) which lacks adaptive integration mechanisms, and from traditional adaptive methods like Adam (<xref ref-type="bibr" rid="B20">Kingma and Ba, 2014</xref>) which cannot distinguish between informative signals and noise in frequency space.</p>
<p>Our key contributions include:</p>
<list list-type="bullet">
<list-item><p>A <bold>spectral optimizer wrapper</bold> that enhances gradient-based optimizers without modifying model architecture, demonstrating 15% inference speedup with 4.5% training overhead in our small-scale experiments.</p></list-item>
<list-item><p>A <bold>frequency-domain filtering technique</bold> that preserves important spectral components while reducing noise, employing quantile-based adaptive thresholding.</p></list-item>
<list-item><p>An <bold>adaptive blending mechanism</bold> with cosine scheduling that outperforms linear approaches in our experiments, reducing loss variance by 43.5%.</p></list-item>
<list-item><p><bold>Empirical evidence</bold> on a small-scale model showing that frequency-domain gradient processing can improve parameter quality for inference, achieving 8% faster convergence alongside 15% inference acceleration in our experimental setting.</p></list-item>
</list>
<p>Our work explores connections between signal processing principles and deep learning optimization, building on spectral analysis approaches in computer vision (<xref ref-type="bibr" rid="B11">Durall et al., 2020</xref>; <xref ref-type="bibr" rid="B17">Huang et al., 2022</xref>) and time series processing (<xref ref-type="bibr" rid="B22">Liang et al., 2022</xref>; <xref ref-type="bibr" rid="B43">Rao et al., 2022</xref>). The approach may complement recent advancements in second-order methods (<xref ref-type="bibr" rid="B1">Anil et al., 2021</xref>; <xref ref-type="bibr" rid="B26">Liu et al., 2023</xref>), distributed optimization (<xref ref-type="bibr" rid="B18">Jiang et al., 2020</xref>; <xref ref-type="bibr" rid="B47">Tang et al., 2023</xref>), and large-scale training methods (<xref ref-type="bibr" rid="B16">Hoffmann et al., 2022</xref>), though such combinations require further investigation.</p>
<p>We validate our approach through comprehensive experiments on a character-level language model trained on the Shakespeare dataset. Results demonstrate that SMI with cosine scheduling and 75% frequency preservation not only accelerates inference by 15% but also provides 8% faster convergence and more stable training dynamics compared to standard AdamW optimization. These findings align with recent observations on the relationship between optimization trajectories and model efficiency (<xref ref-type="bibr" rid="B8">Cheng et al., 2023</xref>; <xref ref-type="bibr" rid="B7">Chen et al., 2023</xref>), suggesting that frequency-domain information can guide optimizers toward parameter configurations that enable more efficient computation.</p>
<p>The remainder of this paper is organized as follows: Section 2 reviews related work in optimization and frequency-domain processing. Section 3 details our proposed Spectral Momentum Integration approach. Section 4 describes the experimental setup, followed by results and analysis in Section 5. Finally, Section 7 concludes with a discussion of implications and future work.</p>
</sec>
<sec id="s2">
<title>2 Related work</title>
<sec>
<title>2.1 Neural network optimization</title>
<p>Gradient-based optimization forms the foundation of deep neural network training. Stochastic Gradient Descent (SGD) (<xref ref-type="bibr" rid="B45">Robbins and Monro, 1951</xref>) and its variants with momentum (<xref ref-type="bibr" rid="B40">Polyak, 1964</xref>) have been standard approaches for decades. More recently, adaptive optimization methods such as AdaGrad (<xref ref-type="bibr" rid="B10">Duchi et al., 2011</xref>), RMSProp (<xref ref-type="bibr" rid="B49">Tieleman and Hinton, 2012</xref>), and Adam (<xref ref-type="bibr" rid="B20">Kingma and Ba, 2014</xref>) have gained popularity for their ability to automatically adjust learning rates for each parameter.</p>
<p>Adam (<xref ref-type="bibr" rid="B20">Kingma and Ba, 2014</xref>) combines momentum and adaptive learning rates, making it widely used across various deep learning applications. AdamW (<xref ref-type="bibr" rid="B27">Loshchilov and Hutter, 2017</xref>) improved upon Adam by decoupling weight decay regularization from the gradient update. Subsequent works like RAdam (<xref ref-type="bibr" rid="B25">Liu et al., 2019</xref>) addressed the warmup instability issues in Adam by rectifying the adaptive learning rate.</p>
<p>Gradient clipping (<xref ref-type="bibr" rid="B39">Pascanu et al., 2013</xref>) is another important technique for stabilizing training, particularly for recurrent neural networks. It prevents gradient explosions by scaling gradients when their norm exceeds a threshold. While effective, these methods all operate in the time domain and do not explicitly consider the frequency characteristics of gradients.</p>
</sec>
<sec>
<title>2.2 Frequency domain analysis in deep learning</title>
<p>The analysis of neural networks in the frequency domain has gained attention in recent years. <xref ref-type="bibr" rid="B42">Rahaman et al. (2019)</xref> demonstrated the &#x0201C;spectral bias&#x0201D; of neural networks, showing that they tend to learn low-frequency functions before high-frequency ones. This finding suggests that frequency-aware optimization might better align with the natural learning dynamics of neural networks.</p>
<p>The Fast Fourier Transform (FFT) has been applied in deep learning primarily to accelerate convolution operations (<xref ref-type="bibr" rid="B32">Mathieu et al., 2013</xref>). <xref ref-type="bibr" rid="B44">Rippel et al. (2015)</xref> proposed spectral representations for convolutional networks, demonstrating improved computational efficiency and parameter interpretability.</p>
<p>In the context of generative models, spectral normalization (<xref ref-type="bibr" rid="B35">Miyato et al., 2018</xref>) has been introduced to stabilize GAN training by normalizing the spectral norm of weight matrices. While related to our work in terms of spectral analysis, this approach focuses on weight normalization rather than gradient processing.</p>
</sec>
<sec>
<title>2.3 Inference acceleration techniques</title>
<p>Various techniques have been developed to improve neural network inference speed. Model compression approaches such as pruning, quantization, and Huffman coding (<xref ref-type="bibr" rid="B14">Han et al., 2015</xref>) reduce model size and computational requirements. Knowledge distillation (<xref ref-type="bibr" rid="B15">Hinton et al., 2015</xref>) transfers knowledge from larger teacher models to smaller student models, improving efficiency without significant performance drops.</p>
<p>Mixed precision training (<xref ref-type="bibr" rid="B34">Micikevicius et al., 2017</xref>) uses lower precision representations (e.g., FP16) to accelerate computation while maintaining numerical stability. These approaches typically modify model structure or representation, whereas our method focuses on improving parameter quality during training to achieve faster inference without structural changes.</p>
<p>The novelty of our approach lies in the integration of frequency-domain analysis directly into the optimization process. While previous works have separately explored spectral properties of neural networks and various optimization techniques, SMI uniquely combines these perspectives to create an enhanced optimizer that leverages both time and frequency domain information.</p>
</sec>
</sec>
<sec id="s3">
<title>3 Method</title>
<p>Our Spectral Momentum Integration (SMI) approach enhances existing optimizers by incorporating frequency-domain processing of gradients. The core idea is to filter gradients in the frequency domain to emphasize important spectral components and then blend these filtered gradients with the original gradients using an adaptive weighting scheme.</p>
<sec>
<title>3.1 Theoretical foundation</title>
<sec>
<title>3.1.1 Signal processing foundation</title>
<p>The effectiveness of SMI is grounded in fundamental signal processing theory and optimization dynamics. When transforming gradients to the frequency domain, we achieve a decomposition that separates structural information from noise:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover accent='true'><mml:mi>G</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="script">F</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>G</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>S</italic><sub><italic>p</italic></sub> represents structural signal components and <italic>N</italic><sub><italic>p</italic></sub> represents noise or less informative components. This decomposition enables more precise gradient filtering than is possible in the time domain alone.</p>
<p>From information theory perspective, the frequency domain representation provides an orthogonal basis that maximizes the separation between signal and noise components. The Parseval&#x00027;s theorem ensures that energy is preserved across domains:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M2"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mo>&#x02225;</mml:mo><mml:msub><mml:mrow><mml:mi>G</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mo>&#x02225;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo>&#x02225;</mml:mo><mml:msub><mml:mrow><mml:mover accent='true'><mml:mi>G</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mo>&#x02225;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>This energy conservation property guarantees that no information is lost during the transformation, only redistributed across frequency components.</p>
</sec>
<sec>
<title>3.1.2 Optimization theory connection</title>
<p>During neural network training, gradients contain information across different frequency scales. Low-frequency components typically correspond to broader structural changes in the parameter space, while high-frequency components often represent either fine-grained local adjustments or noise (<xref ref-type="bibr" rid="B42">Rahaman et al., 2019</xref>; <xref ref-type="bibr" rid="B55">Xu et al., 2019b</xref>; <xref ref-type="bibr" rid="B28">Loshchilov and Hutter, 2019</xref>).</p>
<p>We can formalize this as a gradient decomposition theorem:</p>
<p>Theorem 3.1. [Gradient Frequency Decomposition] For any gradient tensor <inline-formula><mml:math id="M3"><mml:msub><mml:mrow><mml:mi>G</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, the frequency domain representation <inline-formula><mml:math id="M4"><mml:msub><mml:mrow><mml:mover accent='true'><mml:mi>G</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="script">F</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>G</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> admits a natural decomposition into low-frequency structural components <inline-formula><mml:math id="M5"><mml:msubsup><mml:mrow><mml:mover accent='true'><mml:mi>G</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> and high-frequency detail components <inline-formula><mml:math id="M6"><mml:msubsup><mml:mrow><mml:mover accent='true'><mml:mi>G</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> such that:</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M7"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover accent='true'><mml:mi>G</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mover accent='true'><mml:mi>G</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mover accent='true'><mml:mi>G</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M8"><mml:msubsup><mml:mrow><mml:mover accent='true'><mml:mi>G</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> captures global optimization directions and <inline-formula><mml:math id="M9"><mml:msubsup><mml:mrow><mml:mover accent='true'><mml:mi>G</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> captures local noise and fine details.</p>
<p>Conventional optimizers process all frequency components equally, potentially allowing noise to interfere with learning. Our frequency filtering approach preferentially retains components with larger magnitudes, which typically carry more information about the loss landscape structure.</p>
</sec>
<sec>
<title>3.1.3 Inference acceleration mechanism</title>
<p>While our experiments demonstrate inference acceleration, the underlying mechanisms require careful interpretation based on our limited experimental scope:</p>
<sec>
<title>3.1.3.1 Observed spectral regularization effect</title>
<p>SMI appears to act as an implicit spectral regularizer. By filtering gradient frequencies, we hypothesize that it encourages parameters to have specific spectral properties:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M10"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">min</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mi>L</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003BB;</mml:mi><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M11"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> represents a hypothesized implicit spectral regularization term. However, the exact form and magnitude of this regularization require further theoretical and empirical investigation.</p>
</sec>
<sec>
<title>3.1.3.2 Empirical complexity reduction</title>
<p>Our experiments suggest that filtered gradients may guide optimization toward parameter configurations with lower effective complexity:</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M12"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">C</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>f</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>S</mml:mi><mml:mi>M</mml:mi><mml:mi>I</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02272;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">C</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>f</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M13"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">C</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>f</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents effective computational complexity during forward pass. This relationship is observed empirically but lacks theoretical guarantees.</p>
</sec>
<sec>
<title>3.1.3.3 Activation pattern hypothesis</title>
<p>We observe that frequency-filtered gradients correlate with more efficient activation patterns, quantified through activation sparsity:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M14"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">A</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>S</mml:mi><mml:mi>M</mml:mi><mml:mi>I</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x1D53C;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mo>&#x02225;</mml:mo><mml:mtext class="textrm" mathvariant="normal">ReLU</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>S</mml:mi><mml:mi>M</mml:mi><mml:mi>I</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mo>&#x02225;</mml:mo></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>/</mml:mo><mml:mo>&#x02225;</mml:mo><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>S</mml:mi><mml:mi>M</mml:mi><mml:mi>I</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:mi>x</mml:mi><mml:msub><mml:mrow><mml:mo>&#x02225;</mml:mo></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>While higher activation sparsity may translate to computational savings during inference, the causal relationship and generalizability of this observation require validation on larger models and diverse architectures.</p>
</sec>
</sec>
<sec>
<title>3.1.4 Convergence analysis</title>
<p>We provide a preliminary convergence analysis for SMI, though rigorous theoretical guarantees require further investigation:</p>
<p>Theorem 3.2 (SMI Convergence&#x02014;Preliminary). Under standard smoothness assumptions (Lipschitz smoothness, bounded gradients), SMI with appropriately chosen blending schedule &#x003B1;<sub><italic>t</italic></sub> maintains convergence to a stationary point. The convergence rate is conjectured to be <inline-formula><mml:math id="M15"><mml:mi>O</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:msqrt><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msqrt></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, similar to the base optimizer, though formal proof is pending.</p>
<sec>
<title>3.1.4.1 Proof sketch</title>
<p>The key insight is that filtered gradients should remain unbiased estimators of the true gradient direction in expectation, while potentially reducing variance through frequency domain denoising. However, the bias introduced by frequency filtering and its impact on convergence guarantees require rigorous mathematical analysis that goes beyond our current scope.</p>
</sec>
<sec>
<title>3.1.4.2 Open questions</title>
<p>Several theoretical questions remain: (1) Under what conditions does frequency filtering preserve gradient unbiasedness? (2) How does the choice of filtering parameters affect convergence rates? (3) What are the optimal blending schedules for different problem classes? These questions represent important directions for future theoretical work.</p>
</sec>
</sec>
</sec>
<sec>
<title>3.2 Overview</title>
<p>SMI operates as a wrapper around any gradient-based optimizer, intercepting gradients after the backward pass but before the parameter update step. The wrapper performs several key operations:</p>
<list list-type="simple">
<list-item><p>1) Transform gradients from the time domain to the frequency domain using Fast Fourier Transform (FFT).</p></list-item>
<list-item><p>2) Calculate and update exponential moving averages (EMA) of the frequency magnitudes.</p></list-item>
<list-item><p>3) Apply a magnitude-based threshold to filter frequency components.</p></list-item>
<list-item><p>4) Transform filtered gradients back to the time domain using inverse FFT.</p></list-item>
<list-item><p>5) Blend filtered gradients with original gradients using a time-dependent mixing coefficient.</p></list-item>
<list-item><p>6) Pass the blended gradients to the base optimizer for parameter updates.</p></list-item>
</list>
<p>This process allows SMI to selectively preserve important frequency components while reducing noise, resulting in improved parameter updates that lead to both better training dynamics and faster inference.</p>
<sec>
<title>3.2.1 Practical implementation notes</title>
<p>SMI can be easily integrated with existing optimizers through a simple wrapper class. The key implementation considerations include: (1) ensuring gradient tensors are properly reshaped for 2D FFT operations, (2) managing complex number arithmetic in PyTorch using <monospace>torch.fft</monospace> functions, (3) efficient memory management for spectral history buffers, and (4) proper device placement for GPU acceleration. For models with irregular tensor shapes, padding or alternative reshaping strategies may be required.</p>
</sec>
</sec>
<sec>
<title>3.3 Spectral gradient processing</title>
<p>For each parameter tensor <italic>p</italic> with gradient &#x02207;<sub><italic>p</italic></sub><italic>L</italic>, we first reshape it to a 2D matrix to enable FFT processing:</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M16"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>G</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mtext class="textrm" mathvariant="normal">reshape</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mi>L</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>n</italic> represents the flattened spatial dimensions and <italic>d</italic> is the channel dimension. We then apply the 2D Fast Fourier Transform (which decomposes the gradient into its frequency components, similar to how a musical chord can be decomposed into individual notes):</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M17"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover accent='true'><mml:mi>G</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="script">F</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>G</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x02102;</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The magnitude spectrum is computed as:</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M18"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mover accent='true'><mml:mi>G</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>To maintain a stable estimate of frequency importance, we track an exponential moving average (EMA) of these magnitudes:</p>
<disp-formula id="E10"><label>(10)</label><mml:math id="M19"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:msubsup><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B2; is the EMA decay factor (ranging from 0.9 to 0.99 in our experiments).</p>
<p><xref ref-type="fig" rid="F1">Figure 1</xref> illustrates the overall flow of our Spectral Momentum Integration algorithm, showing how gradients are processed through both frequency and time domains before being combined for the final parameter updates.</p>
<fig position="float" id="F1">
<label>Figure 1</label>
<caption><p>Spectral Momentum Integration workflow showing gradient processing through frequency and time domains.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-08-1628943-g0001.tif">
<alt-text>Flowchart depicting a process starting with (a) Original Gradient. It flows to (b) FFT Transformation, then (c) Frequency Filtering. From (c), arrows lead to (d) EMA Spectral History and (e) Inverse FFT. (e) connects to (f) Gradient Blending. (a) also connects to (g) Alpha Scheduling, which influences (f). Finally, (f) connects to (h) Final Gradient. Equations for alpha and the final gradient are shown.</alt-text>
</graphic>
</fig>
</sec>
<sec>
<title>3.4 Frequency filtering and gradient blending</title>
<p>Based on the EMA of magnitude spectrum <italic>H</italic><sub><italic>p</italic></sub>, we filter frequency components using a threshold &#x003C4;<sub><italic>p</italic></sub>, which is the <italic>q</italic>-th quantile of <italic>H</italic><sub><italic>p</italic></sub> (essentially keeping only the strongest frequency components while discarding the weakest ones):</p>
<disp-formula id="E11"><label>(11)</label><mml:math id="M20"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mtext class="textrm" mathvariant="normal">quantile</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>q</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>q</italic> ranges from 0.25 to 0.5 in our experiments, corresponding to retaining the top 75% to 50% of frequency components (similar to noise reduction in audio processing). The binary mask is computed as:</p>
<disp-formula id="E12"><label>(12)</label><mml:math id="M21"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mo>&#x022AE;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02265;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The filtered spectrum is obtained by element-wise multiplication of the original spectrum with the mask:</p>
<disp-formula id="E13"><label>(13)</label><mml:math id="M22"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>G</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover accent='true'><mml:mi>G</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02299;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>This filtered spectrum is then transformed back to the time domain using inverse FFT:</p>
<disp-formula id="E14"><label>(14)</label><mml:math id="M23"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>G</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="script">F</mml:mi></mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>G</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Finally, we blend the filtered and original gradients using a time-dependent mixing coefficient &#x003B1;<sub><italic>t</italic></sub>:</p>
<disp-formula id="E15"><label>(15)</label><mml:math id="M24"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mi>L</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mi>L</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The mixing coefficient &#x003B1;<sub><italic>t</italic></sub> varies throughout training according to a schedule. In our experiments, we explore both linear and cosine schedules:</p>
<sec>
<title>3.4.1 Linear schedule</title>
<disp-formula id="E16"><label>(16)</label><mml:math id="M25"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mfrac><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
</sec>
<sec>
<title>3.4.2 Cosine schedule</title>
<disp-formula id="E17"><label>(17)</label><mml:math id="M26"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mo class="qopname">cos</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003C0;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:mfrac><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>t</italic> is the current iteration, <italic>T</italic> is the total number of iterations, &#x003B1; is the initial value (typically 0.1), and &#x003B1;<sub><italic>end</italic></sub> is the final value (typically ranging from 0.5 to 0.9).</p>
</sec>
</sec>
<sec>
<title>3.5 Computational complexity analysis</title>
<p>The computational overhead of SMI consists of several components that scale differently with model size:</p>
<sec>
<title>3.5.1 FFT operations</title>
<p>For each parameter tensor of size <italic>n</italic>&#x000D7;<italic>d</italic>, the 2D FFT requires <italic>O</italic>(<italic>nd</italic>log(<italic>nd</italic>)) operations. For typical transformer layers, this overhead is manageable and often parallelizable.</p>
</sec>
<sec>
<title>3.5.2 Memory overhead</title>
<p>SMI requires additional memory to store the spectral history <italic>H</italic><sub><italic>p</italic></sub> for each parameter, effectively doubling memory requirements. However, this can be mitigated using mixed-precision storage.</p>
</sec>
<sec>
<title>3.5.3 Scalability analysis</title>
<p>The total computational overhead scales as:</p>
<disp-formula id="E18"><label>(18)</label><mml:math id="M27"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>S</mml:mi><mml:mi>M</mml:mi><mml:mi>I</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B3;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:mo class="qopname">log</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>P</italic> is the number of parameters, &#x003B3; is a small constant (&#x02248;0.05), and <italic>T</italic><sub><italic>base</italic></sub> is the baseline training time.</p>
</sec>
<sec>
<title>3.5.4 Large-scale model considerations</title>
<p>For very large models (100M&#x0002B; parameters), several critical factors emerge:</p>
</sec>
<sec>
<title>3.5.5 Memory bandwidth bottleneck</title>
<p>At extreme scales (1B&#x0002B; parameters), the primary limitation shifts from computational complexity to memory bandwidth. FFT operations require reading and writing large tensors multiple times, potentially saturating memory bandwidth before computational resources. The spectral history storage approximately doubles optimizer state memory requirements, requiring careful analysis of memory-to-computation ratios.</p>
</sec>
<sec>
<title>3.5.6 Distributed training implications</title>
<p>FFT operations introduce additional complexity in distributed training scenarios:</p>
<list list-type="bullet">
<list-item><p><bold>Data parallel:</bold> FFT operations remain local to each GPU, maintaining scalability but requiring synchronized spectral history updates.</p></list-item>
<list-item><p><bold>Model parallel:</bold> Cross-device parameter tensors complicate FFT application, potentially requiring tensor gathering or specialized distributed FFT implementations.</p></list-item>
<list-item><p><bold>Pipeline parallel:</bold> Gradient synchronization timing may be affected by FFT processing latency.</p></list-item>
</list>
</sec>
<sec>
<title>3.5.7 Hardware optimization opportunities</title>
<p>Modern hardware offers several optimization paths:</p>
<list list-type="bullet">
<list-item><p><bold>GPU acceleration:</bold> optimized cuFFT libraries can significantly reduce overhead, particularly for regularly-shaped transformer parameters.</p></list-item>
<list-item><p><bold>Tensor core utilization:</bold> mixed-precision FFT operations can leverage specialized hardware units.</p></list-item>
<list-item><p><bold>Memory hierarchy:</bold> intelligent caching of spectral histories in faster memory tiers can mitigate bandwidth limitations.</p></list-item>
</list>
</sec>
<sec>
<title>3.5.8 Scaling trade-offs</title>
<p>The cost-benefit analysis evolves with model scale:</p>
<list list-type="bullet">
<list-item><p><bold>Training cost:</bold> 5&#x02013;15% overhead becomes significant for multi-million dollar training runs.</p></list-item>
<list-item><p><bold>Inference benefits:</bold> 15% inference acceleration provides substantial value for deployed models with high query volumes.</p></list-item>
<list-item><p><bold>Break-even analysis:</bold> for models deployed for extensive inference workloads, training overhead is typically amortized within weeks of deployment.</p></list-item>
</list>
</sec>
</sec>
<sec>
<title>3.6 Hyperparameter selection guidelines and method complexity</title>
<p>SMI introduces several hyperparameters that require careful tuning, representing a significant complexity burden:</p>
<sec>
<title>3.6.1 Frequency threshold (<italic>q</italic>)</title>
<list list-type="bullet">
<list-item><p><bold>Recommended default:</bold> <italic>q</italic> &#x0003D; 0.25 (75% retention) for most transformer models.</p></list-item>
<list-item><p><bold>For noisy tasks:</bold> Use <italic>q</italic> &#x0003D; 0.5 (50% retention) when dealing with very noisy gradients or small batch sizes.</p></list-item>
<list-item><p><bold>For stable tasks:</bold> Use <italic>q</italic> &#x0003D; 0.15 (85% retention) for well-conditioned problems with large batch sizes.</p></list-item>
<list-item><p><bold>Avoid:</bold> <italic>q</italic> &#x0003C; 0.1 to prevent over-filtering that can destroy important gradient information.</p></list-item>
<list-item><p><bold>Tuning strategy:</bold> Start conservatively with <italic>q</italic> &#x0003D; 0.25 and adjust based on training loss smoothness.</p></list-item>
</list>
</sec>
<sec>
<title>3.6.2 EMA decay (&#x003B2;):</title>
<list list-type="bullet">
<list-item><p><bold>Recommended default:</bold> &#x003B2; &#x0003D; 0.99 for stable training with most optimizers (Adam, AdamW).</p></list-item>
<list-item><p><bold>For dynamic gradients:</bold> use &#x003B2; &#x0003D; 0.95 for tasks with rapidly changing gradient patterns.</p></list-item>
<list-item><p><bold>For very stable gradients:</bold> use &#x003B2; &#x0003D; 0.999 for well-conditioned optimization landscapes.</p></list-item>
<list-item><p><bold>Integration guideline:</bold> match or slightly exceed the &#x003B2;<sub>2</sub> parameter of the base optimizer.</p></list-item>
<list-item><p><bold>Memory consideration:</bold> higher &#x003B2; requires longer warmup periods but provides more robust filtering.</p></list-item>
</list>
</sec>
<sec>
<title>3.6.3 Alpha scheduling</title>
<list list-type="bullet">
<list-item><p><bold>Recommended:</bold> cosine schedule from 0.1 to 0.5 for most applications.</p></list-item>
<list-item><p><bold>Conservative start:</bold> linear 0.1 &#x02192; 0.3 for sensitive models or initial experiments.</p></list-item>
<list-item><p><bold>Aggressive setting:</bold> cosine 0.1 &#x02192; 0.7 only for very noisy gradients with careful monitoring.</p></list-item>
<list-item><p><bold>Safety guidelines:</bold> always start with conservative values and monitor training loss variance.</p></list-item>
</list>
</sec>
<sec>
<title>3.6.4 Integration with popular optimizers</title>
<list list-type="bullet">
<list-item><p><bold>AdamW:</bold> use &#x003B2; &#x0003D; 0.99, <italic>q</italic> &#x0003D; 0.25, cosine &#x003B1; 0.1 &#x02192; 0.5 (most tested combination).</p></list-item>
<list-item><p><bold>Adam:</bold> similar to AdamW but consider slightly higher &#x003B2; &#x0003D; 0.995 for stability.</p></list-item>
<list-item><p><bold>SGD with momentum:</bold> use &#x003B2; &#x0003D; 0.9, <italic>q</italic> &#x0003D; 0.3, linear &#x003B1; 0.1 &#x02192; 0.4 for better compatibility.</p></list-item>
<list-item><p><bold>Lion:</bold> experimental&#x02014;start with very conservative <italic>q</italic> &#x0003D; 0.4, &#x003B1; 0.1 &#x02192; 0.3.</p></list-item>
</list>
</sec>
<sec>
<title>3.6.5 Method limitations:</title>
<list list-type="bullet">
<list-item><p><bold>Hyperparameter complexity:</bold> the method introduces 3&#x02013;4 additional hyperparameters that require careful tuning, increasing optimization complexity.</p></list-item>
<list-item><p><bold>Computational overhead:</bold> while modest (4.5%).</p></list-item>
<list-item><p><bold>Memory requirements:</bold> storing spectral history approximately doubles memory usage for optimizer states.</p></list-item>
<list-item><p><bold>Architecture constraints:</bold> FFT operations work best with regularly-shaped tensors, potentially limiting applicability to certain architectures.</p></list-item>
</list>
</sec>
</sec>
<sec>
<title>3.7 Algorithm</title>
<p>The complete Spectral Momentum Integration algorithm is presented in <xref ref-type="table" rid="T11">Algorithm 1</xref>. This algorithm can be implemented as a wrapper around any gradient-based optimizer, making it easily applicable to existing training pipelines.</p>
<table-wrap position="float" id="T11">
<label>Algorithm 1</label>
<caption><p>Spectral momentum integration (enhanced implementation).</p></caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td align="left" valign="top"><monospace>1: <bold>Input:</bold> Base optimizer <italic>O</italic>, initial <italic>&#x003B1;</italic> &#x0003D; 0.1, final <italic>&#x003B1;</italic><sub><italic>end</italic></sub> &#x0003D; 0.5, total iterations <italic>T</italic>, EMA decay <italic>&#x003B2;</italic> &#x0003D; 0.99, frequency threshold <italic>q</italic> &#x0003D; 0.25</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>2: <bold>Memory Allocation:</bold> Initialize spectral history buffers <italic>H</italic><sub><italic>p</italic></sub> &#x0003D; None for all parameters <italic>p</italic></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>3: <bold>Setup:</bold> <italic>t</italic> &#x02190; 0, device &#x02190; current GPU/CPU device</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>4: <bold>while</bold> not converged <bold>do</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>5: &#x02003; <italic>t</italic> &#x02190; <italic>t</italic> &#x0002B; 1</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>6: &#x02003; Calculate gradients &#x02207;<sub><italic>p</italic></sub><italic>L</italic> for all parameters <italic>p</italic> via backpropagation</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>7: &#x02003; <bold>for</bold> each parameter tensor <italic>p</italic> with gradient &#x02207;<sub><italic>p</italic></sub><italic>L</italic> <bold>do</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>8: &#x02003;&#x02003; <bold>Tensor Preprocessing:</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>9: &#x02003;&#x02003;&#x02003; Original shape &#x02190; &#x02207;<sub><italic>p</italic></sub><italic>L</italic>.shape</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>10: &#x02003;&#x02003;&#x02003; <italic>G</italic><sub><italic>p</italic></sub> &#x02190; reshape(&#x02207;<sub><italic>p</italic></sub><italic>L</italic>, [<italic>n, d</italic>]) {Convert to 2D for FFT}</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>11: &#x02003;&#x02003; <bold>Frequency Domain Transform:</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>12: &#x02003;&#x02003;&#x02003; <italic>&#x0011C;</italic><sub><italic>p</italic></sub> &#x02190; torch.fft.fft2(<italic>G</italic><sub><italic>p</italic></sub>) &#x02003; {2D FFT operation}</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>13: &#x02003;&#x02003;&#x02003; <italic>M</italic><sub><italic>p</italic></sub> &#x02190; |<italic>&#x0011C;</italic><sub><italic>p</italic></sub>| {Magnitude spectrum}</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>14: &#x02003;&#x02003; <bold>Spectral History Management:</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>15: &#x02003;&#x02003; <bold>if</bold> <italic>H</italic><sub><italic>p</italic></sub> is None <bold>then</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>16: &#x02003;&#x02003;&#x02003; <italic>H</italic><sub><italic>p</italic></sub>&#x02190;<italic>M</italic><sub><italic>p</italic></sub>.clone() {Initialize with current magnitudes}</monospace> </td>
</tr>
<tr>
<td align="left" valign="top"><monospace>17: &#x02003;&#x02003; <bold>else</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>18: &#x02003;&#x02003;&#x02003; <italic>H</italic><sub><italic>p</italic></sub> &#x02190; <italic>&#x003B2;</italic> &#x000B7; <italic>H</italic><sub><italic>p</italic></sub> &#x0002B; (1 &#x02212; <italic>&#x003B2;</italic>) &#x000B7; <italic>M</italic><sub><italic>p</italic></sub> {Exponential moving average}</monospace> </td>
</tr>
<tr>
<td align="left" valign="top"><monospace>19: &#x02003;&#x02003; <bold>end if</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>20: &#x02003;&#x02003; <bold>Adaptive Frequency Filtering:</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>21: &#x02003;&#x02003;&#x02003; &#x003C4;<sub><italic>p</italic></sub> &#x02190; torch.quantile(<italic>H</italic><sub><italic>p</italic></sub>, <italic>q</italic>) {Dynamic threshold}</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>22: &#x02003;&#x02003;&#x02003; &#x003A9;<sub><italic>p</italic></sub> &#x02190; (<italic>H</italic><sub><italic>p</italic></sub> &#x02265; &#x003C4;<sub><italic>p</italic></sub>).float() {Binary mask}</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>23: &#x02003;&#x02003;&#x02003; <inline-formula><mml:math id="M28"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>G</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02190;</mml:mo><mml:msub><mml:mrow><mml:mover accent='true'><mml:mi>G</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02299;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> {Element-wise filtering}</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>24: &#x02003;&#x02003; <bold>Reconstruction:</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>25: &#x02003;&#x02003;&#x02003; <inline-formula><mml:math id="M29"><mml:msubsup><mml:mrow><mml:mi>G</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x02190;</mml:mo><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">torch.fft.ifft2</mml:mtext></mml:mstyle><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>G</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">real</mml:mtext></mml:mstyle></mml:math></inline-formula> {Inverse FFT, take real part}</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>26: &#x02003;&#x02003;&#x02003;&#x02003; <inline-formula><mml:math id="M30"><mml:msubsup><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mi>L</mml:mi><mml:mo>&#x02190;</mml:mo><mml:msubsup><mml:mrow><mml:mi>G</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>.</mml:mo><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">reshape</mml:mtext></mml:mstyle><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">original shape</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> {Restore tensor shape}</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>27: &#x02003;&#x02003; <bold>Gradient Blending:</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>28: &#x02003;&#x02003;&#x02003; <inline-formula><mml:math id="M31"><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02190;</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mo class="qopname">cos</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003C0;</mml:mi><mml:mi>t</mml:mi><mml:mo>/</mml:mo><mml:mi>T</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:math></inline-formula> {Cosine schedule}</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>29: &#x02003;&#x02003;&#x02003; <inline-formula><mml:math id="M32"><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mi>L</mml:mi><mml:mo>&#x02190;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:msubsup><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mi>L</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mi>L</mml:mi></mml:math></inline-formula> {Weighted combination}</monospace> </td>
</tr>
<tr>
<td align="left" valign="top"><monospace>30: &#x02003;&#x02003; <bold>end for</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>31: &#x02003;&#x02003; <bold>Optimization Step:</bold> <italic>O</italic>.step() with modified gradients {&#x02207;<sub><italic>p</italic></sub><italic>L</italic>}</monospace> </td>
</tr>
<tr>
<td align="left" valign="top"><monospace>32: <bold>end while</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>33: <bold>Cleanup:</bold> Release spectral history buffers if needed</monospace></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s4">
<title>4 Experimental setup</title>
<p>To evaluate the effectiveness of our Spectral Momentum Integration approach, we conducted experiments on a character-level language model trained on the Shakespeare dataset.</p>
<sec>
<title>4.1 Model architecture</title>
<p>We used a small-scale GPT-like transformer model with the following specifications:</p>
<list list-type="bullet">
<list-item><p>6 transformer layers.</p></list-item>
<list-item><p>6 attention heads per layer.</p></list-item>
<list-item><p>384-dimensional embeddings.</p></list-item>
<list-item><p>Block size (context length) of 256 characters.</p></list-item>
<list-item><p>Total parameters: &#x0007E;10.7 million.</p></list-item>
</list>
<p>The model uses layer normalization (<xref ref-type="bibr" rid="B2">Ba et al., 2016</xref>) and incorporates flash attention when available. This architecture, based on the nanoGPT implementation (<xref ref-type="bibr" rid="B19">Karpathy, 2023</xref>), represents a simplified but representative example of modern language models.</p>
</sec>
<sec>
<title>4.2 Dataset and training configuration</title>
<p>The Shakespeare dataset consists of complete works of William Shakespeare, providing a character-level language modeling task. The data was split into training and validation sets (90%/10%).</p>
<p>Training was performed with the following configuration:</p>
<list list-type="bullet">
<list-item><p>Batch size: 64.</p></list-item>
<list-item><p>Learning rate: 1e-3 with cosine decay.</p></list-item>
<list-item><p>Weight decay: 0.1.</p></list-item>
<list-item><p>Beta1: 0.9, Beta2: 0.99 for AdamW.</p></list-item>
<list-item><p>Maximum iterations: 5,000.</p></list-item>
<list-item><p>Dropout: 0.2.</p></list-item>
<list-item><p>Gradient clipping: 1.0.</p></list-item>
</list>
<p>All experiments were conducted on a single NVIDIA GPU with mixed-precision training (FP16/BF16) where available.</p>
</sec>
<sec>
<title>4.3 Evaluation metrics</title>
<p>We evaluated our approach using the following metrics:</p>
<list list-type="bullet">
<list-item><p><bold>Training loss</bold>: cross-entropy loss on training data.</p></list-item>
<list-item><p><bold>Validation loss</bold>: cross-entropy loss on held-out validation data.</p></list-item>
<list-item><p><bold>Inference speed</bold>: tokens per second during inference.</p></list-item>
<list-item><p><bold>Training time</bold>: total seconds required for training.</p></list-item>
</list>
<p>Each experiment was run with two different random seeds, and we report the mean values across these runs.</p>
</sec>
<sec>
<title>4.4 Experimental configurations</title>
<p>We conducted a series of experiments to systematically explore different configurations of our Spectral Momentum Integration approach:</p>
<list list-type="bullet">
<list-item><p><bold>Run 0: baseline</bold>&#x02014;standard AdamW optimizer without spectral processing.</p></list-item>
<list-item><p><bold>Run 1: basic SMI</bold>&#x02014;linear alpha schedule (0.1 &#x02192; 0.9), EMA decay = 0.9, and 50% magnitude threshold.</p></list-item>
<list-item><p><bold>Run 2: conservative alpha</bold>&#x02014;linear alpha schedule (0.1 &#x02192; 0.5), EMA decay = 0.9, and 50% magnitude threshold.</p></list-item>
<list-item><p><bold>Run 3: higher EMA decay</bold>&#x02014;linear alpha schedule (0.1 &#x02192; 0.5), EMA decay = 0.99, and 50% magnitude threshold.</p></list-item>
<list-item><p><bold>Run 4: adaptive threshold</bold>&#x02014;linear alpha schedule (0.1 &#x02192; 0.5), EMA decay = 0.99, and 75% magnitude threshold.</p></list-item>
<list-item><p><bold>Run 5: cosine schedule</bold>&#x02014;cosine alpha schedule (0.1 &#x02192; 0.5), EMA decay = 0.99, and 75% magnitude threshold.</p></list-item>
</list>
<p>These configurations allowed us to systematically explore the impact of each component of our approach: blending schedule, EMA decay rate, and frequency threshold.</p>
</sec>
</sec>
<sec id="s5">
<title>5 Results and analysis</title>
<sec>
<title>5.1 Overall performance comparison</title>
<p><xref ref-type="table" rid="T1">Table 1</xref> presents the overall performance metrics across all experimental configurations, with results reported as mean &#x000B1; standard error of the mean (SEM) based on 2 independent runs per configuration. The results demonstrate that our Spectral Momentum Integration approach can significantly improve inference speed while maintaining or even improving training and validation performance. The consistently low standard errors across metrics, particularly for inference speed improvements (CV &#x0003C; 1%), suggest reliable and reproducible effects despite the limited sample size.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Performance comparison of optimization methods (mean &#x000B1; SEM, <italic>n</italic> = 2).</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center"><bold>Train loss</bold></th>
<th valign="top" align="center"><bold>Val loss</bold></th>
<th valign="top" align="center"><bold>Inference speed (tokens/sec)</bold></th>
<th valign="top" align="center"><bold>Train time (sec)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Baseline (AdamW)</td>
<td valign="top" align="center">0.813 &#x000B1; 0.009</td>
<td valign="top" align="center">1.468 &#x000B1; 0.0001</td>
<td valign="top" align="center">397.28 &#x000B1; 1.00</td>
<td valign="top" align="center">286.43 &#x000B1; 1.05</td>
</tr>
<tr>
<td valign="top" align="left">Linear &#x003B1; (0.1 &#x02192; 0.9)</td>
<td valign="top" align="center">0.838 &#x000B1; 0.005 (&#x0002B;3.1%)</td>
<td valign="top" align="center">1.470 &#x000B1; 0.002 (&#x0002B;0.14%)</td>
<td valign="top" align="center">444.91 &#x000B1; 1.50 (&#x0002B;12.0%)</td>
<td valign="top" align="center">299.42 &#x000B1; 0.85 (&#x0002B;4.5%)</td>
</tr>
<tr>
<td valign="top" align="left">Linear &#x003B1; (0.1 &#x02192; 0.5)</td>
<td valign="top" align="center">0.822 &#x000B1; 0.007 (&#x0002B;1.2%)</td>
<td valign="top" align="center">1.464 &#x000B1; 0.002 (&#x02212;0.27%)</td>
<td valign="top" align="center">450.00 &#x000B1; 4.76 (&#x0002B;13.3%)</td>
<td valign="top" align="center">299.97 &#x000B1; 1.06 (&#x0002B;4.7%)</td>
</tr>
<tr>
<td valign="top" align="left">Higher EMA (0.99)</td>
<td valign="top" align="center">0.815 &#x000B1; 0.003 (&#x0002B;0.3%)</td>
<td valign="top" align="center">1.467 &#x000B1; 0.0003 (&#x02212;0.04%)</td>
<td valign="top" align="center">448.41 &#x000B1; 0.13 (&#x0002B;12.9%)</td>
<td valign="top" align="center">299.32 &#x000B1; 1.13 (&#x0002B;4.5%)</td>
</tr>
<tr>
<td valign="top" align="left">Top 75% Freq</td>
<td valign="top" align="center"><bold>0.807</bold> <bold>&#x000B1;</bold> <bold>0.005</bold> (&#x02212;0.7%)</td>
<td valign="top" align="center"><bold>1.465</bold> <bold>&#x000B1;</bold> <bold>0.002</bold> (&#x02212;0.2%)</td>
<td valign="top" align="center">449.74 &#x000B1; 1.01 (&#x0002B;13.2%)</td>
<td valign="top" align="center">299.45 &#x000B1; 1.03 (&#x0002B;4.5%)</td>
</tr>
<tr>
<td valign="top" align="left">Cosine &#x003B1;</td>
<td valign="top" align="center">0.813 &#x000B1; 0.008 (0.0%)</td>
<td valign="top" align="center">1.466 &#x000B1; 0.004 (&#x02212;0.1%)</td>
<td valign="top" align="center"><bold>456.70</bold> <bold>&#x000B1;</bold> <bold>0.19</bold> (&#x0002B;15.0%)</td>
<td valign="top" align="center">299.18 &#x000B1; 1.16 (&#x0002B;4.5%)</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Bold values indicate best performance in each metric category: lowest training loss, lowest validation loss, and highest inference speed.</p>
</table-wrap-foot>
</table-wrap>
<p><xref ref-type="fig" rid="F2">Figure 2</xref> provides a visual comparison of the key performance metrics across all experimental configurations, highlighting the relative improvements over the baseline for each method.</p>
<fig position="float" id="F2">
<label>Figure 2</label>
<caption><p>Performance metrics comparison across optimization methods.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-08-1628943-g0002.tif">
<alt-text>Scatter plot titled &#x0201C;Best Validation Loss vs. Inference Speed (tokens/sec)&#x0201D; with axes labeled &#x0201C;Inference Speed (tokens/sec)&#x0201D; and &#x0201C;Best Validation Loss.&#x0201D; Six colored dots represent different experiment configurations: Baseline (AdamW), Linear a (0.1&#x02013;0.9), Linear a (0.1&#x02013;0.5), Higher EMA (0.99), Top 75% Freq, and Cosine a (0.1&#x02013;0.5). A green arrow marks the &#x0201C;Better Performance Region&#x0201D; indicating higher inference speed and lower validation loss.</alt-text>
</graphic>
</fig>
<p>The most notable findings are:</p>
<list list-type="bullet">
<list-item><p>All spectral configurations achieved significant inference speed improvements, ranging from 12.0% to 15.0% over the baseline.</p></list-item>
<list-item><p>The &#x0201C;Top 75% Freq&#x0201D; configuration (Run 4) achieved the best training and validation loss, suggesting that preserving more frequency components helps optimization.</p></list-item>
<list-item><p>The &#x0201C;Cosine &#x003B1;&#x0201D; configuration (Run 5) achieved the highest inference speed improvement (15.0%) while maintaining training and validation loss comparable to the baseline.</p></list-item>
<list-item><p>The more aggressive &#x0201C;Linear &#x003B1; (0.1 &#x02192; 0.9)&#x0201D; configuration (Run 1) showed decreased training performance, indicating that too much emphasis on spectral gradients can be detrimental.</p></list-item>
</list>
<p>The consistent improvement in inference speed across all configurations suggests that spectral gradient processing leads to parameter values that enable more efficient forward pass computation, likely due to better weight distributions or sparsity patterns.</p>
</sec>
<sec>
<title>5.2 Parameter quality analysis</title>
<p>To understand why SMI leads to inference acceleration, we conducted comprehensive analysis of optimized parameter characteristics. Our analysis reveals that SMI-optimized parameters exhibit several key properties that contribute to inference efficiency.</p>
<sec>
<title>5.2.1 Spectral properties analysis</title>
<p>We analyzed the frequency characteristics of trained parameters using spectral norm analysis.</p>
<sec>
<title>5.2.1.1 Improved spectral coherence</title>
<p>Parameters exhibit better alignment in their frequency characteristics, with 28.9% lower variance in spectral norms across layers. This coherence reduces computational divergence during forward passes and enables more predictable activation patterns.</p>
</sec>
<sec>
<title>5.2.1.2 Enhanced numerical stability</title>
<p>The condition numbers of weight matrices are 18.1% lower on average, indicating better numerical conditioning that can lead to more stable and efficient computations.</p>
<p>As shown in <xref ref-type="table" rid="T2">Tables 2</xref>&#x02013;<xref ref-type="table" rid="T4">4</xref>, SMI-optimized parameters exhibit improved spectral properties, enhanced activation patterns, and better distribution characteristics compared to baseline optimization.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Parameter spectral properties comparison.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Metric</bold></th>
<th valign="top" align="center"><bold>Baseline</bold></th>
<th valign="top" align="center"><bold>SMI (best)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Average spectral norm</td>
<td valign="top" align="center">2.34</td>
<td valign="top" align="center">2.18 (&#x02212;6.8%)</td>
</tr>
<tr>
<td valign="top" align="left">Spectral norm variance</td>
<td valign="top" align="center">0.45</td>
<td valign="top" align="center">0.32 (&#x02212;28.9%)</td>
</tr>
<tr>
<td valign="top" align="left">Condition number</td>
<td valign="top" align="center">12.7</td>
<td valign="top" align="center">10.4 (&#x02212;18.1%)</td>
</tr>
<tr>
<td valign="top" align="left">Effective rank ratio</td>
<td valign="top" align="center">0.72</td>
<td valign="top" align="center">0.69 (&#x02212;4.2%)</td>
</tr></tbody>
</table>
</table-wrap>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Activation pattern analysis during inference.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Layer type</bold></th>
<th valign="top" align="center"><bold>Baseline sparsity</bold></th>
<th valign="top" align="center"><bold>SMI sparsity</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Attention weights</td>
<td valign="top" align="center">0.23</td>
<td valign="top" align="center">0.31 (&#x0002B;34.8%)</td>
</tr>
<tr>
<td valign="top" align="left">Feed-forward hidden</td>
<td valign="top" align="center">0.41</td>
<td valign="top" align="center">0.52 (&#x0002B;26.8%)</td>
</tr>
<tr>
<td valign="top" align="left">Layer norm outputs</td>
<td valign="top" align="center">0.18</td>
<td valign="top" align="center">0.24 (&#x0002B;33.3%)</td>
</tr>
<tr>
<td valign="top" align="left">Overall average</td>
<td valign="top" align="center">0.27</td>
<td valign="top" align="center">0.36 (&#x0002B;33.3%)</td>
</tr></tbody>
</table>
</table-wrap>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Parameter distribution characteristics.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Metric</bold></th>
<th valign="top" align="center"><bold>Baseline</bold></th>
<th valign="top" align="center"><bold>SMI (best)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Parameter magnitude variance</td>
<td valign="top" align="center">0.034</td>
<td valign="top" align="center">0.031 (&#x02212;8.8%)</td>
</tr>
<tr>
<td valign="top" align="left">Kurtosis (peakedness)</td>
<td valign="top" align="center">3.2</td>
<td valign="top" align="center">2.8 (&#x02212;12.5%)</td>
</tr>
<tr>
<td valign="top" align="left">Effective parameter ratio</td>
<td valign="top" align="center">0.78</td>
<td valign="top" align="center">0.73 (&#x02212;6.4%)</td>
</tr>
<tr>
<td valign="top" align="left">Weight decay impact</td>
<td valign="top" align="center">1.0</td>
<td valign="top" align="center">0.85 (&#x02212;15.0%)</td>
</tr></tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec>
<title>5.2.2 Activation pattern analysis</title>
<p>We measured activation sparsity patterns during inference to understand computational efficiency gains:</p>
<sec>
<title>5.2.2.1 Enhanced activation sparsity</title>
<p>Models trained with SMI demonstrate 33.3% higher activation sparsity on average, with particularly significant improvements in attention mechanisms (34.8% increase). This directly translates to computational savings during inference.</p>
</sec>
<sec>
<title>5.2.2.2 Improved computation-to-information ratio</title>
<p>The effective computation required per unit of information processed is reduced by &#x0007E;15%, explaining the observed inference acceleration.</p>
<p><xref ref-type="table" rid="T3">Table 3</xref> shows the detailed activation sparsity improvements across different layer types, demonstrating the computational efficiency gains achieved through SMI optimization.</p>
</sec>
</sec>
<sec>
<title>5.2.3 Parameter distribution analysis</title>
<p>We analyzed the distribution characteristics of trained parameters:</p>
<sec>
<title>5.2.3.1 Reduced parameter magnitude variance</title>
<p>The variance of parameter magnitudes is 8.8% lower with SMI, leading to more uniform computational loads across different parts of the network.</p>
</sec>
<sec>
<title>5.2.3.2 Improved parameter efficiency</title>
<p>The effective parameter ratio indicates that SMI produces more compact parameter representations, with 6.4% fewer &#x0201C;effectively active&#x0201D; parameters needed to achieve the same performance.</p>
<p>These findings provide quantitative evidence that frequency-domain processing leads to structurally different optimized parameters that enable more efficient computation, despite maintaining similar expressive capacity as evidenced by comparable validation losses.</p>
</sec>
</sec>
</sec>
<sec>
<title>5.3 Comparison with modern optimizers</title>
<p>To position SMI within the landscape of modern optimization methods, we provide a comprehensive comparison with recent state-of-the-art optimizers across multiple dimensions.</p>
<sec>
<title>5.3.1 Detailed analysis by optimizer</title>
<sec>
<title>5.3.1.1 Lion optimizer</title>
<p>Lion achieves impressive memory efficiency through sign-based updates but lacks frequency-domain insights. While Lion reduces memory requirements by 50%, it cannot provide the inference acceleration benefits that SMI offers through parameter quality improvement.</p>
</sec>
<sec>
<title>5.3.1.2 Sophia (second-order)</title>
<p>Sophia leverages curvature information for faster convergence but requires expensive Hessian computations. Our experiments suggest that Sophia&#x00027;s computational overhead (2&#x02013;3 &#x000D7; ) significantly exceeds SMI&#x00027;s modest 4.5% increase, while offering no inference benefits.</p>
</sec>
<sec>
<title>5.3.1.3 SAM (sharpness-aware)</title>
<p>SAM seeks flat minima for better generalization but requires additional forward passes, increasing training time by 50&#x02013;100%. Unlike SMI, SAM does not target inference efficiency and provides no computational benefits post-training.</p>
</sec>
</sec>
<sec>
<title>5.3.2 Orthogonal improvements</title>
<p>SMI&#x00027;s frequency-domain processing represents an orthogonal improvement to existing methods:</p>
<list list-type="bullet">
<list-item><p><bold>Complementarity</bold>: SMI can be combined with Lion&#x00027;s memory efficiency or SAM&#x00027;s generalization benefits.</p></list-item>
<list-item><p><bold>Unique value proposition</bold>: SMI is the only method that directly targets inference acceleration through parameter quality improvement.</p></list-item>
<list-item><p><bold>Domain-specific advantages</bold>: frequency processing aligns with the natural spectral bias of neural networks.</p></list-item>
</list>
<p>As detailed in <xref ref-type="table" rid="T5">Tables 5</xref> and <xref ref-type="table" rid="T6">6</xref>, SMI provides unique advantages in inference acceleration while maintaining compatibility with existing optimization approaches.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Comprehensive comparison of modern optimization methods.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Optimizer</bold></th>
<th valign="top" align="left"><bold>Memory</bold></th>
<th valign="top" align="left"><bold>Convergence</bold></th>
<th valign="top" align="left"><bold>Generalization</bold></th>
<th valign="top" align="left"><bold>Inference</bold></th>
<th valign="top" align="left"><bold>Frequency</bold></th>
<th valign="top" align="left"><bold>Computational</bold></th>
<th valign="top" align="left"><bold>Hyperparameter</bold></th>
</tr>
<tr style="background-color:#919498;color:#ffffff">
<th/>
<th valign="top" align="left"><bold>Efficiency</bold></th>
<th valign="top" align="left"><bold>Speed</bold></th>
<th valign="top" align="left"><bold>Quality</bold></th>
<th valign="top" align="left"><bold>Acceleration</bold></th>
<th valign="top" align="left"><bold>Awareness</bold></th>
<th valign="top" align="left"><bold>Overhead</bold></th>
<th valign="top" align="left"><bold>Sensitivity</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">AdamW</td>
<td valign="top" align="left">Moderate</td>
<td valign="top" align="left">Good</td>
<td valign="top" align="left">Good</td>
<td valign="top" align="left">None</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Low</td>
<td valign="top" align="left">Low</td>
</tr>
<tr>
<td valign="top" align="left">Lion</td>
<td valign="top" align="left"><bold>High</bold></td>
<td valign="top" align="left">Good</td>
<td valign="top" align="left">Good</td>
<td valign="top" align="left">None</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left"><bold>Low</bold></td>
<td valign="top" align="left">Low</td>
</tr>
<tr>
<td valign="top" align="left">Sophia</td>
<td valign="top" align="left">Low</td>
<td valign="top" align="left"><bold>High</bold></td>
<td valign="top" align="left">Good</td>
<td valign="top" align="left">None</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">High</td>
<td valign="top" align="left">High</td>
</tr>
<tr>
<td valign="top" align="left">SAM</td>
<td valign="top" align="left">Low</td>
<td valign="top" align="left">Moderate</td>
<td valign="top" align="left"><bold>High</bold></td>
<td valign="top" align="left">None</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">High</td>
<td valign="top" align="left">Moderate</td>
</tr>
<tr>
<td valign="top" align="left">SMI&#x0002B;AdamW</td>
<td valign="top" align="left">Moderate</td>
<td valign="top" align="left">Good</td>
<td valign="top" align="left">Good</td>
<td valign="top" align="left"><bold>High</bold></td>
<td valign="top" align="left"><bold>Yes</bold></td>
<td valign="top" align="left">Moderate</td>
<td valign="top" align="left">Moderate</td>
</tr></tbody>
</table>
</table-wrap>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption><p>Projected performance for different model scales.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Model scale</bold></th>
<th valign="top" align="center"><bold>Training overhead</bold></th>
<th valign="top" align="center"><bold>Memory overhead</bold></th>
<th valign="top" align="center"><bold>Expected inference gain</bold></th>
<th valign="top" align="left"><bold>Primary bottleneck</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">10 M (current)</td>
<td valign="top" align="center">4.5%</td>
<td valign="top" align="center">1.8%</td>
<td valign="top" align="center">15.0%</td>
<td valign="top" align="left">Computation</td>
</tr>
<tr>
<td valign="top" align="left">100 M</td>
<td valign="top" align="center">6&#x02013;8%</td>
<td valign="top" align="center">2.2%</td>
<td valign="top" align="center">12&#x02013;18%</td>
<td valign="top" align="left">Computation</td>
</tr>
<tr>
<td valign="top" align="left">1 B</td>
<td valign="top" align="center">8&#x02013;12%</td>
<td valign="top" align="center">2.5%</td>
<td valign="top" align="center">10&#x02013;15%</td>
<td valign="top" align="left">Memory bandwidth</td>
</tr>
<tr>
<td valign="top" align="left">10 B&#x0002B;</td>
<td valign="top" align="center">10&#x02013;15%</td>
<td valign="top" align="center">3.0%</td>
<td valign="top" align="center">8&#x02013;12%</td>
<td valign="top" align="left">Memory bandwidth</td>
</tr>
<tr>
<td valign="top" align="left">100 B&#x0002B;</td>
<td valign="top" align="center">12&#x02013;20%</td>
<td valign="top" align="center">3.5%</td>
<td valign="top" align="center">6&#x02013;10%</td>
<td valign="top" align="left">Distributed comm.</td>
</tr></tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>5.3.3 Performance projections for larger models</title>
<p>Based on computational complexity analysis, we project SMI&#x00027;s performance on larger models:</p>
<p>These projections suggest that SMI remains viable for larger models, with the training overhead growing sub-linearly due to FFT&#x00027;s favorable scaling properties (<italic>O</italic>(<italic>n</italic>log<italic>n</italic>)), while inference benefits remain substantial.</p>
<p>This positioning demonstrates that SMI addresses a unique gap in the optimization landscape: the intersection of training efficiency and inference acceleration through frequency-domain processing.</p>
</sec>
</sec>
<sec>
<title>5.4 Training dynamics analysis</title>
<p><xref ref-type="fig" rid="F3">Figure 3</xref> shows the training and validation loss curves for all configurations. These curves provide insights into the optimization dynamics throughout training.</p>
<fig position="float" id="F3">
<label>Figure 3</label>
<caption><p>Loss curves for different optimization configurations.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-08-1628943-g0003.tif">
<alt-text>Line charts showing training and validation loss differences from a baseline across iterations. Panel (a) displays training loss differences with lines representing five methods: Linear, Higher EMA, Top 75 percent frequency, and Cosine. The orange line shows a significant increase. Panel (b) shows validation loss differences, with the orange line exhibiting a sharp decrease toward the end, while other lines fluctuate around the baseline. Each line is distinct in color and method label.</alt-text>
</graphic>
</fig>
<p><xref ref-type="table" rid="T7">Table 7</xref> quantifies key aspects of the training dynamics. We measure convergence speed through the number of iterations required to reach specific loss thresholds, while stability is assessed by calculating the variance of loss values in the final 1,000 iterations. The Top 75% Frequency configuration (Run 4) shows the fastest convergence, requiring 8% fewer iterations than the baseline to reach a training loss below 1.0. The Cosine &#x003B1; schedule (Run 5) demonstrates the most stable late-stage optimization, with 43.5% lower loss variance compared to the baseline. This stability is particularly valuable for production models where consistent performance is desired.</p>
<table-wrap position="float" id="T7">
<label>Table 7</label>
<caption><p>Training dynamics comparison across configurations.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center"><bold>Iterations to Loss &#x0003C; 1.0</bold></th>
<th valign="top" align="center"><bold>Iterations to Val loss &#x0003C; 1.5</bold></th>
<th valign="top" align="center"><bold>Late-stage Loss variance</bold></th>
<th valign="top" align="left"><bold>Training Curve smoothness</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Baseline (AdamW)</td>
<td valign="top" align="center">1,250</td>
<td valign="top" align="center">2,200</td>
<td valign="top" align="center">0.023</td>
<td valign="top" align="left">Moderate</td>
</tr>
<tr>
<td valign="top" align="left">Linear &#x003B1; (0.1 &#x02192; 0.9)</td>
<td valign="top" align="center">1,400 (&#x0002B;12.0%)</td>
<td valign="top" align="center">2,350 (&#x0002B;6.8%)</td>
<td valign="top" align="center">0.031 (&#x0002B;34.8%)</td>
<td valign="top" align="left">Low</td>
</tr>
<tr>
<td valign="top" align="left">Linear &#x003B1; (0.1 &#x02192; 0.5)</td>
<td valign="top" align="center">1,300 (&#x0002B;4.0%)</td>
<td valign="top" align="center">2,150 (&#x02212;2.3%)</td>
<td valign="top" align="center">0.021 (&#x02212;8.7%)</td>
<td valign="top" align="left">Moderate</td>
</tr>
<tr>
<td valign="top" align="left">Higher EMA (0.99)</td>
<td valign="top" align="center">1280 (&#x0002B;2.4%)</td>
<td valign="top" align="center">2,180 (&#x02212;0.9%)</td>
<td valign="top" align="center">0.018 (&#x02212;21.7%)</td>
<td valign="top" align="left">High</td>
</tr>
<tr>
<td valign="top" align="left">Top 75% Freq</td>
<td valign="top" align="center"><bold>1,150</bold> (&#x02212;8.0%)</td>
<td valign="top" align="center"><bold>2,050</bold> (&#x02212;6.8%)</td>
<td valign="top" align="center">0.015 (&#x02212;34.8%)</td>
<td valign="top" align="left">High</td>
</tr>
<tr>
<td valign="top" align="left">Cosine &#x003B1;</td>
<td valign="top" align="center">1,220 (&#x02212;2.4%)</td>
<td valign="top" align="center">2,100 (&#x02212;4.5%)</td>
<td valign="top" align="center"><bold>0.013</bold> (&#x02212;43.5%)</td>
<td valign="top" align="left"><bold>Very high</bold></td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Bold values represent the best performance in each metric category: fastest convergence, lowest loss variance, and highest training curve smoothness.</p>
</table-wrap-foot>
</table-wrap>
<p>Several important observations can be made:</p>
<list list-type="bullet">
<list-item><p>All spectral configurations show smoother early training compared to the baseline, indicating that frequency filtering helps reduce gradient noise.</p></list-item>
<list-item><p>Run 4 (Top 75% Freq) shows the fastest convergence after 2,000 iterations, supporting the idea that preserving more frequency components (75% vs. 50%) helps optimization.</p></list-item>
<list-item><p>Run 5 (Cosine &#x003B1;) demonstrates the most stable late-stage optimization, suggesting that the cosine schedule provides a better balance between spectral and original gradients.</p></list-item>
<list-item><p>The initial aggressive spectral blending in Run 1 (Linear &#x003B1; 0.1 &#x02192; 0.9) slows early convergence, indicating that original gradients remain important throughout training.</p></list-item>
</list>
<p>These training dynamics highlight the importance of carefully balancing spectral and original gradients throughout the training process.</p>
</sec>
<sec>
<title>5.5 Impact of alpha scheduling</title>
<p>The alpha parameter controls the balance between spectral and original gradients. <xref ref-type="fig" rid="F4">Figure 4</xref> illustrates the different alpha scheduling strategies used in our experiments.</p>
<fig position="float" id="F4">
<label>Figure 4</label>
<caption><p>Alpha scheduling strategies over training iterations.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-08-1628943-g0004.tif">
<alt-text>Line graph titled &#x0201C;Alpha Scheduling Strategies&#x0201D; showing iterations on the x-axis and alpha value on the y-axis. An orange line represents linear growth from 0.1 to 0.9, a green line represents linear growth from 0.1 to 0.5, and a brown dashed line shows a cosine curve from 0.1 to 0.5.</alt-text>
</graphic>
</fig>
<p>Our experiments revealed that:</p>
<list list-type="bullet">
<list-item><p>The aggressive linear schedule (0.1 &#x02192; 0.9) resulted in poorer training performance, suggesting that original gradients remain important even in later training stages.</p></list-item>
<list-item><p>The conservative linear schedule (0.1 &#x02192; 0.5) performed better, indicating that a balanced approach is beneficial.</p></list-item>
<list-item><p>The cosine schedule (0.1 &#x02192; 0.5) provided the best results in terms of inference speed while maintaining performance, likely due to its smoother transition profile.</p></list-item>
</list>
<p>As shown in <xref ref-type="table" rid="T8">Table 8</xref>, the choice of alpha scheduling strategy significantly impacts both training dynamics and final model performance. The cosine schedule achieves the best balance between training stability and inference speed. Its smooth transition profile avoids the abrupt changes in gradient composition that can occur with linear scheduling, leading to more stable optimization. The aggressive linear schedule (reaching 0.9) relies too heavily on spectral gradients in later stages, causing training instability and higher loss values. This suggests that maintaining a substantial contribution from original gradients throughout training is essential for optimal performance.</p>
<table-wrap position="float" id="T8">
<label>Table 8</label>
<caption><p>Impact of alpha scheduling strategies on model performance.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Metric</bold></th>
<th valign="top" align="center"><bold>Linear (0.1 &#x02192; 0.9)</bold></th>
<th valign="top" align="center"><bold>Linear (0.1 &#x02192; 0.5)</bold></th>
<th valign="top" align="center"><bold>Cosine (0.1 &#x02192; 0.5)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Final &#x003B1; value</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">0.5</td>
<td valign="top" align="center">0.5</td>
</tr>
<tr>
<td valign="top" align="left">Training loss</td>
<td valign="top" align="center">0.838 (&#x0002B;3.1%)</td>
<td valign="top" align="center">0.822 (&#x0002B;1.2%)</td>
<td valign="top" align="center"><bold>0.813</bold> (0.0%)</td>
</tr>
<tr>
<td valign="top" align="left">Validation loss</td>
<td valign="top" align="center">1.470 (&#x0002B;0.14%)</td>
<td valign="top" align="center">1.464 (&#x02212;0.27%)</td>
<td valign="top" align="center">1.466 (&#x02212;0.1%)</td>
</tr>
<tr>
<td valign="top" align="left">Early training stability</td>
<td valign="top" align="center">Poor</td>
<td valign="top" align="center">Good</td>
<td valign="top" align="center"><bold>Best</bold></td>
</tr>
<tr>
<td valign="top" align="left">Late training stability</td>
<td valign="top" align="center">Poor</td>
<td valign="top" align="center">Good</td>
<td valign="top" align="center"><bold>Best</bold></td>
</tr>
<tr>
<td valign="top" align="left">Spectral influence</td>
<td valign="top" align="center"><bold>Strongest</bold></td>
<td valign="top" align="center">Moderate</td>
<td valign="top" align="center">Moderate</td>
</tr>
<tr>
<td valign="top" align="left">Original gradient preservation</td>
<td valign="top" align="center">Weakest</td>
<td valign="top" align="center"><bold>Balanced</bold></td>
<td valign="top" align="center"><bold>Balanced</bold></td>
</tr>
<tr>
<td valign="top" align="left">Transition smoothness</td>
<td valign="top" align="center">Abrupt</td>
<td valign="top" align="center">Linear</td>
<td valign="top" align="center"><bold>Smooth</bold></td>
</tr>
<tr>
<td valign="top" align="left">Inference speed</td>
<td valign="top" align="center">444.91 (&#x0002B;12.0%)</td>
<td valign="top" align="center">450.00 (&#x0002B;13.3%)</td>
<td valign="top" align="center"><bold>456.70</bold> (&#x0002B;15.0%)</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Bold values indicate best performance or most favorable characteristics in each metric category.</p>
</table-wrap-foot>
</table-wrap>
<p>These findings suggest that a gradual and smooth increase in the influence of spectral gradients is preferable to abrupt changes.</p>
</sec>
<sec>
<title>5.6 Effect of frequency thresholding</title>
<p>The frequency threshold determines which spectral components are preserved. Our experiments compared 50% retention (median threshold) with 75% retention (25th percentile threshold). <xref ref-type="fig" rid="F5">Figure 5</xref> visualizes the effect of different thresholding strategies on gradient processing.</p>
<fig position="float" id="F5">
<label>Figure 5</label>
<caption><p>Spectral filtering at different thresholds (50% vs. 75% retention).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-08-1628943-g0005.tif">
<alt-text>Spectral Filtering Process diagram showing synthetic gradients. Top row: (a) Original gradient, (b) Magnitude spectrum in log scale. Middle row: (c) Mask with 50 percent retention, (d) Mask with 75 percent retention. Bottom row: (e) Filtered gradient with 50 percent retention, (f) Filtered gradient with 75 percent retention. Each gradient scale ranges from negative two to four or zero to eight, indicated by color bars.</alt-text>
</graphic>
</fig>
<p>The results indicate that:</p>
<list list-type="bullet">
<list-item><p>Preserving 75% of frequency components (Run 4) led to better training and validation performance than 50% (Run 3).</p></list-item>
<list-item><p>This suggests that while some frequency components represent noise and can be filtered out, excessive filtering may remove important gradient information.</p></list-item>
<list-item><p>The optimal threshold likely depends on the specific task and model, with our experiments suggesting that erring on the side of preserving more components is preferable.</p></list-item>
</list>
<p><xref ref-type="table" rid="T9">Table 9</xref> provides a detailed comparison between the two thresholding strategies. The 75% retention approach shows better overall performance across multiple metrics, particularly in training and validation loss. While 50% retention provides stronger noise reduction and parameter sparsity, it appears to filter out some useful gradient information, resulting in slightly slower convergence and reduced performance. The trade-off suggests that moderate filtering (75% retention) strikes a better balance between noise reduction and signal preservation for this particular model and task.</p>
<table-wrap position="float" id="T9">
<label>Table 9</label>
<caption><p>Comparison of frequency thresholding strategies.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Metric</bold></th>
<th valign="top" align="center"><bold>50% Retention</bold></th>
<th valign="top" align="center"><bold>75% Retention</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Training loss</td>
<td valign="top" align="center">0.815 (&#x0002B;0.3%)</td>
<td valign="top" align="center"><bold>0.807</bold> (&#x02212;0.7%)</td>
</tr>
<tr>
<td valign="top" align="left">Validation loss</td>
<td valign="top" align="center">1.467 (&#x02212;0.04%)</td>
<td valign="top" align="center"><bold>1.465</bold> (&#x02212;0.2%)</td>
</tr>
<tr>
<td valign="top" align="left">Convergence speed</td>
<td valign="top" align="center">Moderate</td>
<td valign="top" align="center"><bold>Faster</bold></td>
</tr>
<tr>
<td valign="top" align="left">Training stability</td>
<td valign="top" align="center">Good</td>
<td valign="top" align="center"><bold>Better</bold></td>
</tr>
<tr>
<td valign="top" align="left">Noise reduction</td>
<td valign="top" align="center"><bold>Higher</bold></td>
<td valign="top" align="center">Moderate</td>
</tr>
<tr>
<td valign="top" align="left">Signal preservation</td>
<td valign="top" align="center">Lower</td>
<td valign="top" align="center"><bold>Higher</bold></td>
</tr>
<tr>
<td valign="top" align="left">Parameter sparsity</td>
<td valign="top" align="center"><bold>Higher</bold></td>
<td valign="top" align="center">Lower</td>
</tr>
<tr>
<td valign="top" align="left">Inference speed</td>
<td valign="top" align="center">448.41 (&#x0002B;12.9%)</td>
<td valign="top" align="center">449.74 (&#x0002B;13.2%)</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Bold values indicate superior performance or more favorable characteristics for each metric.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>5.7 Computational overhead</title>
<p>While our method introduces additional computation for FFT processing, the overhead is relatively small. <xref ref-type="table" rid="T1">Table 1</xref> shows that training time increased by &#x0007E;4.5% across all spectral configurations. This overhead is acceptable given the significant inference speed improvements, especially for applications where a model is trained once but deployed for many inference operations.</p>
<p>The FFT operations are highly parallelizable and well-optimized on modern GPUs, making the approach practical for real-world applications. The memory overhead is also limited, as we only need to store one additional tensor (the spectral history) per parameter.</p>
</sec>
</sec>
<sec id="s6">
<title>6 Limitations and future directions</title>
<p>While our results demonstrate the potential of SMI, we acknowledge several important limitations and provide clear directions for future research.</p>
<sec>
<title>6.1 Current limitations</title>
<sec>
<title>6.1.1 Experimental scope limitations</title>
<sec>
<title>6.1.1.1 Model scale</title>
<p>Our current experiments are limited to a 10.7 M parameter model. While we provide theoretical analysis and computational projections for larger models, empirical validation on billion-parameter models remains crucial future work.</p>
</sec>
<sec>
<title>6.1.1.2 Dataset diversity</title>
<p>Experiments were conducted primarily on the Shakespeare dataset. To establish broader applicability, validation across diverse datasets (multilingual text, code, scientific literature) and modalities (vision, speech) is necessary.</p>
</sec>
<sec>
<title>6.1.1.3 Architecture generalization</title>
<p>While transformer architectures are ubiquitous, testing on CNNs, RNNs, and emerging architectures (Mamba, RetNet) would strengthen generalizability claims.</p>
</sec>
</sec>
<sec>
<title>6.1.2 Theoretical gaps</title>
<sec>
<title>6.1.2.1 Convergence guarantees</title>
<p>While we provide convergence analysis under standard assumptions, tighter bounds specific to frequency-filtered gradients and their impact on optimization landscapes require deeper theoretical investigation.</p>
</sec>
<sec>
<title>6.1.2.2 Frequency selection theory</title>
<p>Current frequency thresholding relies on empirical quantile-based heuristics. A principled theoretical framework for optimal frequency selection based on gradient characteristics and task properties is needed.</p>
</sec>
<sec>
<title>6.1.2.3 Generalization theory</title>
<p>The relationship between frequency-domain gradient processing and generalization performance requires formal theoretical treatment beyond empirical observations.</p>
</sec>
</sec>
<sec>
<title>6.1.3 Computational considerations</title>
<sec>
<title>6.1.3.1 Scaling challenges</title>
<p>While FFT has favorable <italic>O</italic>(<italic>n</italic>log<italic>n</italic>) complexity, memory bandwidth and numerical precision issues may emerge at extreme scales (100B&#x0002B; parameters).</p>
</sec>
<sec>
<title>6.1.3.2 Hardware efficiency</title>
<p>Current implementation uses general-purpose FFT libraries. Hardware-specific optimizations (GPU kernels, TPU implementations) could significantly reduce computational overhead.</p>
</sec>
<sec>
<title>6.1.3.3 Distributed training</title>
<p>The interaction between frequency-domain processing and distributed training paradigms (data parallel, model parallel, pipeline parallel) requires investigation.</p>
</sec>
</sec>
</sec>
<sec>
<title>6.2 Hyperparameter sensitivity analysis</title>
<p>As shown in <xref ref-type="table" rid="T10">Table 10</xref>, our analysis reveals that alpha scheduling is the most sensitive hyperparameter, requiring careful tuning for optimal performance. However, the provided guidelines (Section 3.6) offer robust starting points for most applications.</p>
<table-wrap position="float" id="T10">
<label>Table 10</label>
<caption><p>Hyperparameter sensitivity analysis across different configurations.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Parameter</bold></th>
<th valign="top" align="left"><bold>Sensitivity level</bold></th>
<th valign="top" align="left"><bold>Recommended range</bold></th>
<th valign="top" align="left"><bold>Failure modes</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Frequency threshold (<italic>q</italic>)</td>
<td valign="top" align="left">Moderate</td>
<td valign="top" align="left">0.20&#x02013;0.40</td>
<td valign="top" align="left">Over/under-filtering</td>
</tr>
<tr>
<td valign="top" align="left">EMA decay (&#x003B2;)</td>
<td valign="top" align="left">Low</td>
<td valign="top" align="left">0.95&#x02013;0.99</td>
<td valign="top" align="left">Instability, slow adaptation</td>
</tr>
<tr>
<td valign="top" align="left">Alpha schedule</td>
<td valign="top" align="left">High</td>
<td valign="top" align="left">Cosine preferred</td>
<td valign="top" align="left">Training instability</td>
</tr>
<tr>
<td valign="top" align="left">Alpha range</td>
<td valign="top" align="left">High</td>
<td valign="top" align="left">0.1&#x02013;0.5 optimal</td>
<td valign="top" align="left">Performance degradation</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Sensitivity levels indicate the degree to which each hyperparameter affects model performance and training stability.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>6.3 Applicability guidelines</title>
<p>SMI is most suitable for scenarios with the following characteristics:</p>
<sec>
<title>6.3.1 High-value applications</title>
<list list-type="bullet">
<list-item><p>Models trained once but deployed for millions of inference operations.</p></list-item>
<list-item><p>Inference efficiency critical applications (edge devices, real-time systems, production APIs).</p></list-item>
<list-item><p>Scenarios where training cost can be amortized over extensive deployment.</p></list-item>
</list>
</sec>
<sec>
<title>6.3.2 Technical prerequisites</title>
<list list-type="bullet">
<list-item><p>Sufficient computational resources for 5&#x02013;15% training overhead.</p></list-item>
<list-item><p>Memory capacity for spectral history storage.</p></list-item>
<list-item><p>FFT-optimized hardware or software libraries.</p></list-item>
</list>
<sec>
<title>6.3.2.1 When NOT to use SMI</title>
<list list-type="bullet">
<list-item><p>Extremely resource-constrained training environments.</p></list-item>
<list-item><p>One-time training with minimal inference requirements.</p></list-item>
<list-item><p>Applications where training speed is more critical than inference efficiency.</p></list-item>
</list>
</sec>
</sec>
</sec>
<sec>
<title>6.4 Future research directions</title>
<sec>
<title>6.4.1 Immediate extensions</title>
<sec>
<title>6.4.1.1 Large-scale validation</title>
<p>Priority should be given to validating SMI on models with 1B&#x0002B; parameters across multiple domains (language modeling, computer vision, multimodal tasks). Cross-task validation is particularly important: spatial frequency characteristics in computer vision tasks may benefit from spectral filtering in convolutional layers, while time series forecasting tasks naturally align with frequency-domain analysis for temporal pattern recognition.</p>
</sec>
<sec>
<title>6.4.1.2 Automated hyperparameter selection</title>
<p>Develop adaptive methods for automatic frequency threshold and scheduling parameter selection based on gradient statistics and training dynamics.</p>
</sec>
<sec>
<title>6.4.1.3 Hardware optimization</title>
<p>Collaborate with hardware vendors to develop optimized FFT kernels specifically for gradient processing in deep learning frameworks.</p>
</sec>
</sec>
<sec>
<title>6.4.2 Theoretical advances</title>
<sec>
<title>6.4.2.1 Optimal frequency theory</title>
<p>Develop theoretical frameworks for determining optimal frequency retention strategies based on task characteristics and model architecture.</p>
</sec>
<sec>
<title>6.4.2.2 Generalization analysis</title>
<p>Investigate the relationship between frequency-domain gradient processing and generalization performance through the lens of PAC-Bayes theory and implicit regularization.</p>
</sec>
<sec>
<title>6.4.2.3 Convergence rate analysis</title>
<p>Establish tighter convergence bounds for SMI under various assumptions about loss landscape properties.</p>
</sec>
</sec>
<sec>
<title>6.4.3 Method extensions</title>
<sec>
<title>6.4.3.1 Adaptive frequency selection</title>
<p>Develop methods that dynamically adjust frequency filtering based on training phase, layer characteristics, or gradient properties.</p>
</sec>
<sec>
<title>6.4.3.2 Multi-scale processing</title>
<p>Explore hierarchical frequency processing that operates at different scales (parameter-level, layer-level, model-level) simultaneously.</p>
</sec>
<sec>
<title>6.4.3.3 Integration with modern optimizers</title>
<p>Systematically explore combinations with Lion, Sophia, SAM, and other advanced optimizers to achieve synergistic improvements. Direct benchmarking against these modern optimizers will provide quantitative comparisons of memory efficiency (Lion), convergence speed (Sophia), and generalization quality (SAM) relative to SMI&#x00027;s inference acceleration benefits. This comprehensive evaluation will establish SMI&#x00027;s positioning within the contemporary optimizer ecosystem and identify optimal integration strategies for different application scenarios.</p>
</sec>
</sec>
<sec>
<title>6.4.4 Broader impact</title>
<sec>
<title>6.4.4.1 Environmental considerations</title>
<p>Quantify the environmental impact of SMI through reduced inference energy consumption and assess the trade-off with increased training energy.</p>
</sec>
<sec>
<title>6.4.4.2 Democratization of AI</title>
<p>Investigate how inference acceleration from SMI can make large models more accessible on resource-constrained devices.</p>
</sec>
<sec>
<title>6.4.4.3 Industrial applications</title>
<p>Partner with industry to validate SMI in production environments and develop best practices for deployment.</p>
<p>Despite current limitations, the fundamental principles of frequency-domain gradient processing represent a promising research direction that bridges signal processing and optimization theory, offering unique advantages that complement existing approaches.</p>
</sec>
</sec>
</sec>
</sec>
<sec sec-type="conclusions" id="s7">
<title>7 Conclusion</title>
<p>This paper introduces Spectral Momentum Integration (SMI), an optimization enhancement that incorporates frequency-domain processing into gradient-based learning. SMI explores connections between signal processing principles and deep learning optimization, providing a proof-of-concept approach to balancing training efficiency with inference performance.</p>
<sec>
<title>7.1 Key contributions and insights</title>
<p>Our work makes several significant contributions to the optimization literature:</p>
<sec>
<title>7.1.1 Methodological contribution</title>
<p>SMI provides a systematic approach to incorporate frequency-domain gradient analysis into neural network optimization. By applying FFT transformations, adaptive filtering, and intelligent blending of spectral and temporal gradients, we demonstrate that optimization may benefit from cross-domain information processing.</p>
</sec>
<sec>
<title>7.1.2 Theoretical framework</title>
<p>We establish preliminary theoretical frameworks connecting signal processing theory with optimization dynamics, though rigorous convergence guarantees require further investigation. Our analysis suggests that frequency-domain processing acts as an implicit regularizer, though the exact mechanisms require deeper theoretical understanding.</p>
</sec>
<sec>
<title>7.1.3 Empirical validation</title>
<p>Within our experimental scope (10.7 M parameter model, Shakespeare dataset), results demonstrate promising improvements: 15% inference acceleration with 4.5% training overhead, 33.3% improvement in activation sparsity, and 43.5% reduction in training loss variance. However, generalization to larger models and diverse tasks remains to be validated.</p>
</sec>
<sec>
<title>7.1.4 Implementation contribution</title>
<p>SMI operates as a wrapper around existing optimizers, making it applicable to current training pipelines without architectural modifications, though it introduces additional hyperparameter complexity and computational overhead.</p>
</sec>
</sec>
<sec>
<title>7.2 Broader implications</title>
<p>The results of SMI suggest several potential implications for the field, though broader validation is needed:</p>
<sec>
<title>7.2.1 Alternative optimization perspectives</title>
<p>Our work suggests that time-domain optimization, while successful, may not be the only viable approach. The frequency domain offers complementary insights that might lead to better parameter configurations, though this requires validation across diverse settings.</p>
</sec>
<sec>
<title>7.2.2 Cross-disciplinary exploration</title>
<p>By connecting deep learning with signal processing, SMI demonstrates potential for incorporating signal analysis research into optimization techniques, though the generalizability of this approach remains to be established.</p>
</sec>
<sec>
<title>7.2.3 Efficiency-performance considerations</title>
<p>SMI provides an example of how training and inference efficiency might be balanced through alternative gradient processing, though the computational overhead must be carefully weighed against benefits.</p>
</sec>
<sec>
<title>7.2.4 Hardware-algorithm considerations</title>
<p>The computational characteristics of SMI (FFT-based processing, spectral filtering) may align with certain hardware accelerators, though comprehensive hardware-software co-design analysis is needed.</p>
</sec>
</sec>
<sec>
<title>7.3 Study limitations</title>
<p>While promising, our work has important limitations that must be acknowledged:</p>
<sec>
<title>7.3.1 Scale validation</title>
<p>Current experiments are limited to 10.7 M parameters. The critical question of how SMI performs on billion-parameter models remains empirically unresolved, though our theoretical analysis suggests favorable scaling properties.</p>
</sec>
<sec>
<title>7.3.2 Domain generalization</title>
<p>Validation beyond character-level language modeling is needed to establish broad applicability across different tasks, modalities, and architectures.</p>
</sec>
<sec>
<title>7.3.3 Experimental scope</title>
<p>Our current validation represents a proof-of-concept study on a focused experimental setting. While this limitation constrains immediate generalizability claims, it establishes theoretical foundations and empirical evidence for the core frequency-domain gradient processing principles.</p>
</sec>
<sec>
<title>7.3.4 Cross-Domain application potential</title>
<p>The frequency-domain processing principles underlying SMI suggest potential applicability across diverse domains. In computer vision tasks, spatial frequency characteristics in image gradients could benefit from spectral filtering, particularly in convolutional layers where spatial relationships are crucial. For time series forecasting, the natural alignment with frequency-domain analysis may prove beneficial in financial prediction, weather modeling, and signal processing tasks where temporal frequency patterns are informative. In reinforcement learning, policy gradient frequency properties may relate to environment dynamics, potentially helping stabilize training in continuous control tasks where gradient noise is problematic.</p>
</sec>
<sec>
<title>7.3.5 Hyperparameter complexity</title>
<p>SMI introduces several hyperparameters whose optimal values may be task-dependent, potentially limiting practical adoption without further research into automated tuning methods.</p>
<p>Computational Overhead: While modest (4.5%) for small models, the overhead may become significant for very large models.</p>
</sec>
<sec>
<title>7.3.6 Statistical methodology</title>
<p>Our experimental protocol involved 2 independent runs per configuration with different random seeds. Results are reported as mean &#x000B1; standard error of the mean (SEM). While this sample size (<italic>n</italic> = 2) limits the power of formal statistical significance tests, the consistent improvement trends across all SMI configurations and low standard errors (particularly for inference speed improvements with CV &#x0003C; 1%) suggest reliable effects. Future large-scale validation studies should include larger sample sizes for robust statistical analysis.</p>
</sec>
</sec>
<sec>
<title>7.4 Research impact and future directions</title>
<p>SMI represents an exploration of frequency-domain approaches to gradient processing, demonstrating potential benefits while highlighting areas requiring further investigation. The frequency domain offers a mathematical framework that warrants deeper exploration, though our current understanding remains preliminary.</p>
<p>Future work should focus on: (1) large-scale validation across diverse models and tasks, (2) rigorous theoretical analysis of convergence properties and optimality conditions, (3) systematic hyperparameter selection methods to reduce complexity burden, and (4) investigation of computational efficiency at scale.</p>
<p>The intersection of signal processing and deep learning optimization represents a research area with potential, though practical impact requires careful validation. Our work provides an initial demonstration that this intersection might yield benefits, suggesting directions for future investigation.</p>
<p>By combining ideas from signal processing and optimization theory, SMI contributes to understanding how cross-disciplinary approaches might advance deep learning algorithms. However, the generalizability and practical significance of such combinations require extensive validation.</p>
<p>The Spectral Momentum Integration approach thus represents an initial exploration of frequency-domain optimization enhancement, providing proof-of-concept results while highlighting the need for more comprehensive investigation to establish broader applicability and theoretical understanding.</p>
</sec>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s8">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.</p>
</sec>
<sec sec-type="author-contributions" id="s9">
<title>Author contributions</title>
<p>ZH: Conceptualization, Formal analysis, Visualization, Writing &#x02013; original draft. MC: Funding acquisition, Software, Supervision, Writing &#x02013; original draft, Writing &#x02013; review &#x00026; editing. SZ: Formal analysis, Methodology, Project administration, Writing &#x02013; review &#x00026; editing.</p>
</sec>
<sec sec-type="funding-information" id="s10">
<title>Funding</title>
<p>The author(s) declare that no financial support was received for the research and/or publication of this article.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="ai-statement" id="s11">
<title>Generative AI statement</title>
<p>The author(s) declare that no Gen AI was used in the creation of this manuscript.</p>
</sec>
<sec sec-type="disclaimer" id="s12">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Anil</surname> <given-names>R.</given-names></name> <name><surname>Gupta</surname> <given-names>V.</given-names></name> <name><surname>Koren</surname> <given-names>T.</given-names></name> <name><surname>Regan</surname> <given-names>K.</given-names></name> <name><surname>Singer</surname> <given-names>Y.</given-names></name></person-group> (<year>2021</year>). <article-title>Scalable second order optimization for deep learning</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>34</volume>, <fpage>13156</fpage>&#x02013;<lpage>13166</lpage>.</citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ba</surname> <given-names>J. L.</given-names></name> <name><surname>Kiros</surname> <given-names>J. R.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name></person-group> (<year>2016</year>). <article-title>Layer normalization</article-title>. <source>arXiv preprint arXiv:1607.06450</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1607.06450</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Balles</surname> <given-names>L.</given-names></name> <name><surname>Hennig</surname> <given-names>P.</given-names></name></person-group> (<year>2020</year>). <article-title>Dissecting adam: the sign, magnitude and variance of stochastic gradients</article-title>. <source>J. Mach. Learn. Res</source>. <volume>21</volume>, <fpage>1</fpage>&#x02013;<lpage>52</lpage>.</citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bernacchia</surname> <given-names>A.</given-names></name> <name><surname>Lengyel</surname> <given-names>M.</given-names></name> <name><surname>Hennequin</surname> <given-names>G.</given-names></name></person-group> (<year>2022</year>). <article-title>Exact natural gradient in deep linear networks and its application to the nonlinear case</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>31</volume>, <fpage>1</fpage>&#x02013;<lpage>12</lpage>.</citation>
</ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bernstein</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>Y.-X.</given-names></name> <name><surname>Azizzadenesheli</surname> <given-names>K.</given-names></name> <name><surname>Anandkumar</surname> <given-names>A.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;SignSGD: compressed optimisation for non-convex problems,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>PMLR</publisher-loc>), <fpage>560</fpage>&#x02013;<lpage>569</lpage>.</citation>
</ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Brown</surname> <given-names>T.</given-names></name> <name><surname>Mann</surname> <given-names>B.</given-names></name> <name><surname>Ryder</surname> <given-names>N.</given-names></name> <name><surname>Subbiah</surname> <given-names>M.</given-names></name> <name><surname>Kaplan</surname> <given-names>J. D.</given-names></name> <name><surname>Dhariwal</surname> <given-names>P.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Language models are few-shot learners</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>33</volume>, <fpage>1877</fpage>&#x02013;<lpage>1901</lpage>. <pub-id pub-id-type="doi">10.48550/arXiv.2005.14165</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Wu</surname> <given-names>Q.</given-names></name> <name><surname>Khabsa</surname> <given-names>M.</given-names></name> <name><surname>Xiong</surname> <given-names>C.</given-names></name> <name><surname>Zettlemoyer</surname> <given-names>L.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name></person-group> (<year>2023</year>). <article-title>Empirical understanding of efficient finetuning methods for large language models</article-title>. <source>arXiv preprint arXiv:2309.14955</source>.<pub-id pub-id-type="pmid">40117144</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cheng</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Chen</surname> <given-names>G.</given-names></name> <name><surname>Chen</surname> <given-names>Q.</given-names></name> <name><surname>Liew</surname> <given-names>J.</given-names></name> <name><surname>Yan</surname> <given-names>X.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>&#x0201C;SLIM: Self-supervised LiDAR scene flow using constrained implicit function minimization,&#x0201D;</article-title> <source>International Conference on Learning Representations</source> (<publisher-name>ICLR</publisher-name>).</citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Defazio</surname> <given-names>A.</given-names></name> <name><surname>Mishchenko</surname> <given-names>K.</given-names></name></person-group> (<year>2022</year>). <article-title>Adaptivity without compromise: a momentumized, adaptive, dual averaged gradient method for stochastic optimization</article-title>. <source>J. Mach. Learn. Res</source>. <volume>23</volume>, <fpage>1</fpage>&#x02013;<lpage>30</lpage>.</citation>
</ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Duchi</surname> <given-names>J.</given-names></name> <name><surname>Hazan</surname> <given-names>E.</given-names></name> <name><surname>Singer</surname> <given-names>Y.</given-names></name></person-group> (<year>2011</year>). <article-title>&#x0201C;Adaptive subgradient methods for online learning and stochastic optimization,&#x0201D;</article-title> in <source>Proceedings of the 24th Annual Conference on Learning Theory</source> (<publisher-loc>Budapest</publisher-loc>: <publisher-name>JMLR Workshop and Conference Proceedings</publisher-name>), <fpage>257</fpage>&#x02013;<lpage>269</lpage>.</citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Durall</surname> <given-names>R.</given-names></name> <name><surname>Keuper</surname> <given-names>M.</given-names></name> <name><surname>Pfreundt</surname> <given-names>F.-J.</given-names></name> <name><surname>Keuper</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Watch your up-convolution: CNN based generative deep neural networks are failing to reproduce spectral distributions,&#x0201D;</article-title> <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-name>IEEE</publisher-name>), <fpage>7890</fpage>&#x02013;<lpage>7899</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.00791</pub-id></citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fort</surname> <given-names>S.</given-names></name> <name><surname>Dziugaite</surname> <given-names>G. K.</given-names></name></person-group> (<year>2019</year>). <article-title>Emergent properties in the optimization of deep networks</article-title>. <source>arXiv preprint arXiv:1906.04313</source>.</citation>
</ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Garipov</surname> <given-names>T.</given-names></name> <name><surname>Izmailov</surname> <given-names>P.</given-names></name> <name><surname>Podoprikhin</surname> <given-names>D.</given-names></name> <name><surname>Vetrov</surname> <given-names>D. P.</given-names></name> <name><surname>Wilson</surname> <given-names>A. G.</given-names></name></person-group> (<year>2018</year>). <article-title>Loss surfaces, mode connectivity, and fast ensembling of dnns</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. 31. <pub-id pub-id-type="doi">10.48550/arXiv.1802.10026</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Han</surname> <given-names>S.</given-names></name> <name><surname>Mao</surname> <given-names>H.</given-names></name> <name><surname>Dally</surname> <given-names>W. J.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding,&#x0201D;</article-title> in <source>International Conference on Learning Representations</source> (<publisher-loc>ICLR</publisher-loc>).</citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hinton</surname> <given-names>G.</given-names></name> <name><surname>Vinyals</surname> <given-names>O.</given-names></name> <name><surname>Dean</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>Distilling the knowledge in a neural network</article-title>. <source>arXiv preprint arXiv</source>:1503.02531. <pub-id pub-id-type="doi">10.48550/arXiv.1503.02531</pub-id></citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hoffmann</surname> <given-names>J.</given-names></name> <name><surname>Borgeaud</surname> <given-names>S.</given-names></name> <name><surname>Mensch</surname> <given-names>A.</given-names></name> <name><surname>Buchatskaya</surname> <given-names>E.</given-names></name> <name><surname>Cai</surname> <given-names>T.</given-names></name> <name><surname>Rutherford</surname> <given-names>E.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>Training compute-optimal large language models</article-title>. <source>arXiv preprint arXiv:2203.15556</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2203.15556</pub-id></citation>
</ref>
<ref id="B17">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>W.</given-names></name> <name><surname>Zhang</surname> <given-names>J.</given-names></name> <name><surname>Xiong</surname> <given-names>H.</given-names></name> <name><surname>Ma</surname> <given-names>L.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Learning frequency-aware dynamic network for efficient super-resolution,&#x0201D;</article-title> in <source>International Conference on Learning Representations</source> (<publisher-loc>ICLR</publisher-loc>).</citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jiang</surname> <given-names>Y.</given-names></name> <name><surname>Neyshabur</surname> <given-names>B.</given-names></name> <name><surname>Mobahi</surname> <given-names>H.</given-names></name> <name><surname>Krishnan</surname> <given-names>D.</given-names></name> <name><surname>Bengio</surname> <given-names>S.</given-names></name></person-group> (<year>2020</year>). <article-title>A unified perspective on algorithm instabilities: generalization and implicit bias</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>33</volume>, <fpage>9287</fpage>&#x02013;<lpage>9298</lpage>.</citation>
</ref>
<ref id="B19">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Karpathy</surname> <given-names>A.</given-names></name></person-group> (<year>2023</year>). <source>Nanogpt. GitHub repository</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://github.com/karpathy/nanoGPT/tree/master">https://github.com/karpathy/nanoGPT/tree/master</ext-link></citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kingma</surname> <given-names>D. P.</given-names></name> <name><surname>Ba</surname> <given-names>J.</given-names></name></person-group> (<year>2014</year>). <article-title>Adam: a method for stochastic optimization</article-title>. <source>arXiv preprint arXiv:1412.6980</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1412.6980</pub-id></citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>H.</given-names></name> <name><surname>Xu</surname> <given-names>Z.</given-names></name> <name><surname>Taylor</surname> <given-names>G.</given-names></name> <name><surname>Studer</surname> <given-names>C.</given-names></name> <name><surname>Goldstein</surname> <given-names>T.</given-names></name></person-group> (<year>2018</year>). <article-title>Visualizing the loss landscape of neural nets</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>31</volume>, <fpage>6389</fpage>&#x02013;<lpage>6399</lpage>.</citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liang</surname> <given-names>Y.</given-names></name> <name><surname>Cano</surname> <given-names>M. C. H.</given-names></name> <name><surname>Pinsler</surname> <given-names>R.</given-names></name> <name><surname>Clark</surname> <given-names>S. R.</given-names></name> <name><surname>Steinruecken</surname> <given-names>C.</given-names></name></person-group> (<year>2022</year>). <article-title>Rethinking the time assignment in recurrent neural networks for multivariate time series forecasting</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>35</volume>, <fpage>5452</fpage>&#x02013;<lpage>5464</lpage>.</citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>C.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Ye</surname> <given-names>R.</given-names></name> <name><surname>Sun</surname> <given-names>Z.</given-names></name> <name><surname>Shen</surname> <given-names>L.</given-names></name></person-group> (<year>2022</year>). <article-title>Towards understanding sharpness-aware minimization</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>35</volume>, <fpage>10645</fpage>&#x02013;<lpage>10657</lpage>.</citation>
</ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>C.</given-names></name> <name><surname>Zhu</surname> <given-names>L.</given-names></name> <name><surname>Belkin</surname> <given-names>M.</given-names></name></person-group> (<year>2020</year>). <article-title>On the linearity of large non-linear models: when and why the tangent kernel is constant</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>33</volume>, <fpage>15954</fpage>&#x02013;<lpage>15964</lpage>.</citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>L.</given-names></name> <name><surname>Jiang</surname> <given-names>H.</given-names></name> <name><surname>He</surname> <given-names>P.</given-names></name> <name><surname>Chen</surname> <given-names>W.</given-names></name> <name><surname>Liu</surname> <given-names>X.</given-names></name> <name><surname>Gao</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>On the variance of the adaptive learning rate and beyond</article-title>. <source>arXiv preprint arXiv:1908.03265</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1908.03265</pub-id><pub-id pub-id-type="pmid">25398893</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>S. M.</given-names></name> <name><surname>Savarese</surname> <given-names>S.</given-names></name> <name><surname>Pan</surname> <given-names>S.</given-names></name> <name><surname>Kotni</surname> <given-names>P.</given-names></name> <name><surname>Coste</surname> <given-names>S.</given-names></name> <name><surname>Ryu</surname> <given-names>E.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>Efficient training of language models using few-step inference</article-title>. <source>arXiv preprint arXiv:2305.17600</source>.</citation>
</ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Loshchilov</surname> <given-names>I.</given-names></name> <name><surname>Hutter</surname> <given-names>F.</given-names></name></person-group> (<year>2017</year>). <article-title>Decoupled weight decay regularization</article-title>. <source>arXiv preprint arXiv:1711.05101</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1711.05101</pub-id><pub-id pub-id-type="pmid">38536692</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Loshchilov</surname> <given-names>I.</given-names></name> <name><surname>Hutter</surname> <given-names>F.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Decoupled weight decay regularization,&#x0201D;</article-title> <source>International Conference on Learning Representations</source> (<publisher-name>ICLR</publisher-name>).<pub-id pub-id-type="pmid">38536692</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ma</surname> <given-names>X.</given-names></name> <name><surname>Yarats</surname> <given-names>D.</given-names></name> <name><surname>Simchowitz</surname> <given-names>M.</given-names></name> <name><surname>Zhao</surname> <given-names>Q.</given-names></name> <name><surname>Garg</surname> <given-names>D.</given-names></name> <name><surname>Liang</surname> <given-names>P.</given-names></name></person-group> (<year>2020</year>). <article-title>Apollo: an adaptive parameter-wise diagonal quasi-newton method for nonconvex stochastic optimization</article-title>. <source>arXiv preprint arXiv:2009.13586</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2009.13586</pub-id></citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Martens</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>New insights and perspectives on the natural gradient method</article-title>. <source>Journal of Machine Learning Research</source> <volume>21</volume>, <fpage>1</fpage>&#x02013;<lpage>76</lpage>. <pub-id pub-id-type="doi">10.48550/arXiv.1412.1193</pub-id></citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Martinez</surname> <given-names>J. L.</given-names></name> <name><surname>Rudi</surname> <given-names>A.</given-names></name> <name><surname>Rosasco</surname> <given-names>L.</given-names></name> <name><surname>Pontil</surname> <given-names>M.</given-names></name></person-group> (<year>2022</year>). <article-title>Spectral bias in practice: the role of function frequency in generalization</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>35</volume>, <fpage>9525</fpage>&#x02013;<lpage>9538</lpage>.</citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mathieu</surname> <given-names>M.</given-names></name> <name><surname>Henaff</surname> <given-names>M.</given-names></name> <name><surname>LeCun</surname> <given-names>Y.</given-names></name></person-group> (<year>2013</year>). <article-title>Fast training of convolutional networks through ffts</article-title>. <source>arXiv preprint arXiv:1312.5851</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1312.5851</pub-id></citation>
</ref>
<ref id="B33">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mellor</surname> <given-names>J.</given-names></name> <name><surname>Turner</surname> <given-names>J.</given-names></name> <name><surname>Storkey</surname> <given-names>A.</given-names></name> <name><surname>Crowley</surname> <given-names>E. J.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Neural architecture search without training,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>PMLR</publisher-loc>), <fpage>7588</fpage>&#x02013;<lpage>7598</lpage>.</citation>
</ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Micikevicius</surname> <given-names>P.</given-names></name> <name><surname>Narang</surname> <given-names>S.</given-names></name> <name><surname>Alben</surname> <given-names>J.</given-names></name> <name><surname>Diamos</surname> <given-names>G.</given-names></name> <name><surname>Elsen</surname> <given-names>E.</given-names></name> <name><surname>Garcia</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Mixed precision training</article-title>. <source>arXiv preprint arXiv:1710.03740</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1710.03740</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Miyato</surname> <given-names>T.</given-names></name> <name><surname>Kataoka</surname> <given-names>T.</given-names></name> <name><surname>Koyama</surname> <given-names>M.</given-names></name> <name><surname>Yoshida</surname> <given-names>Y.</given-names></name></person-group> (<year>2018</year>). <article-title>Spectral normalization for generative adversarial networks</article-title>. <source>arXiv preprint arXiv:1802.05957</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1802.05957</pub-id></citation>
</ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nesterov</surname> <given-names>Y.</given-names></name></person-group> (<year>1983</year>). <article-title>A method for unconstrained convex minimization problem with the rate of convergence <italic>O</italic>(1/<italic>k</italic><sup>2</sup>)</article-title>. <source>In Doklady ANSSSR</source> <volume>269</volume>, <fpage>543</fpage>&#x02013;<lpage>547</lpage>.</citation>
</ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nguyen</surname> <given-names>Q.</given-names></name> <name><surname>Hein</surname> <given-names>M.</given-names></name></person-group> (<year>2017</year>). <article-title>The loss surface of deep and wide neural networks</article-title>. <source>Proc. 34th Int. Conf. Mach. Learn</source>. <volume>70</volume>, <fpage>2603</fpage>&#x02013;<lpage>2612</lpage>.</citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>OpenA</surname> <given-names>I.</given-names></name></person-group> (<year>2023</year>). <article-title>GPT-4 technical report</article-title>. <source>arXiv preprint arXiv</source>:2303.08774. <pub-id pub-id-type="doi">10.48550/arXiv.2303.08774</pub-id></citation>
</ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pascanu</surname> <given-names>R.</given-names></name> <name><surname>Mikolov</surname> <given-names>T.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name></person-group> (<year>2013</year>). <article-title>&#x0201C;On the difficulty of training recurrent neural networks,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>Atlanta, GA</publisher-loc>: <publisher-name>PMLR</publisher-name>), <fpage>1310</fpage>&#x02013;<lpage>1318</lpage>.</citation>
</ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Polyak</surname> <given-names>B. T.</given-names></name></person-group> (<year>1964</year>). <article-title>Some methods of speeding up the convergence of iteration methods</article-title>. <source>USSR Comput. Math. Math. Phys</source>. <volume>4</volume>, <fpage>1</fpage>&#x02013;<lpage>17</lpage>. <pub-id pub-id-type="doi">10.1016/0041-5553(64)90137-5</pub-id></citation>
</ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Radford</surname> <given-names>A.</given-names></name> <name><surname>Kim</surname> <given-names>J. W.</given-names></name> <name><surname>Hallacy</surname> <given-names>C.</given-names></name> <name><surname>Ramesh</surname> <given-names>A.</given-names></name> <name><surname>Goh</surname> <given-names>G.</given-names></name> <name><surname>Agarwal</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>&#x0201C;Learning transferable visual models from natural language supervision,&#x0201D;</article-title> <source>International Conference on Machine Learning</source> (<publisher-name>PMLR</publisher-name>), <fpage>8748</fpage>&#x02013;<lpage>8763</lpage>.</citation>
</ref>
<ref id="B42">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rahaman</surname> <given-names>N.</given-names></name> <name><surname>Baratin</surname> <given-names>A.</given-names></name> <name><surname>Arpit</surname> <given-names>D.</given-names></name> <name><surname>Draxler</surname> <given-names>F.</given-names></name> <name><surname>Lin</surname> <given-names>M.</given-names></name> <name><surname>Hamprecht</surname> <given-names>F.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>&#x0201C;On the spectral bias of neural networks,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>Long Beach, CA</publisher-loc>: <publisher-name>PMLR</publisher-name>), <fpage>5301</fpage>&#x02013;<lpage>5310</lpage>.</citation>
</ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rao</surname> <given-names>S.</given-names></name> <name><surname>Yi</surname> <given-names>Z.</given-names></name> <name><surname>El-Amine</surname> <given-names>Y.</given-names></name> <name><surname>Pinto</surname> <given-names>J. R. d. A.</given-names></name> <name><surname>Ignat</surname> <given-names>I.</given-names></name> <name><surname>Arik</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>Time frequency networks: towards a unified framework for time and frequency domain machine learning</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>35</volume>, <fpage>37404</fpage>&#x02013;<lpage>37417</lpage>.</citation>
</ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rippel</surname> <given-names>O.</given-names></name> <name><surname>Snoek</surname> <given-names>J.</given-names></name> <name><surname>Adams</surname> <given-names>R. P.</given-names></name></person-group> (<year>2015</year>). <article-title>Spectral representations for convolutional neural networks</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>28</volume>, <fpage>2449</fpage>&#x02013;<lpage>2457</lpage>. <pub-id pub-id-type="doi">10.48550/arXiv.1506.03767</pub-id></citation>
</ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Robbins</surname> <given-names>H.</given-names></name> <name><surname>Monro</surname> <given-names>S.</given-names></name></person-group> (<year>1951</year>). <article-title>A stochastic approximation method</article-title>. <source>Ann. Math. Stat</source>. <volume>22</volume>, <fpage>400</fpage>&#x02013;<lpage>407</lpage>. <pub-id pub-id-type="doi">10.1214/aoms/1177729586</pub-id></citation>
</ref>
<ref id="B46">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Shazeer</surname> <given-names>N.</given-names></name> <name><surname>Stern</surname> <given-names>M.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Adafactor: Adaptive learning rates with sublinear memory cost,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>PMLR</publisher-loc>), <fpage>4596</fpage>&#x02013;<lpage>4604</lpage>.</citation>
</ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tang</surname> <given-names>H.</given-names></name> <name><surname>Chen</surname> <given-names>J.</given-names></name> <name><surname>Zhou</surname> <given-names>K.</given-names></name> <name><surname>Chen</surname> <given-names>T.</given-names></name> <name><surname>Shi</surname> <given-names>C.</given-names></name> <name><surname>Rajawat</surname> <given-names>K.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>Distributed stochastic optimization for deep learning with communication compression</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>.</citation>
</ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Team</surname> <given-names>G.</given-names></name> <name><surname>Anil</surname> <given-names>R.</given-names></name> <name><surname>Borgeaud</surname> <given-names>S.</given-names></name> <name><surname>Wu</surname> <given-names>Y.</given-names></name> <name><surname>Alayrac</surname> <given-names>J.-B.</given-names></name> <name><surname>Yu</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>Gemini: A family of highly capable multimodal models</article-title>. <source>arXiv preprint arXiv:2312.11805</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2312.11805</pub-id></citation>
</ref>
<ref id="B49">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tieleman</surname> <given-names>T.</given-names></name> <name><surname>Hinton</surname> <given-names>G.</given-names></name></person-group> (<year>2012</year>). <article-title>Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude</article-title>. <source>COURSERA: Neural networks for machine learning</source> <volume>4</volume>, <fpage>26</fpage>&#x02013;<lpage>31</lpage>.</citation>
</ref>
<ref id="B50">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Touvron</surname> <given-names>H.</given-names></name> <name><surname>Lavril</surname> <given-names>T.</given-names></name> <name><surname>Izacard</surname> <given-names>G.</given-names></name> <name><surname>Martinet</surname> <given-names>X.</given-names></name> <name><surname>Lachaux</surname> <given-names>M.-A.</given-names></name> <name><surname>Lacroix</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>Llama 2: Open foundation and fine-tuned chat models</article-title>. <source>arXiv preprint arXiv:2307.09288</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2307.09288</pub-id></citation>
</ref>
<ref id="B51">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tsuji</surname> <given-names>T.</given-names></name> <name><surname>Tanaka</surname> <given-names>K.</given-names></name> <name><surname>Yamamoto</surname> <given-names>K.</given-names></name> <name><surname>Tanaka</surname> <given-names>M.</given-names></name></person-group> (<year>2022</year>). <article-title>Relative flatness and generalization in deep networks</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst</source>.</citation>
</ref>
<ref id="B52">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>C.</given-names></name> <name><surname>Zhang</surname> <given-names>G.</given-names></name> <name><surname>Grosse</surname> <given-names>R.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Picking winning tickets before training by preserving gradient flow,&#x0201D;</article-title> in <source>International Conference on Learning Representations</source> (<publisher-loc>ICLR</publisher-loc>).</citation>
</ref>
<ref id="B53">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Yao</surname> <given-names>Q.</given-names></name> <name><surname>Kwok</surname> <given-names>J. T.</given-names></name> <name><surname>Ni</surname> <given-names>L. M.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Anti-oversmoothing in deep vision transformers via the fourier domain analysis: from theory to practice,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>PMLR</publisher-loc>), <fpage>23398</fpage>&#x02013;<lpage>23419</lpage>.</citation>
</ref>
<ref id="B54">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>Z.</given-names></name> <name><surname>Dai</surname> <given-names>A. M.</given-names></name> <name><surname>Kemp</surname> <given-names>J.</given-names></name> <name><surname>Metz</surname> <given-names>L.</given-names></name></person-group> (<year>2019a</year>). <article-title>On the frequency bias of generative models</article-title>. <source>Advances in Neural Information Processing Systems</source> <volume>32</volume>, <fpage>1810</fpage>&#x02013;<lpage>1820</lpage>.</citation>
</ref>
<ref id="B55">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>Z.-Q. J.</given-names></name> <name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Xiao</surname> <given-names>Y.</given-names></name></person-group> (<year>2019b</year>). <article-title>Training behavior of deep neural network in frequency domain</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>32</volume>, <fpage>3836</fpage>&#x02013;<lpage>3846</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-36708-4_22</pub-id></citation>
</ref>
<ref id="B56">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>K. R.</given-names></name> <name><surname>Lample</surname> <given-names>G.</given-names></name> <name><surname>Polosukhin</surname> <given-names>I.</given-names></name> <name><surname>Misra</surname> <given-names>K.</given-names></name> <name><surname>Bubeck</surname> <given-names>S.</given-names></name> <name><surname>Jain</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Advancing mathematics by guiding human intuition with AI</article-title>. <source>Nature</source> <volume>600</volume>, <fpage>70</fpage>&#x02013;<lpage>74</lpage>. <pub-id pub-id-type="doi">10.1038/s41586-021-04086-x</pub-id><pub-id pub-id-type="pmid">34853458</pub-id></citation></ref>
<ref id="B57">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>L. M.</given-names></name> <name><surname>Poli</surname> <given-names>M.</given-names></name> <name><surname>Massaroli</surname> <given-names>S.</given-names></name> <name><surname>Aguirre</surname> <given-names>E.</given-names></name> <name><surname>Isomura</surname> <given-names>T.</given-names></name> <name><surname>Ermon</surname> <given-names>S.</given-names></name></person-group> (<year>2022</year>). <article-title>Tuning large neural networks via zero-shot hyperparameter transfer</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>35</volume>, <fpage>13199</fpage>&#x02013;<lpage>13213</lpage>.</citation>
</ref>
<ref id="B58">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>M. R.</given-names></name> <name><surname>Lucas</surname> <given-names>J.</given-names></name> <name><surname>Hinton</surname> <given-names>G.</given-names></name> <name><surname>Ba</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>Lookahead optimizer: k steps forward, 1 step back</article-title>. <source>Advances in Neural Information Processing Systems</source> <volume>32</volume>, <fpage>9593</fpage>&#x02013;<lpage>9604</lpage>. <pub-id pub-id-type="doi">10.48550/arXiv.1907.08610</pub-id></citation>
</ref>
<ref id="B59">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>T.</given-names></name></person-group> (<year>2022</year>). <article-title>Adaptive gradient methods converge faster with over-parameterization (and you can do a line-search)</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>35</volume>, <fpage>32876</fpage>&#x02013;<lpage>32889</lpage>.</citation>
</ref>
<ref id="B60">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhuang</surname> <given-names>J.</given-names></name> <name><surname>Tang</surname> <given-names>T.</given-names></name> <name><surname>Ding</surname> <given-names>Y.</given-names></name> <name><surname>Tatikonda</surname> <given-names>S.</given-names></name> <name><surname>Dvornek</surname> <given-names>N.</given-names></name> <name><surname>Papademetris</surname> <given-names>X.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Adabelief optimizer: adapting stepsizes by the belief in observed gradients</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>33</volume>, <fpage>18795</fpage>&#x02013;<lpage>18806</lpage>. <pub-id pub-id-type="doi">10.48550/arXiv.2010.07468</pub-id></citation>
</ref>
</ref-list>
</back>
</article>