<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Big Data</journal-id>
<journal-title>Frontiers in Big Data</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Big Data</abbrev-journal-title>
<issn pub-type="epub">2624-909X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fdata.2023.1195742</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Big Data</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>PME: pruning-based multi-size embedding for recommender systems</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Liu</surname> <given-names>Zirui</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2256901/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Song</surname> <given-names>Qingquan</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Li</surname> <given-names>Li</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1068845/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Choi</surname> <given-names>Soo-Hyun</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2262647/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Chen</surname> <given-names>Rui</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1068766/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Hu</surname> <given-names>Xia</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1518368/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Computer Science Department, Rice University</institution>, <addr-line>Houston, TX</addr-line>, <country>United States</country></aff>
<aff id="aff2"><sup>2</sup><institution>Linkedin</institution>, <addr-line>Sunnyvale, CA</addr-line>, <country>United States</country></aff>
<aff id="aff3"><sup>3</sup><institution>Samsung Electronics America</institution>, <addr-line>Mountain View, CA</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Bo Han, Hong Kong Baptist University, Hong Kong SAR, China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Quanming Yao, Tsinghua University, China; Jiangchao Yao, Shanghai Jiao Tong University, China; Zhanke Zhou, Hong Kong Baptist University, Hong Kong SAR, China</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Xia Hu <email>xia.hu&#x00040;rice.edu</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>15</day>
<month>06</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>6</volume>
<elocation-id>1195742</elocation-id>
<history>
<date date-type="received">
<day>28</day>
<month>03</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>25</day>
<month>04</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2023 Liu, Song, Li, Choi, Chen and Hu.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Liu, Song, Li, Choi, Chen and Hu</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract>
<p>Embedding is widely used in recommendation models to learn feature representations. However, the traditional embedding technique that assigns a fixed size to all categorical features may be suboptimal due to the following reasons. In recommendation domain, the majority of categorical features&#x00027; embeddings can be trained with less capacity without impacting model performance, thereby storing embeddings with equal length may incur unnecessary memory usage. Existing work that tries to allocate customized sizes for each feature usually either simply scales the embedding size with feature&#x00027;s popularity or formulates this size allocation problem as an architecture selection problem. Unfortunately, most of these methods either have large performance drop or incur significant extra time cost for searching proper embedding sizes. In this article, instead of formulating the size allocation problem as an architecture selection problem, we approach the problem from a pruning perspective and propose <bold>P</bold>runing-based <bold>M</bold>ulti-size <bold>E</bold>mbedding (PME) framework. During the search phase, we prune the dimensions that have the least impact on model performance in the embedding to reduce its capacity. Then, we show that the customized size of each token can be obtained by transferring the capacity of its pruned embedding with significant less search cost. Experimental results validate that PME can efficiently find proper sizes and hence achieve strong performance while significantly reducing the number of parameters in the embedding layer.</p></abstract>
<kwd-group>
<kwd>neural network</kwd>
<kwd>recommender system</kwd>
<kwd>embedding compression</kwd>
<kwd>pruning</kwd>
<kwd>scalability</kwd>
</kwd-group>
<counts>
<fig-count count="11"/>
<table-count count="1"/>
<equation-count count="11"/>
<ref-count count="32"/>
<page-count count="13"/>
<word-count count="8895"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Machine Learning and Artificial Intelligence</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Embedding feature information into vector representations is crucial for the success of deep learning based recommendation models (Zhang et al., <xref ref-type="bibr" rid="B30">2019</xref>). In practice, the input features to recommender systems are often categorical, such as userID, itemID, and the category of items. For deep learning based recommendation models, these categorical features are mapped to low-dimensional learnable vectors (i.e., embeddings). Then, the learned vectors are fed into the rest of the model to learn the interaction between features. The number of layers in the rest of the recommendation model is typically small (usually less than 10) and independent of the number of categorical features (Cheng et al., <xref ref-type="bibr" rid="B5">2016</xref>; Guo et al., <xref ref-type="bibr" rid="B10">2017</xref>; Lian et al., <xref ref-type="bibr" rid="B17">2018</xref>). In contrast, the dimension of the embedding matrix grows linearly with the number of categorical features, which can easily be at the scale of millions (Park et al., <xref ref-type="bibr" rid="B22">2018</xref>). As a result, the weight matrix of the embedding layer is often responsible for the major memory consumption of a deep learning based recommendation models. For example, the embedding layer of Facebook recommender system contains billions of parameters. Consequently, the embedding layer occupies more than 99.9% memory of the whole model, which can consume hundreds of gigabytes or even terabytes (Park et al., <xref ref-type="bibr" rid="B22">2018</xref>; Ginart et al., <xref ref-type="bibr" rid="B9">2021</xref>). Without compressing the embedding layers, the excessive memory usage of recommendation models is a major obstacle for serving them on-device, where the memory is limited.</p>
<p>Traditional embedding compression methods usually put efforts on compacting the embedding matrix (Markovsky and Usevich, <xref ref-type="bibr" rid="B21">2012</xref>; Wang et al., <xref ref-type="bibr" rid="B28">2017</xref>): Low-rank based methods assume the weight matrix has reduced rank that can be decomposed into several smaller matrices (Markovsky and Usevich, <xref ref-type="bibr" rid="B21">2012</xref>). Hashing based methods reduce the number of embedding vectors in the matrix by mapping similar items into a same bucket (Wang et al., <xref ref-type="bibr" rid="B28">2017</xref>). All these methods follow the framework of the standard embedding technique that learns embeddings with equal length for each token.<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> However, recent advances demonstrate that assigning a fixed embedding size to all tokens may be suboptimal due to the following reasons (Joglekar et al., <xref ref-type="bibr" rid="B12">2020</xref>; Zhao et al., <xref ref-type="bibr" rid="B31">2020a</xref>,<xref ref-type="bibr" rid="B32">b</xref>; Ginart et al., <xref ref-type="bibr" rid="B9">2021</xref>). In the recommendation domain, usually a few head tokens dominate the data, while the majority of tokens (i.e., long-tail tokens) are rarely observed (Park and Tuzhilin, <xref ref-type="bibr" rid="B23">2008</xref>). Since the token&#x00027;s popularity and the importance of its representation to model performance is correlated (Joglekar et al., <xref ref-type="bibr" rid="B12">2020</xref>; Zhao et al., <xref ref-type="bibr" rid="B31">2020a</xref>; Ginart et al., <xref ref-type="bibr" rid="B9">2021</xref>). Thus, when using a fixed embedding size, it may either lose the information of head tokens or waste parameters on long-tail tokens (Kang et al., <xref ref-type="bibr" rid="B13">2020</xref>; Zhao et al., <xref ref-type="bibr" rid="B32">2020b</xref>). We usually choose a large enough embedding size to ensure model performance, which incurs unnecessary memory usage for storing long-tail token&#x00027;s embedding.</p>
<p>To overcome the mentioned drawback of embedding with equal length, several recent work proposes to allocate more capacity (i.e., larger embedding size) to important tokens, and less capacity to unimportant ones (Joglekar et al., <xref ref-type="bibr" rid="B12">2020</xref>; Kang et al., <xref ref-type="bibr" rid="B13">2020</xref>; Zhao et al., <xref ref-type="bibr" rid="B31">2020a</xref>,<xref ref-type="bibr" rid="B32">b</xref>; Ginart et al., <xref ref-type="bibr" rid="B9">2021</xref>). These work can be roughly divided into two categories. Some work proposes to explicitly scale token&#x00027;s embedding size with its frequency according to heuristic rules designed by human experts (Kang et al., <xref ref-type="bibr" rid="B13">2020</xref>; Ginart et al., <xref ref-type="bibr" rid="B9">2021</xref>). However, such allocation strategy may be suboptimal since the importance of a token is not purely decided by its popularity. Inspired by neural architecture search (NAS), another line of research formulates the embedding size allocation problem as an architecture selection problem, which selects the embedding size for each token from several predefined options (Joglekar et al., <xref ref-type="bibr" rid="B12">2020</xref>; Zhao et al., <xref ref-type="bibr" rid="B31">2020a</xref>,<xref ref-type="bibr" rid="B32">b</xref>). Due to the extremely large search space, the search process incurs a significant computational cost. Although the number of parameters in the embedding layer is significantly reduced, these methods still either have large performance drop or introduce significant extra time cost for searching embedding sizes.</p>
<p>In this article, we approach the embedding size allocation problem from a pruning perspective. Our work is motivated by the observation that the majority of token&#x00027;s embeddings can be trained with less capacity without impacting model performance (Joglekar et al., <xref ref-type="bibr" rid="B12">2020</xref>). Therefore, during the search phase, instead of selecting from a set of candidate embedding sizes, we prune the dimensions that have the least impact on model performance in token&#x00027;s embeddings to reduce its capacity. Then, we build a multi-size embedding table for training without sacrificing model performance, where the customized size of each token is obtained by transferring the capacity of its pruned embedding. Moreover, we show that the unimportant parameters in the embedding layer can be identified and pruned at initialization, and this significantly reduces the time cost of searching the customized sizes. Consequently, our framework can reduce the memory occupied by the embedding layer during both the training and inference phases without sacrificing model performance. Our contributions are summarized as follows:</p>
<list list-type="bullet">
<list-item><p>We rigorously show that the embedding size allocation problem can be converted to a pruning problem. Based on this reformulation, we propose a pruning-based multi-size embedding (PMB) framework to search the customized embedding size for each token.</p></list-item>
<list-item><p>In our framework, during the search process, the embedding layer is pruned without training it. Thus, the time cost of the search process is significantly reduced. Once pruned, we build the multi-size embedding table for training by transferring the capacity of token&#x00027;s pruned embedding. Our framework can reduce the memory occupied by the embedding layer during both the training and inference phases.</p></list-item>
<list-item><p>We show that our framework can match or improve the performance of several recommendation models using significantly less parameters. e.g., for Autoint&#x0002B; (Song et al., <xref ref-type="bibr" rid="B26">2019</xref>), we show that PME could significantly improve the Logloss and AUC while using 40 &#x000D7; fewer parameters for click-through rate prediction task on the Criteo dataset.</p></list-item>
</list></sec>
<sec id="s2">
<title>2. Preliminary and problem statement</title>
<sec>
<title>2.1. Notations</title>
<p>We denote matrices with uppercase bold letters (e.g., <bold>V</bold>), vectors with lowercase bold letters (e.g., <bold>v</bold>), and scalars with lowercase alphabets (e.g., <italic>v</italic>). We use <bold>V</bold><sub><italic>i</italic>, :</sub> to represent the <italic>i</italic><sup>th</sup> row of <bold>V</bold>, and <bold>V</bold><sub><italic>i,j</italic></sub> to denote the entry at the <italic>i</italic><sup>th</sup> row and <italic>j</italic><sup>th</sup> column of <bold>V</bold>. We denote the standard <italic>L</italic><sub>0</sub> norm as ||&#x000B7;||<sub>0</sub>. The operation <bold>V</bold> &#x0003D; concat(<bold>V</bold><sub>1</sub>, <bold>V</bold><sub>2</sub>) represents row-wisely concatenating matrix <bold>V</bold><sub>1</sub> and <bold>V</bold><sub>2</sub> into a new matrix <bold>V</bold>. We use &#x02115; &#x0003D; {0, 1, 2, 3&#x022EF;&#x02009;} to denote the set of all non-negative natural numbers. We use &#x02299; to denote the Hadamard product.</p></sec>
<sec>
<title>2.2. Preliminary</title>
<p>Recommender systems involve a massive amount of categorical feature fields, such as userIDs, itemIDs, and the category of items. Let <bold>x</bold> &#x0003D; [<bold>x</bold><sub>1</sub>; <bold>x</bold><sub>2</sub>; &#x022EF;&#x02009;;<bold>x</bold><sub><italic>M</italic></sub>] be an input instance with <italic>M</italic> feature fields, where <bold>x</bold><sub><italic>i</italic></sub> is the one-hot vector corresponding to the <italic>i</italic><sup>th</sup> field. Suppose the vocabulary size of the <italic>i</italic><sup>th</sup> field is <italic>n</italic><sub><italic>i</italic></sub>, i.e., there are <italic>n</italic><sub><italic>i</italic></sub> unique tokens (i.e., categorical features) in the <italic>i</italic><sup>th</sup> field. For each token <italic>x</italic><sub><italic>i</italic></sub>, it is mapped into a low-dimensional vector <inline-formula><mml:math id="M1"><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> by <bold>v</bold><sub><italic>i</italic></sub> &#x0003D; <bold>V</bold><sub><italic>i</italic></sub><bold>x</bold><sub><italic>i</italic></sub>, where <inline-formula><mml:math id="M2"><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is the embedding matrix of the <italic>i</italic><sup>th</sup> field and <italic>d</italic> is the embedding size. For convenience of notations, let <bold>V</bold> &#x0003D; concat(<bold>V</bold><sub>1</sub>, &#x022EF;&#x02009;, <bold>V</bold><sub><italic>M</italic></sub>) be the embedding matrix consisting of all tokens&#x00027; embeddings. Consider a deep learning based recommender system &#x003D5; parameterized by <bold>V</bold> and &#x00398;, where &#x00398; denotes all other model&#x00027;s parameters excluding those in <bold>V</bold>. We denote the prediction corresponding to <bold>x</bold> as <italic>&#x00177;</italic> &#x0003D; &#x003D5;(<bold>x</bold>|<bold>V</bold>, &#x00398;). We aimed to minimize the loss <inline-formula><mml:math id="M32"><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:math></inline-formula>(<bold>V</bold>, &#x00398;; <inline-formula><mml:math id="M33"><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:math></inline-formula>) &#x1D53C; <italic>E</italic><sub>(<bold>x</bold>, <italic>y</italic>)&#x0007E;<inline-formula><mml:math id="M34"><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:math></inline-formula></sub>&#x02113;(&#x003D5;(<bold>x</bold>|<bold>V</bold>, &#x00398;), <italic>y</italic>) over a dataset <inline-formula><mml:math id="M35"><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:math></inline-formula> &#x0003D; {(<bold>x</bold>, <italic>y</italic>)}, where &#x02113; is the loss function such as Logloss.</p></sec>
<sec>
<title>2.3. Multi-size embedding</title>
<p>The multi-size embedding framework allows each token in the vocabulary to have embeddings of different sizes (Joglekar et al., <xref ref-type="bibr" rid="B12">2020</xref>; Ginart et al., <xref ref-type="bibr" rid="B9">2021</xref>). By allocating an appropriate size for each token, the multi-size embedding framework can significantly reduce the total number of parameters in the embedding layer while maintaining the quality of learned representations (Joglekar et al., <xref ref-type="bibr" rid="B12">2020</xref>). Although the multi-size embedding has the mentioned advantages over the standard single-size embedding, applying it requires solving the following problem: Suppose there are <italic>n</italic> tokens in the vocabulary. If the total number of parameters in the multi-size embedding table is limited to no more than a predefined budget <italic>k</italic>, how to search for the optimal size <italic>d</italic><sub><italic>i</italic></sub> of token <italic>i</italic> under the budget constraint, such that the loss could be minimized as much as possible with the learned <italic>d</italic><sub><italic>i</italic></sub>-dimensional embedding vector <inline-formula><mml:math id="M3"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>? We formally define this embedding size allocation problem in Problem 1.</p>
<p>Problem 1 (Embedding size allocation problem). Given a maximum embedding size <italic>d</italic> and a predefined parameter budget <italic>k</italic>, let the <inline-formula><mml:math id="M4"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> be a <italic>d</italic><sub><italic>i</italic></sub>-dimensional embedding representing token <italic>i</italic>. For element-wise operations between embeddings to work, embeddings of different sizes are padded to equal length <italic>d</italic> with zeros following by a projection. Namely, the <inline-formula><mml:math id="M5"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> will be padded with <italic>e</italic><sub><italic>i</italic></sub> trailing zeros such that <italic>d</italic><sub><italic>i</italic></sub>&#x0002B;<italic>e</italic><sub><italic>i</italic></sub> &#x0003D; <italic>d</italic>, leading to a padded vector <inline-formula><mml:math id="M6"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. We define <bold>d</bold> &#x0003D; [<italic>d</italic><sub>1</sub>, &#x022EF;&#x02009;, <italic>d</italic><sub><italic>n</italic></sub>]. Let <inline-formula><mml:math id="M7"><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> be the single-size embedding matrix consisting of all projected <italic>d</italic>-dimensional embeddings, i.e., <inline-formula><mml:math id="M8"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mo>:</mml:mo></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>P</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, where <inline-formula><mml:math id="M9"><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>P</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is a learnable projection matrix associated with token <italic>i</italic>. The goal of embedding size allocation problem aimed to solve the following optimization problem:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M10"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">min</mml:mo></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>d</mml:mtext></mml:mstyle></mml:mrow></mml:munder></mml:mstyle><mml:mi mathvariant="bold-script">L</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle></mml:mrow><mml:mo class="qopname">^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>d</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>&#x00398;</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>d</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>;</mml:mo><mml:mi mathvariant="bold-script">D</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E2"><label>(2)</label><mml:math id="M11"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>s</mml:mi><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:mo>.</mml:mo><mml:mtext>&#x000A0;&#x000A0;</mml:mtext></mml:mtd><mml:mtd><mml:msup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>d</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>&#x00398;</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>d</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">argmin</mml:mo></mml:mrow><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle></mml:mrow><mml:mo class="qopname">^</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mi>&#x00398;</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mi mathvariant="bold-script">L</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle></mml:mrow><mml:mo class="qopname">^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>d</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>&#x00398;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>d</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>;</mml:mo><mml:mi mathvariant="bold-script">D</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E3"><label>(3)</label><mml:math id="M12"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02264;</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E4"><label>(4)</label><mml:math id="M13"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mo>&#x02200;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:mi>&#x02115;</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02264;</mml:mo><mml:mi>d</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p><xref ref-type="fig" rid="F1">Figure 1</xref> illustrates our multi-size embedding framework. The backbone recommendation models in <xref ref-type="fig" rid="F1">Figure 1</xref> refer to the rest of the model excluding the embedding layer. Although the projected embeddings have the same number of parameters as the uncompressed ones, we will only retrieve and project the embeddings for tokens in the current mini-batch data. As the mini-batch size restricts the number of retrieved embeddings, the memory usage from these additional parameters is negligible when considering the significant reduction in parameter numbers of the multi-size embedding table.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>The multi-size embedding framework in our article. For element-wise operations to work (e.g., dot-product in factorization machines), the retrieved embeddings are padded to equal length with zeros following by a field-specific projection.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-06-1195742-g0001.tif"/>
</fig>
<p>Following the studies by Zhao et al. (<xref ref-type="bibr" rid="B31">2020a</xref>) and Ginart et al. (<xref ref-type="bibr" rid="B9">2021</xref>), in our article, the projection matrix <bold>P</bold> in Problem 1 is shared between tokens in a same field to learn field-level structures. We note that such approach also has a nice algebraic explanation: the degree of freedom of the token <italic>i</italic>&#x00027;s representation is limited by <italic>d</italic><sub><italic>i</italic></sub> since</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M14"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold"><mml:mtext>P</mml:mtext></mml:mstyle><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none none none none none none none none none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>p</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>-</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x022EF;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>p</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none none none none none none none none none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mstyle displaystyle="true"><mml:munder accentunder="false"><mml:mrow><mml:mn>0</mml:mn><mml:mo>;</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>;</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mo>&#x0FE38;</mml:mo></mml:munder></mml:mstyle></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munder></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>p</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>In each field, for the token allocated with larger <italic>d</italic><sub><italic>i</italic></sub>, the expressive ability of its embedding is stronger since it is represented using more basis from the row space of <bold>P</bold>. Thus, the multi-size embedding framework illustrated in Problem 1 can control the capacity of each token&#x00027;s representation by allocating different embedding sizes.</p>
<p>Solving Problem 1 poses a significant computational hurdle due to the following two reasons. First, in the recommendation domain, the vocabulary size can easily reach the million level (Covington et al., <xref ref-type="bibr" rid="B6">2016</xref>). Second, since the size of embedding could only be integers, the combinatorial nature of this problem leads to an intractable optimization for a large search space. Finding the optimal embedding sizes for millions of tokens from a discrete search space requires a large amount of computational resources.</p>
<p>In the next section, we show that this combinatorial optimization problem can be converted to a pruning problem, which can be approximately solved with significantly less cost.</p></sec></sec>
<sec sec-type="methods" id="s3">
<title>3. Methodology</title>
<p><xref ref-type="fig" rid="F2">Figure 2</xref> illustrates the overview of our proposed framework. We first search the customized embedding size for each token in a separate search process before training. The key intuition of our proposed method is the optimal capacity of a token that can be obtained by pruning unimportant dimensions in its embedding. In particular, given a standard single-size embedding layer, we prune the dimensions that have the least impact on model performance in token&#x00027;s embeddings to reduce its capacity. Then, the customized size of each token can be obtained by transferring the capacity of its pruned embedding (Section 3.1). We then derive our proposed pruning-based multi-size embedding framework, which prunes the embedding layer at initialization (Section 3.2). In this way, the time cost of the search process is significantly reduced.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Overview of PME framework.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-06-1195742-g0002.tif"/>
</fig>
<p>In practice, a multi-size table is implemented as multiple two-dimensional embedding matrices, each with different sizes. Since the searched size could be any integer smaller than the maximal size <italic>d</italic>, we need to initialize at most <italic>d</italic> two-dimensional matrices, which incurs extra time cost to the retrieval process. To reduce the extra time cost of retrieving from multi-size table, we optimize the retrieval process based on group-wise operations (Section 3.3).</p>
<sec>
<title>3.1. Size allocation as a pruning problem</title>
<p>The success of multi-size embedding framework suggests the embeddings of long-tail tokens can be trained with less capacity without impacting model performance (Joglekar et al., <xref ref-type="bibr" rid="B12">2020</xref>; Ginart et al., <xref ref-type="bibr" rid="B9">2021</xref>). This implies that there exists redundant parameters in the single-size embedding. It is intuitive to start pruning from the parameters that have the least impact on model performance, which is equivalent to reducing the embedding size. For example, as shown in <xref ref-type="fig" rid="F3">Figure 3</xref>, the second value in embedding <bold>v</bold><sub>1</sub> is pruned out and set as zero, leading to a <italic>d</italic><sub>1</sub> &#x0003D; <italic>d</italic>&#x02212;1 embedding size in effect. The actual size of the pruned embedding equals the number of remaining parameters.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>An example to illustrate the pruning-based multi-size embedding. After pruning, we build the multi-size embedding table for training, where the size of each token is set to the number of remaining parameters in its pruned embedding. We note that some tokens may be entirely cutoff from the vocabulary (such as <bold>v</bold><sub>3</sub>, in this example), and they are mapped to unlearnable zero vectors.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-06-1195742-g0003.tif"/>
</fig>
<p>Informally, by setting token <italic>i</italic>&#x00027;s allocated size <italic>d</italic><sub><italic>i</italic></sub> to the number of remaining parameters, the capacity of its pruned embedding will be transferred to <inline-formula><mml:math id="M15"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in Problem 1. We formalize this statement by showing under mild assumptions, the optimal solution of Problem 1 can be constructed using the pruned embeddings 2. We first give the definition of redundant parameter identification problem.</p>
<p>Problem 2 (Redundant parameters identification problem). Given an overparameterized embedding matrix <bold>V</bold>&#x02208;&#x0211D;<sup><italic>n</italic>&#x000D7;<italic>d</italic></sup>, the redundant parameter identification problem aims to solve the following constrained optimization problem:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M16"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">min</mml:mo></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mi>&#x00398;</mml:mi><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>C</mml:mtext></mml:mstyle></mml:mrow></mml:munder></mml:mstyle><mml:mi>L</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle><mml:mo>&#x02299;</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>C</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mi>&#x00398;</mml:mi><mml:mo>;</mml:mo><mml:mi mathvariant="bold-script">D</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E7"><label>(7)</label><mml:math id="M17"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>s</mml:mi><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:mo>.</mml:mo><mml:mtext>&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mstyle mathvariant="bold"><mml:mtext>C</mml:mtext></mml:mstyle><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>C</mml:mtext></mml:mstyle><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>&#x02264;</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <bold>C</bold> is an auxiliary variable representing binary &#x0201C;gates&#x0201D; that denotes whether a parameter in <italic>V</italic> is present. <italic>k</italic> is the parameter budget referring to the number of non-zero entries in <bold>V</bold>, i.e., the amount of gates being &#x0201C;on&#x0201D;. The redundant parameters can be identified by the zeros (the gates being &#x0201C;off&#x0201D;) in <bold>C</bold>.</p>
<p>Proposition 1 (Proof in <xref ref-type="supplementary-material" rid="SM1">Appendix 1</xref>). If the projection matrix in Problem 1 is shared between tokens in each field, the optimal solution of Problem 1 can be constructed from one solution to Problem 2.</p>
<p>The solution <bold>d</bold> to Problem 1 can be obtained by setting the size of each token to the number of remaining parameters in its pruned embedding. We note that such constructed <bold>d</bold> satisfies all constraints in Problem 1. First, according to Equation (7), since there are totally at most <italic>k</italic> remaining parameters in the pruned embedding matrix, the constructed <bold>d</bold> meets the budget constraint in Equation (3). Second, the constructed <bold>d</bold> naturally meets the maximal size constraint in Equation (4) since the number of remaining parameters in the pruned embedding are no more than <italic>d</italic>.</p>
<p>As shown in <xref ref-type="fig" rid="F3">Figure 3</xref>, by Proposition 1 and the above analysis, we build the multi-size embedding table for training, where the customized size of each token equals the capacity of its pruned embedding. In the next subsection, we show that Problem 2 can be approximated solved with significant fewer costs.</p></sec>
<sec>
<title>3.2. Prune embeddings without training them</title>
<p>Most of the existing methods in the pruning literature attempt to identify redundant parameters from a pretrained reference network either based on a saliency criterion (Han et al., <xref ref-type="bibr" rid="B11">2016</xref>; Kusupati et al., <xref ref-type="bibr" rid="B15">2020</xref>) or utilizing sparsity enforcing penalties (Carreira-Perpin&#x000E1;n and Idelbayev, <xref ref-type="bibr" rid="B3">2018</xref>). Unfortunately, all these pruning methods require many expensive <italic>pretrain-prune-retrain</italic> cycles and introduce additional hyperparameters. Recent work has explored the possibility of pruning neural networks at initialization (Lee et al., <xref ref-type="bibr" rid="B16">2019</xref>; Wang et al., <xref ref-type="bibr" rid="B27">2020</xref>). Namely, given a desired parameter budget, redundant parameters are pruned once before training, and then the pruned network is trained in the standard way. Equipped with the technique, there is no need for network pretraining and complex pruning schedules. Inspired by single-shot network pruning (SNIP) (Lee et al., <xref ref-type="bibr" rid="B16">2019</xref>), we directly prune unimportant parameters in the embedding according to the <italic>connection sensitivity</italic>, which can be obtained by utilizing a full-batch of training data. Consequently, the pruning process is disentangled from the above iterative cycle.</p>
<p>The key idea of <italic>connection sensitivity</italic> proposed in SNIP is to preserve the parameters that have the maximum impact on the loss if perturbed. Specifically, the effect of removing parameter <bold>V</bold><sub><italic>i, j</italic></sub> on the loss can be measured as follows:</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M18"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x00394;</mml:mi><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mi>&#x00398;</mml:mi><mml:mo>;</mml:mo><mml:mi mathvariant="bold-script">D</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>1</mml:mtext></mml:mstyle><mml:mo>&#x02299;</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mi>&#x00398;</mml:mi><mml:mo>;</mml:mo><mml:mi mathvariant="bold-script">D</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:mi>L</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>1</mml:mtext></mml:mstyle><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>e</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02299;</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mi>&#x00398;</mml:mi><mml:mo>;</mml:mo><mml:mi mathvariant="bold-script">D</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M19"><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>e</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is an indicator matrix of element <bold>V</bold><sub><italic>i, j</italic></sub> (i.e., zeros everywhere except at the <italic>i</italic><sup>th</sup> row and <italic>j</italic><sup>th</sup> column where it is one), and <bold>1</bold>&#x02208;&#x0211D;<sup><italic>n</italic>&#x000D7;<italic>d</italic></sup> is an all-ones matrix. Equation (8) measures the influence of parameter <bold>V</bold><sub><italic>i, j</italic></sub> on the loss in the discrete setting since <bold>C</bold> is binary. Computing &#x00394;<italic>L</italic><sub><italic>i, j</italic></sub> for each <italic>i, j</italic> is prohibitively expensive since it requires an individual forward pass over the dataset for each parameter <bold>V</bold><sub><italic>i, j</italic></sub>. However, by relaxing the binary constraint of <bold>C</bold>, &#x00394;<italic>L</italic><sub><italic>i, j</italic></sub> can be approximated by the derivative of <italic>L</italic> with respect to <bold>C</bold><sub><italic>i, j</italic></sub>, which is named as <italic>connection sensitivity</italic>. Specifically, the <italic>connection sensitivity</italic> <bold>G</bold>(<bold>V</bold>, &#x00398;; <inline-formula><mml:math id="M36"><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:math></inline-formula>) in SNIP can be computed as follows:</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M20"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x00394;</mml:mi><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mi>&#x00398;</mml:mi><mml:mo>;</mml:mo><mml:mi mathvariant="bold-script">D</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02248;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>G</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mi>&#x00398;</mml:mi><mml:mo>;</mml:mo><mml:mi mathvariant="bold-script">D</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x02202;</mml:mi><mml:mi>L</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>C</mml:mtext></mml:mstyle><mml:mo>&#x02299;</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mi>&#x00398;</mml:mi><mml:mo>;</mml:mo><mml:mi mathvariant="bold-script">D</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x02202;</mml:mi><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>C</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:msub><mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>C</mml:mtext></mml:mstyle><mml:mo>=</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>1</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E10"><label>(10)</label><mml:math id="M21"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x02202;</mml:mi><mml:mi>L</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mi>&#x00398;</mml:mi><mml:mo>;</mml:mo><mml:mi mathvariant="bold-script">D</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x02202;</mml:mi><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle></mml:mrow></mml:mfrac><mml:mo>&#x02299;</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Parameters that least impact the performance if removed can be identified according to <italic>connection sensitivity</italic>. We list the full algorithm in <xref ref-type="table" rid="T2">Algorithm 1</xref>. There is only one hyperparaemter in <xref ref-type="table" rid="T2">Algorithm 1</xref>, namely, the parameter budget <italic>k</italic>, which controls the total number of parameters in the multi-size table. Specifically, we first initialize a standard single-size embedding layer, then calculate the <italic>connection sensitivity</italic> <bold>G</bold>(<bold>V</bold>, &#x00398;; <inline-formula><mml:math id="M37"><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:math></inline-formula>). Once <bold>G</bold>(<bold>V</bold>, &#x00398;; <inline-formula><mml:math id="M38"><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:math></inline-formula>) is obtained, the parameters corresponding to the top-<italic>k</italic> values of |<bold>G</bold>(<bold>V</bold>, &#x00398;; <inline-formula><mml:math id="M39"><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:math></inline-formula>)| are kept. Finally, the allocated size of each token is set to the number of kept dimensions in its pruned embedding.</p>
<table-wrap position="float" id="T2">
<label>Algorithm 1</label>
<caption><p>Pruning-base embedding size search.</p></caption>
<graphic xlink:href="fdata-06-1195742-i0001.tif"/>
</table-wrap>
</sec>
<sec>
<title>3.3. Multi-size table lookup optimization</title>
<p>Most of the deep learning frameworks do not support embedding table with multiple sizes. In practice, a multi-size table is implemented as multiple two-dimensional matrices, each with different sizes. When retrieving embeddings from a multi-size table, it requires to identify which matrix contains the token&#x00027;s embedding according to its size.</p>
<p>The time cost for identifying the matrix containing the token&#x00027;s embedding grows linearly with the number of candidate matrices. In <xref ref-type="table" rid="T2">Algorithm 1</xref>, the searched size of each token can be arbitrary integer between 0 and <italic>d</italic>, which means we need to initialize at most <italic>d</italic> two-dimensional matrices. Thus, the retrieval process will be significantly slowed down when <italic>d</italic> is large, which contradicts with the goal of being efficient.</p>
<p>Similar to the previous studies, (Joglekar et al., <xref ref-type="bibr" rid="B12">2020</xref>; Zhao et al., <xref ref-type="bibr" rid="B31">2020a</xref>,<xref ref-type="bibr" rid="B32">b</xref>), we define a candidate size set <inline-formula><mml:math id="M26"><mml:mi mathvariant="bold-script">C</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>, where <inline-formula><mml:math id="M27"><mml:mn>0</mml:mn><mml:mo>&#x02264;</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0003C;</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0003C;</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mo>&#x0003C;</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>d</mml:mi></mml:math></inline-formula> are <italic>T</italic> predefined embedding sizes. The searched size given by <xref ref-type="table" rid="T2">Algorithm 1</xref> will be rounded to its nearest neighbor in <inline-formula><mml:math id="M40"><mml:mrow><mml:mi mathvariant="script">C</mml:mi></mml:mrow></mml:math></inline-formula>. If <inline-formula><mml:math id="M28"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, for these tokens which have been entirely cutoff from the vocabulary (e.g., <bold>v</bold><sub>3</sub> in the example of <xref ref-type="fig" rid="F3">Figure 3</xref>), they will be mapped to a padding index. The padding index will then be retrieved as an unlearnable zero vector. Formally, as shown in <xref ref-type="fig" rid="F2">Figure 2</xref>, to retrieve embeddings for a batch of tokens in different fields, we first split them into <italic>T</italic> groups based on their rounded embedding size. Then, we retrieve the embeddings for each group and pad them to equal length with zeros. Finally, we re-arrange these padded embeddings to recover the original order of input tokens, and apply field-specific projection on them. We note that the above padding and retrieving process can be efficiently executed in parallel. As the number of groups <italic>T</italic> is typically small, we found that this group-wise implementation delivers minimal overhead compared with standard single-size embedding.</p></sec>
<sec>
<title>3.4. Discussion and limitation</title>
<sec>
<title>3.4.1. Discussion</title>
<p>we recap and discuss the difference between our formulation of the embedding size allocation problem and that in a previous study. There are two main difference between them.</p>
<p>First, in most of the previous studies, the size allocation problem is formulated as an architecture selection problem (Joglekar et al., <xref ref-type="bibr" rid="B12">2020</xref>; Zhao et al., <xref ref-type="bibr" rid="B31">2020a</xref>,<xref ref-type="bibr" rid="B32">b</xref>). Consequently, following the paradigm of NAS, the validation set is used for selecting the size, i.e., the objective in Equation (1) is <inline-formula><mml:math id="M29"><mml:msub><mml:mrow><mml:mi mathvariant="bold-script">L</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>d</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>&#x00398;</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>d</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-script">D</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. In contrast, we formulate this size allocation problem as a pruning problem, which tries to identify parameters that least impact the training loss if removed. Only with such formulation, we can search embedding sizes without training the model, and hence significantly improve the search efficiency. Moreover, the memory usage of embedding layers can be reduced during both the training and inference phases. A detailed discussion about the difference between the formulation based on NAS and the formulation based on pruning is provided in <xref ref-type="supplementary-material" rid="SM1">Appendix 2</xref> (<xref ref-type="supplementary-material" rid="SM1">Supplementary material</xref>).</p>
<p>Second, most of the previous work constructs several projection matrices for each field. In each field, tokens with same allocated sizes share a common projection matrix. In contrast, we propose to construct only one projection matrix for each field since tokens in a same field have field-level latent structure (Zhao et al., <xref ref-type="bibr" rid="B31">2020a</xref>; Ginart et al., <xref ref-type="bibr" rid="B9">2021</xref>). Specifically, embeddings with different sizes are padded to equal length with zeros, enabling the feasible adoption of the field-specific projections. This approach has nice algebraic explanation (see Equation 5). We note that our approach also enables embeddings of equal length but belonging to different fields to be retrieved simultaneously, which is inflexible in most of the previous studies. A detailed analysis is provided in <xref ref-type="supplementary-material" rid="SM1">Appendix 2</xref> (<xref ref-type="supplementary-material" rid="SM1">Supplementary material</xref>).</p></sec>
<sec>
<title>3.4.2. Limitation</title>
<p>The main limitation of PME is that, during the embedding size search phase, the memory usage of embedding layers cannot be reduced. However, we note that most of the search based multi-size embedding frameworks also have this problem (Joglekar et al., <xref ref-type="bibr" rid="B12">2020</xref>; Zhao et al., <xref ref-type="bibr" rid="B31">2020a</xref>,<xref ref-type="bibr" rid="B32">b</xref>; Liu et al., <xref ref-type="bibr" rid="B20">2021</xref>). It is necessary to initialize embeddings with maximal size to evaluate whether the maximal available size in the search space is suitable for a specific token. In this article, we mainly focused on reducing the memory usage of models during the training and inference phases, and their storage requirements.</p></sec></sec></sec>
<sec id="s4">
<title>4. Experiment</title>
<p>We verify the effectiveness of our proposed framework through answering the following research questions:</p>
<list list-type="bullet">
<list-item><p><bold>RQ1</bold>. How is PME compared with other embedding compression methods in terms of model performance at different compression rates?</p></list-item>
<list-item><p><bold>RQ2</bold>. What is the additional time cost for searching the embedding size and for training the model, respectively?</p></list-item>
<list-item><p><bold>RQ3</bold>. How sensitive are the searched embedding sizes to the backbone models and to the initialized weights, respectively?</p></list-item>
</list>
<sec>
<title>4.1. Experimental settings</title>
<p>We first introduce the baseline methods for comparison. Then, we introduce the applied datasets and the hyperparameter settings.</p>
<sec>
<title>4.1.1. Baselines</title>
<p>We compare our proposed method with the following five representative embedding compression methods: (1) <bold>SE</bold> (single-size embedding): a standard single-size embedding method that assigns a fixed embedding size to all tokens in the vocabulary. (2) <bold>MDE</bold> (mixed dimension embedding) (Ginart et al., <xref ref-type="bibr" rid="B9">2021</xref>): a multi-size embedding method that scales token&#x00027;s embedding sizes with its frequency according to heuristic rules designed by human experts. (3) <bold>QREMB</bold> (quotient-remainder embedding) (Shi et al., <xref ref-type="bibr" rid="B25">2020</xref>): a hashing-based method to reduce the total vocabulary size by storing multiple smaller embedding tables based on a standard remainder-hashing function. (4) <bold>LRF</bold> (low-rank factorization) (Koren et al., <xref ref-type="bibr" rid="B14">2009</xref>): a low-rank based method that factorizes the embedding matrix <bold>V</bold>&#x02208;&#x0211D;<sup><italic>n</italic>&#x000D7;<italic>d</italic></sup> as <bold>QR</bold>, where <bold>Q</bold>&#x02208;&#x0211D;<sup><italic>n</italic>&#x000D7;<italic>r</italic></sup>, <bold>R</bold>&#x02208;&#x0211D;<sup><italic>r</italic>&#x000D7;<italic>d</italic></sup>, and <italic>r</italic> is the rank, which satisfies <italic>r</italic>&#x0003C;<italic>d</italic>. (5) <bold>DartsEMB</bold> (Zhao et al., <xref ref-type="bibr" rid="B32">2020b</xref>): a NAS-based mutli-size embedding method that relaxes the discrete embedding size allocation problem to a continuous one that can be solved by gradient descent (Liu et al., <xref ref-type="bibr" rid="B19">2019</xref>). This method is chosen to display the performance of NAS-based mutli-size embedding methods.<xref ref-type="fn" rid="fn0002"><sup>2</sup></xref> Different embedding compression methods are deployed to three representative state-of-the-art recommendation models: DeepFM (Guo et al., <xref ref-type="bibr" rid="B10">2017</xref>), Autoint&#x0002B; (Song et al., <xref ref-type="bibr" rid="B26">2019</xref>) and Wide and Deep (Cheng et al., <xref ref-type="bibr" rid="B5">2016</xref>), to compare their performance. More details about the hyperparameters of these three recommendation models are elaborated in <xref ref-type="supplementary-material" rid="SM1">Appendix 3.2</xref> (<xref ref-type="supplementary-material" rid="SM1">Supplementary material</xref>). Logloss and AUC score are selected as the core metrics for evaluating recommendation model performance.</p></sec>
<sec>
<title>4.1.2. Data preprocessing</title>
<p>We adopt two public benchmark datasets in this article, i.e., <bold>Criteo</bold><xref ref-type="fn" rid="fn0003"><sup>3</sup></xref> and <bold>Avazu</bold>.<xref ref-type="fn" rid="fn0004"><sup>4</sup></xref> The basic statistics of these two datasets are summarized in <xref ref-type="supplementary-material" rid="SM1">Supplementary Table A1</xref> (<xref ref-type="supplementary-material" rid="SM1">Supplementary material</xref>). Both the datasets are processed based on the method and codes provided in the study by Song et al. (<xref ref-type="bibr" rid="B26">2019</xref>). Following the studies by Guo et al. (<xref ref-type="bibr" rid="B10">2017</xref>) and Song et al. (<xref ref-type="bibr" rid="B26">2019</xref>), for each dataset, we divide the data into the training (80%), validation (10%), and test sets (10%).</p></sec>
<sec>
<title>4.1.3. Hyperparameter settings</title>
<p>Since there is a trade-off between recommendation model performance and the number of parameters in the embedding table, to fairly compare the effectiveness of different embedding compression methods, we adjust their hyperparameters to ensure the number of their trainable parameters are comparable. For <bold>PME</bold>, the size of the full SE embedding table to be pruned is set to 32. As illustrated in Section 3, PME has two hyperparameters, namely, the parameter budget <italic>k</italic> and the candidate embedding size set <inline-formula><mml:math id="M41"><mml:mrow><mml:mi mathvariant="script">C</mml:mi></mml:mrow></mml:math></inline-formula>. The candidate size set <inline-formula><mml:math id="M42"><mml:mrow><mml:mi mathvariant="script">C</mml:mi></mml:mrow></mml:math></inline-formula> is set to {0, 2, 8, 16, 32} across all experiments, i.e., each searched size given by <xref ref-type="table" rid="T2">Algorithm 1</xref> will be rounded to its nearest neighbor in <inline-formula><mml:math id="M43"><mml:mrow><mml:mi mathvariant="script">C</mml:mi></mml:mrow></mml:math></inline-formula>. Suppose before pruning, the total number of parameters in the single-size embedding table is <italic>K</italic>. The parameter budget <italic>k</italic> is set to 2% &#x000D7; <italic>K</italic>, 5% &#x000D7; <italic>K</italic>, and 10% &#x000D7; <italic>K</italic>. Due to the page limit, detailed hyperparameter settings for all other baselines are specified in <xref ref-type="supplementary-material" rid="SM1">Appendix 3.3</xref> (<xref ref-type="supplementary-material" rid="SM1">Supplementary material</xref>). The <bold>compression rate</bold> <italic>cr</italic> can be calculated as follows:</p>
<disp-formula id="E11"><mml:math id="M30"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext class="textrm" mathvariant="normal">&#x00023; of parameters in the full SE embedding table</mml:mtext></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">&#x00023; of parameters in the compressed embedding table.</mml:mtext></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>We implement our method using <italic>Pytorch</italic> (Paszke et al., <xref ref-type="bibr" rid="B24">2019</xref>). Every single experiment is run on a single NVIDIA GeForce RTX 1080 Ti GPU with several models parallelly trained on it. To reduce the variance, all of the reported numbers are averaged over four random trials.</p></sec></sec>
<sec>
<title>4.2. Performance vs. parameter number</title>
<p>To answer <bold>RQ1</bold>, we evaluate model performance with embedding compression methods at different compression rates. In addition, we also experimentally analyze the relationship between token&#x00027;s assigned sizes and its frequency to understand how PME allocates embedding sizes for each token.</p>
<sec>
<title>4.2.1. Criteo and Avazu results</title>
<p><xref ref-type="fig" rid="F4">Figures 4</xref>, <xref ref-type="fig" rid="F5">5</xref> depict the Logloss of three recommendation models with embedding compression methods on Criteo and Avazu dataset, respectively. We observe that PME generally outperforms other baselines at different compression rates. Furthermore, we remark that PME can outperform SE even when SE uses maximal sizes on Criteo dataset. For example, PME improve the Logloss by 0.001 level while eliminating 97.4% and 95.7% parameters in the embedding layer for Autoint&#x0002B; and Wide and Deep on Criteo dataset, respectively. It is worth pointing out that an improvement of approximately 0.001 in terms of Logloss or AUC is already regarded as practically significant on these CTR prediction tasks (Cheng et al., <xref ref-type="bibr" rid="B5">2016</xref>). The AUC results are shown in <xref ref-type="fig" rid="F6">Figures 6</xref>, <xref ref-type="fig" rid="F7">7</xref>, which are similar to the Logloss, due to the page limit. We note that DartsEMB cannot assign zero dimension to tokens due to its NAS-based formulation. Moreover, DartsEMB cannot directly control the compression rate. Consequently, the only way to control the DartsEMB&#x00027;s compression rate is to decrease the maximal available size in its search space. However, decreasing maximal available size will limit the capacity of important tokens&#x00027; representation. Thus, with DartsEMB, it is hard to achieve good performance at a high compression rate beyond 10 &#x000D7;. In contrast, PME can directly exclude unimportant tokens from the vocabulary by assigning zero dimensions to them. Since the majority of tokens in the vocabulary are unimportant, PME can maintain the model performance even at an extremely high compression ratio, such as 40 &#x000D7;. Moreover, we emphasize that the memory usage of recommendation models with PME is reduced during both the standard training and inference process.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Test Logloss of recommendation models at approximately 10&#x000D7;, 20&#x000D7;, and 40&#x000D7; compression rate on Criteo dataset. <bold>(A)</bold> The backbone model is DeepFM. <bold>(B)</bold> The backbone model is Autoint&#x0002B;. <bold>(C)</bold> The backbone model is Wide and Deep.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-06-1195742-g0004.tif"/>
</fig>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Test Logloss of recommendation models at approximately 10&#x000D7;, 20&#x000D7;, and 40&#x000D7; compression rate on Avazu dataset. <bold>(A)</bold> The backbone model is DeepFM. <bold>(B)</bold> The backbone model is Autoint&#x0002B;. <bold>(C)</bold> The backbone model is Wide and Deep.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-06-1195742-g0005.tif"/>
</fig>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Test AUC of recommendation models at approximately 10&#x000D7;, 20&#x000D7;, and 40&#x000D7; compression rate Criteo dataset. <bold>(A)</bold> The backbone model is DeepFM. <bold>(B)</bold> The backbone model is Autoint&#x0002B;. <bold>(C)</bold> The backbone model is Wide and Deep.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-06-1195742-g0006.tif"/>
</fig>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Test AUC of recommendation models at approximately 10&#x000D7;, 20&#x000D7;, and 40&#x000D7; compression rate on Avazu dataset. <bold>(A)</bold> The backbone model is DeepFM. <bold>(B)</bold> The backbone model is Autoint&#x0002B;. <bold>(C)</bold> The backbone model is Wide and Deep.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-06-1195742-g0007.tif"/>
</fig></sec>
<sec>
<title>4.2.2. Relationship between frequency and allocated sizes</title>
<p>Recent work hypothesizes that frequent tokens are more important for model performance, and hence deserve to have more capacity while few parameters are enough for infrequent tokens (Joglekar et al., <xref ref-type="bibr" rid="B12">2020</xref>; Kang et al., <xref ref-type="bibr" rid="B13">2020</xref>; Ginart et al., <xref ref-type="bibr" rid="B9">2021</xref>). Based on the hyperthesis, several studies explicitly scale the embedding size with token&#x00027;s frequency (Kang et al., <xref ref-type="bibr" rid="B13">2020</xref>; Ginart et al., <xref ref-type="bibr" rid="B9">2021</xref>). In contrast to them, PME learns embedding sizes by transferring the capacity of tokens&#x00027; pruned embeddings without using the frequency information.</p>
<p>To study whether the embedding sizes assigned by PME are relevant to the frequency, we visualize the distribution of token&#x00027;s embedding size against its frequency on Criteo dataset in <xref ref-type="fig" rid="F8">Figure 8</xref>, where the backbone model is DeepFM with a 40 &#x000D7; compressed embedding layer. Two main observations are summarized as follows: (1) PME generally assigns larger sizes to frequent tokens, and vice versa. (2) Several infrequent tokens, whose frequency is less than 10<sup>3</sup>, are assigned with large capacity, and some frequent tokens are assigned with a smaller capacity. These two observations are partially aligned with the hyperthesis that frequent tokens are more important for model performance, and hence deserve to have more capacity. More importantly, our observations also suggest that the token&#x00027;s capacity should not be purely decided by its popularity. For example, niche items, such as cult films in movie recommendation, are rarely observed compared with popular ones in the collected data, however, the quality of these niche items&#x00027; representations is crucial for personalized recommendations, and hence deserve to have more capacity. However, simply scaling embedding sizes with token&#x00027;s frequency may sacrifice the quality of these niche item&#x00027;s representation. In contrast, PME allocates sizes which can maintain model performance with the full embedding as much as possible, and hence may allocate more capacity for tokens whose representation plays a decisive role for recommendation performance.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Distribution of token&#x00027;s allocated embedding size across all fields on Criteo Dataset. The backbone model is DeepFM. PME generally assigns larger embedding sizes to frequent tokens and smaller sizes to infrequent tokens.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-06-1195742-g0008.tif"/>
</fig></sec></sec>
<sec>
<title>4.3. Efficiency analysis</title>
<p>As shown in <xref ref-type="fig" rid="F2">Figure 2</xref>, the entire pipeline has two phases, namely, the size search phase and the training phase. To answer <bold>RQ2</bold>, we present and analyze the time cost of these two phases, respectively.</p>
<p>For the search phase, we report the search time of PME and DartsEMB in <xref ref-type="table" rid="T1">Table 1</xref>. We note that all other baselines do not have a separate search process. The search cost of PME is approximately 30%&#x0007E;40% of DartsEMB. This is mainly because the embedding table in PME is not trained during the search. In contrast, DartsEMB follows the paradigm of neural architecture search, leading to solve the bi-level optimization problem during the search.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Search time (second) of PME and DartsEMB on criteo dataset with different backbone models.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919497;color:#ffffff">
<th valign="top" align="left"><bold>Search time</bold></th>
<th valign="top" align="center"><bold>DeepFM</bold></th>
<th valign="top" align="center"><bold>Autoint&#x0002B;</bold></th>
<th valign="top" align="center"><bold>Wide and deep</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">DartsEMB</td>
<td valign="top" align="center">801</td>
<td valign="top" align="center">2,404</td>
<td valign="top" align="center">745</td>
</tr> <tr>
<td valign="top" align="left">PME</td>
<td valign="top" align="center">228 (&#x02212;71.5%)</td>
<td valign="top" align="center">1,034 (&#x02212;60.0%)</td>
<td valign="top" align="center">219 (&#x02212;70.6%)</td>
</tr></tbody>
</table>
</table-wrap>
<p>For the training phase, <xref ref-type="fig" rid="F9">Figure 9</xref> displays the training time per epoch of three models with different embedding compression methods. We can observe that PME generally reduce the 10%&#x0007E;20% training time compared with SE, and is comparable or faster than other baselines. This speedup may be due to models with PME have significantly less trainable parameters, i.e., many tokens are mapped to unlearnable zero vectors during training (see <xref ref-type="fig" rid="F8">Figure 8</xref>). We remark that PME could retrieve tokens&#x00027; embeddings from different fields simultaneously, which cannot be done in DartsEMB (see <xref ref-type="supplementary-material" rid="SM1">Appendix 2</xref> in <xref ref-type="supplementary-material" rid="SM1">Supplementary material</xref>). To summarize, PME can not only reduce the memory occupied by the embedding layer during both the training and inference process, but also can speed up the training process.</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>Training time per epoch of recommendation models with different embedding compression methods on Criteo dataset. <bold>(A)</bold> The backbone model is DeepFM. <bold>(B)</bold> The backbone model is Autoint&#x0002B;. <bold>(C)</bold> The backbone model is Wide and Deep.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-06-1195742-g0009.tif"/>
</fig></sec>
<sec>
<title>4.4. Sensitivity analysis</title>
<p>In this subsection, we study the sensitivity of searched sizes proposed by PME on backbone models and initialized weights using the Criteo dataset (<bold>RQ3</bold>).</p>
<sec>
<title>4.4.1. Initialization sensitivity analysis</title>
<p>The Lottery Ticket Hypothesis (LTH) demonstrates randomly initialized networks contain subnetworks (winning tickets) that, when trained in isolation, can reach the accuracy comparable to the original network (Frankle and Carbin, <xref ref-type="bibr" rid="B8">2019</xref>). LTH suggests the connections of winning tickets have those specific initial weights that make training particularly effective (Frankle and Carbin, <xref ref-type="bibr" rid="B8">2019</xref>).</p>
<p>However, in PME, the allocated size of each token is obtained by transferring only the capacity of its pruned embedding. Moreover, the randomly initialized weights used for identifying redundant parameters are not trained during the search process. According to LTH, the allocated sizes may overfit the particular initialized weights used during the search process. To investigate whether searched sizes are customized for the initialized weights used during the search process, following the method given in the study by Zhao et al. (<xref ref-type="bibr" rid="B31">2020a</xref>), we calculate the averaged Pearson correlation of searched sizes with five different random seeds. Here, the searched sizes refers to the output of <xref ref-type="table" rid="T2">Algorithm 1</xref>, instead of rounded sizes for a fine-grained comparison. The results are presented in <xref ref-type="fig" rid="F10">Figure 10</xref>. We note that a Pearson correlation beyond 0.8 is already regarded as strongly correlated (Buda and Jarynowski, <xref ref-type="bibr" rid="B1">2010</xref>; Zhao et al., <xref ref-type="bibr" rid="B31">2020a</xref>).</p>
<fig id="F10" position="float">
<label>Figure 10</label>
<caption><p>Averaged Pearson correlation between searched sizes with different random seeds. As parameters are being pruned, the Pearson correlation converges to one.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-06-1195742-g0010.tif"/>
</fig>
<p>As shown in <xref ref-type="fig" rid="F10">Figure 10</xref>, PME is generally robust to different initializations in terms of Pearson correlation. Moreover, as the parameters are being pruned, the Pearson correlation converges to one. This suggests that under highly limited resource constraints, the allocation strategy of PME is initialization-agnostic.</p></sec>
<sec>
<title>4.4.2. Architecture sensitivity analysis</title>
<p>For PME, the embedding sizes are calculated based on the gradients of the randomly initialized weights. Thus, backbone models may largely influence the searched embedding sizes since the gradient flow is decided by the architecture of backbone model. To investigate whether the searched embedding sizes are sensitive to the backbone models, similar to the initialization sensitivity analysis experiments, <xref ref-type="fig" rid="F11">Figure 11</xref> presents the Pearson correlation of searched embedding sizes with two representative models, namely, DeepFM and Autoint&#x0002B;.</p>
<fig id="F11" position="float">
<label>Figure 11</label>
<caption><p>Averaged Pearson correlation between searched sizes with DeepFM and Autoint&#x0002B;. Here, we use the searched sizes instead of rounded sizes. As parameters are being pruned, the Pearson correlation converges to one.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-06-1195742-g0011.tif"/>
</fig>
<p>Similarly, as shown in <xref ref-type="fig" rid="F11">Figure 11</xref>, PME is generally robust to backbone models in terms of Pearson correlation. Moreover, as the parameters are being pruned, the Pearson correlation converges to one. This suggests that under highly limited resource constraints, the searched embedding sizes proposed by PME is model-agnostic. We note that both DeepFM and Autoint&#x0002B; with PME can achieve comparable or better performance at high compression rates on Criteo dataset (see <xref ref-type="fig" rid="F4">Figure 4</xref>), we hypothesize that although backbone models are different, PME identifies a same group of the most important tokens and allocate more parameters to them.</p></sec></sec></sec>
<sec id="s5">
<title>5. Related work</title>
<p>Many embedding compression embedding methods have been proposed to reduce the memory consumption of the embedding layer. We roughly categorize existing embedding compression methods into four classes as follows.</p>
<sec>
<title>5.1. Multi-size embedding</title>
<p>Multi-size embedding allows each token in the vocabulary to have embeddings of different sizes. Specifically, mixed dimension embedding (MDE) proposes to adaptively allocate sizes for tokens according to their frequency (Ginart et al., <xref ref-type="bibr" rid="B9">2021</xref>). Neural Input Search (NIS) tries to search the embedding size using Reinforcement Learning (Joglekar et al., <xref ref-type="bibr" rid="B12">2020</xref>). Inspired by the differentiable architecture search (DARTS) (Liu et al., <xref ref-type="bibr" rid="B19">2019</xref>), AutoEmb makes the embedding sizes selection process differentiable by incorporating the DARTS method (Zhao et al., <xref ref-type="bibr" rid="B32">2020b</xref>). Similarly, AutoDim proposes to search field-wise embedding sizes by relaxing the discrete embedding size allocation problem to a continuous one that can be solved by gradient descent (Zhao et al., <xref ref-type="bibr" rid="B31">2020a</xref>).</p>
<p>Plug-in Embedding Pruning (PEP) (Liu et al., <xref ref-type="bibr" rid="B20">2021</xref>) also adopts the pruning-based formulation to learn embedding sizes, which is the most related study to ours with two main differences. First, PEP uses the sparse matrix format to store the pruned embedding layer and retrains the model with the sparse embedding matrix. In contrast, PME builds a multi-size embedding table for training by transferring the capacity of the token&#x00027;s pruned embeddings. Second, PEP utilizes Soft Threshold Reparameterization (Kusupati et al., <xref ref-type="bibr" rid="B15">2020</xref>) to prune redundant parameters, which requires expensive <italic>pretrain</italic>-<italic>prune</italic>-<italic>retrain</italic> cycles. In contrast, PME disentangles the pruning process from the iterative cycle by pruning redundant parameters at initialization. We do not compare with PEP due to the following two reasons. First, to the best of our knowledge, the official implementation of embedding layers in Pytorch does not support the sparse matrix format. The official codes of PEP have not released yet. Second, the baseline performance reported in Liu et al. (<xref ref-type="bibr" rid="B20">2021</xref>) has a large gap with ours.</p></sec>
<sec>
<title>5.2. Low-rank approximation</title>
<p>Low-rank approximation assumes there is a low-rank latent structure in the embedding matrix, and decomposes the original matrix to several smaller matrices (Markovsky and Usevich, <xref ref-type="bibr" rid="B21">2012</xref>). TT-Rec uses tensor train decomposition instead of the standard low-rank decomposition to optimize for GPU computations (Yin et al., <xref ref-type="bibr" rid="B29">2021</xref>).</p></sec>
<sec>
<title>5.3. Hashing</title>
<p>Hashing is a widely used technique to reduce the store space by mapping similar tokens into the same bucket, and vice versa (Wang et al., <xref ref-type="bibr" rid="B28">2017</xref>). Recently, efforts have also been devoted to jointly learn feature representations and hashing functions to preserve the similarity, and hence minimize the performance gap after compression (Lin et al., <xref ref-type="bibr" rid="B18">2015</xref>; Cao et al., <xref ref-type="bibr" rid="B2">2017</xref>; Wang et al., <xref ref-type="bibr" rid="B28">2017</xref>). Another representative work is ROBE (Desai et al., <xref ref-type="bibr" rid="B7">2022</xref>). Specifically, Desai et al. (<xref ref-type="bibr" rid="B7">2022</xref>) maintain a single array for learned parameters which is a compressed representation of embedding table. All embedding tables share the same array of learned parameters. The embeddings are accessed in a blocked manner from the embedding array using GPU-friendly universal hashing.</p></sec>
<sec>
<title>5.4. Quantization</title>
<p>Quantization refers to representing weights or gradients with a small numbers of bits, e.g., eight bits. In this way, we can effectively shrink the model size and accelerate the inference procedures (Han et al., <xref ref-type="bibr" rid="B11">2016</xref>). Specifically, differentiable product quantization (DPQ) proposes a differentiable quantization framework that enables end-to-end training for embedding compression and achieves significant compression rates on NLP models (Chen et al., <xref ref-type="bibr" rid="B4">2020</xref>). Inspired by DPQ, multi-granular quantized embeddings (MGQEs) generalize the framework of DPQ to the recommendation domain by incorporating the frequency information of tokens (Kang et al., <xref ref-type="bibr" rid="B13">2020</xref>).</p></sec></sec>
<sec sec-type="conclusions" id="s6">
<title>6. Conclusion</title>
<p>In this study, we approach the embedding size allocation problem from a pruning perspective. During the search phase, we prune the dimensions that have the least impact on model performance in the embedding to reduce its capacity. Then, we show that the customized size of each token can be obtained by transferring the capacity of its pruned embedding. Experiments verify that PME can achieve strong performance while significantly reducing the parameter number and can be trained efficiently.</p></sec>
<sec sec-type="data-availability" id="s7">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/<xref ref-type="supplementary-material" rid="SM1">Supplementary material</xref>, further inquiries can be directed to the corresponding author.</p></sec>
<sec sec-type="author-contributions" id="s8">
<title>Author contributions</title>
<p>ZL, QS, and XH contributed to the whole framework. XH, QS, LL, S-HC, and RC contributed to the revision of the manuscript. All authors contributed to the manuscript and approved the submitted version.</p></sec>
</body>
<back>
<sec sec-type="funding-information" id="s9">
<title>Funding</title>
<p>This work was funded by NSF IIS-2224843 and IIS-1849085.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>QS was employed by LinkedIn. RC, LL, and S-HC were employed by Samsung Electronics America. The remaining authors declare that the study was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec sec-type="disclaimer" id="s11">
<title>Author disclaimer</title>
<p>The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.</p>
</sec>
<sec sec-type="supplementary-material" id="s12">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fdata.2023.1195742/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fdata.2023.1195742/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.pdf" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>For convenience, we use the term &#x0201C;tokens&#x0201D; to represent elements (e.g., users and items) in the vocabulary.</p></fn>
<fn id="fn0002"><p><sup>2</sup>We do not compare with NIS (Joglekar et al., <xref ref-type="bibr" rid="B12">2020</xref>), since the reinforcement learning based search process is extremely slow in the normal setting.</p></fn>
<fn id="fn0003"><p><sup>3</sup><ext-link ext-link-type="uri" xlink:href="https://www.kaggle.com/c/criteo-display-ad-challenge">https://www.kaggle.com/c/criteo-display-ad-challenge</ext-link></p></fn>
<fn id="fn0004"><p><sup>4</sup><ext-link ext-link-type="uri" xlink:href="https://www.kaggle.com/c/avazu-ctr-prediction">https://www.kaggle.com/c/avazu-ctr-prediction</ext-link></p></fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Buda</surname> <given-names>A.</given-names></name> <name><surname>Jarynowski</surname> <given-names>A.</given-names></name></person-group> (<year>2010</year>). <source>Life Time of Correlations and its Applications</source>. Andrzej Buda Wydawnictwo Niezale&#x00139;L&#x00027;ne.</citation>
</ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cao</surname> <given-names>Z.</given-names></name> <name><surname>Long</surname> <given-names>M.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Yu</surname> <given-names>P. S.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Hashnet: Deep learning to hash by continuation,&#x0201D;</article-title> in <source>IEEE International Conference on Computer Vision, ICCV 2017</source> (<publisher-loc>Venice</publisher-loc>: <publisher-name>IEEE Computer Society</publisher-name>), <fpage>5609</fpage>&#x02013;<lpage>5618</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2017.598</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Carreira-Perpin&#x000E1;n</surname> <given-names>M. A.</given-names></name> <name><surname>Idelbayev</surname> <given-names>Y.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;&#x0201C;Learning-compression&#x0201D; algorithms for neural net pruning,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, <fpage>8532</fpage>&#x02013;<lpage>8541</lpage>.</citation>
</ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>T.</given-names></name> <name><surname>Li</surname> <given-names>L.</given-names></name> <name><surname>Sun</surname> <given-names>Y.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Differentiable product quantization for end-to-end embedding compression,&#x0201D;</article-title> in <source>Proceedings of the 37th International Conference on Machine Learning, ICML 2020, Virtual Event (PMLR), vol. 119 of Proceedings of Machine Learning Research</source> (<publisher-loc>Vienna</publisher-loc>), <fpage>1617</fpage>&#x02013;<lpage>1626</lpage>.</citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cheng</surname> <given-names>H.-T.</given-names></name> <name><surname>Koc</surname> <given-names>L.</given-names></name> <name><surname>Harmsen</surname> <given-names>J.</given-names></name> <name><surname>Shaked</surname> <given-names>T.</given-names></name> <name><surname>Chandra</surname> <given-names>T.</given-names></name> <name><surname>Aradhye</surname> <given-names>H.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>&#x0201C;Wide &#x00026; deep learning for recommender systems,&#x0201D;</article-title> in <source>Proceedings of the 1st Workshop on Deep Learning for Recommender Systems</source>, <fpage>7</fpage>&#x02013;<lpage>10</lpage>.</citation>
</ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Covington</surname> <given-names>P.</given-names></name> <name><surname>Adams</surname> <given-names>J.</given-names></name> <name><surname>Sargin</surname> <given-names>E.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Deep neural networks for youtube recommendations,&#x0201D;</article-title> in <source>Proceedings of the 10th ACM Conference on Recommender Systems</source>, eds <person-group person-group-type="editor"><name><surname>Sen</surname> <given-names>S.</given-names></name> <name><surname>Geyer</surname> <given-names>W.</given-names></name> <name><surname>Freyne</surname> <given-names>J.</given-names></name> <name><surname>Castells</surname> <given-names>P.</given-names></name></person-group> (<publisher-loc>Boston, MA</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>191</fpage>&#x02013;<lpage>198</lpage>. <pub-id pub-id-type="doi">10.1145/2959100.2959190</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Desai</surname> <given-names>A.</given-names></name> <name><surname>Chou</surname> <given-names>L.</given-names></name> <name><surname>Shrivastava</surname> <given-names>A.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Random offset block embedding (ROBE) for compressed embedding tables in deep learning recommendation systems,&#x0201D;</article-title> in <source>Proceedings of Machine Learning and Systems 2022, MLSys 2022</source>, eds <person-group person-group-type="editor"><name><surname>Marculescu</surname> <given-names>D.</given-names></name> <name><surname>Chi</surname> <given-names>Y.</given-names></name> <name><surname>Wu</surname> <given-names>C.</given-names></name></person-group> (<publisher-loc>Santa Clara, CA</publisher-loc>: <publisher-name>mlsys.org</publisher-name>).</citation>
</ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Frankle</surname> <given-names>J.</given-names></name> <name><surname>Carbin</surname> <given-names>M.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;The lottery ticket hypothesis: Finding sparse, trainable neural networks,&#x0201D;</article-title> in <source>7th International Conference on Learning Representations, ICLR 2019</source> (<publisher-loc>New Orleans, LA</publisher-loc>: <publisher-name>OpenReview.net</publisher-name>).</citation>
</ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ginart</surname> <given-names>A. A.</given-names></name> <name><surname>Naumov</surname> <given-names>M.</given-names></name> <name><surname>Mudigere</surname> <given-names>D.</given-names></name> <name><surname>Yang</surname> <given-names>J.</given-names></name> <name><surname>Zou</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Mixed dimension embeddings with application to memory-efficient recommendation systems,&#x0201D;</article-title> in <source>IEEE International Symposium on Information Theory, ISIT 2021</source> (<publisher-loc>Melbourne, VIC</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2786</fpage>&#x02013;<lpage>2791</lpage>. <pub-id pub-id-type="doi">10.1109/ISIT45174.2021.9517710</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Guo</surname> <given-names>H.</given-names></name> <name><surname>Tang</surname> <given-names>R.</given-names></name> <name><surname>Ye</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>Z.</given-names></name> <name><surname>He</surname> <given-names>X.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Deepfm: A factorization-machine based neural network for CTR prediction,&#x0201D;</article-title> in <source>Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017</source>, ed <person-group person-group-type="editor"><name><surname>Sierra</surname> <given-names>C.</given-names></name></person-group> (<publisher-loc>Melbourne, VIC</publisher-loc>: <publisher-name>ijcai.org</publisher-name>), <fpage>1725</fpage>&#x02013;<lpage>1731</lpage>. <pub-id pub-id-type="doi">10.24963/ijcai.2017/239</pub-id><pub-id pub-id-type="pmid">36072720</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Han</surname> <given-names>S.</given-names></name> <name><surname>Mao</surname> <given-names>H.</given-names></name> <name><surname>Dally</surname> <given-names>W. J.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,&#x0201D;</article-title> in <source>4th International Conference on Learning Representations, ICLR 2016</source>, eds Y. Bengio and Y. LeCun (<publisher-name>San Juan</publisher-name>).</citation>
</ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Joglekar</surname> <given-names>M. R.</given-names></name> <name><surname>Li</surname> <given-names>C.</given-names></name> <name><surname>Chen</surname> <given-names>M.</given-names></name> <name><surname>Xu</surname> <given-names>T.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Adams</surname> <given-names>J. K.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>&#x0201C;Neural input search for large scale recommendation models,&#x0201D;</article-title> in <source>KDD &#x00027;20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event</source>, eds R. Gupta, Y. Liu, J. Tang, and B. A. Prakash (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>2387</fpage>&#x02013;<lpage>2397</lpage>. <pub-id pub-id-type="doi">10.1145/3394486.3403288</pub-id><pub-id pub-id-type="pmid">36238497</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kang</surname> <given-names>W.</given-names></name> <name><surname>Cheng</surname> <given-names>D. Z.</given-names></name> <name><surname>Chen</surname> <given-names>T.</given-names></name> <name><surname>Yi</surname> <given-names>X.</given-names></name> <name><surname>Lin</surname> <given-names>D.</given-names></name> <name><surname>Hong</surname> <given-names>L.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>&#x0201C;Learning multi-granular quantized embeddings for large-vocab categorical features in recommender systems,&#x0201D;</article-title> in <source>Companion of The 2020 Web Conference 2020</source>, eds A. E. F. Seghrouchni, G. Sukthankar, T. Liu, and M. van Steen (<publisher-loc>Taipei</publisher-loc>: <publisher-name>ACM / IW3C2</publisher-name>), <fpage>562</fpage>&#x02013;<lpage>566</lpage>. <pub-id pub-id-type="doi">10.1145/3366424.3383416</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Koren</surname> <given-names>Y.</given-names></name> <name><surname>Bell</surname> <given-names>R.</given-names></name> <name><surname>Volinsky</surname> <given-names>C.</given-names></name></person-group> (<year>2009</year>). <article-title>Matrix factorization techniques for recommender systems</article-title>. <source>Computer</source> <volume>42</volume>, <fpage>30</fpage>&#x02013;<lpage>37</lpage>. <pub-id pub-id-type="doi">10.1109/MC.2009.263</pub-id></citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kusupati</surname> <given-names>A.</given-names></name> <name><surname>Ramanujan</surname> <given-names>V.</given-names></name> <name><surname>Somani</surname> <given-names>R.</given-names></name> <name><surname>Wortsman</surname> <given-names>M.</given-names></name> <name><surname>Jain</surname> <given-names>P.</given-names></name> <name><surname>Kakade</surname> <given-names>S. M.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>&#x0201C;Soft threshold weight reparameterization for learnable sparsity,&#x0201D;</article-title> in <source>Proceedings of the 37th International Conference on Machine Learning, ICML 2020, Virtual Event (PMLR), vol. 119 of Proceedings of Machine Learning Research</source>, <fpage>5544</fpage>&#x02013;<lpage>5555</lpage>.</citation>
</ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>N.</given-names></name> <name><surname>Ajanthan</surname> <given-names>T.</given-names></name> <name><surname>Torr</surname> <given-names>P. H. S.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Snip: single-shot network pruning based on connection sensitivity,&#x0201D;</article-title> in <source>7th International Conference on Learning Representations, ICLR 2019</source> (<publisher-loc>New Orleans, LA</publisher-loc>: <publisher-name>OpenReview.net</publisher-name>).</citation>
</ref>
<ref id="B17">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lian</surname> <given-names>J.</given-names></name> <name><surname>Zhou</surname> <given-names>X.</given-names></name> <name><surname>Zhang</surname> <given-names>F.</given-names></name> <name><surname>Chen</surname> <given-names>Z.</given-names></name> <name><surname>Xie</surname> <given-names>X.</given-names></name> <name><surname>Sun</surname> <given-names>G.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;xdeepfm: Combining explicit and implicit feature interactions for recommender systems,&#x0201D;</article-title> in <source>Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &#x00026; Data Mining, KDD 2018</source>, eds Y. Guo and F. Farooq (<publisher-loc>London</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>1754</fpage>&#x02013;<lpage>1763</lpage>. <pub-id pub-id-type="doi">10.1145/3219819.3220023</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>K.</given-names></name> <name><surname>Yang</surname> <given-names>H.</given-names></name> <name><surname>Hsiao</surname> <given-names>J.</given-names></name> <name><surname>Chen</surname> <given-names>C.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Deep learning of binary hash codes for fast image retrieval,&#x0201D;</article-title> in <source>2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2015</source> (<publisher-loc>Boston, MA</publisher-loc>: <publisher-name>IEEE Computer Society</publisher-name>), <fpage>27</fpage>&#x02013;<lpage>35</lpage>. <pub-id pub-id-type="doi">10.1109/534CVPRW.2015.7301269</pub-id></citation>
</ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>H.</given-names></name> <name><surname>Simonyan</surname> <given-names>K.</given-names></name> <name><surname>Yang</surname> <given-names>Y.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;DARTS: differentiable architecture search,&#x0201D;</article-title> in <source>7th International Conference on Learning Representations, ICLR 2019</source> (<publisher-loc>New Orleans, LA</publisher-loc>: <publisher-name>OpenReview.net</publisher-name>).</citation>
</ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>S.</given-names></name> <name><surname>Gao</surname> <given-names>C.</given-names></name> <name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Jin</surname> <given-names>D.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Learnable embedding sizes for recommender systems,&#x0201D;</article-title> in <source>9th International Conference on Learning Representations, ICLR 2021, Virtual Event</source> (<publisher-loc>OpenReview.net</publisher-loc>).</citation>
</ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Markovsky</surname> <given-names>I.</given-names></name></person-group> (<year>2012</year>). <article-title>&#x0201C;Low rank approximation - algorithms, implementation, applications,&#x0201D;</article-title> in <source>Communications and Control Engineering</source> (<publisher-loc>London</publisher-loc>: <publisher-name>Springer</publisher-name>). <pub-id pub-id-type="doi">10.1007/978-1-4471-2227-2</pub-id><pub-id pub-id-type="pmid">34658507</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Park</surname> <given-names>J.</given-names></name> <name><surname>Naumov</surname> <given-names>M.</given-names></name> <name><surname>Basu</surname> <given-names>P.</given-names></name> <name><surname>Deng</surname> <given-names>S.</given-names></name> <name><surname>Kalaiah</surname> <given-names>A.</given-names></name> <name><surname>Khudia</surname> <given-names>D. S.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>Deep learning inference in facebook data centers: Characterization, performance optimizations and hardware implications</article-title>. <source>arXiv [Preprint]</source>. arXiv: 1811.09886.</citation>
</ref>
<ref id="B23">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Park</surname> <given-names>Y.</given-names></name> <name><surname>Tuzhilin</surname> <given-names>A.</given-names></name></person-group> (<year>2008</year>). <article-title>&#x0201C;The long tail of recommender systems and how to leverage it,&#x0201D;</article-title> in <source>Proceedings of the 2008 ACM Conference on Recommender Systems, RecSys 2008</source>, eds P. Pu, D. G. Bridge, B. Mobasher, and F. Ricci (<publisher-loc>Lausanne</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>11</fpage>&#x02013;<lpage>18</lpage>. <pub-id pub-id-type="doi">10.1145/1454008.1454012</pub-id></citation>
</ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Paszke</surname> <given-names>A.</given-names></name> <name><surname>Gross</surname> <given-names>S.</given-names></name> <name><surname>Massa</surname> <given-names>F.</given-names></name> <name><surname>Lerer</surname> <given-names>A.</given-names></name> <name><surname>Bradbury</surname> <given-names>J.</given-names></name> <name><surname>Chanan</surname> <given-names>G.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>&#x0201C;Pytorch: An imperative style, high-performance deep learning library,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019</source>, eds H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d&#x00027;Alch&#x000E9;&#x02013;Buc, E. B. Fox, and R. Garnett (<publisher-loc>Vancouver, BC</publisher-loc>), <fpage>8024</fpage>&#x02013;<lpage>8035</lpage>.</citation>
</ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Shi</surname> <given-names>H. M.</given-names></name> <name><surname>Mudigere</surname> <given-names>D.</given-names></name> <name><surname>Naumov</surname> <given-names>M.</given-names></name> <name><surname>Yang</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Compositional embeddings using complementary partitions for memory-efficient recommendation systems,&#x0201D;</article-title> in <source>KDD &#x00027;20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event</source>, eds <person-group person-group-type="editor"><name><surname>Gupta</surname> <given-names>R.</given-names></name> <name><surname>Liu</surname> <given-names>Y.</given-names></name> <name><surname>Tang</surname> <given-names>J.</given-names></name> <name><surname>Prakash</surname> <given-names>B. A.</given-names></name></person-group> (<publisher-name>ACM</publisher-name>), <fpage>165</fpage>&#x02013;<lpage>175</lpage>. <pub-id pub-id-type="doi">10.1145/3394486.3403059</pub-id></citation>
</ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Song</surname> <given-names>W.</given-names></name> <name><surname>Shi</surname> <given-names>C.</given-names></name> <name><surname>Xiao</surname> <given-names>Z.</given-names></name> <name><surname>Duan</surname> <given-names>Z.</given-names></name> <name><surname>Xu</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>&#x0201C;Autoint: Automatic feature interaction learning via self-attentive neural networks,&#x0201D;</article-title> in <source>Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019</source>, eds W. Zhu, D. Tao, X. Cheng, P. Cui, E. A. Rundensteiner, D. Carmel, Q. He, and J. X. Yu (<publisher-loc>Beijing</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>1161</fpage>&#x02013;<lpage>1170</lpage>. <pub-id pub-id-type="doi">10.1145/3357384.3357925</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>C.</given-names></name> <name><surname>Zhang</surname> <given-names>G.</given-names></name> <name><surname>Grosse</surname> <given-names>R. B.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Picking winning tickets before training by preserving gradient flow,&#x0201D;</article-title> in <source>8th International Conference on Learning Representations, ICLR 2020</source> (<publisher-loc>Addis Ababa</publisher-loc>: <publisher-name>OpenReview.net</publisher-name>).</citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>T.</given-names></name> <name><surname>Sebe</surname> <given-names>N.</given-names></name> <name><surname>Shen</surname> <given-names>H. T.</given-names></name></person-group> (<year>2017</year>). <article-title>A survey on learning to Hash</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source> <volume>40</volume>, <fpage>769</fpage>&#x02013;<lpage>790</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2017.2699960</pub-id><pub-id pub-id-type="pmid">28475044</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yin</surname> <given-names>C.</given-names></name> <name><surname>Acun</surname> <given-names>B.</given-names></name> <name><surname>Wu</surname> <given-names>C.</given-names></name> <name><surname>Liu</surname> <given-names>X.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Tt-rec: Tensor train compression for deep learning recommendation models,&#x0201D;</article-title> in <source>Proceedings of Machine Learning and Systems 2021, MLSys 2021, virtual</source>, eds <person-group person-group-type="editor"><name><surname>Smola</surname> <given-names>A.</given-names></name> <name><surname>Dimakis</surname> <given-names>A.</given-names></name> <name><surname>Stoica</surname> <given-names>I.</given-names></name></person-group> (<publisher-name>mlsys.org</publisher-name>).</citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>S.</given-names></name> <name><surname>Yao</surname> <given-names>L.</given-names></name> <name><surname>Sun</surname> <given-names>A.</given-names></name> <name><surname>Tay</surname> <given-names>Y.</given-names></name></person-group> (<year>2019</year>). <article-title>Deep learning based recommender system: a survey and new perspectives</article-title>. <source>ACM Comput. Surv.</source> <volume>52</volume>, <fpage>1</fpage>&#x02013;<lpage>38</lpage>. <pub-id pub-id-type="doi">10.1145/3158369</pub-id></citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>X.</given-names></name> <name><surname>Liu</surname> <given-names>H.</given-names></name> <name><surname>Liu</surname> <given-names>H.</given-names></name> <name><surname>Tang</surname> <given-names>J.</given-names></name> <name><surname>Guo</surname> <given-names>W.</given-names></name> <name><surname>Shi</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2020a</year>). <article-title>Memory-efficient embedding for recommendations</article-title>. <source>arXiv [Preprint]</source>. arXiv:2006.14827.</citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>C.</given-names></name> <name><surname>Chen</surname> <given-names>M.</given-names></name> <name><surname>Zheng</surname> <given-names>X.</given-names></name> <name><surname>Liu</surname> <given-names>X.</given-names></name> <name><surname>Tang</surname> <given-names>J.</given-names></name></person-group> (<year>2020b</year>). <article-title>AutoEMB: automated embedding dimensionality search in streaming recommendations</article-title>. <source>arXiv preprint arXiv:2002.11252</source>.</citation>
</ref>
</ref-list>
</back>
</article> 