<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Microbiol.</journal-id>
<journal-title>Frontiers in Microbiology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Microbiol.</abbrev-journal-title>
<issn pub-type="epub">1664-302X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fmicb.2025.1484183</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Microbiology</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Composite quantile regression approach to batch effect correction in microbiome data</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Park</surname> <given-names>Jiwon</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2717780/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/data-curation/"/>
<role content-type="https://credit.niso.org/contributor-roles/formal-analysis/"/>
<role content-type="https://credit.niso.org/contributor-roles/investigation/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/validation/"/>
<role content-type="https://credit.niso.org/contributor-roles/visualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Park</surname> <given-names>Taesung</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c002"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/872605/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/project-administration/"/>
<role content-type="https://credit.niso.org/contributor-roles/supervision/"/>
<role content-type="https://credit.niso.org/contributor-roles/validation/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Interdisciplinary Program of Bioinformatics, Seoul National University</institution>, <addr-line>Seoul</addr-line>, <country>Republic of Korea</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Statistics, Seoul National University</institution>, <addr-line>Seoul</addr-line>, <country>Republic of Korea</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Hyun-Seob Song, University of Nebraska-Lincoln, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Na-Rae Lee, Konkuk University, Republic of Korea</p>
<p>Eglantina Kalluci, University of Tirana, Albania</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Jiwon Park <email>jiwonp&#x00040;snu.ac.kr</email></corresp>
<corresp id="c002">Taesung Park <email>tspark&#x00040;stats.snu.ac.kr</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>25</day>
<month>02</month>
<year>2025</year>
</pub-date>
<pub-date pub-type="collection">
<year>2025</year>
</pub-date>
<volume>16</volume>
<elocation-id>1484183</elocation-id>
<history>
<date date-type="received">
<day>21</day>
<month>08</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>06</day>
<month>02</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2025 Park and Park.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Park and Park</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<sec>
<title>Background</title>
<p>Batch effects refer to data variations that arise from non-biological factors such as experimental conditions, equipment, and external factors. These effects are considered significant issues in the analysis of biological data since they can compromise data consistency and distort actual biological differences, which can severely skew the results of downstream analyses.</p>
</sec>
<sec>
<title>Method</title>
<p>In this study, we introduce a new approach that comprehensively addresses two types of batch effects: &#x0201C;systematic batch effects&#x0201D; which are consistent across all samples in a batch, and &#x0201C;nonsystematic batch effects&#x0201D; which vary depending on the variability of operational taxonomic units (OTUs) within each sample in the same batch. To address systematic batch effects, we apply a negative binomial regression model and correct for consistent batch influences by excluding fixed batch effects. Additionally, to handle nonsystematic batch effects, we employ composite quantile regression. By adjusting the distribution of OTUs to be similar based on a reference batch selected using the Kruskal-Walis test method, we consider the variability at the OTU level.</p>
</sec>
<sec>
<title>Results</title>
<p>The performance of the model is evaluated and compared with existing methods using PERMANOVA R-squared values, Principal Coordinates Analysis (PCoA) plots and Average Silhouette Coefficient calculated with diverse distance-based metrics. The model is applied to three real microbiome datasets: Metagenomic urine control data, Human Immunodeficiency Virus Re-analysis Consortium data, and Men and Women Offering Understanding of Throat HPV study data. The results demonstrate that the model effectively corrects for batch effects across all datasets.</p>
</sec></abstract>
<kwd-group>
<kwd>batch effect</kwd>
<kwd>composite quantile regression</kwd>
<kwd>microbiome</kwd>
<kwd>zero inflation</kwd>
<kwd>over-dispersion</kwd>
</kwd-group>
<counts>
<fig-count count="5"/>
<table-count count="4"/>
<equation-count count="10"/>
<ref-count count="36"/>
<page-count count="10"/>
<word-count count="6073"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Systems Microbiology</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1 Introduction</title>
<p>The human microbiome, comprising diverse microorganisms across various body sites, is crucial for understanding health and disease dynamics (Ursell et al., <xref ref-type="bibr" rid="B29">2012</xref>; Wang and L&#x000EA;Cao, <xref ref-type="bibr" rid="B31">2020</xref>). Advances in sequencing technologies have deepened our understanding by enabling researchers to not only identify microbial species but also to understand their functional roles within the ecosystem. However, these advancements have also highlighted challenges like batch effects&#x02014;discrepancies across sample batches that can distort biological insights (Leek and Storey, <xref ref-type="bibr" rid="B13">2007</xref>). That is, batch effects refer to systematic and non-systematic discrepancies or variations that arise during the processing of samples in a study (Lazar et al., <xref ref-type="bibr" rid="B12">2013</xref>; Wang and L&#x000EA;Cao, <xref ref-type="bibr" rid="B31">2020</xref>). Systematic batch effects represent the consistent differences across all samples within a batch, and nonsystematic batch effects demonstrate variability that is dependent on the diversity of OTUs present within each individual sample of the same batch (Soneson et al., <xref ref-type="bibr" rid="B25">2014</xref>). Key factors contributing to these effects include variations in sample collection, DNA extraction methods, sequencing protocols, and data analysis techniques. Microbiome data&#x00027;s inherent properties&#x02014;high zero-inflation, over-dispersion&#x02014;exacerbate the impact of batch effects. Zero-inflation indicates the presence of many zeros in the data, which highlights the scarcity or absence of certain microbial species in many samples. Over-dispersion arises from individual variability and technical differences in sequencing depth, complicating data analysis.</p>
<p>While numerous methods exist for correcting batch effects in high-throughput data, many assume continuous data and are not directly applicable to microbiome data, which typically consist of count data. Traditional methods, such as the ComBat algorithm (Johnson et al., <xref ref-type="bibr" rid="B8">2007</xref>), typically assume a Gaussian distribution. Based on this, extensions for count data such as RNA-seq have been developed that utilize a negative binomial distribution. This approach is more aligned with the inherent properties of the data and effectively adjusts for consistent batch patterns. However, while this method is applicable for addressing systematic batch effects, it struggles to fully correct for non-systematic batch influences caused by irregular experimental errors or the unique characteristics of individual samples.</p>
<p>Another approach, Meta-analysis Methods with a Uniform Pipeline for Heterogeneity in microbiome studies (MMUPHin) provides a comprehensive solution for managing heterogeneity in microbiome studies (Ma et al., <xref ref-type="bibr" rid="B19">2022</xref>). This method includes joint normalization and meta-analysis techniques specifically designed for the unique characteristics of microbiome data. Its adaptability is especially effective for handling the diverse and often non-parametric nature of microbiome count data. However, MMUPHin assumes the data to be Zero-inflated Gaussian, which is primarily suitable only for certain transformations of relative abundance data, such as taxon counts normalized by each sample&#x00027;s library size. This assumption limits its applicability, indicating a need for more flexible approaches in certain scenarios. Percentile normalization is a method by which data can be converted to a uniform distribution based on percentiles (Gibbons et al., <xref ref-type="bibr" rid="B6">2018</xref>). This transformation helps mitigate the effects of over-dispersion and the high zero count. However, a key limitation of percentile-normalization lies in its potential to oversimplify complex data structures, possibly leading to the loss of meaningful biological variance. This issue can be particularly critical when dealing with highly diverse microbiome samples. Recently, new methods for batch effect correction have been introduced, tailored specifically to design unique characteristics of microbiome data. Among those approaches, Conditional Quantile Regression (ConQuR) (Ling et al., <xref ref-type="bibr" rid="B16">2021a</xref>,<xref ref-type="bibr" rid="B17">b</xref>) utilizes a conditional quantile regression approach for batch effect correction, enabling the generation of corrected OTUs. ConQuR independently processes each OTU without assuming a specific distribution, offering flexible handling of data. However, when OTUs across batches exhibit significantly different distributions, each quantile may represent different characteristics, potentially compromising the consistency of the analysis results. While using a reference batch to standardize OTU distributions can address fundamental inter-batch differences, the effectiveness of this approach heavily depends on the extent of these differences and the appropriateness of the chosen reference batch. If the reference batch does not adequately represent the characteristics of other batches, this method may introduce additional distortions.</p>
</sec>
<sec sec-type="materials and methods" id="s2">
<title>2 Materials and methods</title>
<sec>
<title>2.1 Microbiome datasets</title>
<sec>
<title>2.1.1 Human immunodeficiency virus re-analysis consortium datasets (HIVRC)</title>
<p>We used the first dataset from multiple individual studies derived through the Human Immunodeficiency Virus Re-analysis Consortium (HIVRC) (Deeks et al., <xref ref-type="bibr" rid="B4">2015</xref>; Tuddenham et al., <xref ref-type="bibr" rid="B28">2020</xref>; Ling et al., <xref ref-type="bibr" rid="B16">2021a</xref>,<xref ref-type="bibr" rid="B17">b</xref>). This integrated dataset, based on fecal samples, provides microbial profiles indicative of the gut microbiome. It comprises 17 individual datasets, including 16S rRNA gene sequences from both HIV-uninfected (HIV-) and HIV-infected (HIV&#x0002B;) patients. Out of the 17 studies within the HIVRC, we selected 4 specific datasets for our analysis, namely those from Noguera-Julian, Pinto-Cardoso, Serrano-Villar, and Vesterbacka (<xref ref-type="table" rid="T1">Table 1</xref>). The selection criteria were based on the study design (either case-control or cross-sectional) and the sample type. These criteria ensured that the selected studies were compatible and relevant for our model validation purposes. Publicly available dataset was used in this study.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Summary of the sample size and covariates in HIVRC datasets.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Batch ID (study)</bold></th>
<th valign="top" align="center"><bold>Sample size</bold></th>
<th valign="top" align="center"><bold>HIV status = 1(%)</bold></th>
<th valign="top" align="center"><bold>Age [mean(SD)]</bold></th>
<th valign="top" align="center"><bold>Gender = 1 (%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Noguera-Julian</td>
<td valign="top" align="center">232</td>
<td valign="top" align="center">199 (85.8)</td>
<td valign="top" align="center">42.3 (10.7)</td>
<td valign="top" align="center">171 (73.7)</td>
</tr> <tr>
<td valign="top" align="left">Pinto-Cardoso</td>
<td valign="top" align="center">43</td>
<td valign="top" align="center">33 (76.7)</td>
<td valign="top" align="center">39.9 (10.2)</td>
<td valign="top" align="center">35 (81.4)</td>
</tr> <tr>
<td valign="top" align="left">Serrano-Villar 2017</td>
<td valign="top" align="center">23</td>
<td valign="top" align="center">21 (91.3)</td>
<td valign="top" align="center">42.4 (8.63)</td>
<td valign="top" align="center">23 (100)</td>
</tr> <tr>
<td valign="top" align="left">Vesterbacka</td>
<td valign="top" align="center">62</td>
<td valign="top" align="center">47 (75.8)</td>
<td valign="top" align="center">46.9 (9.61)</td>
<td valign="top" align="center">31 (50)</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>The provided HIV status includes the number and proportion of patients diagnosed with HIV. The Age variable presents the mean and standard deviation, while the Gender variable reports the number and percentage of male participants.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>2.1.2 Men and women offering understanding of throat HPV study dataset</title>
<p>We used the second dataset from the Men and Women Offering Understanding of Throat Human Papillomavirus (MOUTH) study (Zhang et al., <xref ref-type="bibr" rid="B35">2022</xref>). This cross-sectional study includes 16S rRNA gene sequences from participants with oncogenic oral HPV infection or HPV seropositive antibodies. The dataset is composed of samples distributed across seven distinct sequencing batches, each identified by a unique Batch ID. The HPV dataset includes various covariates, summarized as follows: sample size, age [mean (SD)], smoking status, sexual orientation, and HPV status. Smoking status is categorized as Never smoker (0), Former smoker (1), and Current smoker (2), with the reported percentages indicating the proportion of participants who have ever smoked (including both former and current smokers). Sexual orientation is categorized as Heterosexual (0), Homosexual (1), and Others (2). HPV status represents the percentage of participants with a positive HPV status. This dataset is divided into 7 batches based on the experimental plates used during sequencing, allowing for the consideration of potential batch effects arising from differences in sample processing. These variables, including smoking status and sexual orientation, are incorporated into the analysis to control for potential confounding factors (<xref ref-type="table" rid="T2">Table 2</xref>). Publicly available dataset was used in this study.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Summary of the sample size and covariates in HPV datasets.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Batch ID (plate ID)</bold></th>
<th valign="top" align="center"><bold>Sample size</bold></th>
<th valign="top" align="center"><bold>Age [mean (SD)]</bold></th>
<th valign="top" align="center"><bold>Smoking = True (%)</bold></th>
<th valign="top" align="center"><bold>Sexual orientation; Heterosexual = 0, Homosexual = 1, Others = 2</bold></th>
<th valign="top" align="center"><bold>HPV status = <italic>p</italic>os (%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">p68_s01_JH1_16SV4</td>
<td valign="top" align="center">43</td>
<td valign="top" align="center">54.6 (10.6)</td>
<td valign="top" align="center">19 (44.1)</td>
<td valign="top" align="center">36 (84.7)/3 (7.1)/ 3 (7.1)</td>
<td valign="top" align="center">3 (6.9)</td>
</tr> <tr>
<td valign="top" align="left">p68_s02_JH2_16SV4</td>
<td valign="top" align="center">49</td>
<td valign="top" align="center">52.8 (9.1)</td>
<td valign="top" align="center">18 (36.7)</td>
<td valign="top" align="center">41 (89.1)/1 (2.1)/ 4 (8.6)</td>
<td valign="top" align="center">3 (6.2)</td>
</tr> <tr>
<td valign="top" align="left">p68_s03_JH3_16SV4</td>
<td valign="top" align="center">56</td>
<td valign="top" align="center">54.3 (9.6)</td>
<td valign="top" align="center">25 (44.6)</td>
<td valign="top" align="center">49 (87.5)/3 (5.3)/ 4 (7.1)</td>
<td valign="top" align="center">2 (3.5)</td>
</tr> <tr>
<td valign="top" align="left">p68_s04_JH4_16SV4</td>
<td valign="top" align="center">79</td>
<td valign="top" align="center">54.1 (9.1)</td>
<td valign="top" align="center">32 (40.5)</td>
<td valign="top" align="center">71 (92.2)/4 (5.1)/2 (2.5)</td>
<td valign="top" align="center">3 (3.7)</td>
</tr> <tr>
<td valign="top" align="left">p68_s05_JH5_16SV4</td>
<td valign="top" align="center">89</td>
<td valign="top" align="center">54 (10.1)</td>
<td valign="top" align="center">34 (38.2)</td>
<td valign="top" align="center">83 (93.2)/3 (3.3)/3 (3.3)</td>
<td valign="top" align="center">2 (2.2)</td>
</tr> <tr>
<td valign="top" align="left">p68_s06_JH6_16SV4</td>
<td valign="top" align="center">88</td>
<td valign="top" align="center">53.3 (10.3)</td>
<td valign="top" align="center">38 (43.1)</td>
<td valign="top" align="center">67 (77)/7 (7)/13 (14.9)</td>
<td valign="top" align="center">20 (23.2)</td>
</tr> <tr>
<td valign="top" align="left">p68_s07_JH7_16SV4</td>
<td valign="top" align="center">91</td>
<td valign="top" align="center">56.7 (11.6)</td>
<td valign="top" align="center">43 (47.2)</td>
<td valign="top" align="center">74 (85)/8 (9.1)/5 (5.7)</td>
<td valign="top" align="center">17 (18.6)</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>The provided smoking indicator represents the percentage of HIV-positive participants (Pinto-Cardoso et al., <xref ref-type="bibr" rid="B22">2017</xref>). The sexual orientation variable reports the number and proportion of participants based on their sexual orientation (0 = heterosexual, 1 = homosexual, 2 = other). The HPV status indicator includes the percentage of HPV-positive participants.</p>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec>
<title>2.2 Proposed model</title>
<sec>
<title>2.2.1 Negative binomial model</title>
<p>Here, we suppose that the entry <italic>Y</italic><sub><italic>ijg</italic></sub> denotes a count value for the <italic>j</italic><sup><italic>th</italic></sup> OTU of the <italic>i</italic><sup><italic>th</italic></sup> sample from the <italic>g</italic><sup><italic>th</italic></sup> batch, with <italic>i</italic> &#x0003D; 1, &#x02026;, <italic>n</italic>, <italic>j</italic> &#x0003D; 1, &#x02026;, <italic>m</italic>, <italic>g</italic> &#x0003D; 1, &#x02026;, <italic>k</italic>. Covariates such as important biomedical, demographic, genomic, and other information based on prior knowledge are represented as <italic>X</italic><sub><italic>i</italic></sub>, which is the vector of fixed effects and &#x003B2; is the vector of the fixed effects coefficients (Zhang et al., <xref ref-type="bibr" rid="B34">2017</xref>).</p>
<p>We describe the negative binomial regression model used for batch effect adjustment. In our approach, it is employed to estimate the count of OTUs, considering the influence of covariates and batch ID as a fixed effect. By treating the batch ID as a fixed effect, this model can directly account for the consistent differences observed across batches. This model specifically targets the non-zero counts of OTUs, denoted as <italic>Y</italic><sub><italic>ijg</italic></sub> |<italic>Y</italic><sub><italic>ijg</italic></sub> &#x0003E; 0. The counts are assumed to follow a negative binomial distribution (Tran et al., <xref ref-type="bibr" rid="B27">2020</xref>), represented by <italic>Y</italic><sub><italic>ijg</italic></sub> |<italic>Y</italic><sub><italic>ijg</italic></sub> &#x0003E; 0 &#x0007E; <italic>NB</italic>(&#x003BC;<sub><italic>ijg</italic></sub>, &#x003B8;<sub><italic>jg</italic></sub>), where &#x003BC;<sub><italic>ijg</italic></sub> is expected count and &#x003B8;<sub><italic>jg</italic></sub> is the dispersion parameter of the negative binomial distribution. The equation for this model is expressed as follows (Chen and Li, <xref ref-type="bibr" rid="B3">2016</xref>; Dong et al., <xref ref-type="bibr" rid="B5">2020</xref>; Li et al., <xref ref-type="bibr" rid="B14">2023</xref>; Yirga et al., <xref ref-type="bibr" rid="B32">2020</xref>):</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="center"><mml:mtr><mml:mtd><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>X</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mo>&#x003B2;</mml:mo></mml:mstyle></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>In this equation, &#x003C3;<sub><italic>j</italic></sub> represents the baseline expression level for the <italic>j</italic><sup><italic>th</italic></sup> OTU, reflecting the unique characteristics of each OTU. <bold>X</bold><sub><italic>i</italic></sub> is the vector of the covariates for the <italic>i</italic><sup><italic>th</italic></sup> sample, <bold>&#x003B2;</bold><sub><italic>j</italic></sub> is the vector of coefficients for the <italic>j</italic><sup><italic>th</italic></sup> OTU, &#x003B3;<sub><italic>jg</italic></sub> is the mean batch effect for the <italic>j</italic><sup><italic>th</italic></sup> OTU in the <italic>g</italic><sup><italic>th</italic></sup> batch, and log<italic>N</italic><sub><italic>i</italic></sub> represents the library size, i.e., the total counts across all OTUs in the <italic>i</italic><sup><italic>th</italic></sup> sample (Johnson et al., <xref ref-type="bibr" rid="B8">2007</xref>; Zhang et al., <xref ref-type="bibr" rid="B36">2020</xref>; Ramakodi, <xref ref-type="bibr" rid="B23">2021</xref>).</p>
<p>The variance of the negative binomial distribution is defined as:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M2"><mml:mtable class="eqnarray" columnalign="center"><mml:mtr><mml:mtd><mml:mi>V</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0003E;</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>This setup allows us to account for the baseline expression levels, import ant biological covariates, and batch effects in the analysis, providing an adequate structure for modeling OTU counts in microbiome data. To adjust for batch effects, the log-transformed mean parameter log(&#x003BC;<sub><italic>ijg</italic></sub>) is adjusted by subtracting the batch effect term (Van den Berg et al., <xref ref-type="bibr" rid="B30">2006</xref>):</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="center"><mml:mtr><mml:mtd><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M4"><mml:msubsup><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> is the adjusted mean parameter. The variance parameter for the <italic>j</italic><sup><italic>th</italic></sup> OTU in the <italic>g</italic><sup><italic>th</italic></sup> batch, &#x003B8;<sub><italic>jg</italic></sub>, is averaged across all batches to obtain a consistent variance parameter:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M5"><mml:mtable class="eqnarray" columnalign="center"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:msub><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M6"><mml:msubsup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> is the average variance parameter for the <italic>j</italic><sup><italic>th</italic></sup> OTU and <italic>N</italic><sub><italic>G</italic></sub> is the total number of batches. By averaging these parameters, we can apply a consistent variance across all batches, reducing variability caused by individual batch effects that can be caused by non-interest variables (Johnson et al., <xref ref-type="bibr" rid="B8">2007</xref>; Zhang et al., <xref ref-type="bibr" rid="B36">2020</xref>).</p>
<p>The adjusted counts calculated using the refined parameters <inline-formula><mml:math id="M7"><mml:msubsup><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math id="M8"><mml:msubsup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> are modeled to follow negative binomial distribution <inline-formula><mml:math id="M9"><mml:msubsup><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x0007E;</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mi>N</mml:mi><mml:mi>B</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msubsup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. Adjusted values <inline-formula><mml:math id="M10"><mml:msubsup><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> are then derived by mapping the data from the empirical distribution of the original counts to the batch-free distribution, according to quantile levels. This mapping process ensures that the batch-specific effects are removed, and the adjusted counts are comparable across different batches.</p>
</sec>
<sec>
<title>2.2.2 Logistic regression</title>
<p>We proceed with an additional process to address variability at the OTU level using the adjusted OTU count. It initiates with utilizing logistic regression to estimate the probability of nonzero occurrences within the adjusted OTU table (Zhang and Yi, <xref ref-type="bibr" rid="B33">2020</xref>). This probabilistic model serves to identify samples where the presence of OTU is confirmed. Subsequently, quantile regression is applied to these identified samples (Pendegraft et al., <xref ref-type="bibr" rid="B20">2019</xref>).</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M11"><mml:mtable class="eqnarray" columnalign="center"><mml:mtr><mml:mtd><mml:mrow><mml:mi>logit</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi>Y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mo>&#x02217;</mml:mo></mml:msubsup><mml:mo>&#x0003E;</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>X</mml:mi></mml:mstyle><mml:mi>i</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mstyle mathsize='normal'><mml:mi>&#x003B6;</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>B</mml:mi></mml:mstyle><mml:mi>i</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mstyle mathsize='normal'><mml:mi>&#x003C8;</mml:mi></mml:mstyle></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E6"><label>(6)</label><mml:math id="M12"><mml:mtable class="eqnarray" columnalign="center"><mml:mtr><mml:mtd><mml:mo class="qopname">Pr</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:munderover></mml:mstyle><mml:mo>&#x0003E;</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo class="qopname">exp</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>X</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mstyle mathvariant="bold"><mml:mo>&#x003B6;</mml:mo></mml:mstyle><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>B</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mstyle mathvariant="bold"><mml:mo>&#x003C8;</mml:mo></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0002B;</mml:mo><mml:mo class="qopname">exp</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>X</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mstyle mathvariant="bold"><mml:mo>&#x003B6;</mml:mo></mml:mstyle><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>B</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mstyle mathvariant="bold"><mml:mo>&#x003C8;</mml:mo></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Here, <inline-formula><mml:math id="M13"><mml:msubsup><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> denotes the count of the <italic>j</italic><sup><italic>th</italic></sup> adjusted OTU in the <italic>i</italic><sup><italic>th</italic></sup> sample, <bold>X</bold><sub><bold>i</bold></sub> represents the covariates for this observation, <bold>B</bold><sub><bold>i</bold></sub> is random effect term for the <italic>i</italic><sup><italic>th</italic></sup> sample, capturing the batch effect. <bold>&#x003B6;</bold> is the coefficient vector linked to covariates, and <bold>&#x003C8;</bold> is the coefficient vector associated with batch effects. <italic>q</italic><sub><italic>ij</italic></sub> denotes the probability of nonzero occurrences within the adjusted OTU table, and it is crucial for understanding the likelihood of observing nonzero counts for a given OTU in the adjusted OTU table.</p>
</sec>
<sec>
<title>2.2.3 Composite quantile regression</title>
<p>We employ Composite Quantile Regression (CQR) to robustly model microbiome data, which exhibits zero-inflation and a high frequency of outliers. This method allows for the estimation of conditional values of microbial OTUs across various quantiles. For non-zero microbial counts, the conditional quantile function of the response variable is defined as Ling et al. (<xref ref-type="bibr" rid="B16">2021a</xref>,<xref ref-type="bibr" rid="B17">b</xref>); Koenker and Hallock (<xref ref-type="bibr" rid="B10">2001</xref>):</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M14"><mml:mtable class="eqnarray" columnalign="center"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>X</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msubsup><mml:mrow><mml:mtext>&#x000A0;&#x000A0;</mml:mtext><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x0003E;</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>X</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mstyle mathvariant="bold"><mml:mo>&#x003B1;</mml:mo></mml:mstyle><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>B</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mstyle mathvariant="bold"><mml:mo>&#x003B4;</mml:mo></mml:mstyle><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>In this equation, <bold>&#x003B1;</bold>(&#x003C4;), <bold>&#x003B4;</bold>(&#x003C4;) is the vector of regression coefficients specific to quantile level <inline-formula><mml:math id="M15"><mml:mi>&#x003C4;</mml:mi><mml:mo>=</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:mfrac><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> with a large <italic>k</italic> (e.g., 5<sup><italic>th</italic></sup>, 10<sup><italic>th</italic></sup>, &#x02026;, 95<sup><italic>th</italic></sup> <italic>percentiles with k</italic> &#x0003D; 19). This allows us to capture the distributional characteristics of the data across various quantiles. This model diverges from traditional quantile regression by adopting a unified set of regression coefficients across all quantiles. This approach facilitates a more streamlined estimation process that leverages the robustness of quantile regression while simplifying the model complexity typically associated with estimating separate coefficients for each quantile.</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M16"><mml:mtable class="eqnarray" columnalign="center"><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mi>&#x02211;</mml:mi><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mstyle displaystyle='true'><mml:msubsup><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:msubsup><mml:mrow><mml:mstyle displaystyle='true'><mml:msubsup><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>m</mml:mi></mml:msubsup><mml:mrow><mml:msub><mml:mi>&#x003C1;</mml:mi><mml:mi>&#x003C4;</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi>Y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mo>&#x02217;</mml:mo></mml:msubsup><mml:mo>&#x02212;</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>X</mml:mi></mml:mstyle><mml:mi>i</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>&#x003B1;</mml:mi></mml:mstyle><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x003C4;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x02212;</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>B</mml:mi></mml:mstyle><mml:mi>i</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>&#x003B4;</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003C4;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>In this formula, &#x003C1;<sub>&#x003C4;</sub>(<italic>u</italic>) is the check loss function calculated as &#x003C1;<sub>&#x003C4;</sub>(<italic>u</italic>) &#x0003D; <italic>u</italic>[&#x003C4;&#x02212;<italic>I</italic>(<italic>u</italic> &#x0003C; 0)], <italic>T</italic> is the set of target quantiles, <italic>n</italic> is the number of samples, and <italic>m</italic> is the number of OTUs. This formulation reduces the influence of outliers compared to mean regression and provides a deeper understanding of the overall distribution by analyzing data across various quantiles (He et al., <xref ref-type="bibr" rid="B7">2023</xref>).</p>
<p>Then, we incorporate the zero-inflation aspect by accounting for the probability of the response variable being zero. This approach allows us to effectively handle datasets with many zero values by considering the original distribution of the data. We combine the zero probability with the previously estimated quantile function to create a unified model that explains the entire data distribution, including both zero and positive values (Ling, <xref ref-type="bibr" rid="B15">2019</xref>).</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M17"><mml:mtable class="eqnarray" columnalign="center"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none none none none none none none none none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mn>0</mml:mn><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>,</mml:mo><mml:mtext>&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>&#x003C4;</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>-</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>,</mml:mo><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>&#x003C4;</mml:mi><mml:mo>&#x02265;</mml:mo><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
</sec>
<sec>
<title>2.2.4 Reference batch selection</title>
<p>The selection of the reference batch is pivotal in the quantile regression framework as it serves as the baseline to which the distribution of counts (given the OTU is present) are aligned. The reference batch is not chosen based on size or abundance alone, as these factors do not necessarily indicate the quality or representativeness of the batch. To identify potential reference batches that demonstrate homogeneity across multiple microbial datasets, we apply the Kruskal-Wallis test (Kruskal and Wallis, <xref ref-type="bibr" rid="B11">1952</xref>). This non-parametric method assesses whether the median values across different batches are statistically similar, indicating stable baseline conditions across these groups. Batches that do not show significant differences in their medians indicate the possibility of consistent performance across various batches and are shortlisted for further evaluation (Kruskal and Wallis, <xref ref-type="bibr" rid="B11">1952</xref>; Chakraborty et al., <xref ref-type="bibr" rid="B2">2011</xref>). Given the high prevalence of outliers and zero counts in microbiome data, traditional measures of variability such as the standard coefficient of variation (CV) may not provide reliable insights. Instead, we utilize a Robust CV, defined as the ratio of the Median Absolute Deviation (MAD) to the median of the data, multiplied by 100% (Pham-Gia and Hung, <xref ref-type="bibr" rid="B21">2001</xref>):</p>
<disp-formula id="E10"><label>(10)</label><mml:math id="M18"><mml:mtable class="eqnarray" columnalign="center"><mml:mtr><mml:mtd><mml:mi>R</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mi>C</mml:mi><mml:mi>V</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>M</mml:mi><mml:mi>A</mml:mi><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>*</mml:mo><mml:mn>100</mml:mn><mml:mi>%</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>This formula offers a more resilient measure against the skewness and anomalies inherent in microbiome data. Among the batches identified as homogeneous by the Kruskal-Wallis test, the batch exhibiting the lowest Robust CV is selected as the optimal reference. This batch is expected to have the least variability, making it the best candidate for minimizing batch effects in the subsequent analyses. <xref ref-type="supplementary-material" rid="SM1">Supplementary material</xref> including the code and data used in this study, is available at <ext-link ext-link-type="uri" xlink:href="https://github.com/JIWONNP/Composite-Quantile-Regerssion">https://github.com/JIWONNP/Composite-Quantile-Regerssion</ext-link>.</p>
</sec>
</sec>
</sec>
<sec sec-type="results" id="s3">
<title>3 Results</title>
<p>In our study, we conducted an empirical evaluation of batch effect correction methods, including MMUPHin, Percentile Normalization, ConQuR, ComBat and our proposed model. To demonstrate the effectiveness of our proposed model, we used PERMANOVA <italic>R</italic><sup>2</sup> (<xref ref-type="table" rid="T3">Tables 3</xref>, <xref ref-type="table" rid="T4">4</xref>), Average Silhouette Coefficient for quantification and Principal Coordinates Analysis (PCoA) plots (<xref ref-type="fig" rid="F1">Figures 1</xref>&#x02013;<xref ref-type="fig" rid="F5">5</xref>) for visualization. A range of dissimilarity metrics commonly used in microbiome research were employed for this evaluation. We compared the values for each method across different dissimilarity metrics (Song et al., <xref ref-type="bibr" rid="B26">2023</xref>): Bray-Curtis, Aitchison, Canberra, and Manhattan. The outcomes of the Kruskal-Wallis test indicate no significant differences between the distributions of batch groups, suggesting that the negative binomial regression model appropriately formed similar distribution pattern by excluding consistent effects across each batch.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>PERMANOVA <italic>R</italic><sup>2</sup> of HIVRC dataset.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center" colspan="4"><bold>Distance metric</bold></th>
</tr>
<tr style="background-color:#919498;color:#ffffff">
<th/>
<th valign="top" align="center"><bold>Bray-curtis</bold></th>
<th valign="top" align="center"><bold>Aitchison</bold></th>
<th valign="top" align="center"><bold>Canberra</bold></th>
<th valign="top" align="center"><bold>Manhattan</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Original</td>
<td valign="top" align="center">0.1199<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0642<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0808<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0879<sup>&#x0002A;</sup></td>
</tr> <tr>
<td valign="top" align="left">MMUPHin</td>
<td valign="top" align="center">0.0822<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0602<sup>&#x0002A;</sup></td>
<td valign="top" align="center"><bold>0.0779</bold><sup><bold>&#x0002A;</bold></sup></td>
<td valign="top" align="center">0.0295<sup>&#x0002A;</sup></td>
</tr> <tr>
<td valign="top" align="left">Percentile normalization</td>
<td valign="top" align="center">0.1191<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0899<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.1142<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.1136<sup>&#x0002A;</sup></td>
</tr> <tr>
<td valign="top" align="left">ConQuR</td>
<td valign="top" align="center">0.0149<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0901<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.1628<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0087<sup>&#x0002A;</sup></td>
</tr> <tr>
<td valign="top" align="left">ComBat</td>
<td valign="top" align="center">0.0637<sup>&#x0002A;</sup></td>
<td valign="top" align="center"><bold>0.0564</bold><sup><bold>&#x0002A;</bold></sup></td>
<td valign="top" align="center">0.0825<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0137<sup>&#x0002A;</sup></td>
</tr> <tr>
<td valign="top" align="left">Proposed model</td>
<td valign="top" align="center"><bold>0.0128</bold><sup><bold>&#x0002A;</bold></sup></td>
<td valign="top" align="center">0.0854<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.1493<sup>&#x0002A;</sup></td>
<td valign="top" align="center"><bold>0.0065</bold><sup><bold>&#x0002A;</bold></sup></td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>PERMANOVA <italic>R</italic><sup>2</sup> represents the proportion of total variance explained by differences between groups (Anderson, <xref ref-type="bibr" rid="B1">2014</xref>; Kelly et al., <xref ref-type="bibr" rid="B9">2015</xref>). In this study, it was utilized to assess whether differences between groups were reduced by comparing the data before and after batch effect correction. A lower <italic>R</italic><sup>2</sup> value indicates that the batch effect correction was successfully performed. The values in bold represent the cases where batch correction was most effectively performed for each distance metric. Significant values (<italic>P</italic> &#x02264; 0.05) are denoted by an asterisk (<sup>&#x0002A;</sup>).</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>PERMANOVA <italic>R</italic><sup>2</sup> of HPV dataset.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center" colspan="4"><bold>Distance metrics</bold></th>
</tr>
<tr style="background-color:#919498;color:#ffffff">
<th/>
<th valign="top" align="center"><bold>Bray-Curtis</bold></th>
<th valign="top" align="center"><bold>Aitchison</bold></th>
<th valign="top" align="center"><bold>Canberra</bold></th>
<th valign="top" align="center"><bold>Manhattan</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Original</td>
<td valign="top" align="center">0.0379<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0473<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0596<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0291<sup>&#x0002A;</sup></td>
</tr> <tr>
<td valign="top" align="left">MMUPHin</td>
<td valign="top" align="center">0.023<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0449<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0677<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0156<sup>&#x0002A;</sup></td>
</tr> <tr>
<td valign="top" align="left">Percentile normalization</td>
<td valign="top" align="center">0.071<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.039<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0676<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0766<sup>&#x0002A;</sup></td>
</tr> <tr>
<td valign="top" align="left">ConQuR</td>
<td valign="top" align="center">0.0154<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0263<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0294<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0032<sup>&#x0002A;</sup></td>
</tr> <tr>
<td valign="top" align="left">ComBat</td>
<td valign="top" align="center">0.0177<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0422<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0592<sup>&#x0002A;</sup></td>
<td valign="top" align="center">0.0079<sup>&#x0002A;</sup></td>
</tr> <tr>
<td valign="top" align="left">Proposed model</td>
<td valign="top" align="center"><bold>0.002</bold><sup>&#x0002A;</sup></td>
<td valign="top" align="center"><bold>0.0246</bold><sup>&#x0002A;</sup></td>
<td valign="top" align="center"><bold>0.0264</bold><sup>&#x0002A;</sup></td>
<td valign="top" align="center"><bold>0.0029</bold><sup>&#x0002A;</sup></td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>The values in bold represent the cases where batch correction was most effectively performed for each distance metric. Significant values (<italic>P</italic> &#x02264; 0.05) are denoted by an asterisk (<sup>&#x0002A;</sup>).</p>
</table-wrap-foot>
</table-wrap>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>PCoA plots of HIVRC dataset. Each plot corresponds to a different dissimilarity measure. These metrics capture different aspects of the data&#x00027;s structure in pre-correction data. The colors represent batch ID indicating in which of the individual studies that each sample was included. If spatial patterns among data points from various batches are consistent, this is indicative of effective batch effect correction. Distinct colors represent different batches, and the intermingling of these colors rather than clear segregation indicates a high degree of batch effect correction.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-16-1484183-g0001.tif"/>
</fig>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Comparative PCoA plot across diverse methods of HIVRC dataset. Each row shows data points clustered by batch ID. The first column is the original data without any batch effect correction. The following columns show the data after applying existing methods and proposed model. Each row corresponds to a different dissimilarity metric used in the PCoA algorithm. The batch IDs were coded as follows: &#x0201C;Noguera-Julian&#x0201D; = 1, &#x0201C;Pinto-Cardoso&#x0201D; = 2, &#x0201C;Serrano-Villar 2017&#x0201D; = 3, and &#x0201C;Vesterbacka&#x0201D; = 4. The variation across rows demonstrates how the choice of dissimilarity metric affects the visualization of batch effects and their correction.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-16-1484183-g0002.tif"/>
</fig>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>PCoA plots of HPV dataset. Each plot corresponds to a different dissimilarity measure. These metrics capture different aspects of the data&#x00027;s structure. The colors represent seven batch ID indicating individual PlateID that each sample was included.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-16-1484183-g0003.tif"/>
</fig>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Comparative PCoA plots across diverse methods of HPV dataset. Each row shows data points clustered by 7 batch ID. The first column is the original data without any batch effect correction. The following columns show the data after applying existing methods and proposed model. Each row corresponds to a different dissimilarity metric used in the PCoA algorithm. The batch IDs were coded as follows: &#x0201C;p68_s01_JH1_16SV4&#x0201D; = 1, &#x0201C;p68_s07_JH7_16SV4&#x0201D; = 2, &#x0201C;p68_s02_JH2_16SV4&#x0201D; = 3, and &#x0201C;p68_s03_JH3_16SV4&#x0201D;= 4, &#x0201C;p68_s04_JH4_16SV4&#x0201D; = 5, &#x0201C;p68_s05_JH5_16SV4&#x0201D; = 6, &#x0201C;p68_s06_JH6_16SV4&#x0201D; = 7.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-16-1484183-g0004.tif"/>
</fig>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Average Silhouette coefficient of each dataset. The silhouette coefficient measures the quality of clustering (Llet et al., <xref ref-type="bibr" rid="B18">2004</xref>). When the silhouette coefficient is close to 1, it means the samples are well-separated within their own group and distinctly apart from other groups, implying that batch effects are still present. A silhouette coefficient close to 0 suggests that the boundaries between groups are ambiguous and there are no clear distinctions, indicating that the samples are well-mixed and batch effect correction is most ideal. Conversely, a silhouette coefficient close to &#x02212;1 means that the samples have poor cohesion within their group and are closer to samples in other groups, indicating poor batch effect correction. Four different distance metrics were employed: Bray-curtis (red), Aitchison (green), Canberra (blue), and Manhattan (purple).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-16-1484183-g0005.tif"/>
</fig>
<p>Results from the analysis of the Human Immunodeficiency Virus Re-analysis Consortium dataset are presented below. These were analyzed to evaluate the effectiveness of our proposed model in correcting batch effects and maintaining the integrity of the microbiome data. For Bray-Curtis dissimilarity, the proposed model achieved the lowest <italic>R</italic><sup>2</sup> value of 0.0128<sup>&#x0002A;</sup>, indicating superior performance in batch effect correction compared to other methods. ConQuR also performed well with a <italic>R</italic><sup>2</sup> value of 0.0149<sup>&#x0002A;</sup>, while MMUPHin and ComBat showed moderate performance with <italic>R</italic><sup>2</sup> values of 0.0822<sup>&#x0002A;</sup> and 0.0637<sup>&#x0002A;</sup>, respectively (<xref ref-type="table" rid="T4">Table 4</xref>). Regarding Aitchison dissimilarity, the proposed model had a higher <italic>R</italic><sup>2</sup> value of 0.0854<sup>&#x0002A;</sup>, indicating less effective correction in this specific metric and ComBat demonstrated better performance with R<sup>2</sup> values of 0.0564<sup>&#x0002A;</sup>. ConQuR showed the highest <italic>R</italic><sup>2</sup> value of 0.0901<sup>&#x0002A;</sup>, suggesting the least effective correction in this metric. For Canberra dissimilarity, MMUPHin achieved the lowest <italic>R</italic><sup>2</sup> value of 0.0779<sup>&#x0002A;</sup>, closely followed by ComBat with an <italic>R</italic><sup>2</sup> value of 0.0825<sup>&#x0002A;</sup>. The proposed model had <italic>R</italic><sup>2</sup> value of 0.1493<sup>&#x0002A;</sup>, which was lower than ConQuR&#x00027;s <italic>R</italic><sup>2</sup> value of 0.1628<sup>&#x0002A;</sup>, indicating moderate performance. In the case of Manhattan dissimilarity, the proposed model excelled with the lowest <italic>R</italic><sup>2</sup> value of 0.0065<sup>&#x0002A;</sup>, demonstrating excellent batch effect correction. ConQuR and ComBat showed moderate performance with <italic>R</italic><sup>2</sup> values of 0.0087<sup>&#x0002A;</sup> and 0.0137<sup>&#x0002A;</sup>, respectively. This analysis confirms the effectiveness of various methods in addressing batch effects, with significant values (<italic>P</italic> &#x02264; 0.05) marked by an asterisk (<sup>&#x0002A;</sup>). This calculation was conducted using the &#x0201C;adonis&#x0201D; function of vegan package (2.6&#x02013;6.1) in R.</p>
<p>Overall, our proposed method consistently achieved lower <italic>R</italic><sup>2</sup> values, indicating a significant reduction in batch effects across various dissimilarity measures. This performance underscores the robustness and superior effectiveness of our method in correcting batch effects in microbiome data. MMUPHin and Percentile Normalization exhibited intermediate performance across most dissimilarity measures, while ConQuR and ComBat showed varying effectiveness depending on the metric used. The effectiveness of batch effect correction varies across different distance metrics, as each metric considers different aspects of the data. Bray-Curtis and Manhattan metrics reflect compositional differences between samples. The Bray-Curtis metric focuses on the relative abundance of shared OTUs between samples, highlighting how similarly the samples are composed in terms of the OTUs they have in common. Similarly, the Manhattan metric considers the absolute differences in abundance, treating all changes equally. Therefore, Bray-Curtis and Manhattan metrics effectively capture overall abundance differences and compositional changes in data. On the other hand, Aitchison and Canberra metrics emphasize ratio-based differences and small values. The Aitchison metric, used for compositional data, is sensitive to variations in the relative abundance of OTUs, considering ratio fluctuations. The Canberra metric assigns greater weight to changes in small values, highlighting differences in proportions with smaller magnitudes. Due to these characteristics, in composite quantile regression models, the effectiveness of adjustments can be diminished when using distance metrics like Aitchison and Canberra, depending on the selected regression coefficients (Ricotta, <xref ref-type="bibr" rid="B24">2017</xref>). This comprehensive evaluation highlights the proposed model&#x00027;s ability to handle batch effects more effectively, ensuring reliable analysis of microbiome data.</p>
<p>The proposed model consistently achieved the lowest R<sup>2</sup> values across all distance metrics in the HPV dataset, demonstrating optimal batch effect correction performance with values of 0.002<sup>&#x0002A;</sup>, 0.0246<sup>&#x0002A;</sup>, 0.0264<sup>&#x0002A;</sup>, and 0.0029<sup>&#x0002A;</sup> for the Bray-Curtis, Aitchison, Canberra, and Manhattan distance metrics, respectively (<xref ref-type="table" rid="T4">Table 4</xref>). The prominent results of the proposed model with the Bray-Curtis and Manhattan distance metrics can be attributed to the characteristics of the composite quantile regression model described previously.</p>
</sec>
<sec id="s4">
<title>4 Discussion and conclusion</title>
<p>Our model employs a Negative Binomial approach to model the variability within batch effects, aiming to align batch distributions and exclude generalized effects across batches. We select the reference batch through a methodical process, employing the Kruskal-Wallis test and median absolute deviation to identify batches that consistently reflect the general characteristics of the data. Furthermore, our approach incorporates composite quantile regression which addresses the effects within each batch at the OTU level. This procedure ensures that our model not only corrects for obvious systematic differences but also subtly adjusts for less predictable changes within the microbiome data, leading to a more reliable results of biological data across different batches.</p>
<p>When applied to the HIVRC data, and the HPV study data, this model demonstrated considerable success in reducing batch effects. It showed improved correction results compared to the application of conditional quantile regression independently, consistently across these different datasets. By effectively addressing both systematic and non-systematic batch effects, the proposed method ensures a more reliable and accurate analysis of microbiome data. We also verified that similar results were obtained using an additional dataset (Ewha Women&#x00027;s University Medical Center metagenomic urine control data&#x02212;16S rRNA sequencing data). Given the similarity of observed patterns, we chose to present results from two representative datasets.</p>
<p>Despite the effectiveness of our proposed method, several limitations should be acknowledged. First, microbiome data often contain a high proportion of zeros that can lead to instability in quantile regression at extreme quantiles (e.g., lower 10% and upper 90%) due to the limited number of non-zero data points. This sparsity can affect the robustness and accuracy of the regression estimates at these extreme quantiles. In practice, the dataset with the lowest zero inflation (31%) showed the best results in correcting batch effect based on quantile estimation and based on this, it was determined that the rate of zero inflation could act as a factor that decreases the stability of the model. Furthermore, the performance of the model may vary depending on the metrics or categories of data used. Different datasets or dissimilarity distance measures may influence the effectiveness of the batch effect correction, indicating that the model&#x00027;s performance is not universally consistent across all possible applications. Lastly, the efficacy of our model may be compromised in research designs where the separation of confounding variables from primary variables is challenging. Addressing batch effects in confounded study designs presents considerable difficulties, and careful experimental design can serve as a critical factor in alleviating batch effects. These limitations highlight the need for further refinement of the model to enhance its robustness and applicability across diverse microbiome datasets. Additionally, the model assumes that the count data follow a negative binomial distribution, which may not always be the case in real-world microbiome data. Also, when various trends or patterns clearly emerge across different quantiles of the data, it may be more appropriate to use regression coefficients specialized for each quantile.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s5">
<title>Data availability statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found here: (1) HIVRC [<ext-link ext-link-type="uri" xlink:href="https://www.synapse.org/Synapse:syn18406803/wiki/589668">https://www.synapse.org/Synapse:syn18406803/wiki/589668</ext-link>] (ID: syn18406803). (2) HPV [<ext-link ext-link-type="uri" xlink:href="https://www.synapse.org/Synapse:syn26529406/wiki/614952">https://www.synapse.org/Synapse:syn26529406/wiki/614952</ext-link>] (ID: syn26529406).</p>
</sec>
<sec sec-type="ethics-statement" id="s6">
<title>Ethics statement</title>
<p>Ethical approval was not required for the study involving humans in accordance with the local legislation and institutional requirements. Written informed consent to participate in this study was not required from the participants or the participants&#x00027; legal guardians/next of kin in accordance with the national legislation and the institutional requirements. The manuscript presents research on animals that do not require ethical approval for their study.</p>
</sec>
<sec sec-type="author-contributions" id="s7">
<title>Author contributions</title>
<p>JP: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing &#x02013; original draft, Writing &#x02013; review &#x00026; editing. TP: Project administration, Supervision, Validation, Writing &#x02013; review &#x00026; editing.</p>
</sec>
<sec sec-type="funding-information" id="s8">
<title>Funding</title>
<p>The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education (NRF-2022R1A2C1092497).</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec sec-type="supplementary-material" id="s10">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fmicb.2025.1484183/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fmicb.2025.1484183/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.pdf" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Anderson</surname> <given-names>M. J.</given-names></name></person-group> (<year>2014</year>). <source>Permutational Multivariate Analysis Of Variance (PERMANOVA).</source> <publisher-loc>Hoboken</publisher-loc>: <publisher-name>Wiley</publisher-name>, <fpage>1</fpage>&#x02013;<lpage>15</lpage>.</citation>
</ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chakraborty</surname> <given-names>S.</given-names></name> <name><surname>Balasubramanian</surname> <given-names>V.</given-names></name> <name><surname>Panchanathan</surname> <given-names>S.</given-names></name></person-group> (<year>2011</year>). <article-title>&#x0201C;Optimal batch selection for active learning in multi-label classification,&#x0201D;</article-title> in <source>Proceedings of the 19th ACM International Conference on Multimedia</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>1413</fpage>&#x02013;<lpage>1416</lpage>.</citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>E. Z.</given-names></name> <name><surname>Li</surname> <given-names>H.</given-names></name></person-group> (<year>2016</year>). <article-title>A two-part mixed-effects model for analyzing longitudinal microbiome compositional data</article-title>. <source>Bioinformatics</source> <volume>32</volume>, <fpage>2611</fpage>&#x02013;<lpage>2617</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btw308</pub-id><pub-id pub-id-type="pmid">27187200</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Deeks</surname> <given-names>S. G.</given-names></name> <name><surname>Overbaugh</surname> <given-names>J.</given-names></name> <name><surname>Phillips</surname> <given-names>A.</given-names></name> <name><surname>Buchbinder</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). <article-title>HIV infection</article-title>. <source>Nature Rev. Dis. Prim.</source> <volume>1</volume>, <fpage>1</fpage>&#x02013;<lpage>22</lpage>. <pub-id pub-id-type="doi">10.1038/nrdp.2015.35</pub-id><pub-id pub-id-type="pmid">27188527</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dong</surname> <given-names>M.</given-names></name> <name><surname>Li</surname> <given-names>L.</given-names></name> <name><surname>Chen</surname> <given-names>M.</given-names></name> <name><surname>Kusalik</surname> <given-names>A.</given-names></name> <name><surname>Xu</surname> <given-names>W.</given-names></name></person-group> (<year>2020</year>). <article-title>Predictive analysis methods for human microbiome data with application to Parkinson&#x00027;s disease</article-title>. <source>PLoS ONE</source> <volume>15</volume>:<fpage>e0237779</fpage> <pub-id pub-id-type="doi">10.1371/journal.pone.0237779</pub-id><pub-id pub-id-type="pmid">32834004</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gibbons</surname> <given-names>S. M.</given-names></name> <name><surname>Duvallet</surname> <given-names>C.</given-names></name> <name><surname>Alm</surname> <given-names>E. J.</given-names></name></person-group> (<year>2018</year>). <article-title>Correcting for batch effects in case-control microbiome studies</article-title>. <source>PLoS Comput. Biol.</source> <volume>14</volume>:<fpage>e1006102</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1006102</pub-id><pub-id pub-id-type="pmid">29684016</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>He</surname> <given-names>X.</given-names></name> <name><surname>Pan</surname> <given-names>X.</given-names></name> <name><surname>Tan</surname> <given-names>K. M.</given-names></name> <name><surname>Zhou</surname> <given-names>W. X.</given-names></name></person-group> (<year>2023</year>). <article-title>Smoothed quantile regression with large-scale inference</article-title>. <source>J. Econom.</source> <volume>232</volume>, <fpage>367</fpage>&#x02013;<lpage>388</lpage>. <pub-id pub-id-type="pmid">36776480</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Johnson</surname> <given-names>W. E.</given-names></name> <name><surname>Li</surname> <given-names>C.</given-names></name> <name><surname>Rabinovic</surname> <given-names>A.</given-names></name></person-group> (<year>2007</year>). <article-title>Adjusting batch effects in microarray expression data using empirical Bayes methods</article-title>. <source>Biostatistics</source> <volume>8</volume>, <fpage>118</fpage>&#x02013;<lpage>127</lpage>. <pub-id pub-id-type="doi">10.1093/biostatistics/kxj037</pub-id><pub-id pub-id-type="pmid">16632515</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kelly</surname> <given-names>B. J.</given-names></name> <name><surname>Gross</surname> <given-names>R.</given-names></name> <name><surname>Bittinger</surname> <given-names>K.</given-names></name> <name><surname>Sherrill-Mix</surname> <given-names>S.</given-names></name> <name><surname>Lewis</surname> <given-names>J. D.</given-names></name> <name><surname>Collman</surname> <given-names>R. G.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Power and sample-size estimation for microbiome studies using pairwise distances and PERMANOVA</article-title>. <source>Bioinformatics</source> <volume>31</volume>, <fpage>2461</fpage>&#x02013;<lpage>2468</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btv183</pub-id><pub-id pub-id-type="pmid">25819674</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Koenker</surname> <given-names>R.</given-names></name> <name><surname>Hallock</surname> <given-names>K. F.</given-names></name></person-group> (<year>2001</year>). <article-title>Quantile regression</article-title>. <source>J. Econ. Perspect.</source> <volume>15</volume>, <fpage>143</fpage>&#x02013;<lpage>156</lpage>. <pub-id pub-id-type="doi">10.1257/jep.15.4.143</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kruskal</surname> <given-names>W. H.</given-names></name> <name><surname>Wallis</surname> <given-names>W. A.</given-names></name></person-group> (<year>1952</year>). <article-title>Use of ranks in one-criterion variance analysis</article-title>. <source>J. Am. Stat. Assoc.</source> <volume>47</volume>, <fpage>583</fpage>&#x02013;<lpage>621</lpage>. <pub-id pub-id-type="doi">10.1080/01621459.1952.10483441</pub-id></citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lazar</surname> <given-names>C.</given-names></name> <name><surname>Meganck</surname> <given-names>S.</given-names></name> <name><surname>Taminau</surname> <given-names>J.</given-names></name> <name><surname>Steenhoff</surname> <given-names>D.</given-names></name> <name><surname>Coletta</surname> <given-names>A.</given-names></name> <name><surname>Molter</surname> <given-names>C.</given-names></name> <etal/></person-group>. (<year>2013</year>). <article-title>Batch effect removal methods for microarray gene expression data integration: a survey</article-title>. <source>Brief. Bioinformatics</source> <volume>14</volume>, <fpage>469</fpage>&#x02013;<lpage>490</lpage>. <pub-id pub-id-type="doi">10.1093/bib/bbs037</pub-id><pub-id pub-id-type="pmid">22851511</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Leek</surname> <given-names>J. T.</given-names></name> <name><surname>Storey</surname> <given-names>J. D.</given-names></name></person-group> (<year>2007</year>). <article-title>Capturing heterogeneity in gene expression studies by surrogate variable analysis</article-title>. <source>PLoS Genet</source>. <volume>3</volume>:<fpage>e161</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pgen.0030161</pub-id><pub-id pub-id-type="pmid">17907809</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>T.</given-names></name> <name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Patil</surname> <given-names>P.</given-names></name> <name><surname>Johnson</surname> <given-names>W. E.</given-names></name></person-group> (<year>2023</year>). <article-title>Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference</article-title>. <source>Biostatistics</source> <volume>24</volume>, <fpage>635</fpage>&#x02013;<lpage>652</lpage>. <pub-id pub-id-type="doi">10.1093/biostatistics/kxab039</pub-id><pub-id pub-id-type="pmid">34893807</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ling</surname> <given-names>W.</given-names></name></person-group> (<year>2019</year>). <source>Quantile Regression for Zero-Inflated Outcomes</source> (<publisher-loc>Doctoral dissertation</publisher-loc>). Columbia University, New York, NY, United States.</citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ling</surname> <given-names>W.</given-names></name> <name><surname>Zhao</surname> <given-names>N.</given-names></name> <name><surname>Lulla</surname> <given-names>A.</given-names></name> <name><surname>Plantinga</surname> <given-names>A. M.</given-names></name> <name><surname>Fu</surname> <given-names>W.</given-names></name> <name><surname>Zhang</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2021a</year>). <article-title>Batch effects removal for microbiome data via conditional quantile regression (ConQuR)</article-title>. <source>bioRxiv</source>. <pub-id pub-id-type="doi">10.1101/2021.09.23.461592</pub-id><pub-id pub-id-type="pmid">36109499</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ling</surname> <given-names>W.</given-names></name> <name><surname>Zhao</surname> <given-names>N.</given-names></name> <name><surname>Plantinga</surname> <given-names>A. M.</given-names></name> <name><surname>Launer</surname> <given-names>L. J.</given-names></name> <name><surname>Fodor</surname> <given-names>A. A.</given-names></name> <name><surname>Meyer</surname> <given-names>K. A.</given-names></name> <etal/></person-group>. (<year>2021b</year>). <article-title>Powerful and robust non-parametric association testing for microbiome data via a zero-inflated quantile approach (ZINQ)</article-title>. <source>Microbiome</source> <volume>9</volume>, <fpage>1</fpage>&#x02013;<lpage>19</lpage>. <pub-id pub-id-type="doi">10.1186/s40168-021-01129-3</pub-id><pub-id pub-id-type="pmid">34474689</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lleti</surname> <given-names>R.</given-names></name> <name><surname>Ortiz</surname> <given-names>M. C.</given-names></name> <name><surname>Sarabia</surname> <given-names>L. A.</given-names></name> <name><surname>S&#x000E1;nchez</surname> <given-names>M. S.</given-names></name></person-group> (<year>2004</year>). <article-title>Selecting variables for k-means cluster analysis by using a genetic algorithm that optimises the silhouettes</article-title>. <source>Anal. Chim. Acta</source> <volume>515</volume>, <fpage>87</fpage>&#x02013;<lpage>100</lpage>. <pub-id pub-id-type="doi">10.1016/j.aca.2003.12.020</pub-id></citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ma</surname> <given-names>S.</given-names></name> <name><surname>Shungin</surname> <given-names>D.</given-names></name> <name><surname>Mallick</surname> <given-names>H.</given-names></name> <name><surname>Schirmer</surname> <given-names>M.</given-names></name> <name><surname>Nguyen</surname> <given-names>L. H.</given-names></name> <name><surname>Kolde</surname> <given-names>R.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>Population structure discovery in meta-analyzed microbial communities and inflammatory bowel disease using MMUPHin</article-title>. <source>Genome Biol.</source> <volume>23</volume>:<fpage>208</fpage>. <pub-id pub-id-type="pmid">36192803</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pendegraft</surname> <given-names>A. H.</given-names></name> <name><surname>Guo</surname> <given-names>B.</given-names></name> <name><surname>Yi</surname> <given-names>N.</given-names></name></person-group> (<year>2019</year>). <article-title>Bayesian hierarchical negative binomial models for multivariable analyses with applications to human microbiome count data</article-title>. <source>PLoS ONE</source> <volume>14</volume>:<fpage>e0220961</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0220961</pub-id><pub-id pub-id-type="pmid">31437194</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pham-Gia</surname> <given-names>T.</given-names></name> <name><surname>Hung</surname> <given-names>T. L.</given-names></name></person-group> (<year>2001</year>). <article-title>The mean and median absolute deviations</article-title>. <source>Math. Comput. Model.</source> <volume>34</volume>, <fpage>921</fpage>&#x02013;<lpage>936</lpage>. <pub-id pub-id-type="doi">10.1016/S0895-7177(01)00109-1</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pinto-Cardoso</surname> <given-names>S.</given-names></name> <name><surname>Lozupone</surname> <given-names>C.</given-names></name> <name><surname>Brice&#x000F1;o</surname> <given-names>O.</given-names></name> <name><surname>Alva-Hern&#x000E1;ndez</surname> <given-names>S.</given-names></name> <name><surname>T&#x000E9;llez</surname> <given-names>N.</given-names></name> <name><surname>Adriana</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Fecal bacterial communities in treated HIV infected individuals on two antiretroviral regimens</article-title>. <source>Sci. Rep.</source> <volume>7</volume>:<fpage>43741</fpage>. <pub-id pub-id-type="doi">10.1038/srep43741</pub-id><pub-id pub-id-type="pmid">28262770</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ramakodi</surname> <given-names>M. P.</given-names></name></person-group> (<year>2021</year>). <article-title>Effect of amplicon sequencing depth in environmental microbiome research</article-title>. <source>Curr. Microbiol.</source> <volume>78</volume>, <fpage>1026</fpage>&#x02013;<lpage>1033</lpage>. <pub-id pub-id-type="doi">10.1007/s00284-021-02345-8</pub-id><pub-id pub-id-type="pmid">33537885</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ricotta</surname> <given-names>C.</given-names></name></person-group> (<year>2017</year>). <article-title>Of beta diversity, variance, evenness, and dissimilarity</article-title>. <source>Ecol. Evol.</source> <volume>7</volume>, <fpage>4835</fpage>&#x02013;<lpage>4843</lpage>.</citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Soneson</surname> <given-names>C.</given-names></name> <name><surname>Gerster</surname> <given-names>S.</given-names></name> <name><surname>Delorenzi</surname> <given-names>M.</given-names></name></person-group> (<year>2014</year>). <article-title>Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation</article-title>. <source>PLoS ONE</source> <volume>9</volume>:<fpage>e100335</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0100335</pub-id><pub-id pub-id-type="pmid">24967636</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Song</surname> <given-names>J. S.</given-names></name> <name><surname>Kim</surname> <given-names>J. O. R.</given-names></name> <name><surname>Yoon</surname> <given-names>S. M.</given-names></name> <name><surname>Kwon</surname> <given-names>M. J.</given-names></name> <name><surname>Ki</surname> <given-names>C. S.</given-names></name></person-group> (<year>2023</year>). <article-title>The association between gut microbiome and hypertension varies according to enterotypes: a Korean study</article-title>. <source>Front. Microb.</source> <volume>2</volume>:<fpage>1072059</fpage>. <pub-id pub-id-type="doi">10.3389/frmbi.2023.1072059</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tran</surname> <given-names>H. T. N.</given-names></name> <name><surname>Ang</surname> <given-names>K. S.</given-names></name> <name><surname>Chevrier</surname> <given-names>M.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Lee</surname> <given-names>N. Y. S.</given-names></name> <name><surname>Goh</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>A benchmark of batch-effect correction methods for single -cell RNA sequencing data</article-title>. <source>Genome Biol.</source> <volume>21</volume>, <fpage>1</fpage>&#x02013;<lpage>32</lpage>. <pub-id pub-id-type="doi">10.1186/s13059-019-1850-9</pub-id><pub-id pub-id-type="pmid">31948481</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tuddenham</surname> <given-names>S. A.</given-names></name> <name><surname>Koay</surname> <given-names>W. L. A.</given-names></name> <name><surname>Zhao</surname> <given-names>N.</given-names></name> <name><surname>White</surname> <given-names>J. R.</given-names></name> <name><surname>Ghanem</surname> <given-names>K. G.</given-names></name> <name><surname>Sears</surname> <given-names>C. L.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>The impact of human immunodeficiency virus infection on gut microbiota &#x003B1;-diversity: an individual-level meta-analysis</article-title>. <source>Clin. Infect. Dis.</source> <volume>70</volume>, <fpage>615</fpage>&#x02013;<lpage>627</lpage>. <pub-id pub-id-type="pmid">30921452</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ursell</surname> <given-names>L. K.</given-names></name> <name><surname>Metcalf</surname> <given-names>J. L.</given-names></name> <name><surname>Parfrey</surname> <given-names>L. W.</given-names></name> <name><surname>Knight</surname> <given-names>R.</given-names></name></person-group> (<year>2012</year>). <article-title>Defining the human microbiome</article-title>. <source>Nutr. Rev.</source> <volume>70</volume>, <fpage>S38</fpage>&#x02013;<lpage>S44</lpage>. <pub-id pub-id-type="doi">10.1111/j.1753-4887.2012.00493.x</pub-id><pub-id pub-id-type="pmid">22861806</pub-id></citation></ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Van den Berg</surname> <given-names>R. A.</given-names></name> <name><surname>Hoefsloot</surname> <given-names>H. C.</given-names></name> <name><surname>Westerhuis</surname> <given-names>J. A.</given-names></name> <name><surname>Smilde</surname> <given-names>A. K.</given-names></name> <name><surname>Van der Werf</surname> <given-names>M. J.</given-names></name></person-group> (<year>2006</year>). <article-title>Centering, scaling, and transformations: improving the biological information content of metabolomics data</article-title>. <source>BMC Genomics</source> <volume>7</volume>, <fpage>1</fpage>&#x02013;<lpage>15</lpage>. <pub-id pub-id-type="doi">10.1186/1471-2164-7-142</pub-id><pub-id pub-id-type="pmid">16762068</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>L&#x000EA;Cao</surname> <given-names>K. A.</given-names></name></person-group> (<year>2020</year>). <article-title>Managing batch effects in microbiome data</article-title>. <source>Brief. Bioinformat.</source> <volume>21</volume>, <fpage>1954</fpage>&#x02013;<lpage>1970</lpage>. <pub-id pub-id-type="doi">10.1093/bib/bbz105</pub-id><pub-id pub-id-type="pmid">31776547</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yirga</surname> <given-names>A. A.</given-names></name> <name><surname>Melesse</surname> <given-names>S. F.</given-names></name> <name><surname>Mwambi</surname> <given-names>H. G.</given-names></name> <name><surname>Ayele</surname> <given-names>D. G.</given-names></name></person-group> (<year>2020</year>). <article-title>Negative binomial mixed models for analyzing longitudinal CD4 count data</article-title>. <source>Sci. Rep.</source> <volume>10</volume>:<fpage>16742</fpage>. <pub-id pub-id-type="doi">10.1038/s41598-020-73883-7</pub-id><pub-id pub-id-type="pmid">33028929</pub-id></citation></ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Yi</surname> <given-names>N.</given-names></name></person-group> (<year>2020</year>). <article-title>NBZIMM: negative binomial and zero-inflated mixed models, with application to microbiome/metagenomics data analysis</article-title>. <source>BMC Bioinform.</source> <volume>21</volume>, <fpage>1</fpage>&#x02013;<lpage>19</lpage>. <pub-id pub-id-type="doi">10.1186/s12859-020-03803-z</pub-id><pub-id pub-id-type="pmid">33126862</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Mallick</surname> <given-names>H.</given-names></name> <name><surname>Tang</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>L.</given-names></name> <name><surname>Cui</surname> <given-names>X.</given-names></name> <name><surname>Benson</surname> <given-names>A. K.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Negative binomial mixed models for analyzing microbiome count data</article-title>. <source>BMC Bioinformat.</source> <volume>18</volume>, <fpage>1</fpage>&#x02013;<lpage>10</lpage>. <pub-id pub-id-type="doi">10.1186/s12859-016-1441-7</pub-id><pub-id pub-id-type="pmid">28049409</pub-id></citation></ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>D&#x00027;Souza</surname> <given-names>G.</given-names></name> <name><surname>Fakhry</surname> <given-names>C.</given-names></name> <name><surname>Bigelow</surname> <given-names>E. O.</given-names></name> <name><surname>Usyk</surname> <given-names>M.</given-names></name> <name><surname>Burk</surname> <given-names>R. D.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>Oral HPV associated with differences in oral microbiota beta diversity and microbiota abundance</article-title>. <source>J. Infect. Dis</source>. <volume>226</volume>, <fpage>1098</fpage>&#x02013;<lpage>1108</lpage>. <pub-id pub-id-type="doi">10.1093/infdis/jiac010</pub-id><pub-id pub-id-type="pmid">35038733</pub-id></citation></ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Parmigiani</surname> <given-names>G.</given-names></name> <name><surname>Johnson</surname> <given-names>W. E.</given-names></name></person-group> (<year>2020</year>). <article-title>ComBat-seq: batch effect adjustment for RNA-seq count data</article-title>. <source>NAR Genom. Bioinform</source>. <volume>2</volume>:<fpage>lqaa078</fpage>. <pub-id pub-id-type="doi">10.1093/nargab/lqaa078</pub-id><pub-id pub-id-type="pmid">33015620</pub-id></citation></ref>
</ref-list>
</back>
</article>