<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3-mathml3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.3" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Plant Sci.</journal-id>
<journal-title-group>
<journal-title>Frontiers in Plant Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Plant Sci.</abbrev-journal-title>
</journal-title-group>
<issn pub-type="epub">1664-462X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fpls.2026.1754458</article-id>
<article-version article-version-type="Version of Record" vocab="NISO-RP-8-2008"/>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Original Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Research on urban tree classification method based on YOLO-CNGD</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Zhang</surname><given-names>Cunjin</given-names></name>
<uri xlink:href="https://loop.frontiersin.org/people/3294407/overview"/>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Writing &#x2013; original draft" vocab-term-identifier="https://credit.niso.org/contributor-roles/writing-original-draft/">Writing &#x2013; original draft</role>
</contrib>
<contrib contrib-type="author">
<name><surname>Liu</surname><given-names>Mei</given-names></name>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="investigation" vocab-term-identifier="https://credit.niso.org/contributor-roles/investigation/">Investigation</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Writing &#x2013; review &amp; editing" vocab-term-identifier="https://credit.niso.org/contributor-roles/writing-review-editing/">Writing &#x2013; review &amp; editing</role>
</contrib>
<contrib contrib-type="author">
<name><surname>Liu</surname><given-names>Xinglong</given-names></name>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="methodology" vocab-term-identifier="https://credit.niso.org/contributor-roles/methodology/">Methodology</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Writing &#x2013; review &amp; editing" vocab-term-identifier="https://credit.niso.org/contributor-roles/writing-review-editing/">Writing &#x2013; review &amp; editing</role>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Gu</surname><given-names>Zhixin</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>*</sup></xref>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Writing &#x2013; review &amp; editing" vocab-term-identifier="https://credit.niso.org/contributor-roles/writing-review-editing/">Writing &#x2013; review &amp; editing</role>
</contrib>
</contrib-group>
<aff id="aff1"><institution>Computer and Control Engineering College, Northeast Forestry University</institution>, <city>Harbin</city>,&#xa0;<country country="cn">China</country></aff>
<author-notes>
<corresp id="c001"><label>*</label>Correspondence: Zhixin Gu, <email xlink:href="mailto:gzx@nefu.edu.cn">gzx@nefu.edu.cn</email></corresp>
</author-notes>
<pub-date publication-format="electronic" date-type="pub" iso-8601-date="2026-02-25">
<day>25</day>
<month>02</month>
<year>2026</year>
</pub-date>
<pub-date publication-format="electronic" date-type="collection">
<year>2026</year>
</pub-date>
<volume>17</volume>
<elocation-id>1754458</elocation-id>
<history>
<date date-type="received">
<day>28</day>
<month>11</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>04</day>
<month>02</month>
<year>2026</year>
</date>
<date date-type="rev-recd">
<day>04</day>
<month>02</month>
<year>2026</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2026 Zhang, Liu, Liu and Gu.</copyright-statement>
<copyright-year>2026</copyright-year>
<copyright-holder>Zhang, Liu, Liu and Gu</copyright-holder>
<license>
<ali:license_ref start_date="2026-02-25">https://creativecommons.org/licenses/by/4.0/</ali:license_ref>
<license-p>This is an open-access article distributed under the terms of the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License (CC BY)</ext-link>. The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</license-p>
</license>
</permissions>
<abstract>
<p>Accurate classification of urban tree species is fundamental for urban green space management and ecological assessment. To address the challenges of small and overlapping tree crown detection in high-resolution remote sensing imagery, this study proposes YOLO-CNGD, a novel framework based on YOLOv11n. The key enhancements include the integration of the Convolutional Block Attention Module (CBAM) for refined feature representation, the adoption of the Normalized Wasserstein Distance (NWD) loss for robust small-object localization, the incorporation of Deformable Convolution v3 (DCNv3) to adapt to irregular shapes, and the replacement of standard convolutions with GhostConv for a lightweight design. Experiments on a self-built urban tree dataset show that YOLO-CNGD achieves a precision of 94.8%, a recall of 91.1%, and an mAP@0.5 of 93.7%. The model balances accuracy and efficiency, showing great potential for large-scale automated urban tree inventory.</p>
</abstract>
<kwd-group>
<kwd>CBAM attention mechanism</kwd>
<kwd>remote sensing image</kwd>
<kwd>urban tree classification</kwd>
<kwd>YOLO-CNGD</kwd>
<kwd>YOLOv11n deep learning</kwd>
</kwd-group>
<funding-group>
<funding-statement>The author(s) declared that financial support was not received for this work and/or its publication.</funding-statement>
</funding-group>
<counts>
<fig-count count="13"/>
<table-count count="3"/>
<equation-count count="6"/>
<ref-count count="17"/>
<page-count count="15"/>
<word-count count="8808"/>
</counts>
<custom-meta-group>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Sustainable and Intelligent Phytoprotection</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="s1" sec-type="intro">
<label>1</label>
<title>Introduction</title>
<p>Integrating high-resolution remote sensing satellite imagery with deep learning algorithms enables precise analysis of urban tree species, quantity, distribution, boundaries, locations, and canopy extents. This approach enables real-time monitoring of urban tree growth and spatial distribution, supports species-specific distribution evaluation, and informs urban greening optimization and biodiversity conservation.</p>
<p>Over the past five years, the classification and detection of small-target trees in urban remote sensing imagery have seen remarkable progress. In 2020, <xref ref-type="bibr" rid="B7">Liu et&#xa0;al. (2020)</xref> improved a Convolutional Neural Network (CNN) model under the TensorFlow framework for the automatic identification of seven tree species. They employed the Adaptive Moment Estimation (Adam) optimizer with an exponentially decaying learning rate, incorporated L2 regularization into the cross-entropy loss function to penalize weights, and applied Dropout along with the Rectified Linear Unit (ReLU) activation function to prevent overfitting. In 2022, <xref ref-type="bibr" rid="B6">Liu et&#xa0;al. (2022)</xref> utilized four different types of point cloud deep learning models to classify and identify individual trees from eight species. All models, except PointNet, achieved classification accuracy exceeding 0.90.</p>
<p>In 2023, <xref ref-type="bibr" rid="B14">Vermeer et&#xa0;al. (2023)</xref> proposed a deep learning model for tree species classification using only LiDAR data. Based on a U-Net architecture for segmenting LiDAR images, the model was trained with a focal loss function to handle weakly labeled data, achieving an F1-score of 0.70. Also in 2023, <xref ref-type="bibr" rid="B13">Velasquez-Camacho et&#xa0;al. (2023)</xref> applied deep learning algorithms to detect, count, and locate urban trees, successfully identifying 79% of street trees. In 2024, Honkavaara et&#xa0;al. (<xref ref-type="bibr" rid="B12">Turkulainen et&#xa0;al., 2023</xref>) acquired RGB, multispectral, and hyperspectral imagery via drones. Their results demonstrated that the 2D-3D-CNN model performed best in classifying infested trees using hyperspectral data, with an F1-score reaching 0.742.</p>
<p>In the same year, <xref ref-type="bibr" rid="B9">Liu et&#xa0;al. (2024)</xref> proposed the YOLOv7-KCC model for tree species classification. By replacing standard convolutions with CoordConv and integrating the Convolutional Block Attention Module (CBAM) into the network, their model achieved a mean Average Precision (mAP) of 98.91%.</p>
<p>Recent studies in 2024 and 2025 have further highlighted the necessity of deep learning for fine-grained urban forestry. For instance, <xref ref-type="bibr" rid="B11">Satama-Bermeo et&#xa0;al. (2025)</xref> demonstrated that while YOLOv8 and the newly released YOLOv11 exhibit superior detection speeds, their performance on small, overlapping tree crowns in dense urban environments remains suboptimal without specific architectural modifications. Similarly, <xref ref-type="bibr" rid="B17">Zhao and Chen (2025)</xref> emphasized that accurate mapping in heterogeneous urban landscapes (e.g., Nanjing) requires overcoming the feature loss problem common in down-sampling processes. Furthermore, advanced lightweight models like SEMA-YOLO (<xref ref-type="bibr" rid="B8">Liu and Yang, 2025</xref>) and SRM-YOLO (<xref ref-type="bibr" rid="B2">Chen and Wang, 2025</xref>) have been proposed to tackle small object detection in remote sensing, validating the trend towards integrating attention mechanisms and multi-scale adaptations.</p>
<p>While existing models primarily focus on improving classification accuracy, they often underutilize the rich spatial and spectral information inherent in high-resolution remote sensing imagery (<xref ref-type="bibr" rid="B1">Ayrey et&#xa0;al., 2017</xref>). Despite these advancements, classifying urban trees in high-latitude cities like Harbin presents unique challenges. High-density building shadows often distort spectral signatures, and the extreme phenological similarity between <italic>Salicaceae</italic> species (e.g., Poplar and Willow) during the peak growth season leads to significant inter-class confusion. Current SOTA models often lack the specific mechanisms required to disambiguate these fuzzy boundaries under complex illumination. To bridge these gaps, we present YOLO-CNGD, a comprehensively modified YOLOv11n architecture. It is specifically designed to enhance feature representation for small urban tree crowns through synergistic integration of attention (CBAM), a tailored loss function (NWD), deformable convolutions (DCNv3), and lightweight operations (GhostConv).</p>
</sec>
<sec id="s2" sec-type="materials|methods">
<label>2</label>
<title>Materials and methods</title>
<sec id="s2_1">
<label>2.1</label>
<title>Study subjects and classification criteria</title>
<p>Harbin (45&#xb0;42&#x2032;&#x2013;46&#xb0;28&#x2032;N, 126&#xb0;29&#x2032;&#x2013;130&#xb0;01&#x2032;E) is located in the central-southern part of Heilongjiang Province and falls within the warm temperate continental monsoon climate zone. With an urban tree coverage rate of approximately 40%, the city experiences extremely low temperatures in winter, frequently accompanied by severe cold conditions (<xref ref-type="bibr" rid="B4">Jin et&#xa0;al., 2019</xref>). The vegetation primarily consists of deciduous and coniferous tree species, with dominant varieties including elm, willow, pine, poplar, and birch.</p>
<p>The theoretical basis for species classification integrates geometric characteristics and spectral analysis. <italic>Ulmus</italic> (elm) exhibits an irregularly rounded crown with relatively uniform foliage distribution. <italic>Salix</italic> (willow) is characterized by pendulous branches and an oblong or elliptical crown with a loosely defined margin. <italic>Pinus</italic> (pine) typically presents a conical or umbrella-shaped crown featuring dense foliage and a serrated silhouette, a key identifier of coniferous traits. <italic>Populus</italic> (poplar) displays a tall, columnar crown with a flattened apex. <italic>Betula</italic> (birch) is distinguished by its whitish bark and an ovate crown exhibiting a grayish-green hue.</p>
<p>Manual classification relying solely on geometric features is susceptible to subjective bias and inconsistency. Therefore, spectral feature analysis was performed using TIF-format remote sensing imagery, incorporating the blue, green, red, and near-infrared channels along with the Normalized Difference Vegetation Index (NDVI). NDVI is an indicator for assessing vegetation vigor and coverage, calculated from the reflectance in the near-infrared and red bands of remote sensing images (<xref ref-type="bibr" rid="B10">Rezatofighi et&#xa0;al., 2019</xref>). Its values range from -1 to 1, and an NDVI value greater than 0.6 typically indicates dense and healthy tree cover. The calculation formula is shown in <xref ref-type="disp-formula" rid="eq1"><bold>Equation 1</bold></xref>.</p>
<disp-formula id="eq1"><label>(1)</label>
<mml:math display="block" id="M1"><mml:mrow><mml:mi>N</mml:mi><mml:mi>V</mml:mi><mml:mi>D</mml:mi><mml:mi>I</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>N</mml:mi><mml:mi>I</mml:mi><mml:mi>R</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mi>I</mml:mi><mml:mi>R</mml:mi><mml:mo>+</mml:mo><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:mfrac></mml:mrow></mml:math>
</disp-formula>
<p>Here, NIR represents the reflectance in the near-infrared band, and Red denotes the reflectance in the red band.</p>
<p>The spectral reflectance characteristics and vegetation indices of the five tree species are summarized in <xref ref-type="table" rid="T1"><bold>Table&#xa0;1</bold></xref>. The wavelength ranges for the spectral bands are defined as follows: blue (450&#x2013;500 nm), green (500&#x2013;600 nm), red (600&#x2013;700 nm), and near-infrared (700&#x2013;900 nm).</p>
<table-wrap id="T1" position="float">
<label>Table&#xa0;1</label>
<caption>
<p>Spectral reflectance and vegetation indices.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="middle" align="center">Tree species</th>
<th valign="middle" align="center">Blue light reflectance</th>
<th valign="middle" align="center">Green light reflectance</th>
<th valign="middle" align="center">Red light reflectance</th>
<th valign="middle" align="center">NIR reflectance</th>
<th valign="middle" align="center">NDVI</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="middle" align="center">Elm</td>
<td valign="middle" align="center">8%-12%</td>
<td valign="middle" align="center">15%-20%</td>
<td valign="middle" align="center">8%-12%</td>
<td valign="middle" align="center">45%-55%</td>
<td valign="middle" align="center">0.67</td>
</tr>
<tr>
<td valign="middle" align="center">Willow</td>
<td valign="middle" align="center">5%-10%</td>
<td valign="middle" align="center">15%-25%</td>
<td valign="middle" align="center">5%-10%</td>
<td valign="middle" align="center">50%-60%</td>
<td valign="middle" align="center">0.74</td>
</tr>
<tr>
<td valign="middle" align="center">Pine</td>
<td valign="middle" align="center">5%-10%</td>
<td valign="middle" align="center">10%-15%</td>
<td valign="middle" align="center">5%-10%</td>
<td valign="middle" align="center">40%-50%</td>
<td valign="middle" align="center">0.70</td>
</tr>
<tr>
<td valign="middle" align="center">Poplar</td>
<td valign="middle" align="center">5%-10%</td>
<td valign="middle" align="center">15%-25%</td>
<td valign="middle" align="center">5%-10%</td>
<td valign="middle" align="center">50%-60%</td>
<td valign="middle" align="center">0.74</td>
</tr>
<tr>
<td valign="middle" align="center">Birch</td>
<td valign="middle" align="center">10%-15%</td>
<td valign="middle" align="center">15%-20%</td>
<td valign="middle" align="center">10%-15%</td>
<td valign="middle" align="center">40%-50%</td>
<td valign="middle" align="center">0.58</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Data preprocessing</title>
<p>The remote sensing imagery for the Harbin area, Heilongjiang Province, China, was acquired from Google Earth Engine and exported in GeoTIFF format to constitute the dataset. The images were captured on August 10, 2022, with a sensor altitude of 280 meters above sea level. Using the Python Pillow library, the TIF images were converted to PNG format. The final dataset comprises 2,500 images, each with a pixel size of 640 &#xd7; 640.</p>
<p>The image data were annotated using Labelme. This tool saves the annotation information in JSON format, which is structured, easy to store, and machine-readable. The entire JSON file constitutes a dictionary-like structure, where various metadata elements are stored in key-value pairs. For instance, the &#x201c;imageWidth&#x201d; and &#x201c;imageHeight&#x201d; fields represent the width and height of the image in pixels, respectively.</p>
<p>To ensure the reliability of the dataset, we conducted a rigorous field verification campaign in August 2022. We randomly selected 500 samples from the annotated dataset, covering all five tree species, and verified their ground truth categories using a handheld GPS device (accuracy &#xb1;1m) and field photography. This on-site validation confirmed that our expert visual interpretation achieved an accuracy of over 96%, with minor corrections applied primarily to distinguish between young <italic>Populus</italic> and <italic>Salix</italic> trees in shaded areas.</p>
<p>To enhance model performance and generalization capability, ensuring its robustness in complex and varied image recognition tasks, data augmentation was performed through image rotation and mirror flipping. This process simulates the appearance of images from different perspectives. Mirror flipping includes both horizontal and vertical flipping, which increases data diversity and exposes the model to a wider range of directional variations in image samples.</p>
<p>Since both rotation and flipping significantly alter the position and orientation of target objects in the image, it is essential to update the corresponding annotation information promptly and accurately. The bounding boxes must be adjusted accordingly to ensure precise alignment with the transformed objects. An example of the data preprocessing procedure is illustrated in <xref ref-type="fig" rid="f1"><bold>Figure&#xa0;1</bold></xref>. To ensure the integrity of model evaluation and prevent data leakage, the original dataset of 2,500 images was first split into training, validation, and test sets in a ratio of 8:1:1. Subsequently, data augmentation (rotation and flipping) was applied exclusively to the training set, expanding it to 15,000 images, while the validation and test sets remained unaugmented to represent real-world scenarios.</p>
<fig id="f1" position="float">
<label>Figure&#xa0;1</label>
<caption>
<p>Example of data preprocessing.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-17-1754458-g001.tif">
<alt-text content-type="machine-generated">Six-part satellite image illustration shows an urban parking lot and nearby tennis courts under various geometric transformations: original view, 90-degree, 180-degree, and 270-degree rotations, horizontal flip, and vertical flip, labeled in Chinese.</alt-text>
</graphic></fig>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Model and methods</title>
<p>The YOLOv11n architecture (<xref ref-type="fig" rid="f2"><bold>Figure&#xa0;2</bold></xref>) primarily consists of three core components: a Backbone, a Neck, and a Head.</p>
<fig id="f2" position="float">
<label>Figure&#xa0;2</label>
<caption>
<p>Architecture of the YOLOv11n model.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-17-1754458-g002.tif">
<alt-text content-type="machine-generated">Flowchart diagram of a neural network for object detection, showing an input layer followed by a backbone with modules labeled CBS, C3k2, SPPF, and C2PSA, then a head with concatenation, upsampling, and detection layers leading to three output tensors sized eighty by eighty by two hundred fifty-five, forty by forty by two hundred fifty-five, and twenty by twenty by two hundred fifty-five.</alt-text>
</graphic></fig>
<p>The backbone network extracts hierarchical feature representations from the input image, and this process begins with a series of Convolution-Batch Normalization-SiLU (CBS) modules. Within these modules, the convolutional layer performs feature extraction, utilizing kernels of varying sizes and strides to capture local information at different scales. The Batch Normalization (BN) layer then normalizes the output of the convolution, accelerating model convergence and enhancing its generalization ability. Finally, the SiLU activation function introduces non-linear transformations, which augment the model&#x2019;s expressive capacity and enable it to learn more complex patterns.</p>
<p>The Residual Connections facilitate the learning of residual information via skip connections, which mitigates the vanishing gradient problem in deep network training and enables the network to excavate more profound features. Furthermore, Group Convolution operates independently across multiple branches, significantly reducing both parameter count and computational overhead. This synergistic combination allows the subsequent C3k2 module to efficiently extract deep image features while minimizing computational cost and model parameters. Consequently, the design not only maintains detection accuracy but also enhances operational efficiency in resource-constrained environments.</p>
<p>The backbone network further incorporates a Spatial Pyramid Pooling-Fast (SPPF) module and a C2PSA module. The SPPF module is designed to enhance feature extraction capabilities. The input feature map first undergoes a preliminary transformation via a convolutional layer. It is then fed into three parallel max-pooling layers, which perform down-sampling at different scales to generate multi-scale features. These multi-scale features, along with the original input features, are concatenated, effectively integrating fine-grained details with broader contextual information. The fused features are subsequently refined through another convolutional layer to better suit the requirements of the object detection task.</p>
<p>The C2PSA module is constructed based on the Squeeze-and-Excitation (SE) module and the Pyramid Split Attention (PSA) module. The input feature map is first processed by a convolutional layer for initial transformation. A split operation then distributes the features into multiple branches, each of which is processed by a PSA sub-module. A structural diagram of the C2PSA module is presented in <xref ref-type="fig" rid="f3"><bold>Figure&#xa0;3</bold></xref>.</p>
<fig id="f3" position="float">
<label>Figure&#xa0;3</label>
<caption>
<p>Architecture of the proposed C2PSA module.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-17-1754458-g003.tif">
<alt-text content-type="machine-generated">Diagram illustrating three neural network attention modules: SE with sequential GAP, convolution, ReLU, convolution, and Softmax; PSA with parallel convolution-SE paths, concatenation, Softmax, and x_concat; C2PSA with convolution, split into multiple PSA modules, concatenation, and final convolution.</alt-text>
</graphic></fig>
<p>The C2PSA module employs a multi-step, multi-dimensional feature processing strategy to more effectively capture subtle inter-class distinctions. Specifically, feature outputs from multiple parallel PSA submodules are first fused via a concatenation layer. The combined features then pass through a Softmax operation and are further aggregated through feature concatenation (x_concat), thereby enhancing the modeling of spatial dependencies. Subsequently, the fused features are refined and integrated by a convolutional layer. This structured approach enables the C2PSA module to sensitively discern fine-grained differences among target categories.</p>
<p>The primary function of the neck network is to integrate the hierarchical feature maps generated by the backbone, thereby constructing a feature pyramid rich in semantic information and multi-scale representations. Through iterative upsampling and concatenation operations, it progressively enhances the discriminative power and robustness of the features, ultimately providing superior input for the detection head.</p>
<p>The head network takes the fused feature maps from the neck as input. It then processes them through a series of dedicated Detect modules to produce the final detection outcomes. Each Detect module typically consists of convolutional layers for further feature refinement and transformation, followed by a prediction layer that estimates both the class probabilities and spatial coordinates of the objects.</p>
<p>The head network of YOLOv11n generates detection outputs across three distinct scales, with dimensions of 80&#xd7;80, 40&#xd7;40, and 20&#xd7;20 (each with 255 channels). This multi-scale design enables the model to effectively detect objects of varying sizes and categories within an image. By leveraging both fine-grained and high-level semantic information, this approach significantly enhances the comprehensiveness and accuracy of object detection.</p>
</sec>
<sec id="s2_4">
<label>2.4</label>
<title>YOLO-CNGD: a novel framework for urban tree classification</title>
<p>To enhance feature representation, particularly for small objects, we introduce several key modifications to the YOLOv11n architecture. First, the Convolutional Block Attention Module (CBAM) is hierarchically integrated into the backbone network at layers 5, 7, and 9. This lightweight module sequentially infers attention maps along both the channel and spatial dimensions, allowing the model to adaptively emphasize informative features while suppressing less useful ones. The refined C3k2-CBAM module is thereby empowered to better excavate the latent features of small objects, effectively mitigating the information loss caused by low pixel counts. Second, we replace the original EIoU loss function with the Normalized Wasserstein Distance (NWD) loss. This substitution shifts the bounding box regression loss to be more sensitive to small objects, leading to an overall enhancement in model performance.</p>
<p>To address the limitations of the standard CBS (Conv-BN-SiLU) module in detecting small targets within remote sensing imagery, we strategically replaced it with a more efficient GBS module at layers 4, 6, 8, and 20 of YOLOv11n. While the CBS module serves as a common building block in the architecture, it is not optimal for small objects in complex scenes. The GBS module overcomes this by optimizing kernel configurations, weight distributions, and activation functions. This redesign significantly reduces the computational overhead and parameter footprint, enabling the model to operate efficiently even under resource constraints while maintaining a focus on small targets.</p>
<p>To enhance the model&#x2019;s ability to capture geometric features of urban trees, we strategically integrated the Deformable Convolution Network v3 (DCNv3) module at two critical points in the YOLOv11n architecture. The DCNv3 module improves upon standard convolution by introducing learnable offsets, allowing the convolutional kernel to adaptively sample features from irregular shapes and positions, thereby capturing more comprehensive object characteristics. First, within the backbone network at layer 11, we optimized the existing C2PSA module by incorporating DCNv3, resulting in the C2PSA-D3 module. This integration maximizes the benefits of deformable convolution without a substantial increase in computational overhead. Second, we enhanced the model&#x2019;s detection head by replacing standard convolutions with DCNv3, forming the Detect-D3 module. This upgrade provides the small-object detection head with richer feature information. Collectively, these replacements significantly boost the model&#x2019;s capacity to discern tree edges and shapes, leading to superior detection accuracy in challenging scenarios, such as complex backgrounds and occluded conditions. The overall architecture of the proposed YOLO-CNGD model is depicted in <xref ref-type="fig" rid="f4"><bold>Figure&#xa0;4</bold></xref>.</p>
<fig id="f4" position="float">
<label>Figure&#xa0;4</label>
<caption>
<p>YOLO-CNGD model.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-17-1754458-g004.tif">
<alt-text content-type="machine-generated">Flowchart illustrating a neural network architecture with a backbone and head. The backbone consists of multiple modules such as CBS, C3k2, GBS, C3k2-CBAM, SPPF, and C2PSA-D3, followed by paths involving upsampling, concatenation, and detection steps leading to three head outputs sized eighty by eighty by two hundred fifty-five, forty by forty by two hundred fifty-five, and twenty by twenty by two hundred fifty-five.</alt-text>
</graphic></fig>
<sec id="s2_4_1">
<label>2.4.1</label>
<title>The C3k2-CBAM attention module</title>
<p>The CBAM (Convolutional Block Attention Module) enhances model performance by sequentially recalibrating feature maps across both channel and spatial dimensions. However, the standard ReLU activation function used in CBAM suffers from the gradient vanishing problem in its negative region. To mitigate this issue, we replaced all ReLU activations with Leaky ReLU. The architecture of our modified CBAM is illustrated in <xref ref-type="fig" rid="f5"><bold>Figure&#xa0;5</bold></xref>, where the Channel Attention Module (CAM) is highlighted by a red frame, the Spatial Attention Module (SAM) by a blue frame, and the two components form a sequential cascade.</p>
<fig id="f5" position="float">
<label>Figure&#xa0;5</label>
<caption>
<p>Architecture of the modified CBAM attention module.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-17-1754458-g005.tif">
<alt-text content-type="machine-generated">Block diagram illustrating a sequential attention mechanism composed of CAM and SAM modules. The CAM module applies GMP plus GAP, MLP, Conv plus Leaky ReLU, and sigmoid activation to generate feature weights. The SAM module processes feature weights through GMP plus GAP, seven by seven convolution plus Leaky ReLU, one by one convolution, and sigmoid activation to produce the final feature weights and output.</alt-text>
</graphic></fig>
<p>The CBAM attention mechanism was incorporated into the model, and the structure of the resulting C3k2-CBAM module is illustrated in <xref ref-type="fig" rid="f6"><bold>Figure&#xa0;6</bold></xref>.</p>
<fig id="f6" position="float">
<label>Figure&#xa0;6</label>
<caption>
<p>Architecture of the C3k2-CBAM module.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-17-1754458-g006.tif">
<alt-text content-type="machine-generated">Block diagram illustrating the C3k2-CBAM module, where input flows through CBS and Split blocks, is separated into three C3k branches and a CBAM, then concatenated and passed through another CBS block.</alt-text>
</graphic></fig>
</sec>
<sec id="s2_4_2">
<label>2.4.2</label>
<title>The normalized Wasserstein distance loss</title>
<p>The original YOLOv11n model employs the EIoU loss for bounding box regression. However, this loss function exhibits high sensitivity to minor deviations when dealing with small objects, whose bounding boxes are typically small in size and inherently unstable in aspect ratio. To address this limitation, we introduce the Normalized Wasserstein Distance (NWD) loss as a replacement. The NWD metric can be seamlessly integrated into any anchor-based detector as a direct substitute for the conventional IoU standard. Moreover, its dedicated loss function provides stable and effective gradients during training, facilitating faster and more stable model convergence.</p>
</sec>
<sec id="s2_4_3">
<label>2.4.3</label>
<title>GhostConv: lightweight convolutional module</title>
<p>To address the issue of feature map redundancy, GhostNet introduces a more efficient convolution paradigm by leveraging a small set of primary filters and inexpensive linear operations. The core component, the GhostConv module, first employs a limited number of ghost filters to extract the most critical and representative features from the input. This step fundamentally reduces the number of convolutional kernels required, thereby decreasing the model&#x2019;s parameter count significantly. For the less critical, redundant features, the module applies a series of cost-effective linear transformations instead of traditional convolutions. Finally, the outputs from both the ghost filters and the linear transformations are concatenated to form the final feature maps for subsequent tasks.</p>
</sec>
<sec id="s2_4_4">
<label>2.4.4</label>
<title>DCNv3</title>
<p>We introduce a dynamic sparse kernel into the model, which allows the sampling locations of the convolutional kernel to adapt dynamically to the input content, deviating from a fixed, regular grid. This is achieved by incorporating learnable offsets, enabling the kernel to adjust its sampling positions based on the actual structure of the target object, thereby capturing features with greater precision. The synergistic combination of dynamic sparsity and adaptive sampling mechanisms maximizes the informational efficiency of the parameters, mitigating redundancy and empowering the model to learn complex data patterns more effectively. The specific integration of this mechanism, via the DCNv3 module, is illustrated in <xref ref-type="fig" rid="f7"><bold>Figure&#xa0;7</bold></xref>.</p>
<fig id="f7" position="float">
<label>Figure&#xa0;7</label>
<caption>
<p>Architectures of the C2PSA-D3 and detect-D3 modules.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-17-1754458-g007.tif">
<alt-text content-type="machine-generated">Flowchart illustrating a deep learning model architecture with two main sections: left section details a pipeline starting with &#x201c;C2PSA-D3,&#x201d; passing through Conv, Split, multiple PSA modules, concat, and ending with DCNv3; right section compares &#x201c;Detect&#x201d; and &#x201c;Detect-D3&#x201d; branches, each splitting into Conv and DW Conv paths followed by either Conv2d or DCNv3 and branching to Box or Cls outputs.</alt-text>
</graphic></fig>
</sec>
</sec>
<sec id="s2_5">
<label>2.5</label>
<title>Model evaluation methodology</title>
<p>The outcomes of the detection experiments can be categorized into four fundamental cases: true positive (TP), false positive (FP), true negative (TN), and false negative (FN).</p>
<p>1. Precision (P) is defined as the ratio of correctly detected targets to the total number of targets detected by the model <xref ref-type="disp-formula" rid="eq2"><bold>Equation 2</bold></xref>. In the given context, a True Positive (TP) represents an instance where a tree is correctly identified as one of the five species, while a False Positive (FP) denotes an instance that is incorrectly classified as one of the five species.</p>
<disp-formula id="eq2"><label>(2)</label>
<mml:math display="block" id="M2"><mml:mrow><mml:mtext>P</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow><mml:mrow><mml:mtext>TP</mml:mtext><mml:mo>+</mml:mo><mml:mtext>FP</mml:mtext></mml:mrow></mml:mfrac></mml:mrow></mml:math>
</disp-formula>
<p>2. Recall (R) represents the percentage of actual targets that are correctly identified by the model <xref ref-type="disp-formula" rid="eq3"><bold>Equation 3</bold></xref>. Here, a False Negative (FN) represents a case where the target belongs to one of the five tree species, but the model fails to detect it.</p>
<disp-formula id="eq3"><label>(3)</label>
<mml:math display="block" id="M3"><mml:mrow><mml:mtext>R</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow><mml:mrow><mml:mtext>TP</mml:mtext><mml:mo>+</mml:mo><mml:mtext>FN</mml:mtext></mml:mrow></mml:mfrac></mml:mrow></mml:math>
</disp-formula>
<p>3. The F1-score is the harmonic mean of Precision and Recall, providing a single metric that balances the trade-off between these two values <xref ref-type="disp-formula" rid="eq4"><bold>Equation 4</bold></xref>.</p>
<disp-formula id="eq4"><label>(4)</label>
<mml:math display="block" id="M4"><mml:mrow><mml:mtext>F</mml:mtext><mml:mn>1</mml:mn><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mfrac><mml:mn>2</mml:mn><mml:mrow><mml:msup><mml:mtext>R</mml:mtext><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mtext>P</mml:mtext><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mo>&#xd7;</mml:mo><mml:mfrac><mml:mrow><mml:mtext>P</mml:mtext><mml:mo>&#xd7;</mml:mo><mml:mtext>R</mml:mtext></mml:mrow><mml:mrow><mml:mtext>P</mml:mtext><mml:mo>+</mml:mo><mml:mtext>R</mml:mtext></mml:mrow></mml:mfrac></mml:mrow></mml:math>
</disp-formula>
<p>4. Average Precision (AP) is defined as the area under the Precision-Recall (P-R) curve, which plots Precision (y-axis) against Recall (x-axis). A robust model maintains high precision as recall increases, resulting in a larger area under the curve and thus a higher AP value. Typically, an Intersection over Union (IoU) threshold of 0.5 is used for this evaluation. A higher AP value indicates better detection performance. The formulas for computing AP and mean Average Precision (mAP) are given in <xref ref-type="disp-formula" rid="eq5">Equations 5</xref> and <xref ref-type="disp-formula" rid="eq6">6</xref>, respectively.</p>
<disp-formula id="eq5"><label>(5)</label>
<mml:math display="block" id="M5"><mml:mrow><mml:mtext>AP</mml:mtext><mml:mo>=</mml:mo><mml:msubsup><mml:mo>&#x222b;</mml:mo><mml:mn>0</mml:mn><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mtext>P</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mtext>r</mml:mtext><mml:mo stretchy="false">)</mml:mo><mml:mtext>dr</mml:mtext></mml:mrow></mml:math>
</disp-formula>
<disp-formula id="eq6"><label>(6)</label>
<mml:math display="block" id="M6"><mml:mrow><mml:mtext>mAP</mml:mtext><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:msubsup><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mtext>i</mml:mtext><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mtext>C</mml:mtext></mml:msubsup><mml:mrow><mml:msub><mml:mrow><mml:mtext>AP</mml:mtext></mml:mrow><mml:mtext>i</mml:mtext></mml:msub><mml:mo stretchy="false">/</mml:mo><mml:mtext>C</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:math>
</disp-formula>
</sec>
<sec id="s2_6">
<label>2.6</label>
<title>Experimental setup</title>
<p>All experiments were conducted on a workstation equipped with an Intel Core i9-12900K CPU and an NVIDIA GeForce RTX 3090 GPU (24GB). The operating system was Ubuntu 20.04, and the deep learning framework used was PyTorch 1.12.1 with CUDA 11.3. The model was trained for 200 epochs using the SGD optimizer. The initial learning rate was set to 0.01 with a momentum of 0.937 and weight decay of 0.0005. The batch size was set to 32. We utilized a cosine annealing strategy for learning rate decay. The dataset of 15,000 images was randomly partitioned into training, validation, and test sets in a ratio of 8:1:1.</p>
<p>Regarding model complexity and hardware demands, the YOLO-CNGD model has 3.05 million parameters and requires 8.2 GFLOPs for a single 640&#xd7;640 image inference. The total training process for 200 epochs on the NVIDIA RTX 3090 lasted approximately 4.2 hours. During the inference phase, the model occupies only 1.1 GB of VRAM, making it highly suitable for deployment on edge computing devices with limited hardware resources.</p>
</sec>
<sec id="s2_7">
<label>2.7</label>
<title>Model evaluation and stability verification</title>
<p>Beyond the hold-out test set, we conducted a 5-fold cross-validation to assess the model&#x2019;s stability and mitigate the impact of random data partitioning. The entire dataset of 15,000 images was randomly divided into 5 equal-sized folds. In each iteration, four folds were combined and then split into training and validation subsets, while the remaining fold was used as the test set. This process was repeated five times with each fold serving as the test set once. The final performance metrics reported are the mean &#xb1; standard deviation calculated across all five test folds. A paired t-test was conducted between YOLO-CNGD and the baseline YOLOv11n, yielding a p-value &lt; 0.05, which confirms that the performance gains are statistically significant. This approach provides a more reliable estimate of model generalizability and allows us to compute confidence intervals for our results.</p>
</sec>
</sec>
<sec id="s3" sec-type="results">
<label>3</label>
<title>Results</title>
<sec id="s3_1">
<label>3.1</label>
<title>Comparative experimental results</title>
<p>To benchmark the performance of our model, we trained and evaluated several state-of-the-art object detection algorithms, including Faster R-CNN, SSD, RetinaNet, YOLOv3, YOLOv5, YOLOv7, YOLOv8, and YOLOv11n, on the urban tree remote sensing dataset. A comprehensive comparison was conducted using metrics such as model size (number of parameters), detection accuracy (Precision, Recall, F1-score), and comprehensive detection performance (mAP@0.5, mAP@0.5:0.95). The quantitative results are summarized in <xref ref-type="table" rid="T2"><bold>Table&#xa0;2</bold></xref>, and the corresponding performance curves are visualized in <xref ref-type="fig" rid="f8"><bold>Figure&#xa0;8</bold></xref>.</p>
<table-wrap id="T2" position="float">
<label>Table&#xa0;2</label>
<caption>
<p>Performance comparison of different object detection models.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="middle" align="center">Model</th>
<th valign="middle" align="center">Params (M)</th>
<th valign="middle" align="center">Precision (%)</th>
<th valign="middle" align="center">Recall (%)</th>
<th valign="middle" align="center">F1-score</th>
<th valign="middle" align="center">mAP@0.5 (%)</th>
<th valign="middle" align="center">mAP@0.5:0.95 (%)</th>
<th valign="middle" align="center">Inference time (ms)</th>
<th valign="middle" align="center">FPS</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="middle" align="center">Faster-RCNN</td>
<td valign="middle" align="center">40.00</td>
<td valign="middle" align="center">63.7</td>
<td valign="middle" align="center">60.8</td>
<td valign="middle" align="center">0.622</td>
<td valign="middle" align="center">60.5</td>
<td valign="middle" align="center">29.6</td>
<td valign="middle" align="center">41.7</td>
<td valign="middle" align="center">24</td>
</tr>
<tr>
<td valign="middle" align="center">SSD</td>
<td valign="middle" align="center">34.00</td>
<td valign="middle" align="center">77.5</td>
<td valign="middle" align="center">72.3</td>
<td valign="middle" align="center">0.748</td>
<td valign="middle" align="center">74.8</td>
<td valign="middle" align="center">38.2</td>
<td valign="middle" align="center">20.8</td>
<td valign="middle" align="center">48</td>
</tr>
<tr>
<td valign="middle" align="center">RetinaNet</td>
<td valign="middle" align="center">36.60</td>
<td valign="middle" align="center">83.0</td>
<td valign="middle" align="center">81.2</td>
<td valign="middle" align="center">0.820</td>
<td valign="middle" align="center">82.4</td>
<td valign="middle" align="center">45.9</td>
<td valign="middle" align="center">23.8</td>
<td valign="middle" align="center">42</td>
</tr>
<tr>
<td valign="middle" align="center">YOLOv3</td>
<td valign="middle" align="center">61.60</td>
<td valign="middle" align="center">86.1</td>
<td valign="middle" align="center">80.2</td>
<td valign="middle" align="center">0.830</td>
<td valign="middle" align="center">82.3</td>
<td valign="middle" align="center">50.1</td>
<td valign="middle" align="center">11.8</td>
<td valign="middle" align="center">85</td>
</tr>
<tr>
<td valign="middle" align="center">YOLOv5</td>
<td valign="middle" align="center">6.70</td>
<td valign="middle" align="center">88.0</td>
<td valign="middle" align="center">85.6</td>
<td valign="middle" align="center">0.871</td>
<td valign="middle" align="center">86.1</td>
<td valign="middle" align="center">55.2</td>
<td valign="middle" align="center">8.5</td>
<td valign="middle" align="center">118</td>
</tr>
<tr>
<td valign="middle" align="center">YOLOv7</td>
<td valign="middle" align="center">37.20</td>
<td valign="middle" align="center">92.5</td>
<td valign="middle" align="center">83.9</td>
<td valign="middle" align="center">0.875</td>
<td valign="middle" align="center">85.2</td>
<td valign="middle" align="center">51.8</td>
<td valign="middle" align="center">9.5</td>
<td valign="middle" align="center">105</td>
</tr>
<tr>
<td valign="middle" align="center">YOLOv8</td>
<td valign="middle" align="center">11.20</td>
<td valign="middle" align="center">90.0</td>
<td valign="middle" align="center">87.0</td>
<td valign="middle" align="center">0.885</td>
<td valign="middle" align="center">86.9</td>
<td valign="middle" align="center">56.2</td>
<td valign="middle" align="center">8.3</td>
<td valign="middle" align="center">120</td>
</tr>
<tr>
<td valign="middle" align="center">YOLOv11n</td>
<td valign="middle" align="center">2.58</td>
<td valign="middle" align="center">90.7 &#xb1; 0.22</td>
<td valign="middle" align="center">87.2 &#xb1; 0.25</td>
<td valign="middle" align="center">0.894</td>
<td valign="middle" align="center">88.5 &#xb1; 0.21</td>
<td valign="middle" align="center">57.1 &#xb1; 0.18</td>
<td valign="middle" align="center">7.8</td>
<td valign="middle" align="center">128</td>
</tr>
<tr>
<td valign="middle" align="center">YOLO-CNGD (Ours)</td>
<td valign="middle" align="center">3.05</td>
<td valign="middle" align="center">94.8 &#xb1; 0.18</td>
<td valign="middle" align="center">91.1 &#xb1; 0.23</td>
<td valign="middle" align="center">0.929</td>
<td valign="middle" align="center">93.7 &#xb1; 0.15</td>
<td valign="middle" align="center">62.7</td>
<td valign="middle" align="center">8.9</td>
<td valign="middle" align="center">112</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="f8" position="float">
<label>Figure&#xa0;8</label>
<caption>
<p>Performance comparison of different models.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-17-1754458-g008.tif">
<alt-text content-type="machine-generated">Bar chart comparing model performance metrics including parameter count, precision, recall, F1-score, and mean average precision across multiple experiments, with each metric represented by a distinct color for comparison.</alt-text>
</graphic></fig>
<p>In addition to detection accuracy, we evaluated the inference speed, a crucial metric for real-time monitoring. On the NVIDIA GeForce RTX 3090 GPU, the proposed YOLO-CNGD achieved an inference speed of 112 FPS (Frames Per Second) with an average inference time of 8.9 ms per image. This performance significantly exceeds the real-time requirement (usually 30 FPS), confirming the model&#x2019;s suitability for rapid large-scale urban forest surveying.</p>
<p>To validate the stability of the model, we performed 5-fold cross-validation. The results show that YOLO-CNGD maintains high stability, with an mAP@0.5 of 93.7% &#xb1; 0.15% (95% Confidence Interval: [93.4%, 94.0%]). This narrow confidence interval confirms that the performance improvements are statistically robust and not due to random variation in the dataset split.</p>
<p>The results for YOLOv11n and YOLO-CNGD are reported as mean &#xb1; standard deviation based on 5-fold cross-validation. Other comparative models reflect the best performance from standard training runs consistent with their original literature settings.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Ablation study results</title>
<p>A series of ablation studies were conducted to validate the contribution of each proposed modification, with the quantitative results summarized in <xref ref-type="table" rid="T3"><bold>Table&#xa0;3</bold></xref>. The results demonstrate that our full model, which incorporates the DG (DCNv3 + GhostConv) structure, the NWD loss, and the CBAM attention mechanism, achieves significant performance gains over the baseline YOLOv11n. Specifically, it improves overall Precision by 4.1%, Recall by 4.0%, mAP@0.5 by 5.2%, and mAP@0.5:0.95 by 5.6%.</p>
<table-wrap id="T3" position="float">
<label>Table&#xa0;3</label>
<caption>
<p>Quantitative results of the ablation study on model components.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="middle" align="center">No.</th>
<th valign="middle" align="center">Model</th>
<th valign="middle" align="center">P</th>
<th valign="middle" align="center">R</th>
<th valign="middle" align="center">mAP@0.5</th>
<th valign="middle" align="center">mAP@0.5:0.95</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="middle" align="center">1</td>
<td valign="middle" align="center">YOLOv11n(baseline)</td>
<td valign="middle" align="center">0.907</td>
<td valign="middle" align="center">0.872</td>
<td valign="middle" align="center">0.885</td>
<td valign="middle" align="center">0.571</td>
</tr>
<tr>
<td valign="middle" align="center">2</td>
<td valign="middle" align="center">YOLOv11n+DG</td>
<td valign="middle" align="center">0.912</td>
<td valign="middle" align="center">0.878</td>
<td valign="middle" align="center">0.893</td>
<td valign="middle" align="center">0.573</td>
</tr>
<tr>
<td valign="middle" align="center">3</td>
<td valign="middle" align="center">YOLOv11n+NWD</td>
<td valign="middle" align="center">0.925</td>
<td valign="middle" align="center">0.886</td>
<td valign="middle" align="center">0.904</td>
<td valign="middle" align="center">0.582</td>
</tr>
<tr>
<td valign="middle" align="center">4</td>
<td valign="middle" align="center">YOLOv11n+CBAM</td>
<td valign="middle" align="center">0.914</td>
<td valign="middle" align="center">0.881</td>
<td valign="middle" align="center">0.893</td>
<td valign="middle" align="center">0.574</td>
</tr>
<tr>
<td valign="middle" align="center">5</td>
<td valign="middle" align="center">YOLOv11n+DG+NWD</td>
<td valign="middle" align="center">0.935</td>
<td valign="middle" align="center">0.903</td>
<td valign="middle" align="center">0.918</td>
<td valign="middle" align="center">0.586</td>
</tr>
<tr>
<td valign="middle" align="center">6</td>
<td valign="middle" align="center">YOLOv11n+DG+CBAM</td>
<td valign="middle" align="center">0.925</td>
<td valign="middle" align="center">0.893</td>
<td valign="middle" align="center">0.911</td>
<td valign="middle" align="center">0.578</td>
</tr>
<tr>
<td valign="middle" align="center">7</td>
<td valign="middle" align="center">YOLOv11n+NWD+CBAM</td>
<td valign="middle" align="center">0.933</td>
<td valign="middle" align="center">0.905</td>
<td valign="middle" align="center">0.926</td>
<td valign="middle" align="center">0.596</td>
</tr>
<tr>
<td valign="middle" align="center">8</td>
<td valign="middle" align="center">YOLOv11n+DG+NWD+CBAM</td>
<td valign="middle" align="center">0.948</td>
<td valign="middle" align="center">0.911</td>
<td valign="middle" align="center">0.937</td>
<td valign="middle" align="center">0.627</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Analysis and discussion</title>
<sec id="s4_1">
<label>4.1</label>
<title>Analysis of comparative experimental results</title>
<p>1. Parameter Count: The number of parameters is a critical indicator of model complexity, directly influencing the computational resources and time required for both training and inference. The data reveals significant disparities in parameter counts across the YOLO series. Notably, YOLOv11n, with a minimal 2.58M parameters, demonstrates superior architectural efficiency for lightweight deployment without imposing excessive hardware demands. In contrast, YOLOv3 possesses a substantially larger parameter count of 61.6M. While this high complexity suggests a greater capacity for learning rich and detailed feature representations, it consequently leads to prolonged training times and an increased risk of overfitting, which can ultimately impair the model&#x2019;s generalization performance.</p>
<p>2. Precision: Precision serves as a core metric for evaluating the accuracy of model predictions, reflecting the correctness of positive identifications. It is calculated as the proportion of true positive instances among all samples predicted as positive, thereby directly indicating the reliability of the model in identifying target categories. Among the eight models compared, YOLOv11n achieved the highest precision of 0.907, demonstrating its strong capability in accurately determining the presence of targets with a low probability of false alarms. In contrast, Faster R-CNN attained a notably lower precision of 0.637, suggesting that it may produce a higher number of false positives during detection.</p>
<p>To further analyze the classification errors, we examined the confusion matrix of the proposed model,as presented in <xref ref-type="fig" rid="f9"><bold>Figure&#xa0;9</bold></xref>. The results reveal that the majority of misclassifications occur between <italic>Salix</italic> (Willow) and <italic>Populus</italic> (Poplar). This confusion is primarily attributed to two factors: first, the high spectral similarity of their leaves in the visible bands makes them difficult to distinguish based on color alone; second, in areas with dense vegetation, shadows often obscure the distinctive crown shapes (e.g., the pendulous branches of willows), leading to edge prediction errors. In contrast, <italic>Pinus</italic> (Pine) exhibits the least confusion with other species due to its unique coniferous texture and needle-like foliage.</p>
<fig id="f9" position="float">
<label>Figure&#xa0;9</label>
<caption>
<p>Confusion matrix of YOLO-CNGD.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-17-1754458-g009.tif">
<alt-text content-type="machine-generated">Confusion matrix for YOLO-CNGD model with one thousand five hundred samples, showing correct and incorrect classifications for five tree types: Elm, Willow, Poplar, Birch, and Pine. Color intensity indicates the number of samples, with most values concentrated along the diagonal, signifying high accuracy. Notable misclassifications occur mainly between Willow and Poplar as indicated by larger off-diagonal values. Color bar on the right represents sample count from zero to three hundred.</alt-text>
</graphic></fig>
<p>The matrix validates the model&#x2019;s performance on the validation set (N = 1500). The rows represent the actual labels, and the columns represent the predicted labels. The diagonal elements indicate correct classifications. The orange-highlighted cells quantitatively demonstrate the primary spectral confusion between <italic>Salix</italic> (Willow) and <italic>Populus</italic> (Poplar),</p>
<p>3. Recall: Recall measures a model&#x2019;s ability to detect all positive instances, defined as the proportion of actual positives correctly identified. YOLOv11n and YOLOv8 achieved outstanding recall scores of 0.872 and 0.870, respectively, indicating their strong capability in capturing target objects and minimizing missed detections. In contrast, Faster R-CNN attained a recall of only 0.608, significantly lower than the top-performing models, suggesting a tendency to overlook a considerable number of actual positive instances during detection.</p>
<p>4. F1-score: As the harmonic mean of precision and recall, the F1-score provides a balanced assessment of a model&#x2019;s prediction accuracy and detection completeness, offering a comprehensive evaluation of overall performance. YOLOv11n achieved the highest F1-score of 0.894, indicating an optimal balance between precision and recall. This result demonstrates its capability to maintain high prediction accuracy while effectively detecting the majority of target instances. In contrast, Faster R-CNN obtained an F1-score of only 0.622, suggesting a higher incidence of missed detections and highlighting the need for further optimization in practical applications.</p>
<p>5. mAP@0.5 and mAP@0.5:0.95: mAP@0.5 represents the mean average precision calculated at an IoU threshold of 0.5, reflecting the model&#x2019;s detection performance under a relatively lenient bounding box matching criterion. This metric emphasizes the model&#x2019;s preliminary capability to identify target presence. YOLOv11n achieved an mAP@0.5 of 0.885, indicating its strong performance in accurately detecting targets under relaxed localization requirements. This makes it particularly suitable for applications where rapid identification of approximate target locations is prioritized over precise boundary delineation. In contrast, mAP@0.5:0.95 is computed across multiple IoU thresholds (from 0.5 to 0.95 with a step size of 0.05), imposing stricter localization accuracy demands and providing a more comprehensive assessment of the model&#x2019;s detection robustness in challenging scenarios. YOLOv11n also attained the highest score of 0.571 in this rigorous metric, demonstrating its consistent ability to maintain high detection precision across varying degrees of target overlap.</p>
<p>In summary, while Faster R-CNN, SSD, and RetinaNet exhibit relatively weaker performance on certain metrics, the YOLO series demonstrates prominent results across multiple aspects. Among them, YOLOv11n achieves outstanding detection accuracy with a notably low parameter count of only 2.58M. It performs excellently in terms of precision (0.907), recall (0.872), F1-score (0.894), as well as mAP@0.5 (0.885) and mAP@0.5:0.95 (0.571). This model strikes an effective balance between lightweight design and reliable detection performance, making it highly suitable for real-world applications that require efficient deployment under limited computational resources without compromising on accuracy. YOLOv11n thus represents a cost-effective and competitive solution in the field of urban tree detection. To ensure the reliability of the improvements, we repeated the training process multiple times. The proposed YOLO-CNGD consistently outperformed the baseline, and the improvement in mAP@0.5 (5.2%) is considered statistically significant given the stability of the training curves.</p>
<p>We further examined the confusion matrix to analyze specific misclassifications, revealing that the primary inter-class confusion occurs between <italic>Salix</italic> (Willow) and <italic>Populus</italic> (Poplar). As visualized in <xref ref-type="fig" rid="f10"><bold>Figures&#xa0;10A&#x2013;C</bold></xref>, these failure cases are mainly attributed to heavy shadows and dense canopy occlusion in complex urban environments, which render the spectral signatures of these two species indistinguishable. Under such conditions, the characteristic drooping branches of willows are often obscured, leading to the observed misclassifications.&#x201d;</p>
<fig id="f10" position="float">
<label>Figure&#xa0;10</label>
<caption>
<p>Visualization of typical failure cases in remote sensing imagery. The figure highlights three primary scenarios leading to misclassification: <bold>(A)</bold> Shadow Occlusion: A Willow tree (True: 1) is obscured by building shadows, altering its spectral signature and leading to misclassification as Poplar (Pred: 2). <bold>(B)</bold> Dense Canopy Occlusion: Overlapping crowns in dense stands blur the boundaries between an <italic>Elm</italic> and a <italic>Birch</italic>, causing segmentation errors. <bold>(C)</bold> Texture/Spectral Similarity: A young Poplar tree (True: 2) lacks distinct textural features and is spectrally confused with a Willow (Pred: 1). The red boxes indicate the ground truth label, the model&#x2019;s prediction, and the confidence score.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-17-1754458-g010.tif">
<alt-text content-type="machine-generated">Three-panel figure showing challenges in tree identification from aerial images. Panel A, titled &#x201c;Shadow Occlusion,&#x201d; highlights building shadow over trees; result shows willow misclassified as poplar, confidence 0.75. Panel B, &#x201c;Dense Canopy Occlusion,&#x201d; displays overlapping crowns blurring boundaries; elm is misclassified as birch, confidence 0.68. Panel C, &#x201c;Texture/Spectral Similarity,&#x201d; features a young tree with uniform texture; poplar misclassified as willow, confidence 0.72. Each panel uses arrows and text to identify the primary challenge affecting tree classification.</alt-text>
</graphic></fig>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Analysis of ablation study results on classification</title>
<p>To extract more discriminative features from data and enhance the performance of the YOLOv11n model in object detection tasks, we conduct a series of ablation studies. These experiments are designed to systematically investigate the individual contributions of various components and mechanisms to the overall model performance.</p>
<sec id="s4_2_1">
<label>4.2.1</label>
<title>Ablation study on individual components</title>
<p>Ablation study 2: YOLOv11n+DG. This experiment integrated DCNv3 and GhostConv (collectively, the DG module) into the model. The results show a recall increase from 0.872 to 0.878 (+0.6%) and a rise in mAP@0.5 from 0.885 to 0.893 (+0.8%), demonstrating the module&#x2019;s effectiveness for small object detection. However, precision saw only a marginal gain to 0.912 (+0.5%), and mAP@0.5:0.95 increased by a mere 0.002. This indicates that while the DG module enhances feature adaptability, the deformable convolutions introduce a slight uptick in false positives, necessitating further optimization of the dynamic offset mechanism.</p>
<p>Ablation study 3: YOLOv11n + NWD loss function. By optimizing the bounding box matching strategy, the NWD loss function improves localization accuracy. The precision increased from 0.907 to 0.925, a gain of 1.8%, while mAP@0.5 and mAP@0.5:0.95 rose to 0.904 and 0.582, representing improvements of 1.9% and 1.1%, respectively. The recall rate improved at a slower pace, which can be attributed to the stricter suppression of duplicate detections by NWD, leading to the filtering of some low-confidence targets. This enhancement is particularly effective in dense object scenarios.</p>
<p>Ablation study 4: YOLOv11n + CBAM attention mechanism. The introduction of the CBAM module led to an improvement in mAP@0.5:0.95, which increased to 0.574. However, the gains in precision and recall were less pronounced compared to those in Ablation Study 3. This suggests that while the attention mechanism enhances the representation of critical features, its excessive focus on local regions may result in the loss of global contextual information, consequently affecting the detection of some objects. Therefore, CBAM is more suitable when combined with localization optimization modules to achieve a better balance between feature selection and detection robustness.</p>
<p>The Precision-Recall (PR) curve plots precision against recall. The Average Precision (AP), defined as the area under the PR curve, serves as a performance metric, where a larger AP value indicates a better algorithm. The results are shown in <xref ref-type="fig" rid="f11"><bold>Figure&#xa0;11</bold></xref>.</p>
<fig id="f11" position="float">
<label>Figure&#xa0;11</label>
<caption>
<p>Precision-recall curve.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-17-1754458-g011.tif">
<alt-text content-type="machine-generated">Precision-recall line chart compares classification performance for six tree classes: birch, elm, willow, pine, poplar, and the average across all classes. Poplar achieves the highest average precision of zero point nine five six, while pine is lowest at zero point nine one three. Axes are labeled Precision and Recall. A legend in the upper right identifies each colored line by class and score.</alt-text>
</graphic></fig>
<p>Specifically, the per-species analysis demonstrates balanced detection capabilities as illustrated in the PR curve. The model achieved the highest Average Precision (AP) for <italic>Populus</italic> (95.6%) and <italic>Ulmus</italic> (93.3%), followed by <italic>Betula</italic> (92.8%) and <italic>Salix</italic> (92.3%). Even for <italic>Pinus</italic>, which possesses complex needle textures distinct from broad-leaved species, the AP reached 91.3%, indicating the model&#x2019;s robustness across diverse canopy morphological types.</p>
</sec>
<sec id="s4_2_2">
<label>4.2.2</label>
<title>Dual-improvement experiment</title>
<p>Ablation study 5: YOLOv11n+DG+NWD. The synergistic combination of Deformable Convolution (DG) and Normalized Wasserstein Distance (NWD) yielded a significant performance gain. The model achieved a precision of 0.935, a recall of 0.903, an mAP@0.5 of 0.918, and an mAP@0.5:0.95 of 0.586. DG enhances multi-scale feature extraction with its deformable convolutions, while NWD optimizes bounding box matching. This combination effectively mitigates missed detections and localization errors for small objects, validating the complementarity between the two modules.</p>
<p>Ablation study 6: YOLOv11n+DG+CBAM. The integration of Deformable Convolution (DG) with the Convolutional Block Attention Module (CBAM) elevated the model&#x2019;s recall to 0.893 and mAP@0.5 to 0.911. However, this combination yielded only a marginal improvement in mAP@0.5:0.95, which reached 0.578. Although the dynamic adaptability of DG and the feature refinement of CBAM collectively enhanced target detection rates, the absence of a dedicated localization optimization mechanism like NWD resulted in insufficient precision for some bounding boxes. This finding indicates that while attention mechanisms and deformable convolutions improve feature representation, they must be coupled with a more refined localization strategy to achieve comprehensive performance gains.</p>
<p>Ablation study 7: YOLOv11n+NWD+CBAM. The synergistic integration of NWD and CBAM yielded the most balanced performance across comprehensive metrics, achieving an mAP@0.5 of 0.926 and elevating mAP@0.5:0.95 to 0.596. The precise localization capability of NWD, combined with the critical feature refinement provided by CBAM, significantly enhanced detection stability in complex backgrounds. However, the balance between precision and recall, while strong, remained slightly inferior to the final model incorporating all three components. A comparative analysis of the object detection results is presented in <xref ref-type="fig" rid="f12"><bold>Figure&#xa0;12</bold></xref>.</p>
<fig id="f12" position="float">
<label>Figure&#xa0;12</label>
<caption>
<p>Comparison of classification and detection results.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-17-1754458-g012.tif">
<alt-text content-type="machine-generated">Satellite image collage with four panels, each showing aerial views of urban areas with labeled bounding boxes identifying tree species&#x2014;birch, elm, poplar, and pine&#x2014;and confidence scores. Different colors denote species while numerical values indicate detection confidence. Panels are labeled (a), (b), (c), and (d).</alt-text>
</graphic></fig>
</sec>
<sec id="s4_2_3">
<label>4.2.3</label>
<title>Comprehensive improvement experiment</title>
<p>Ablation study 8: YOLOv11n+DG+NWD+CBAM (YOLO-CNGD). The synergistic integration of all three components drove the model&#x2019;s performance to a comprehensive breakthrough. Compared to the baseline, precision and recall increased by 4.1% and 4.0%, reaching 0.948 and 0.911, respectively, while mAP@0.5 and mAP@0.5:0.95 rose by 5.2% and 5.6% to 0.937 and 0.627. The collaborative mechanism&#x2014;DG&#x2019;s dynamic feature adaptation, NWD&#x2019;s localization refinement, and CBAM&#x2019;s attention guidance&#x2014;significantly reduced the miss rate for small objects. The model maintains high precision even at high recall rates, unequivocally validating the effectiveness of our multi-dimensional improvement strategy.</p>
<p>A comparative analysis between YOLO-CNGD and YOLOv11n reveals that the baseline YOLOv11n suffers from pronounced missed detections of small-target trees. In contrast, YOLO-CNGD, leveraging its optimized architecture and mechanisms, detects these objects with markedly superior accuracy. This series of ablation studies demonstrates that the synergistic integration of feature adaptation, precise localization, and attention guidance achieves the most robust detection performance, thereby providing an effective solution for object detection in complex scenarios.</p>
</sec>
</sec>
<sec id="s4_3" sec-type="discussion">
<label>4.3</label>
<title>Discussion</title>
<sec id="s4_3_1">
<label>4.3.1</label>
<title>Comparison between YOLO-CNGD and the baseline model</title>
<p>The CBAM attention mechanism is a lightweight module that operates through sequential channel and spatial sub-modules. It refines the input feature map by applying adaptive weights across both channel and spatial dimensions. This dual weighting enhances the interdependencies among features, enabling the network to more effectively focus on and extract the most informative characteristics of the target objects.</p>
<p>Compared with state-of-the-art methods published in 2025, our YOLO-CNGD demonstrates specific advantages in balancing accuracy and efficiency. While <xref ref-type="bibr" rid="B5">Li et&#xa0;al. (2025)</xref> relied on heavy UAV-LiDAR data for classification, our method achieves comparable precision (94.8%) using only cost-effective satellite imagery, making it more scalable for city-wide monitoring. Additionally, unlike the general-purpose YOLO improvements proposed by <xref ref-type="bibr" rid="B3">Jiang and Chen (2025)</xref> for pest detection, our integration of the NWD loss specifically targets the &#x2018;location jitter&#x2019; problem of small urban trees. By replacing standard convolutions with GhostConv, we also address the computational constraints highlighted by <xref ref-type="bibr" rid="B16">Zhang et&#xa0;al. (2025)</xref>, ensuring that our model remains deployable on edge devices. Finally, our method offers a distinct advantage over the pine wilt detection model by <xref ref-type="bibr" rid="B15">Wang et&#xa0;al. (2025)</xref> by focusing on the morphological distinction of healthy tree species in complex mixed forests.</p>
<p>To address the challenges of small object detection, we incorporate the Normalized Wasserstein Distance (NWD) into the YOLO-CNGD model. NWD serves as a superior alternative to IoU for the Non-Maximum Suppression (NMS) and loss calculation. It effectively mitigates the high sensitivity of IoU to bounding box scale and location for small objects. By providing a smoother response to positional deviations, NWD enhances the model&#x2019;s robustness, contributing to a performance improvement of 1.8%.</p>
<p>We introduce two modules, GhostConv and DCNv3, to enhance the base model. The incorporation of multiple new mechanisms initially led to a substantial increase in parameters, consequently reducing detection speed. To mitigate this, the lightweight GhostConv was adopted to streamline the model&#x2019;s complexity and improve computational efficiency. Meanwhile, DCNv3 was integrated to expand the model&#x2019;s receptive field, enabling it to capture more informative features from large-scale parameters and data. The addition of DCNv3 alone contributed to a performance gain of 0.4% to 4.3%.</p>
</sec>
<sec id="s4_3_2">
<label>4.3.2</label>
<title>Comparison with state-of-the-art lightweight models</title>
<p>Recent advancements in 2024 and 2025 have introduced several robust lightweight models for remote sensing. For instance, SEMA-YOLO enhances small object detection through shallow-layer enhancement, and SRM-YOLO effectively utilizes multi-scale adaptation. While these models show impressive general performance, they primarily focus on minimizing parameter counts or handling standard small objects (e.g., vehicles or ships) that have rigid boundaries.</p>
<p>In contrast, urban trees present unique challenges such as irregular canopy shapes and severe &#x2018;location jitter&#x2019; caused by wind or shadows. Our comparative analysis suggests that while general-purpose SOTA models like YOLOv8 or even the improved SEMA-YOLO excel in speed, they often lack specific mechanisms to handle the fuzzy boundaries of vegetation. YOLO-CNGD distinguishes itself by integrating DCNv3, which adaptively deforms the convolutional kernel to fit irregular tree crowns&#x2014;a feature absent in standard lightweight improvements. Furthermore, our adoption of NWD loss provides superior stability for locating small, overlapping trees in dense urban environments, offering a more domain-specific solution for precision forestry.</p>
</sec>
<sec id="s4_3_3">
<label>4.3.3</label>
<title>Statistical analysis of tree categories by YOLO-CNGD within a designated area</title>
<p>The model&#x2019;s anchor mechanism generates object proposals, whose spatial locations and category constraints provide a precise initialization for semantic segmentation. This funnels the segmentation task into the anchor-defined regions, drastically narrowing the search space and boosting both efficiency and accuracy. Secondly, the multi-feature fusion network extracts discriminative features (e.g., canopy morphology, visible-band reflectance). These features are leveraged not only for classification but also as deep semantic inputs to the segmentation network. This enhances the distinction of boundaries between tree species and effectively disentangles complex scenarios involving occlusions and overlapping canopies.</p>
<p>During the tree categorization and counting procedure based on YOLO-CNGD, the key information from the bounding boxes&#x2014;including the top-left and bottom-right coordinates, along with their corresponding class labels&#x2014;is extracted and saved into a.txt file. The output format is illustrated in <xref ref-type="fig" rid="f13"><bold>Figure&#xa0;13</bold></xref>.</p>
<fig id="f13" position="float">
<label>Figure&#xa0;13</label>
<caption>
<p>Bounding box file for tree categorization.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-17-1754458-g013.tif">
<alt-text content-type="machine-generated">Rows of numbers are arranged in groups, each row displaying six integers followed by a decimal value and a final single-digit integer, all in a clear sans-serif font against a white background.</alt-text>
</graphic></fig>
<p>The text file contains six columns of data. The first and second columns denote the x and y coordinates of the top-left corner of the bounding box, respectively. The third and fourth columns represent the x and y coordinates of the bottom-right corner. The fifth column is the classification confidence score. The sixth column indicates the tree species category, encoded as follows: 0 for Elm, 1 for Willow, 2 for Poplar, 3 for Birch, and 4 for Pine.</p>
</sec>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Implications for urban forest management and ecological assessment</title>
<p>The accurate, automated classification of urban tree species achieved by YOLO-CNGD transcends mere technical performance, offering tangible value for urban forestry practice and ecological planning. The transition from manual interpretation to AI-driven mapping enables scalable, data-informed decision-making across several key domains.</p>
<sec id="s4_4_1">
<label>4.4.1</label>
<title>From classification maps to biodiversity metrics</title>
<p>The high-resolution species distribution map generated by YOLO-CNGD serves as a foundational layer for quantitative ecological assessment. Moving beyond visual inspection, forestry managers can calculate standardized biodiversity indices&#x2014;such as the Shannon-Wiener Index or Species Richness&#x2014;at varying administrative scales (e.g., block, district, or park). This allows for the objective identification of biodiversity &#x201c;cold spots,&#x201d; such as areas dominated by a single species (e.g., extensive <italic>Populus</italic> monocultures), which may exhibit lower ecological resilience to pests or climate stressors. Conversely, &#x201c;hot spots&#x201d; of high species diversity can be recognized and conserved. This data-driven approach facilitates targeted greening policies, such as strategic enrichment planting in species-poor areas, to enhance overall urban biodiversity and ecosystem stability. Our model&#x2019;s ability to distinguish <italic>Populus</italic> from <italic>Salix</italic> with high precision is particularly valuable for forest health monitoring. In Harbin&#x2019;s urban core, monocultures of <italic>Populus</italic> are highly susceptible to pest outbreaks such as the Asian Longhorned Beetle. By providing geolocated distribution maps, YOLO-CNGD enables managers to identify these &#x2018;vulnerability hotspots&#x2019; and prioritize species diversification to enhance urban ecological resilience.</p>
</sec>
<sec id="s4_4_2">
<label>4.4.2</label>
<title>Functional group stratification for ecosystem service estimation</title>
<p>Accurate discrimination between functional groups, particularly conifers (<italic>Pinus</italic>) and broadleaf deciduous trees (e.g., <italic>Ulmus</italic>, <italic>Salix</italic>, <italic>Populus</italic>), is critical for refining urban ecosystem service models. These groups differ significantly in their seasonal dynamics, carbon sequestration rates, and microclimate regulation capacities. For instance, conifers provide year-round visual greenery and particulate matter capture, while deciduous trees may offer superior shading during summer. By enabling precise mapping of these functional types, YOLO-CNGD outputs allow for more stratified and accurate estimation of key services like carbon storage, urban heat island mitigation, and air quality improvement, moving beyond generalized urban canopy cover metrics.</p>
</sec>
<sec id="s4_4_3">
<label>4.4.3</label>
<title>Towards precision arboriculture and individual tree health monitoring</title>
<p>The model&#x2019;s output&#x2014;providing not just a species label but also a geolocated bounding box for each detected tree&#x2014;paves the way for an individual tree health management system. By integrating the classification result with the spectral information for each instance, trees exhibiting signs of stress can be automatically flagged and precisely located on a digital map. This transforms urban forestry from a reactive, area-based maintenance model to a proactive, precision arboriculture paradigm. Forestry crews can efficiently prioritize inspections, irrigation, fertilization, or pest control interventions for specific, at-risk trees, optimizing resource allocation and potentially reducing management costs.</p>
<p>In summary, YOLO-CNGD acts as a powerful analytical tool that converts remote sensing imagery into actionable intelligence. It supports urban forest managers in biodiversity conservation, evidence-based planning for ecosystem services, and the implementation of cost-effective, targeted maintenance strategies.</p>
</sec>
</sec>
<sec id="s4_5">
<label>4.5</label>
<title>Limitations and future work</title>
<p>Despite the promising results, this study has several limitations that need to be addressed in future research. First, regarding dataset diversity, the current dataset is limited to Harbin, China, and comprises images captured on a single date (August 10, 2022). Consequently, the model&#x2019;s generalization capability across different climatic zones and phenological seasons (e.g., autumn leaf coloration or winter defoliation) remains to be validated. Future work will expand the dataset to include multi-temporal and multi-regional imagery to test the model&#x2019;s transferability. Second, regarding failure cases, although the model handles small objects well, performance drops in scenarios with dense canopy occlusion or heavy shadows cast by high-rise buildings. In these cases, the spectral features of the shadowed trees are distorted, leading to misclassifications between species with similar crown shapes (e.g., Willow and Poplar). Third, regarding taxonomic scope, this study focused on five dominant tree species. While sufficient for Harbin&#x2019;s urban core, expanding the class categories to include shrubs and other rare species would enhance the tool&#x2019;s applicability for broader biodiversity surveys. Future work will explore fusing multi-spectral indices (e.g., enhancing red-edge bands) or temporal data to disambiguate spectrally similar species under varying illumination.</p>
</sec>
</sec>
<sec id="s5" sec-type="conclusions">
<label>5</label>
<title>Conclusion</title>
<p>This study focuses on the Harbin region, where remote sensing images were acquired via Google Earth Engine. After data preprocessing, a dedicated dataset containing Elm, Willow, Pine, Poplar, and Birch trees was constructed, with classification and segmentation labels annotated using Labelme. To enhance the model&#x2019;s ability to capture key features, the CBAM attention mechanism was introduced, operating through a dual-pathway structure to improve accuracy. For addressing small object detection challenges, the NWD loss function and DCNv3 were incorporated, thereby boosting detection precision. Additionally, GhostConv was used to replace standard convolutions in the C3k2 module, effectively reducing model parameters. The YOLO-CNGD model was established for urban tree classification. Quantitative evaluations indicate that the proposed framework outperforms the baseline, delivering a precision of 94.8% and an mAP@0.5 of 93.7%, confirming its efficacy for urban forestry tasks. A comprehensive analysis and discussion of the experimental outcomes were conducted. By leveraging the tree category bounding box files, accurate species-specific tree counting within the study area was achieved. This method provides a cost-effective alternative to traditional manual surveys, enabling real-time monitoring of urban green space structure and supporting data-driven decision-making for sustainable urban development.</p>
</sec>
</body>
<back>
<sec id="s6" sec-type="data-availability">
<title>Data availability statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.</p></sec>
<sec id="s7" sec-type="author-contributions">
<title>Author contributions</title>
<p>CZ: Writing &#x2013; original draft. ML: Investigation, Writing &#x2013; review &amp; editing. XL: Methodology, Writing &#x2013; review &amp; editing. ZG: Writing &#x2013; review &amp; editing.</p></sec>
<sec id="s9" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p></sec>
<sec id="s10" sec-type="ai-statement">
<title>Generative AI statement</title>
<p>The author(s) declared that generative AI was not used in the creation of this manuscript.</p>
<p>Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.</p></sec>
<sec id="s11" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p></sec>
<ref-list>
<title>References</title>
<ref id="B1">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Ayrey</surname> <given-names>E.</given-names></name>
<name><surname>Fraver</surname> <given-names>S.</given-names></name>
<name><surname>Kershaw</surname> <given-names>J. A.</given-names> <suffix>Jr.</suffix></name>
<name><surname>Kenefic</surname> <given-names>L. S.</given-names></name>
<name><surname>Hayes</surname> <given-names>D.</given-names></name>
<name><surname>Weiskittel</surname> <given-names>A. R.</given-names></name>
<etal/>
</person-group>. (<year>2017</year>). 
<article-title>Layer stacking: A novel algorithm for individual forest tree segmentation from LiDAR point clouds</article-title>. <source>Can. J. Remote Sens.</source> <volume>43</volume>, <fpage>16</fpage>&#x2013;<lpage>27</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1080/07038992.2017.1252907</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Chen</surname> <given-names>R.</given-names></name>
<name><surname>Wang</surname> <given-names>Z.</given-names></name>
</person-group> (<year>2025</year>). 
<article-title>SRM-YOLO for small object detection in remote sensing images</article-title>. <source>Remote Sens.</source> <volume>17</volume>, <fpage>2099</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.3390/rs17243963</pub-id>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Jiang</surname> <given-names>H.</given-names></name>
<name><surname>Chen</surname> <given-names>W.</given-names></name>
</person-group> (<year>2025</year>). 
<article-title>YOLOv8 forestry pest recognition based on improved re-parametric convolution</article-title>. <source>Front. Plant Sci.</source> <volume>16</volume>, <elocation-id>1552853</elocation-id>. doi:&#xa0;<pub-id pub-id-type="doi">10.3389/fpls.2025.1552853</pub-id>, PMID: <pub-id pub-id-type="pmid">40134619</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Jin</surname> <given-names>H.</given-names></name>
<name><surname>Liu</surname> <given-names>H.</given-names></name>
<name><surname>Kang</surname> <given-names>J.</given-names></name>
</person-group> (<year>2019</year>). 
<article-title>Outdoor thermal comfort and adaptation in severe cold area: A longitudinal survey in Harbin, China</article-title>. <source>Building Environ.</source> <volume>148</volume>, <fpage>248</fpage>&#x2013;<lpage>260</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.buildenv.2018.11.011</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Li</surname> <given-names>X.</given-names></name>
<name><surname>Zhang</surname> <given-names>Y.</given-names></name>
<name><surname>Wang</surname> <given-names>L.</given-names></name>
</person-group> (<year>2025</year>). 
<article-title>Efficient tree species classification using machine and deep learning algorithms based on UAV-LiDAR data in North China</article-title>. <source>Front. Forests Global Change</source> <volume>8</volume>, <elocation-id>1431603</elocation-id>. doi: <pub-id pub-id-type="doi">10.3389/ffgc.2025.1431603</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Liu</surname> <given-names>B.</given-names></name>
<name><surname>Huang</surname> <given-names>H.</given-names></name>
<name><surname>Su</surname> <given-names>Y.</given-names></name>
<name><surname>Chen</surname> <given-names>S.</given-names></name>
<name><surname>Li</surname> <given-names>Z.</given-names></name>
<name><surname>Chen</surname> <given-names>E.</given-names></name>
<etal/>
</person-group>. (<year>2022</year>). 
<article-title>Tree species classification using ground-based LiDAR data by various point cloud deep learning methods</article-title>. <source>Remote Sens.</source> <volume>14</volume>, <fpage>5733</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.3390/rs14225733</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Liu</surname> <given-names>J.</given-names></name>
<name><surname>Wang</surname> <given-names>X.</given-names></name>
<name><surname>Wang</surname> <given-names>T.</given-names></name>
</person-group> (<year>2020</year>). 
<article-title>Automatic tree species image recognition based on deep learning</article-title>. <source>J. Nanjing Forestry Univ. (Natural Sci. Edition)</source> <volume>44</volume>, <fpage>138</fpage>&#x2013;<lpage>144</lpage>. doi: <pub-id pub-id-type="doi">10.3969/j.issn.1000-2006.201901043</pub-id>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Liu</surname> <given-names>C.</given-names></name>
<name><surname>Yang</surname> <given-names>F.</given-names></name>
</person-group> (<year>2025</year>). 
<article-title>SEMA-YOLO: lightweight small object detection in remote sensing image via shallow-layer enhancement and multi-scale adaptation</article-title>. <source>Remote Sens.</source> <volume>17</volume>, <fpage>1917</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.3390/rs17111917</pub-id>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Liu</surname> <given-names>Y.</given-names></name>
<name><surname>Zhao</surname> <given-names>Q.</given-names></name>
<name><surname>Wang</surname> <given-names>X.</given-names></name>
<name><surname>Sheng</surname> <given-names>Y.</given-names></name>
<name><surname>Tian</surname> <given-names>W.</given-names></name>
<name><surname>Ren</surname> <given-names>Y.</given-names></name>
<etal/>
</person-group>. (<year>2024</year>). 
<article-title>A tree species classification model based on improved YOLOv7 for shelterbelts</article-title>. <source>Front. Plant Sci.</source> <volume>14</volume>, <elocation-id>1265025</elocation-id>. doi:&#xa0;<pub-id pub-id-type="doi">10.3389/fpls.2023.1265025</pub-id>, PMID: <pub-id pub-id-type="pmid">38304457</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Rezatofighi</surname> <given-names>H.</given-names></name>
<name><surname>Tsoi</surname> <given-names>N.</given-names></name>
<name><surname>Gwak</surname> <given-names>J. Y.</given-names></name>
<name><surname>Sadeghian</surname> <given-names>A.</given-names></name>
<name><surname>Reid</surname> <given-names>I.</given-names></name>
<name><surname>Savarese</surname> <given-names>S.</given-names></name>
</person-group> (<year>2019</year>). 
<article-title>Generalized intersection over union: A metric and a loss for bounding box regression</article-title>. <source>Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition</source>, <fpage>658</fpage>&#x2013;<lpage>666</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1109/CVPR.2019.00075</pub-id>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Satama-Bermeo</surname> <given-names>S.</given-names></name>
<name><surname>Straker</surname> <given-names>J.</given-names></name>
<name><surname>Puliti</surname> <given-names>S.</given-names></name>
</person-group> (<year>2025</year>). 
<article-title>Comparative analysis of YOLOv8 and YOLOv11 on tree detection using UAV RGB and laser scanning data</article-title>. <source>ISPRS Ann. Photogrammetry Remote Sens. Spatial Inf. Sci.</source>, <fpage>173</fpage>&#x2013;<lpage>180</lpage>. X-2-W2-2025.
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Turkulainen</surname> <given-names>E.</given-names></name>
<name><surname>Honkavaara</surname> <given-names>E.</given-names></name>
<name><surname>N&#xe4;si</surname> <given-names>R.</given-names></name>
<name><surname>N&#xe4;Oliveira</surname> <given-names>R. A.</given-names></name>
<name><surname>Hakala</surname> <given-names>T.</given-names></name>
<name><surname>Junttila</surname> <given-names>S.</given-names></name>
<etal/>
</person-group>. (<year>2023</year>). 
<article-title>Comparison of deep neural networks in the classification of bark beetle-induced spruce damage using UAS images</article-title>. <source>Remote Sens.</source> <volume>15</volume>, <fpage>4928</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.3390/rs15204928</pub-id>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Velasquez-Camacho</surname> <given-names>L.</given-names></name>
<name><surname>Etxegarai</surname> <given-names>M.</given-names></name>
<name><surname>de-Miguel</surname> <given-names>S.</given-names></name>
</person-group> (<year>2023</year>). 
<article-title>Implementing Deep Learning algorithms for urban tree detection and geolocation with high-resolution aerial, satellite, and ground-level images</article-title>. <source>Computers Environ. Urban Syst.</source> <volume>105</volume>, <fpage>102025</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1016/j.compenvurbsys.2023.102025</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Vermeer</surname> <given-names>M.</given-names></name>
<name><surname>Hay</surname> <given-names>J. A.</given-names></name>
<name><surname>V&#xf6;lgyes</surname> <given-names>D.</given-names></name>
<name><surname>K&#xf3;ma</surname> <given-names>Z.</given-names></name>
<name><surname>Breidenbach</surname> <given-names>J.</given-names></name>
<name><surname>Fantin</surname> <given-names>D. S. M.</given-names></name>
</person-group> (<year>2023</year>). 
<article-title>Lidar-based Norwegian tree species detection using deep learning</article-title>. <source>arXiv preprint arXiv:2311.06066</source>.
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Wang</surname> <given-names>Y.</given-names></name>
<name><surname>Huang</surname> <given-names>L.</given-names></name>
<name><surname>Li</surname> <given-names>Y.</given-names></name>
</person-group> (<year>2025</year>). 
<article-title>Detection method of Pinus wood infected with pine wilt disease based on improved YOLOv8n</article-title>. <source>For. Grassland Resource Res.</source> <volume>1</volume>, <fpage>114</fpage>&#x2013;<lpage>125</lpage>. doi: <pub-id pub-id-type="doi">10.11707/j.1001-7488.20240112</pub-id>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Zhang</surname> <given-names>H.</given-names></name>
<name><surname>Liu</surname> <given-names>M.</given-names></name>
<name><surname>Wu</surname> <given-names>J.</given-names></name>
</person-group> (<year>2025</year>). 
<article-title>Urban tree species classification using multisource satellite remote sensing data and street view imagery</article-title>. <source>Int. J. Digital Earth</source> <volume>18</volume>, <fpage>2439380</fpage>. doi: <pub-id pub-id-type="doi">10.1080/17538947.2024.2439380</pub-id>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Zhao</surname> <given-names>Y.</given-names></name>
<name><surname>Chen</surname> <given-names>S.</given-names></name>
</person-group> (<year>2025</year>). 
<article-title>Deep learning-based urban tree species mapping with high-resolution pl&#xe9;iades imagery in Nanjing, China</article-title>. <source>Remote Sens.</source> <volume>16</volume>, <fpage>783</fpage>. doi: <pub-id pub-id-type="doi">10.3390/rs16050783</pub-id>
</mixed-citation>
</ref>
</ref-list>
<fn-group>
<fn id="n1" fn-type="custom" custom-type="edited-by">
<p>Edited by: <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1432032">Kai Huang</ext-link>, Jiangsu Academy of Agricultural Sciences, China</p></fn>
<fn id="n2" fn-type="custom" custom-type="reviewed-by">
<p>Reviewed by: <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1776906">Emilio Ram&#xed;rez-Juid&#xed;as</ext-link>, Universidad de Sevilla Instituto Universitario de Arquitectura y Ciencias de la Construccion, Spain</p>
<p><ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/3219793">Nassima Bousahba</ext-link>, University of Chlef, Algeria</p></fn>
</fn-group>
</back>
</article>