AUTHOR=Liu Yifan , Yu Qiudong , Geng Shuze , Guo Shiyi , Liu Ling TITLE=SSViT-YOLOv11: fusing lightweight YOLO & ViT for coffee fruit maturity detection JOURNAL=Frontiers in Plant Science VOLUME=Volume 16 - 2025 YEAR=2025 URL=https://www.frontiersin.org/journals/plant-science/articles/10.3389/fpls.2025.1691643 DOI=10.3389/fpls.2025.1691643 ISSN=1664-462X ABSTRACT=Accurate identification of coffee fruit maturity is critical for optimizing harvest timing and ensuring bean quality, but manual inspection is time-consuming and prone to subjectivity. Automated visual detection faces challenges including subtle color differences among maturity stages, frequent occlusions within fruit clusters, variable lighting, and abundant small-scale targets. In this paper, we propose SSViT-YOLOv11, an improved YOLOv11n-based framework that integrates Single Scale Vision Transformer (SSViT) into the backbone and refines multi-scale feature fusion to enhance context modeling and small-object representation. The C3K2 modules in YOLOv11n are integrated with Arbitrary Kernel Convolution (AKConv) and multi-scale convolutional attention (MSCA) is added in the head, effectively improving detection accuracy and rendering the model more lightweight. Experimental results show that SSViT-YOLOv11 achieves superior performance across multiple evaluation metrics. Specifically, the model attains a precision of 81.1%, a recall of 77.4%, and a mean Average Precision (mAP@50) of 84.54%, while operating at 23 FPS and requiring only 2.16 million parameters. These results indicate that the proposed model offers a favorable balance of accuracy, inference speed, and model compactness, making it well suited for assisting farmers in coffee fruit maturity assessment.