Ranjan Sapkota

and 12 more

Given the rapid emergence and applications of Large Language Models (LLMs) across various scientific fields, insights regarding their applicability in agriculture are still only partially explored. This paper conducts an in-depth review of LLMs in agriculture, focusing on understanding how LLMs can be developed and implemented to optimize agricultural processes, increase efficiency, and reduce costs. Recent studies have explored the capabilities of LLMs in agricultural information processing and decision-making. Nevertheless, a comprehensive understanding of the capabilities, challenges, limitations, and future directions of LLMs in agricultural information processing and application remains in its early stages. Such exploration is essential to provide the community with a broader perspective and clearer understanding of LLMs' applications, serving as a baseline for the current state and trends in the subject matter. To bridge this gap, this survey reviews the progress of LLMs and their utilization in agriculture, with an additional focus on 11 key research questions (RQs), where 4 RQs are general and 7 RQs are agriculture focused. By addressing these RQs, this review outlines the current opportunities and challenges, limitations, and future roadmap for LLMs in agriculture. The findings indicate that multi-modal LLMs not only simplify complex agricultural challenges but also significantly enhance decision-making and improve the efficiency of agricultural image processing. These advancements position LLMs as an essential tool for the future of farming. For continued research and understanding, an organized and regularly updated list of papers on LLMs is available at https://github.com/JiajiaLi04/ Multi-Modal-LLMs-in-Agriculture.

Ranjan Sapkota

and 5 more

Object detection, specifically fruitlet detection, is a crucial image processing technique in agricultural automation, enabling the accurate identification of fruitlets on orchard trees within images. It is vital for early fruit load management and overall crop management, facilitating the effective deployment of automation and robotics to optimize orchard productivity and resource use. This study systematically performed an extensive evaluation of the performances of all configurations of YOLOv8, YOLOv9, YOLOv10, and YOLO11 object detection algorithms in terms of precision, recall, mean Average Precision at 50% Intersection over Union (mAP@50), and computational speeds including pre-processing, inference, and post-processing times immature green apple (or fruitlet) detection in commercial orchards. Additionally, this research performed and validated in-field counting of fruitlets using an iPhone and machine vision sensors in 4 different apple varieties (Scifresh, Scilate, Honeycrisp & Cosmic crisp). This investigation of total 22 different configurations of YOLOv8, YOLOv9, YOLOv10 and YOLO11 (5 for YOLOv8, 6 for YOLOv9, 6 for YOLOv10, and 5 for YOLO11) revealed that YOLOv9 gelan-base and YOLO11s outperforms all other configurations of YOLOv10 , YOLOv9 and YOLOv8 in terms of mAP@50 with a score of 0.935 and 0.933 respectively. In terms of precision, specifically, YOLOv9 Gelan-e achieved the highest mAP@50 of 0.935, outperforming YOLOv11s's 0.0.933, YOLOv10s's 0.924, and YOLOv8s's 0.924. In terms of recall, YOLOv9 gelan-base achieved highest value among YOLOv9 configurations (0.899), and YOLO11m performed the best among the YOLO11 configurations (0.897). In comparison for inference speeds, YOLO11n demonstrated fastest inference speeds of only 2.4 ms, while the fastest inference speed across YOLOv10, YOLOv9 and YOLOv8 were 5.5, 11.5 and 4.1 ms for YOLOv10n, YOLOv9 gelan-s and YOLOv8n respectively.

Ranjan Sapkota

and 1 more

In this study, a robust method for 3D pose estimation of immature green apples (fruitlets) in commercial orchards was developed, utilizing the YOLO11 object pose detection model alongside Vision Transformers (ViT) for depth estimation. For object detection and pose estimation, performance comparisons of YOLO11 (YOLO11n, YOLO11s, YOLO11m, YOLO11l and YOLO11x) and YOLOv8 (YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l and YOLOv8x) were made under identical hyperparameter settings among the all configurations. Likewise, for RGB to RGB-D mapping, Dense Prediction Transformer (DPT) and Depth Anything V2 were investigated. It was observed that YOLO11n surpassed all configurations of YOLO11 and YOLOv8 in terms of box precision and pose precision, achieving scores of 0.91 and 0.915, respectively. Conversely, YOLOv8n exhibited the highest box and pose recall scores of 0.905 and 0.925, respectively. Regarding the mean average precision at 50% intersection over union (mAP@50), YOLO11s led all configurations with a box mAP@50 score of 0.94, while YOLOv8n achieved the highest pose mAP@50 score of 0.96. In terms of image processing speed, YOLO11n outperformed all configurations with an impressive inference speed of 2.7 ms, significantly faster than the quickest YOLOv8 configuration, YOLOv8n, which processed images in 7.8 ms. This demonstrates a substantial improvement in inference speed over previous iterations, particularly evident when comparing YOLO11n and YOLOv8n. Subsequent integration of ViTs for the green fruit's pose depth estimation revealed that Depth Anything V2 outperformed Dense Prediction Transformer in 3D pose length validation, achieving the lowest Root Mean Square Error (RMSE) of 1.52 and Mean Absolute Error (MAE) of 1.28, demonstrating exceptional precision in estimating immature green fruit lengths. Following this, the DPT showed notable accuracy improvements with a RMSE of 3.29 and an MAE of 2.62. In contrast, measurements derived from Intel RealSense point clouds exhibited the highest discrepancies from the ground truth, with a RMSE of 9.98 and an MAE of 7.74. These findings emphasize the effectiveness of YOLO11 in detecting and estimating the pose of immature green fruits, illustrating how Vision Transformers like Depth Anything V2 adeptly convert RGB images into RGB-D data, thus enhancing the precision and computational requirement of 3D pose estimations for future robotic thinning applications in commercial orchards.