VineStereo: An Optimized Deep Stereo Network and Dataset for Thin Grapevine Structures


Introduction

Agricultural robotics faces significant challenges in tasks requiring precise manipulation of thin structures, such as grapevine pruning. Accurate perception of these delicate and intricate structures is critical but remains hindered by the scarcity of agricultural-specific datasets and the limitations of existing stereo-matching algorithms. Grapevine pruning, essential for optimizing yield and crop quality, is a labor-intensive process further exacerbated by skilled labor shortages. Addressing these issues is crucial for advancing automated solutions in viticulture and agricultural robotics.

Objective

To develop an edge-aware stereo-matching framework tailored for thin structure detection and perception, facilitating automated grapevine pruning. This research aims to overcome dataset scarcity by leveraging NeRF-Supervised Deep Stereo techniques for efficient data generation, ensuring the reliable and accurate reconstruction of thin structures in agricultural environments.

Methodology

  1. Data Generation:
    • Utilized NeRF-Supervised Deep Stereo (NSDS) to generate synthetic stereo image datasets with "pseudo" ground truth depth maps, reducing the need for extensive real-world data collection.
    • Generated a comprehensive dataset of 20,000 stereo image pairs reflecting diverse grapevine structures and conditions.
  2. Edge-Aware Stereo-Matching:
    • Introduced edge-awareness to the RAFT-Stereo framework by incorporating Sobel edge filtering and depth thresholding to enhance thin structure perception.
    • Fine-tuned the stereo-matching network using the synthetic datasets, with tailored masking strategies in GRU layers to improve depth estimation accuracy for thin structures.
  3. 3D Reconstruction:
    • Applied iterative point cloud registration using the colored Iterative Closest Point (ICP) algorithm to merge depth maps from multiple viewpoints, creating high-fidelity 3D reconstructions of grapevines.

Results

Conclusion

StereoVine represents a significant advancement in the automation of agricultural tasks, particularly grapevine pruning. By integrating NeRF-Supervised Deep Stereo techniques with edge-aware stereo-matching, this research addresses critical challenges in dataset scarcity and fine structure perception. The proposed framework enables reliable 3D reconstructions and automated pruning in viticulture, paving the way for broader applications in agricultural robotics. Future work will focus on integrating bud detection and cut point localization to further refine robotic pruning capabilities.


Contribution

I was the co-lead researcher on the VineStereo project. I contributed to the development of the edge-aware stereo-matching framework. My responsibilities included data generation by training NeRF models, and fine-tuning the RAFT-Stereo network for thin structure detection. I aslo contributed extensively in paper writing and visualization generation. This research was conducted at the Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA.


Visuals

Visual 1
Figure 1: (a) shows the disparity map generated using the original RAFT-Stereo network; (b) Shows disparity map generated from our optimized edge-aware RAFT-Stereo
Visual 2
Figure 2: Overview of VineStereo Matching Pipeline: This figure represents the integration of stereo images through dedicated encoders to a GRU-based network, which, along with disparity data and encoder-processed edge maps, refines the output to generate a detailed disparity map highlighting thin structures, as shown in the color-coded depth representation on the right.
Visual 3
Figure 3: Real world setup
Visual 4
Figure 4: Qualitative Results: Depth Map
Visual 5
Figure 5: Qualitative Results: Point Cloud ground truth (left) vs ours (right)
Visual 6
Figure 6: Qualitative Results: In-the-wild Vine Point Cloud