--- summary: VGGT (Visual Geometry Grounded Tracking) integration and usage tags: [vision, vggt] updated: 2025-08-21 audience: [dev] --- # šŸ” VGGT VGGT (Visual Geometry Grounded Tracking) is integrated into Tatbot to provide dense 3D reconstruction and automatic camera pose estimation from multi-view RGB images. This complements the existing AprilTag-based calibration and RealSense depth sensing. ## Overview VGGT enhances the vision system by: - **Dense 3D Reconstruction**: Generates high-quality point clouds from RGB-only images - **Markerless Pose Estimation**: Estimates camera poses without requiring AprilTags - **Cross-Node GPU Processing**: Runs on dedicated GPU nodes (ook) while cameras are on sensor nodes (hog) - **Scale Alignment**: Automatically aligns VGGT outputs with metric measurements using AprilTag references - **COLMAP Integration**: Exports standard COLMAP format for compatibility with external tools ## Architecture ### Components ```text src/tatbot/ ā”œā”€ā”€ cam/vggt_runner.py # Core VGGT processing utilities ā”œā”€ā”€ tools/gpu/vggt_recon.py # Remote GPU reconstruction tool ā”œā”€ā”€ tools/robot/sense.py # Enhanced with VGGT integration └── viz/vggt_compare.py # Comparison visualization tool ``` ### Data Flow 1. **Image Capture** (hog): Multi-camera RGB images saved to NFS 2. **GPU Processing** (ook): VGGT model processes images → poses + dense points 3. **Scale Alignment** (ook): Align VGGT scale with AprilTag reference measurements 4. **File Storage** (NFS): Results saved in COLMAP + PLY formats 5. **Visualization** (any): Compare VGGT vs RealSense reconstructions ## Usage ### Enable VGGT in Sense Tool ```json { "tool": "sense", "input": { "scene": "default", "enable_vggt": true, "vggt_conf_threshold": 5.0, "vggt_use_ba": false, "calibrate_extrinsics": true } } ``` **Parameters:** - `enable_vggt`: Enable VGGT reconstruction (default: false) - `vggt_conf_threshold`: Confidence threshold for point filtering (default: 5.0) - `vggt_use_ba`: Enable bundle adjustment refinement (default: false) - `vggt_image_count`: Number of images per camera (default: 1) ### Direct Reconstruction You can also run VGGT reconstruction directly on GPU nodes: ```json { "tool": "vggt_reconstruct", "input": { "image_dir": "/nfs/tatbot/recordings/sense-default-20241215_143022/images", "output_pointcloud_path": "/nfs/tatbot/recordings/sense-default-20241215_143022/pointclouds/vggt_dense.ply", "output_frustums_path": "/nfs/tatbot/recordings/sense-default-20241215_143022/metadata/vggt_frustums.json", "output_colmap_dir": "/nfs/tatbot/recordings/sense-default-20241215_143022/colmap", "scene": "default", "vggt_conf_threshold": 5.0, "shared_camera": false } } ``` ### Compare Viz View VGGT reconstruction alongside RealSense data: ```bash uv run python -m tatbot.viz.vggt_compare --dataset_dir=/nfs/tatbot/recordings/sense-default-20241215_143022 ``` Or via MCP tool: ```json { "tool": "vggt_compare_viz", "input": { "dataset_dir": "/nfs/tatbot/recordings/sense-default-20241215_143022", "show_vggt": true, "show_rs": true, "show_vggt_frustums": true, "show_apriltag_frustums": true } } ``` ## Output VGGT integration generates the following files in sense datasets: ```text /nfs/tatbot/recordings/sense-{scene}-{timestamp}/ ā”œā”€ā”€ images/ # RGB images for VGGT processing │ ā”œā”€ā”€ realsense1.png │ ā”œā”€ā”€ realsense2.png │ └── ipcamera1.png ā”œā”€ā”€ pointclouds/ │ ā”œā”€ā”€ realsense1_000000.ply # RealSense depth point clouds │ ā”œā”€ā”€ realsense2_000000.ply │ └── vggt_dense.ply # VGGT dense reconstruction ā”œā”€ā”€ colmap/ # COLMAP format camera data │ ā”œā”€ā”€ cameras.txt # Camera intrinsics │ ā”œā”€ā”€ images.txt # Camera poses │ └── points3D.txt # 3D points (optional) └── metadata/ ā”œā”€ā”€ apriltag_frustums.json # AprilTag camera poses ā”œā”€ā”€ vggt_frustums.json # VGGT camera poses ā”œā”€ā”€ vggt_confidence.npz # Depth confidence maps └── metrics.json # Reconstruction metrics ``` ## Implementation ### Model Management VGGT uses the HuggingFace model cache for efficient storage: - **Model**: `facebook/VGGT-1B` (1 billion parameters) - **Storage**: `~/.cache/huggingface/` on each GPU node - **Loading**: Automatic download on first use, cached thereafter - **Memory**: ~8GB VRAM required for inference ### Scale Alignment VGGT outputs are scale-ambiguous and require metric alignment: 1. **AprilTag Method** (Preferred): Compares camera baseline distances between VGGT and AprilTag poses 2. **Median Baseline**: Robust scale factor = `median(AprilTag_baselines) / median(VGGT_baselines)` 3. **Validation**: Scale factors outside 1e-3 to 1e3 range are reset to 1.0 4. **Application**: Scale applied only to camera poses (not double-scaled with world points) ### Coordinate Systems - **VGGT Output**: Camera-from-world extrinsics (OpenCV convention) - **Tatbot Internal**: World-from-camera poses (for consistency) - **Conversion**: Handled automatically by `vggt_runner.py` utilities - **COLMAP Export**: Standard COLMAP format with quaternion poses ### Cross-Node Communication - **Protocol**: MCP (Model Context Protocol) over HTTP - **Payload**: File paths only (~1KB), not raw data - **Timeout**: 15 minutes for GPU processing - **Retry**: 2 attempts with exponential backoff - **Fallback**: Graceful degradation if VGGT fails ## Configuration ### Nodes - **GPU Node** (`ook`): CUDA-capable GPU with ≄8GB VRAM - **Sensor Node** (`hog`): Camera connections (RealSense + IP cameras) - **Storage**: NFS mount for cross-node file access ### Cameras VGGT works with any RGB cameras defined in the scene: - **IP Cameras**: Full resolution RGB frames - **RealSense**: Color stream (depth stream used separately) - **Format**: PNG images saved to `images/` subdirectory - **Naming**: `{camera_name}.png` format for consistent processing ### GPU Enable VGGT capability in `src/conf/mcp/ook.yaml`: ```yaml extras: ["gpu"] # Enables GPU-required tools tools: - vggt_reconstruct - convert_strokelist_to_batch ``` ### Settings Optional VGGT-specific settings in `src/conf/cam/vggt.yaml`: ```yaml preferred_gpu_node: ook model_path: null # Use HuggingFace cache resolution: 518 # VGGT internal resolution conf_threshold: 5.0 use_ba: false # Bundle adjustment return_point_map: true ``` ## Performance ### Timing - **Image Loading**: ~2-5 seconds for typical multi-camera setup - **VGGT Inference**: ~30-60 seconds on RTX-class GPU - **Scale Alignment**: ~1 second - **File I/O**: ~5-10 seconds for PLY/COLMAP export - **Total**: ~1-2 minutes for complete reconstruction ### Resources - **GPU Memory**: ~8GB VRAM during inference - **Storage**: ~50-100MB per reconstruction (dense point cloud) - **Network**: Minimal (<1KB MCP payloads) ### Quality - **Point Density**: 10x+ more points than RealSense depth - **Pose Accuracy**: Typically within 2-5mm of AprilTag calibration - **Coverage**: Full 360° reconstruction from multi-view images ## Integration ### Existing Vision Pipeline - **Complementary**: Works alongside AprilTag calibration and RealSense depth - **Validation**: VGGT poses compared against AprilTag ground truth - **Enhancement**: Dense reconstruction supplements sparse RealSense data ### Surface Mapping - **Mesh Input**: VGGT point clouds can be used for skin mesh reconstruction - **Stroke Mapping**: Enhanced 3D surfaces improve 2D→3D stroke mapping accuracy - **Geodesic Tracing**: Denser meshes enable more accurate surface path planning ### Visualization - **Comparison Tool**: Side-by-side VGGT vs RealSense point cloud analysis - **Camera Poses**: Visual comparison of AprilTag vs VGGT camera frustums - **Quality Assessment**: Real-time pose accuracy and point density metrics ## Troubleshooting ### Issues **VGGT Processing Fails:** - Check GPU memory availability (`nvidia-smi`) - Verify HuggingFace model cache download completed - Ensure sufficient disk space for model (~4GB) **Scale Alignment Problems:** - Verify AprilTag detection succeeded in images - Check that AprilTag and VGGT detect same cameras - Examine scale factor bounds in metrics.json **Cross-Node Communication:** - Verify MCP server running on GPU node - Check network connectivity between hog and ook - Ensure NFS mount accessible from both nodes **Poor Reconstruction Quality:** - Increase number of viewpoints (more camera positions) - Improve lighting conditions for better image quality - Adjust confidence threshold for point filtering - Consider bundle adjustment for pose refinement ### Debugging Monitor VGGT processing via logs: ```bash # On GPU node tail -f /nfs/tatbot/mcp-logs/ook.log # Metrics file cat /nfs/tatbot/recordings/{dataset}/metadata/metrics.json ``` Key metrics to check: - `vggt_scale_factor`: Should be reasonable (0.1 - 10.0) - `vggt_point_count`: Higher is generally better - `mean_cam_center_err_m`: Pose accuracy vs AprilTag reference ## Roadmap - **Bundle Adjustment**: Integration with pycolmap for pose refinement - **Multi-Scale Processing**: Support for different VGGT resolution settings - **Temporal Consistency**: Video-based VGGT for improved accuracy - **Direct Mesh Output**: Skip PLY intermediate and generate meshes directly - **Real-Time Processing**: Optimize for faster reconstruction cycles