🔍 VGGT¶

VGGT (Visual Geometry Grounded Tracking) is integrated into Tatbot to provide dense 3D reconstruction and automatic camera pose estimation from multi-view RGB images. This complements the existing AprilTag-based calibration and RealSense depth sensing.

Overview¶

VGGT enhances the vision system by:

Dense 3D Reconstruction: Generates high-quality point clouds from RGB-only images
Markerless Pose Estimation: Estimates camera poses without requiring AprilTags
Cross-Node GPU Processing: Runs on dedicated GPU nodes (ook) while cameras are on sensor nodes (hog)
Scale Alignment: Automatically aligns VGGT outputs with metric measurements using AprilTag references
COLMAP Integration: Exports standard COLMAP format for compatibility with external tools

Architecture¶

Components¶

src/tatbot/
├── cam/vggt_runner.py           # Core VGGT processing utilities
├── tools/gpu/vggt_recon.py      # Remote GPU reconstruction tool
├── tools/robot/sense.py         # Enhanced with VGGT integration
└── viz/vggt_compare.py          # Comparison visualization tool

Data Flow¶

Image Capture (hog): Multi-camera RGB images saved to NFS
GPU Processing (ook): VGGT model processes images → poses + dense points
Scale Alignment (ook): Align VGGT scale with AprilTag reference measurements
File Storage (NFS): Results saved in COLMAP + PLY formats
Visualization (any): Compare VGGT vs RealSense reconstructions

Usage¶

Enable VGGT in Sense Tool¶

{
  "tool": "sense",
  "input": {
    "scene": "default",
    "enable_vggt": true,
    "vggt_conf_threshold": 5.0,
    "vggt_use_ba": false,
    "calibrate_extrinsics": true
  }
}

Parameters:

enable_vggt: Enable VGGT reconstruction (default: false)
vggt_conf_threshold: Confidence threshold for point filtering (default: 5.0)
vggt_use_ba: Enable bundle adjustment refinement (default: false)
vggt_image_count: Number of images per camera (default: 1)

Direct Reconstruction¶

You can also run VGGT reconstruction directly on GPU nodes:

{
  "tool": "vggt_reconstruct",
  "input": {
    "image_dir": "/nfs/tatbot/recordings/sense-default-20241215_143022/images",
    "output_pointcloud_path": "/nfs/tatbot/recordings/sense-default-20241215_143022/pointclouds/vggt_dense.ply",
    "output_frustums_path": "/nfs/tatbot/recordings/sense-default-20241215_143022/metadata/vggt_frustums.json",
    "output_colmap_dir": "/nfs/tatbot/recordings/sense-default-20241215_143022/colmap",
    "scene": "default",
    "vggt_conf_threshold": 5.0,
    "shared_camera": false
  }
}

Compare Viz¶

View VGGT reconstruction alongside RealSense data:

uv run python -m tatbot.viz.vggt_compare --dataset_dir=/nfs/tatbot/recordings/sense-default-20241215_143022

Or via MCP tool:

{
  "tool": "vggt_compare_viz",
  "input": {
    "dataset_dir": "/nfs/tatbot/recordings/sense-default-20241215_143022",
    "show_vggt": true,
    "show_rs": true,
    "show_vggt_frustums": true,
    "show_apriltag_frustums": true
  }
}

Output¶

VGGT integration generates the following files in sense datasets:

/nfs/tatbot/recordings/sense-{scene}-{timestamp}/
├── images/                      # RGB images for VGGT processing
│   ├── realsense1.png
│   ├── realsense2.png
│   └── ipcamera1.png
├── pointclouds/
│   ├── realsense1_000000.ply    # RealSense depth point clouds
│   ├── realsense2_000000.ply
│   └── vggt_dense.ply           # VGGT dense reconstruction
├── colmap/                      # COLMAP format camera data
│   ├── cameras.txt              # Camera intrinsics
│   ├── images.txt               # Camera poses
│   └── points3D.txt             # 3D points (optional)
└── metadata/
    ├── apriltag_frustums.json   # AprilTag camera poses
    ├── vggt_frustums.json       # VGGT camera poses
    ├── vggt_confidence.npz      # Depth confidence maps
    └── metrics.json             # Reconstruction metrics

Implementation¶

Model Management¶

VGGT uses the HuggingFace model cache for efficient storage:

Model: facebook/VGGT-1B (1 billion parameters)
Storage: ~/.cache/huggingface/ on each GPU node
Loading: Automatic download on first use, cached thereafter
Memory: ~8GB VRAM required for inference

Scale Alignment¶

VGGT outputs are scale-ambiguous and require metric alignment:

AprilTag Method (Preferred): Compares camera baseline distances between VGGT and AprilTag poses
Median Baseline: Robust scale factor = median(AprilTag_baselines) / median(VGGT_baselines)
Validation: Scale factors outside 1e-3 to 1e3 range are reset to 1.0
Application: Scale applied only to camera poses (not double-scaled with world points)

Coordinate Systems¶

VGGT Output: Camera-from-world extrinsics (OpenCV convention)
Tatbot Internal: World-from-camera poses (for consistency)
Conversion: Handled automatically by vggt_runner.py utilities
COLMAP Export: Standard COLMAP format with quaternion poses

Cross-Node Communication¶

Protocol: MCP (Model Context Protocol) over HTTP
Payload: File paths only (~1KB), not raw data
Timeout: 15 minutes for GPU processing
Retry: 2 attempts with exponential backoff
Fallback: Graceful degradation if VGGT fails

Configuration¶

Nodes¶

GPU Node (ook): CUDA-capable GPU with ≥8GB VRAM
Sensor Node (hog): Camera connections (RealSense + IP cameras)
Storage: NFS mount for cross-node file access

Cameras¶

VGGT works with any RGB cameras defined in the scene:

IP Cameras: Full resolution RGB frames
RealSense: Color stream (depth stream used separately)
Format: PNG images saved to images/ subdirectory
Naming: {camera_name}.png format for consistent processing

GPU¶

Enable VGGT capability in src/conf/mcp/ook.yaml:

extras: ["gpu"]  # Enables GPU-required tools
tools:
  - vggt_reconstruct
  - convert_strokelist_to_batch

Settings¶

Optional VGGT-specific settings in src/conf/cam/vggt.yaml:

preferred_gpu_node: ook
model_path: null  # Use HuggingFace cache
resolution: 518   # VGGT internal resolution
conf_threshold: 5.0
use_ba: false     # Bundle adjustment
return_point_map: true

Performance¶

Timing¶

Image Loading: ~2-5 seconds for typical multi-camera setup
VGGT Inference: ~30-60 seconds on RTX-class GPU
Scale Alignment: ~1 second
File I/O: ~5-10 seconds for PLY/COLMAP export
Total: ~1-2 minutes for complete reconstruction

Resources¶

GPU Memory: ~8GB VRAM during inference
Storage: ~50-100MB per reconstruction (dense point cloud)
Network: Minimal (<1KB MCP payloads)

Quality¶

Point Density: 10x+ more points than RealSense depth
Pose Accuracy: Typically within 2-5mm of AprilTag calibration
Coverage: Full 360° reconstruction from multi-view images

Integration¶

Existing Vision Pipeline¶

Complementary: Works alongside AprilTag calibration and RealSense depth
Validation: VGGT poses compared against AprilTag ground truth
Enhancement: Dense reconstruction supplements sparse RealSense data

Surface Mapping¶

Mesh Input: VGGT point clouds can be used for skin mesh reconstruction
Stroke Mapping: Enhanced 3D surfaces improve 2D→3D stroke mapping accuracy
Geodesic Tracing: Denser meshes enable more accurate surface path planning

Visualization¶

Comparison Tool: Side-by-side VGGT vs RealSense point cloud analysis
Camera Poses: Visual comparison of AprilTag vs VGGT camera frustums
Quality Assessment: Real-time pose accuracy and point density metrics

Troubleshooting¶

Issues¶

VGGT Processing Fails:

Check GPU memory availability (nvidia-smi)
Verify HuggingFace model cache download completed
Ensure sufficient disk space for model (~4GB)

Scale Alignment Problems:

Verify AprilTag detection succeeded in images
Check that AprilTag and VGGT detect same cameras
Examine scale factor bounds in metrics.json

Cross-Node Communication:

Verify MCP server running on GPU node
Check network connectivity between hog and ook
Ensure NFS mount accessible from both nodes

Poor Reconstruction Quality:

Increase number of viewpoints (more camera positions)
Improve lighting conditions for better image quality
Adjust confidence threshold for point filtering
Consider bundle adjustment for pose refinement

Debugging¶

Monitor VGGT processing via logs:

# On GPU node
tail -f /nfs/tatbot/mcp-logs/ook.log

# Metrics file
cat /nfs/tatbot/recordings/{dataset}/metadata/metrics.json

Key metrics to check:

vggt_scale_factor: Should be reasonable (0.1 - 10.0)
vggt_point_count: Higher is generally better
mean_cam_center_err_m: Pose accuracy vs AprilTag reference

Roadmap¶

Bundle Adjustment: Integration with pycolmap for pose refinement
Multi-Scale Processing: Support for different VGGT resolution settings
Temporal Consistency: Video-based VGGT for improved accuracy
Direct Mesh Output: Skip PLY intermediate and generate meshes directly
Real-Time Processing: Optimize for faster reconstruction cycles