πŸ” VGGTΒΆ

VGGT (Visual Geometry Grounded Tracking) is integrated into Tatbot to provide dense 3D reconstruction and automatic camera pose estimation from multi-view RGB images. This complements the existing AprilTag-based calibration and RealSense depth sensing.

OverviewΒΆ

VGGT enhances the vision system by:

  • Dense 3D Reconstruction: Generates high-quality point clouds from RGB-only images

  • Markerless Pose Estimation: Estimates camera poses without requiring AprilTags

  • Cross-Node GPU Processing: Runs on dedicated GPU nodes (ook) while cameras are on sensor nodes (hog)

  • Scale Alignment: Automatically aligns VGGT outputs with metric measurements using AprilTag references

  • COLMAP Integration: Exports standard COLMAP format for compatibility with external tools

ArchitectureΒΆ

ComponentsΒΆ

src/tatbot/
β”œβ”€β”€ cam/vggt_runner.py           # Core VGGT processing utilities
β”œβ”€β”€ tools/gpu/vggt_recon.py      # Remote GPU reconstruction tool
β”œβ”€β”€ tools/robot/sense.py         # Enhanced with VGGT integration
└── viz/vggt_compare.py          # Comparison visualization tool

Data FlowΒΆ

  1. Image Capture (hog): Multi-camera RGB images saved to NFS

  2. GPU Processing (ook): VGGT model processes images β†’ poses + dense points

  3. Scale Alignment (ook): Align VGGT scale with AprilTag reference measurements

  4. File Storage (NFS): Results saved in COLMAP + PLY formats

  5. Visualization (any): Compare VGGT vs RealSense reconstructions

UsageΒΆ

Enable VGGT in Sense ToolΒΆ

{
  "tool": "sense",
  "input": {
    "scene": "default",
    "enable_vggt": true,
    "vggt_conf_threshold": 5.0,
    "vggt_use_ba": false,
    "calibrate_extrinsics": true
  }
}

Parameters:

  • enable_vggt: Enable VGGT reconstruction (default: false)

  • vggt_conf_threshold: Confidence threshold for point filtering (default: 5.0)

  • vggt_use_ba: Enable bundle adjustment refinement (default: false)

  • vggt_image_count: Number of images per camera (default: 1)

Direct ReconstructionΒΆ

You can also run VGGT reconstruction directly on GPU nodes:

{
  "tool": "vggt_reconstruct",
  "input": {
    "image_dir": "/nfs/tatbot/recordings/sense-default-20241215_143022/images",
    "output_pointcloud_path": "/nfs/tatbot/recordings/sense-default-20241215_143022/pointclouds/vggt_dense.ply",
    "output_frustums_path": "/nfs/tatbot/recordings/sense-default-20241215_143022/metadata/vggt_frustums.json",
    "output_colmap_dir": "/nfs/tatbot/recordings/sense-default-20241215_143022/colmap",
    "scene": "default",
    "vggt_conf_threshold": 5.0,
    "shared_camera": false
  }
}

Compare VizΒΆ

View VGGT reconstruction alongside RealSense data:

uv run python -m tatbot.viz.vggt_compare --dataset_dir=/nfs/tatbot/recordings/sense-default-20241215_143022

Or via MCP tool:

{
  "tool": "vggt_compare_viz",
  "input": {
    "dataset_dir": "/nfs/tatbot/recordings/sense-default-20241215_143022",
    "show_vggt": true,
    "show_rs": true,
    "show_vggt_frustums": true,
    "show_apriltag_frustums": true
  }
}

OutputΒΆ

VGGT integration generates the following files in sense datasets:

/nfs/tatbot/recordings/sense-{scene}-{timestamp}/
β”œβ”€β”€ images/                      # RGB images for VGGT processing
β”‚   β”œβ”€β”€ realsense1.png
β”‚   β”œβ”€β”€ realsense2.png
β”‚   └── ipcamera1.png
β”œβ”€β”€ pointclouds/
β”‚   β”œβ”€β”€ realsense1_000000.ply    # RealSense depth point clouds
β”‚   β”œβ”€β”€ realsense2_000000.ply
β”‚   └── vggt_dense.ply           # VGGT dense reconstruction
β”œβ”€β”€ colmap/                      # COLMAP format camera data
β”‚   β”œβ”€β”€ cameras.txt              # Camera intrinsics
β”‚   β”œβ”€β”€ images.txt               # Camera poses
β”‚   └── points3D.txt             # 3D points (optional)
└── metadata/
    β”œβ”€β”€ apriltag_frustums.json   # AprilTag camera poses
    β”œβ”€β”€ vggt_frustums.json       # VGGT camera poses
    β”œβ”€β”€ vggt_confidence.npz      # Depth confidence maps
    └── metrics.json             # Reconstruction metrics

ImplementationΒΆ

Model ManagementΒΆ

VGGT uses the HuggingFace model cache for efficient storage:

  • Model: facebook/VGGT-1B (1 billion parameters)

  • Storage: ~/.cache/huggingface/ on each GPU node

  • Loading: Automatic download on first use, cached thereafter

  • Memory: ~8GB VRAM required for inference

Scale AlignmentΒΆ

VGGT outputs are scale-ambiguous and require metric alignment:

  1. AprilTag Method (Preferred): Compares camera baseline distances between VGGT and AprilTag poses

  2. Median Baseline: Robust scale factor = median(AprilTag_baselines) / median(VGGT_baselines)

  3. Validation: Scale factors outside 1e-3 to 1e3 range are reset to 1.0

  4. Application: Scale applied only to camera poses (not double-scaled with world points)

Coordinate SystemsΒΆ

  • VGGT Output: Camera-from-world extrinsics (OpenCV convention)

  • Tatbot Internal: World-from-camera poses (for consistency)

  • Conversion: Handled automatically by vggt_runner.py utilities

  • COLMAP Export: Standard COLMAP format with quaternion poses

Cross-Node CommunicationΒΆ

  • Protocol: MCP (Model Context Protocol) over HTTP

  • Payload: File paths only (~1KB), not raw data

  • Timeout: 15 minutes for GPU processing

  • Retry: 2 attempts with exponential backoff

  • Fallback: Graceful degradation if VGGT fails

ConfigurationΒΆ

NodesΒΆ

  • GPU Node (ook): CUDA-capable GPU with β‰₯8GB VRAM

  • Sensor Node (hog): Camera connections (RealSense + IP cameras)

  • Storage: NFS mount for cross-node file access

CamerasΒΆ

VGGT works with any RGB cameras defined in the scene:

  • IP Cameras: Full resolution RGB frames

  • RealSense: Color stream (depth stream used separately)

  • Format: PNG images saved to images/ subdirectory

  • Naming: {camera_name}.png format for consistent processing

GPUΒΆ

Enable VGGT capability in src/conf/mcp/ook.yaml:

extras: ["gpu"]  # Enables GPU-required tools
tools:
  - vggt_reconstruct
  - convert_strokelist_to_batch

SettingsΒΆ

Optional VGGT-specific settings in src/conf/cam/vggt.yaml:

preferred_gpu_node: ook
model_path: null  # Use HuggingFace cache
resolution: 518   # VGGT internal resolution
conf_threshold: 5.0
use_ba: false     # Bundle adjustment
return_point_map: true

PerformanceΒΆ

TimingΒΆ

  • Image Loading: ~2-5 seconds for typical multi-camera setup

  • VGGT Inference: ~30-60 seconds on RTX-class GPU

  • Scale Alignment: ~1 second

  • File I/O: ~5-10 seconds for PLY/COLMAP export

  • Total: ~1-2 minutes for complete reconstruction

ResourcesΒΆ

  • GPU Memory: ~8GB VRAM during inference

  • Storage: ~50-100MB per reconstruction (dense point cloud)

  • Network: Minimal (<1KB MCP payloads)

QualityΒΆ

  • Point Density: 10x+ more points than RealSense depth

  • Pose Accuracy: Typically within 2-5mm of AprilTag calibration

  • Coverage: Full 360Β° reconstruction from multi-view images

IntegrationΒΆ

Existing Vision PipelineΒΆ

  • Complementary: Works alongside AprilTag calibration and RealSense depth

  • Validation: VGGT poses compared against AprilTag ground truth

  • Enhancement: Dense reconstruction supplements sparse RealSense data

Surface MappingΒΆ

  • Mesh Input: VGGT point clouds can be used for skin mesh reconstruction

  • Stroke Mapping: Enhanced 3D surfaces improve 2Dβ†’3D stroke mapping accuracy

  • Geodesic Tracing: Denser meshes enable more accurate surface path planning

VisualizationΒΆ

  • Comparison Tool: Side-by-side VGGT vs RealSense point cloud analysis

  • Camera Poses: Visual comparison of AprilTag vs VGGT camera frustums

  • Quality Assessment: Real-time pose accuracy and point density metrics

TroubleshootingΒΆ

IssuesΒΆ

VGGT Processing Fails:

  • Check GPU memory availability (nvidia-smi)

  • Verify HuggingFace model cache download completed

  • Ensure sufficient disk space for model (~4GB)

Scale Alignment Problems:

  • Verify AprilTag detection succeeded in images

  • Check that AprilTag and VGGT detect same cameras

  • Examine scale factor bounds in metrics.json

Cross-Node Communication:

  • Verify MCP server running on GPU node

  • Check network connectivity between hog and ook

  • Ensure NFS mount accessible from both nodes

Poor Reconstruction Quality:

  • Increase number of viewpoints (more camera positions)

  • Improve lighting conditions for better image quality

  • Adjust confidence threshold for point filtering

  • Consider bundle adjustment for pose refinement

DebuggingΒΆ

Monitor VGGT processing via logs:

# On GPU node
tail -f /nfs/tatbot/mcp-logs/ook.log

# Metrics file
cat /nfs/tatbot/recordings/{dataset}/metadata/metrics.json

Key metrics to check:

  • vggt_scale_factor: Should be reasonable (0.1 - 10.0)

  • vggt_point_count: Higher is generally better

  • mean_cam_center_err_m: Pose accuracy vs AprilTag reference

RoadmapΒΆ

  • Bundle Adjustment: Integration with pycolmap for pose refinement

  • Multi-Scale Processing: Support for different VGGT resolution settings

  • Temporal Consistency: Video-based VGGT for improved accuracy

  • Direct Mesh Output: Skip PLY intermediate and generate meshes directly

  • Real-Time Processing: Optimize for faster reconstruction cycles