π VGGTΒΆ
VGGT (Visual Geometry Grounded Tracking) is integrated into Tatbot to provide dense 3D reconstruction and automatic camera pose estimation from multi-view RGB images. This complements the existing AprilTag-based calibration and RealSense depth sensing.
OverviewΒΆ
VGGT enhances the vision system by:
Dense 3D Reconstruction: Generates high-quality point clouds from RGB-only images
Markerless Pose Estimation: Estimates camera poses without requiring AprilTags
Cross-Node GPU Processing: Runs on dedicated GPU nodes (ook) while cameras are on sensor nodes (hog)
Scale Alignment: Automatically aligns VGGT outputs with metric measurements using AprilTag references
COLMAP Integration: Exports standard COLMAP format for compatibility with external tools
ArchitectureΒΆ
ComponentsΒΆ
src/tatbot/
βββ cam/vggt_runner.py # Core VGGT processing utilities
βββ tools/gpu/vggt_recon.py # Remote GPU reconstruction tool
βββ tools/robot/sense.py # Enhanced with VGGT integration
βββ viz/vggt_compare.py # Comparison visualization tool
Data FlowΒΆ
Image Capture (hog): Multi-camera RGB images saved to NFS
GPU Processing (ook): VGGT model processes images β poses + dense points
Scale Alignment (ook): Align VGGT scale with AprilTag reference measurements
File Storage (NFS): Results saved in COLMAP + PLY formats
Visualization (any): Compare VGGT vs RealSense reconstructions
UsageΒΆ
Enable VGGT in Sense ToolΒΆ
{
"tool": "sense",
"input": {
"scene": "default",
"enable_vggt": true,
"vggt_conf_threshold": 5.0,
"vggt_use_ba": false,
"calibrate_extrinsics": true
}
}
Parameters:
enable_vggt
: Enable VGGT reconstruction (default: false)vggt_conf_threshold
: Confidence threshold for point filtering (default: 5.0)vggt_use_ba
: Enable bundle adjustment refinement (default: false)vggt_image_count
: Number of images per camera (default: 1)
Direct ReconstructionΒΆ
You can also run VGGT reconstruction directly on GPU nodes:
{
"tool": "vggt_reconstruct",
"input": {
"image_dir": "/nfs/tatbot/recordings/sense-default-20241215_143022/images",
"output_pointcloud_path": "/nfs/tatbot/recordings/sense-default-20241215_143022/pointclouds/vggt_dense.ply",
"output_frustums_path": "/nfs/tatbot/recordings/sense-default-20241215_143022/metadata/vggt_frustums.json",
"output_colmap_dir": "/nfs/tatbot/recordings/sense-default-20241215_143022/colmap",
"scene": "default",
"vggt_conf_threshold": 5.0,
"shared_camera": false
}
}
Compare VizΒΆ
View VGGT reconstruction alongside RealSense data:
uv run python -m tatbot.viz.vggt_compare --dataset_dir=/nfs/tatbot/recordings/sense-default-20241215_143022
Or via MCP tool:
{
"tool": "vggt_compare_viz",
"input": {
"dataset_dir": "/nfs/tatbot/recordings/sense-default-20241215_143022",
"show_vggt": true,
"show_rs": true,
"show_vggt_frustums": true,
"show_apriltag_frustums": true
}
}
OutputΒΆ
VGGT integration generates the following files in sense datasets:
/nfs/tatbot/recordings/sense-{scene}-{timestamp}/
βββ images/ # RGB images for VGGT processing
β βββ realsense1.png
β βββ realsense2.png
β βββ ipcamera1.png
βββ pointclouds/
β βββ realsense1_000000.ply # RealSense depth point clouds
β βββ realsense2_000000.ply
β βββ vggt_dense.ply # VGGT dense reconstruction
βββ colmap/ # COLMAP format camera data
β βββ cameras.txt # Camera intrinsics
β βββ images.txt # Camera poses
β βββ points3D.txt # 3D points (optional)
βββ metadata/
βββ apriltag_frustums.json # AprilTag camera poses
βββ vggt_frustums.json # VGGT camera poses
βββ vggt_confidence.npz # Depth confidence maps
βββ metrics.json # Reconstruction metrics
ImplementationΒΆ
Model ManagementΒΆ
VGGT uses the HuggingFace model cache for efficient storage:
Model:
facebook/VGGT-1B
(1 billion parameters)Storage:
~/.cache/huggingface/
on each GPU nodeLoading: Automatic download on first use, cached thereafter
Memory: ~8GB VRAM required for inference
Scale AlignmentΒΆ
VGGT outputs are scale-ambiguous and require metric alignment:
AprilTag Method (Preferred): Compares camera baseline distances between VGGT and AprilTag poses
Median Baseline: Robust scale factor =
median(AprilTag_baselines) / median(VGGT_baselines)
Validation: Scale factors outside 1e-3 to 1e3 range are reset to 1.0
Application: Scale applied only to camera poses (not double-scaled with world points)
Coordinate SystemsΒΆ
VGGT Output: Camera-from-world extrinsics (OpenCV convention)
Tatbot Internal: World-from-camera poses (for consistency)
Conversion: Handled automatically by
vggt_runner.py
utilitiesCOLMAP Export: Standard COLMAP format with quaternion poses
Cross-Node CommunicationΒΆ
Protocol: MCP (Model Context Protocol) over HTTP
Payload: File paths only (~1KB), not raw data
Timeout: 15 minutes for GPU processing
Retry: 2 attempts with exponential backoff
Fallback: Graceful degradation if VGGT fails
ConfigurationΒΆ
NodesΒΆ
GPU Node (
ook
): CUDA-capable GPU with β₯8GB VRAMSensor Node (
hog
): Camera connections (RealSense + IP cameras)Storage: NFS mount for cross-node file access
CamerasΒΆ
VGGT works with any RGB cameras defined in the scene:
IP Cameras: Full resolution RGB frames
RealSense: Color stream (depth stream used separately)
Format: PNG images saved to
images/
subdirectoryNaming:
{camera_name}.png
format for consistent processing
GPUΒΆ
Enable VGGT capability in src/conf/mcp/ook.yaml
:
extras: ["gpu"] # Enables GPU-required tools
tools:
- vggt_reconstruct
- convert_strokelist_to_batch
SettingsΒΆ
Optional VGGT-specific settings in src/conf/cam/vggt.yaml
:
preferred_gpu_node: ook
model_path: null # Use HuggingFace cache
resolution: 518 # VGGT internal resolution
conf_threshold: 5.0
use_ba: false # Bundle adjustment
return_point_map: true
PerformanceΒΆ
TimingΒΆ
Image Loading: ~2-5 seconds for typical multi-camera setup
VGGT Inference: ~30-60 seconds on RTX-class GPU
Scale Alignment: ~1 second
File I/O: ~5-10 seconds for PLY/COLMAP export
Total: ~1-2 minutes for complete reconstruction
ResourcesΒΆ
GPU Memory: ~8GB VRAM during inference
Storage: ~50-100MB per reconstruction (dense point cloud)
Network: Minimal (<1KB MCP payloads)
QualityΒΆ
Point Density: 10x+ more points than RealSense depth
Pose Accuracy: Typically within 2-5mm of AprilTag calibration
Coverage: Full 360Β° reconstruction from multi-view images
IntegrationΒΆ
Existing Vision PipelineΒΆ
Complementary: Works alongside AprilTag calibration and RealSense depth
Validation: VGGT poses compared against AprilTag ground truth
Enhancement: Dense reconstruction supplements sparse RealSense data
Surface MappingΒΆ
Mesh Input: VGGT point clouds can be used for skin mesh reconstruction
Stroke Mapping: Enhanced 3D surfaces improve 2Dβ3D stroke mapping accuracy
Geodesic Tracing: Denser meshes enable more accurate surface path planning
VisualizationΒΆ
Comparison Tool: Side-by-side VGGT vs RealSense point cloud analysis
Camera Poses: Visual comparison of AprilTag vs VGGT camera frustums
Quality Assessment: Real-time pose accuracy and point density metrics
TroubleshootingΒΆ
IssuesΒΆ
VGGT Processing Fails:
Check GPU memory availability (
nvidia-smi
)Verify HuggingFace model cache download completed
Ensure sufficient disk space for model (~4GB)
Scale Alignment Problems:
Verify AprilTag detection succeeded in images
Check that AprilTag and VGGT detect same cameras
Examine scale factor bounds in metrics.json
Cross-Node Communication:
Verify MCP server running on GPU node
Check network connectivity between hog and ook
Ensure NFS mount accessible from both nodes
Poor Reconstruction Quality:
Increase number of viewpoints (more camera positions)
Improve lighting conditions for better image quality
Adjust confidence threshold for point filtering
Consider bundle adjustment for pose refinement
DebuggingΒΆ
Monitor VGGT processing via logs:
# On GPU node
tail -f /nfs/tatbot/mcp-logs/ook.log
# Metrics file
cat /nfs/tatbot/recordings/{dataset}/metadata/metrics.json
Key metrics to check:
vggt_scale_factor
: Should be reasonable (0.1 - 10.0)vggt_point_count
: Higher is generally bettermean_cam_center_err_m
: Pose accuracy vs AprilTag reference
RoadmapΒΆ
Bundle Adjustment: Integration with pycolmap for pose refinement
Multi-Scale Processing: Support for different VGGT resolution settings
Temporal Consistency: Video-based VGGT for improved accuracy
Direct Mesh Output: Skip PLY intermediate and generate meshes directly
Real-Time Processing: Optimize for faster reconstruction cycles