🎓 VLA Policy Training Guide for Tatbot Robot¶
This guide provides comprehensive documentation for finetuning Vision-Language-Action (VLA) policies, evaluating training with WandB, and performing inference on the Tatbot robot with RealSense cameras using the LeRobot framework.
Table of Contents¶
Overview¶
The LeRobot framework supports multiple Vision-Language-Action policies that can be trained on robotic manipulation tasks and deployed on real hardware. This guide focuses on two main VLA policies:
SmolVLA: A lightweight vision-language-action model optimized for efficient robotics
π0 (Pi0): A vision-language-action flow model for general robot control
Available VLA Policies¶
SmolVLA¶
Paper: https://arxiv.org/abs/2506.01844
Location:
src/lerobot/policies/smolvla/
Main Files:
modeling_smolvla.py
: Model implementationconfiguration_smolvla.py
: Configuration classsmolvlm_with_expert.py
: VLM with expert module
π0 (Pi0)¶
Paper: https://www.physicalintelligence.company/download/pi0.pdf
Location:
src/lerobot/policies/pi0/
Main Files:
modeling_pi0.py
: Model implementationconfiguration_pi0.py
: Configuration classpaligemma_with_expert.py
: PaliGemma with expert module
Environment Setup¶
Install Dependencies¶
# Basic installation
pip install -e .
# SmolVLA specific dependencies (includes transformers, accelerate, safetensors)
pip install -e ".[smolvla]"
# Pi0 specific dependencies (includes transformers)
pip install -e ".[pi0]"
# For Tatbot robot support
pip install -e ".[tatbot]"
# For RealSense camera support
pip install -e ".[intelrealsense]"
# For WandB logging (included in base requirements)
# WandB is already included in the base installation
# Complete installation for Tatbot with VLA policies
pip install -e ".[tatbot,intelrealsense,smolvla,pi0]"
Dataset Preparation¶
Using Existing Datasets¶
LeRobot provides access to various datasets through HuggingFace Hub:
from lerobot.datasets.lerobot_dataset import LeRobotDataset
# Load a dataset
dataset = LeRobotDataset("lerobot/aloha_sim_insertion_human")
Creating Custom Datasets for Tatbot¶
For the Tatbot robot, you’ll need to record data with proper camera configuration:
# Dataset recording configuration for Tatbot
delta_timestamps = {
"observation.image": [-0.1, 0.0], # Previous and current frame
"observation.state": [-0.1, 0.0], # Previous and current state
"action": [-0.1, 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4],
}
Training VLA Policies¶
Training SmolVLA¶
From Scratch¶
lerobot-train \
--policy.type=smolvla \
--dataset.repo_id=your_dataset_repo \
--batch_size=64 \
--steps=200000 \
--wandb.enable=true \
--wandb.project=tatbot_smolvla \
--output_dir=outputs/train/smolvla_tatbot
From Pretrained Model¶
lerobot-train \
--policy.path=lerobot/smolvla_base \
--dataset.repo_id=your_dataset_repo \
--batch_size=64 \
--steps=100000 \
--wandb.enable=true \
--wandb.project=tatbot_smolvla_finetune
Training π0 (Pi0)¶
From Scratch¶
lerobot-train \
--policy.type=pi0 \
--dataset.repo_id=your_dataset_repo \
--batch_size=32 \
--steps=200000 \
--wandb.enable=true \
--wandb.project=tatbot_pi0 \
--output_dir=outputs/train/pi0_tatbot
From Pretrained Model¶
lerobot-train \
--policy.path=lerobot/pi0 \
--dataset.repo_id=your_dataset_repo \
--batch_size=32 \
--steps=100000 \
--wandb.enable=true \
--wandb.project=tatbot_pi0_finetune
Key Training Configuration Parameters¶
SmolVLA Configuration (configuration_smolvla.py
)¶
# Key parameters
n_obs_steps: int = 1
chunk_size: int = 50
n_action_steps: int = 50
max_state_dim: int = 32
max_action_dim: int = 32
resize_imgs_with_padding: tuple = (512, 512)
# Training settings
optimizer_lr: float = 1e-4
optimizer_grad_clip_norm: float = 10
scheduler_warmup_steps: int = 1_000
scheduler_decay_steps: int = 30_000
# Model settings
vlm_model_name: str = "HuggingFaceTB/SmolVLM2-500M-Video-Instruct"
freeze_vision_encoder: bool = True
train_expert_only: bool = True
Pi0 Configuration (configuration_pi0.py
)¶
# Key parameters (similar structure to SmolVLA)
n_obs_steps: int = 1
chunk_size: int = 50
n_action_steps: int = 50
Evaluation with WandB¶
WandB Setup¶
The training script automatically integrates with WandB through src/lerobot/utils/wandb_utils.py
:
# Key WandB configuration in training
--wandb.enable=true \
--wandb.project=your_project_name \
--wandb.entity=your_wandb_entity \
--wandb.notes="Training notes" \
--wandb.mode=online # or offline for local logging
Tracked Metrics¶
The following metrics are automatically logged to WandB:
Training Metrics: loss, gradient norm, learning rate, update speed
Evaluation Metrics: success rate, reward sum, evaluation speed
System Metrics: GPU utilization, memory usage
Evaluation Script Usage¶
lerobot-eval \
--policy.path=outputs/train/smolvla_tatbot/checkpoints/last/pretrained_model \
--env.type=tatbot \
--eval.batch_size=10 \
--eval.n_episodes=10 \
--device=cuda
Robot Inference on Tatbot¶
Tatbot Configuration¶
The Tatbot robot configuration is defined in src/lerobot/robots/tatbot/
:
Key Components (tatbot.py
)¶
Dual arm setup with left and right arms
RealSense camera integration
IP camera support
Thread pool executor for parallel operations
Configuration Structure (config_tatbot.py
)¶
@dataclass
class TatbotConfig(RobotConfig):
rs_cameras: dict[str, CameraConfig] # RealSense cameras
ip_cameras: dict[str, CameraConfig] # IP cameras
ip_address_l: str # Left arm IP
ip_address_r: str # Right arm IP
arm_l_config_filepath: str # Left arm YAML config
arm_r_config_filepath: str # Right arm YAML config
home_pos_l: list[float] # Left arm home position
home_pos_r: list[float] # Right arm home position
goal_time: float # Default travel time
connection_timeout: float # Connection timeout
RealSense Camera Setup¶
RealSense cameras are configured in src/lerobot/cameras/realsense/
:
from lerobot.cameras.realsense import RealSenseCamera, RealSenseCameraConfig
from lerobot.cameras import ColorMode, Cv2Rotation
# Configure RealSense camera
config = RealSenseCameraConfig(
serial_number_or_name="your_camera_serial",
fps=30,
width=1280,
height=720,
color_mode=ColorMode.BGR,
rotation=Cv2Rotation.NO_ROTATION,
use_depth=True # Enable depth capture
)
camera = RealSenseCamera(config)
camera.connect()
Inference Script¶
import torch
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
# OR for Pi0:
# from lerobot.policies.pi0.modeling_pi0 import PI0Policy
from lerobot.robots.tatbot.tatbot import Tatbot
from lerobot.robots.tatbot.config_tatbot import TatbotConfig
# Load trained policy (SmolVLA example)
policy = SmolVLAPolicy.from_pretrained("outputs/train/smolvla_tatbot/checkpoints/last/pretrained_model")
# OR for Pi0:
# policy = PI0Policy.from_pretrained("outputs/train/pi0_tatbot/checkpoints/last/pretrained_model")
policy.eval()
policy.to("cuda")
# Initialize Tatbot
config = TatbotConfig(
rs_cameras={
"cam_left": RealSenseCameraConfig(serial_number_or_name="left_serial"),
"cam_right": RealSenseCameraConfig(serial_number_or_name="right_serial")
},
ip_address_l="192.168.1.10",
ip_address_r="192.168.1.11",
# ... other config parameters
)
robot = Tatbot(config)
robot.connect()
# Main inference loop
try:
while True:
# Get observations from robot
observation = robot.get_observation()
# Get action from policy
with torch.no_grad():
action = policy.select_action(observation)
# Execute action on robot
robot.send_action(action)
except KeyboardInterrupt:
robot.disconnect()
Code Examples¶
Complete Training Pipeline¶
# train_vla_tatbot.py
import torch
from lerobot.configs.train import TrainPipelineConfig
from lerobot.scripts.train import train
# Configure training
config = TrainPipelineConfig(
policy_type="smolvla",
dataset_repo_id="your_tatbot_dataset",
output_dir="outputs/train/tatbot_vla",
steps=100000,
batch_size=32,
eval_freq=5000,
save_freq=10000,
log_freq=100,
wandb_enable=True,
wandb_project="tatbot_vla_training"
)
# Run training
train(config)
Custom Dataset Recording¶
# record_tatbot_dataset.py
from lerobot.robots.tatbot.tatbot import Tatbot
from lerobot.datasets.lerobot_dataset import LeRobotDataset
robot = Tatbot(config)
robot.connect()
# Record episodes
dataset = LeRobotDataset.create(
repo_id="your_username/tatbot_task",
fps=30,
robot=robot
)
# Record data...
dataset.push_to_hub()
Troubleshooting¶
Common Issues and Solutions¶
RealSense Camera Not Found
# Find available cameras lerobot-find-cameras realsense
CUDA Out of Memory
Reduce batch_size
Enable gradient accumulation
Use mixed precision training with
--policy.use_amp=true
Slow Data Loading
Increase number of dataloader workers
Use local dataset cache
Optimize image preprocessing
WandB Connection Issues
Use offline mode:
--wandb.mode=offline
Sync later with:
wandb sync outputs/train/your_run
Robot Connection Timeout
Check network connectivity
Verify IP addresses in config
Increase connection_timeout parameter
Additional Resources¶
Key Files for Reference¶
Training Script:
src/lerobot/scripts/train.py
Evaluation Script:
src/lerobot/scripts/eval.py
WandB Utils:
src/lerobot/utils/wandb_utils.py
Tatbot Robot:
src/lerobot/robots/tatbot/tatbot.py
RealSense Camera:
src/lerobot/cameras/realsense/camera_realsense.py
SmolVLA Policy:
src/lerobot/policies/smolvla/modeling_smolvla.py
Pi0 Policy:
src/lerobot/policies/pi0/modeling_pi0.py
Documentation References¶
Training with Script:
examples/4_train_policy_with_script.md
Policy README files in respective directories
Camera configuration guides in
docs/source/cameras.mdx
Citation¶
If you use SmolVLA:
@article{shukor2025smolvla,
title={SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics},
author={Shukor, Mustafa and others},
journal={arXiv preprint arXiv:2506.01844},
year={2025}
}
For π0: Refer to Physical Intelligence paper at https://www.physicalintelligence.company/download/pi0.pdf