🎓 VLA Policy Training Guide for Tatbot Robot

This guide provides comprehensive documentation for finetuning Vision-Language-Action (VLA) policies, evaluating training with WandB, and performing inference on the Tatbot robot with RealSense cameras using the LeRobot framework.

Table of Contents

  1. Overview

  2. Available VLA Policies

  3. Environment Setup

  4. Dataset Preparation

  5. Training VLA Policies

  6. Evaluation with WandB

  7. Robot Inference on Tatbot

  8. Code Examples

  9. Troubleshooting

Overview

The LeRobot framework supports multiple Vision-Language-Action policies that can be trained on robotic manipulation tasks and deployed on real hardware. This guide focuses on two main VLA policies:

  • SmolVLA: A lightweight vision-language-action model optimized for efficient robotics

  • π0 (Pi0): A vision-language-action flow model for general robot control

Available VLA Policies

SmolVLA

  • Paper: https://arxiv.org/abs/2506.01844

  • Location: src/lerobot/policies/smolvla/

  • Main Files:

    • modeling_smolvla.py: Model implementation

    • configuration_smolvla.py: Configuration class

    • smolvlm_with_expert.py: VLM with expert module

π0 (Pi0)

  • Paper: https://www.physicalintelligence.company/download/pi0.pdf

  • Location: src/lerobot/policies/pi0/

  • Main Files:

    • modeling_pi0.py: Model implementation

    • configuration_pi0.py: Configuration class

    • paligemma_with_expert.py: PaliGemma with expert module

Environment Setup

Install Dependencies

# Basic installation
pip install -e .

# SmolVLA specific dependencies (includes transformers, accelerate, safetensors)
pip install -e ".[smolvla]"

# Pi0 specific dependencies (includes transformers)
pip install -e ".[pi0]"

# For Tatbot robot support
pip install -e ".[tatbot]"

# For RealSense camera support
pip install -e ".[intelrealsense]"

# For WandB logging (included in base requirements)
# WandB is already included in the base installation

# Complete installation for Tatbot with VLA policies
pip install -e ".[tatbot,intelrealsense,smolvla,pi0]"

Dataset Preparation

Using Existing Datasets

LeRobot provides access to various datasets through HuggingFace Hub:

from lerobot.datasets.lerobot_dataset import LeRobotDataset

# Load a dataset
dataset = LeRobotDataset("lerobot/aloha_sim_insertion_human")

Creating Custom Datasets for Tatbot

For the Tatbot robot, you’ll need to record data with proper camera configuration:

# Dataset recording configuration for Tatbot
delta_timestamps = {
    "observation.image": [-0.1, 0.0],  # Previous and current frame
    "observation.state": [-0.1, 0.0],  # Previous and current state
    "action": [-0.1, 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4],
}

Training VLA Policies

Training SmolVLA

From Scratch

lerobot-train \
    --policy.type=smolvla \
    --dataset.repo_id=your_dataset_repo \
    --batch_size=64 \
    --steps=200000 \
    --wandb.enable=true \
    --wandb.project=tatbot_smolvla \
    --output_dir=outputs/train/smolvla_tatbot

From Pretrained Model

lerobot-train \
    --policy.path=lerobot/smolvla_base \
    --dataset.repo_id=your_dataset_repo \
    --batch_size=64 \
    --steps=100000 \
    --wandb.enable=true \
    --wandb.project=tatbot_smolvla_finetune

Training π0 (Pi0)

From Scratch

lerobot-train \
    --policy.type=pi0 \
    --dataset.repo_id=your_dataset_repo \
    --batch_size=32 \
    --steps=200000 \
    --wandb.enable=true \
    --wandb.project=tatbot_pi0 \
    --output_dir=outputs/train/pi0_tatbot

From Pretrained Model

lerobot-train \
    --policy.path=lerobot/pi0 \
    --dataset.repo_id=your_dataset_repo \
    --batch_size=32 \
    --steps=100000 \
    --wandb.enable=true \
    --wandb.project=tatbot_pi0_finetune

Key Training Configuration Parameters

SmolVLA Configuration (configuration_smolvla.py)

# Key parameters
n_obs_steps: int = 1
chunk_size: int = 50
n_action_steps: int = 50
max_state_dim: int = 32
max_action_dim: int = 32
resize_imgs_with_padding: tuple = (512, 512)

# Training settings
optimizer_lr: float = 1e-4
optimizer_grad_clip_norm: float = 10
scheduler_warmup_steps: int = 1_000
scheduler_decay_steps: int = 30_000

# Model settings
vlm_model_name: str = "HuggingFaceTB/SmolVLM2-500M-Video-Instruct"
freeze_vision_encoder: bool = True
train_expert_only: bool = True

Pi0 Configuration (configuration_pi0.py)

# Key parameters (similar structure to SmolVLA)
n_obs_steps: int = 1
chunk_size: int = 50
n_action_steps: int = 50

Evaluation with WandB

WandB Setup

The training script automatically integrates with WandB through src/lerobot/utils/wandb_utils.py:

# Key WandB configuration in training
--wandb.enable=true \
--wandb.project=your_project_name \
--wandb.entity=your_wandb_entity \
--wandb.notes="Training notes" \
--wandb.mode=online  # or offline for local logging

Tracked Metrics

The following metrics are automatically logged to WandB:

  • Training Metrics: loss, gradient norm, learning rate, update speed

  • Evaluation Metrics: success rate, reward sum, evaluation speed

  • System Metrics: GPU utilization, memory usage

Evaluation Script Usage

lerobot-eval \
    --policy.path=outputs/train/smolvla_tatbot/checkpoints/last/pretrained_model \
    --env.type=tatbot \
    --eval.batch_size=10 \
    --eval.n_episodes=10 \
    --device=cuda

Robot Inference on Tatbot

Tatbot Configuration

The Tatbot robot configuration is defined in src/lerobot/robots/tatbot/:

Key Components (tatbot.py)

  • Dual arm setup with left and right arms

  • RealSense camera integration

  • IP camera support

  • Thread pool executor for parallel operations

Configuration Structure (config_tatbot.py)

@dataclass
class TatbotConfig(RobotConfig):
    rs_cameras: dict[str, CameraConfig]  # RealSense cameras
    ip_cameras: dict[str, CameraConfig]  # IP cameras
    ip_address_l: str  # Left arm IP
    ip_address_r: str  # Right arm IP
    arm_l_config_filepath: str  # Left arm YAML config
    arm_r_config_filepath: str  # Right arm YAML config
    home_pos_l: list[float]  # Left arm home position
    home_pos_r: list[float]  # Right arm home position
    goal_time: float  # Default travel time
    connection_timeout: float  # Connection timeout

RealSense Camera Setup

RealSense cameras are configured in src/lerobot/cameras/realsense/:

from lerobot.cameras.realsense import RealSenseCamera, RealSenseCameraConfig
from lerobot.cameras import ColorMode, Cv2Rotation

# Configure RealSense camera
config = RealSenseCameraConfig(
    serial_number_or_name="your_camera_serial",
    fps=30,
    width=1280,
    height=720,
    color_mode=ColorMode.BGR,
    rotation=Cv2Rotation.NO_ROTATION,
    use_depth=True  # Enable depth capture
)

camera = RealSenseCamera(config)
camera.connect()

Inference Script

import torch
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
# OR for Pi0:
# from lerobot.policies.pi0.modeling_pi0 import PI0Policy
from lerobot.robots.tatbot.tatbot import Tatbot
from lerobot.robots.tatbot.config_tatbot import TatbotConfig

# Load trained policy (SmolVLA example)
policy = SmolVLAPolicy.from_pretrained("outputs/train/smolvla_tatbot/checkpoints/last/pretrained_model")
# OR for Pi0:
# policy = PI0Policy.from_pretrained("outputs/train/pi0_tatbot/checkpoints/last/pretrained_model")
policy.eval()
policy.to("cuda")

# Initialize Tatbot
config = TatbotConfig(
    rs_cameras={
        "cam_left": RealSenseCameraConfig(serial_number_or_name="left_serial"),
        "cam_right": RealSenseCameraConfig(serial_number_or_name="right_serial")
    },
    ip_address_l="192.168.1.10",
    ip_address_r="192.168.1.11",
    # ... other config parameters
)

robot = Tatbot(config)
robot.connect()

# Main inference loop
try:
    while True:
        # Get observations from robot
        observation = robot.get_observation()
        
        # Get action from policy
        with torch.no_grad():
            action = policy.select_action(observation)
        
        # Execute action on robot
        robot.send_action(action)
        
except KeyboardInterrupt:
    robot.disconnect()

Code Examples

Complete Training Pipeline

# train_vla_tatbot.py
import torch
from lerobot.configs.train import TrainPipelineConfig
from lerobot.scripts.train import train

# Configure training
config = TrainPipelineConfig(
    policy_type="smolvla",
    dataset_repo_id="your_tatbot_dataset",
    output_dir="outputs/train/tatbot_vla",
    steps=100000,
    batch_size=32,
    eval_freq=5000,
    save_freq=10000,
    log_freq=100,
    wandb_enable=True,
    wandb_project="tatbot_vla_training"
)

# Run training
train(config)

Custom Dataset Recording

# record_tatbot_dataset.py
from lerobot.robots.tatbot.tatbot import Tatbot
from lerobot.datasets.lerobot_dataset import LeRobotDataset

robot = Tatbot(config)
robot.connect()

# Record episodes
dataset = LeRobotDataset.create(
    repo_id="your_username/tatbot_task",
    fps=30,
    robot=robot
)

# Record data...
dataset.push_to_hub()

Troubleshooting

Common Issues and Solutions

  1. RealSense Camera Not Found

    # Find available cameras
    lerobot-find-cameras realsense
    
  2. CUDA Out of Memory

    • Reduce batch_size

    • Enable gradient accumulation

    • Use mixed precision training with --policy.use_amp=true

  3. Slow Data Loading

    • Increase number of dataloader workers

    • Use local dataset cache

    • Optimize image preprocessing

  4. WandB Connection Issues

    • Use offline mode: --wandb.mode=offline

    • Sync later with: wandb sync outputs/train/your_run

  5. Robot Connection Timeout

    • Check network connectivity

    • Verify IP addresses in config

    • Increase connection_timeout parameter

Additional Resources

Key Files for Reference

  • Training Script: src/lerobot/scripts/train.py

  • Evaluation Script: src/lerobot/scripts/eval.py

  • WandB Utils: src/lerobot/utils/wandb_utils.py

  • Tatbot Robot: src/lerobot/robots/tatbot/tatbot.py

  • RealSense Camera: src/lerobot/cameras/realsense/camera_realsense.py

  • SmolVLA Policy: src/lerobot/policies/smolvla/modeling_smolvla.py

  • Pi0 Policy: src/lerobot/policies/pi0/modeling_pi0.py

Documentation References

  • Training with Script: examples/4_train_policy_with_script.md

  • Policy README files in respective directories

  • Camera configuration guides in docs/source/cameras.mdx

Citation

If you use SmolVLA:

@article{shukor2025smolvla,
  title={SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics},
  author={Shukor, Mustafa and others},
  journal={arXiv preprint arXiv:2506.01844},
  year={2025}
}

For π0: Refer to Physical Intelligence paper at https://www.physicalintelligence.company/download/pi0.pdf