π MonitoringΒΆ
A single wallboard (Prometheus + Grafana) on rpi1 (Raspberry Pi 5) showing live CPU/memory/disk/net and GPU metrics across all nodes with minimal overhead, using only FOSS and unified fleet overview dashboard.
Executive SummaryΒΆ
We standardize on Prometheus + Grafana with per-node exporters:
Host metrics (all nodes): Prometheus Node Exporter (tiny static binary).
NVIDIA dGPU (ook/oop): DCGM Exporter (official NVIDIA).
Jetson (ojo / AGX Orin): jetson-stats (jtop)βbased exporter that exposes system + GPU.
Intel Arc/iGPU (hog): Intel GPU exporter that parses
intel_gpu_top -J.Raspberry Pi SoC (rpi1, rpi2): lightweight rpi_exporter for temps/voltages/clock.
Prometheus + Grafana run centrally on eek (System76 Meerkat). rpi1 displays Grafana in kiosk mode.
Fleet & RolesΒΆ
Host |
Hardware |
Role(s) |
Exporters |
|---|---|---|---|
eek |
System76 Meerkat (i5β1340P) |
Prometheus + Grafana server; NFS |
|
ook |
Acer Nitro V 15 (i7β13620H + RTX 4050) |
WiβFi NAT; GPU compute |
|
oop |
Desktop PC (RTX 3090) |
GPU compute; development |
|
hog |
GEEKOM GT1 Mega (Core Ultra 9 + Intel Arc) |
Robot control, RealSense |
|
ojo |
Jetson AGX Orin |
Agent inference |
jetsonβstats node exporter (includes CPU/mem/GPU) |
rpi1 |
Raspberry Pi 5 (8GB) |
Wallboard; app frontend |
|
rpi2 |
Raspberry Pi 5 (8GB) |
DNS/DHCP |
|
Jetson note: the selected exporter exposes both host and GPU metrics on port :9100; do not run a separate node_exporter there to avoid port conflicts.
HighβLevel ArchitectureΒΆ
[ rpi1 ] ββ> Chromium/Grafana Kiosk (readβonly)
http://eek:3000/d/tatbot-compute/tatbot-compute?kiosk=tv&refresh=5s
βββββββββββββββββββββββββ [ eek ] βββββββββββββββββββββββββββ
β Prometheus (9090) β scrape 15s Grafana OSS (3000) β
β retention: 7d or 2GB (whichever first) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β² β² β² β² β²
β β β β β
node_exporter DCGM exporter Intel GPU rpi_exporter jetson-stats
:9100 :9400 :8080 :9110 :9100
[eek/ook/hog] [ook/oop] [hog] [rpi1/rpi2] [ojo]
Design DecisionsΒΆ
Centralize Prometheus+Grafana on
eekto keep compute nodes light.Keep exporters tiny:
Node Exporter is a single static binary; use systemd (no container overhead).
DCGM exporter runs as a small container only on the NVIDIA host.
Jetson uses a jtopβbased exporter (CPU/mem/temps/GPU) with one port.
Intel GPU exporters wrap
intel_gpu_top -Jand expose Prometheus metrics.rpi_exporter reads VC hardware directly (no
vcgencmd), very low overhead.
Modest scrape interval:
global.scrape_interval: 15s.Bounded storage:
--storage.tsdb.retention.time: 7dand--storage.tsdb.retention.size: 2GB.Version pinning everywhere (Docker tags, PyPI packages, binary releases).
Inventory (Single Source of Truth)ΒΆ
Inventory lives at ~/tatbot/config/monitoring/inventory.yml (versions, scrape interval, and nodes). IPs should match src/conf/nodes.yaml. Keep them in sync and regenerate Prometheus config (see βGenerate Prometheus Configβ).
Why versions here? This makes
inventory.ymlthe true source of truth for both topology and versions. The agent can template Docker tags, download URLs, and PyPI requirements from these fields to produce deterministic configs and installers.
PerβNode InstallationΒΆ
The CLI agent commits config/service files; a human runs the commands below (root ssh). Replace versions with those from inventory.yml if you change them.
Prereqs by node
eek: Docker Engine + compose plugin (section 7.1).
ook (RTX 4050): Docker Engine + NVIDIA Container Toolkit (section 6.2); NVIDIA driver working (
nvidia-smi).hog (Intel GPU): Docker Engine (section 6.4).
ojo (Jetson): Python3/pip; install jetson-stats + exporter (section 6.3).
rpi1/rpi2: None beyond systemd and curl.
Common host metrics (node_exporter) β eek, ook, hog, rpi1, rpi2ΒΆ
Quick installer (recommended):
cd ~/tatbot && git pull && sudo bash scripts/monitor/install.sh
Manual install (x86_64) β On each host (repo at ~/tatbot), download and install Node Exporter v1.9.1:
eek, ook, hog (Intel/AMD x86_64):
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.9.1/node_exporter-1.9.1.linux-amd64.tar.gz
tar -xzf node_exporter-1.9.1.linux-amd64.tar.gz -C /tmp
sudo useradd --no-create-home --shell /usr/sbin/nologin nodeexp || true
sudo install -o nodeexp -g nodeexp -m 0755 /tmp/node_exporter-1.9.1.linux-amd64/node_exporter /usr/local/bin/node_exporter
sudo install -o root -g root -m 0644 ~/tatbot/config/monitoring/exporters/$(hostname)/node_exporter.service /etc/systemd/system/
sudo systemctl daemon-reload && sudo systemctl enable --now node_exporter
curl -sS --no-progress-meter http://localhost:9100/metrics | head -n 20
rpi1, rpi2 (Raspberry Pi 5 ARM64):
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.9.1/node_exporter-1.9.1.linux-arm64.tar.gz
tar -xzf node_exporter-1.9.1.linux-arm64.tar.gz -C /tmp
sudo useradd --no-create-home --shell /usr/sbin/nologin nodeexp || true
sudo install -o nodeexp -g nodeexp -m 0755 /tmp/node_exporter-1.9.1.linux-arm64/node_exporter /usr/local/bin/node_exporter
sudo install -o root -g root -m 0644 ~/tatbot/config/monitoring/exporters/$(hostname)/node_exporter.service /etc/systemd/system/
sudo systemctl daemon-reload && sudo systemctl enable --now node_exporter
curl -sS --no-progress-meter http://localhost:9100/metrics | head -n 20
Do not install node_exporter on
ojo(Jetson). It runsjetson-stats-node-exporteron :9100 (see βJetson (ojo): jtop/jetsonβstats Exporterβ).
NVIDIA dGPU (ook/oop): DCGM Exporter (container, pinned tag)ΒΆ
Prereqs (Docker engine + NVIDIA Container Toolkit on Ubuntu 24.04):
sudo apt-get update
sudo apt-get install -y docker.io
sudo usermod -aG docker $USER && newgrp docker
sudo systemctl enable --now docker
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Test GPU access inside container (optional):
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
Run DCGM exporter (pinned tag from inventory) on each NVIDIA host (ook, oop):
docker run -d --restart=always --gpus all --cap-add SYS_ADMIN --net host \
--name dcgm-exporter -e DCGM_EXPORTER_LISTEN=":9400" \
nvidia/dcgm-exporter:4.4.0-4.5.0-ubi9
Or use the systemd unit in the repo: ~/tatbot/config/monitoring/exporters/ook/dcgm-exporter.service (for ook) or ~/tatbot/config/monitoring/exporters/oop/dcgm-exporter.service (for oop), then:
sudo systemctl daemon-reload && sudo systemctl enable --now dcgm-exporter
Verify (ook): curl -sS --no-progress-meter http://192.168.1.90:9400/metrics | head -n 20
Verify (oop): curl -sS --no-progress-meter http://192.168.1.51:9400/metrics | head -n 20
Jetson (ojo): jtop/jetsonβstats Exporter (no jtop.service dependency required)ΒΆ
sudo -H pip3 install "jetson-stats==4.3.2"
sudo -H pip3 install "jetson-stats-node-exporter==0.1.2"
sudo install -m 0644 ~/tatbot/config/monitoring/exporters/ojo/jetson-stats-node-exporter.service /etc/systemd/system/
sudo systemctl daemon-reload && sudo systemctl enable --now jetson-stats-node-exporter
curl -sS --no-progress-meter http://192.168.1.96:9100/metrics | head -n 20
We removed the
Requires=jtop.servicedependency. The exporter uses the Python API from jetsonβstats directly and does not require thejtopsystemd service to be running. See βSystemd Unit Files (Exporters)β for the corrected unit file.
Intel Arc/iGPU (hog): Intel GPU ExporterΒΆ
hog needs Node Exporter (section 6.1) PLUS Intel GPU monitoring:
First install Docker if not present:
sudo apt-get update
sudo apt-get install -y docker.io
sudo systemctl enable --now docker
sudo usermod -aG docker $USER
# Log out and back in for group changes
Then run Intel GPU exporter (replace the tag with your pinned tag from inventory):
docker run -d --restart=always --net host --name intel-gpu-exporter --privileged \
-v /sys:/sys:ro -v /dev/dri:/dev/dri \
restreamio/intel-prometheus:latest # replace 'latest' with a pinned tag
curl -sS --no-progress-meter http://192.168.1.88:8080/metrics | head -n 20
Raspberry Pi SoC telemetry β rpi1, rpi2ΒΆ
Download and install rpi_exporter for ARM64:
cd /tmp
wget https://github.com/lukasmalkmus/rpi_exporter/releases/download/v0.4.0/rpi_exporter-0.4.0.linux-arm64.tar.gz
tar -xzf rpi_exporter-0.4.0.linux-arm64.tar.gz -C /tmp
sudo install -o root -g root -m 0755 /tmp/rpi_exporter /usr/local/bin/rpi_exporter
sudo install -m 0644 ~/tatbot/config/monitoring/exporters/$(hostname)/rpi_exporter.service /etc/systemd/system/
sudo systemctl daemon-reload && sudo systemctl enable --now rpi_exporter
curl -sS --no-progress-meter http://$(hostname):9110/metrics | head -n 20
rpi_exporter flag change: service files in this repo now use
--web.listen-address=:9110(older docs sometimes show-addr). If you had a prior unit with-addr, update it to--web.listen-addressor copy the unit fromconfig/monitoring/exporters/<host>/rpi_exporter.service.
Quick installer option:
cd ~/tatbot && git pull && sudo bash scripts/monitor/install.sh
Prometheus & Grafana on eek (Pinned Images)ΒΆ
Docker Compose Setup (eek)ΒΆ
Install Docker with modern compose plugin:
# Remove legacy docker-compose if installed
sudo apt-get remove -y docker-compose || true
# Add Docker's official repository
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release; echo $VERSION_CODENAME) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# Install Docker Engine + Compose plugin
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
sudo systemctl enable --now docker
sudo usermod -aG docker $USER && newgrp docker
# Verify
docker compose version
Start Prometheus + Grafana:
cd ~/tatbot && make -C config/monitoring up
Prometheus ConfigΒΆ
File: ~/tatbot/config/monitoring/prometheus/prometheus.yml (generated from inventory). Alert rules: ~/tatbot/config/monitoring/prometheus/rules/.
(Optional) Starter AlertsΒΆ
See: ~/tatbot/config/monitoring/prometheus/rules/edge.rules.yml.
Grafana Provisioning & Dashboards (Tatbot Compute)ΒΆ
Provision Prometheus DatasourceΒΆ
File: ~/tatbot/config/monitoring/grafana/provisioning/datasources/prometheus.yaml.
Provision DashboardsΒΆ
File: ~/tatbot/config/monitoring/grafana/provisioning/dashboards/dashboards.yaml.
Included dashboards (JSON files under grafana/dashboards/)ΒΆ
Tatbot Compute β
tatbot-compute.json(installed in repo, uid=tatbot-compute)
The Tatbot Compute dashboard is the default kiosk target and summarizes CPU, Memory, Disk, Network, and GPU across all hosts. You can drill into the detailed dashboards when needed.
Tatbot Compute DashboardΒΆ
Installed at ~/tatbot/config/monitoring/grafana/dashboards/tatbot-compute.json (uid=tatbot-compute). Intel exporter metric names may vary (igpu_*).
rpi1 Wallboard (Kiosk)ΒΆ
Quick Start: Monitoring Kiosk ScriptΒΆ
Run on rpi1 (wallboard display node) to launch the monitoring dashboard in kiosk mode:
# Start monitoring kiosk (default: eek:3000, 5s refresh)
cd ~/tatbot && bash scripts/monitor/kiosk.sh
# Custom refresh interval
cd ~/tatbot && bash scripts/monitor/kiosk.sh eek 10
# Specific IP address
cd ~/tatbot && bash scripts/monitor/kiosk.sh 192.168.1.97
When to run: Once per boot, or when you want to restart the wallboard display.
Manual Kiosk URLΒΆ
Point Chromium (or grafana-kiosk) at:
http://eek:3000/d/tatbot-compute/tatbot-compute?kiosk=tv&refresh=5s
Optional: grafana-kiosk systemd ServiceΒΆ
Unit file: ~/tatbot/config/monitoring/exporters/rpi1/grafana-kiosk.service.
Enable: sudo systemctl daemon-reload && sudo systemctl enable --now grafana-kiosk
Systemd Unit Files (Exporters)ΒΆ
Node Exporter:
~/tatbot/config/monitoring/exporters/<host>/node_exporter.serviceDCGM exporter (docker):
~/tatbot/config/monitoring/exporters/ook/dcgm-exporter.service,~/tatbot/config/monitoring/exporters/oop/dcgm-exporter.serviceJetson exporter:
~/tatbot/config/monitoring/exporters/ojo/jetson-stats-node-exporter.serviceIntel GPU exporter (docker):
~/tatbot/config/monitoring/exporters/hog/intel-gpu-exporter.servicerpi_exporter:
~/tatbot/config/monitoring/exporters/rpi{1,2}/rpi_exporter.serviceGrafana kiosk (optional):
~/tatbot/config/monitoring/exporters/rpi1/grafana-kiosk.service
Repo LayoutΒΆ
config/monitoring/
ββ inventory.yml
ββ compose/
β ββ docker-compose.yml
ββ prometheus/
β ββ prometheus.yml
β ββ rules/
β ββ edge.rules.yml
ββ grafana/
β ββ provisioning/
β β ββ datasources/prometheus.yaml
β β ββ dashboards/dashboards.yaml
β ββ dashboards/
β ββ tatbot-compute.json
ββ exporters/
ββ eek/node_exporter.service
ββ ook/{node_exporter.service,dcgm-exporter.service}
ββ oop/{node_exporter.service,dcgm-exporter.service}
ββ hog/{node_exporter.service,intel-gpu-exporter.service}
ββ ojo/jetson-stats-node-exporter.service
ββ rpi{1,2}/{node_exporter.service,rpi_exporter.service}
scripts/
ββ monitor/
β ββ server.sh # Start/verify monitoring stack on eek
β ββ kiosk.sh # Start kiosk display on rpi1
β ββ clean.sh # Clean monitoring volumes and cache
src/tatbot/utils/
ββ gen_prom_config.py # Generate Prometheus config from inventory
Generate Prometheus ConfigΒΆ
Prometheus targets are generated from inventory.yml.
From repo root:
python3 src/tatbot/utils/gen_prom_config.pyOr:
make -C config/monitoring gen-prom
Output: config/monitoring/prometheus/prometheus.yml.
After changes, restart the stack on eek:
make -C config/monitoring restart
Monitoring Server ManagementΒΆ
Single Entry Point ScriptΒΆ
Run on eek to start/verify the complete monitoring system:
# Start and verify monitoring system
cd ~/tatbot && ./scripts/monitor/server.sh
# Restart services and verify
cd ~/tatbot && ./scripts/monitor/server.sh --restart
Cache Cleanup ScriptΒΆ
Run on any node to clean all cached monitoring data:
# Clean cache (works on any node)
cd ~/tatbot && ./scripts/monitor/clean.sh
# For complete reset: clean + restart
cd ~/tatbot && ./scripts/monitor/clean.sh && ./scripts/monitor/server.sh
server.sh features:
Verifies itβs running on eek (monitoring server host)
Optionally restarts Prometheus + Grafana containers
Performs comprehensive diagnostics on all nodes
Tests connectivity, services, and HTTP endpoints
Provides installation commands for missing exporters
Shows detailed Prometheus target status
clean.sh features:
Stops and removes Docker containers/volumes (on eek)
Clears browser cache and temp profiles
Removes log files and temporary data
Works on any node (cleans local cache)
Safe to run anytime for fresh start
Manual Verification ChecklistΒΆ
If running manual verification:
curl http://eek:9090/-/readyβPrometheus Server is Ready.curl http://eek:9090/targetsshows all targets UP.curl http://ook:9400/metricsincludesDCGM_FI_DEV_GPU_UTIL.curl http://oop:9400/metricsincludesDCGM_FI_DEV_GPU_UTIL.curl http://ojo:9100/metricsincludes Jetson system + GPU metrics.curl http://192.168.1.88:8080/metricsincludes Inteligpu_*metrics.Grafana at
http://eek:3000/shows dashboards, including Tatbot Compute.rpi1 displays the Tatbot Compute URL with
?kiosk=tv&refresh=5s.
Security & OperationsΒΆ
LANβonly exposure; firewall Prometheus (9090) and Grafana (3000) to your subnet.
Grafana uses anonymous Viewer; remove when not needed.
Prometheus retention capped by time and size to avoid disk exhaustion.
If historical retention grows, consider remote_write to VictoriaMetrics later.
Back up Grafanaβs
/var/lib/grafana(provisioned dashboards are already in Git).
References (Selected)ΒΆ
Prometheus downloads & retention flags
https://prometheus.io/download/
https://prometheus.io/docs/prometheus/latest/storage/
https://prometheus.io/docs/prometheus/latest/migration/
Node Exporter release (v1.9.1)
https://github.com/prometheus/node_exporter/releases
NVIDIA DCGM exporter
Docs: https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html
Tags: https://hub.docker.com/r/nvidia/dcgm-exporter/tags
Jetson Stats & exporter
jetson-stats (PyPI): https://pypi.org/project/jetson-stats/
jetson-stats-node-exporter (PyPI): https://pypi.org/project/jetson-stats-node-exporter/
Project: https://github.com/laminair/jetson_stats_node_exporter
Intel GPU exporters
restreamio/intel-prometheus: https://hub.docker.com/r/restreamio/intel-prometheus
go-intel-gpu-exporter: https://pkg.go.dev/gitlab.com/leandrosansilva/go-intel-gpu-exporter
Example metrics (igpu_*): https://github.com/onedr0p/intel-gpu-exporter
Grafana OSS (v12.x) & kiosk
Releases: https://github.com/grafana/grafana/releases
Docker: https://grafana.com/docs/grafana/latest/setup-grafana/installation/docker/
grafana-kiosk: https://github.com/grafana/grafana-kiosk
Reference dashboards
Node Exporter Full (1860): https://grafana.com/grafana/dashboards/1860
NVIDIA DCGM (12239): https://grafana.com/grafana/dashboards/12239
NVIDIA Jetson (14493 / 21727): https://grafana.com/grafana/dashboards/14493-nvidia-jetson/
Intel GPU Metrics (23251): https://grafana.com/grafana/dashboards/23251-intel-gpu-metrics/