Troubleshooting

GPU not detected

Symptom: hwledger probe returns empty GPU list.

Diagnosis:

bash

hwledger probe --json | jq .gpus
# Returns: []

Fixes:

Check driver: nvidia-smi / rocm-smi / system_profiler SPDisplaysDataType (macOS)
Check compute capability: NVIDIA requires compute capability 3.0+ (Kepler or newer)

Verify env vars:

bash

echo $CUDA_VISIBLE_DEVICES  # Should not be empty
export CUDA_VISIBLE_DEVICES="0"  # Force GPU 0

macOS Metal: M1/M2/M3 only. Intel Macs not supported.

Metal framework missing (macOS)

Symptom: Error on macOS with M-chip: "Metal framework not found".

Fix:

bash

brew install metal-tools
# Restart Terminal

NVML library not found

Symptom: NVIDIA GPU detected but: "libnvidia-ml.so not found".

Fixes:

bash

# Linux
sudo apt-get install libnvidia-compute-XXX  # Replace XXX with CUDA version
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# macOS
brew install nvidia-cuda-toolkit

0 GB free VRAM

Symptom: hwledger plan --model mistral-7b fails: "Insufficient VRAM".

Diagnosis: Other process hogging GPU memory.

Fixes:

bash

# Check what's using VRAM
nvidia-smi  # Look for Process column

# Kill process
nvidia-smi kill PID

# Or clear all CUDA cache
nvidia-smi --query-compute-apps=pid,used_memory --format=csv,noheader | \
  while read pid mem; do kill -9 $pid 2>/dev/null; done

# Clear VRAM (nuclear option)
sudo nvidia-smi --gpu-reset  # Requires driver reload

Model ingest hangs

Symptom: hwledger ingest --model mistral-7b stalls indefinitely.

Diagnosis: Network issue, HuggingFace API rate-limited, or missing git-lfs.

Fixes:

bash

# Check network
curl -I https://huggingface.co/  # Should return 200

# Install git-lfs
brew install git-lfs  # macOS
sudo apt-get install git-lfs  # Linux

# Try with explicit cache dir + verbose
hwledger ingest --model mistral-7b \
  --cache-dir /tmp/hf_cache \
  --log-level debug

Inference timeout

Symptom: hwledger run --model llama-70b input.json times out after 300 seconds.

Fixes:

Increase timeout: --timeout 600
Reduce context: --context 4096 (instead of 32K)
Reduce batch: --batch 1 (instead of 4)
Use quantization: --quant int4 to reduce memory pressure

Fleet server won't start

Symptom: hwledger server fails: "Address already in use".

Fix:

bash

# Find what's listening on port 5443
lsof -i :5443
# Kill it
kill -9 PID

# Or use different port
hwledger server --listen 0.0.0.0:5444

Agent can't reach server

Symptom: Agent heartbeat fails: "Connection refused" or "CERTIFICATE_VERIFY_FAILED".

Diagnosis:

bash

# Check connectivity
curl -v https://fleet.example.com:5443/health  # Should work with valid cert

# Check agent config
cat ~/.config/hwledger/agent.toml | grep server_addr

# Check cert
openssl s_client -connect fleet.example.com:5443 -showcerts

Fixes:

Check server is running: pgrep hwledger-server or systemctl status
Check firewall: sudo iptables -L / Security Group (AWS/Azure)
Check DNS: nslookup fleet.example.com
Check cert expiration: openssl x509 -in ~/.config/hwledger/server.cert.pem -text -noout | grep -A2 Validity

SSH fallback auth fails

Symptom: hwledger fleet register-ssh --host user@remote.box fails: "Permission denied".

Fixes:

Test SSH manually: ssh -i ~/.ssh/id_ed25519 user@remote.box nvidia-smi
Check key permissions: chmod 600 ~/.ssh/id_ed25519
Add to SSH agent: ssh-add ~/.ssh/id_ed25519
Check remote user: ensure user can run nvidia-smi without sudo

Audit verify fails

Symptom: hwledger audit --verify fails: "Hash mismatch at event N".

Diagnosis: Ledger corrupted or tampered.

Fix:

bash

# Export last known-good backup
hwledger audit --export backup.json --since "2026-04-01T00:00:00Z"

# Reset to known state (WARNING: loses recent events)
rm ~/.cache/hwledger/fleet.db
hwledger server  # Recreates empty DB

Troubleshooting ​

GPU not detected ​

Metal framework missing (macOS) ​

NVML library not found ​

0 GB free VRAM ​

Model ingest hangs ​

Inference timeout ​

Fleet server won't start ​

Agent can't reach server ​

SSH fallback auth fails ​

Audit verify fails ​

Related ​

Troubleshooting

GPU not detected

Metal framework missing (macOS)

NVML library not found

0 GB free VRAM

Model ingest hangs

Inference timeout

Fleet server won't start

Agent can't reach server

SSH fallback auth fails

Audit verify fails

Related