Troubleshooting
GPU not detected
Symptom: hwledger probe returns empty GPU list.
Diagnosis:
hwledger probe --json | jq .gpus
# Returns: []Fixes:
- Check driver:
nvidia-smi/rocm-smi/system_profiler SPDisplaysDataType(macOS) - Check compute capability: NVIDIA requires compute capability 3.0+ (Kepler or newer)
- Verify env vars:bash
echo $CUDA_VISIBLE_DEVICES # Should not be empty export CUDA_VISIBLE_DEVICES="0" # Force GPU 0 - macOS Metal: M1/M2/M3 only. Intel Macs not supported.
Metal framework missing (macOS)
Symptom: Error on macOS with M-chip: "Metal framework not found".
Fix:
brew install metal-tools
# Restart TerminalNVML library not found
Symptom: NVIDIA GPU detected but: "libnvidia-ml.so not found".
Fixes:
# Linux
sudo apt-get install libnvidia-compute-XXX # Replace XXX with CUDA version
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
# macOS
brew install nvidia-cuda-toolkit0 GB free VRAM
Symptom: hwledger plan --model mistral-7b fails: "Insufficient VRAM".
Diagnosis: Other process hogging GPU memory.
Fixes:
# Check what's using VRAM
nvidia-smi # Look for Process column
# Kill process
nvidia-smi kill PID
# Or clear all CUDA cache
nvidia-smi --query-compute-apps=pid,used_memory --format=csv,noheader | \
while read pid mem; do kill -9 $pid 2>/dev/null; done
# Clear VRAM (nuclear option)
sudo nvidia-smi --gpu-reset # Requires driver reloadModel ingest hangs
Symptom: hwledger ingest --model mistral-7b stalls indefinitely.
Diagnosis: Network issue, HuggingFace API rate-limited, or missing git-lfs.
Fixes:
# Check network
curl -I https://huggingface.co/ # Should return 200
# Install git-lfs
brew install git-lfs # macOS
sudo apt-get install git-lfs # Linux
# Try with explicit cache dir + verbose
hwledger ingest --model mistral-7b \
--cache-dir /tmp/hf_cache \
--log-level debugInference timeout
Symptom: hwledger run --model llama-70b input.json times out after 300 seconds.
Fixes:
- Increase timeout:
--timeout 600 - Reduce context:
--context 4096(instead of 32K) - Reduce batch:
--batch 1(instead of 4) - Use quantization:
--quant int4to reduce memory pressure
Fleet server won't start
Symptom: hwledger server fails: "Address already in use".
Fix:
# Find what's listening on port 5443
lsof -i :5443
# Kill it
kill -9 PID
# Or use different port
hwledger server --listen 0.0.0.0:5444Agent can't reach server
Symptom: Agent heartbeat fails: "Connection refused" or "CERTIFICATE_VERIFY_FAILED".
Diagnosis:
# Check connectivity
curl -v https://fleet.example.com:5443/health # Should work with valid cert
# Check agent config
cat ~/.config/hwledger/agent.toml | grep server_addr
# Check cert
openssl s_client -connect fleet.example.com:5443 -showcertsFixes:
- Check server is running:
pgrep hwledger-serveror systemctl status - Check firewall:
sudo iptables -L/ Security Group (AWS/Azure) - Check DNS:
nslookup fleet.example.com - Check cert expiration:
openssl x509 -in ~/.config/hwledger/server.cert.pem -text -noout | grep -A2 Validity
SSH fallback auth fails
Symptom: hwledger fleet register-ssh --host user@remote.box fails: "Permission denied".
Fixes:
- Test SSH manually:
ssh -i ~/.ssh/id_ed25519 user@remote.box nvidia-smi - Check key permissions:
chmod 600 ~/.ssh/id_ed25519 - Add to SSH agent:
ssh-add ~/.ssh/id_ed25519 - Check remote user: ensure user can run
nvidia-smiwithout sudo
Audit verify fails
Symptom: hwledger audit --verify fails: "Hash mismatch at event N".
Diagnosis: Ledger corrupted or tampered.
Fix:
# Export last known-good backup
hwledger audit --export backup.json --since "2026-04-01T00:00:00Z"
# Reset to known state (WARNING: loses recent events)
rm ~/.cache/hwledger/fleet.db
hwledger server # Recreates empty DB