OrthoRoute/docs/cloud_gpu_setup.md

# Cloud GPU Setup Guide for OrthoRoute

**Complete instructions for running OrthoRoute headless routing on Vast.ai or other cloud GPU providers**

**Last Updated:** November 15, 2025

---

## Step 1: Rent GPU Instance on Vast.ai

### Recommended Specifications

**For boards with <2,000 nets:**
- GPU: RTX 4090 (24 GB VRAM)
- Cost: ~$0.40/hr
- Sufficient for most boards

**For boards with 2,000-8,000 nets:**
- GPU: RTX 6000 Ada (48 GB VRAM) or A100 80GB
- Cost: ~$0.80-1.50/hr
- Needed for large backplanes

**For boards with >8,000 nets:**
- GPU: H100 80GB or A100 80GB
- Cost: ~$1.50-2.50/hr
- Maximum capacity

### On Vast.ai Website

1. Go to https://vast.ai/console/create/
2. **Filter instances:**
   - GPU Type: RTX 4090, RTX 6000 Ada, or A100
   - VRAM: ≥ 24 GB (48+ GB for large boards)
   - Disk Space: ≥ 20 GB
   - CUDA Version: 12.x or later
3. **Sort by price** ($/hr)
4. **Click "Rent"** on suitable instance
5. **Select:**
   - Image: `pytorch/pytorch:latest` (has CUDA + Python pre-installed)
   - Or: `nvidia/cuda:12.2.0-devel-ubuntu22.04`
6. **Click "Create"**

### Get SSH Connection Info

After instance starts (30-60 seconds):
1. Click on instance in dashboard
2. Copy SSH command shown (looks like):
   ```bash
   ssh -p 12345 root@ssh.vast.ai -L 8080:localhost:8080
   ```
3. Or use direct IP if shown

---

## Step 2: Connect and Setup Environment

### SSH into Instance

```bash
# Use the SSH command from Vast.ai dashboard
ssh -p 12345 root@ssh.vast.ai
```

**You should see a prompt like:**
```
root@C.27877234:~#
```

### Install System Dependencies

```bash
# Update package manager
apt-get update

# Install git and basic tools
apt-get install -y git tmux htop

# Verify CUDA is available
nvidia-smi
# Should show GPU info (e.g., RTX 4090, 24GB VRAM)

# Verify Python version
python3 --version
# Should be Python 3.8 or later
```

---

## Step 3: Clone OrthoRoute Repository

```bash
# Navigate to workspace
cd /workspace

# Clone repository
git clone https://github.com/bbenchoff/OrthoRoute.git
cd OrthoRoute

# Verify files
ls -la
# Should see: main.py, orthoroute/, logs/, etc.
```

**If using a private repository:**
```bash
# Option 1: Use HTTPS with token
git clone https://YOUR_TOKEN@github.com/YourUsername/OrthoRoute.git

# Option 2: Use SSH (need to add SSH key to GitHub first)
git clone git@github.com:YourUsername/OrthoRoute.git
```

---

## Step 4: Install Python Dependencies

### Check CUDA Version

```bash
nvcc --version
# Note the CUDA version (e.g., 12.2, 12.4, etc.)
```

### Install CuPy (GPU acceleration library)

**For CUDA 12.x:**
```bash
pip3 install cupy-cuda12x
```

**For CUDA 11.x:**
```bash
pip3 install cupy-cuda11x
```

**Verify CuPy installation:**
```bash
python3 -c "import cupy as cp; print(cp.__version__); print('GPU Available:', cp.cuda.is_available())"
# Should print: GPU Available: True
```

### Install Other Dependencies

```bash
# Install NumPy and SciPy
pip3 install numpy scipy

# Verify installations
python3 -c "import numpy; import scipy; print('NumPy:', numpy.__version__, 'SciPy:', scipy.__version__)"
```

**Complete dependency list:**
```bash
pip3 install cupy-cuda12x numpy scipy
```

**Note:** Don't install PyQt6 (GUI not needed for headless mode).

---

## Step 5: Upload Your ORP File

### From Your Local Machine

**Using SCP:**
```bash
# On your local machine (not on the Vast instance):
scp -P 12345 MainController.ORP root@ssh.vast.ai:/workspace/OrthoRoute/

# Replace:
#   12345 - with your actual port from Vast.ai
#   MainController.ORP - with your actual ORP filename
```

**Verify upload:**
```bash
# Back on the Vast instance:
cd /workspace/OrthoRoute
ls -lh *.ORP
# Should show your ORP file
```

### Alternative: Upload to Cloud Storage First

If ORP file is large:
```bash
# On local machine: Upload to temporary host
# curl -F "file=@MainController.ORP" https://file.io
# Gets back a URL

# On Vast instance: Download
wget https://file.io/XXXXXX -O MainController.ORP
```

---

## Step 6: Run OrthoRoute Headless Mode

### Using tmux (Recommended - survives SSH disconnects)

```bash
# Start new tmux session
tmux new -s routing

# Inside tmux, run OrthoRoute
cd /workspace/OrthoRoute
python3 main.py headless MainController.ORP

# Detach from tmux (keeps running in background):
# Press: Ctrl+b, then d

# Later, reattach to see progress:
tmux attach -t routing

# Kill session when done:
tmux kill-session -t routing
```

### Direct Run (Simpler but dies if SSH disconnects)

```bash
cd /workspace/OrthoRoute
python3 main.py headless MainController.ORP
```

### With Options

```bash
# Increase iterations for complex boards
python3 main.py headless MainController.ORP --max-iterations 150

# Force CPU mode if GPU runs out of memory
python3 main.py headless MainController.ORP --cpu-only

# Custom output filename
python3 main.py headless MainController.ORP -o CustomName.ORS
```

---

## Step 7: Monitor Progress

### Watch Live Console Output

**If using tmux:**
```bash
tmux attach -t routing
```

**If running directly:**
Already showing in your terminal.

### Tail Log Files

```bash
# In a second SSH session or tmux pane:
cd /workspace/OrthoRoute

# Watch latest log file
tail -f logs/run_*.log | grep "WARNING"

# Or just iteration summaries:
tail -f logs/run_*.log | grep "ITER.*nets="

# Or with watch command:
watch -n 2 'tail -5 logs/run_*.log'
```

### Monitor GPU Usage

```bash
# Watch GPU utilization every 5 seconds
nvidia-smi -l 5

# Or with watch:
watch -n 5 nvidia-smi
```

**What to look for:**
- GPU Utilization: Should be 80-100%
- GPU Memory: Should be stable (not growing infinitely)
- Power Usage: Should be near max (e.g., 350W for RTX 4090)

### Check Disk Space

```bash
# Iteration 1 on 8K nets creates LARGE log files
df -h

# If disk getting full, you can compress or delete old logs:
gzip logs/old_run_*.log
```

---

## Step 8: Handle Common Issues

### Out of Memory Error

**Error:**
```
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating X bytes
```

**Solutions:**

**A) Upgrade to larger GPU:**
- Kill current job: `pkill -f main.py`
- Destroy instance on Vast.ai
- Rent instance with more VRAM (48+ GB)
- Restart from Step 1

**B) Use CPU mode:**
```bash
pkill -f main.py
python3 main.py headless MainController.ORP --cpu-only
```

**C) Reduce batch size** (requires code change - not recommended)

### Process Killed / SSH Disconnected

**If you weren't using tmux:**
- Routing stopped when SSH died
- Must restart from scratch

**If you were using tmux:**
```bash
# Reconnect to Vast instance
ssh -p 12345 root@ssh.vast.ai

# Reattach to tmux session
tmux attach -t routing

# Routing should still be running!
```

### Instance Becomes Unresponsive

**If SSH hangs or times out:**
- Instance might have crashed
- Check Vast.ai dashboard - instance status
- If "stopped", you'll need to restart
- Unfortunately, routing progress lost (no checkpointing yet)

### Logs Too Large

**8K net routing can create 10+ GB log files:**

```bash
# Check log size
du -h logs/

# Compress old logs to save space
gzip logs/run_*.log

# Or delete very old logs
rm logs/run_2025111*.log
```

---

## Step 9: Download Results

### When Routing Completes

**You'll see:**
```
================================================================================
ROUTING COMPLETE!
================================================================================
Solution file: MainController.ORS
...
```

### Download ORS File to Local Machine

**Using SCP (from your local machine):**
```bash
scp -P 12345 root@ssh.vast.ai:/workspace/OrthoRoute/MainController.ORS ./

# Replace:
#   12345 - your Vast.ai port
#   MainController.ORS - your actual ORS filename
#   ./ - current directory (or specify path)
```

**Using cloud storage:**
```bash
# On Vast instance: Upload to file sharing service
curl -F "file=@MainController.ORS" https://file.io
# Returns download URL

# On local machine: Download
wget https://file.io/XXXXXX -O MainController.ORS
```

**Verify file integrity:**
```bash
# On local machine, check file is valid gzip:
gzip -t MainController.ORS && echo "File OK" || echo "File corrupted"

# Check file size (should be ~500KB - 5MB):
ls -lh MainController.ORS
```

---

## Step 10: Import into KiCad

**On your local machine:**

1. Open KiCad with your board
2. Launch OrthoRoute plugin
3. Press **Ctrl+I** (or File → Import Solution)
4. Select `MainController.ORS`
5. Review routing in preview
6. Click **"Apply to KiCad"** to commit traces/vias

---

## Complete Example Session

### Session Recording

```bash
# === ON LOCAL MACHINE ===

# 1. Export board
# (In KiCad OrthoRoute plugin: Ctrl+E → save MainController.ORP)

# 2. Upload to Vast
scp -P 12345 MainController.ORP root@ssh.vast.ai:/workspace/

# === ON VAST.AI INSTANCE ===

# 3. SSH in
ssh -p 12345 root@ssh.vast.ai

# 4. Setup
cd /workspace
git clone https://github.com/YourUser/OrthoRoute.git
cd OrthoRoute
pip3 install cupy-cuda12x numpy scipy

# 5. Verify GPU
nvidia-smi
python3 -c "import cupy; print('GPU:', cupy.cuda.is_available())"

# 6. Start tmux session
tmux new -s routing

# 7. Run routing
python3 main.py headless MainController.ORP

# 8. Detach from tmux (Ctrl+b, then d)

# 9. Monitor progress (optional)
tail -f logs/run_*.log | grep "ITER.*nets="

# 10. Wait for completion (check back in 4-8 hours)

# 11. Download result
exit  # Exit SSH

# === BACK ON LOCAL MACHINE ===

# 12. Download ORS file
scp -P 12345 root@ssh.vast.ai:/workspace/OrthoRoute/MainController.ORS ./

# 13. Import into KiCad (Ctrl+I)

# 14. Destroy Vast instance (stop billing)
# (In Vast.ai dashboard: click Destroy)
```

---

## Cost Estimation

### Typical Costs by Board Size

**Small board (100-500 nets):**
- Time: 10-30 minutes
- GPU: RTX 4090 @ $0.40/hr
- **Cost: $0.20**

**Medium board (500-2,000 nets):**
- Time: 30 minutes - 2 hours
- GPU: RTX 4090 @ $0.40/hr
- **Cost: $0.80**

**Large board (2,000-8,000 nets):**
- Time: 4-12 hours
- GPU: RTX 6000 Ada (48GB) @ $0.80/hr
- **Cost: $6-10**

**Huge board (8,000+ nets):**
- Time: 12-24 hours
- GPU: A100 80GB @ $1.50/hr
- **Cost: $18-36**

**vs. buying RTX 4090:** ~$1,600

**Break-even:** ~40 large routing jobs (or never, if you value your time)

---

## Tips & Tricks

### 1. Use tmux ALWAYS

```bash
# Start every session with:
tmux new -s routing

# Detach: Ctrl+b, then d
# Reattach: tmux attach -t routing
```

**Why:** If SSH disconnects, routing keeps going. Saved me countless times.

### 2. Monitor Without Attaching

```bash
# See what's happening in tmux without attaching:
tmux capture-pane -t routing -p | tail -20
```

### 3. Multiple Sessions for Monitoring

```bash
# Window 1: Routing
tmux new -s routing
python3 main.py headless board.ORP

# Detach (Ctrl+b, d)

# Window 2: Monitoring
tmux new -s monitor
tail -f logs/run_*.log | grep "ITER.*nets="

# Detach (Ctrl+b, d)

# Switch between:
tmux attach -t routing
tmux attach -t monitor
```

### 4. Estimate Time Remaining

```bash
# From iteration timestamps, calculate rate:
# Example: ITER 10 at 10:30, ITER 20 at 11:45
# = 10 iterations in 75 minutes
# = 7.5 min/iteration
# If need 80 iterations total: (80-20) × 7.5 = 450 min = 7.5 hours
```

### 5. Verify GPU is Being Used

```bash
# Run this DURING routing:
nvidia-smi

# Look for:
#   GPU Util: 95-100%
#   Memory Usage: 20-30 GB (should be high)
#   Process: python3 main.py headless ...
```

**If GPU Util is 0%:** Routing is using CPU (slow!) - check CuPy installation.

### 6. Pre-test Small Board

Before routing huge board:
```bash
# Test with small ORP first:
python3 main.py headless TestBackplane.ORP

# Should complete in 20-30 min
# Verifies: GPU works, dependencies correct, no issues
```

### 7. Compress Logs to Save Disk

```bash
# While routing is running (in another terminal):
cd /workspace/OrthoRoute/logs
gzip run_2025*.log  # Compress old logs

# Or auto-compress with cron:
(crontab -l; echo "*/30 * * * * gzip /workspace/OrthoRoute/logs/*.log 2>/dev/null") | crontab -
```

---

## Troubleshooting

### "No module named 'cupy'"

**Problem:** CuPy not installed

**Fix:**
```bash
pip3 install cupy-cuda12x
```

### "CUDA initialization failed"

**Problem:** CUDA runtime mismatch

**Fix:**
```bash
# Check CUDA version
nvcc --version

# Install matching CuPy:
# CUDA 11.x: pip3 install cupy-cuda11x
# CUDA 12.x: pip3 install cupy-cuda12x
```

### "Permission denied" when cloning repo

**Problem:** Private repository

**Fix:**
```bash
# Generate SSH key on Vast instance:
ssh-keygen -t ed25519 -C "vast-gpu"
cat ~/.ssh/id_ed25519.pub
# Copy output, add to GitHub → Settings → SSH Keys

# Or use personal access token:
git clone https://YOUR_TOKEN@github.com/user/repo.git
```

### Routing uses CPU instead of GPU

**Check:**
```bash
python3 -c "import cupy; print('Available:', cupy.cuda.is_available())"
```

**If False:**
- CuPy not installed correctly
- CUDA version mismatch
- GPU drivers not loaded

**Force GPU mode:**
```bash
python3 main.py headless board.ORP --use-gpu
```

### Instance runs out of disk space

**Check space:**
```bash
df -h
```

**If <5 GB free:**
```bash
# Compress logs
gzip logs/*.log

# Or delete old logs
rm logs/run_2025111*.log

# Or mount external storage (Vast.ai option)
```

### Routing takes forever on CPU

**If forced to use `--cpu-only`:**
- 8K net board could take 48-72 hours
- Consider renting bigger GPU instead
- Or reduce grid resolution in ORP file

---

## Optimization Tips

### 1. Choose Right GPU for Your Board

| Board Size | Nets | VRAM Needed | Recommended GPU | Cost/hr |
|------------|------|-------------|-----------------|---------|
| Small | <500 | 8 GB | RTX 3080 | $0.25 |
| Medium | 500-2K | 16 GB | RTX 4090 | $0.40 |
| Large | 2K-6K | 24 GB | RTX 4090 | $0.40 |
| Huge | 6K-10K | 48 GB | RTX 6000 Ada | $0.80 |
| Massive | 10K+ | 80 GB | A100 80GB | $1.50 |

### 2. Batch Multiple Boards

```bash
# Route multiple boards in one session:
python3 main.py headless Board1.ORP
python3 main.py headless Board2.ORP
python3 main.py headless Board3.ORP

# Or in parallel (if enough VRAM):
python3 main.py headless Board1.ORP &
python3 main.py headless Board2.ORP &
wait
```

### 3. Auto-shutdown When Done

```bash
# Add to end of routing script:
python3 main.py headless board.ORP && shutdown -h now

# Instance stops automatically when complete
# Minimizes billing
```

---

## Quick Reference Card

**Setup:**
```bash
ssh -p PORT root@ssh.vast.ai
cd /workspace
git clone https://github.com/user/OrthoRoute.git
cd OrthoRoute
pip3 install cupy-cuda12x numpy scipy
```

**Upload file:**
```bash
# From local machine:
scp -P PORT board.ORP root@ssh.vast.ai:/workspace/OrthoRoute/
```

**Run routing:**
```bash
tmux new -s routing
python3 main.py headless board.ORP
# Ctrl+b, d to detach
```

**Monitor:**
```bash
tail -f logs/run_*.log | grep "ITER.*nets="
nvidia-smi -l 5
```

**Download result:**
```bash
# From local machine:
scp -P PORT root@ssh.vast.ai:/workspace/OrthoRoute/board.ORS ./
```

**Import to KiCad:**
```
Ctrl+I → select board.ORS → Apply to KiCad
```

---

## Expected Timeline (8K Net Board)

```
00:00 - Start instance, SSH in
00:05 - Clone repo, install dependencies
00:10 - Upload ORP file (depends on internet speed)
00:15 - Start routing in tmux
02:30 - Iteration 1 completes (greedy routing)
04:00 - Iteration 20 completes
08:00 - Iteration 50 completes
12:00 - Iteration 75 completes
14:00 - Convergence! (iteration 85-95)
14:05 - Download ORS file
14:10 - Destroy instance

Total: ~14 hours runtime, ~$12-15 cost
```

---

## Vast.ai Specific Notes

### Instance States

- **Loading:** Starting up (1-2 min)
- **Running:** Active and billable
- **Stopped:** Paused (not billable, but loses data)
- **Destroyed:** Terminated (stops billing)

### Billing

- Billed per **second** of runtime
- Continues billing until you **Destroy** instance
- Check dashboard frequently when job completes

### Data Persistence

- `/workspace` directory persists across stops
- `~/.ssh`, `/tmp` do NOT persist
- **Always destroy** when done (or you keep paying)

### Port Forwarding

SSH command includes port forwarding:
```bash
ssh -p 12345 root@ssh.vast.ai -L 8080:localhost:8080
```

You can ignore the `-L 8080:localhost:8080` part for headless routing.

---

## Other Cloud Providers

### RunPod

**Similar setup:**
```bash
# SSH command from RunPod dashboard
ssh root@X.X.X.X -p 22

# Rest is identical to Vast.ai
```

**Differences:**
- Easier UI
- Slightly more expensive (~$0.50/hr for RTX 4090)
- Better reliability
- Jupyter notebook support (not needed for headless)

### Lambda Labs

**Setup:**
```bash
ssh ubuntu@instance.lambdalabs.com
sudo apt-get install python3-pip
# Rest same as Vast.ai
```

**Differences:**
- More expensive (~$1.10/hr for A100)
- Very reliable
- Better for production workloads
- Fixed pricing (no bidding)

---

## Security Notes

### Protect Your ORP Files

ORP files contain your entire board design:
- Pad positions
- Net connectivity
- Design rules

**Don't:**
- Upload to public GitHub
- Share ORP files publicly
- Leave on instance after destroying

**Do:**
- Use private repositories
- Delete ORP/ORS from instance before destroying:
  ```bash
  rm /workspace/OrthoRoute/*.ORP
  rm /workspace/OrthoRoute/*.ORS
  ```
- Download and backup ORS files locally

### SSH Key Security

**Generate unique key for cloud instances:**
```bash
ssh-keygen -t ed25519 -f ~/.ssh/vast_key
# Use ~/.ssh/vast_key instead of default key
# If compromised, only affects cloud instances
```

---

## Post-Processing

### After Downloading ORS

**1. Verify file:**
```bash
ls -lh MainController.ORS
# Should be ~500KB - 5MB depending on board size
```

**2. Import to KiCad:**
- Ctrl+I in OrthoRoute plugin
- Select ORS file
- Review in preview

**3. Run DRC:**
- Check for violations
- Expect ~300-500 via barrel conflicts (known limitation)
- Zero trace-trace violations (should be clean)

**4. Manual cleanup (if needed):**
- Fix barrel conflicts by moving vias 0.1-0.2mm
- Typically 30-60 minutes for large boards

---

## FAQ

**Q: Can I close my laptop while routing?**
A: Yes, if using tmux! Routing continues on the cloud.

**Q: How do I know when it's done?**
A: Check tmux session or log files. Or set up email notification (advanced).

**Q: What if I run out of money mid-routing?**
A: Vast.ai stops instance, routing lost. Add credits before starting.

**Q: Can I pause and resume?**
A: Not currently. Checkpointing is a planned feature but not implemented.

**Q: GPU seems idle during routing?**
A: Check nvidia-smi. If 0%, CuPy isn't working. Use `--cpu-only` as fallback.

**Q: Can I route multiple boards in parallel?**
A: Yes, if enough VRAM. 2 small boards on 1 GPU works. Large boards need dedicated GPU.

---

**Last Updated:** November 15, 2025
**Tested On:** Vast.ai, RunPod, Lambda Labs
**GPU Tested:** RTX 4090, RTX 6000 Ada, A100 80GB
**Status:** Production-ready