OrthoRoute/docs/cloud_gpu_setup.md
2025-11-14 23:09:20 -08:00

18 KiB
Raw Permalink Blame History

Cloud GPU Setup Guide for OrthoRoute

Complete instructions for running OrthoRoute headless routing on Vast.ai or other cloud GPU providers

Last Updated: November 15, 2025


Step 1: Rent GPU Instance on Vast.ai

For boards with <2,000 nets:

  • GPU: RTX 4090 (24 GB VRAM)
  • Cost: ~$0.40/hr
  • Sufficient for most boards

For boards with 2,000-8,000 nets:

  • GPU: RTX 6000 Ada (48 GB VRAM) or A100 80GB
  • Cost: ~$0.80-1.50/hr
  • Needed for large backplanes

For boards with >8,000 nets:

  • GPU: H100 80GB or A100 80GB
  • Cost: ~$1.50-2.50/hr
  • Maximum capacity

On Vast.ai Website

  1. Go to https://vast.ai/console/create/
  2. Filter instances:
    • GPU Type: RTX 4090, RTX 6000 Ada, or A100
    • VRAM: ≥ 24 GB (48+ GB for large boards)
    • Disk Space: ≥ 20 GB
    • CUDA Version: 12.x or later
  3. Sort by price ($/hr)
  4. Click "Rent" on suitable instance
  5. Select:
    • Image: pytorch/pytorch:latest (has CUDA + Python pre-installed)
    • Or: nvidia/cuda:12.2.0-devel-ubuntu22.04
  6. Click "Create"

Get SSH Connection Info

After instance starts (30-60 seconds):

  1. Click on instance in dashboard
  2. Copy SSH command shown (looks like):
    ssh -p 12345 root@ssh.vast.ai -L 8080:localhost:8080
    
  3. Or use direct IP if shown

Step 2: Connect and Setup Environment

SSH into Instance

# Use the SSH command from Vast.ai dashboard
ssh -p 12345 root@ssh.vast.ai

You should see a prompt like:

root@C.27877234:~#

Install System Dependencies

# Update package manager
apt-get update

# Install git and basic tools
apt-get install -y git tmux htop

# Verify CUDA is available
nvidia-smi
# Should show GPU info (e.g., RTX 4090, 24GB VRAM)

# Verify Python version
python3 --version
# Should be Python 3.8 or later

Step 3: Clone OrthoRoute Repository

# Navigate to workspace
cd /workspace

# Clone repository
git clone https://github.com/bbenchoff/OrthoRoute.git
cd OrthoRoute

# Verify files
ls -la
# Should see: main.py, orthoroute/, logs/, etc.

If using a private repository:

# Option 1: Use HTTPS with token
git clone https://YOUR_TOKEN@github.com/YourUsername/OrthoRoute.git

# Option 2: Use SSH (need to add SSH key to GitHub first)
git clone git@github.com:YourUsername/OrthoRoute.git

Step 4: Install Python Dependencies

Check CUDA Version

nvcc --version
# Note the CUDA version (e.g., 12.2, 12.4, etc.)

Install CuPy (GPU acceleration library)

For CUDA 12.x:

pip3 install cupy-cuda12x

For CUDA 11.x:

pip3 install cupy-cuda11x

Verify CuPy installation:

python3 -c "import cupy as cp; print(cp.__version__); print('GPU Available:', cp.cuda.is_available())"
# Should print: GPU Available: True

Install Other Dependencies

# Install NumPy and SciPy
pip3 install numpy scipy

# Verify installations
python3 -c "import numpy; import scipy; print('NumPy:', numpy.__version__, 'SciPy:', scipy.__version__)"

Complete dependency list:

pip3 install cupy-cuda12x numpy scipy

Note: Don't install PyQt6 (GUI not needed for headless mode).


Step 5: Upload Your ORP File

From Your Local Machine

Using SCP:

# On your local machine (not on the Vast instance):
scp -P 12345 MainController.ORP root@ssh.vast.ai:/workspace/OrthoRoute/

# Replace:
#   12345 - with your actual port from Vast.ai
#   MainController.ORP - with your actual ORP filename

Verify upload:

# Back on the Vast instance:
cd /workspace/OrthoRoute
ls -lh *.ORP
# Should show your ORP file

Alternative: Upload to Cloud Storage First

If ORP file is large:

# On local machine: Upload to temporary host
# curl -F "file=@MainController.ORP" https://file.io
# Gets back a URL

# On Vast instance: Download
wget https://file.io/XXXXXX -O MainController.ORP

Step 6: Run OrthoRoute Headless Mode

# Start new tmux session
tmux new -s routing

# Inside tmux, run OrthoRoute
cd /workspace/OrthoRoute
python3 main.py headless MainController.ORP

# Detach from tmux (keeps running in background):
# Press: Ctrl+b, then d

# Later, reattach to see progress:
tmux attach -t routing

# Kill session when done:
tmux kill-session -t routing

Direct Run (Simpler but dies if SSH disconnects)

cd /workspace/OrthoRoute
python3 main.py headless MainController.ORP

With Options

# Increase iterations for complex boards
python3 main.py headless MainController.ORP --max-iterations 150

# Force CPU mode if GPU runs out of memory
python3 main.py headless MainController.ORP --cpu-only

# Custom output filename
python3 main.py headless MainController.ORP -o CustomName.ORS

Step 7: Monitor Progress

Watch Live Console Output

If using tmux:

tmux attach -t routing

If running directly: Already showing in your terminal.

Tail Log Files

# In a second SSH session or tmux pane:
cd /workspace/OrthoRoute

# Watch latest log file
tail -f logs/run_*.log | grep "WARNING"

# Or just iteration summaries:
tail -f logs/run_*.log | grep "ITER.*nets="

# Or with watch command:
watch -n 2 'tail -5 logs/run_*.log'

Monitor GPU Usage

# Watch GPU utilization every 5 seconds
nvidia-smi -l 5

# Or with watch:
watch -n 5 nvidia-smi

What to look for:

  • GPU Utilization: Should be 80-100%
  • GPU Memory: Should be stable (not growing infinitely)
  • Power Usage: Should be near max (e.g., 350W for RTX 4090)

Check Disk Space

# Iteration 1 on 8K nets creates LARGE log files
df -h

# If disk getting full, you can compress or delete old logs:
gzip logs/old_run_*.log

Step 8: Handle Common Issues

Out of Memory Error

Error:

cupy.cuda.memory.OutOfMemoryError: Out of memory allocating X bytes

Solutions:

A) Upgrade to larger GPU:

  • Kill current job: pkill -f main.py
  • Destroy instance on Vast.ai
  • Rent instance with more VRAM (48+ GB)
  • Restart from Step 1

B) Use CPU mode:

pkill -f main.py
python3 main.py headless MainController.ORP --cpu-only

C) Reduce batch size (requires code change - not recommended)

Process Killed / SSH Disconnected

If you weren't using tmux:

  • Routing stopped when SSH died
  • Must restart from scratch

If you were using tmux:

# Reconnect to Vast instance
ssh -p 12345 root@ssh.vast.ai

# Reattach to tmux session
tmux attach -t routing

# Routing should still be running!

Instance Becomes Unresponsive

If SSH hangs or times out:

  • Instance might have crashed
  • Check Vast.ai dashboard - instance status
  • If "stopped", you'll need to restart
  • Unfortunately, routing progress lost (no checkpointing yet)

Logs Too Large

8K net routing can create 10+ GB log files:

# Check log size
du -h logs/

# Compress old logs to save space
gzip logs/run_*.log

# Or delete very old logs
rm logs/run_2025111*.log

Step 9: Download Results

When Routing Completes

You'll see:

================================================================================
ROUTING COMPLETE!
================================================================================
Solution file: MainController.ORS
...

Download ORS File to Local Machine

Using SCP (from your local machine):

scp -P 12345 root@ssh.vast.ai:/workspace/OrthoRoute/MainController.ORS ./

# Replace:
#   12345 - your Vast.ai port
#   MainController.ORS - your actual ORS filename
#   ./ - current directory (or specify path)

Using cloud storage:

# On Vast instance: Upload to file sharing service
curl -F "file=@MainController.ORS" https://file.io
# Returns download URL

# On local machine: Download
wget https://file.io/XXXXXX -O MainController.ORS

Verify file integrity:

# On local machine, check file is valid gzip:
gzip -t MainController.ORS && echo "File OK" || echo "File corrupted"

# Check file size (should be ~500KB - 5MB):
ls -lh MainController.ORS

Step 10: Import into KiCad

On your local machine:

  1. Open KiCad with your board
  2. Launch OrthoRoute plugin
  3. Press Ctrl+I (or File → Import Solution)
  4. Select MainController.ORS
  5. Review routing in preview
  6. Click "Apply to KiCad" to commit traces/vias

Complete Example Session

Session Recording

# === ON LOCAL MACHINE ===

# 1. Export board
# (In KiCad OrthoRoute plugin: Ctrl+E → save MainController.ORP)

# 2. Upload to Vast
scp -P 12345 MainController.ORP root@ssh.vast.ai:/workspace/

# === ON VAST.AI INSTANCE ===

# 3. SSH in
ssh -p 12345 root@ssh.vast.ai

# 4. Setup
cd /workspace
git clone https://github.com/YourUser/OrthoRoute.git
cd OrthoRoute
pip3 install cupy-cuda12x numpy scipy

# 5. Verify GPU
nvidia-smi
python3 -c "import cupy; print('GPU:', cupy.cuda.is_available())"

# 6. Start tmux session
tmux new -s routing

# 7. Run routing
python3 main.py headless MainController.ORP

# 8. Detach from tmux (Ctrl+b, then d)

# 9. Monitor progress (optional)
tail -f logs/run_*.log | grep "ITER.*nets="

# 10. Wait for completion (check back in 4-8 hours)

# 11. Download result
exit  # Exit SSH

# === BACK ON LOCAL MACHINE ===

# 12. Download ORS file
scp -P 12345 root@ssh.vast.ai:/workspace/OrthoRoute/MainController.ORS ./

# 13. Import into KiCad (Ctrl+I)

# 14. Destroy Vast instance (stop billing)
# (In Vast.ai dashboard: click Destroy)

Cost Estimation

Typical Costs by Board Size

Small board (100-500 nets):

  • Time: 10-30 minutes
  • GPU: RTX 4090 @ $0.40/hr
  • Cost: $0.20

Medium board (500-2,000 nets):

  • Time: 30 minutes - 2 hours
  • GPU: RTX 4090 @ $0.40/hr
  • Cost: $0.80

Large board (2,000-8,000 nets):

  • Time: 4-12 hours
  • GPU: RTX 6000 Ada (48GB) @ $0.80/hr
  • Cost: $6-10

Huge board (8,000+ nets):

  • Time: 12-24 hours
  • GPU: A100 80GB @ $1.50/hr
  • Cost: $18-36

vs. buying RTX 4090: ~$1,600

Break-even: ~40 large routing jobs (or never, if you value your time)


Tips & Tricks

1. Use tmux ALWAYS

# Start every session with:
tmux new -s routing

# Detach: Ctrl+b, then d
# Reattach: tmux attach -t routing

Why: If SSH disconnects, routing keeps going. Saved me countless times.

2. Monitor Without Attaching

# See what's happening in tmux without attaching:
tmux capture-pane -t routing -p | tail -20

3. Multiple Sessions for Monitoring

# Window 1: Routing
tmux new -s routing
python3 main.py headless board.ORP

# Detach (Ctrl+b, d)

# Window 2: Monitoring
tmux new -s monitor
tail -f logs/run_*.log | grep "ITER.*nets="

# Detach (Ctrl+b, d)

# Switch between:
tmux attach -t routing
tmux attach -t monitor

4. Estimate Time Remaining

# From iteration timestamps, calculate rate:
# Example: ITER 10 at 10:30, ITER 20 at 11:45
# = 10 iterations in 75 minutes
# = 7.5 min/iteration
# If need 80 iterations total: (80-20) × 7.5 = 450 min = 7.5 hours

5. Verify GPU is Being Used

# Run this DURING routing:
nvidia-smi

# Look for:
#   GPU Util: 95-100%
#   Memory Usage: 20-30 GB (should be high)
#   Process: python3 main.py headless ...

If GPU Util is 0%: Routing is using CPU (slow!) - check CuPy installation.

6. Pre-test Small Board

Before routing huge board:

# Test with small ORP first:
python3 main.py headless TestBackplane.ORP

# Should complete in 20-30 min
# Verifies: GPU works, dependencies correct, no issues

7. Compress Logs to Save Disk

# While routing is running (in another terminal):
cd /workspace/OrthoRoute/logs
gzip run_2025*.log  # Compress old logs

# Or auto-compress with cron:
(crontab -l; echo "*/30 * * * * gzip /workspace/OrthoRoute/logs/*.log 2>/dev/null") | crontab -

Troubleshooting

"No module named 'cupy'"

Problem: CuPy not installed

Fix:

pip3 install cupy-cuda12x

"CUDA initialization failed"

Problem: CUDA runtime mismatch

Fix:

# Check CUDA version
nvcc --version

# Install matching CuPy:
# CUDA 11.x: pip3 install cupy-cuda11x
# CUDA 12.x: pip3 install cupy-cuda12x

"Permission denied" when cloning repo

Problem: Private repository

Fix:

# Generate SSH key on Vast instance:
ssh-keygen -t ed25519 -C "vast-gpu"
cat ~/.ssh/id_ed25519.pub
# Copy output, add to GitHub → Settings → SSH Keys

# Or use personal access token:
git clone https://YOUR_TOKEN@github.com/user/repo.git

Routing uses CPU instead of GPU

Check:

python3 -c "import cupy; print('Available:', cupy.cuda.is_available())"

If False:

  • CuPy not installed correctly
  • CUDA version mismatch
  • GPU drivers not loaded

Force GPU mode:

python3 main.py headless board.ORP --use-gpu

Instance runs out of disk space

Check space:

df -h

If <5 GB free:

# Compress logs
gzip logs/*.log

# Or delete old logs
rm logs/run_2025111*.log

# Or mount external storage (Vast.ai option)

Routing takes forever on CPU

If forced to use --cpu-only:

  • 8K net board could take 48-72 hours
  • Consider renting bigger GPU instead
  • Or reduce grid resolution in ORP file

Optimization Tips

1. Choose Right GPU for Your Board

Board Size Nets VRAM Needed Recommended GPU Cost/hr
Small <500 8 GB RTX 3080 $0.25
Medium 500-2K 16 GB RTX 4090 $0.40
Large 2K-6K 24 GB RTX 4090 $0.40
Huge 6K-10K 48 GB RTX 6000 Ada $0.80
Massive 10K+ 80 GB A100 80GB $1.50

2. Batch Multiple Boards

# Route multiple boards in one session:
python3 main.py headless Board1.ORP
python3 main.py headless Board2.ORP
python3 main.py headless Board3.ORP

# Or in parallel (if enough VRAM):
python3 main.py headless Board1.ORP &
python3 main.py headless Board2.ORP &
wait

3. Auto-shutdown When Done

# Add to end of routing script:
python3 main.py headless board.ORP && shutdown -h now

# Instance stops automatically when complete
# Minimizes billing

Quick Reference Card

Setup:

ssh -p PORT root@ssh.vast.ai
cd /workspace
git clone https://github.com/user/OrthoRoute.git
cd OrthoRoute
pip3 install cupy-cuda12x numpy scipy

Upload file:

# From local machine:
scp -P PORT board.ORP root@ssh.vast.ai:/workspace/OrthoRoute/

Run routing:

tmux new -s routing
python3 main.py headless board.ORP
# Ctrl+b, d to detach

Monitor:

tail -f logs/run_*.log | grep "ITER.*nets="
nvidia-smi -l 5

Download result:

# From local machine:
scp -P PORT root@ssh.vast.ai:/workspace/OrthoRoute/board.ORS ./

Import to KiCad:

Ctrl+I → select board.ORS → Apply to KiCad

Expected Timeline (8K Net Board)

00:00 - Start instance, SSH in
00:05 - Clone repo, install dependencies
00:10 - Upload ORP file (depends on internet speed)
00:15 - Start routing in tmux
02:30 - Iteration 1 completes (greedy routing)
04:00 - Iteration 20 completes
08:00 - Iteration 50 completes
12:00 - Iteration 75 completes
14:00 - Convergence! (iteration 85-95)
14:05 - Download ORS file
14:10 - Destroy instance

Total: ~14 hours runtime, ~$12-15 cost

Vast.ai Specific Notes

Instance States

  • Loading: Starting up (1-2 min)
  • Running: Active and billable
  • Stopped: Paused (not billable, but loses data)
  • Destroyed: Terminated (stops billing)

Billing

  • Billed per second of runtime
  • Continues billing until you Destroy instance
  • Check dashboard frequently when job completes

Data Persistence

  • /workspace directory persists across stops
  • ~/.ssh, /tmp do NOT persist
  • Always destroy when done (or you keep paying)

Port Forwarding

SSH command includes port forwarding:

ssh -p 12345 root@ssh.vast.ai -L 8080:localhost:8080

You can ignore the -L 8080:localhost:8080 part for headless routing.


Other Cloud Providers

RunPod

Similar setup:

# SSH command from RunPod dashboard
ssh root@X.X.X.X -p 22

# Rest is identical to Vast.ai

Differences:

  • Easier UI
  • Slightly more expensive (~$0.50/hr for RTX 4090)
  • Better reliability
  • Jupyter notebook support (not needed for headless)

Lambda Labs

Setup:

ssh ubuntu@instance.lambdalabs.com
sudo apt-get install python3-pip
# Rest same as Vast.ai

Differences:

  • More expensive (~$1.10/hr for A100)
  • Very reliable
  • Better for production workloads
  • Fixed pricing (no bidding)

Security Notes

Protect Your ORP Files

ORP files contain your entire board design:

  • Pad positions
  • Net connectivity
  • Design rules

Don't:

  • Upload to public GitHub
  • Share ORP files publicly
  • Leave on instance after destroying

Do:

  • Use private repositories
  • Delete ORP/ORS from instance before destroying:
    rm /workspace/OrthoRoute/*.ORP
    rm /workspace/OrthoRoute/*.ORS
    
  • Download and backup ORS files locally

SSH Key Security

Generate unique key for cloud instances:

ssh-keygen -t ed25519 -f ~/.ssh/vast_key
# Use ~/.ssh/vast_key instead of default key
# If compromised, only affects cloud instances

Post-Processing

After Downloading ORS

1. Verify file:

ls -lh MainController.ORS
# Should be ~500KB - 5MB depending on board size

2. Import to KiCad:

  • Ctrl+I in OrthoRoute plugin
  • Select ORS file
  • Review in preview

3. Run DRC:

  • Check for violations
  • Expect ~300-500 via barrel conflicts (known limitation)
  • Zero trace-trace violations (should be clean)

4. Manual cleanup (if needed):

  • Fix barrel conflicts by moving vias 0.1-0.2mm
  • Typically 30-60 minutes for large boards

FAQ

Q: Can I close my laptop while routing? A: Yes, if using tmux! Routing continues on the cloud.

Q: How do I know when it's done? A: Check tmux session or log files. Or set up email notification (advanced).

Q: What if I run out of money mid-routing? A: Vast.ai stops instance, routing lost. Add credits before starting.

Q: Can I pause and resume? A: Not currently. Checkpointing is a planned feature but not implemented.

Q: GPU seems idle during routing? A: Check nvidia-smi. If 0%, CuPy isn't working. Use --cpu-only as fallback.

Q: Can I route multiple boards in parallel? A: Yes, if enough VRAM. 2 small boards on 1 GPU works. Large boards need dedicated GPU.


Last Updated: November 15, 2025 Tested On: Vast.ai, RunPod, Lambda Labs GPU Tested: RTX 4090, RTX 6000 Ada, A100 80GB Status: Production-ready