AllSpiceMirrors/OrthoRoute

Fork 0

mirror of https://github.com/bbenchoff/OrthoRoute.git synced 2025-12-26 10:56:47 +00:00

Brian Benchoff 03ff5ade4a made QT optional for headless for cloud

2025-11-14 23:09:20 -08:00

18 KiB

Raw Permalink Blame History

Cloud GPU Setup Guide for OrthoRoute

Complete instructions for running OrthoRoute headless routing on Vast.ai or other cloud GPU providers

Last Updated: November 15, 2025

Step 1: Rent GPU Instance on Vast.ai

Recommended Specifications

For boards with <2,000 nets:

GPU: RTX 4090 (24 GB VRAM)
Cost: ~$0.40/hr
Sufficient for most boards

For boards with 2,000-8,000 nets:

GPU: RTX 6000 Ada (48 GB VRAM) or A100 80GB
Cost: ~$0.80-1.50/hr
Needed for large backplanes

For boards with >8,000 nets:

GPU: H100 80GB or A100 80GB
Cost: ~$1.50-2.50/hr
Maximum capacity

On Vast.ai Website

Go to https://vast.ai/console/create/
Filter instances:
- GPU Type: RTX 4090, RTX 6000 Ada, or A100
- VRAM: ≥ 24 GB (48+ GB for large boards)
- Disk Space: ≥ 20 GB
- CUDA Version: 12.x or later
Sort by price ($/hr)
Click "Rent" on suitable instance
Select:
- Image: pytorch/pytorch:latest (has CUDA + Python pre-installed)
- Or: nvidia/cuda:12.2.0-devel-ubuntu22.04
Click "Create"

Get SSH Connection Info

After instance starts (30-60 seconds):

Click on instance in dashboard

Copy SSH command shown (looks like):

ssh -p 12345 root@ssh.vast.ai -L 8080:localhost:8080

Or use direct IP if shown

Step 2: Connect and Setup Environment

SSH into Instance

# Use the SSH command from Vast.ai dashboard
ssh -p 12345 root@ssh.vast.ai

You should see a prompt like:

root@C.27877234:~#

Install System Dependencies

# Update package manager
apt-get update

# Install git and basic tools
apt-get install -y git tmux htop

# Verify CUDA is available
nvidia-smi
# Should show GPU info (e.g., RTX 4090, 24GB VRAM)

# Verify Python version
python3 --version
# Should be Python 3.8 or later

Step 3: Clone OrthoRoute Repository

# Navigate to workspace
cd /workspace

# Clone repository
git clone https://github.com/bbenchoff/OrthoRoute.git
cd OrthoRoute

# Verify files
ls -la
# Should see: main.py, orthoroute/, logs/, etc.

If using a private repository:

# Option 1: Use HTTPS with token
git clone https://YOUR_TOKEN@github.com/YourUsername/OrthoRoute.git

# Option 2: Use SSH (need to add SSH key to GitHub first)
git clone git@github.com:YourUsername/OrthoRoute.git

Step 4: Install Python Dependencies

Check CUDA Version

nvcc --version
# Note the CUDA version (e.g., 12.2, 12.4, etc.)

Install CuPy (GPU acceleration library)

For CUDA 12.x:

pip3 install cupy-cuda12x

For CUDA 11.x:

pip3 install cupy-cuda11x

Verify CuPy installation:

python3 -c "import cupy as cp; print(cp.__version__); print('GPU Available:', cp.cuda.is_available())"
# Should print: GPU Available: True

Install Other Dependencies

# Install NumPy and SciPy
pip3 install numpy scipy

# Verify installations
python3 -c "import numpy; import scipy; print('NumPy:', numpy.__version__, 'SciPy:', scipy.__version__)"

Complete dependency list:

pip3 install cupy-cuda12x numpy scipy

Note: Don't install PyQt6 (GUI not needed for headless mode).

Step 5: Upload Your ORP File

From Your Local Machine

Using SCP:

# On your local machine (not on the Vast instance):
scp -P 12345 MainController.ORP root@ssh.vast.ai:/workspace/OrthoRoute/

# Replace:
#   12345 - with your actual port from Vast.ai
#   MainController.ORP - with your actual ORP filename

Verify upload:

# Back on the Vast instance:
cd /workspace/OrthoRoute
ls -lh *.ORP
# Should show your ORP file

Alternative: Upload to Cloud Storage First

If ORP file is large:

# On local machine: Upload to temporary host
# curl -F "file=@MainController.ORP" https://file.io
# Gets back a URL

# On Vast instance: Download
wget https://file.io/XXXXXX -O MainController.ORP

Step 6: Run OrthoRoute Headless Mode

Using tmux (Recommended - survives SSH disconnects)

# Start new tmux session
tmux new -s routing

# Inside tmux, run OrthoRoute
cd /workspace/OrthoRoute
python3 main.py headless MainController.ORP

# Detach from tmux (keeps running in background):
# Press: Ctrl+b, then d

# Later, reattach to see progress:
tmux attach -t routing

# Kill session when done:
tmux kill-session -t routing

Direct Run (Simpler but dies if SSH disconnects)

cd /workspace/OrthoRoute
python3 main.py headless MainController.ORP

With Options

# Increase iterations for complex boards
python3 main.py headless MainController.ORP --max-iterations 150

# Force CPU mode if GPU runs out of memory
python3 main.py headless MainController.ORP --cpu-only

# Custom output filename
python3 main.py headless MainController.ORP -o CustomName.ORS

Step 7: Monitor Progress

Watch Live Console Output

If using tmux:

tmux attach -t routing

If running directly: Already showing in your terminal.

Tail Log Files

# In a second SSH session or tmux pane:
cd /workspace/OrthoRoute

# Watch latest log file
tail -f logs/run_*.log | grep "WARNING"

# Or just iteration summaries:
tail -f logs/run_*.log | grep "ITER.*nets="

# Or with watch command:
watch -n 2 'tail -5 logs/run_*.log'

Monitor GPU Usage

# Watch GPU utilization every 5 seconds
nvidia-smi -l 5

# Or with watch:
watch -n 5 nvidia-smi

What to look for:

GPU Utilization: Should be 80-100%
GPU Memory: Should be stable (not growing infinitely)
Power Usage: Should be near max (e.g., 350W for RTX 4090)

Check Disk Space

# Iteration 1 on 8K nets creates LARGE log files
df -h

# If disk getting full, you can compress or delete old logs:
gzip logs/old_run_*.log

Step 8: Handle Common Issues

Out of Memory Error

Error:

cupy.cuda.memory.OutOfMemoryError: Out of memory allocating X bytes

Solutions:

A) Upgrade to larger GPU:

Kill current job: pkill -f main.py
Destroy instance on Vast.ai
Rent instance with more VRAM (48+ GB)
Restart from Step 1

B) Use CPU mode:

pkill -f main.py
python3 main.py headless MainController.ORP --cpu-only

C) Reduce batch size (requires code change - not recommended)

Process Killed / SSH Disconnected

If you weren't using tmux:

Routing stopped when SSH died
Must restart from scratch

If you were using tmux:

# Reconnect to Vast instance
ssh -p 12345 root@ssh.vast.ai

# Reattach to tmux session
tmux attach -t routing

# Routing should still be running!

Instance Becomes Unresponsive

If SSH hangs or times out:

Instance might have crashed
Check Vast.ai dashboard - instance status
If "stopped", you'll need to restart
Unfortunately, routing progress lost (no checkpointing yet)

Logs Too Large

8K net routing can create 10+ GB log files:

# Check log size
du -h logs/

# Compress old logs to save space
gzip logs/run_*.log

# Or delete very old logs
rm logs/run_2025111*.log

Step 9: Download Results

When Routing Completes

You'll see:

================================================================================
ROUTING COMPLETE!
================================================================================
Solution file: MainController.ORS
...

Download ORS File to Local Machine

Using SCP (from your local machine):

scp -P 12345 root@ssh.vast.ai:/workspace/OrthoRoute/MainController.ORS ./

# Replace:
#   12345 - your Vast.ai port
#   MainController.ORS - your actual ORS filename
#   ./ - current directory (or specify path)

Using cloud storage:

# On Vast instance: Upload to file sharing service
curl -F "file=@MainController.ORS" https://file.io
# Returns download URL

# On local machine: Download
wget https://file.io/XXXXXX -O MainController.ORS

Verify file integrity:

# On local machine, check file is valid gzip:
gzip -t MainController.ORS && echo "File OK" || echo "File corrupted"

# Check file size (should be ~500KB - 5MB):
ls -lh MainController.ORS

Step 10: Import into KiCad

On your local machine:

Open KiCad with your board
Launch OrthoRoute plugin
Press Ctrl+I (or File → Import Solution)
Select MainController.ORS
Review routing in preview
Click "Apply to KiCad" to commit traces/vias

Complete Example Session

Session Recording

# === ON LOCAL MACHINE ===

# 1. Export board
# (In KiCad OrthoRoute plugin: Ctrl+E → save MainController.ORP)

# 2. Upload to Vast
scp -P 12345 MainController.ORP root@ssh.vast.ai:/workspace/

# === ON VAST.AI INSTANCE ===

# 3. SSH in
ssh -p 12345 root@ssh.vast.ai

# 4. Setup
cd /workspace
git clone https://github.com/YourUser/OrthoRoute.git
cd OrthoRoute
pip3 install cupy-cuda12x numpy scipy

# 5. Verify GPU
nvidia-smi
python3 -c "import cupy; print('GPU:', cupy.cuda.is_available())"

# 6. Start tmux session
tmux new -s routing

# 7. Run routing
python3 main.py headless MainController.ORP

# 8. Detach from tmux (Ctrl+b, then d)

# 9. Monitor progress (optional)
tail -f logs/run_*.log | grep "ITER.*nets="

# 10. Wait for completion (check back in 4-8 hours)

# 11. Download result
exit  # Exit SSH

# === BACK ON LOCAL MACHINE ===

# 12. Download ORS file
scp -P 12345 root@ssh.vast.ai:/workspace/OrthoRoute/MainController.ORS ./

# 13. Import into KiCad (Ctrl+I)

# 14. Destroy Vast instance (stop billing)
# (In Vast.ai dashboard: click Destroy)

Cost Estimation

Typical Costs by Board Size

Small board (100-500 nets):

Time: 10-30 minutes
GPU: RTX 4090 @ $0.40/hr
Cost: $0.20

Medium board (500-2,000 nets):

Time: 30 minutes - 2 hours
GPU: RTX 4090 @ $0.40/hr
Cost: $0.80

Large board (2,000-8,000 nets):

Time: 4-12 hours
GPU: RTX 6000 Ada (48GB) @ $0.80/hr
Cost: $6-10

Huge board (8,000+ nets):

Time: 12-24 hours
GPU: A100 80GB @ $1.50/hr
Cost: $18-36

vs. buying RTX 4090: ~$1,600

Break-even: ~40 large routing jobs (or never, if you value your time)

Tips & Tricks

1. Use tmux ALWAYS

# Start every session with:
tmux new -s routing

# Detach: Ctrl+b, then d
# Reattach: tmux attach -t routing

Why: If SSH disconnects, routing keeps going. Saved me countless times.

2. Monitor Without Attaching

# See what's happening in tmux without attaching:
tmux capture-pane -t routing -p | tail -20

3. Multiple Sessions for Monitoring

# Window 1: Routing
tmux new -s routing
python3 main.py headless board.ORP

# Detach (Ctrl+b, d)

# Window 2: Monitoring
tmux new -s monitor
tail -f logs/run_*.log | grep "ITER.*nets="

# Detach (Ctrl+b, d)

# Switch between:
tmux attach -t routing
tmux attach -t monitor

4. Estimate Time Remaining

# From iteration timestamps, calculate rate:
# Example: ITER 10 at 10:30, ITER 20 at 11:45
# = 10 iterations in 75 minutes
# = 7.5 min/iteration
# If need 80 iterations total: (80-20) × 7.5 = 450 min = 7.5 hours

5. Verify GPU is Being Used

# Run this DURING routing:
nvidia-smi

# Look for:
#   GPU Util: 95-100%
#   Memory Usage: 20-30 GB (should be high)
#   Process: python3 main.py headless ...

If GPU Util is 0%: Routing is using CPU (slow!) - check CuPy installation.

6. Pre-test Small Board

Before routing huge board:

# Test with small ORP first:
python3 main.py headless TestBackplane.ORP

# Should complete in 20-30 min
# Verifies: GPU works, dependencies correct, no issues

7. Compress Logs to Save Disk

# While routing is running (in another terminal):
cd /workspace/OrthoRoute/logs
gzip run_2025*.log  # Compress old logs

# Or auto-compress with cron:
(crontab -l; echo "*/30 * * * * gzip /workspace/OrthoRoute/logs/*.log 2>/dev/null") | crontab -

Troubleshooting

"No module named 'cupy'"

Problem: CuPy not installed

Fix:

pip3 install cupy-cuda12x

"CUDA initialization failed"

Problem: CUDA runtime mismatch

Fix:

# Check CUDA version
nvcc --version

# Install matching CuPy:
# CUDA 11.x: pip3 install cupy-cuda11x
# CUDA 12.x: pip3 install cupy-cuda12x

"Permission denied" when cloning repo

Problem: Private repository

Fix:

# Generate SSH key on Vast instance:
ssh-keygen -t ed25519 -C "vast-gpu"
cat ~/.ssh/id_ed25519.pub
# Copy output, add to GitHub → Settings → SSH Keys

# Or use personal access token:
git clone https://YOUR_TOKEN@github.com/user/repo.git

Routing uses CPU instead of GPU

Check:

python3 -c "import cupy; print('Available:', cupy.cuda.is_available())"

If False:

CuPy not installed correctly
CUDA version mismatch
GPU drivers not loaded

Force GPU mode:

python3 main.py headless board.ORP --use-gpu

Instance runs out of disk space

Check space:

df -h

If <5 GB free:

# Compress logs
gzip logs/*.log

# Or delete old logs
rm logs/run_2025111*.log

# Or mount external storage (Vast.ai option)

Routing takes forever on CPU

If forced to use --cpu-only:

8K net board could take 48-72 hours
Consider renting bigger GPU instead
Or reduce grid resolution in ORP file

Optimization Tips

1. Choose Right GPU for Your Board

Board Size	Nets	VRAM Needed	Recommended GPU	Cost/hr
Small	<500	8 GB	RTX 3080	$0.25
Medium	500-2K	16 GB	RTX 4090	$0.40
Large	2K-6K	24 GB	RTX 4090	$0.40
Huge	6K-10K	48 GB	RTX 6000 Ada	$0.80
Massive	10K+	80 GB	A100 80GB	$1.50

2. Batch Multiple Boards

# Route multiple boards in one session:
python3 main.py headless Board1.ORP
python3 main.py headless Board2.ORP
python3 main.py headless Board3.ORP

# Or in parallel (if enough VRAM):
python3 main.py headless Board1.ORP &
python3 main.py headless Board2.ORP &
wait

3. Auto-shutdown When Done

# Add to end of routing script:
python3 main.py headless board.ORP && shutdown -h now

# Instance stops automatically when complete
# Minimizes billing

Quick Reference Card

Setup:

ssh -p PORT root@ssh.vast.ai
cd /workspace
git clone https://github.com/user/OrthoRoute.git
cd OrthoRoute
pip3 install cupy-cuda12x numpy scipy

Upload file:

# From local machine:
scp -P PORT board.ORP root@ssh.vast.ai:/workspace/OrthoRoute/

Run routing:

tmux new -s routing
python3 main.py headless board.ORP
# Ctrl+b, d to detach

Monitor:

tail -f logs/run_*.log | grep "ITER.*nets="
nvidia-smi -l 5

Download result:

# From local machine:
scp -P PORT root@ssh.vast.ai:/workspace/OrthoRoute/board.ORS ./

Import to KiCad:

Ctrl+I → select board.ORS → Apply to KiCad

Expected Timeline (8K Net Board)

00:00 - Start instance, SSH in
00:05 - Clone repo, install dependencies
00:10 - Upload ORP file (depends on internet speed)
00:15 - Start routing in tmux
02:30 - Iteration 1 completes (greedy routing)
04:00 - Iteration 20 completes
08:00 - Iteration 50 completes
12:00 - Iteration 75 completes
14:00 - Convergence! (iteration 85-95)
14:05 - Download ORS file
14:10 - Destroy instance

Total: ~14 hours runtime, ~$12-15 cost

Vast.ai Specific Notes

Instance States

Loading: Starting up (1-2 min)
Running: Active and billable
Stopped: Paused (not billable, but loses data)
Destroyed: Terminated (stops billing)

Billing

Billed per second of runtime
Continues billing until you Destroy instance
Check dashboard frequently when job completes

Data Persistence

/workspace directory persists across stops
~/.ssh, /tmp do NOT persist
Always destroy when done (or you keep paying)

Port Forwarding

SSH command includes port forwarding:

ssh -p 12345 root@ssh.vast.ai -L 8080:localhost:8080

You can ignore the -L 8080:localhost:8080 part for headless routing.

Other Cloud Providers

RunPod

Similar setup:

# SSH command from RunPod dashboard
ssh root@X.X.X.X -p 22

# Rest is identical to Vast.ai

Differences:

Easier UI
Slightly more expensive (~$0.50/hr for RTX 4090)
Better reliability
Jupyter notebook support (not needed for headless)

Lambda Labs

Setup:

ssh ubuntu@instance.lambdalabs.com
sudo apt-get install python3-pip
# Rest same as Vast.ai

Differences:

More expensive (~$1.10/hr for A100)
Very reliable
Better for production workloads
Fixed pricing (no bidding)

Security Notes

Protect Your ORP Files

ORP files contain your entire board design:

Pad positions
Net connectivity
Design rules

Don't:

Upload to public GitHub
Share ORP files publicly
Leave on instance after destroying

Do:

Use private repositories

Delete ORP/ORS from instance before destroying:

rm /workspace/OrthoRoute/*.ORP
rm /workspace/OrthoRoute/*.ORS

Download and backup ORS files locally

SSH Key Security

Generate unique key for cloud instances:

ssh-keygen -t ed25519 -f ~/.ssh/vast_key
# Use ~/.ssh/vast_key instead of default key
# If compromised, only affects cloud instances

Post-Processing

After Downloading ORS

1. Verify file:

ls -lh MainController.ORS
# Should be ~500KB - 5MB depending on board size

2. Import to KiCad:

Ctrl+I in OrthoRoute plugin
Select ORS file
Review in preview

3. Run DRC:

Check for violations
Expect ~300-500 via barrel conflicts (known limitation)
Zero trace-trace violations (should be clean)

4. Manual cleanup (if needed):

Fix barrel conflicts by moving vias 0.1-0.2mm
Typically 30-60 minutes for large boards

FAQ

Q: Can I close my laptop while routing? A: Yes, if using tmux! Routing continues on the cloud.

Q: How do I know when it's done? A: Check tmux session or log files. Or set up email notification (advanced).

Q: What if I run out of money mid-routing? A: Vast.ai stops instance, routing lost. Add credits before starting.

Q: Can I pause and resume? A: Not currently. Checkpointing is a planned feature but not implemented.

Q: GPU seems idle during routing? A: Check nvidia-smi. If 0%, CuPy isn't working. Use --cpu-only as fallback.

Q: Can I route multiple boards in parallel? A: Yes, if enough VRAM. 2 small boards on 1 GPU works. Large boards need dedicated GPU.

Last Updated: November 15, 2025 Tested On: Vast.ai, RunPod, Lambda Labs GPU Tested: RTX 4090, RTX 6000 Ada, A100 80GB Status: Production-ready

18 KiB Raw Permalink Blame History Unescape Escape

Cloud GPU Setup Guide for OrthoRoute

Step 1: Rent GPU Instance on Vast.ai

Recommended Specifications

On Vast.ai Website

Get SSH Connection Info

Step 2: Connect and Setup Environment

SSH into Instance

Install System Dependencies

Step 3: Clone OrthoRoute Repository

Step 4: Install Python Dependencies

Check CUDA Version

Install CuPy (GPU acceleration library)

Install Other Dependencies

Step 5: Upload Your ORP File

From Your Local Machine

Alternative: Upload to Cloud Storage First

Step 6: Run OrthoRoute Headless Mode

Using tmux (Recommended - survives SSH disconnects)

Direct Run (Simpler but dies if SSH disconnects)

With Options

Step 7: Monitor Progress

Watch Live Console Output

Tail Log Files

Monitor GPU Usage

Check Disk Space

Step 8: Handle Common Issues

Out of Memory Error

Process Killed / SSH Disconnected

Instance Becomes Unresponsive

Logs Too Large

Step 9: Download Results

When Routing Completes

Download ORS File to Local Machine

Step 10: Import into KiCad

Complete Example Session

Session Recording

Cost Estimation

Typical Costs by Board Size

Tips & Tricks

1. Use tmux ALWAYS

2. Monitor Without Attaching

3. Multiple Sessions for Monitoring

4. Estimate Time Remaining

5. Verify GPU is Being Used

6. Pre-test Small Board

7. Compress Logs to Save Disk

Troubleshooting

"No module named 'cupy'"

"CUDA initialization failed"

"Permission denied" when cloning repo

Routing uses CPU instead of GPU

Instance runs out of disk space

Routing takes forever on CPU

Optimization Tips

1. Choose Right GPU for Your Board

2. Batch Multiple Boards

3. Auto-shutdown When Done

Quick Reference Card

Expected Timeline (8K Net Board)

Vast.ai Specific Notes

Instance States

Billing

Data Persistence

Port Forwarding

Other Cloud Providers

RunPod

Lambda Labs

Security Notes

Protect Your ORP Files

SSH Key Security

Post-Processing

After Downloading ORS

FAQ

18 KiB

Raw Permalink Blame History