Skip to main content

Scaling goals

  • Increase throughput without unstable tail latency
  • Maintain request success rate during demand spikes
  • Keep credit settlement and node discovery consistent

Worker scaling strategy

1

Scale out first

Add more workers before increasing per-node concurrency.
2

Tune concurrency

Raise max_concurrency incrementally while tracking P95 latency.
3

Pin workloads

Route model families to dedicated node pools when possible.

Registry scaling strategy

ComponentRecommendation
API layerRun multiple stateless replicas behind a load balancer.
DatabaseUse managed PostgreSQL with automated backups and read replicas.
CacheAdd Redis for node discovery and session lookup hot paths.
QueueUse durable queues for asynchronous settlement and retries.

Autoscaling signals

Use a combination of:
  • queue depth
  • in-flight request count
  • P95 latency
  • GPU utilization
Scaling only by CPU can under-provision GPU-bound workloads.
Large concurrency jumps can increase timeout rates and reduce total throughput.
Mixing dissimilar models in one pool can create noisy-neighbor effects.
Last modified on February 21, 2026