Scaling

This guide covers scaling your RootCause.ai deployment for production workloads. Proper scaling ensures responsive performance under load while optimizing resource costs.


Component Scaling Guidelines

Different components have different scaling characteristics:

Component
Scaling Type
Bottleneck

Platform (UI)

Horizontal

Concurrent users

Data Service

Horizontal

API requests

ML Jobs

Horizontal

Discovery/simulation queue

MongoDB

Vertical + Sharding

Data volume

PostgreSQL

Vertical

Temporal workflows

Redis

Vertical

Cache size

RabbitMQ

Horizontal

Message throughput


Replica Counts

Development / Testing

platform:
  replicaCount: 1

dataService:
  replicaCount: 1

mlJobs:
  replicaCount: 2

Production (Small)

Up to 50 concurrent users, moderate simulation load:

Production (Medium)

Up to 200 concurrent users, heavy simulation load:

Production (Large)

Enterprise scale:


Resource Limits

Platform (Web UI)

Data Service

ML Jobs

ML Jobs are CPU and memory intensive:

For very large models (100+ variables):


Horizontal Pod Autoscaling

Enable automatic scaling based on load:

Platform HPA

Data Service HPA

ML Jobs HPA

Or scale based on RabbitMQ queue depth:


Database Scaling

MongoDB

MongoDB uses replica sets. For higher throughput:

For very large datasets, consider sharding.

PostgreSQL

PostgreSQL scales vertically:

For read-heavy workloads, add replicas:

Redis


Node Affinity

Ensure ML Jobs run on appropriate nodes:

For GPU workloads (local LLM):


Pod Disruption Budgets

Ensure availability during updates:


Monitoring Scaling Decisions

Track these metrics to inform scaling:

Platform/Data Service:

  • Request latency (P50, P95, P99)

  • Request rate (RPS)

  • Error rate

  • CPU/Memory utilization

ML Jobs:

  • Queue depth (pending jobs)

  • Job duration

  • Success/failure rate

  • CPU/Memory utilization

Databases:

  • Connection count

  • Query latency

  • Replication lag

  • Disk I/O


Cost Optimization

Right-size resources:

  1. Start with recommended values

  2. Monitor actual usage for 1-2 weeks

  3. Adjust requests to match P95 usage

  4. Set limits at 2x requests

Use spot/preemptible instances for ML Jobs:

Scale down during off-hours:


Performance Benchmarks

Typical performance expectations:

Workload
Recommended Config
Expected Performance

Small (10 users, light ML)

2/2/3 replicas

<500ms API response

Medium (50 users, moderate ML)

3/5/10 replicas

<1s API, <10min discovery

Large (200 users, heavy ML)

5/10/20 replicas

<2s API, <30min discovery

Discovery time scales with:

  • Number of variables (exponential impact)

  • Number of rows (linear impact)

  • Data complexity (non-linear)


Next Steps

With scaling configured:

  • Review Upgrading for zero-downtime updates

  • Set up monitoring and alerting

  • Implement backup procedures

Last updated