Upgrades & Operations

Day-2 operations for the RootCause Platform. This doc covers upgrades, rollback, scaling, configuration changes, and troubleshooting.

Who this is for: Data scientists can handle upgrades and routine operations (Sections 1-3). Infrastructure teams should review the backup, rollback, scaling, and troubleshooting sections.


1. How updates work

The RootCause Operator automatically checks the OCI registry for new chart versions. When an update is available, it appears on the Releases page in the Admin UI.

Key points:

  • You can jump directly to any version — no need to step through intermediate releases

  • Dependencies and platform are upgraded together — the operator manages both Helm releases as a unit

  • Updates are applied through the Admin UI — no Helm commands needed for routine upgrades


2. Upgrading

The operator does not require backups before upgrading. However, we recommend these steps as best practice — they take a few minutes and give you a recovery path if anything unexpected happens.

Recommended backup checklist:

# Export the CR spec (captures your full configuration)
kubectl get rootcauseinstallation rootcause -n rootcause -o yaml > rc-backup-$(date +%Y%m%d).yaml

# PostgreSQL dump (contains platform data and LiteLLM model config)
kubectl exec -it postgres-0 -n rootcause -- \
  pg_dumpall -U postgres > postgres-backup-$(date +%Y%m%d).sql

# MongoDB dump
kubectl exec -it perceptura-mongo-0 -n rootcause -- \
  mongodump --out=/backup/$(date +%Y%m%d)

Also recommended:

  • Screenshot or export your LiteLLM model configuration (models, API keys, fallback settings) — this lives in PostgreSQL and is covered by the dump above, but a screenshot is faster to reference if you need to re-enter settings manually

  • Check release notes for the target version

  • Test in a staging environment if you have one

Applying an upgrade

Via the Admin UI (recommended):

  1. Open the Admin UI

  2. Go to the Releases page

  3. Select the target version

  4. Click Request upgrade

  5. Watch the Deployment status panel — the phase progresses through Reconciling to Ready

Via CLI (for automation or GitOps workflows):

The operator detects the spec change and reconciles automatically.

Monitor the upgrade:

Verify: Run the same smoke test from the Deployment guide (Step 9) — all pods running, Admin UI accessible, platform login works, LLM responds, data loads.


3. Rolling back

The operator treats rollback the same as any other version change. To go back to a previous version, you select it and apply it — the operator deploys whichever version you tell it to.

This works because the operator manages the full deployment state. It doesn't need to "undo" anything — it simply deploys the version you specify.

Steps

  1. Open the Admin UI > Releases page

  2. Select the previous version you want to return to

  3. Click Request upgrade (yes, "upgrade" — the operator doesn't distinguish between moving forward or backward)

  4. Watch the Deployment status panel

  5. Verify with the smoke test

When implicit rollback isn't enough

In rare cases — for example, if an upgrade included a database schema migration that isn't backward-compatible — selecting the previous version may not fully restore the old behavior. If this happens and you took backups before upgrading:

Restore PostgreSQL:

Restore MongoDB:

Re-enter LiteLLM configuration from your screenshot/export if needed.

Then select the previous version on the Releases page and apply it.


4. Scaling

All scaling is configured in the Components section of the Admin UI bootstrap wizard. Change values and click Apply configuration — the operator reconciles.

Replica counts by workload size

Component
Dev/Test
Small prod (50 users)
Medium prod (200 users)
Large prod (200+ users)

Platform

1

2

3

5

Data Service

1

3

5

10

ML Jobs

2

5

10

20

Resource limits per component

Platform (Web UI):

Request
Limit

CPU

500m

2

Memory

1 Gi

4 Gi

Data Service:

Request
Limit

CPU

1

4

Memory

2 Gi

8 Gi

ML Jobs:

Request
Limit

CPU

2

8

Memory

4 Gi

16 Gi

For very large models (100+ variables), ML Jobs may need more:

Request
Limit

CPU

4

16

Memory

8 Gi

32 Gi

Horizontal Pod Autoscaling (HPA)

HPA can be configured via Raw Helm Overrides in the Advanced section of the wizard.

CPU-based autoscaling (platform and data service):

Queue-depth-based autoscaling (ML Jobs):

Scaling databases

Databases scale through the Dependencies section of the wizard or via Raw Helm Overrides:

Database
How to scale
Production recommendation

MongoDB

Increase replica set members (1 → 3 for HA) and storage size

3 replicas, 100 Gi+ storage

PostgreSQL

Vertical scaling (increase CPU/memory). Add read replicas for read-heavy workloads.

2-4 CPU, 4-8 Gi RAM, 10 Gi+ storage

Redis

Vertical scaling (increase memory)

1-4 CPU, 2-8 Gi RAM

Node affinity and GPU

For dedicated node pools or GPU workloads, use Resource Patches in the Advanced section:


5. Changing configuration

Change any setting in the bootstrap wizard and click Apply configuration. The operator detects the spec change (generation bump) and reconciles — upgrading both Helm releases with the updated values.

Changes take effect within minutes. The Deployment status panel shows the reconciliation progress.

Force reconcile: If the installation is stuck or you need to re-apply values without changing the spec, click Force reconcile in the Deployment status panel. Via CLI:


6. Managing secrets

The Secrets page in the Admin UI lists all namespace secrets. You can view, create, edit, and delete secrets. Missing secrets required by the platform are highlighted.

Rotating credentials:

  1. Update the secret value on the Secrets page (or via kubectl)

  2. Restart the affected pods so they pick up the new value:


7. Managing users

With managed FusionAuth

Use the Users page in the Admin UI to add, edit, or remove users. The operator manages FusionAuth accounts and the platform admin email list automatically.

With external OIDC/SAML

Manage user accounts in your identity provider (EntraID, Okta, etc.). In the Admin UI, use the Platform admin emails section on the Users page to control which users have admin access. These emails are stored in the CR and mounted as PLATFORM_ADMIN_EMAILS.


8. Uninstalling

On the Bootstrap page, scroll to Danger Zone, check the confirmation box, and click Request undeploy. The operator will:

  1. Uninstall the platform Helm release

  2. Uninstall the dependencies Helm release

  3. Clean up operator-managed secrets

The installation CR is preserved in Uninstalled phase. You can redeploy by updating the spec and clicking Apply configuration — the operator will do a fresh deployment.

Warning: Undeploying destroys all data in the deployed databases (PostgreSQL, MongoDB, Redis). Take backups before undeploying if you need to preserve data.


9. Troubleshooting

Check operator logs

Check installation status

Key fields:

Field
What it tells you

status.phase

Current state: Ready, Reconciling, Degraded, Uninstalled, Blocked

status.lastError

Error message if the phase is Degraded

status.conditions

Detailed condition status for each subsystem

Check Helm releases

Check pod logs

Common issues

Issue
Cause
Fix

ImagePullBackOff

Missing or invalid pull secret

Verify regcred exists with correct credentials

Degraded with Helm error

Chart values validation failure

Check status.lastError and fix the configuration in the wizard

FusionAuth stuck in Maintenance Mode

silentMode not enabled

Ensure operator version includes the silent mode fix (0.1.14+)

OIDC login returns invalid_redirect_uri

Wildcard validation not enabled

Ensure operator version includes authorizedURLValidationPolicy: AllowWildcards (0.1.16+)

PVC stuck in Pending

No matching storage class

Set storageClass in the bootstrap wizard to match your cluster

Pods stuck in Pending

Insufficient cluster resources

Scale up nodes or reduce replica counts

CrashLoopBackOff

Application error on startup

Check pod logs with kubectl logs <pod-name> --previous

Upgrade stuck in Reconciling

Operator unable to complete reconciliation

Check operator logs, then try Force reconcile

Upgrade-specific issues

Issue
Cause
Fix

Pods not restarting after upgrade

Old pods still running

kubectl rollout restart deployment <name> -n rootcause

LiteLLM models missing after reinstall

Model config stored in PostgreSQL, not in CR

Re-enter model configuration from your backup/screenshot

Configuration errors after upgrade

New chart version has different value schema

Check status.lastError, update wizard fields to match new schema


Quick reference: useful commands

Last updated