# Upgrades & Operations

Day-2 operations for the RootCause Platform. This doc covers upgrades, rollback, scaling, configuration changes, and troubleshooting.

**Who this is for:** Data scientists can handle upgrades and routine operations (Sections 1-3). Infrastructure teams should review the backup, rollback, scaling, and troubleshooting sections.

***

## 1. How updates work

The RootCause Operator automatically checks the OCI registry for new chart versions. When an update is available, it appears on the **Releases** page in the Admin UI.

Key points:

* **You can jump directly to any version** — no need to step through intermediate releases
* **Dependencies and platform are upgraded together** — the operator manages both Helm releases as a unit
* **Updates are applied through the Admin UI** — no Helm commands needed for routine upgrades

***

## 2. Upgrading

### Before you upgrade (recommended)

The operator does not require backups before upgrading. However, we recommend these steps as best practice — they take a few minutes and give you a recovery path if anything unexpected happens.

**Recommended backup checklist:**

```bash
# Export the CR spec (captures your full configuration)
kubectl get rootcauseinstallation rootcause -n rootcause -o yaml > rc-backup-$(date +%Y%m%d).yaml

# PostgreSQL dump (contains platform data and LiteLLM model config)
kubectl exec -it postgres-0 -n rootcause -- \
  pg_dumpall -U postgres > postgres-backup-$(date +%Y%m%d).sql

# MongoDB dump
kubectl exec -it perceptura-mongo-0 -n rootcause -- \
  mongodump --out=/backup/$(date +%Y%m%d)
```

Also recommended:

* Screenshot or export your LiteLLM model configuration (models, API keys, fallback settings) — this lives in PostgreSQL and is covered by the dump above, but a screenshot is faster to reference if you need to re-enter settings manually
* Check release notes for the target version
* Test in a staging environment if you have one

### Applying an upgrade

**Via the Admin UI (recommended):**

1. Open the Admin UI
2. Go to the **Releases** page
3. Select the target version
4. Click **Request upgrade**
5. Watch the **Deployment status** panel — the phase progresses through `Reconciling` to `Ready`

**Via CLI** (for automation or GitOps workflows):

```bash
# Update the chart versions in the CR spec
kubectl patch rootcauseinstallation rootcause -n rootcause \
  --type=merge -p '{"spec":{"release":{"dependenciesChartVersion":"<new-version>","platformChartVersion":"<new-version>"}}}'
```

The operator detects the spec change and reconciles automatically.

**Monitor the upgrade:**

```bash
# Watch pods restart
kubectl get pods -n rootcause -w

# Check installation status
kubectl get rootcauseinstallation -n rootcause -o jsonpath='{.status.phase}'
```

**Verify:** Run the same smoke test from the Deployment guide (Step 9) — all pods running, Admin UI accessible, platform login works, LLM responds, data loads.

***

## 3. Rolling back

The operator treats rollback the same as any other version change. To go back to a previous version, you select it and apply it — the operator deploys whichever version you tell it to.

This works because the operator manages the full deployment state. It doesn't need to "undo" anything — it simply deploys the version you specify.

### Steps

1. Open the Admin UI > **Releases** page
2. Select the previous version you want to return to
3. Click **Request upgrade** (yes, "upgrade" — the operator doesn't distinguish between moving forward or backward)
4. Watch the **Deployment status** panel
5. Verify with the smoke test

### When implicit rollback isn't enough

In rare cases — for example, if an upgrade included a database schema migration that isn't backward-compatible — selecting the previous version may not fully restore the old behavior. If this happens and you took backups before upgrading:

**Restore PostgreSQL:**

```bash
kubectl exec -i postgres-0 -n rootcause -- \
  psql -U postgres < postgres-backup-<date>.sql
```

**Restore MongoDB:**

```bash
kubectl exec -it perceptura-mongo-0 -n rootcause -- \
  mongorestore --drop /backup/<date>
```

**Re-enter LiteLLM configuration** from your screenshot/export if needed.

Then select the previous version on the Releases page and apply it.

***

## 4. Scaling

All scaling is configured in the **Components** section of the Admin UI bootstrap wizard. Change values and click **Apply configuration** — the operator reconciles.

### Replica counts by workload size

| Component    | Dev/Test | Small prod (50 users) | Medium prod (200 users) | Large prod (200+ users) |
| ------------ | -------- | --------------------- | ----------------------- | ----------------------- |
| Platform     | 1        | 2                     | 3                       | 5                       |
| Data Service | 1        | 3                     | 5                       | 10                      |
| ML Jobs      | 2        | 5                     | 10                      | 20                      |

### Resource limits per component

**Platform (Web UI):**

|        | Request | Limit |
| ------ | ------- | ----- |
| CPU    | 500m    | 2     |
| Memory | 1 Gi    | 4 Gi  |

**Data Service:**

|        | Request | Limit |
| ------ | ------- | ----- |
| CPU    | 1       | 4     |
| Memory | 2 Gi    | 8 Gi  |

**ML Jobs:**

|        | Request | Limit |
| ------ | ------- | ----- |
| CPU    | 2       | 8     |
| Memory | 4 Gi    | 16 Gi |

For very large models (100+ variables), ML Jobs may need more:

|        | Request | Limit |
| ------ | ------- | ----- |
| CPU    | 4       | 16    |
| Memory | 8 Gi    | 32 Gi |

### Horizontal Pod Autoscaling (HPA)

HPA can be configured via **Raw Helm Overrides** in the Advanced section of the wizard.

**CPU-based autoscaling (platform and data service):**

```yaml
platform:
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70

dataService:
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 15
    targetCPUUtilizationPercentage: 60
```

**Queue-depth-based autoscaling (ML Jobs):**

```yaml
mlJobs:
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 30
    metrics:
      - type: External
        external:
          metric:
            name: rabbitmq_queue_messages
            selector:
              matchLabels:
                queue: ml-jobs
          target:
            type: Value
            value: 10
```

### Scaling databases

Databases scale through the Dependencies section of the wizard or via Raw Helm Overrides:

| Database   | How to scale                                                                        | Production recommendation           |
| ---------- | ----------------------------------------------------------------------------------- | ----------------------------------- |
| MongoDB    | Increase replica set members (1 → 3 for HA) and storage size                        | 3 replicas, 100 Gi+ storage         |
| PostgreSQL | Vertical scaling (increase CPU/memory). Add read replicas for read-heavy workloads. | 2-4 CPU, 4-8 Gi RAM, 10 Gi+ storage |
| Redis      | Vertical scaling (increase memory)                                                  | 1-4 CPU, 2-8 Gi RAM                 |

### Node affinity and GPU

For dedicated node pools or GPU workloads, use Resource Patches in the Advanced section:

```yaml
# Pin ML Jobs to compute nodes
mlJobs:
  nodeSelector:
    node-type: compute

# GPU support for local LLMs
mlJobs:
  nodeSelector:
    accelerator: nvidia-gpu
  resources:
    limits:
      nvidia.com/gpu: 1
```

***

## 5. Changing configuration

Change any setting in the bootstrap wizard and click **Apply configuration**. The operator detects the spec change (generation bump) and reconciles — upgrading both Helm releases with the updated values.

Changes take effect within minutes. The Deployment status panel shows the reconciliation progress.

**Force reconcile:** If the installation is stuck or you need to re-apply values without changing the spec, click **Force reconcile** in the Deployment status panel. Via CLI:

```bash
kubectl patch rootcauseinstallation rootcause -n rootcause \
  --type=merge -p '{"spec":{"actions":{"forceReconcileNonce":"'$(uuidgen)'"}}}'
```

***

## 6. Managing secrets

The **Secrets** page in the Admin UI lists all namespace secrets. You can view, create, edit, and delete secrets. Missing secrets required by the platform are highlighted.

**Rotating credentials:**

1. Update the secret value on the Secrets page (or via `kubectl`)
2. Restart the affected pods so they pick up the new value:

```bash
# Example: restart data-service after updating LLM provider API keys
kubectl rollout restart deployment rootcause-platform-data-service -n rootcause
```

***

## 7. Managing users

### With managed FusionAuth

Use the **Users** page in the Admin UI to add, edit, or remove users. The operator manages FusionAuth accounts and the platform admin email list automatically.

### With external OIDC/SAML

Manage user accounts in your identity provider (EntraID, Okta, etc.). In the Admin UI, use the **Platform admin emails** section on the Users page to control which users have admin access. These emails are stored in the CR and mounted as `PLATFORM_ADMIN_EMAILS`.

***

## 8. Uninstalling

On the Bootstrap page, scroll to **Danger Zone**, check the confirmation box, and click **Request undeploy**. The operator will:

1. Uninstall the platform Helm release
2. Uninstall the dependencies Helm release
3. Clean up operator-managed secrets

The installation CR is preserved in `Uninstalled` phase. You can redeploy by updating the spec and clicking Apply configuration — the operator will do a fresh deployment.

> **Warning:** Undeploying destroys all data in the deployed databases (PostgreSQL, MongoDB, Redis). Take backups before undeploying if you need to preserve data.

***

## 9. Troubleshooting

### Check operator logs

```bash
kubectl logs deploy/rootcause-operator-controller -n rootcause
```

### Check installation status

```bash
kubectl get rootcauseinstallation -n rootcause -o yaml
```

Key fields:

| Field               | What it tells you                                                           |
| ------------------- | --------------------------------------------------------------------------- |
| `status.phase`      | Current state: `Ready`, `Reconciling`, `Degraded`, `Uninstalled`, `Blocked` |
| `status.lastError`  | Error message if the phase is `Degraded`                                    |
| `status.conditions` | Detailed condition status for each subsystem                                |

### Check Helm releases

```bash
helm list -n rootcause
```

### Check pod logs

```bash
# Platform logs
kubectl logs -l app=platform -n rootcause --tail=50

# Data service logs
kubectl logs -l app=data-service -n rootcause --tail=50

# Check logs from a crashed container
kubectl logs <pod-name> -n rootcause --previous
```

### Common issues

| Issue                                     | Cause                                      | Fix                                                                                        |
| ----------------------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------ |
| `ImagePullBackOff`                        | Missing or invalid pull secret             | Verify `regcred` exists with correct credentials                                           |
| `Degraded` with Helm error                | Chart values validation failure            | Check `status.lastError` and fix the configuration in the wizard                           |
| FusionAuth stuck in Maintenance Mode      | `silentMode` not enabled                   | Ensure operator version includes the silent mode fix (0.1.14+)                             |
| OIDC login returns `invalid_redirect_uri` | Wildcard validation not enabled            | Ensure operator version includes `authorizedURLValidationPolicy: AllowWildcards` (0.1.16+) |
| PVC stuck in Pending                      | No matching storage class                  | Set `storageClass` in the bootstrap wizard to match your cluster                           |
| Pods stuck in Pending                     | Insufficient cluster resources             | Scale up nodes or reduce replica counts                                                    |
| `CrashLoopBackOff`                        | Application error on startup               | Check pod logs with `kubectl logs <pod-name> --previous`                                   |
| Upgrade stuck in `Reconciling`            | Operator unable to complete reconciliation | Check operator logs, then try Force reconcile                                              |

### Upgrade-specific issues

| Issue                                  | Cause                                        | Fix                                                                |
| -------------------------------------- | -------------------------------------------- | ------------------------------------------------------------------ |
| Pods not restarting after upgrade      | Old pods still running                       | `kubectl rollout restart deployment <name> -n rootcause`           |
| LiteLLM models missing after reinstall | Model config stored in PostgreSQL, not in CR | Re-enter model configuration from your backup/screenshot           |
| Configuration errors after upgrade     | New chart version has different value schema | Check `status.lastError`, update wizard fields to match new schema |

***

## Quick reference: useful commands

```bash
# Get Admin UI master password
kubectl get secret rootcause-bootstrap-auth -n rootcause \
  -o jsonpath='{.data.ADMIN_PASSWORD}' | base64 -d

# Get LiteLLM master password
kubectl get secret rootcause-dependencies-litellm-secrets \
  -n rootcause -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d

# Export CR spec (backup your configuration)
kubectl get rootcauseinstallation rootcause -n rootcause -o yaml > rc-backup.yaml

# Watch all pods
kubectl get pods -n rootcause -w

# Check installation phase
kubectl get rootcauseinstallation -n rootcause -o jsonpath='{.status.phase}'

# Force reconcile
kubectl patch rootcauseinstallation rootcause -n rootcause \
  --type=merge -p '{"spec":{"actions":{"forceReconcileNonce":"'$(uuidgen)'"}}}'

# Check events (sorted by time)
kubectl get events -n rootcause --sort-by='.lastTimestamp'
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.rootcause.ai/installation-and-deployment/upgrades-and-operations.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
