# Deployment Guide

This guide takes you from a prepared Kubernetes cluster to a running RootCause Platform. It assumes your infrastructure team has completed the checklist in the Requirements doc.

There are four CLI steps to bootstrap the operator, then everything else happens in the Admin UI.

***

## Before you start

Confirm you have:

* [ ] Kubernetes cluster running (1.26+) with `kubectl` cluster-admin access
* [ ] Helm 3.12+ installed
* [ ] Ingress controller deployed with wildcard DNS pointing to its IP
* [ ] Object storage buckets created with credentials ready
* [ ] Container registry credentials from RootCause (GitLab deploy token)
* [ ] DNS base domain and subdomains planned
* [ ] TLS certificate available
* [ ] Identity provider details ready (if using external OIDC or SAML)
* [ ] LLM provider API keys ready (at least two providers recommended — see Step 7)

If anything is missing, go back to the **Requirements doc** and hand the checklist to your infrastructure team.

***

## Step 1: Create the namespace

```bash
kubectl create namespace rootcause
```

All RootCause components will be deployed into this namespace.

**Verify:**

```bash
kubectl get namespace rootcause
```

***

## Step 2: Create registry credentials

The operator and platform images are hosted on GitLab Container Registry. Create an image pull secret:

```bash
kubectl create secret docker-registry regcred \
  -n rootcause \
  --docker-server=registry.gitlab.com \
  --docker-username=<your-username> \
  --docker-password=<your-deploy-token>
```

Use the GitLab deploy token provided by RootCause. It needs `read_registry` scope. The operator reuses this secret for both image pulls and OCI chart pulls — no separate chart registry secret is needed.

**Verify:**

```bash
kubectl get secret regcred -n rootcause
```

***

## Step 3: Install the MongoDB Community Operator

The platform uses the MongoDB Community Operator to manage MongoDB instances:

```bash
helm install community-operator community-operator \
  --repo https://mongodb.github.io/helm-charts \
  -n rootcause \
  --set operator.watchNamespace=rootcause
```

> **Skip this step** if a MongoDB Community Operator is already installed cluster-wide. It should be configured to watch all namespaces.

**Verify:**

```bash
kubectl get pods -n rootcause -l name=mongodb-kubernetes-operator
```

You should see one pod in `Running` state.

***

## Step 4: Install the RootCause Operator

```bash
helm install rootcause-operator \
  oci://registry.gitlab.com/perceptura/client-deployments/releases-platform/charts/rootcause-operator \
  -n rootcause
```

The chart defaults `imagePullSecrets` to `regcred`, so no extra flags are needed.

**Verify:**

```bash
kubectl get pods -n rootcause
```

You should see:

```
rootcause-operator-controller-...   1/1   Running
rootcause-operator-admin-...        1/1   Running
mongodb-kubernetes-operator-...     1/1   Running
```

***

## Step 5: Access the Admin UI

Port-forward the Admin UI to your local machine:

```bash
kubectl port-forward -n rootcause svc/rootcause-operator-admin 3000:3000
```

Open <http://localhost:3000> in your browser.

### Log in

The operator generates a master password during installation. Retrieve it:

```bash
kubectl get secret rootcause-bootstrap-auth -n rootcause \
  -o jsonpath='{.data.ADMIN_PASSWORD}' | base64 -d
```

Enter this password on the login page.

***

## Step 6: Configure via the Bootstrap Wizard

The Admin UI presents a wizard with several sections. Walk through each one.

### Release Versions

* **Dependencies chart version** and **Platform chart version**: Use the latest versions unless RootCause support has told you otherwise.

### Telemetry

* **Client ID**: Your organization identifier (provided by RootCause, e.g., `acme-corp`).

### Storage

Configure your object storage backend:

| Field               | Azure                                                                                      | AWS                            | GCP                         |
| ------------------- | ------------------------------------------------------------------------------------------ | ------------------------------ | --------------------------- |
| **Storage backend** | Azure Blob Storage                                                                         | S3                             | Google Cloud Storage        |
| **Storage class**   | `managed-csi`                                                                              | `gp3`                          | `standard` or `premium-rwo` |
| **Credentials**     | Storage account name, connection string, account key, endpoint suffix (`core.windows.net`) | Access key, secret key, region | Service account JSON key    |

Create three buckets/containers:

| Bucket        | Purpose                       |
| ------------- | ----------------------------- |
| Datasets      | Uploaded data files           |
| Digital twins | Generated digital twin models |
| ML models     | Trained ML model artifacts    |

### Networking

| Field             | What to enter                                                                    |
| ----------------- | -------------------------------------------------------------------------------- |
| **Base domain**   | Your base domain (e.g., `rootcause.example.com`)                                 |
| **Ingress class** | `nginx` for most deployments. `azure/application-gateway` for Azure App Gateway. |
| **Subdomains**    | Platform, Auth, and LiteLLM subdomains (e.g., `platform`, `auth`, `litellm`)     |

#### Ingress annotations

Add annotations under **All ingresses (base)** based on your ingress controller:

**nginx:**

| Key                                        | Value  |
| ------------------------------------------ | ------ |
| `nginx.ingress.kubernetes.io/ssl-redirect` | `true` |

For TLS with a pre-existing wildcard certificate, select **Manual (bring your own secret)** under TLS mode and provide the secret name.

**Azure Application Gateway:**

| Key                                          | Value   |
| -------------------------------------------- | ------- |
| `appgw.ingress.kubernetes.io/ssl-redirect`   | `false` |
| `appgw.ingress.kubernetes.io/use-private-ip` | `true`  |

For each ingress (Platform, Auth, LiteLLM), also add:

| Key                                                 | Value                     |
| --------------------------------------------------- | ------------------------- |
| `appgw.ingress.kubernetes.io/appgw-ssl-certificate` | *(your certificate name)* |

> **Azure Application Gateway note:** App Gateway does not resolve loopback external URLs from within the cluster. You will need a `hostAliases` resource patch in the Advanced section — see [Azure Application Gateway patches](#azure-application-gateway-patches) below.

### Identity & Access

Choose based on the decision you made in the Requirements doc:

**External OIDC (recommended)**

| Field               | What to enter                                                                               |
| ------------------- | ------------------------------------------------------------------------------------------- |
| Authentication mode | OIDC                                                                                        |
| Deploy FusionAuth   | No                                                                                          |
| Issuer              | Your IdP's issuer URL                                                                       |
| Client ID           | Your application's client ID                                                                |
| Client secret       | Your application's client secret                                                            |
| Well-known URL      | Your IdP's OpenID configuration URL                                                         |
| Logout URL          | Your IdP's logout endpoint (include a `post_logout_redirect_uri` back to your platform URL) |

> **Azure EntraID specifics:** The issuer URL is `https://login.microsoftonline.com/<tenant-id>/v2.0`. In the Azure Portal, ensure your app registration has:
>
> 1. **Redirect URI**: `https://<platform-subdomain>.<base-domain>/api/auth/callback/login` (type: Web)
> 2. **Token configuration**: Include `email`, `profile`, and `openid` scopes
> 3. **API permissions**: `openid`, `profile`, `email` (Microsoft Graph, delegated)

**External SAML**

| Field                | What to enter                                                 |
| -------------------- | ------------------------------------------------------------- |
| Authentication mode  | SAML                                                          |
| Deploy FusionAuth    | No                                                            |
| *(remaining fields)* | Metadata URL or XML, entity ID, and certificate from your IdP |

**Managed FusionAuth (POC / no existing IdP)**

| Field                | What to enter     |
| -------------------- | ----------------- |
| Authentication mode  | Built-in (no SSO) |
| Deploy FusionAuth    | Yes               |
| FusionAuth API key   | *(leave blank)*   |
| FusionAuth Tenant ID | *(leave blank)*   |

Leave credentials blank. The operator auto-provisions everything: API keys, tenant, application, and OIDC configuration.

### Dependencies

All infrastructure dependencies are deployed by the operator by default:

| Dependency | Default | When to change                                                                                    |
| ---------- | ------- | ------------------------------------------------------------------------------------------------- |
| PostgreSQL | Deploy  | Use external if your organization provides a managed PostgreSQL instance                          |
| MongoDB    | Deploy  | Typically always deployed. Set replicas: 1 for testing, 3 for production.                         |
| Redis      | Deploy  | Typically always deployed. Set replicas: 1 for testing, 3 for production.                         |
| Temporal   | Deploy  | Always deployed                                                                                   |
| LiteLLM    | Deploy  | Set to External only if you already have a LiteLLM instance. Enter your existing URL and API key. |

### Components

Configure replicas and resources per component. Defaults work for testing. For production, use these as a starting point:

| Component    | Replicas | CPU request | Memory request |
| ------------ | -------- | ----------- | -------------- |
| Platform     | 2-3      | 500m        | 1 Gi           |
| Data Service | 2-3      | 1           | 2 Gi           |
| Data Fusion  | 2-3      | —           | —              |
| ML Jobs      | 3-5      | 2           | 4 Gi           |

See the **Upgrades & Operations** doc for detailed scaling guidance.

### Advanced

For most deployments, you can skip this section. It's here for edge cases.

* **Resource patches**: Deep-merge patches into rendered Kubernetes manifests (annotations, tolerations, node selectors, `hostAliases`, etc.)
* **Raw Helm overrides**: Free-form YAML merged over computed values for any chart setting not covered by the wizard

#### Azure Application Gateway patches

Azure Application Gateway does not resolve external URLs from within the cluster. The platform pod needs a `hostAliases` entry to route auth traffic to the Application Gateway's IP directly.

Add a **Deployment** resource patch:

| Field       | Value              |
| ----------- | ------------------ |
| Chart       | platform           |
| Kind        | deployment         |
| Object name | rootcause-platform |

Patch content:

```yaml
spec:
  template:
    spec:
      hostAliases:
      - ip: "<app-gateway-public-ip>"
        hostnames:
        - "<auth-subdomain>.<base-domain>"
```

App Gateway also requires explicit path definitions. Add these in **Raw Helm Overrides**:

**Platform chart overrides:**

```yaml
platform:
  ingress:
    hosts:
      - host: <platform-subdomain>.<base-domain>
        paths:
          - path: /
            pathType: Exact
            port: 80
          - path: /*
            pathType: Prefix
            port: 80
```

**Dependencies chart overrides:**

```yaml
fusionauth:
  ingress:
    paths:
      - path: /*
        pathType: Prefix
      - path: /
        pathType: Exact
```

> Without both path types, App Gateway may return 502 errors on some requests.

#### Node selector (dedicated node pools)

If your cluster uses a dedicated node pool for RootCause, add a node selector:

| Key                | Value                |
| ------------------ | -------------------- |
| *(your label key)* | *(your label value)* |

For example, `usage: rootcause` to match an AKS node pool label.

### Deploy

Review the **Deployment Summary** at the bottom of the wizard, then click **Apply configuration**.

The operator will:

1. Create required secrets (storage credentials, platform config)
2. Create the `RootCauseInstallation` custom resource
3. Deploy infrastructure dependencies (PostgreSQL, Redis, MongoDB, Temporal, LiteLLM, and optionally FusionAuth)
4. Deploy platform services (Platform, Data Service, Data Fusion, ML Jobs)

Watch the **Deployment status** panel on the right. The phase progresses through `Reconciling` to `Ready`, typically in 2-5 minutes.

***

## Step 7: Configure LLM models

After the platform is deployed, configure your LLM models in LiteLLM. The platform requires at least two models to be configured (e.g., one OpenAI, one Anthropic). This provides fallback availability — if one provider is down, the platform uses the other.

### Access the LiteLLM UI

Get the LiteLLM master password (the username is `admin`):

```bash
kubectl get secret rootcause-dependencies-litellm-secrets \
  -n rootcause -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d
```

Navigate to `https://<litellm-subdomain>.<base-domain>/ui` and log in.

> If you configured an external LiteLLM instance in the wizard, skip to Step 8. Configure models in your existing LiteLLM instead.

### Add credentials

1. Go to **Models + Endpoints > LLM Credentials**
2. For each provider: enter a name, select the provider, and paste your API key
3. Recommended minimum: one OpenAI credential and one Anthropic credential

### Create models

1. Go to **Models + Endpoints > Add Model**
2. Select the provider, give the model a name, and choose the credentials you just created
3. Click **Test connection** to verify the model responds
4. Repeat for each model

### Set up fallbacks

Fallbacks ensure the platform always has a working LLM, even during provider outages:

1. Go to **Settings > Router Settings > Add Fallbacks**
2. Select your primary model
3. Add one or more fallback models in order of preference
4. Click **Add Fallback**

> **Tip:** Match fallbacks by weight class — if your primary is GPT-4, fall back to Claude Sonnet, not to a smaller model. The exact model doesn't matter as much as matching capability.

### Create an API key

1. Go to **Virtual Keys > Create New Key**
2. Give it a name and select which models it can access
3. Save the generated key — this is the `LITELLM_API_KEY` the platform uses internally

> **Important:** LiteLLM model configuration is stored in PostgreSQL, not in Helm values or the CR. It is **not preserved** across clean reinstalls. If you ever need to migrate or reinstall, screenshot or export your model configuration first.

***

## Step 8: Create platform users

Navigate to the **Users** page in the Admin UI.

**With managed FusionAuth:**

1. Click **+ Add user**
2. Enter email, password, first name, last name
3. Click **Create user**

The operator creates a FusionAuth account, registers it to the platform application, and adds the email to the admin list automatically.

**With external OIDC or SAML:**

Create users in your external identity provider (EntraID, Okta, etc.), then add their email addresses in the **Platform admin emails** section on the Users page. These emails are stored in the CR and mounted as `PLATFORM_ADMIN_EMAILS` on the platform deployment.

***

## Step 9: Log in and verify

Navigate to `https://<platform-subdomain>.<base-domain>`.

* With FusionAuth: click "Continue with OIDC SSO" and log in with the credentials from Step 8
* With external OIDC/SAML: you'll be redirected to your identity provider

### Post-install smoke test

Run through these checks to confirm everything is working:

| # | Check                 | How                                                                      |
| - | --------------------- | ------------------------------------------------------------------------ |
| 1 | All pods running      | `kubectl get pods -n rootcause` — all should be `Running` or `Completed` |
| 2 | Admin UI accessible   | <http://localhost:3000> loads (with port-forward active)                 |
| 3 | Platform login works  | Navigate to platform URL, complete login flow, reach the home page       |
| 4 | LLM responds          | Open the LiteLLM playground, send a test prompt, get a response          |
| 5 | Data works end-to-end | Create a workspace, open a Data View, confirm it loads                   |

If any check fails, see the **Troubleshooting** section in the Upgrades & Operations doc.

***

## Appendix: Migrating from legacy Helm deployment

If you are migrating from an older Helm-based deployment (pre-operator), follow these steps before starting at Step 1 above.

### Before you begin

1. **Save LiteLLM model configuration** — screenshot or export all models, API keys, and routing rules from your existing LiteLLM UI. These are stored in PostgreSQL and will not survive the migration.
2. **Export current Helm values** — `helm get values rootcause-platform -n <namespace>` and `helm get values rootcause-dependencies -n <namespace>`. You'll use these as reference when filling in the bootstrap wizard.
3. **Backup databases** (recommended):
   * `mongodump --host <host> --out ./mongo-backup`
   * `pg_dump -h <host> -U postgres -d appdb > postgres-backup.sql`

### Clean up existing installation

Choose one method:

**Method A: Clean uninstall** (preserves namespace, use if namespace is shared)

```bash
# Uninstall Helm releases — platform first, then dependencies
helm uninstall rootcause-platform -n <namespace>
helm uninstall rootcause-dependencies -n <namespace>

# Remove any operator resources if present
helm uninstall rootcause-operator -n <namespace> 2>/dev/null
helm uninstall community-operator -n <namespace> 2>/dev/null

# Clean up remaining resources
kubectl delete secrets --all -n <namespace>
kubectl delete configmaps --all -n <namespace>
kubectl delete jobs --all -n <namespace>

# Delete persistent volume claims (THIS DESTROYS ALL DATA)
kubectl delete pvc --all -n <namespace>

# Verify namespace is clean
kubectl get all,pvc,secrets,configmaps -n <namespace>
```

**Method B: Delete and recreate namespace** (simpler, but destroys everything in the namespace)

```bash
kubectl delete namespace <namespace>
kubectl wait --for=delete namespace/<namespace> --timeout=120s 2>/dev/null
```

### After cleanup

1. Start the deployment guide from **Step 1**
2. When filling in the bootstrap wizard, use your exported Helm values as reference
3. After deployment, re-enter your LiteLLM model configuration from the screenshot you saved
4. Verify with the smoke test in Step 9

This is a one-time procedure. Once migrated to the operator, all future updates happen through the Admin UI.

***

## Next step

Proceed to the **Upgrades & Operations** doc for day-2 operations: applying updates, rolling back, scaling, and troubleshooting.
