# Data Views

Real-world data rarely comes in the exact shape you need for analysis. Sales data lives in one system, customer attributes in another, and product information somewhere else. Even within a single dataset, you often need to filter, aggregate, or transform before the data is ready for causal discovery.

Data Views solve this problem. They're a flexible layer between your raw data and your analysis—letting you join tables, apply filters, create calculated fields, and reshape data without ever touching the originals. When your source data updates, your Data Views update too.

Think of Data Views as saved recipes for data preparation. Define the transformations once, and RootCause.ai applies them automatically whenever you need the data. This means your analyses stay consistent and reproducible, even as underlying data changes.

(SCREENSHOT: Data View editor showing a visual pipeline of operations connecting source datasets to output)

***

### Why Data Views Matter

**For Causal Discovery**

Causal discovery works best with a single, coherent dataset that represents your analytical domain. If you're analyzing customer behavior, you want customer attributes, transactions, support interactions, and outcomes all joined together. Data Views let you build this unified view.

**For Data Quality**

Raw data often contains issues—duplicates, missing values, outliers. Data Views let you clean data consistently, so these issues don't propagate into your analyses.

**For Reproducibility**

When you build a Data View, you're documenting exactly how you prepared the data. Anyone on your team can see the transformations, understand the logic, and reproduce the results.

***

### Creating a Data View

1. Navigate to **Data** → **Data Views**
2. Click **New Data View**
3. Add one or more **Sources** (datasets or other Data Views)
4. Apply **Operations** to transform the data
5. Preview the result to verify it looks correct
6. Save your Data View

(SCREENSHOT: New Data View button and empty editor state with "Add Source" prompt)

***

### Adding Sources

A Data View starts with one or more sources. The source determines what data flows into your pipeline.

**Dataset Source**

The most common starting point. Select a dataset from your workspace and all its columns and rows become available for transformation.

**Data View Source**

Use another Data View as a source. This lets you chain transformations—build a "cleaned customers" view, then use it as a source for more specific analyses. It's like building with Lego blocks.

**Ontology Concept Source**

Pull data based on an Ontology Concept, automatically gathering that concept from all datasets where it appears. Useful for building unified views across multiple data sources.

(SCREENSHOT: Source selector panel showing Dataset, Data View, and Concept options)

***

### Operations Pipeline

Operations transform your data step by step. Each operation takes the output of the previous step as its input, creating a pipeline.

(SCREENSHOT: Operations pipeline showing multiple connected steps with data flowing through them)

**Data Transformation Operations:**

| Operation   | Description                             | Example Use Case                             |
| ----------- | --------------------------------------- | -------------------------------------------- |
| SQL Query   | Execute custom SQL queries on your data | Complex calculations, window functions       |
| Filter      | Filter rows based on conditions         | Remove test accounts, focus on recent data   |
| Aggregate   | Group and aggregate data by columns     | Sum revenue by customer, count events by day |
| Join        | Combine data from two sources           | Link orders to customer attributes           |
| Concatenate | Stack data sources vertically           | Combine monthly files into one dataset       |
| Pivot       | Transform rows into columns             | Turn event types into separate columns       |
| Unpivot     | Transform columns into rows             | Normalize wide-format survey data            |
| Sort        | Sort data by one or more columns        | Order by date for time series analysis       |

**Column Operations:**

| Operation    | Description                                                          |
| ------------ | -------------------------------------------------------------------- |
| Split Column | Split a column into multiple columns (e.g., full name → first, last) |
| Drop Columns | Remove unwanted columns to simplify the dataset                      |

**Data Cleaning Operations:**

| Operation         | Description                                                         |
| ----------------- | ------------------------------------------------------------------- |
| Remove Duplicates | Remove duplicate rows                                               |
| Merge Duplicates  | Merge duplicate rows with a strategy (keep first, sum values, etc.) |
| Impute Nulls      | Fill missing values with mean, median, or custom values             |
| Drop Nulls        | Remove rows with missing values in specified columns                |
| Replace Values    | Replace specific values (fix typos, standardize categories)         |
| Replace Type      | Change column data types                                            |
| Clip Outliers     | Remove statistical outliers that might skew analysis                |
| Normalize Numeric | Scale numeric values to a standard range                            |

**Category Operations:**

| Operation               | Description                                                            |
| ----------------------- | ---------------------------------------------------------------------- |
| Merge Categories        | Combine similar categories (e.g., "USA", "US", "United States" → "US") |
| Fill Missing Categories | Fill missing category values with a default                            |
| Map Categories          | Map categories to new values (rename, recode)                          |

**Time Series Operations:**

| Operation             | Description                                               |
| --------------------- | --------------------------------------------------------- |
| Interpolate Time      | Fill gaps in time series data                             |
| Time Series Aggregate | Aggregate data by time periods (daily → weekly → monthly) |

***

### Common Workflows

**Example: Building a Customer 360 View**

You have customer data in three places: demographics in your CRM, transactions in your database, and support tickets in your helpdesk. Here's how to combine them:

1. Add all three datasets as sources
2. Join demographics to transactions on `customer_id`
3. Aggregate support tickets to get `ticket_count` per customer
4. Join the aggregated tickets to your customer data
5. Result: one row per customer with demographics, purchase history, and support metrics

(SCREENSHOT: Customer 360 Data View showing joined sources with preview of combined columns)

**Filtering Data**

Maybe you only want to analyze active customers from the last year:

1. Add a Filter operation
2. Define conditions: `status = 'active' AND last_purchase_date >= '2024-01-01'`
3. Combine conditions with AND/OR logic as needed

**Aggregating Data**

To analyze at a higher level (e.g., monthly revenue by region):

1. Add an Aggregate operation
2. Group by: `region`, `month`
3. Aggregate: `SUM(revenue)`, `COUNT(orders)`

**Custom SQL**

For complex transformations, use the SQL Query operation. You get full SQL power:

```sql
SELECT 
  customer_id,
  DATE_TRUNC('month', order_date) as month,
  SUM(total) as monthly_revenue,
  COUNT(*) as order_count,
  AVG(total) as avg_order_value
FROM source_data
GROUP BY customer_id, DATE_TRUNC('month', order_date)
```

(SCREENSHOT: SQL Query operation editor with syntax highlighting and preview results)

***

### What Happens in the Background

Understanding the execution model helps you build efficient Data Views:

1. **Plan Building** – RootCause.ai analyzes your operations and builds an execution plan
2. **Optimization** – Operations are optimized for performance (reordering filters, pushing down predicates)
3. **Execution** – The pipeline runs against your source data
4. **Schema Computation** – Output columns and types are determined
5. **Preview Generation** – A sample is generated so you can verify results
6. **Caching** – Results are cached for quick access

When source data changes (e.g., a dataset sync), Data Views automatically reflect the new data—no manual refresh needed.

***

### Using Data Views

**For Causal Analysis**

Data Views are the input for Digital Twins. When you create a Digital Twin, you select a Data View that contains all the variables you want to analyze. Go to **Intelligence** → **Digital Twin** and select your prepared Data View.

**For Reports**

Reference Data Views in your reports to include live, transformed data. Charts and tables pull from the Data View, so they update when the underlying data changes.

**For Export**

Sometimes you need to take your transformed data elsewhere. Export a Data View as CSV, Parquet, or JSON for use in other tools, sharing with stakeholders, or backup.

(SCREENSHOT: Export dialog showing format options and download button)

***

### Best Practices

**Keep It Simple**

Start with fewer operations and add complexity as needed. A Data View with 20 operations is hard to debug when something goes wrong. If you need complex logic, consider breaking it into multiple Data Views that chain together.

**Name Meaningfully**

Future you will thank present you. Names like `Customer_360_Monthly_Active` tell you what's inside. Names like `Data View 1` tell you nothing.

**Document Purpose**

Add descriptions explaining what the Data View contains and why. "Customer attributes joined with last 12 months of transactions, filtered to active customers only" is much more useful than a blank description.

**Test Incrementally**

Preview results after each operation. It's much easier to fix an issue when you know exactly which step introduced it.

***

### Next Steps

With your data prepared in a Data View, you're ready to add meaning:

* Tag columns with [Ontology Concepts](https://docs.rootcause.ai/user-guide/data-management/ontology-concepts) to link related data across your workspace
* Create a [Digital Twin](https://docs.rootcause.ai/user-guide/digital-twin/creating-digital-twin) to discover the causal relationships hidden in your data
