# Evaluation Tab

The Evaluation tab shows how well your Digital Twin actually works. Causal discovery finds relationships, but how accurately can the model predict outcomes? How well does it capture the true patterns in your data?

Before trusting simulation results, you should verify the model performs well. The Evaluation tab provides comprehensive metrics at both the global level (overall model quality) and the node level (how well each variable is predicted).

(SCREENSHOT: Evaluation tab overview showing global metrics and node performance charts)

***

### Why Evaluation Matters

A Digital Twin is only as useful as its accuracy. If the model poorly predicts certain variables, simulations involving those variables may be unreliable.

Evaluation helps you:

* **Trust results** – Know which predictions to rely on
* **Identify weak spots** – Find variables that need better data or modeling
* **Compare versions** – See if changes improved or hurt performance
* **Guide refinement** – Focus effort where it will help most

***

### Version Comparison

The Evaluation tab supports comparing multiple versions side-by-side:

**Selecting Versions**

1. Use the version multi-select dropdown
2. Check the versions you want to compare
3. Metrics update to show comparisons

**Why Compare?**

* See if configuration changes improved performance
* Track model quality over time
* Identify which version is best for production

(SCREENSHOT: Version comparison selector with multiple versions checked)

***

### Global Metrics

Top-level metrics summarize overall model quality:

**Average Accuracy**

For categorical and boolean variables: the average percentage of correct predictions across all such nodes.

* Higher is better
* 100% would mean perfect prediction
* Compare against a baseline (e.g., always predicting the most common class)

**Average MSE (Mean Squared Error)**

For numeric variables: average squared prediction error.

* Lower is better
* Sensitive to large errors (outliers penalized heavily)
* In the units of your data squared

**Average MAE (Mean Absolute Error)**

For numeric variables: average absolute prediction error.

* Lower is better
* More robust to outliers than MSE
* In the same units as your data

**Average MAPE (Mean Absolute Percentage Error)** – *Temporal only*

For time series forecasts: average percentage error across forecast horizons.

* Lower is better
* Expressed as a percentage
* Useful for comparing across different scales

**Log Likelihood**

Overall goodness-of-fit of the probabilistic model.

* Higher (less negative) is better
* Measures how likely the observed data is under the model
* Useful for comparing model versions

(SCREENSHOT: Global metrics cards showing key statistics)

***

### Node-Level Metrics

The detailed table shows how well each variable is predicted:

**Classification Metrics** (for Category/Boolean nodes):

| Metric            | Description                                   | Better |
| ----------------- | --------------------------------------------- | ------ |
| Accuracy          | % of correct predictions                      | Higher |
| Precision         | Of predicted positives, % actually positive   | Higher |
| Recall            | Of actual positives, % correctly predicted    | Higher |
| F1 Score          | Harmonic mean of precision and recall         | Higher |
| Weighted Accuracy | Accuracy weighted by class frequency          | Higher |
| AUC               | Area under ROC curve (discrimination ability) | Higher |

**Regression Metrics** (for Numeric nodes):

| Metric | Description                           | Better |
| ------ | ------------------------------------- | ------ |
| MSE    | Mean squared error                    | Lower  |
| MAE    | Mean absolute error                   | Lower  |
| R²     | Variance explained by the model (0-1) | Higher |

**Temporal Metrics** (for Numeric nodes in time series):

| Metric | Description                                | Better |
| ------ | ------------------------------------------ | ------ |
| MAPE   | Mean absolute percentage error per horizon | Lower  |

(SCREENSHOT: Node-level metrics table showing all variables with their performance)

***

### Understanding the Metrics

**Accuracy vs. Precision vs. Recall**

* **Accuracy**: Overall correctness
* **Precision**: When the model says "yes," how often is it right?
* **Recall**: Of all actual "yes" cases, how many did the model find?

For imbalanced data (rare events), precision and recall are often more informative than accuracy.

**R² Interpretation**

R² ranges from 0 to 1 (sometimes negative for very poor models):

* **R² = 1.0** – Perfect prediction
* **R² = 0.5** – Model explains 50% of variance
* **R² = 0.0** – Model does no better than predicting the mean

**MAPE by Horizon**

For temporal models, MAPE typically increases with forecast horizon:

* Near-term forecasts are usually more accurate
* Distant forecasts have more uncertainty
* The table shows MAPE at first and last horizons

(SCREENSHOT: Node expanded to show per-class accuracy or MAPE by horizon chart)

***

### Performance Charts

Visual summaries help identify patterns:

**Top Performing Categorical Nodes**

Bar chart showing the 5 categorical variables with highest accuracy. These are your model's strengths.

**Top Performing Numeric Nodes**

Bar chart showing:

* For static twins: 5 variables with highest R²
* For temporal twins: 5 variables with lowest average MAPE

(SCREENSHOT: Performance bar charts for categorical and numeric nodes)

***

### Expanding Node Details

Click any row in the metrics table to see detailed breakdown:

**For Categorical Nodes:**

* Per-class accuracy chart
* Shows how well each category is predicted
* Identifies if certain classes are harder to predict

**For Numeric Nodes (Temporal):**

* MAPE by forecast horizon chart
* Shows how accuracy degrades over time
* Helps set appropriate forecast windows

(SCREENSHOT: Expanded node showing per-class or per-horizon breakdown)

***

### Exporting Metrics

Export the evaluation data for external analysis or reporting:

1. Click **Export**
2. Choose format (CSV or JSON)
3. Download the node metrics table

Useful for:

* Including in reports
* Tracking over time
* Comparing across models

(SCREENSHOT: Export dropdown with format options)

***

### Interpreting Poor Performance

If a variable has poor metrics:

**Possible Causes:**

* **Missing important causes** – The variable's true drivers aren't in the model
* **Poor data quality** – Noise, errors, or missing values
* **Wrong model type** – Maybe need temporal for a time-dependent variable
* **Insufficient data** – Not enough examples to learn patterns

**What to Do:**

1. Check if important predictors are included
2. Review data quality for that variable
3. Consider adding more data or features
4. Accept higher uncertainty in simulations involving this variable

***

### Best Practices

**Evaluate Before Simulating**

Always check evaluation metrics before trusting simulation results. A model that can't predict well won't simulate well either.

**Focus on Key Variables**

Pay most attention to metrics for variables you'll simulate or predict. Less-important variables can have lower performance without affecting your use case.

**Compare Versions Thoughtfully**

When comparing versions, look for consistent improvements. A version that improves some metrics but hurts others needs careful consideration.

**Set Performance Thresholds**

Decide what "good enough" means for your use case. Perfect prediction isn't always necessary or achievable.

***

### Next Steps

With model quality understood:

* If satisfied, proceed to [Simulations](https://docs.rootcause.ai/user-guide/digital-twin/tabs/simulation-tab)
* If issues found, refine configuration in [Config Tab](https://docs.rootcause.ai/user-guide/digital-twin/tabs/config-tab)
* For poor-performing variables, check relationships in [Relationships Tab](https://docs.rootcause.ai/user-guide/digital-twin/tabs/relationships-tab)
