Evaluation Tab

The Evaluation tab shows how well your Digital Twin actually works. Causal discovery finds relationships, but how accurately can the model predict outcomes? How well does it capture the true patterns in your data?

Before trusting simulation results, you should verify the model performs well. The Evaluation tab provides comprehensive metrics at both the global level (overall model quality) and the node level (how well each variable is predicted).

(SCREENSHOT: Evaluation tab overview showing global metrics and node performance charts)


Why Evaluation Matters

A Digital Twin is only as useful as its accuracy. If the model poorly predicts certain variables, simulations involving those variables may be unreliable.

Evaluation helps you:

  • Trust results – Know which predictions to rely on

  • Identify weak spots – Find variables that need better data or modeling

  • Compare versions – See if changes improved or hurt performance

  • Guide refinement – Focus effort where it will help most


Version Comparison

The Evaluation tab supports comparing multiple versions side-by-side:

Selecting Versions

  1. Use the version multi-select dropdown

  2. Check the versions you want to compare

  3. Metrics update to show comparisons

Why Compare?

  • See if configuration changes improved performance

  • Track model quality over time

  • Identify which version is best for production

(SCREENSHOT: Version comparison selector with multiple versions checked)


Global Metrics

Top-level metrics summarize overall model quality:

Average Accuracy

For categorical and boolean variables: the average percentage of correct predictions across all such nodes.

  • Higher is better

  • 100% would mean perfect prediction

  • Compare against a baseline (e.g., always predicting the most common class)

Average MSE (Mean Squared Error)

For numeric variables: average squared prediction error.

  • Lower is better

  • Sensitive to large errors (outliers penalized heavily)

  • In the units of your data squared

Average MAE (Mean Absolute Error)

For numeric variables: average absolute prediction error.

  • Lower is better

  • More robust to outliers than MSE

  • In the same units as your data

Average MAPE (Mean Absolute Percentage Error)Temporal only

For time series forecasts: average percentage error across forecast horizons.

  • Lower is better

  • Expressed as a percentage

  • Useful for comparing across different scales

Log Likelihood

Overall goodness-of-fit of the probabilistic model.

  • Higher (less negative) is better

  • Measures how likely the observed data is under the model

  • Useful for comparing model versions

(SCREENSHOT: Global metrics cards showing key statistics)


Node-Level Metrics

The detailed table shows how well each variable is predicted:

Classification Metrics (for Category/Boolean nodes):

Metric
Description
Better

Accuracy

% of correct predictions

Higher

Precision

Of predicted positives, % actually positive

Higher

Recall

Of actual positives, % correctly predicted

Higher

F1 Score

Harmonic mean of precision and recall

Higher

Weighted Accuracy

Accuracy weighted by class frequency

Higher

AUC

Area under ROC curve (discrimination ability)

Higher

Regression Metrics (for Numeric nodes):

Metric
Description
Better

MSE

Mean squared error

Lower

MAE

Mean absolute error

Lower

Variance explained by the model (0-1)

Higher

Temporal Metrics (for Numeric nodes in time series):

Metric
Description
Better

MAPE

Mean absolute percentage error per horizon

Lower

(SCREENSHOT: Node-level metrics table showing all variables with their performance)


Understanding the Metrics

Accuracy vs. Precision vs. Recall

  • Accuracy: Overall correctness

  • Precision: When the model says "yes," how often is it right?

  • Recall: Of all actual "yes" cases, how many did the model find?

For imbalanced data (rare events), precision and recall are often more informative than accuracy.

R² Interpretation

R² ranges from 0 to 1 (sometimes negative for very poor models):

  • R² = 1.0 – Perfect prediction

  • R² = 0.5 – Model explains 50% of variance

  • R² = 0.0 – Model does no better than predicting the mean

MAPE by Horizon

For temporal models, MAPE typically increases with forecast horizon:

  • Near-term forecasts are usually more accurate

  • Distant forecasts have more uncertainty

  • The table shows MAPE at first and last horizons

(SCREENSHOT: Node expanded to show per-class accuracy or MAPE by horizon chart)


Performance Charts

Visual summaries help identify patterns:

Top Performing Categorical Nodes

Bar chart showing the 5 categorical variables with highest accuracy. These are your model's strengths.

Top Performing Numeric Nodes

Bar chart showing:

  • For static twins: 5 variables with highest R²

  • For temporal twins: 5 variables with lowest average MAPE

(SCREENSHOT: Performance bar charts for categorical and numeric nodes)


Expanding Node Details

Click any row in the metrics table to see detailed breakdown:

For Categorical Nodes:

  • Per-class accuracy chart

  • Shows how well each category is predicted

  • Identifies if certain classes are harder to predict

For Numeric Nodes (Temporal):

  • MAPE by forecast horizon chart

  • Shows how accuracy degrades over time

  • Helps set appropriate forecast windows

(SCREENSHOT: Expanded node showing per-class or per-horizon breakdown)


Exporting Metrics

Export the evaluation data for external analysis or reporting:

  1. Click Export

  2. Choose format (CSV or JSON)

  3. Download the node metrics table

Useful for:

  • Including in reports

  • Tracking over time

  • Comparing across models

(SCREENSHOT: Export dropdown with format options)


Interpreting Poor Performance

If a variable has poor metrics:

Possible Causes:

  • Missing important causes – The variable's true drivers aren't in the model

  • Poor data quality – Noise, errors, or missing values

  • Wrong model type – Maybe need temporal for a time-dependent variable

  • Insufficient data – Not enough examples to learn patterns

What to Do:

  1. Check if important predictors are included

  2. Review data quality for that variable

  3. Consider adding more data or features

  4. Accept higher uncertainty in simulations involving this variable


Best Practices

Evaluate Before Simulating

Always check evaluation metrics before trusting simulation results. A model that can't predict well won't simulate well either.

Focus on Key Variables

Pay most attention to metrics for variables you'll simulate or predict. Less-important variables can have lower performance without affecting your use case.

Compare Versions Thoughtfully

When comparing versions, look for consistent improvements. A version that improves some metrics but hurts others needs careful consideration.

Set Performance Thresholds

Decide what "good enough" means for your use case. Perfect prediction isn't always necessary or achievable.


Next Steps

With model quality understood:

Last updated