Evaluation Tab

The Evaluation tab shows how well your Digital Twin actually works. Causal discovery finds relationships, but how accurately can the model predict outcomes? How well does it capture the true patterns in your data?

Before trusting simulation results, you should verify the model performs well. The Evaluation tab provides comprehensive metrics at both the global level (overall model quality) and the node level (how well each variable is predicted).

(SCREENSHOT: Evaluation tab overview showing global metrics and node performance charts)

Why Evaluation Matters

A Digital Twin is only as useful as its accuracy. If the model poorly predicts certain variables, simulations involving those variables may be unreliable.

Evaluation helps you:

Trust results – Know which predictions to rely on
Identify weak spots – Find variables that need better data or modeling
Compare versions – See if changes improved or hurt performance
Guide refinement – Focus effort where it will help most

Version Comparison

The Evaluation tab supports comparing multiple versions side-by-side:

Selecting Versions

Use the version multi-select dropdown
Check the versions you want to compare
Metrics update to show comparisons

Why Compare?

See if configuration changes improved performance
Track model quality over time
Identify which version is best for production

(SCREENSHOT: Version comparison selector with multiple versions checked)

Global Metrics

Top-level metrics summarize overall model quality:

Average Accuracy

For categorical and boolean variables: the average percentage of correct predictions across all such nodes.

Higher is better
100% would mean perfect prediction
Compare against a baseline (e.g., always predicting the most common class)

Average MSE (Mean Squared Error)

For numeric variables: average squared prediction error.

Lower is better
Sensitive to large errors (outliers penalized heavily)
In the units of your data squared

Average MAE (Mean Absolute Error)

For numeric variables: average absolute prediction error.

Lower is better
More robust to outliers than MSE
In the same units as your data

Average MAPE (Mean Absolute Percentage Error) – Temporal only

For time series forecasts: average percentage error across forecast horizons.

Lower is better
Expressed as a percentage
Useful for comparing across different scales

Log Likelihood

Overall goodness-of-fit of the probabilistic model.

Higher (less negative) is better
Measures how likely the observed data is under the model
Useful for comparing model versions

(SCREENSHOT: Global metrics cards showing key statistics)

Node-Level Metrics

The detailed table shows how well each variable is predicted:

Classification Metrics (for Category/Boolean nodes):

Metric

Description

Better

Accuracy

% of correct predictions

Higher

Precision

Of predicted positives, % actually positive

Higher

Recall

Of actual positives, % correctly predicted

Higher

F1 Score

Harmonic mean of precision and recall

Higher

Weighted Accuracy

Accuracy weighted by class frequency

Higher

AUC

Area under ROC curve (discrimination ability)

Higher

Regression Metrics (for Numeric nodes):

Metric

Description

Better

MSE

Mean squared error

Lower

MAE

Mean absolute error

Lower

R²

Variance explained by the model (0-1)

Higher

Temporal Metrics (for Numeric nodes in time series):

Metric

Description

Better

MAPE

Mean absolute percentage error per horizon

Lower

(SCREENSHOT: Node-level metrics table showing all variables with their performance)

Understanding the Metrics

Accuracy vs. Precision vs. Recall

Accuracy: Overall correctness
Precision: When the model says "yes," how often is it right?
Recall: Of all actual "yes" cases, how many did the model find?

For imbalanced data (rare events), precision and recall are often more informative than accuracy.

R² Interpretation

R² ranges from 0 to 1 (sometimes negative for very poor models):

R² = 1.0 – Perfect prediction
R² = 0.5 – Model explains 50% of variance
R² = 0.0 – Model does no better than predicting the mean

MAPE by Horizon

For temporal models, MAPE typically increases with forecast horizon:

Near-term forecasts are usually more accurate
Distant forecasts have more uncertainty
The table shows MAPE at first and last horizons

(SCREENSHOT: Node expanded to show per-class accuracy or MAPE by horizon chart)

Performance Charts

Visual summaries help identify patterns:

Top Performing Categorical Nodes

Bar chart showing the 5 categorical variables with highest accuracy. These are your model's strengths.

Top Performing Numeric Nodes

Bar chart showing:

For static twins: 5 variables with highest R²
For temporal twins: 5 variables with lowest average MAPE

(SCREENSHOT: Performance bar charts for categorical and numeric nodes)

Expanding Node Details

Click any row in the metrics table to see detailed breakdown:

For Categorical Nodes:

Per-class accuracy chart
Shows how well each category is predicted
Identifies if certain classes are harder to predict

For Numeric Nodes (Temporal):

MAPE by forecast horizon chart
Shows how accuracy degrades over time
Helps set appropriate forecast windows

(SCREENSHOT: Expanded node showing per-class or per-horizon breakdown)

Exporting Metrics

Export the evaluation data for external analysis or reporting:

Click Export
Choose format (CSV or JSON)
Download the node metrics table

Useful for:

Including in reports
Tracking over time
Comparing across models

(SCREENSHOT: Export dropdown with format options)

Interpreting Poor Performance

If a variable has poor metrics:

Possible Causes:

Missing important causes – The variable's true drivers aren't in the model
Poor data quality – Noise, errors, or missing values
Wrong model type – Maybe need temporal for a time-dependent variable
Insufficient data – Not enough examples to learn patterns

What to Do:

Check if important predictors are included
Review data quality for that variable
Consider adding more data or features
Accept higher uncertainty in simulations involving this variable

Best Practices

Evaluate Before Simulating

Always check evaluation metrics before trusting simulation results. A model that can't predict well won't simulate well either.

Focus on Key Variables

Pay most attention to metrics for variables you'll simulate or predict. Less-important variables can have lower performance without affecting your use case.

Compare Versions Thoughtfully

When comparing versions, look for consistent improvements. A version that improves some metrics but hurts others needs careful consideration.

Set Performance Thresholds

Decide what "good enough" means for your use case. Perfect prediction isn't always necessary or achievable.

Next Steps

With model quality understood:

If satisfied, proceed to Simulations
If issues found, refine configuration in Config Tab
For poor-performing variables, check relationships in Relationships Tab

PreviousPath Analysis Tab NextSimulation Tab

Last updated 3 months ago

hashtagWhy Evaluation Matters

hashtagVersion Comparison

hashtagGlobal Metrics

hashtagNode-Level Metrics

hashtagUnderstanding the Metrics

hashtagPerformance Charts

hashtagExpanding Node Details

hashtagExporting Metrics

hashtagInterpreting Poor Performance

hashtagBest Practices

hashtagNext Steps

Why Evaluation Matters

Version Comparison

Global Metrics

Node-Level Metrics

Understanding the Metrics

Performance Charts

Expanding Node Details

Exporting Metrics

Interpreting Poor Performance

Best Practices

Next Steps