Evaluation Tab
The Evaluation tab shows how well your Digital Twin actually works. Causal discovery finds relationships, but how accurately can the model predict outcomes? How well does it capture the true patterns in your data?
Before trusting simulation results, you should verify the model performs well. The Evaluation tab provides comprehensive metrics at both the global level (overall model quality) and the node level (how well each variable is predicted).
(SCREENSHOT: Evaluation tab overview showing global metrics and node performance charts)
Why Evaluation Matters
A Digital Twin is only as useful as its accuracy. If the model poorly predicts certain variables, simulations involving those variables may be unreliable.
Evaluation helps you:
Trust results – Know which predictions to rely on
Identify weak spots – Find variables that need better data or modeling
Compare versions – See if changes improved or hurt performance
Guide refinement – Focus effort where it will help most
Version Comparison
The Evaluation tab supports comparing multiple versions side-by-side:
Selecting Versions
Use the version multi-select dropdown
Check the versions you want to compare
Metrics update to show comparisons
Why Compare?
See if configuration changes improved performance
Track model quality over time
Identify which version is best for production
(SCREENSHOT: Version comparison selector with multiple versions checked)
Global Metrics
Top-level metrics summarize overall model quality:
Average Accuracy
For categorical and boolean variables: the average percentage of correct predictions across all such nodes.
Higher is better
100% would mean perfect prediction
Compare against a baseline (e.g., always predicting the most common class)
Average MSE (Mean Squared Error)
For numeric variables: average squared prediction error.
Lower is better
Sensitive to large errors (outliers penalized heavily)
In the units of your data squared
Average MAE (Mean Absolute Error)
For numeric variables: average absolute prediction error.
Lower is better
More robust to outliers than MSE
In the same units as your data
Average MAPE (Mean Absolute Percentage Error) – Temporal only
For time series forecasts: average percentage error across forecast horizons.
Lower is better
Expressed as a percentage
Useful for comparing across different scales
Log Likelihood
Overall goodness-of-fit of the probabilistic model.
Higher (less negative) is better
Measures how likely the observed data is under the model
Useful for comparing model versions
(SCREENSHOT: Global metrics cards showing key statistics)
Node-Level Metrics
The detailed table shows how well each variable is predicted:
Classification Metrics (for Category/Boolean nodes):
Accuracy
% of correct predictions
Higher
Precision
Of predicted positives, % actually positive
Higher
Recall
Of actual positives, % correctly predicted
Higher
F1 Score
Harmonic mean of precision and recall
Higher
Weighted Accuracy
Accuracy weighted by class frequency
Higher
AUC
Area under ROC curve (discrimination ability)
Higher
Regression Metrics (for Numeric nodes):
MSE
Mean squared error
Lower
MAE
Mean absolute error
Lower
R²
Variance explained by the model (0-1)
Higher
Temporal Metrics (for Numeric nodes in time series):
MAPE
Mean absolute percentage error per horizon
Lower
(SCREENSHOT: Node-level metrics table showing all variables with their performance)
Understanding the Metrics
Accuracy vs. Precision vs. Recall
Accuracy: Overall correctness
Precision: When the model says "yes," how often is it right?
Recall: Of all actual "yes" cases, how many did the model find?
For imbalanced data (rare events), precision and recall are often more informative than accuracy.
R² Interpretation
R² ranges from 0 to 1 (sometimes negative for very poor models):
R² = 1.0 – Perfect prediction
R² = 0.5 – Model explains 50% of variance
R² = 0.0 – Model does no better than predicting the mean
MAPE by Horizon
For temporal models, MAPE typically increases with forecast horizon:
Near-term forecasts are usually more accurate
Distant forecasts have more uncertainty
The table shows MAPE at first and last horizons
(SCREENSHOT: Node expanded to show per-class accuracy or MAPE by horizon chart)
Performance Charts
Visual summaries help identify patterns:
Top Performing Categorical Nodes
Bar chart showing the 5 categorical variables with highest accuracy. These are your model's strengths.
Top Performing Numeric Nodes
Bar chart showing:
For static twins: 5 variables with highest R²
For temporal twins: 5 variables with lowest average MAPE
(SCREENSHOT: Performance bar charts for categorical and numeric nodes)
Expanding Node Details
Click any row in the metrics table to see detailed breakdown:
For Categorical Nodes:
Per-class accuracy chart
Shows how well each category is predicted
Identifies if certain classes are harder to predict
For Numeric Nodes (Temporal):
MAPE by forecast horizon chart
Shows how accuracy degrades over time
Helps set appropriate forecast windows
(SCREENSHOT: Expanded node showing per-class or per-horizon breakdown)
Exporting Metrics
Export the evaluation data for external analysis or reporting:
Click Export
Choose format (CSV or JSON)
Download the node metrics table
Useful for:
Including in reports
Tracking over time
Comparing across models
(SCREENSHOT: Export dropdown with format options)
Interpreting Poor Performance
If a variable has poor metrics:
Possible Causes:
Missing important causes – The variable's true drivers aren't in the model
Poor data quality – Noise, errors, or missing values
Wrong model type – Maybe need temporal for a time-dependent variable
Insufficient data – Not enough examples to learn patterns
What to Do:
Check if important predictors are included
Review data quality for that variable
Consider adding more data or features
Accept higher uncertainty in simulations involving this variable
Best Practices
Evaluate Before Simulating
Always check evaluation metrics before trusting simulation results. A model that can't predict well won't simulate well either.
Focus on Key Variables
Pay most attention to metrics for variables you'll simulate or predict. Less-important variables can have lower performance without affecting your use case.
Compare Versions Thoughtfully
When comparing versions, look for consistent improvements. A version that improves some metrics but hurts others needs careful consideration.
Set Performance Thresholds
Decide what "good enough" means for your use case. Perfect prediction isn't always necessary or achievable.
Next Steps
With model quality understood:
If satisfied, proceed to Simulations
If issues found, refine configuration in Config Tab
For poor-performing variables, check relationships in Relationships Tab
Last updated

