ML Predictions (Coming Soon)
Expanse Pro introduces ML-powered predictions trained on the telemetry collected in Expanse Core. These models learn from your historical job data to provide actionable insights before job submission.
Resource Prediction
Problem: Users guess resource requirements, leading to over-provisioning (wasted allocation) or under-provisioning (job failures, OOM kills).
Solution: Expanse predicts optimal resource allocation based on code characteristics and input data shapes.
$ expanse run solver --predict-resources
Predicted resources for 'solver':
Nodes: 4 (requested: 8) ▼ 50% reduction possible
Memory: 94 GB (requested: 128 GB)
Walltime: 2h 34m (requested: 4h)
Confidence: 87%
Recommendation: Reduce node count to 4. Historical jobs with similar
code patterns and input shapes completed successfully with this allocation.
Proceed with prediction? [Y/n/original]
Impact:
- Reduced queue wait times (smaller jobs schedule faster)
- Higher cluster utilisation
- Lower allocation waste
- Fewer OOM failures from under-provisioning
Queue Time Prediction
Problem: Users submit jobs without knowing when they'll start, making it hard to plan work or choose optimal submission times.
Solution: Expanse predicts expected queue wait time based on current cluster state and historical patterns.
$ expanse run solver --predict-queue
Queue time prediction for 'solver' on archer2:
Current queue depth: 247 jobs
Requested resources: 4 nodes, 2h walltime
Estimated wait time: 45 min - 1h 15min
Confidence interval: 80%
Better times to submit:
- Tomorrow 06:00: ~10 min wait (historically low usage)
- Friday 18:00: ~5 min wait (weekend lull)
Submit now? [Y/n/schedule]
Impact:
- Better researcher time management
- Option to schedule jobs for optimal queue times
- Reduced uncertainty and frustration
Failure Prediction
Problem: Jobs fail after hours of queue wait and partial execution, wasting compute time and researcher productivity.
Solution: Expanse predicts failure probability before submission and suggests preventive fixes.
$ expanse run solver --validate
Pre-flight validation for 'solver':
✓ Node configuration valid
✓ Input dependencies satisfied
✓ Cluster permissions verified
⚠ FAILURE RISK: HIGH (78%)
Predicted failure mode: OUT_OF_MEMORY at ~47 minutes
Evidence:
- Code pattern similar to jobs that OOM'd (embedding distance: 0.12)
- Input mesh size (2.3M elements) exceeds safe threshold for 128GB
- Historical jobs with this pattern: 23/29 failed with OOM
Suggested fixes:
1. Increase memory: --mem=256GB (recommended)
2. Reduce mesh resolution in preprocessing
3. Enable out-of-core solver mode: --solver-flags="--ooc"
Proceed anyway? [Y/n/fix]
Impact:
- Fewer wasted compute hours
- Faster iteration cycles
- Actionable fix suggestions, not just warnings
- Learn from collective failure patterns across the community
Custom Models for Your Lab
With sufficient historical data (~6-12 months), Expanse can train custom prediction models specific to your workloads:
- Lab-specific patterns: Models tuned to your particular codes and workflows
- Higher accuracy: Custom models outperform generic ones for your use cases
- Private training: For enterprise deployments, training happens entirely on-prem
ML predictions will roll out progressively as we collect sufficient training data. Resource prediction (lowest data requirement) will be available first, followed by queue time and failure prediction.