Skip to main content

ML Predictions (Coming Soon)

Expanse Pro introduces ML-powered predictions trained on the telemetry collected in Expanse Core. These models learn from your historical job data to provide actionable insights before job submission.

Resource Prediction

Problem: Users guess resource requirements, leading to over-provisioning (wasted allocation) or under-provisioning (job failures, OOM kills).

Solution: Expanse predicts optimal resource allocation based on code characteristics and input data shapes.

$ expanse run solver --predict-resources

Predicted resources for 'solver':
Nodes: 4 (requested: 8)50% reduction possible
Memory: 94 GB (requested: 128 GB)
Walltime: 2h 34m (requested: 4h)
Confidence: 87%

Recommendation: Reduce node count to 4. Historical jobs with similar
code patterns and input shapes completed successfully with this allocation.

Proceed with prediction? [Y/n/original]

Impact:

  • Reduced queue wait times (smaller jobs schedule faster)
  • Higher cluster utilisation
  • Lower allocation waste
  • Fewer OOM failures from under-provisioning

Queue Time Prediction

Problem: Users submit jobs without knowing when they'll start, making it hard to plan work or choose optimal submission times.

Solution: Expanse predicts expected queue wait time based on current cluster state and historical patterns.

$ expanse run solver --predict-queue

Queue time prediction for 'solver' on archer2:
Current queue depth: 247 jobs
Requested resources: 4 nodes, 2h walltime

Estimated wait time: 45 min - 1h 15min
Confidence interval: 80%

Better times to submit:
- Tomorrow 06:00: ~10 min wait (historically low usage)
- Friday 18:00: ~5 min wait (weekend lull)

Submit now? [Y/n/schedule]

Impact:

  • Better researcher time management
  • Option to schedule jobs for optimal queue times
  • Reduced uncertainty and frustration

Failure Prediction

Problem: Jobs fail after hours of queue wait and partial execution, wasting compute time and researcher productivity.

Solution: Expanse predicts failure probability before submission and suggests preventive fixes.

$ expanse run solver --validate

Pre-flight validation for 'solver':

✓ Node configuration valid
✓ Input dependencies satisfied
✓ Cluster permissions verified

⚠ FAILURE RISK: HIGH (78%)

Predicted failure mode: OUT_OF_MEMORY at ~47 minutes

Evidence:
- Code pattern similar to jobs that OOM'd (embedding distance: 0.12)
- Input mesh size (2.3M elements) exceeds safe threshold for 128GB
- Historical jobs with this pattern: 23/29 failed with OOM

Suggested fixes:
1. Increase memory: --mem=256GB (recommended)
2. Reduce mesh resolution in preprocessing
3. Enable out-of-core solver mode: --solver-flags="--ooc"

Proceed anyway? [Y/n/fix]

Impact:

  • Fewer wasted compute hours
  • Faster iteration cycles
  • Actionable fix suggestions, not just warnings
  • Learn from collective failure patterns across the community

Custom Models for Your Lab

With sufficient historical data (~6-12 months), Expanse can train custom prediction models specific to your workloads:

  • Lab-specific patterns: Models tuned to your particular codes and workflows
  • Higher accuracy: Custom models outperform generic ones for your use cases
  • Private training: For enterprise deployments, training happens entirely on-prem
Expanse Pro Timeline

ML predictions will roll out progressively as we collect sufficient training data. Resource prediction (lowest data requirement) will be available first, followed by queue time and failure prediction.