ML Predictions (Coming Soon)

Expanse Pro introduces ML-powered predictions trained on the telemetry collected in Expanse Core. These models learn from your historical job data to provide actionable insights before job submission.

Resource Prediction

Problem: Users guess resource requirements, leading to over-provisioning (wasted allocation) or under-provisioning (job failures, OOM kills).

Solution: Expanse predicts optimal resource allocation based on code characteristics and input data shapes.

$ expanse run solver --predict-resources

Predicted resources for 'solver':
  Nodes:        4 (requested: 8)      ▼ 50% reduction possible
  Memory:       94 GB (requested: 128 GB)
  Walltime:     2h 34m (requested: 4h)
  Confidence:   87%

Recommendation: Reduce node count to 4. Historical jobs with similar
code patterns and input shapes completed successfully with this allocation.

Proceed with prediction? [Y/n/original]

Impact:

Reduced queue wait times (smaller jobs schedule faster)
Higher cluster utilisation
Lower allocation waste
Fewer OOM failures from under-provisioning

Queue Time Prediction

Problem: Users submit jobs without knowing when they'll start, making it hard to plan work or choose optimal submission times.

Solution: Expanse predicts expected queue wait time based on current cluster state and historical patterns.

$ expanse run solver --predict-queue

Queue time prediction for 'solver' on archer2:
  Current queue depth:    247 jobs
  Requested resources:    4 nodes, 2h walltime
  
  Estimated wait time:    45 min - 1h 15min
  Confidence interval:    80%
  
  Better times to submit:
    - Tomorrow 06:00:     ~10 min wait (historically low usage)
    - Friday 18:00:       ~5 min wait (weekend lull)

Submit now? [Y/n/schedule]

Impact:

Better researcher time management
Option to schedule jobs for optimal queue times
Reduced uncertainty and frustration

Failure Prediction

Problem: Jobs fail after hours of queue wait and partial execution, wasting compute time and researcher productivity.

Solution: Expanse predicts failure probability before submission and suggests preventive fixes.

$ expanse run solver --validate

Pre-flight validation for 'solver':

  ✓ Node configuration valid
  ✓ Input dependencies satisfied
  ✓ Cluster permissions verified
  
  ⚠ FAILURE RISK: HIGH (78%)
  
  Predicted failure mode: OUT_OF_MEMORY at ~47 minutes
  
  Evidence:
    - Code pattern similar to jobs that OOM'd (embedding distance: 0.12)
    - Input mesh size (2.3M elements) exceeds safe threshold for 128GB
    - Historical jobs with this pattern: 23/29 failed with OOM
  
  Suggested fixes:
    1. Increase memory: --mem=256GB (recommended)
    2. Reduce mesh resolution in preprocessing
    3. Enable out-of-core solver mode: --solver-flags="--ooc"

Proceed anyway? [Y/n/fix]

Impact:

Fewer wasted compute hours
Faster iteration cycles
Actionable fix suggestions, not just warnings
Learn from collective failure patterns across the community

Custom Models for Your Lab

With sufficient historical data (~6-12 months), Expanse can train custom prediction models specific to your workloads:

Lab-specific patterns: Models tuned to your particular codes and workflows
Higher accuracy: Custom models outperform generic ones for your use cases
Private training: For enterprise deployments, training happens entirely on-prem

Expanse Pro Timeline

ML predictions will roll out progressively as we collect sufficient training data. Resource prediction (lowest data requirement) will be available first, followed by queue time and failure prediction.

Resource Prediction​

Queue Time Prediction​

Failure Prediction​

Custom Models for Your Lab​

Resource Prediction

Queue Time Prediction

Failure Prediction

Custom Models for Your Lab