Skip to main content

Data Collection & Privacy

Expanse collects telemetry during Expanse Core to enable ML-powered predictions in Expanse Pro. Our approach is privacy-first: we extract only mathematical footprints that are impossible to reverse-engineer, ensuring your code and sensitive data never leave your network in identifiable form.

What We Collect

At each job submission and completion, Expanse captures:

On Submit

Data PointDescription
Queue stateCurrent queue depth, wait times, node availability
Cluster topologyHardware configuration, interconnect, available resources
Job characteristicsRequested resources, node count, walltime, partition
Code embeddingPrivacy-preserving mathematical fingerprint of source code (see below)
Input shapesDimensions and types of input data (not the data itself)

On Completion

Data PointDescription
RuntimeActual execution time
Peak memoryMaximum memory utilisation
Exit statusSuccess, failure, timeout, OOM, etc.
Resource utilisationCPU/GPU usage over time (aggregated)
Output shapesDimensions of produced outputs

Privacy-First Architecture

Your source code and raw telemetry never leave your network. Instead, Expanse processes everything on-premises before transmission.

┌─────────────────────────────────────────────────────────────┐
│ YOUR NETWORK (On-Premises) │
│ │
│ Source Code ──→ Embedding Model ──→ Mathematical Vector │
│ │ │ │
│ Telemetry ─────→ Aggregation ─────→ Statistical Summary │
│ │ │ │
│ Anonymisation │ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ Masked Payload (irreversible) │ │
│ └─────────────────────────────────┘ │
│ │ │
└──────────────────────────────┼──────────────────────────────┘
│ HTTPS

┌─────────────────────┐
│ Expanse Servers │
│ (ML Training) │
└─────────────────────┘

Code Embedding (Not Code)

We use learned embeddings to create mathematical fingerprints of code:

  • One-way transformation: The embedding cannot be reversed to reconstruct source code
  • Semantic similarity: Similar algorithms produce similar vectors, enabling pattern learning
  • No identifiable content: Variable names, comments, and literal values are stripped before embedding
# Your actual code (NEVER transmitted)
def compute_stress_tensor(mesh, forces):
"""Proprietary FEA implementation"""
stiffness = assemble_stiffness_matrix(mesh)
return np.linalg.solve(stiffness, forces)

# What we transmit (mathematical vector)
[0.234, -0.891, 0.445, 0.112, ..., -0.667] # 512-dim embedding

The embedding captures computational patterns (matrix assembly, linear solve) without revealing implementation details.

Telemetry Aggregation

Raw metrics are aggregated before transmission:

  • No time-series: Only statistical summaries (mean, p50, p95, max)
  • No identifiers: Job IDs, user names, and paths are hashed or removed
  • Differential privacy: Noise added to prevent individual job identification

Giving Back to the Community

Your anonymised data contributions help improve predictions for everyone:

  • Collective learning: Models trained on diverse workloads generalise better
  • Academic collaboration: We partner with EPCC and universities to advance HPC research
  • Open benchmarks: Aggregated, anonymised statistics published for community benefit

In return, you receive:

  • Continuously improving prediction models
  • Custom model fine-tuning for your specific workload patterns (Expanse Pro)
  • Early access to new ML features

Academic Transparency

We are committed to transparency in our data collection and ML methods:

Publication in Progress

We are preparing an academic paper detailing our privacy-preserving telemetry collection methods, embedding techniques, and differential privacy guarantees. This will be submitted for peer review and published openly.

The paper will cover:

  • Formal privacy guarantees and threat model analysis
  • Embedding model architecture and irreversibility proofs
  • Benchmark comparisons with traditional (non-private) approaches

Deployment Options

OptionAvailabilityDescription
Masked EmbeddingsAcademic & StandardCode/telemetry embedded on-prem, only vectors transmitted
Aggregate-OnlyStandardOnly statistical summaries, no embeddings
Fully On-PremEnterpriseAll ML training happens on your infrastructure, zero egress

For academic partners, masked embeddings are the default, providing strong privacy while enabling collective model improvement. Enterprise customers requiring zero data egress can deploy fully on-premises ML training infrastructure.