Data Collection & Privacy
Expanse collects telemetry during Expanse Core to enable ML-powered predictions in Expanse Pro. Our approach is privacy-first: we extract only mathematical footprints that are impossible to reverse-engineer, ensuring your code and sensitive data never leave your network in identifiable form.
What We Collect
At each job submission and completion, Expanse captures:
On Submit
| Data Point | Description |
|---|---|
| Queue state | Current queue depth, wait times, node availability |
| Cluster topology | Hardware configuration, interconnect, available resources |
| Job characteristics | Requested resources, node count, walltime, partition |
| Code embedding | Privacy-preserving mathematical fingerprint of source code (see below) |
| Input shapes | Dimensions and types of input data (not the data itself) |
On Completion
| Data Point | Description |
|---|---|
| Runtime | Actual execution time |
| Peak memory | Maximum memory utilisation |
| Exit status | Success, failure, timeout, OOM, etc. |
| Resource utilisation | CPU/GPU usage over time (aggregated) |
| Output shapes | Dimensions of produced outputs |
Privacy-First Architecture
Your source code and raw telemetry never leave your network. Instead, Expanse processes everything on-premises before transmission.
┌─────────────────────────────────────────────────────────────┐
│ YOUR NETWORK (On-Premises) │
│ │
│ Source Code ──→ Embedding Model ──→ Mathematical Vector │
│ │ │ │
│ Telemetry ─────→ Aggregation ─────→ Statistical Summary │
│ │ │ │
│ Anonymisation │ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ Masked Payload (irreversible) │ │
│ └─────────────────────────────────┘ │
│ │ │
└──────────────────────────────┼──────────────────────────────┘
│ HTTPS
▼
┌─────────────────────┐
│ Expanse Servers │
│ (ML Training) │
└─────────────────────┘
Code Embedding (Not Code)
We use learned embeddings to create mathematical fingerprints of code:
- One-way transformation: The embedding cannot be reversed to reconstruct source code
- Semantic similarity: Similar algorithms produce similar vectors, enabling pattern learning
- No identifiable content: Variable names, comments, and literal values are stripped before embedding
# Your actual code (NEVER transmitted)
def compute_stress_tensor(mesh, forces):
"""Proprietary FEA implementation"""
stiffness = assemble_stiffness_matrix(mesh)
return np.linalg.solve(stiffness, forces)
# What we transmit (mathematical vector)
[0.234, -0.891, 0.445, 0.112, ..., -0.667] # 512-dim embedding
The embedding captures computational patterns (matrix assembly, linear solve) without revealing implementation details.
Telemetry Aggregation
Raw metrics are aggregated before transmission:
- No time-series: Only statistical summaries (mean, p50, p95, max)
- No identifiers: Job IDs, user names, and paths are hashed or removed
- Differential privacy: Noise added to prevent individual job identification
Giving Back to the Community
Your anonymised data contributions help improve predictions for everyone:
- Collective learning: Models trained on diverse workloads generalise better
- Academic collaboration: We partner with EPCC and universities to advance HPC research
- Open benchmarks: Aggregated, anonymised statistics published for community benefit
In return, you receive:
- Continuously improving prediction models
- Custom model fine-tuning for your specific workload patterns (Expanse Pro)
- Early access to new ML features
Academic Transparency
We are committed to transparency in our data collection and ML methods:
We are preparing an academic paper detailing our privacy-preserving telemetry collection methods, embedding techniques, and differential privacy guarantees. This will be submitted for peer review and published openly.
The paper will cover:
- Formal privacy guarantees and threat model analysis
- Embedding model architecture and irreversibility proofs
- Benchmark comparisons with traditional (non-private) approaches
Deployment Options
| Option | Availability | Description |
|---|---|---|
| Masked Embeddings | Academic & Standard | Code/telemetry embedded on-prem, only vectors transmitted |
| Aggregate-Only | Standard | Only statistical summaries, no embeddings |
| Fully On-Prem | Enterprise | All ML training happens on your infrastructure, zero egress |
For academic partners, masked embeddings are the default, providing strong privacy while enabling collective model improvement. Enterprise customers requiring zero data egress can deploy fully on-premises ML training infrastructure.