Data Transfer & Filesystem
Expanse provides a tiered filesystem abstraction that enables seamless data sharing across languages (Python, C, Fortran) and clusters. Data flows automatically between nodes without manual file management.
Same-Cluster Optimisation
When producer and consumer nodes run on the same cluster:
Producer writes: artifacts/preprocess/mesh.arrow
↓
Symlink created: inputs/mesh.arrow → artifacts/preprocess/mesh.arrow
↓
Consumer reads: inputs/mesh.arrow (zero-copy via symlink)
No data is copied, as both nodes reference the same file on disk.
Cross-Cluster Transfer
When nodes run on different clusters, Expanse handles transfer automatically:
# Node on local machine produces data
- name: preprocess
ref: nodes/preprocess
cluster: local
# Node on ARCHER2 consumes it, and transfer happens automatically
- name: solver
ref: nodes/solver
cluster: archer2
Transfer priority:
- Globus: Preferred for large files between HPC centres with Globus endpoints
- rsync over SSH: Fallback for smaller transfers or when Globus unavailable
- scp: Final fallback
/dev/shm optimisation for nodes running on the same physical compute node within a cluster; this enables shared memory transfers with zero filesystem overhead.
expanse_io API
Each language runtime provides a consistent API for reading inputs and writing outputs. Nodes should use these APIs rather than direct file I/O to ensure proper Arrow format handling and artifact registration.
Python
from expanse_io import read_input, write_output, read_json, write_json
# Read array input from previous node
mesh = read_input("mesh") # Returns numpy.ndarray
params = read_input("params") # Shape/dtype from Arrow schema
# Read JSON input
config = read_json("config") # Returns dict
# Write array output for downstream nodes
write_output("solution", result) # Accepts numpy.ndarray
write_output("residuals", errors)
# Write JSON output
write_json("metrics", {"loss": 0.05, "accuracy": 0.98})
Fortran
use expanse_io
real(8), allocatable :: mesh(:,:), solution(:,:)
integer :: shape(2), ierr
! Read array input
call expanse_read_real64("mesh", mesh, shape, ierr)
if (ierr /= 0) stop "Failed to read mesh"
! Process...
! Write array output
call expanse_write_real64("solution", solution, shape, ierr)
! Also available:
! expanse_read_real32 / expanse_write_real32 (float)
! expanse_read_int32 / expanse_write_int32 (integer)
! expanse_read_int64 / expanse_write_int64 (long)
C
#include "expanse_io.h"
double *mesh;
int64_t shape[2];
int err;
// Read array input
err = expanse_read_real64("mesh", &mesh, shape);
if (err != 0) { fprintf(stderr, "Failed to read mesh\n"); exit(1); }
// Process...
// Write array output
err = expanse_write_real64("solution", solution, shape, 2);
// Also available:
// expanse_read_real32 / expanse_write_real32
// expanse_read_int32 / expanse_write_int32
// expanse_read_int64 / expanse_write_int64
Supported Data Types
| Type String | Python | Fortran | C |
|---|---|---|---|
array[float64, N] | np.float64 | real(8) | double |
array[float32, N] | np.float32 | real(4) | float |
array[int64, N] | np.int64 | integer(8) | int64_t |
array[int32, N] | np.int32 | integer(4) | int32_t |
json | dict | N/A | N/A |
file | Path string | Path string | Path string |
The N in type strings indicates dimensionality: array[float64, 1] is a 1D array, array[float64, 2] is 2D, etc.
Environment Variables
The runtime libraries use these environment variables (set automatically by Expanse):
| Variable | Description |
|---|---|
EXPANSE_INPUTS | Directory containing input Arrow files (symlinked from producers) |
EXPANSE_ARTIFACT_DIR | Directory where this node should write its Arrow outputs |
EXPANSE_OUTPUTS | Directory for user-visible result files (copied to results/) |
Project Data Folder
The data/ folder at your project root holds static input files that nodes can reference:
my-simulation/
├── data/
│ ├── initial_mesh.vtk # Static input mesh
│ ├── parameters.json # Simulation parameters
│ └── boundary_conditions/
│ ├── inlet.csv
│ └── outlet.csv
├── nodes/
└── workflows/
Reference project data in node inputs using the data/ prefix:
# nodes/solver/node.yaml
inputs:
- name: mesh
from: data/initial_mesh.vtk # Static project file
type: file
- name: config
from: data/parameters.json
type: json
- name: boundaries
from: preprocessor/boundaries # Dynamic from previous node
type: array[float64, 2]
The data/ folder is:
- Read-only from the node's perspective
- Version-controlled with your project
- Transferred automatically to remote clusters when needed
- Useful for: initial conditions, configuration files, reference datasets, lookup tables
Results Folder
The results/ folder collects outputs you want to keep after workflow completion. Mark outputs as results using the path: field:
# nodes/solver/node.yaml
outputs:
- name: solution
type: array[float64, 2]
path: final_solution.bin # ← Copied to results/
- name: convergence
type: array[float64, 1]
path: convergence_history.csv # ← Copied to results/
- name: internal_state
type: array[float64, 2]
# No path: field, stays in artifacts/, not copied to results/
After workflow completion:
my-simulation/
├── artifacts/ # Raw Arrow files (can be cleaned up)
│ └── solver/
│ ├── solution.arrow
│ ├── convergence.arrow
│ └── internal_state.arrow
└── results/ # User-facing outputs (kept)
├── final_solution.bin
└── convergence_history.csv
Key Distinctions
| Folder | Purpose | Lifecycle |
|---|---|---|
artifacts/ | Intermediate Arrow files for inter-node data flow | Ephemeral; can be cleaned after workflow |
results/ | Final outputs you want to keep and inspect | Persistent; version-controlled or archived |
data/ | Static inputs to the workflow | Persistent; checked into version control |
Accessing Results
# List results after workflow completion
ls results/
# Results are plain files in the format you specified
# No Arrow wrapper, ready for your analysis tools
head results/convergence_history.csv
The path: field also controls the output filename and format. Expanse will convert from Arrow to the appropriate format based on the file extension where possible.