Skip to main content

Data Transfer & Filesystem

Expanse provides a tiered filesystem abstraction that enables seamless data sharing across languages (Python, C, Fortran) and clusters. Data flows automatically between nodes without manual file management.

Same-Cluster Optimisation

When producer and consumer nodes run on the same cluster:

Producer writes:  artifacts/preprocess/mesh.arrow

Symlink created: inputs/mesh.arrow → artifacts/preprocess/mesh.arrow

Consumer reads: inputs/mesh.arrow (zero-copy via symlink)

No data is copied, as both nodes reference the same file on disk.

Cross-Cluster Transfer

When nodes run on different clusters, Expanse handles transfer automatically:

# Node on local machine produces data
- name: preprocess
ref: nodes/preprocess
cluster: local

# Node on ARCHER2 consumes it, and transfer happens automatically
- name: solver
ref: nodes/solver
cluster: archer2

Transfer priority:

  1. Globus: Preferred for large files between HPC centres with Globus endpoints
  2. rsync over SSH: Fallback for smaller transfers or when Globus unavailable
  3. scp: Final fallback
Coming Soon

/dev/shm optimisation for nodes running on the same physical compute node within a cluster; this enables shared memory transfers with zero filesystem overhead.

expanse_io API

Each language runtime provides a consistent API for reading inputs and writing outputs. Nodes should use these APIs rather than direct file I/O to ensure proper Arrow format handling and artifact registration.

Python

from expanse_io import read_input, write_output, read_json, write_json

# Read array input from previous node
mesh = read_input("mesh") # Returns numpy.ndarray
params = read_input("params") # Shape/dtype from Arrow schema

# Read JSON input
config = read_json("config") # Returns dict

# Write array output for downstream nodes
write_output("solution", result) # Accepts numpy.ndarray
write_output("residuals", errors)

# Write JSON output
write_json("metrics", {"loss": 0.05, "accuracy": 0.98})

Fortran

use expanse_io

real(8), allocatable :: mesh(:,:), solution(:,:)
integer :: shape(2), ierr

! Read array input
call expanse_read_real64("mesh", mesh, shape, ierr)
if (ierr /= 0) stop "Failed to read mesh"

! Process...

! Write array output
call expanse_write_real64("solution", solution, shape, ierr)

! Also available:
! expanse_read_real32 / expanse_write_real32 (float)
! expanse_read_int32 / expanse_write_int32 (integer)
! expanse_read_int64 / expanse_write_int64 (long)

C

#include "expanse_io.h"

double *mesh;
int64_t shape[2];
int err;

// Read array input
err = expanse_read_real64("mesh", &mesh, shape);
if (err != 0) { fprintf(stderr, "Failed to read mesh\n"); exit(1); }

// Process...

// Write array output
err = expanse_write_real64("solution", solution, shape, 2);

// Also available:
// expanse_read_real32 / expanse_write_real32
// expanse_read_int32 / expanse_write_int32
// expanse_read_int64 / expanse_write_int64

Supported Data Types

Type StringPythonFortranC
array[float64, N]np.float64real(8)double
array[float32, N]np.float32real(4)float
array[int64, N]np.int64integer(8)int64_t
array[int32, N]np.int32integer(4)int32_t
jsondictN/AN/A
filePath stringPath stringPath string

The N in type strings indicates dimensionality: array[float64, 1] is a 1D array, array[float64, 2] is 2D, etc.

Environment Variables

The runtime libraries use these environment variables (set automatically by Expanse):

VariableDescription
EXPANSE_INPUTSDirectory containing input Arrow files (symlinked from producers)
EXPANSE_ARTIFACT_DIRDirectory where this node should write its Arrow outputs
EXPANSE_OUTPUTSDirectory for user-visible result files (copied to results/)

Project Data Folder

The data/ folder at your project root holds static input files that nodes can reference:

my-simulation/
├── data/
│ ├── initial_mesh.vtk # Static input mesh
│ ├── parameters.json # Simulation parameters
│ └── boundary_conditions/
│ ├── inlet.csv
│ └── outlet.csv
├── nodes/
└── workflows/

Reference project data in node inputs using the data/ prefix:

# nodes/solver/node.yaml
inputs:
- name: mesh
from: data/initial_mesh.vtk # Static project file
type: file
- name: config
from: data/parameters.json
type: json
- name: boundaries
from: preprocessor/boundaries # Dynamic from previous node
type: array[float64, 2]

The data/ folder is:

  • Read-only from the node's perspective
  • Version-controlled with your project
  • Transferred automatically to remote clusters when needed
  • Useful for: initial conditions, configuration files, reference datasets, lookup tables

Results Folder

The results/ folder collects outputs you want to keep after workflow completion. Mark outputs as results using the path: field:

# nodes/solver/node.yaml
outputs:
- name: solution
type: array[float64, 2]
path: final_solution.bin # ← Copied to results/
- name: convergence
type: array[float64, 1]
path: convergence_history.csv # ← Copied to results/
- name: internal_state
type: array[float64, 2]
# No path: field, stays in artifacts/, not copied to results/

After workflow completion:

my-simulation/
├── artifacts/ # Raw Arrow files (can be cleaned up)
│ └── solver/
│ ├── solution.arrow
│ ├── convergence.arrow
│ └── internal_state.arrow
└── results/ # User-facing outputs (kept)
├── final_solution.bin
└── convergence_history.csv

Key Distinctions

FolderPurposeLifecycle
artifacts/Intermediate Arrow files for inter-node data flowEphemeral; can be cleaned after workflow
results/Final outputs you want to keep and inspectPersistent; version-controlled or archived
data/Static inputs to the workflowPersistent; checked into version control

Accessing Results

# List results after workflow completion
ls results/

# Results are plain files in the format you specified
# No Arrow wrapper, ready for your analysis tools
head results/convergence_history.csv

The path: field also controls the output filename and format. Expanse will convert from Arrow to the appropriate format based on the file extension where possible.