Data Transfer & Filesystem

Expanse provides a tiered filesystem abstraction that enables seamless data sharing across languages (Python, C, Fortran) and clusters. Data flows automatically between nodes without manual file management.

Same-Cluster Optimisation

When producer and consumer nodes run on the same cluster:

Producer writes:  artifacts/preprocess/mesh.arrow
                          ↓
Symlink created:  inputs/mesh.arrow → artifacts/preprocess/mesh.arrow
                          ↓
Consumer reads:   inputs/mesh.arrow (zero-copy via symlink)

No data is copied, as both nodes reference the same file on disk.

Cross-Cluster Transfer

When nodes run on different clusters, Expanse handles transfer automatically:

# Node on local machine produces data
- name: preprocess
  ref: nodes/preprocess
  cluster: local

# Node on ARCHER2 consumes it, and transfer happens automatically
- name: solver
  ref: nodes/solver
  cluster: archer2

Transfer priority:

Globus: Preferred for large files between HPC centres with Globus endpoints
rsync over SSH: Fallback for smaller transfers or when Globus unavailable
scp: Final fallback

Coming Soon

/dev/shm optimisation for nodes running on the same physical compute node within a cluster; this enables shared memory transfers with zero filesystem overhead.

expanse_io API

Each language runtime provides a consistent API for reading inputs and writing outputs. Nodes should use these APIs rather than direct file I/O to ensure proper Arrow format handling and artifact registration.

Python

from expanse_io import read_input, write_output, read_json, write_json

# Read array input from previous node
mesh = read_input("mesh")           # Returns numpy.ndarray
params = read_input("params")       # Shape/dtype from Arrow schema

# Read JSON input
config = read_json("config")        # Returns dict

# Write array output for downstream nodes
write_output("solution", result)    # Accepts numpy.ndarray
write_output("residuals", errors)

# Write JSON output
write_json("metrics", {"loss": 0.05, "accuracy": 0.98})

Fortran

use expanse_io

real(8), allocatable :: mesh(:,:), solution(:,:)
integer :: shape(2), ierr

! Read array input
call expanse_read_real64("mesh", mesh, shape, ierr)
if (ierr /= 0) stop "Failed to read mesh"

! Process...

! Write array output
call expanse_write_real64("solution", solution, shape, ierr)

! Also available:
! expanse_read_real32  / expanse_write_real32   (float)
! expanse_read_int32   / expanse_write_int32    (integer)
! expanse_read_int64   / expanse_write_int64    (long)

C

#include "expanse_io.h"

double *mesh;
int64_t shape[2];
int err;

// Read array input
err = expanse_read_real64("mesh", &mesh, shape);
if (err != 0) { fprintf(stderr, "Failed to read mesh\n"); exit(1); }

// Process...

// Write array output
err = expanse_write_real64("solution", solution, shape, 2);

// Also available:
// expanse_read_real32  / expanse_write_real32
// expanse_read_int32   / expanse_write_int32
// expanse_read_int64   / expanse_write_int64

Supported Data Types

Type String	Python	Fortran	C
`array[float64, N]`	`np.float64`	`real(8)`	`double`
`array[float32, N]`	`np.float32`	`real(4)`	`float`
`array[int64, N]`	`np.int64`	`integer(8)`	`int64_t`
`array[int32, N]`	`np.int32`	`integer(4)`	`int32_t`
`json`	`dict`	N/A	N/A
`file`	Path string	Path string	Path string

The N in type strings indicates dimensionality: array[float64, 1] is a 1D array, array[float64, 2] is 2D, etc.

Environment Variables

The runtime libraries use these environment variables (set automatically by Expanse):

Variable	Description
`EXPANSE_INPUTS`	Directory containing input Arrow files (symlinked from producers)
`EXPANSE_ARTIFACT_DIR`	Directory where this node should write its Arrow outputs
`EXPANSE_OUTPUTS`	Directory for user-visible result files (copied to `results/`)

Project Data Folder

The data/ folder at your project root holds static input files that nodes can reference:

my-simulation/
├── data/
│   ├── initial_mesh.vtk       # Static input mesh
│   ├── parameters.json        # Simulation parameters
│   └── boundary_conditions/
│       ├── inlet.csv
│       └── outlet.csv
├── nodes/
└── workflows/

Reference project data in node inputs using the data/ prefix:

# nodes/solver/node.yaml
inputs:
  - name: mesh
    from: data/initial_mesh.vtk    # Static project file
    type: file
  - name: config
    from: data/parameters.json
    type: json
  - name: boundaries
    from: preprocessor/boundaries   # Dynamic from previous node
    type: array[float64, 2]

The data/ folder is:

Read-only from the node's perspective
Version-controlled with your project
Transferred automatically to remote clusters when needed
Useful for: initial conditions, configuration files, reference datasets, lookup tables

Results Folder

The results/ folder collects outputs you want to keep after workflow completion. Mark outputs as results using the path: field:

# nodes/solver/node.yaml
outputs:
  - name: solution
    type: array[float64, 2]
    path: final_solution.bin       # ← Copied to results/
  - name: convergence
    type: array[float64, 1]
    path: convergence_history.csv  # ← Copied to results/
  - name: internal_state
    type: array[float64, 2]
    # No path: field, stays in artifacts/, not copied to results/

After workflow completion:

my-simulation/
├── artifacts/                    # Raw Arrow files (can be cleaned up)
│   └── solver/
│       ├── solution.arrow
│       ├── convergence.arrow
│       └── internal_state.arrow
└── results/                      # User-facing outputs (kept)
    ├── final_solution.bin
    └── convergence_history.csv

Key Distinctions

Folder	Purpose	Lifecycle
`artifacts/`	Intermediate Arrow files for inter-node data flow	Ephemeral; can be cleaned after workflow
`results/`	Final outputs you want to keep and inspect	Persistent; version-controlled or archived
`data/`	Static inputs to the workflow	Persistent; checked into version control

Accessing Results

# List results after workflow completion
ls results/

# Results are plain files in the format you specified
# No Arrow wrapper, ready for your analysis tools
head results/convergence_history.csv

The path: field also controls the output filename and format. Expanse will convert from Arrow to the appropriate format based on the file extension where possible.

Same-Cluster Optimisation​

Cross-Cluster Transfer​

expanse_io API​

Python​

Fortran​

C​

Supported Data Types​

Environment Variables​

Project Data Folder​

Results Folder​

Key Distinctions​

Accessing Results​