Low Memory Strategies

Overview

Large-scale Self-Consistent Field Theory (SCFT) simulations are fundamentally memory-bound. The storage of propagator fields $q(\mathbf{r}, s)$ typically dominates the memory footprint, scaling as $\mathcal{O}(P \cdot M \cdot N_s)$, where $P$ is the number of propagators, $M$ is the spatial grid size, and $N_s$ is the number of contour steps.

For a typical $256^3$ grid simulation:

Grid points ($M$): $\approx 1.67 \times 10^7$
Propagator Memory: $> 26$ GB (standard storage)
Bottleneck: Exceeds the VRAM of most consumer and workstation GPUs (10-24 GB).

Polyorder.jl implements two orthogonal, state-of-the-art strategies to overcome this barrier, enabling massive simulations on single GPUs:

Checkpointing (Time-Domain Compression): Optimizes storage along the contour integration path ($N_s$).
Symmetric Storage (Space-Domain Compression): Exploits crystallographic symmetries to compress spatial data ($M$).

Combined, these methods can reduce memory usage by over 99%, allowing $256^3$ or even $512^3$ simulations on constrained hardware.

Performance Profiles

Polyorder.jl manages these strategies through high-level Performance Profiles. The following table maps profiles to the underlying memory algorithms:

Profile	Strategy 1: Checkpointing	Strategy 2: Symmetry	Precomputed Physics	Target Use Case
`:fast`	❌ Full	❌ None	✅ Yes	Small/Medium CPU
`:balanced`	❌ Full	✅ Auto	✅ Yes	Default (CPU/GPU)
`:compact`	❌ Full	✅ Auto	❌ No (On-the-fly)	Medium GPU ($128^3$)
`:minimal`	✅ Active	✅ Auto	❌ No (On-the-fly)	Max Scale ($256^3$)

Precomputed Physics: Storing operators like exp(-k²) speeds up simulation but costs memory. Profiles like :compact and :minimal calculate these on-the-fly to save space.

Strategy 1: Checkpointing (Time-Domain)

Concept

Standard SCFT solvers store the propagator $q(\mathbf{r}, s)$ at every contour step $s$ to compute the density $\phi(\mathbf{r})$ and stress tensor. Checkpointing trades a small amount of computation for massive memory savings by storing only a sparse set of "checkpoint" frames.

When a specific step $s$ is needed:

The solver retrieves the nearest preceding checkpoint.
It re-integrates (propagates) forward to $s$.
Intermediate steps are caching in a temporary buffer.

Polyorder.jl uses an analytical optimal checkpoint distribution to minimize the peak memory footprint, scaling storage from $\mathcal{O}(N_s)$ to $\mathcal{O}(\sqrt{N_s})$.

Usage

The most reliable way to enable periodic checkpointing is using the :minimal profile.

# 1. Using Profile (Recommended)
# Enables Checkpointing + Symmetry + On-the-fly MDE
scft = NoncyclicChainSCFT(system, w; profile=:minimal)

# 2. Advanced Manual Control
# If you want to use checkpointing WITHOUT other minimal-profile restrictions
scft = NoncyclicChainSCFT(system, w, ds; 
    low_memory=true,
    mde_options=(; shared_cache=true) 
)

Performance Impact:

Memory: ~30-50% reduction for typical chains ($N_s \approx 100$).
Compute: Slight overhead (1x) due to re-integration steps.
Allocation: Zero runtime allocation (pre-allocated caches).

Comparison with State-of-the-Art (PSCF+)

Recent work (e.g., PSCF+, JCTC 2025) introduced the "Slice" algorithm, which uses an arithmetic sequence partitioning (slices of size 1, 2, 3...) to achieve $\mathcal{O}(\sqrt{N_s})$ memory scaling.

Polyorder.jl achieves the same asymptotic $\mathcal{O}(\sqrt{N_s})$ scaling using an Optimal Periodic Checkpointing distribution but introduces a critical architectural innovation: Shared Cache Pools.

Feature	PSCF+ "Slice"	Polyorder.jl (This Work)
Scaling	$\mathcal{O}(\sqrt{N_s})$	$\mathcal{O}(\sqrt{N_s})$
Strategy	Arithmetic Sequence	Analytic Optimal Period
Cache Memory	Per-propagator buffer	Shared Pool (Multiple propagators reuse the same memory)
Impact	Reduces $N_s$ term	Reduces both $N_s$ term and the prefactor of the cache overhead

By sharing caches between forward/backward propagators and even different polymer blocks, Polyorder.jl minimizes the "working memory" required for recomputation, which is often the new bottleneck once the main trajectory storage is compressed.

Strategy 2: Symmetric Storage (Space-Domain)

Scientific Innovation

This strategy introduces a novel approach to memory compression by leveraging Space Group Theory. In ordered block copolymer phases (e.g., Gyroid, Sphere, Lamellar), the density and propagator fields respect the symmetry of the underlying crystallographic space group.

Instead of storing the field values at all $M$ grid points, we only need to store the values for the symmetry-unique points (stars) in the irreducible wedge of the Wigner-Seitz cell.

Full Grid: $M$ points.
Symmetric Basis: $N_{stars} \approx M / |G|$, where $|G|$ is the order of the space group.
Compression Ratio: Proportional to the number of symmetry operations (e.g., cubic groups like $Ia\bar{3}d$ provide ~96x compression).

Polyorder.jl implements a fully featured SymmetricStorage backend that handles this mapping transparently, performing mathematical operations directly in the compressed domain or expanding on-the-fly (getindex) with zero-allocation buffers.

Usage

Symmetric storage is automatically enabled by default in :balanced, :compact, and :minimal profiles for periodic systems with a defined UnitCell.

# 1. Automatic (Default)
# Uses symmetric storage if a crystal system is detected
scft = NoncyclicChainSCFT(system, w; profile=:balanced)

# 2. Manual Override (Pre-computed)
# If you want to reuse a specific symmetry specification
spec = Polyorder.symmetry_spec(w)
scft = NoncyclicChainSCFT(system, w; symmetry=spec)

# 3. Disabling Symmetry
# Use the :fast profile or specific flags if you need to break symmetry
scft = NoncyclicChainSCFT(system, w; profile=:fast)

Symmetric storage assumes the fields strictly obey the space group symmetries. It is ideal for computing known phases (e.g., Gyroid stability).

Benchmarks: Breaking the Limit

The combination of these strategies enables simulations previously impossible on single-GPU hardware.

Memory Scaling (Gyroid, Space Group 230)

AB Diblock, $N=100$, 4 Propagators

Resolution	Grid Size ($M$)	Standard Storage	Checkpointing	Symmetric Storage	Reduction
Low	$48^3$	172 MB	120 MB	2.25 MB	98.7%
Standard	$96^3$	1.35 GB	669 MB	17.8 MB	98.7%
High	$128^3$	3.1 GB	1.5 GB	42 MB	98.7%
Extreme	$256^3$	25.5 GB (OOM)	8.5 GB	336 MB	98.7%

256³ GPU Case Study

A $256^3$ simulation on an NVIDIA RTX 2080 Ti (11 GB).

Without Optimization: Requires >46 GB. Impossible to run.
With Symmetric Storage + Optimizations:
- Propagator Memory: 336 MB
- Total Working Memory: ~5.6 GB
- Status: Successful Run ✅

Tests Summary (RTX 2080 Ti):
  memory_demo: ✓ PASSED
  correctness: ✓ PASSED (Matches CPU reference to 1e-15)
  large_grid:  ✓ PASSED (256³ Gyroid solved in 5.6GB)

Hardware-Optimized Solvers

In addition to storage compression, Polyorder.jl employs hardware-aware optimizations:

Solver Sharing: Propagators with identical physical parameters share MDE solver instances and FFT buffers.
Fused Kernels: Custom CUDA kernels for density integration and Anderson mixing to minimize memory bandwidth.
Zero-Allocation Loops: All iterative steps use pre-allocated buffers, ensuring stable memory usage over thousands of steps.