Low Memory Mode
Polyorder.jl provides a Low Memory Mode utilizing Checkpointing and Shared Caches to significantly reduce memory usage during SCFT simulations. This enables running high-resolution 3D simulations (e.g., $128^3$ or larger) on memory-constrained hardware (e.g., GPUs with 11-24 GB VRAM) that would otherwise exceed memory limits.
Note: This mode trades compute time for memory. In exchange for massive memory savings (typically 50-80% reduction), the SCFT iteration time effectively doubles (~2x slowdown) due to propagator recomputation.
Prerequisites
To benefit from Low Memory Mode, you typically encounter one of the following scenarios:
- Large-scale 3D simulations: Grid sizes > $128^3$.
- Memory-constrained hardware: Running on consumer/gaming GPUs or workstations with limited RAM.
Usage
Basic Usage
To enable Low Memory Mode, simply set the low_memory=true keyword argument when creating your NoncyclicChainSCFT solver. This automatically enables both checkpointing and shared cache optimizations with auto-tuned parameters.
using Polyorder
# 1. Define typical system and fields
system = ...
w = ...
ds = ...
# 2. Enable Low Memory Mode
scft = NoncyclicChainSCFT(system, w, ds;
low_memory=true, # <--- Enables checkpointing
init=:randn
)
# 3. Run as usual
Polyorder.solve!(scft)Advanced Configuration
You can customize the checkpointing behavior using the mde_options NamedTuple.
scft = NoncyclicChainSCFT(system, w, ds;
low_memory=true,
mde_options=(;
shared_cache=true, # Use shared cache pool (recommended: true)
k=0 # Checkpoint interval (0 = auto-tune)
)
)shared_cache: Iftrue(default), multiple propagators share a single recomputation buffer. This saves significantly more memory but requires that propagators are not accessed simultaneously (which is true for standard SCFT).k: The checkpointing interval.k=0(default) calculates the optimal $k$ analytically to minimize memory usage. You can manually set an integer $k \ge 0$ if needed (e.g.,k=10stores checkpoints every 10th step).
CPU vs GPU Support
Low Memory Mode is fully device-agnostic:
- CPU: Reduces RAM usage, allowing you to run massive grids or multiple concurrent jobs on a workstation.
- GPU: Critical for running large 3D simulations on limited VRAM.
Example: GPU + Low Memory
Combine GPU arrays with low_memory=true to maximize grid size on your graphics card.
using CUDA, Polyorder
# 1. Setup system and lattice
# create lattice (lat) and polymer system (sys) here...
# 2. Create GPU-backed field
w_gpu = AuxiliaryField(CUDA.zeros(Float64, 128, 128, 128), lat)
# 3. Enable low memory checkpointing
scft_gpu = NoncyclicChainSCFT(sys, w_gpu, ds;
low_memory=true, # Critical for 128^3 on consumer GPUs
init=:randn
)
solve!(scft_gpu)How It Works
This mode utilizes optimal checkpointing to reduce the storage complexity of propagator history from $\mathcal{O}(N_s)$ to $\mathcal{O}(\sqrt{N_s})$.
- Checkpointing: Instead of storing the full propagator history ($N_s$ steps), we store "checkpoints" at intervals of $k$.
- Recomputation: Intermediate steps between checkpoints are recomputed on-the-fly when needed during density calculation. This effectively means solving the MDEs twice per iteration, leading to a ~2x increase in compute time.
- Shared Cache: A pool of temporary buffers is used for recomputation, shared across different propagators (e.g., forward/backward).
- Optimized Implementation: We use pre-allocated buffers and in-place broadcasting (
.=) for recomputation, ensuring that the allocation overhead per iteration is negligible (< 4%).
Troubleshooting
Runtime Performance
Since propagator steps are recomputed during density integration, you should expect the simulation to run roughly 2x slower compared to the full storage mode. This is the inherent trade-off for the massive memory reduction.
GPU Memory Fragmentation
On GPUs, running multiple solvers sequentially can sometimes lead to memory fragmentation. If you encounter Out-Of-Memory (OOM) errors even with low_memory=true, try the following:
- Garbage Collection: Run
GC.gc(); CUDA.reclaim()between solves to free up cached GPU memory. - Fresh Session: Run a single large simulation in a fresh Julia session to ensure maximum contiguous memory availability.
- Precision: Use
Float32instead ofFloat64for your fields. This reduces memory usage by another 50% at the cost of some numerical precision.julia w_gpu = AuxiliaryField(CUDA.zeros(Float32, 128, 128, 128), lat)