GPU Acceleration

Polyorder.jl provides seamless support for GPU acceleration. This is achieved through a device-agnostic design powered by AcceleratedKernels.jl and KernelAbstractions.jl. This means the same core SCFT algorithms work on both CPU and GPU without code duplication, simply by changing the underlying array type.

Prerequisites

To use GPU acceleration, you need:

A CUDA-capable NVIDIA GPU.
Properly installed NVIDIA drivers.
CUDA.jl installed in your Julia environment.

Installation

First, ensure CUDA.jl is installed. In the Julia REPL:

using Pkg
Pkg.add("CUDA")

Polyorder.jl automatically detects CUDA.jl and AcceleratedKernels.jl to enable GPU functionality.

Example: Running SCFT on GPU

To run on a GPU, you simply initialize the AuxiliaryField with a CuArray (from CUDA.jl). Polyorder.jl detects the GPU array and automatically uses AcceleratedKernels.jl to dispatch operations to the device.

using Polyorder
using CUDA
using Random

# 1. Define the polymer system (exact same code for CPU and GPU)
system = AB_system(χN=20.0)

# 2. Define the simulation cell (lattice)
lattice = BravaisLattice(UnitCell(Tetragonal(), 4.0, 1.6))

# --- GPU Initialization ---

# Create an initial GPU array (e.g., zeros) with the desired grid size
gpu_data = CUDA.zeros(Float64, 48, 48, 16)

# Create the AuxiliaryField backed by the GPU array
w_gpu = AuxiliaryField(gpu_data, lattice; specie=:A)

# Create the SCFT solver
scft_gpu = NoncyclicChainSCFT(system, w_gpu; 
    init=:randn, 
    rng=Xoshiro(1234)
)

# Run the solver
# All heavy computations (propagators, FFTs, field updates) happen on the GPU
Polyorder.solve!(scft_gpu)

How It Works

Polyorder.jl uses a device-agnostic approach:

Field Storage: The type of array used for fields (w, ϕ) determines the execution backend. Using CuArray triggers GPU execution.
Kernel Dispatch: AcceleratedKernels.jl and KernelAbstractions.jl are used to write kernels that compile for both CPU and GPU.
FFTs: The AbstractFFTs interface automatically uses CUDA.CUFFT when acting on CuArrays.
Linear Algebra: Updaters like Anderson Mixing use generic linear algebra that works seamlessly on GPU arrays.

Performance Tips

Precision (Float64 vs Float32)

By default, SCFT simulations often use Float64 (double precision) for accuracy. Most consumer GPUs are optimized for Float32 (single precision).

Default: AuxiliaryField(..., init=:randn) creates Float64 arrays. CuArray(...) preserves the element type.
Faster Option: If single precision is sufficient for your problem, convert the field to Float32 before creating the AuxiliaryField.

w_gpu_data_f32 = CuArray{Float32}(w_temp.data)
w_gpu_f32 = AuxiliaryField(w_gpu_data_f32, lattice)

Grid Sizes

FFT performance on GPUs is highly sensitive to grid dimensions. Sizes that are powers of 2 (e.g., 32, 64, 128) or products of small primes are significantly faster.

Use best_N_fft to find an optimal grid size:

N = best_N_fft(desired_size; pow2=true)

Troubleshooting

Out of Memory (OOM)

SCFT requires storing multiple copies of the propagator and history arrays for acceleration. If you run out of GPU memory:

Reduce the grid size.
Use Float32 precision.