GPU Acceleration

Lux crypto libraries support GPU acceleration through Metal (Apple Silicon) and CUDA (NVIDIA) backends.

Overview

GPU acceleration provides 8-50x speedups for computationally intensive cryptographic operations:

BLS Pairings: Accelerated elliptic curve operations
NTT/FFT: Fast polynomial multiplication for lattice crypto
FHE Operations: Homomorphic encryption with GPU parallelism

Backend Selection

import "github.com/luxfi/crypto/gpu"

// Check available backends
backends := gpu.AvailableBackends()
// Returns: ["metal", "cuda", "cpu"]

// Select backend (auto-selects best available)
ctx := gpu.NewContext(gpu.BackendAuto)

// Or specify explicitly
ctx := gpu.NewContext(gpu.BackendMetal)  // Apple Silicon
ctx := gpu.NewContext(gpu.BackendCUDA)   // NVIDIA
ctx := gpu.NewContext(gpu.BackendCPU)    // SIMD fallback

Supported Operations

Array Operations

import "github.com/luxfi/crypto/gpu"

// Create arrays on GPU
a := gpu.NewArray([]float64{1, 2, 3, 4})
b := gpu.NewArray([]float64{5, 6, 7, 8})

// Element-wise operations
c := gpu.Add(a, b)
d := gpu.Mul(a, b)

// Matrix operations
m1 := gpu.NewMatrix([][]float64{{1, 2}, {3, 4}})
m2 := gpu.NewMatrix([][]float64{{5, 6}, {7, 8}})
result := gpu.MatMul(m1, m2)

FFT/NTT

import "github.com/luxfi/crypto/gpu"

// Fast Fourier Transform
data := gpu.NewArray(signal)
spectrum := gpu.FFT(data)
recovered := gpu.IFFT(spectrum)

// Number Theoretic Transform (for lattice crypto)
poly := gpu.NewArray(coefficients)
ntt := gpu.NTT(poly, modulus)
intt := gpu.INTT(ntt, modulus)

BLS Acceleration

import "github.com/luxfi/crypto/bls"

// GPU-accelerated signing
sig := bls.Sign(privateKey, message)  // Uses GPU if available

// Batch verification (highly parallel)
results := bls.VerifyBatch(publicKeys, messages, signatures)

// Aggregate verification
agg := bls.AggregateSignatures(signatures)
valid := bls.VerifyAggregate(publicKeys, message, agg)

Performance Benchmarks

Apple M1 Max

Operation	CPU	Metal GPU	Speedup
BLS Sign	1.2 ms	0.15 ms	8x
BLS Verify	2.5 ms	0.3 ms	8x
BLS Batch (100)	250 ms	15 ms	17x
NTT (n=4096)	50 μs	5 μs	10x
NTT (n=65536)	1 ms	50 μs	20x
FFT (n=1M)	100 ms	5 ms	20x
MatMul (1024x1024)	500 ms	10 ms	50x

NVIDIA RTX 4090

Operation	CPU	CUDA GPU	Speedup
BLS Sign	1.2 ms	0.1 ms	12x
BLS Verify	2.5 ms	0.2 ms	12x
BLS Batch (100)	250 ms	8 ms	31x
NTT (n=4096)	50 μs	3 μs	17x
NTT (n=65536)	1 ms	25 μs	40x
FFT (n=1M)	100 ms	2 ms	50x

FHE Acceleration

Fully Homomorphic Encryption benefits greatly from GPU acceleration:

import "github.com/luxfi/crypto/fhe"

// Create FHE context with GPU
ctx := fhe.NewContext(fhe.Config{
    Backend: fhe.BackendGPU,
    Scheme:  fhe.CKKS,
    Params:  fhe.PN14QP438,
})

// Encrypt vectors
ct1 := ctx.Encrypt([]float64{1.0, 2.0, 3.0})
ct2 := ctx.Encrypt([]float64{4.0, 5.0, 6.0})

// Homomorphic operations (run on GPU)
sum := ctx.Add(ct1, ct2)      // ~10 μs
prod := ctx.Mul(ct1, ct2)     // ~30 μs
rotated := ctx.Rotate(ct1, 1) // ~50 μs

FHE Performance

Operation	CPU	GPU	Speedup
CKKS Encrypt	500 μs	50 μs	10x
CKKS Add	100 μs	10 μs	10x
CKKS Multiply	500 μs	30 μs	17x
CKKS Rotate	200 μs	20 μs	10x
TFHE Bootstrap	20 ms	1 ms	20x

Memory Model

Unified Memory (Metal)

Apple Silicon provides unified memory between CPU and GPU:

// Data automatically available on both CPU and GPU
arr := gpu.NewArray(data)

// No explicit transfers needed
result := gpu.Add(arr, arr)

// Access result on CPU
values := result.ToSlice()

Discrete Memory (CUDA)

NVIDIA GPUs have separate memory:

// Explicit transfers for CUDA
arr := gpu.NewArray(data)           // Copies to GPU
arr.ToDevice()                       // Explicit GPU transfer
result := gpu.Add(arr, arr)          // Runs on GPU
values := result.ToHost().ToSlice()  // Copy back to CPU

Building with GPU Support

macOS (Metal)

# Metal support is automatic on Apple Silicon
go build -tags=metal ./...

# Test GPU availability
go test -v -run TestGPUAvailable ./gpu

Linux (CUDA)

# Install CUDA toolkit first
# https://developer.nvidia.com/cuda-downloads

# Build with CUDA support
CGO_ENABLED=1 go build -tags=cuda ./...

# Test CUDA availability
go test -v -run TestCUDAAvailable ./gpu

Fallback Behavior

When GPU is unavailable, operations automatically fall back to CPU:

ctx := gpu.NewContext(gpu.BackendAuto)

if ctx.Backend() == gpu.BackendCPU {
    log.Println("Running on CPU (GPU not available)")
}

// Operations work the same regardless of backend
result := gpu.Add(a, b)

C++ Libraries

For direct C++ usage, see the C++ Libraries documentation.

The Go packages wrap these C++ libraries:

Go Package	C++ Library
`github.com/luxfi/crypto/gpu`	luxcpp/gpu
`github.com/luxfi/crypto/bls`	luxcpp/crypto

Next Steps

C++ Libraries - Native C++ implementation
BLS Signatures - Signature aggregation
Post-Quantum Crypto - Lattice-based algorithms

GPU Acceleration

On this page