Skip to content

Benchmark Results

For JuliaCon2025 we benchmarks cuNumeric.jl on 8 A100 GPUs (single-node) and compared it to the Python library cuPyNumeric and other relevant benchmarks depending on the problem. All results shown are weak scaling. We hope to have multi-node benchmarks soon!

SGEMM

Code Outline:

julia
mul!(C, A, B)
GEMM EfficiencyGEMM GFLOPS

Monte-Carlo Integration

Monte-Carlo integration is embaressingly parallel and should scale perfectly. We do not know the exact number of operations in exp so the GFLOPs is off by a constant factor.

Code Outline:

julia
integrand = (x) -> exp(-square(x))
val = (V/N) * sum(integrand(x))
MC EfficiencyMC GFLOPS

Gray-Scott (2D)

Solving a PDE requires halo-exchanges and lots of data movement. In this benchmark we fall an order of magnitude short of the ImplicitGlobalGrid.jl library which specifically targets multi-node, multi-GPU halo exchanges. We attribute this to the lack of kernel fusion in cuNumeric.jl