Tuning VkFFT
This is a quick demonstration of how to tune low-level VkFFT parameters to achieve the best possible performance - here illustrated on an Apple M1 pro GPU.
Remember: this is only useful for intensive applications, e.g. when using FFTs during a long iterative process. Otherwise, tuning is usually overkill !
Imports & test data
Let’s try a 2D transform of a (250,250,250) array:
[1]:
import timeit
import numpy as np
import pyopencl as cl
import pyopencl.array as cla
from pyvkfft.fft import fftn, ifftn
from pyvkfft.opencl import VkFFTApp
from pyvkfft.benchmark import bench_pyvkfft_opencl
[2]:
ctx = cl.create_some_context()
gpu_name = ctx.devices[0].name
print("GPU:", gpu_name)
cq = cl.CommandQueue(ctx)
n = 250
GPU: Apple M1 Pro
Using the benchmark function
This function executes the tests in a separate process, so it should avoid issues consuming GPU resources. The drawback is that it is relatively slow (need to re-initialise the GPU context for every test).
[3]:
res = bench_pyvkfft_opencl((n,n,n),ndim=2,gpu_name=gpu_name)
print(f"Speed with default parameters: {res[1]:6.1f} Gbytes/s")
Speed with default parameters: 107.4 Gbytes/s
Now try changing the coalescedMemory
parameter (default is 32 for nvidia/amd, 64 for others) - test 4 values
[4]:
args = {'tune_config':{'backend':'pyopencl',
'coalescedMemory':[16,32,64,128]}}
res = bench_pyvkfft_opencl((n,n,n),ndim=2,gpu_name=gpu_name, args=args)
print(f"Speed: {res[1]:6.1f} Gbytes/s")
Speed: 109.3 Gbytes/s
This did not work on the M1 pro - no real improvement.
Let’s try instead tuning the aimThreads
parameter (defaults at 128).
[5]:
args = {'tune_config':{'backend':'pyopencl',
'aimThreads':[32, 64, 128]}}
res = bench_pyvkfft_opencl((n,n,n),ndim=2,gpu_name=gpu_name, args=args)
print(f"Speed: {res[1]:6.1f} Gbytes/s")
Speed: 156.9 Gbytes/s
Much better - 50% faster !
Using the simple FFT interface
Some default tuning options can be used just by passing tune=True
to the simple fft API functions.
This will automatically test a few parameters (depending on the GPU) and choose the one yielding the best speed. This was tested on a few types of GPUs.
Let’s try first without tuning:
[6]:
a= cla.empty(cq,(n,n,n), dtype=np.complex64)
cq.finish()
t0 = timeit.default_timer()
for i in range(100):
a = fftn(a,a, ndim=2)
cq.finish()
dt = timeit.default_timer()-t0
print(f"Without tuning: dt={dt:8.5f}s")
Without tuning: dt= 0.42657s
Now with tuning (we do it twice, the first will cache the result)
[7]:
a = fftn(a,a, ndim=2, tune=True)
cq.finish()
t0 = timeit.default_timer()
for i in range(100):
a = fftn(a,a, ndim=2, tune=True)
cq.finish()
dt = timeit.default_timer()-t0
print(f"With tuning: dt={dt:8.5f}s")
With tuning: dt= 0.27237s
Using the VkFFTApp API
This allows either to:
choose a set of parameters to tune (similarly to
tune=True
in the simple fft API)or pass directly some parameters
Let’s try first without tuning:
[8]:
a= cla.zeros(cq,(n,n,n), dtype=np.complex64)
app = VkFFTApp(a.shape, a.dtype, cq, ndim=2, inplace=True)
cq.finish()
t0 = timeit.default_timer()
for i in range(100):
a = app.fft(a,a)
cq.finish()
dt = timeit.default_timer()-t0
print(f"Without tuning: dt={dt:8.5f}s")
Without tuning: dt= 0.40874s
Now with automatic tuning. The tuning part will be done immediately when creating the VkFFTApp, by creating temporary arrays.
[9]:
a= cla.zeros(cq,(n,n,n), dtype=np.complex64)
app = VkFFTApp(a.shape, a.dtype, cq, ndim=2, inplace=True,
tune_config={'backend':'pyopencl',
'aimThreads':[32, 64, 128]})
cq.finish()
t0 = timeit.default_timer()
for i in range(100):
a = app.fft(a,a)
cq.finish()
dt = timeit.default_timer()-t0
print(f"With auto-tuning: dt={dt:8.5f}s")
With auto-tuning: dt= 0.27309s
The other approach consists in directly giving the known optimised parameter:
[10]:
a= cla.empty(cq,(n,n,n), dtype=np.complex64)
app = VkFFTApp(a.shape, a.dtype, cq, ndim=2, inplace=True, aimThreads=64)
cq.finish()
t0 = timeit.default_timer()
for i in range(100):
a = app.fft(a,a)
cq.finish()
dt = timeit.default_timer()-t0
print(f"With tuned parameter: dt={dt:8.5f}s")
With tuned parameter: dt= 0.27962s