pyvkfft-benchmark
Run pyvkfft benchmark tests. This is pretty slow as each test runs in a separate process (including the GPU initialisation) - this is done to avoid any context a memory issues when performing a large number of tests. This can also be used to compare results with cufft (scikit-cuda or cupy) and gpyfft.
usage: pyvkfft-benchmark [-h] [--backend {cuda,opencl,gpyfft,skcuda,cupy}]
[--precision {single,double}] [--gpu GPU]
[--opencl_platform OPENCL_PLATFORM] [--serial]
[--save] [--compare COMPARE] [--systematic]
[--dry-run] [--plot PLOT [PLOT ...]]
[--radix [{2,3,5,7,11,13} ...]] [--bluestein]
[--ndim {1,2,3} [{1,2,3} ...]] [--range RANGE RANGE]
[--range-mb RANGE_MB RANGE_MB]
[--minsize-mb MINSIZE_MB] [--nbatch NBATCH] [--r2c]
[--dct {1,2,3,4}] [--dst {1,2,3,4}] [--inplace]
[--disableReorderFourStep {-1,0,1}]
[--coalescedMemory {-1,16,32,64,128} [{-1,16,32,64,128} ...]]
[--numSharedBanks {-1,16,20,24,28,32,36,40,44,48,52,56,60,64}]
[--aimThreads {-1,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156,160,164,168,172,176,180,184,188,192,196,200,204,208,212,216,220,224,228,232,236,240,244,248,252,256} [{-1,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156,160,164,168,172,176,180,184,188,192,196,200,204,208,212,216,220,224,228,232,236,240,244,248,252,256} ...]]
[--performBandwidthBoost {-1,0,1,2,4}]
[--registerBoost {-1,1,2,4}]
[--registerBoostNonPow2 {-1,0,1}]
[--registerBoost4Step {-1,1,2,4}]
[--warpSize {-1,1,2,4,8,16,32,64,128,256} [{-1,1,2,4,8,16,32,64,128,256} ...]]
[--batchedGroup BATCHEDGROUP BATCHEDGROUP BATCHEDGROUP]
[--useLUT {-1,0,1}]
[--forceCallbackVersionRealTransforms {-1,0,1}]
Named Arguments
- --backend
Possible choices: cuda, opencl, gpyfft, skcuda, cupy
FFT backend to use, 'cuda' and 'opencl' will use pyvkfft with the corresponding language.
Default:
'opencl'
- --precision
Possible choices: single, double
Precision for the benchmark
Default:
'single'
- --gpu
GPU name (or sub-string)
- --opencl_platform
Name (or sub-string) of the opencl platform to use (case-insensitive). Note that by default the PoCL platform is skipped, unless it is specifically requested or it is the only one available (PoCL has some issues with VkFFT for some transforms)
- --serial
Use this to perform all tests in a single process. This is mostly useful for testing, and can lead to GPU memory issues, especially with skcuda.
Default:
False
- --save
Save results to an sql file
Default:
False
- --compare
Name of database file to compare to.
- --systematic
Perform a systematic benchmark over a range of array sizes. Without this argument only a small number of array sizes is tested.
Default:
False
- --dry-run
Perform a dry-run, printing the number of array shapes to test
Default:
False
- --plot
Plot results stored in *.sql files. Separate plots are given for different dimensions. Multiple *.sql files can be given for comparison. This parameter supersedes all others (no tests are run if --plot is given)
systematic
Options for --systematic:
- --radix
Possible choices: 2, 3, 5, 7, 11, 13
Perform only radix transforms. Without --radix, all integer sizes are tested. With '--radix', all radix transforms allowed by the backend are used. Alternatively a list can be given: '--radix 2' (only 2**n array sizes), '--radix 2 3 5' (only 2**N1 * 3**N2 * 5**N3)
- --bluestein, --rader
Test only non-radix sizes, using the Bluestein or Rader transforms. Not compatible with --radix
Default:
False
- --ndim
Possible choices: 1, 2, 3
Number of dimensions for the transform. The arrays will be stacked so that each batch transform is at least 1GB.
Default:
[2]
- --range
Range of array lengths [min, max] along each transform dimension, '--range 2 128'. This is combined with --range-mb to determine the actual range, so you can put large values here and let the maximum total size limit the actual memory used.
Default:
[2, 256]
- --range-mb
Range of array sizes in MBytes. This is combined with --range tofind the actual range to use.
Default:
[0, 128]
- --minsize-mb
Minimal size (in MB) of the transformed array to ensure a precise enough timing, as the FT is tested on a stacked array using a batch transform. Larger values take more time. Ignored if --nbatch is not -1 (the default)
Default:
100
- --nbatch
Specify the batch size for the array transforms. By default (-1), this number is automatically adjusted for each length so that the total size is equal to 'minsize-mb' (100MB by default), e.g. for 2D R2C test of 512x512, the batch number is 100. Use 1 to disable batch, or any other number to use a fixed batch size.
Default:
-1
- --r2c
Test real-to-complex transform (default is c2c)
Default:
False
- --dct
Possible choices: 1, 2, 3, 4
Test direct cosine transform of the given type (default is c2c)
Default:
False
- --dst
Possible choices: 1, 2, 3, 4
Test direct sine transform of the given type (default is c2c)
Default:
False
- --inplace
Test inplace transforms
Default:
False
advanced
Advanced options for VkFFT. Do NOT use unless you really know what these mean. -1 will always defer the choice to VkFFT. For some parameters (coalescedMemory, aimThreads and warpSize), if multiple values are used, this will trigger the automatic tuning of the transform by testing each possible configuration of parameters, before using the optimal parameter for the actual transform.
- --disableReorderFourStep
Possible choices: -1, 0, 1
Disables unshuffling of Four step algorithm. Requires tempbuffer allocation
Default:
-1
- --coalescedMemory
Possible choices: -1, 16, 32, 64, 128
Number of bytes to coalesce per one transaction: defaults to 32 for Nvidia and AMD, 64 for others.Should be a power of two
Default:
[-1]
- --numSharedBanks
Possible choices: -1, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64
Number of shared banks on the target GPU. Default is 32.
Default:
-1
- --aimThreads
Possible choices: -1, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76, 80, 84, 88, 92, 96, 100, 104, 108, 112, 116, 120, 124, 128, 132, 136, 140, 144, 148, 152, 156, 160, 164, 168, 172, 176, 180, 184, 188, 192, 196, 200, 204, 208, 212, 216, 220, 224, 228, 232, 236, 240, 244, 248, 252, 256
Try to aim all kernels at this amount of threads.
Default:
[-1]
- --performBandwidthBoost
Possible choices: -1, 0, 1, 2, 4
Try to reduce coalesced number by a factor of Xto get bigger sequence in one upload for strided axes.
Default:
-1
- --registerBoost
Possible choices: -1, 1, 2, 4
Specify if the register file size is bigger than shared memory and can be used to extend it X times (on Nvidia 256KB register file can be used instead of 32KB of shared memory, set this constant to 4 to emulate 128KB of shared memory).
Default:
-1
- --registerBoostNonPow2
Possible choices: -1, 0, 1
Specify if register over-utilization should be used on non-power of 2 sequences
Default:
-1
- --registerBoost4Step
Possible choices: -1, 1, 2, 4
Specify if register file over-utilization should be used in big sequences (>2^14), same definition as registerBoost
Default:
-1
- --warpSize
Possible choices: -1, 1, 2, 4, 8, 16, 32, 64, 128, 256
Number of threads per warp/wavefront. Normally automatically derived from the driver. Must be a power of two
Default:
[-1]
- --batchedGroup
How many FFTs are done per single kernel by a dedicated thread block, for each dimension.
Default:
[-1, -1, -1]
- --useLUT
Possible choices: -1, 0, 1
Use a look-up table to bypass the native sincos functions.
Default:
-1
- --forceCallbackVersionRealTransforms
Possible choices: -1, 0, 1
force callback version of R2C and R2R (DCT/DST) algorithmsfor all usecases. this is normally activated automaticallyby VkFFT for odd sizes.
Default:
-1
Examples: * Simple benchmark for radix transforms:
pyvkfft-benchmark --backend cuda --gpu titan
- Systematic benchmark for 1D radix transforms over a given range:
pyvkfft-benchmark --backend cuda --gpu titan --systematic --ndim 1 --range 2 256
- Same but only for powers of 2 and 3 sizes, in 2D, and save the results to an SQL file for later plotting:
pyvkfft-benchmark --backend cuda --gpu titan --systematic --radix 2 3 --ndim 2 --range 2 256 --save
- plot the result of a benchmark:
pyvkfft-benchmark --plot pyvkfft-version-gpu-date-etc.sql
- plot & compare the results of multiple benchmarks (grouped by 1D/2D/3D transforms):
pyvkfft-benchmark --plot *.sql
- Systematic test in OpenCL for an M1 GPU, tuning the VkFFT algorithm with the best possible 'aimthreads' low-level parameter to maximise throughput:
pyvkfft-benchmark --backend opencl --gpu m1 --systematic --radix --ndim 2 --range 2 256 --inplace --aimThreads 16 32 64 --r2c
When testing VkFFT, each line also indicates at the end the type of algorithm used: (r)adix, (R)ader or (B)luestein, the size of the temporary buffer (if any) and the number of uploads (number of read and writes) for each axis.
- Note 1: the indicated throughput is always
- computed assuming a single read and write for each axis (by convention),
even if we know the number of uploads is actually larger.
- Note 2: in the case of DCT1 and DST1 the throughput will be worse as these
are computed as complex systems of size 2N-2, i.e. with 4x the original size.