SIMD Autovectorization in Numba#
This Notebook has been adapted with minor changes from numba/numba-examples. The original is licensed undera BSD-2 license and copyrighted by Numba. See numba/numba-examples for details.
Most modern CPUs have support for instructions that apply the same operation to multiple data elements simultaneously. These are called “Single Instruction, Multiple Data” (SIMD) operations, and the LLVM backend used by Numba can generate them in some cases to execute loops more quickly. (This process is called “autovectorization.”)
For example, Intel processors have support for SIMD instruction sets like:
SSE (128-bit inputs)
AVX (256-bit inputs)
AVX-512 (512-bit inputs, Skylake-X and later or Xeon Phi)
These wide instructions typically operate on as many values as will fit into an input register. For AVX instructions, this means that either 8 float32 values or 4 float64 values can be processed as a single input. As a result, the NumPy dtype that you use can potentially impact performance to a greater degree than when SIMD is not in use.
import numpy as np
from numba import jit
It can be somewhat tricky to determine when LLVM has successfully autovectorized a loop. The Numba team is working on exporting diagnostic information to show where the autovectorizer has generated SIMD code. For now, we can use a fairly crude approach of searching the assembly language generated by LLVM for SIMD instructions.
It is also interesting to note what kind of SIMD is used on your system. On x86_64, the name of the registers used indicates which level of SIMD is in use:
SSE:
xmmX
AVX/AVX2:
ymmX
AVX-512:
zmmX
where X is an integer.
Note: The method we use below to find SIMD instructions will only work on Intel/AMD CPUs. Other platforms have entirely different assembly language syntax for SIMD instructions.
def find_instr(func, keyword, sig=0, limit=5):
count = 0
for l in func.inspect_asm(func.signatures[sig]).split('\n'):
if keyword in l:
count += 1
print(l)
if count >= limit:
break
if count == 0:
print('No instructions found')
Basic SIMD#
Let’s start with a simple function that returns the square difference between two arrays, as you might write for a least-squares optimization:
@jit(nopython=True)
def sqdiff(x, y):
out = np.empty_like(x)
for i in range(x.shape[0]):
out[i] = (x[i] - y[i])**2
return out
x32 = np.linspace(1, 2, 10000, dtype=np.float32)
y32 = np.linspace(2, 3, 10000, dtype=np.float32)
sqdiff(x32, y32)
array([1. , 0.99999976, 1. , ..., 1. , 1.0000002 ,
1. ], dtype=float32)
x64 = x32.astype(np.float64)
y64 = y32.astype(np.float64)
sqdiff(x64, y64)
array([1. , 0.99999976, 1. , ..., 1. , 1.00000024,
1. ])
Numba has created two different implementations of the function, one for float32
1-D arrays, and one for float64
1-D arrays:
sqdiff.signatures
[(array(float32, 1d, C), array(float32, 1d, C)),
(array(float64, 1d, C), array(float64, 1d, C))]
This allows Numba (and LLVM) to specialize the use of the SIMD instructions for each situation. In particular, using lower precision floating point allows twice as many values to fit into a SIMD register. We will see that for the same number of elements, the float32
calculation goes twice as fast:
%timeit sqdiff(x32, y32)
%timeit sqdiff(x64, y64)
1.95 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.81 µs ± 73.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
We can check for SIMD instructions in both cases. (Due to the order of compilation above, signature 0 is the float32
implementation and signature 1 is the float64
implementation.)
print('float32:')
find_instr(sqdiff, keyword='subp', sig=0)
print('---\nfloat64:')
find_instr(sqdiff, keyword='subp', sig=1)
float32:
vsubps (%rax,%rsi,4), %ymm0, %ymm0
vsubps 32(%rax,%rsi,4), %ymm1, %ymm1
vsubps 64(%rax,%rsi,4), %ymm2, %ymm2
vsubps 96(%rax,%rsi,4), %ymm3, %ymm3
vsubps 128(%rax,%rsi,4), %ymm0, %ymm0
---
float64:
vsubpd (%rax,%rsi,8), %ymm0, %ymm0
vsubpd 32(%rax,%rsi,8), %ymm1, %ymm1
vsubpd 64(%rax,%rsi,8), %ymm2, %ymm2
vsubpd 96(%rax,%rsi,8), %ymm3, %ymm3
vsubpd 128(%rax,%rsi,8), %ymm0, %ymm0
In x86_64 assembly, SSE uses subps
for “subtraction packed single precision” (AVX uses vsubps
), representing vector float32 operations. The subpd
instruction (AVX = vsubpd
) stands for “subtraction packed double precision”, representing float64 operations.
SIMD and Division#
In general, the autovectorizer cannot deal with branches inside loops, although this is an area where LLVM is likely to improve in the future. Your best bet for SIMD acceleration is to only have pure math operations in the loop.
As a result, you would naturally assume a function like this would be OK:
@jit(nopython=True)
def frac_diff1(x, y):
out = np.empty_like(x)
for i in range(x.shape[0]):
out[i] = 2 * (x[i] - y[i]) / (x[i] + y[i])
return out
frac_diff1(x32, y32)
array([-0.6666667 , -0.66662216, -0.66657776, ..., -0.400032 ,
-0.40001604, -0.4 ], dtype=float32)
find_instr(frac_diff1, keyword='subp', sig=0)
No instructions found
No instructions found
?!
The problem is that division by zero can behave in two different ways:
In Python, division by zero raises an exception.
In NumPy, division by zero results in a
NaN
orinf
, like in C.
By default, Numba @jit
follows the Python convention, and @vectorize
/@guvectorize
follow the NumPy convention. When following the Python convention, a simple division operation r = x / y
expands out into something like:
if y == 0:
raise ZeroDivisionError()
else:
r = x / y
This branching code causes the autovectorizer to give up, and no SIMD to be generated for our example above.
Fortunately, Numba allows you to override the “error model” of the function if you don’t want a ZeroDivisionError
to be raised:
@jit(nopython=True, error_model='numpy')
def frac_diff2(x, y):
out = np.empty_like(x)
for i in range(x.shape[0]):
out[i] = 2 * (x[i] - y[i]) / (x[i] + y[i])
return out
frac_diff2(x32, y32)
array([-0.6666667 , -0.66662216, -0.66657776, ..., -0.400032 ,
-0.40001604, -0.4 ], dtype=float32)
find_instr(frac_diff2, keyword='subp', sig=0)
vsubps %ymm1, %ymm0, %ymm2
vsubps %ymm1, %ymm0, %ymm2
vsubps %ymm1, %ymm0, %ymm2
vsubps %ymm1, %ymm0, %ymm2
vsubps %ymm1, %ymm0, %ymm2
We have SIMD instructions again, but when we check the speed:
%timeit frac_diff2(x32, y32)
%timeit frac_diff2(x64, y64)
5.84 µs ± 45.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.83 µs ± 61.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
This is faster than the no-SIMD case, but there doesn’t seem to be a speed benefit with float32
inputs. What’s going on?
The remaining issue is very subtle. We can see it if we look at a type-annotated version of the function:
frac_diff2.inspect_types(pretty=True)
/home/betcke/miniconda3/envs/dev/lib/python3.8/site-packages/numba/core/annotations/pretty_annotate.py:7: FutureWarning: The pretty_annotate functionality is experimental and might change API
warn("The pretty_annotate functionality is experimental and might change API",
1:
@jit(nopython=True, error_model='numpy')
| ||||||||||||||||||||||||||
2:
def frac_diff2(x, y):
| ||||||||||||||||||||||||||
|
label 0
|
x = arg(0, name=x) :: array(float32, 1d, C)
|
y = arg(1, name=y) :: array(float32, 1d, C)
|
$2load_global.0 = global(np: <module 'numpy' from '/home/betcke/miniconda3/envs/dev/lib/python3.8/site-packages/numpy/__init__.py'>) :: Module(<module 'numpy' from '/home/betcke/miniconda3/envs/dev/lib/python3.8/site-packages/numpy/__init__.py'>)
|
$4load_method.1 = getattr(value=$2load_global.0, attr=empty_like) :: Function(<function empty_like at 0x7f829875c700>)
|
del $2load_global.0
|
$8call_method.3 = call $4load_method.1(x, func=$4load_method.1, args=[Var(x, <ipython-input-12-115e70afe30f>:3)], kws=(), vararg=None) :: (array(float32, 1d, C),) -> array(float32, 1d, C)
|
del $4load_method.1
|
out = $8call_method.3 :: array(float32, 1d, C)
|
del $8call_method.3
|
4:
for i in range(x.shape[0]):
$12load_global.4 = global(range: <class 'range'>) :: Function(<class 'range'>)
|
$16load_attr.6 = getattr(value=x, attr=shape) :: UniTuple(int64 x 1)
|
$const18.7 = const(int, 0) :: Literal[int](0)
|
$20binary_subscr.8 = static_getitem(value=$16load_attr.6, index=0, index_var=$const18.7) :: int64
|
del $const18.7
|
del $16load_attr.6
|
$22call_function.9 = call $12load_global.4($20binary_subscr.8, func=$12load_global.4, args=[Var($20binary_subscr.8, <ipython-input-12-115e70afe30f>:4)], kws=(), vararg=None) :: (int64,) -> range_state_int64
|
del $20binary_subscr.8
|
del $12load_global.4
|
$24get_iter.10 = getiter(value=$22call_function.9) :: range_iter_int64
|
del $22call_function.9
|
$phi26.0 = $24get_iter.10 :: range_iter_int64
|
del $24get_iter.10
|
jump 26
|
label 26
|
$26for_iter.1 = iternext(value=$phi26.0) :: pair<int64, bool>
|
$26for_iter.2 = pair_first(value=$26for_iter.1) :: int64
|
$26for_iter.3 = pair_second(value=$26for_iter.1) :: bool
|
del $26for_iter.1
|
$phi28.1 = $26for_iter.2 :: int64
|
del $26for_iter.2
|
branch $26for_iter.3, 28, 72
|
label 28
|
del $26for_iter.3
|
i = $phi28.1 :: int64
|
del $phi28.1
|
5:
out[i] = 2 * (x[i] - y[i]) / (x[i] + y[i])
$const30.2 = const(int, 2) :: Literal[int](2)
|
$36binary_subscr.5 = getitem(value=x, index=i) :: float32
|
$42binary_subscr.8 = getitem(value=y, index=i) :: float32
|
$44binary_subtract.9 = $36binary_subscr.5 - $42binary_subscr.8 :: float32
|
del $42binary_subscr.8
|
del $36binary_subscr.5
|
$46binary_multiply.10 = $const30.2 * $44binary_subtract.9 :: float64
|
del $const30.2
|
del $44binary_subtract.9
|
$52binary_subscr.13 = getitem(value=x, index=i) :: float32
|
$58binary_subscr.16 = getitem(value=y, index=i) :: float32
|
$60binary_add.17 = $52binary_subscr.13 + $58binary_subscr.16 :: float32
|
del $58binary_subscr.16
|
del $52binary_subscr.13
|
$62binary_true_divide.18 = $46binary_multiply.10 / $60binary_add.17 :: float64
|
del $60binary_add.17
|
del $46binary_multiply.10
|
out[i] = $62binary_true_divide.18 :: (array(float32, 1d, C), int64, float64) -> none
|
del i
|
del $62binary_true_divide.18
|
jump 26
|
6:
return out
label 72
|
del y
|
del x
|
del $phi28.1
|
del $phi26.0
|
del $26for_iter.3
|
$74return_value.1 = cast(value=out) :: array(float32, 1d, C)
|
del out
|
return $74return_value.1
|
1:
@jit(nopython=True, error_model='numpy')
| ||||||||||||||||||||||||||
2:
def frac_diff2(x, y):
| ||||||||||||||||||||||||||
|
label 0
|
x = arg(0, name=x) :: array(float64, 1d, C)
|
y = arg(1, name=y) :: array(float64, 1d, C)
|
$2load_global.0 = global(np: <module 'numpy' from '/home/betcke/miniconda3/envs/dev/lib/python3.8/site-packages/numpy/__init__.py'>) :: Module(<module 'numpy' from '/home/betcke/miniconda3/envs/dev/lib/python3.8/site-packages/numpy/__init__.py'>)
|
$4load_method.1 = getattr(value=$2load_global.0, attr=empty_like) :: Function(<function empty_like at 0x7f829875c700>)
|
del $2load_global.0
|
$8call_method.3 = call $4load_method.1(x, func=$4load_method.1, args=[Var(x, <ipython-input-12-115e70afe30f>:3)], kws=(), vararg=None) :: (array(float64, 1d, C),) -> array(float64, 1d, C)
|
del $4load_method.1
|
out = $8call_method.3 :: array(float64, 1d, C)
|
del $8call_method.3
|
4:
for i in range(x.shape[0]):
$12load_global.4 = global(range: <class 'range'>) :: Function(<class 'range'>)
|
$16load_attr.6 = getattr(value=x, attr=shape) :: UniTuple(int64 x 1)
|
$const18.7 = const(int, 0) :: Literal[int](0)
|
$20binary_subscr.8 = static_getitem(value=$16load_attr.6, index=0, index_var=$const18.7) :: int64
|
del $const18.7
|
del $16load_attr.6
|
$22call_function.9 = call $12load_global.4($20binary_subscr.8, func=$12load_global.4, args=[Var($20binary_subscr.8, <ipython-input-12-115e70afe30f>:4)], kws=(), vararg=None) :: (int64,) -> range_state_int64
|
del $20binary_subscr.8
|
del $12load_global.4
|
$24get_iter.10 = getiter(value=$22call_function.9) :: range_iter_int64
|
del $22call_function.9
|
$phi26.0 = $24get_iter.10 :: range_iter_int64
|
del $24get_iter.10
|
jump 26
|
label 26
|
$26for_iter.1 = iternext(value=$phi26.0) :: pair<int64, bool>
|
$26for_iter.2 = pair_first(value=$26for_iter.1) :: int64
|
$26for_iter.3 = pair_second(value=$26for_iter.1) :: bool
|
del $26for_iter.1
|
$phi28.1 = $26for_iter.2 :: int64
|
del $26for_iter.2
|
branch $26for_iter.3, 28, 72
|
label 28
|
del $26for_iter.3
|
i = $phi28.1 :: int64
|
del $phi28.1
|
5:
out[i] = 2 * (x[i] - y[i]) / (x[i] + y[i])
$const30.2 = const(int, 2) :: Literal[int](2)
|
$36binary_subscr.5 = getitem(value=x, index=i) :: float64
|
$42binary_subscr.8 = getitem(value=y, index=i) :: float64
|
$44binary_subtract.9 = $36binary_subscr.5 - $42binary_subscr.8 :: float64
|
del $42binary_subscr.8
|
del $36binary_subscr.5
|
$46binary_multiply.10 = $const30.2 * $44binary_subtract.9 :: float64
|
del $const30.2
|
del $44binary_subtract.9
|
$52binary_subscr.13 = getitem(value=x, index=i) :: float64
|
$58binary_subscr.16 = getitem(value=y, index=i) :: float64
|
$60binary_add.17 = $52binary_subscr.13 + $58binary_subscr.16 :: float64
|
del $58binary_subscr.16
|
del $52binary_subscr.13
|
$62binary_true_divide.18 = $46binary_multiply.10 / $60binary_add.17 :: float64
|
del $60binary_add.17
|
del $46binary_multiply.10
|
out[i] = $62binary_true_divide.18 :: (array(float64, 1d, C), int64, float64) -> none
|
del i
|
del $62binary_true_divide.18
|
jump 26
|
6:
return out
label 72
|
del y
|
del x
|
del $phi28.1
|
del $phi26.0
|
del $26for_iter.3
|
$74return_value.1 = cast(value=out) :: array(float64, 1d, C)
|
del out
|
return $74return_value.1
|
If you expand out line 5 in the float32 version of the function, you will see the following bit of Numba IR:
$const30.2 = const(int, 2) :: Literal[int](2)
$36binary_subscr.5 = getitem(value=x, index=i) :: float32
$42binary_subscr.8 = getitem(value=y, index=i) :: float32
$44binary_subtract.9 = $36binary_subscr.5 - $42binary_subscr.8 :: float32
del $42binary_subscr.8
del $36binary_subscr.5
$46binary_multiply.10 = $const30.2 * $44binary_subtract.9 :: float64
Notice that the constant 2
has been typed as an int value. Later, this causes the multiplication 2 * (x[i] - y[i]
to promote up to float64, and then the rest of the calculation becomes float64. This is a situation where Numba is being overly conservative (and should be fixed at some point), but we can tweak this behavior by casting the constant to the type we want:
@jit(nopython=True, error_model='numpy')
def frac_diff3(x, y):
out = np.empty_like(x)
dt = x.dtype # Cast the constant using the dtype of the input
for i in range(x.shape[0]):
# Could also use np.float32(2) to always use same type, regardless of input
out[i] = dt.type(2) * (x[i] - y[i]) / (x[i] + y[i])
return out
frac_diff3(x32, y32)
array([-0.6666667 , -0.66662216, -0.66657776, ..., -0.400032 ,
-0.40001604, -0.4 ], dtype=float32)
%timeit frac_diff3(x32, y32)
%timeit frac_diff3(x64, y64)
2.27 µs ± 17.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.84 µs ± 99.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Now our float32 version is nice and speedy (and 6x faster than what we started with, if we only care about float32).
SIMD and Reductions#
The autovectorizer can also optimize reduction loops, but only with permission. Normally, compilers are very careful not to reorder floating point instructions because floating point arithmetic is approximate, so mathematically allowed transformations do not always give the same result. For example, it is not generally true for floating point numbers that:
(a + (b + c)) == ((a + b) + c)
For many situations, the round-off error that causes the difference between the left and the right is not important, so changing the order of additions is acceptable for a performance increase.
To allow reordering of operations, we need to tell Numba to enable fastmath
optimizations:
@jit(nopython=True)
def do_sum(A):
acc = 0.
# without fastmath, this loop must accumulate in strict order
for x in A:
acc += x**2
return acc
@jit(nopython=True, fastmath=True)
def do_sum_fast(A):
acc = 0.
# with fastmath, the reduction can be vectorized as floating point
# reassociation is permitted.
for x in A:
acc += x**2
return acc
do_sum(x32)
find_instr(do_sum, keyword='mulp') # look for vector multiplication
No instructions found
do_sum_fast(x32)
find_instr(do_sum_fast, keyword='mulp')
vmulps %xmm1, %xmm1, %xmm1
vmulps %xmm7, %xmm7, %xmm4
vmulps %xmm6, %xmm6, %xmm1
vmulps %xmm5, %xmm5, %xmm1
vmulps %xmm1, %xmm1, %xmm1
The fast version is 2x faster:
%timeit do_sum(x32)
%timeit do_sum_fast(x32)
9.96 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.11 µs ± 68 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)