Implementing the LU decomposition in Python#

In the following we demonstrate a simple LU decomposition in Python. We also demonstrate do timing comparisions against the Python implementation from Scipy to show that one should never use a self-implementation of the LU decomposition but always use existing Numpy/Scipy routines.

import numpy as np
from scipy.linalg import lu as scipy_lu
def lu(A):
    """Our cool LU implementation."""

    if not A.shape[0] == A.shape[1]:
        raise ValueError("Input matrix must be square.")

    n = A.shape[0]

    L = np.zeros((n, n), dtype=np.float64)
    U = np.zeros((n, n), dtype=np.float64)

    U[:] = A

    np.fill_diagonal(L, 1)

    for col in range(n-1):
        for row in range(col + 1, n):
            L[row, col] = U[row, col] / U[col, col]
            U[row, col:] = U[row, col:] - L[row, col] * U[col, col:]
            U[row, col] = 0

    return (L, U)
n = 200
A = np.random.randn(n, n)
L, U = lu(A)

residual = np.linalg.norm(A - L @ U) / np.linalg.norm(A)
print(f"The residual is: {residual}")

runtime_my_lu = %timeit -o lu(A)
runtime_scipy_lu = %timeit -o scipy_lu(A)

print(f"Runtime of our LU (s): {runtime_my_lu.best}")
print(f"Runtime of Scipy LU (s): {runtime_scipy_lu.best}")

print(f"The speed ratio between the two implementations is: {runtime_my_lu.best / runtime_scipy_lu.best}")
The residual is: 2.285256427880218e-13
50.6 ms ± 474 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.26 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Runtime of our LU (s): 0.05018079660003423
Runtime of Scipy LU (s): 0.0018813237100403057
The speed ratio between the two implementations is: 26.673132503581297