Array¶
Dask Array implements a subset of the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays. This lets us compute on arrays larger than memory using all of our cores. We coordinate these blocked algorithms using Dask graphs.
Examples¶
Visit https://examples.dask.org/array.html to see and run examples using Dask Array.
Design¶
Dask arrays coordinate many NumPy arrays arranged into a grid. These NumPy arrays may live on disk or on other machines.
Common Uses¶
Dask Array is used in fields like atmospheric and oceanographic science, large scale imaging, genomics, numerical algorithms for optimization or statistics, and more.
Scope¶
Dask arrays support most of the NumPy interface like the following:
Arithmetic and scalar mathematics:
+, *, exp, log, ...Reductions along axes:
sum(), mean(), std(), sum(axis=0), ...Tensor contractions / dot products / matrix multiply:
tensordotAxis reordering / transpose:
transposeSlicing:
x[:100, 500:100:-2]Fancy indexing along single axes with lists or NumPy arrays:
x[:, [10, 1, 5]]Array protocols like
__array__and__array_ufunc__Some linear algebra:
svd, qr, solve, solve_triangular, lstsq…
However, Dask Array does not implement the entire NumPy interface. Users expecting this will be disappointed. Notably, Dask Array lacks the following features:
Much of
np.linalghas not been implemented. This has been done by a number of excellent BLAS/LAPACK implementations, and is the focus of numerous ongoing academic research projectsArrays with unknown shapes do not support all operations
Operations like
sortwhich are notoriously difficult to do in parallel, and are of somewhat diminished value on very large data (you rarely actually need a full sort). Often we include parallel-friendly alternatives liketopkDask Array doesn’t implement operations like
tolistthat would be very inefficient for larger datasets. Likewise, it is very inefficient to iterate over a Dask array with for loopsDask development is driven by immediate need, hence many lesser used functions have not been implemented. Community contributions are encouraged
See the dask.array API for a more extensive list of functionality.
Execution¶
By default, Dask Array uses the threaded scheduler in order to avoid data transfer costs, and because NumPy releases the GIL well. It is also quite effective on a cluster using the dask.distributed scheduler.