Configuration Reference¶

Dask
Distributed
- Client
- Comm
- Dashboard
- Deploy
- Scheduler
- Worker
- Admin
- UCX
- RMM

Note

It is possible to configure Dask inline with dot notation, with YAML or via environment variables. See the conversion utility for converting the following dot notation to other forms.

Dask ¶

dask.temporary-directory None ¶: Temporary directory for local disk storage /tmp, /scratch, or /local. This directory is used during dask spill-to-disk operations. When the value is "null" (default), dask will create a directory from where dask was launched: `cwd/dask-worker-space`

dask.dataframe.shuffle-compression None ¶: Compression algorithm used for on disk-shuffling. Partd, the library used for compression supports ZLib, BZ2, SNAPPY, and BLOSC

dask.array.svg.size 120 ¶: The size of pixels used when displaying a dask array as an SVG image. This is used, for example, for nice rendering in a Jupyter notebook

dask.optimization.fuse.active True ¶: Turn task fusion on/off

dask.optimization.fuse.ave-width 1 ¶: Upper limit for width, where width = num_nodes / height, a good measure of parallelizability

dask.optimization.fuse.max-width None ¶: Don't fuse if total width is greater than this. Set to null to dynamically adjust to 1.5 + ave_width * log(ave_width + 1)

dask.optimization.fuse.max-height inf ¶: Don't fuse more than this many levels

dask.optimization.fuse.max-depth-new-edges None ¶: Don't fuse if new dependencies are added after this many levels. Set to null to dynamically adjust to ave_width * 1.5.

dask.optimization.fuse.subgraphs None ¶: Set to True to fuse multiple tasks into SubgraphCallable objects. Set to None to let the default optimizer of individual dask collections decide. If no collection-specific default exists, None defaults to False.

dask.optimization.fuse.rename-keys True ¶: Set to true to rename the fused keys with `default_fused_keys_renamer`. Renaming fused keys can keep the graph more understandable and comprehensible, but it comes at the cost of additional processing. If False, then the top-most key will be used. For advanced usage, a function to create the new name is also accepted.

Distributed ¶

Client ¶

distributed.client.heartbeat 5s ¶: This value is the time between heartbeats The client sends a periodic heartbeat message to the scheduler. If it misses enough of these then the scheduler assumes that it has gone.

distributed.client.scheduler-info-interval 2s ¶: Interval between scheduler-info updates

Comm ¶

distributed.comm.retry.count 0 ¶: The number of times to retry a connection

distributed.comm.retry.delay.min 1s ¶: The first non-zero delay between retry attempts

distributed.comm.retry.delay.max 20s ¶: The maximum delay between retries

distributed.comm.compression auto ¶: The compression algorithm to use This could be one of lz4, snappy, zstd, or blosc

distributed.comm.offload 10MiB ¶: The size of message after which we choose to offload serialization to another thread In some cases, you may also choose to disable this altogether with the value false This is useful if you want to include serialization in profiling data, or if you have data types that are particularly sensitive to deserialization

distributed.comm.default-scheme tcp ¶: The default protocol to use, like tcp or tls

distributed.comm.socket-backlog 2048 ¶: When shuffling data between workers, there can really be O(cluster size) connection requests on a single worker socket, make sure the backlog is large enough not to lose any.

distributed.comm.recent-messages-log-length 0 ¶: number of messages to keep for debugging

distributed.comm.zstd.level 3 ¶: Compression level, between 1 and 22.

distributed.comm.zstd.threads 0 ¶: Number of threads to use. 0 for single-threaded, -1 to infer from cpu count.

distributed.comm.timeouts.connect 10s ¶: No Comment

distributed.comm.timeouts.tcp 30s ¶: No Comment

distributed.comm.require-encryption None ¶: Whether to require encryption on non-local comms

distributed.comm.tls.ciphers None ¶: No Comment

distributed.comm.tls.ca-file None ¶: Path to a CA file, in pem format

distributed.comm.tls.scheduler.cert None ¶: Path to certificate file

distributed.comm.tls.scheduler.key None ¶: Path to key file. Alternatively, the key can be appended to the cert file above, and this field left blank

distributed.comm.tls.worker.key None ¶: Path to key file. Alternatively, the key can be appended to the cert file above, and this field left blank

distributed.comm.tls.worker.cert None ¶: Path to certificate file

distributed.comm.tls.client.key None ¶: Path to key file. Alternatively, the key can be appended to the cert file above, and this field left blank

distributed.comm.tls.client.cert None ¶: Path to certificate file

Dashboard ¶

distributed.dashboard.link {scheme}://{host}:{port}/status ¶: The form for the dashboard links This is used wherever we print out the link for the dashboard It is filled in with relevant information like the schema, host, and port number

distributed.dashboard.export-tool False ¶: No Comment

distributed.dashboard.graph-max-items 5000 ¶: maximum number of tasks to try to plot in "graph" view

Deploy ¶

distributed.deploy.lost-worker-timeout 15s ¶: Interval after which to hard-close a lost worker job Otherwise we wait for a while to see if a worker will reappear

distributed.deploy.cluster-repr-interval 500ms ¶: Interval between calls to update cluster-repr for the widget

Scheduler ¶

distributed.scheduler.allowed-failures 3 ¶: The number of retries before a task is considered bad When a worker dies when a task is running that task is rerun elsewhere. If many workers die while running this same task then we call the task bad, and raise a KilledWorker exception. This is the number of workers that are allowed to die before this task is marked as bad.

distributed.scheduler.bandwidth 100000000 ¶: The expected bandwidth between any pair of workers This is used when making scheduling decisions. The scheduler will use this value as a baseline, but also learn it over time.

distributed.scheduler.blocked-handlers [] ¶: A list of handlers to exclude The scheduler operates by receiving messages from various workers and clients and then performing operations based on those messages. Each message has an operation like "close-worker" or "task-finished". In some high security situations administrators may choose to block certain handlers from running. Those handlers can be listed here. For a list of handlers see the `dask.distributed.Scheduler.handlers` attribute.

distributed.scheduler.default-data-size 1kiB ¶: The default size of a piece of data if we don't know anything about it. This is used by the scheduler in some scheduling decisions

distributed.scheduler.events-cleanup-delay 1h ¶: The amount of time to wait until workers or clients are removed from the event log after they have been removed from the scheduler

distributed.scheduler.idle-timeout None ¶: Shut down the scheduler after this duration if no activity has occured This can be helpful to reduce costs and stop zombie processes from roaming the earth.

distributed.scheduler.transition-log-length 100000 ¶: How long should we keep the transition log Every time a task transitions states (like "waiting", "processing", "memory", "released") we record that transition in a log. To make sure that we don't run out of memory we will clear out old entries after a certain length. This is that length.

distributed.scheduler.work-stealing True ¶: Whether or not to balance work between workers dynamically Some times one worker has more work than we expected. The scheduler will move these tasks around as necessary by default. Set this to false to disable this behavior

distributed.scheduler.work-stealing-interval 100ms ¶: How frequently to balance worker loads

distributed.scheduler.worker-ttl None ¶: Time to live for workers. If we don't receive a heartbeat faster than this then we assume that the worker has died.

distributed.scheduler.pickle True ¶: Is the scheduler allowed to deserialize arbitrary bytestrings? The scheduler almost never deserializes user data. However there are some cases where the user can submit functions to run directly on the scheduler. This can be convenient for debugging, but also introduces some security risk. By setting this to false we ensure that the user is unable to run arbitrary code on the scheduler.

distributed.scheduler.preload [] ¶: Run custom modules during the lifetime of the scheduler You can run custom modules when the scheduler starts up and closes down. See https://docs.dask.org/en/latest/setup/custom-startup.html for more information

distributed.scheduler.preload-argv [] ¶: Arguments to pass into the preload scripts described above See https://docs.dask.org/en/latest/setup/custom-startup.html for more information

distributed.scheduler.unknown-task-duration 500ms ¶: Default duration for all tasks with unknown durations Over time the scheduler learns a duration for tasks. However when it sees a new type of task for the first time it has to make a guess as to how long it will take. This value is that guess.

distributed.scheduler.default-task-durations.rechunk-split 1us ¶: No Comment

distributed.scheduler.default-task-durations.shuffle-split 1us ¶: No Comment

distributed.scheduler.validate False ¶: Whether or not to run consistency checks during execution. This is typically only used for debugging.

distributed.scheduler.dashboard.status.task-stream-length 1000 ¶: The maximum number of tasks to include in the task stream plot

distributed.scheduler.dashboard.tasks.task-stream-length 100000 ¶: The maximum number of tasks to include in the task stream plot

distributed.scheduler.dashboard.tls.ca-file None ¶: No Comment

distributed.scheduler.dashboard.tls.key None ¶: No Comment

distributed.scheduler.dashboard.tls.cert None ¶: No Comment

distributed.scheduler.dashboard.bokeh-application.allow_websocket_origin ['*'] ¶: No Comment

distributed.scheduler.dashboard.bokeh-application.keep_alive_milliseconds 500 ¶: No Comment

distributed.scheduler.dashboard.bokeh-application.check_unused_sessions_milliseconds 500 ¶: No Comment

distributed.scheduler.locks.lease-validation-interval 10s ¶: The time to wait until an acquired semaphore is released if the Client goes out of scope

distributed.scheduler.locks.lease-timeout 30s ¶: The timeout after which a lease will be released if not refreshed

distributed.scheduler.http.routes ['distributed.http.scheduler.prometheus', 'distributed.http.scheduler.info', 'distributed.http.scheduler.json', 'distributed.http.health', 'distributed.http.proxy', 'distributed.http.statics'] ¶: A list of modules like "prometheus" and "health" that can be included or excluded as desired These modules will have a ``routes`` keyword that gets added to the main HTTP Server. This is also a list that can be extended with user defined modules.

Worker ¶

distributed.worker.blocked-handlers [] ¶: A list of handlers to exclude The scheduler operates by receiving messages from various workers and clients and then performing operations based on those messages. Each message has an operation like "close-worker" or "task-finished". In some high security situations administrators may choose to block certain handlers from running. Those handlers can be listed here. For a list of handlers see the `dask.distributed.Scheduler.handlers` attribute.

distributed.worker.multiprocessing-method spawn ¶: How we create new workers, one of "spawn", "forkserver", or "fork" This is passed to the ``multiprocessing.get_context`` function.

distributed.worker.use-file-locking True ¶: Whether or not to use lock files when creating workers Workers create a local directory in which to place temporary files. When many workers are created on the same process at once these workers can conflict with each other by trying to create this directory all at the same time. To avoid this, Dask usually used a file-based lock. However, on some systems file-based locks don't work. This is particularly common on HPC NFS systems, where users may want to set this to false.

distributed.worker.connections.outgoing 50 ¶: No Comment

distributed.worker.connections.incoming 10 ¶: No Comment

distributed.worker.preload [] ¶: Run custom modules during the lifetime of the worker You can run custom modules when the worker starts up and closes down. See https://docs.dask.org/en/latest/setup/custom-startup.html for more information

distributed.worker.preload-argv [] ¶: Arguments to pass into the preload scripts described above See https://docs.dask.org/en/latest/setup/custom-startup.html for more information

distributed.worker.daemon True ¶: Whether or not to run our process as a daemon process

distributed.worker.validate False ¶: Whether or not to run consistency checks during execution. This is typically only used for debugging.

distributed.worker.lifetime.duration None ¶: The time after creation to close the worker, like "1 hour"

distributed.worker.lifetime.stagger 0 seconds ¶: Random amount by which to stagger lifetimes If you create many workers at the same time, you may want to avoid having them kill themselves all at the same time. To avoid this you might want to set a stagger time, so that they close themselves with some random variation, like "5 minutes" That way some workers can die, new ones can be brought up, and data can be transferred over smoothly.

distributed.worker.lifetime.restart False ¶: Do we try to resurrect the worker after the lifetime deadline?

distributed.worker.profile.interval 10ms ¶: The time between polling the worker threads, typically short like 10ms

distributed.worker.profile.cycle 1000ms ¶: The time between bundling together this data and sending it to the scheduler This controls the granularity at which people can query the profile information on the time axis.

distributed.worker.profile.low-level False ¶: Whether or not to use the libunwind and stacktrace libraries to gather profiling information at the lower level (beneath Python) To get this to work you will need to install the experimental stacktrace library at conda install -c numba stacktrace See https://github.com/numba/stacktrace

distributed.worker.memory.target 0.6 ¶: Target fraction below which to try to keep memory

distributed.worker.memory.spill 0.7 ¶: When the process memory (as observed by the operating system) gets above this amount we spill data to disk.

distributed.worker.memory.pause 0.8 ¶: When the process memory (as observed by the operating system) gets above this amount we no longer start new tasks on this worker.

distributed.worker.memory.terminate 0.95 ¶: When the process memory reaches this level the nanny process will kill the worker (if a nanny is present)

distributed.worker.http.routes ['distributed.http.worker.prometheus', 'distributed.http.health', 'distributed.http.statics'] ¶: A list of modules like "prometheus" and "health" that can be included or excluded as desired These modules will have a ``routes`` keyword that gets added to the main HTTP Server. This is also a list that can be extended with user defined modules.

Admin ¶

distributed.admin.tick.interval 20ms ¶: The time between ticks, default 20ms

distributed.admin.tick.limit 3s ¶: The time allowed before triggering a warning

distributed.admin.max-error-length 10000 ¶: Maximum length of traceback as text Some Python tracebacks can be very very long (particularly in stack overflow errors) If the traceback is larger than this size (in bytes) then we truncate it.

distributed.admin.log-length 10000 ¶: Default length of logs to keep in memory The scheduler and workers keep the last 10000 or so log entries in memory.

distributed.admin.log-format %(name)s - %(levelname)s - %(message)s ¶: The log format to emit. See https://docs.python.org/3/library/logging.html#logrecord-attributes

distributed.admin.pdb-on-err False ¶: Enter Python Debugger on scheduling error

UCX ¶

ucx.tcp None ¶: No Comment

ucx.nvlink None ¶: No Comment

ucx.infiniband None ¶: No Comment

ucx.rdmacm None ¶: No Comment

ucx.cuda_copy None ¶: No Comment

ucx.net-devices None ¶: Define which Infiniband device to use

ucx.reuse-endpoints True ¶: Whether to reuse endpoints or not, default True

RMM ¶

rmm.pool-size None ¶: The size of the memory pool in bytes