Setup Prometheus monitoring¶
Prometheus is a widely popular tool for monitoring and alerting a wide variety of systems.
A distributed cluster offers a number of Prometheus metrics if the prometheus_client package is installed.
The metrics are exposed in Prometheus’ text-based format at the /metrics
endpoint on both schedulers and workers.
Available metrics¶
Apart from the metrics exposed per default by the prometheus_client
, schedulers and workers expose a number of Dask-specific metrics.
Scheduler metrics¶
The scheduler exposes the following metrics about itself:
Metric name |
Description |
---|---|
|
Number of clients connected |
|
Number of workers scheduler needs for task graph |
|
Number of workers known by scheduler |
|
Number of tasks known by scheduler |
|
Total number of times a task has been marked suspicious |
|
Total number of processed tasks no longer in memory and already removed from the scheduler job queue Note: Task groups on the scheduler which have all tasks in the forgotten state are not included. |
|
Accumulated count of task prefix in each state |
|
Maximum tick duration observed since Prometheus last scraped metrics |
|
Total number of ticks observed since the server started |
Semaphore metrics¶
The following metrics about semaphores are available on the scheduler:
Metric name |
Description |
---|---|
|
Maximum leases allowed per semaphore Note: This will be constant for each semaphore during its lifetime. |
|
Amount of currently active leases per semaphore |
|
Amount of currently pending leases per semaphore |
|
Total number of leases acquired per semaphore |
|
Total number of leases released per semaphore Note: If a semaphore is closed while there are still leases active,
this count will not equal |
|
Exponential moving average of the time it took to acquire a lease per semaphore Note: This only includes time spent on scheduler side, it does not include time spent on communication. Note: This average is calculated based on order of leases instead of time of lease acquisition. |
Work-stealing metrics¶
If work-stealing
is enabled, the scheduler exposes these metrics:
Metric name |
Description |
---|---|
|
Total number of stealing requests |
|
Total cost of stealing requests |
Worker metrics¶
The worker exposes these metrics about itself:
Metric name |
Description |
---|---|
|
Number of tasks at worker |
|
Number of worker threads |
|
Latency of worker connection |
|
Memory breakdown |
|
Total size of open data transfers from other workers |
|
Number of open data transfers from other workers |
|
Total number of data transfers from other workers since the worker was started |
|
Total size of open data transfers to other workers |
|
Number of open data transfers to other workers |
|
Total number of data transfers to other workers since the worker was started |
|
Deprecated: This metric has been renamed to Number of open fetch requests to other workers |
|
Maximum tick duration observed since Prometheus last scraped metrics |
|
Total number of ticks observed since the server started |
If the crick package is installed, the worker additionally exposes:
Metric name |
Description |
---|---|
|
Median tick duration at worker |
|
Median task runtime at worker |
|
Bandwidth for transfer at worker |