Skip to main content

Continuous Queuing of Workflows

After a workflow event is triggered, it enters the Kafka message queue in the form of a message, and is then consumed by the workflow consumption service for processing.

When encountering a persistent workflow queue within the platform, the initial step is to differentiate between a scenario where the queue is large but items are being processed gradually and one where the queue remains stagnant with no consumption.

Consider the following two scenarios:

Workflow queued and consumed

Check the monitoring page of the workflow to see if there are workflows with a very large queue, such as tens of thousands or millions of workflows. This may be due to misconfigured trigger logic by business personnel or loops causing a high volume of workflow triggers.

Consumption capacity may be affected by various factors. For example, while the system can normally process ten thousand workflows per minute, there may be instances where hundreds of thousands of workflows are triggered, resulting in a large number of workflows being queued up without being quickly consumed.

For example, when a few workflows have a queue of tens of thousands or more, causing severe queuing for all workflows, if it is confirmed that these workflows with large queues do not need to be processed again, they should be directly closed in a non-paused state.

  • When a workflow is closed directly, the workflows in the queue will be quickly consumed (without going through the nodes logic in the workflow), allowing them to be processed as soon as possible.

  • If the workflows with a large queue need to be consumed, they can be paused first and then re-started during business downtime.


If there is no unintended triggering of workflows with high queue volumes, please log in to the server and check resource usage. You can use the top command to view the real-time CPU and memory usage of the server and its processes.

  • If the process consuming a significant amount of CPU is the mongod process, it is usually caused by slow queries. For resolution, refer to the slow query optimization.

  • If the process consuming a significant amount of CPU is a dotnet or java process, it is typically due to complex logic in workflow nodes, which can also increase resource usage for related services.

    • If the server resources are fully utilized, you may choose to scale out or postpone running workflows with complex logic to periods of low business activity.
    • If it is a Kubernetes-based cluster deployment, and there is some resource redundancy, you can dynamically scale up the service instances with high resource usage.

Workflow queued but not consumed

When the workflow is continuously queued but not consumed, it is usually due to a full server disk or an issue with the Kafka service.

  • Use the df -Th command on the server to check the usage of the system disk and data disk.

  • Check if the Kafka service is running properly:

Check the health check logs of the storage component container.

docker logs $(docker ps | grep mingdaoyun-sc | awk '{print $1}')
  • If the log output is normal, it will be all INFO. If the Kafka service keeps restarting continuously, it means the current Kafka service is abnormal. You can try restarting the service as a whole first. If Kafka still cannot start, clear the Kafka error data.

If the triggered workflows were unable to be written to the Kafka queue due to reasons such as a full disk or Kafka service issues, the history "queued" workflows triggered will no longer be consumed.


An extremely rare possibility is that the Kafka consumer group is in a rebalance state. This is mainly caused by slow processes resulting in timeouts, or consumer service instances being affected by resource issues and restarting, which in turn triggers consumer group rebalancing.

You can use the following command to check the actual message accumulation status in each topic partition and whether the system is currently in a rebalancing state:

  1. Enter the container of the storage component

    docker exec -it $(docker ps | grep mingdaoyun-sc | awk '{print $1}') bash
  2. Execute the command to view the consumption of the md-workflow-consumer consumer group

    /usr/local/kafka/bin/kafka-consumer-groups.sh --bootstrap-server ${ENV_KAFKA_ENDPOINTS:=127.0.0.1:9092} --describe --group md-workflow-consumer
    • If prompted with Error: Executing consumer group command failed due to null, you can click to download the Kafka installation package, then upload the installation package to the deployment server and copy it into the mingdaoyun-sc container. After that, unzip the file and use the bin/kafka-consumer-groups.sh from the new installation package to execute the above command.

The normal output is as follow. If prompted with Warning: Consumer group 'md-workflow-consumer' is rebalancing., it means the consumer group is currently rebalancing.

  • The LAG column represents the number of messages currently accumulated in the Topic partition.

  • Commonly used Topic names in workflows:

    • WorkFlow:Main workflow execution

    • WorkFlow-Process:Sub-workflow execution

    • WorkFlow-Router:Slow queue for workflow execution

    • WorkFlow-Batch:Bulk workflow execution

    • WorkFlow-Button:Button-triggered workflow execution

    • WorkSheet:Row record validation for triggering workflows

    • WorkSheet-Router:Slow queue row record validation for triggering workflows