Log Data Cleanup

Overview

In the HAP private deployment system, some "log" type data is retained for a long time in MongoDB, which can lead to large volumes of data in certain usage scenarios, occupying significant database storage space.

You can use the show dbs command in MongoDB to check the size of each database, and then use the command to calculate table size to find tables that occupy significant storage space.

We provide a log data cleanup solution that allows for physical deletion of data from relevant tables according to specified rules.

Important Notice Before Operation

The cleanup operation is a physical deletion, meaning the corresponding data is permanently lost. Once completed, the log data for the corresponding time period cannot be recovered or viewed on the system interface.

Impact Scope:

Tables in the mdworkflow database that can be cleaned mainly affect:
- Workflow execution history
- Approval flow history
- Interruptions in running processes: If the data being cleaned includes approval flows or workflows that are not yet completed, these processes will be interrupted due to data loss and cannot continue execution.
Tables in the mdworksheetlog database mainly affect:
- Worksheet row record logs
Tables in the mdintegration database mainly affect:
- Integration center history request logs
Tables in the mdservicedata database mainly affect:
- Application behavior logs
- Usage analysis logs

Data Cleanup Whitelist

Database	Table Name	Table Usage Description
mdworkflow	wf_instance	Main workflow execution history associated data
mdworkflow	wf_subInstanceActivity	Subprocess execution history associated data
mdworkflow	wf_subInstanceCallback	Subprocess execution history associated data
mdworkflow	wf_instanceExtends	Workflow execution history associated data
mdworkflow	code_catch	Stores temporary data generated during runtime by code block nodes
mdworkflow	hooks_catch	Stores received Webhook data
mdworkflow	webhooks_catch	Stores data obtained by workflow nodes "send API requests"
mdworkflow	app_multiple_catch	Stores data obtained by selecting "direct access" in multi-data nodes
mdworkflow	custom_apipackageapi_catch	Stores response data returned by API integration calls
mdworksheetlog	wslog*	Stores worksheet row record logs for corresponding month Table naming format is wslog + date (e.g., wslog202409)
mdintegration	wf_instance	Integration center - request logs
mdintegration	wf_instance_relation	Integration center - request log associated data
mdintegration	webhooks_catch	Integration center - log data corresponding to "view details" in request logs
mdintegration	code_catch	Integration center - log data corresponding to "view details" in request logs
mdintegration	json_catch	Integration center - log data corresponding to "view details" in request logs
mdintegration	custom_parameter_catch	Integration center - log data corresponding to "view details" in request logs
mdservicedata	al_actionlog*	Stores application behavior logs for corresponding month Table naming format is al_actionlog + date (e.g., al_actionlog202409)
mdservicedata	al_uselog	Stores log data for "usage analysis" feature

Data Cleanup Suggestions

Below are specific cleanup methods and considerations for different types of data.

Log Tables Archived Monthly

This method is suitable for log tables that are automatically created monthly.

Applicable Tables:
- All tables in the mdworksheetlog database starting with wslog
- All tables in the mdservicedata database starting with al_actionlog
Operation Method: The most direct and efficient way to clean these monthly archived tables is to use the drop command to delete the entire table. This operation is extremely fast and immediately frees up all disk space occupied by the table.
Operation Example: The following command will delete the worksheet log for January 2024 in the mdworksheetlog database.
```
use mdworksheetlog;
db.wslog202401.drop();
```

Workflow Execution History Data

When cleaning workflow (and associated approval flow) execution history, the following four core data tables are involved:

wf_instance
wf_subInstanceActivity
wf_subInstanceCallback
wf_instanceExtends

These four tables have closely linked data, collectively forming a complete process record. Therefore, during cleanup operations, they must be treated as an indivisible whole.

The same deletion rules (e.g., deleting by the same time range) must be applied to these four tables, and operations should be performed synchronously. Any inconsistency in cleanup will compromise data integrity, potentially causing historical process query anomalies or triggering unknown system issues.

Cache Data Tables (Ending with _catch)

These tables are distributed across mdworkflow and mdintegration databases, mainly used for storing temporary cache data during workflow and integration task execution. You can perform independent, selective cleanup on these tables based on actual needs.

Potential Risks and Impact:

Before cleaning these tables, be sure to understand the following potential impacts:

Impact on Running Processes: Cleaning cache tables under the mdworkflow database may cause running workflows to fail due to missing dependent data. For instance, the webhooks_catch table is used to temporarily store received Webhook events. If event data in the table is cleared before the associated workflow completes, the ongoing process will be interrupted or fail.
Impact on Log Detail Viewing: Cleaning cache tables under the mdintegration database will cause the "view details" feature in "request logs" in the integration center to be empty, as the detailed data required will have been deleted.

Operation Suggestions:

Depending on your tolerance for the above risks, choose one of the following cleanup methods:

Conditional Cleanup

If business continuity is essential, it's recommended to use a data cleanup tool to delete old cache data based on conditions such as time range.
Direct Table Deletion

If business processes can tolerate the above risks (e.g., confirming no running processes and no need to view old log details), the entire cache table can be directly dropped, which is the fastest way to reclaim disk space.

Usage Analysis Logs

The al_uselog table in the mdservicedata database stores data for the product's "usage analysis" feature.

Since the frontend page only supports querying data from the past year, it's advised to periodically clean data from more than a year ago to reduce unnecessary storage occupation.

Configuring Data Cleanup Tasks

Download the image (Offline package download)

docker pull registry.cn-hangzhou.aliyuncs.com/mdpublic/mingdaoyun-archivetools:1.0.4

Create a config.json configuration file, taking workflow execution history deletion as an example:

[
    {
        "id": "1",
        "text": "Description",
        "start": "2023-05-31 16:00:00",
        "end": "2023-06-30 16:00:00",
        "src": "mongodb://root:password@192.168.1.20:27017/mdworkflow?authSource=admin",
        "archive": "",
        "table": "wf_instance",
        "delete": true,
        "batchSize": 500,
        "retentionDays": 0,
        "concurrencyLimit": 100
    },
    {
        "id": "2",
        "text": "Description",
        "start": "2023-05-31 16:00:00",
        "end": "2023-06-30 16:00:00",
        "src": "mongodb://root:password@192.168.1.30:27017/mdworkflow?authSource=admin",
        "archive": "",
        "table": "wf_subInstanceActivity",
        "delete": true,
        "batchSize": 500,
        "retentionDays": 0,
        "concurrencyLimit": 100
    },
    {
        "id": "3",
        "text": "Description",
        "start": "2023-05-31 16:00:00",
        "end": "2023-06-30 16:00:00",
        "src": "mongodb://root:password@192.168.1.30:27017/mdworkflow?authSource=admin",
        "archive": "",
        "table": "wf_subInstanceCallback",
        "delete": true,
        "batchSize": 500,
        "retentionDays": 0,
        "concurrencyLimit": 100
    },
    {
        "id": "4",
        "text": "Description",
        "start": "2023-05-31 16:00:00",
        "end": "2023-06-30 16:00:00",
        "src": "mongodb://root:password@192.168.1.30:27017/mdworkflow?authSource=admin",
        "archive": "",
        "table": "wf_instanceExtends",
        "delete": true,
        "batchSize": 500,
        "retentionDays": 0,
        "concurrencyLimit": 100
    }
]

Adjust or add configuration content according to the above format to clean up the data tables you require.
Note that the time specified in the configuration file is in UTC timezone
- UTC: 2023-05-31 16:00:00
  - Convert to UTC+8: 2023-06-01 00:00:00
- UTC: 2023-06-30 16:00:00
  - Convert to UTC+8: 2023-07-01 00:00:00

Parameter Description:

"id": "Task Identifier ID",
"text": "Custom Description",
"start": "Specify the start time for data deletion, UTC timezone, deleting data greater than or equal to this time (retentionDays value greater than 0 will automatically invalidate this configuration)",
"end": "Specify the end time for data deletion, UTC timezone, deleting data less than this time (retentionDays value greater than 0 will automatically invalidate this configuration)",
"src": "Source database connection address",
"archive": "Target database connection address (if this value is empty, then no archiving will occur, deletion will be performed solely according to specified rules)",
"table": "Data table",
"delete": "Default is true, the cleanup of archived data from source database upon task completion and record count verification. Set to false if deletion is not required.",
"batchSize": "Batch size for single insertions and deletions",
"retentionDays": "Default value is 0. When greater than 0, it denotes the number of days before which data will be deleted, enabling scheduled deletion task mode, and automatically invalidates the start and end dates specified. Default execution interval is every 24 hours.",
"concurrencyLimit": "Concurrency operation limit, default is usually maintained at 100"

Start the archiving service and execute in the directory where config.json is located:
```
docker run -d -it -v $(pwd)/config.json:/usr/local/MDArchiveTools/config.json  -v /usr/share/zoneinfo/Etc/GMT-8:/etc/localtime registry.cn-hangzhou.aliyuncs.com/mdpublic/mingdaoyun-archivetools:1.0.4
```
Other considerations:
- If the cleanup program and HAP single server mode are running on the same server, add --network script_default parameter in the docker run command, enabling the cleanup program to access MongoDB through Docker's internal network. In this case, the connection address of the source database src in the config.json configuration file should be written as mongodb://sc:27017/dbname.
- Resource Utilization: During the program's operation, it creates a certain level of resource utilization pressure on the source database, target database, and the machine where the program is located. It's advised to execute during business idle periods.
- Log Viewing:
  - Background Run (default): Use docker ps -a to find the container ID, then execute docker logs containerID to view logs.
  - Foreground Run: Omit the -d parameter, and logs will be output in real-time to the terminal, facilitating progress monitoring.
- Scheduled Tasks:
  - Define Execution Interval: Modify execution intervals through custom ENV_ARCHIVE_INTERVAL variable, unit in milliseconds, default value is 86400000.
- Reclaim Disk Space: After deleting data using the cleanup tool, disk space occupied by deleted data is not immediately released but is typically reused by the same table.

Overview​

Important Notice Before Operation​

Data Cleanup Whitelist​

Data Cleanup Suggestions​

Log Tables Archived Monthly​

Workflow Execution History Data​

Cache Data Tables (Ending with _catch)​

Usage Analysis Logs​

Configuring Data Cleanup Tasks​