Log Data Cleanup
Overview
In the HAP private deployment system, some "log" type data is retained for a long time in MongoDB, which can lead to large volumes of data in certain usage scenarios, occupying significant database storage space.
You can use the show dbs
command in MongoDB to check the size of each database, and then use the command to calculate table size to find tables that occupy significant storage space.
We provide a log data cleanup solution that allows for physical deletion of data from relevant tables according to specified rules.
Important Notice Before Operation
The cleanup operation is a physical deletion, meaning the corresponding data is permanently lost. Once completed, the log data for the corresponding time period cannot be recovered or viewed on the system interface.
Impact Scope:
-
Tables in the
mdworkflow
database that can be cleaned mainly affect:- Workflow execution history
- Approval flow history
- Interruptions in running processes: If the data being cleaned includes approval flows or workflows that are not yet completed, these processes will be interrupted due to data loss and cannot continue execution.
-
Tables in the
mdworksheetlog
database mainly affect:- Worksheet row record logs
-
Tables in the
mdintegration
database mainly affect:- Integration center history request logs
-
Tables in the
mdservicedata
database mainly affect:- Application behavior logs
- Usage analysis logs
Data Cleanup Whitelist
Database | Table Name | Table Usage Description |
---|---|---|
mdworkflow | wf_instance | Main workflow execution history associated data |
mdworkflow | wf_subInstanceActivity | Subprocess execution history associated data |
mdworkflow | wf_subInstanceCallback | Subprocess execution history associated data |
mdworkflow | wf_instanceExtends | Workflow execution history associated data |
mdworkflow | code_catch | Stores temporary data generated during runtime by code block nodes |
mdworkflow | hooks_catch | Stores received Webhook data |
mdworkflow | webhooks_catch | Stores data obtained by workflow nodes "send API requests" |
mdworkflow | app_multiple_catch | Stores data obtained by selecting "direct access" in multi-data nodes |
mdworkflow | custom_apipackageapi_catch | Stores response data returned by API integration calls |
mdworksheetlog | wslog* | Stores worksheet row record logs for corresponding month Table naming format is wslog + date (e.g., wslog202409) |
mdintegration | wf_instance | Integration center - request logs |
mdintegration | wf_instance_relation | Integration center - request log associated data |
mdintegration | webhooks_catch | Integration center - log data corresponding to "view details" in request logs |
mdintegration | code_catch | Integration center - log data corresponding to "view details" in request logs |
mdintegration | json_catch | Integration center - log data corresponding to "view details" in request logs |
mdintegration | custom_parameter_catch | Integration center - log data corresponding to "view details" in request logs |
mdservicedata | al_actionlog* | Stores application behavior logs for corresponding month Table naming format is al_actionlog + date (e.g., al_actionlog202409) |
mdservicedata | al_uselog | Stores log data for "usage analysis" feature |
Data Cleanup Suggestions
Below are specific cleanup methods and considerations for different types of data.
Log Tables Archived Monthly
This method is suitable for log tables that are automatically created monthly.
-
Applicable Tables:
- All tables in the
mdworksheetlog
database starting withwslog
- All tables in the
mdservicedata
database starting withal_actionlog
- All tables in the
-
Operation Method: The most direct and efficient way to clean these monthly archived tables is to use the
drop
command to delete the entire table. This operation is extremely fast and immediately frees up all disk space occupied by the table. -
Operation Example: The following command will delete the worksheet log for January 2024 in the mdworksheetlog database.
use mdworksheetlog;
db.wslog202401.drop();
Workflow Execution History Data
When cleaning workflow (and associated approval flow) execution history, the following four core data tables are involved:
wf_instance
wf_subInstanceActivity
wf_subInstanceCallback
wf_instanceExtends
These four tables have closely linked data, collectively forming a complete process record. Therefore, during cleanup operations, they must be treated as an indivisible whole.
The same deletion rules (e.g., deleting by the same time range) must be applied to these four tables, and operations should be performed synchronously. Any inconsistency in cleanup will compromise data integrity, potentially causing historical process query anomalies or triggering unknown system issues.
Cache Data Tables (Ending with _catch)
These tables are distributed across mdworkflow
and mdintegration
databases, mainly used for storing temporary cache data during workflow and integration task execution. You can perform independent, selective cleanup on these tables based on actual needs.
Potential Risks and Impact:
Before cleaning these tables, be sure to understand the following potential impacts:
-
Impact on Running Processes: Cleaning cache tables under the
mdworkflow
database may cause running workflows to fail due to missing dependent data. For instance, thewebhooks_catch
table is used to temporarily store received Webhook events. If event data in the table is cleared before the associated workflow completes, the ongoing process will be interrupted or fail. -
Impact on Log Detail Viewing: Cleaning cache tables under the
mdintegration
database will cause the "view details" feature in "request logs" in the integration center to be empty, as the detailed data required will have been deleted.
Operation Suggestions:
Depending on your tolerance for the above risks, choose one of the following cleanup methods:
-
Conditional Cleanup
If business continuity is essential, it's recommended to use a data cleanup tool to delete old cache data based on conditions such as time range.
-
Direct Table Deletion
If business processes can tolerate the above risks (e.g., confirming no running processes and no need to view old log details), the entire cache table can be directly dropped, which is the fastest way to reclaim disk space.
Usage Analysis Logs
The al_uselog
table in the mdservicedata
database stores data for the product's "usage analysis" feature.
Since the frontend page only supports querying data from the past year, it's advised to periodically clean data from more than a year ago to reduce unnecessary storage occupation.
Configuring Data Cleanup Tasks
-
Download the image (Offline package download)
docker pull registry.cn-hangzhou.aliyuncs.com/mdpublic/mingdaoyun-archivetools:1.0.4
-
Create a
config.json
configuration file, taking workflow execution history deletion as an example:[
{
"id": "1",
"text": "Description",
"start": "2023-05-31 16:00:00",
"end": "2023-06-30 16:00:00",
"src": "mongodb://root:password@192.168.1.20:27017/mdworkflow?authSource=admin",
"archive": "",
"table": "wf_instance",
"delete": true,
"batchSize": 500,
"retentionDays": 0,
"concurrencyLimit": 100
},
{
"id": "2",
"text": "Description",
"start": "2023-05-31 16:00:00",
"end": "2023-06-30 16:00:00",
"src": "mongodb://root:password@192.168.1.30:27017/mdworkflow?authSource=admin",
"archive": "",
"table": "wf_subInstanceActivity",
"delete": true,
"batchSize": 500,
"retentionDays": 0,
"concurrencyLimit": 100
},
{
"id": "3",
"text": "Description",
"start": "2023-05-31 16:00:00",
"end": "2023-06-30 16:00:00",
"src": "mongodb://root:password@192.168.1.30:27017/mdworkflow?authSource=admin",
"archive": "",
"table": "wf_subInstanceCallback",
"delete": true,
"batchSize": 500,
"retentionDays": 0,
"concurrencyLimit": 100
},
{
"id": "4",
"text": "Description",
"start": "2023-05-31 16:00:00",
"end": "2023-06-30 16:00:00",
"src": "mongodb://root:password@192.168.1.30:27017/mdworkflow?authSource=admin",
"archive": "",
"table": "wf_instanceExtends",
"delete": true,
"batchSize": 500,
"retentionDays": 0,
"concurrencyLimit": 100
}
]- Adjust or add configuration content according to the above format to clean up the data tables you require.
- Note that the time specified in the configuration file is in UTC timezone
- UTC: 2023-05-31 16:00:00
- Convert to UTC+8: 2023-06-01 00:00:00
- UTC: 2023-06-30 16:00:00
- Convert to UTC+8: 2023-07-01 00:00:00
- UTC: 2023-05-31 16:00:00
Parameter Description:
"id": "Task Identifier ID",
"text": "Custom Description",
"start": "Specify the start time for data deletion, UTC timezone, deleting data greater than or equal to this time (retentionDays value greater than 0 will automatically invalidate this configuration)",
"end": "Specify the end time for data deletion, UTC timezone, deleting data less than this time (retentionDays value greater than 0 will automatically invalidate this configuration)",
"src": "Source database connection address",
"archive": "Target database connection address (if this value is empty, then no archiving will occur, deletion will be performed solely according to specified rules)",
"table": "Data table",
"delete": "Default is true, the cleanup of archived data from source database upon task completion and record count verification. Set to false if deletion is not required.",
"batchSize": "Batch size for single insertions and deletions",
"retentionDays": "Default value is 0. When greater than 0, it denotes the number of days before which data will be deleted, enabling scheduled deletion task mode, and automatically invalidates the start and end dates specified. Default execution interval is every 24 hours.",
"concurrencyLimit": "Concurrency operation limit, default is usually maintained at 100" -
Start the archiving service and execute in the directory where
config.json
is located:docker run -d -it -v $(pwd)/config.json:/usr/local/MDArchiveTools/config.json -v /usr/share/zoneinfo/Etc/GMT-8:/etc/localtime registry.cn-hangzhou.aliyuncs.com/mdpublic/mingdaoyun-archivetools:1.0.4
Other considerations:
-
If the cleanup program and HAP single server mode are running on the same server, add
--network script_default
parameter in thedocker run
command, enabling the cleanup program to access MongoDB through Docker's internal network. In this case, the connection address of the source database src in theconfig.json
configuration file should be written asmongodb://sc:27017/dbname
. -
Resource Utilization: During the program's operation, it creates a certain level of resource utilization pressure on the source database, target database, and the machine where the program is located. It's advised to execute during business idle periods.
-
Log Viewing:
-
Background Run (default): Use
docker ps -a
to find the container ID, then executedocker logs containerID
to view logs. -
Foreground Run: Omit the
-d
parameter, and logs will be output in real-time to the terminal, facilitating progress monitoring.
-
-
Scheduled Tasks:
- Define Execution Interval: Modify execution intervals through custom
ENV_ARCHIVE_INTERVAL
variable, unit in milliseconds, default value is 86400000.
- Define Execution Interval: Modify execution intervals through custom
-
Reclaim Disk Space: After deleting data using the cleanup tool, disk space occupied by deleted data is not immediately released but is typically reused by the same table.
-