I have an Aurora MySQL global cluster with writer and reader instances and both have following configuration.
Instance size: db.r7g.xlarge
general_log: enabled
slow_query_log: enabled
log_output: FILE
The issue I am facing is that the CPU usage is spiking to 100% along with a steep spike in rest of the metrics and my services are no longer able to connect to database.
Upon investigation I found several “No space left on device” error message in database logs at the same time this issue happened.
This issue happened twice with same pattern and to resolve this issue I just did a manual failover of writer/reader instances (this is a temporary fix until this issue happens again).
The two storage metrics I looked at are “AuroraVolumeBytesLeftTotal” which had approx 140TB left and “FreeLocalStorage” which had approx 80GB left.
From the definitions of these metrics it appears that AuroraVolume is the shared storage where table data is stored and it autoscales based on usage, and FreeLocalStorage is the storage available to each aurora instance (writer/reader).
The “No space left on device” error message does not match with any metrics available to us.
Also, since aurora has log rotation policies then the no space issue does not makes sense.
For audit, general, and slow query logs are rotated after either 24 hours or when 15% of storage has been consumed, and in case when FILE logging is enabled, general log and slow query log files are examined every hour and log files more than 24 hours old are deleted.
Ref: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/USER_LogAccess.MySQL.LogFileSize.html#USER_LogAccess.AMS.LogFileSize.retention
Is there any way to get more insights in the local storage used by aurora instances or can someone help me better understand the root cause if I am looking in the wrong direction?
Appreciate any help on this.
Thank you.
Note: Following are some of the error messages i am seeing in database logs
2024-05-13T11:40:34.710624Z 0 [Note] [MY-000000] [Repl] [Dump thread metrics] Secondary_id: 1012653105, Secondary_uuid: 549b2e36-46a4-491d-b8e1-c95eef81e275, Binlog_file: mysql-bin-changelog.002012, Binlog_position: 88516906, Bytes_behind_primary: 1875, Bytes_behind_primary (1875) is smaller than aurora_binlog_io_cache_size (134217728) by 134215853 (rpl_binlog_sender.cc:1945)
<> <> [Note] [<>] [Server] Aborted connection <> to db: <> user: <> host: <> (Got an error reading communication packets). (sql_connect.cc:<>)
<> <> [Note] [<>] [Repl] [Dump thread metrics] Secondary_id: <>, Secondary_uuid: <>, Binlog_file: <>, Binlog_position: <>, Bytes_behind_primary: <>, Bytes_behind_primary (<>) is smaller than aurora_binlog_io_cache_size (<>) by <> (rpl_binlog_sender.cc:<>)
2024-05-13T12:51:59.641538Z 21044534 [ERROR] [MY-010907] [Server] Error writing file ‘/rdsdbdata/log/general/mysql-general.log’ (errno: 28 – No space left on device)
Uh oh, gzputs operation on GzipFile /rdsdbdata/log/aurora-engine-logs/grover.mysqld.2024-05-12-22-03-58.260.10.log.gz failed: No space left on device
Uh oh, gzflush operation on GzipFile /rdsdbdata/log/aurora-engine-logs/grover.mysqld.2024-05-12-22-03-58.260.10.log.gz failed: No space left on device
2024-05-13T12:36:12.281249Z 21042731 [ERROR] [MY-010907] [Server] Error writing file ‘/rdsdbdata/log/general/mysql-general.log’ (errno: 28 – No space left on device)