We have set up logical replication on AWS RDS. We have a subscription over a publication of most db tables. After the initial copy of tables that have been completed CDC started.
After some time we inspected the following log entry on the subscriber:
2024-09-24 05:08:26 UTC::@:[22213]:ERROR: could not receive data from WAL stream: SSL SYSCALL error: EOF detected
2024-09-24 05:08:26 UTC::@:[705]:LOG: background worker "logical replication worker" (PID 22213) exited with exit code 1
2024-09-24 05:08:26 UTC::@:[22296]:LOG: logical replication apply worker for subscription "......" has started
The issue is that lag of this replication slot is continuously increasing after that time (it is currently 150gb) . We tried to enable / disable subscription
and refresh publication
but issue was not resolved. We then increased wal_sender_timeout
and wal_receiver_timeout
and the error disappeared from logs but the lag is keep increasing.
Some times the slot becomes active and restart_lsn
moves but the lag is keep increasing.
Should we change streaming
to parallel on subscription? Are there any caveats for this setting? Are there any other solutions?
Config settings on publisher:
┌───────────────────────┬─────────┐
│ name │ setting │
├───────────────────────┼─────────┤
│ max_replication_slots │ 20 │
│ max_wal_senders │ 35 │
│ wal_level │ logical │
│ wal_sender_timeout │ 300000 │
└───────────────────────┴─────────┘
Config settings on subscriber
┌─────────────────────────────────┬─────────┐
│ name │ setting │
├─────────────────────────────────┼─────────┤
│ max_logical_replication_workers │ 8 │
│ max_replication_slots │ 20 │
│ max_worker_processes │ 20 │
│ wal_receiver_timeout │ 300000 │
└─────────────────────────────────┴─────────┘
5