incorrect wal removal due to max_slot_wal_keep_size

Started by Jeff Janesover 1 year ago2 messages
#1Jeff Janes
jeff.janes@gmail.com

I was testing logical replication over my (remarkably bad) wifi network to
see what kind of throughput and lag I would get. I was using pgbench
default transaction as the workload generator with all 4 tables being
replicated. I had synchronous replication configured by
synchronous_standby_names, except at the time it was not actually in use
due to synchronous_commit being set to 'local' on the benchmarking
connections.

The master was shutdown cleanly with a 'smart shutdown request' (in a state
where substantial lag had accumulated--I don't know exactly how much but at
least 20,000 transaction had replayed after replication restarted before it
stalled) when I got distracted by other things and decided to reboot the
ubuntu machine it was running on.

When I restarted the master PostgreSQL server, the replica started to catch
up, but then eventually stalled.

On the master, I had this log, which occurred right after the first
checkpoint (since the server restart) began.:

4790 00000 2024-10-09 12:03:12.819 EDT LOG: invalidating obsolete
replication slot "sub"
4790 00000 2024-10-09 12:03:12.819 EDT DETAIL: The slot's restart_lsn
1/84C5B510 exceeds the limit by 37374704 bytes.
4790 00000 2024-10-09 12:03:12.819 EDT HINT: You might need to increase
"max_slot_wal_keep_size".

But max_slot_wal_keep_size was set to -1 and had never been set to anything
other than that!

The master was running 18devel-d94cf5ca7f. Not for any particular reason,
but just because that is what I happened to have on when I started mucking
around with this. I don't recall running this particular test in this
manner before, and have no reason to think it is only broken in 18dev.

I'm going to try to reproduce this on 17.0, but in the meantime any other
suggestions for investigating this?

I have noticed some previous similar complaints about
max_slot_wal_keep_size being incorrectly invoked, but it didn't look like
they were ever resolved.

Cheers,

Jeff

#2Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Jeff Janes (#1)
RE: incorrect wal removal due to max_slot_wal_keep_size

Dear Jeff,

Thanks for reporting the issue. I've tried to reproduce the issue (by adding delay
on worker-side and immediate shut-down), but not done yet. If possible, could you
please share a script to reproduce? It is helpful to analyze.

I'm going to try to reproduce this on 17.0, but in the meantime any other suggestions for investigating this?x

It is very helpful to check the content of pg_stat_replication_slots view and pg_wal
directory of the postgres, when you succeed to reproduce. Also, please set
log_min_messages = DEBUG2 to check logs from RemoveOldXlogFiles() and RemoveXlogFile().
I want to see the log when you can reproduce.

They are inspired by [1]/messages/by-id/Yz2hivgyjS1RfMKs@depesz.com. I doubt the thread and yours are the same issue or not.

[1]: /messages/by-id/Yz2hivgyjS1RfMKs@depesz.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED