Possible causes of high_replay lag, given replication settings?

Started by Jon Zeppieri9 months ago5 messagesgeneral

zeppieri@gmail.com

9 months ago

Hi,

I just had a situation where physical replication fell far behind
(hours). The write and flush lag times were 0, but replay_lag was
high. The replica has hot_standby_feedback on, and both
max_standby_streaming_delay and max_standby_archive_delay are set to
30s.

What could cause a situation like this? If the network were a problem,
I'd expect the other _lag times to be high. So it appears that the
replica was getting the WAL but was unable to apply it. Are there
situations where the replica cannot apply WAL other than the kinds of
conflicts that would be addressed by the _delay settings?

I checked pg_stat_database_conflicts, but there was nothing in it -- all zeros.

- Jon

nick@cleaton.net

9 months ago

In reply to: Jon Zeppieri (#1)

Re: Possible causes of high_replay lag, given replication settings?

On Fri, 18 Jul 2025 at 21:29, Jon Zeppieri <zeppieri@gmail.com> wrote:

I just had a situation where physical replication fell far behind
(hours). The write and flush lag times were 0, but replay_lag was
high. The replica has hot_standby_feedback on, and both
max_standby_streaming_delay and max_standby_archive_delay are set to
30s.

What could cause a situation like this? If the network were a problem,
I'd expect the other _lag times to be high. So it appears that the
replica was getting the WAL but was unable to apply it. Are there
situations where the replica cannot apply WAL other than the kinds of
conflicts that would be addressed by the _delay settings?

I checked pg_stat_database_conflicts, but there was nothing in it -- all zeros.

This can happen when there are several busy writing processes on the
primary. The single replay process on the replica can't keep up with
the writes.

zeppieri@gmail.com

9 months ago

In reply to: Nick Cleaton (#2)

Re: Possible causes of high_replay lag, given replication settings?

On Wed, Jul 23, 2025 at 4:27 PM Nick Cleaton <nick@cleaton.net> wrote:

On Fri, 18 Jul 2025 at 21:29, Jon Zeppieri <zeppieri@gmail.com> wrote:

I just had a situation where physical replication fell far behind
(hours). The write and flush lag times were 0, but replay_lag was
high. The replica has hot_standby_feedback on, and both
max_standby_streaming_delay and max_standby_archive_delay are set to
30s.

What could cause a situation like this? If the network were a problem,
I'd expect the other _lag times to be high. So it appears that the
replica was getting the WAL but was unable to apply it. Are there
situations where the replica cannot apply WAL other than the kinds of
conflicts that would be addressed by the _delay settings?

I checked pg_stat_database_conflicts, but there was nothing in it -- all zeros.

This can happen when there are several busy writing processes on the
primary. The single replay process on the replica can't keep up with
the writes.

Thanks for the response, Nick. I'm curious why the situation you
describe wouldn't also lead to the write_lag and flush_lag also being
high. If the problem is simply keeping up with the primary, wouldn't
you expect all three lag times to be elevated?

- Jon

Greg Sabino Mullane

greg@turnstep.com

9 months ago

In reply to: Jon Zeppieri (#3)

Re: Possible causes of high_replay lag, given replication settings?

On Fri, Jul 25, 2025 at 9:57 AM Jon Zeppieri <zeppieri@gmail.com> wrote:

Thanks for the response, Nick. I'm curious why the situation you describe
wouldn't also lead to the write_lag and flush_lag also being
high. If the problem is simply keeping up with the primary, wouldn't you
expect all three lag times to be elevated?

No - write and flush are pretty quick and simple, it's just putting the WAL
onto the local disk. Replay involves a lot more work as we have to parse
the WAL and apply the changes, which means doing a lot of I/O across many
files. Still, *hours* to me indicates more than just a lot of extra
traffic. Check that recovery_min_apply_delay is still 0, then log onto the
replica and see what's going on with regards to open transactions and locks.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

zeppieri@gmail.com

9 months ago

In reply to: Greg Sabino Mullane (#4)

Re: Possible causes of high_replay lag, given replication settings?

On Fri, Jul 25, 2025 at 7:13 PM Greg Sabino Mullane <htamfids@gmail.com> wrote:

On Fri, Jul 25, 2025 at 9:57 AM Jon Zeppieri <zeppieri@gmail.com> wrote:

Thanks for the response, Nick. I'm curious why the situation you describe wouldn't also lead to the write_lag and flush_lag also being
high. If the problem is simply keeping up with the primary, wouldn't you expect all three lag times to be elevated?

No - write and flush are pretty quick and simple, it's just putting the WAL onto the local disk. Replay involves a lot more work as we have to parse the WAL and apply the changes, which means doing a lot of I/O across many files. Still, *hours* to me indicates more than just a lot of extra traffic. Check that recovery_min_apply_delay is still 0, then log onto the replica and see what's going on with regards to open transactions and locks.

Thanks Greg. `recovery_min_apply_delay` is 0, just checked. Also, I
didn't mention in my initial post that it seemed the cause of the
delay was long-running queries on the replica, rather than the
primary. It's possible, of course, that I'm wrong, but I was able to
get the replica moving again when I killed off old queries on the
replica. If those were the problem, though, then I don't understand
why the max_standby_streaming_delay didn't prevent that situation.

- Jon