Syncrep and improving latency due to WAL throttling

Started by Jakub Wartakover 3 years ago36 messageshackers

jakub.wartak@enterprisedb.com

over 3 years ago

Hi,

attached is proposal idea by Tomas (in CC) for protecting and
prioritizing OLTP latency on syncrep over other heavy WAL hitting
sessions. This is the result of internal testing and research related
to the syncrep behavior with Tomas, Alvaro and me. The main objective
of this work-in-progress/crude patch idea (with GUC default disabled)
is to prevent build-up of lag against the synchronous standby. It
allows DBA to maintain stable latency for many backends when in
parallel there is some WAL-heavy activity happening (full table
rewrite, VACUUM, MV build, archiving, etc.). In other words it allows
slow down of any backend activity. Any feedback on such a feature is
welcome, including better GUC name proposals ;) and conditions in
which such feature should be disabled even if it would be enabled
globally (right now only anti- wraparound VACUUM comes to mind, it's
not in the patch).

Demo; Given:
- two DBs in syncrep configuration and artificial RTT 10ms latency
between them (introduced via tc qdisc netem)
- insertBIG.sql = "insert into bandwidthhog select repeat('c', 1000)
from generate_series(1, 500000);" (50MB of WAL data)
- pgbench (8c) and 1x INSERT session

There are clearly visible drops of pgbench (OLTP) latency when the WAL
socket is saturated:

with 16devel/master and synchronous_commit_flush_wal_after=0
(disabled, default/baseline):
postgres@host1:~$ pgbench -n -R 50 -c 8 -T 15 -P 1
pgbench (16devel)
progress: 1.0 s, 59.0 tps, lat 18.840 ms stddev 11.251, 0 failed, lag 0.059 ms
progress: 2.0 s, 48.0 tps, lat 14.332 ms stddev 4.272, 0 failed, lag 0.063 ms
progress: 3.0 s, 56.0 tps, lat 15.383 ms stddev 6.270, 0 failed, lag 0.061 ms
progress: 4.0 s, 51.0 tps, lat 15.104 ms stddev 5.850, 0 failed, lag 0.061 ms
progress: 5.0 s, 47.0 tps, lat 15.184 ms stddev 5.472, 0 failed, lag 0.063 ms
progress: 6.0 s, 23.0 tps, lat 88.495 ms stddev 141.845, 0 failed, lag 0.064 ms
progress: 7.0 s, 1.0 tps, lat 999.053 ms stddev 0.000, 0 failed, lag 0.077 ms
progress: 8.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed, lag 0.000 ms
progress: 9.0 s, 1.0 tps, lat 2748.142 ms stddev NaN, 0 failed, lag 0.072 ms
progress: 10.0 s, 68.1 tps, lat 3368.267 ms stddev 282.842, 0 failed,
lag 2911.857 ms
progress: 11.0 s, 97.0 tps, lat 2560.750 ms stddev 216.844, 0 failed,
lag 2478.261 ms
progress: 12.0 s, 96.0 tps, lat 1463.754 ms stddev 376.276, 0 failed,
lag 1383.873 ms
progress: 13.0 s, 94.0 tps, lat 616.243 ms stddev 230.673, 0 failed,
lag 527.241 ms
progress: 14.0 s, 59.0 tps, lat 48.265 ms stddev 72.533, 0 failed, lag 15.181 ms
progress: 15.0 s, 39.0 tps, lat 14.237 ms stddev 6.073, 0 failed, lag 0.063 ms
transaction type: <builtin: TPC-B (sort of)>
[..]
latency average = 931.383 ms
latency stddev = 1188.530 ms
rate limit schedule lag: avg 840.170 (max 3605.569) ms

session2 output:
postgres=# \i insertBIG.sql
Timing is on.
INSERT 0 500000
Time: 4119.485 ms (00:04.119)

This new GUC makes it possible for the OLTP traffic to be less
affected (latency-wise) when the heavy bulk traffic hits. With
synchronous_commit_flush_wal_after=1024 (kB) it's way better, but
latency rises up to 45ms:
postgres@host1:~$ pgbench -n -R 50 -c 8 -T 15 -P 1
pgbench (16devel)
progress: 1.0 s, 52.0 tps, lat 17.300 ms stddev 10.178, 0 failed, lag 0.061 ms
progress: 2.0 s, 51.0 tps, lat 19.490 ms stddev 12.626, 0 failed, lag 0.061 ms
progress: 3.0 s, 48.0 tps, lat 14.839 ms stddev 5.429, 0 failed, lag 0.061 ms
progress: 4.0 s, 53.0 tps, lat 24.635 ms stddev 13.449, 0 failed, lag 0.062 ms
progress: 5.0 s, 48.0 tps, lat 17.999 ms stddev 9.291, 0 failed, lag 0.062 ms
progress: 6.0 s, 57.0 tps, lat 21.513 ms stddev 17.011, 0 failed, lag 0.058 ms
progress: 7.0 s, 50.0 tps, lat 28.071 ms stddev 21.622, 0 failed, lag 0.061 ms
progress: 8.0 s, 45.0 tps, lat 27.244 ms stddev 11.975, 0 failed, lag 0.064 ms
progress: 9.0 s, 57.0 tps, lat 35.988 ms stddev 25.752, 0 failed, lag 0.057 ms
progress: 10.0 s, 56.0 tps, lat 45.478 ms stddev 39.831, 0 failed, lag 0.534 ms
progress: 11.0 s, 62.0 tps, lat 45.146 ms stddev 32.881, 0 failed, lag 0.058 ms
progress: 12.0 s, 51.0 tps, lat 24.250 ms stddev 12.405, 0 failed, lag 0.063 ms
progress: 13.0 s, 57.0 tps, lat 18.554 ms stddev 8.677, 0 failed, lag 0.060 ms
progress: 14.0 s, 44.0 tps, lat 15.923 ms stddev 6.958, 0 failed, lag 0.065 ms
progress: 15.0 s, 54.0 tps, lat 19.773 ms stddev 10.024, 0 failed, lag 0.063 ms
transaction type: <builtin: TPC-B (sort of)>
[..]
latency average = 25.575 ms
latency stddev = 21.540 ms

session2 output:
postgres=# set synchronous_commit_flush_wal_after = 1024;
SET
postgres=# \i insertBIG.sql
INSERT 0 500000
Time: 8889.318 ms (00:08.889)

With 16devel/master and synchronous_commit_flush_wal_after=256 (kB)
all is smooth:
postgres@host1:~$ pgbench -n -R 50 -c 8 -T 15 -P 1
pgbench (16devel)
progress: 1.0 s, 49.0 tps, lat 14.345 ms stddev 4.700, 0 failed, lag 0.062 ms
progress: 2.0 s, 45.0 tps, lat 14.812 ms stddev 5.816, 0 failed, lag 0.064 ms
progress: 3.0 s, 49.0 tps, lat 13.145 ms stddev 4.320, 0 failed, lag 0.063 ms
progress: 4.0 s, 44.0 tps, lat 14.429 ms stddev 4.715, 0 failed, lag 0.063 ms
progress: 5.0 s, 49.0 tps, lat 18.111 ms stddev 8.536, 0 failed, lag 0.062 ms
progress: 6.0 s, 58.0 tps, lat 17.929 ms stddev 8.198, 0 failed, lag 0.060 ms
progress: 7.0 s, 65.0 tps, lat 20.186 ms stddev 12.973, 0 failed, lag 0.059 ms
progress: 8.0 s, 47.0 tps, lat 16.174 ms stddev 6.508, 0 failed, lag 0.061 ms
progress: 9.0 s, 45.0 tps, lat 14.485 ms stddev 4.736, 0 failed, lag 0.061 ms
progress: 10.0 s, 53.0 tps, lat 16.879 ms stddev 8.783, 0 failed, lag 0.061 ms
progress: 11.0 s, 42.0 tps, lat 13.711 ms stddev 4.464, 0 failed, lag 0.062 ms
progress: 12.0 s, 49.0 tps, lat 13.252 ms stddev 4.082, 0 failed, lag 0.062 ms
progress: 13.0 s, 48.0 tps, lat 14.179 ms stddev 6.238, 0 failed, lag 0.058 ms
progress: 14.0 s, 43.0 tps, lat 12.210 ms stddev 2.993, 0 failed, lag 0.060 ms
progress: 15.0 s, 34.0 tps, lat 14.811 ms stddev 6.544, 0 failed, lag 0.062 ms
transaction type: <builtin: TPC-B (sort of)>
[..]
latency average = 15.454 ms
latency stddev = 7.354 ms

session2 output (notice the INSERT took much longer but did NOT affect
the pgbench's stddev at all):
postgres=# set synchronous_commit_flush_wal_after = 256;
SET
postgres=# \i insertBIG.sql
Timing is on.
[..]
INSERT 0 500000
Time: 22737.729 ms (00:22.738)

Without this feature (or with synchronous_commit_flush_wal_after=0)
the TCP's SendQ on socket walsender-->walreceiver is growing and as
such any next sendto() by OLTP backends/walwriter ends being queued
too much causing stalls of activity.

-Jakub Wartak.

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Jakub Wartak (#1)

Re: Syncrep and improving latency due to WAL throttling

Hi,

On 2023-01-25 14:32:51 +0100, Jakub Wartak wrote:

In other words it allows slow down of any backend activity. Any feedback on
such a feature is welcome, including better GUC name proposals ;) and
conditions in which such feature should be disabled even if it would be
enabled globally (right now only anti- wraparound VACUUM comes to mind, it's
not in the patch).

Such a feature could be useful - but I don't think the current place of
throttling has any hope of working reliably:

@@ -1021,6 +1025,21 @@ XLogInsertRecord(XLogRecData *rdata,
pgWalUsage.wal_bytes += rechdr->xl_tot_len;
pgWalUsage.wal_records++;
pgWalUsage.wal_fpi += num_fpi;
+
+		backendWalInserted += rechdr->xl_tot_len;
+
+		if ((synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY || synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_WRITE) &&
+			synchronous_commit_flush_wal_after > 0 &&
+			backendWalInserted > synchronous_commit_flush_wal_after * 1024L)
+		{
+			elog(DEBUG3, "throttling WAL down on this session (backendWalInserted=%d)", backendWalInserted);
+			XLogFlush(EndPos);
+			/* XXX: refactor SyncRepWaitForLSN() to have different waitevent than default WAIT_EVENT_SYNC_REP  */
+			/* maybe new WAIT_EVENT_SYNC_REP_BIG or something like that */
+			SyncRepWaitForLSN(EndPos, false);
+			elog(DEBUG3, "throttling WAL down on this session - end");
+			backendWalInserted = 0;
+		}
}

You're blocking in the middle of an XLOG insertion. We will commonly hold
important buffer lwlocks, it'll often be in a critical section (no cancelling
/ terminating the session!). This'd entail loads of undetectable deadlocks
(i.e. hard hangs). And even leaving that aside, doing an unnecessary
XLogFlush() with important locks held will seriously increase contention.

My best idea for how to implement this in a somewhat safe way would be for
XLogInsertRecord() to set a flag indicating that we should throttle, and set
InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
proceed (i.e. we'll not be in a critical / interrupts off section) can
actually perform the delay. That should fix the hard deadlock danger and
remove most of the increase in lock contention.

I don't think doing an XLogFlush() of a record that we just wrote is a good
idea - that'll sometimes significantly increase the number of fdatasyncs/sec
we're doing. To make matters worse, this will often cause partially filled WAL
pages to be flushed out - rewriting a WAL page multiple times can
significantly increase overhead on the storage level. At the very least this'd
have to flush only up to the last fully filled page.

Just counting the number of bytes inserted by a backend will make the overhead
even worse, as the flush will be triggered even for OLTP sessions doing tiny
transactions, even though they don't contribute to the problem you're trying
to address. How about counting how many bytes of WAL a backend has inserted
since the last time that backend did an XLogFlush()?

A bulk writer won't do a lot of XLogFlush()es, so the time/bytes since the
last XLogFlush() will increase quickly, triggering a flush at the next
opportunity. But short OLTP transactions will do XLogFlush()es at least at
every commit.

I also suspect the overhead will be more manageable if you were to force a
flush only up to a point further back than the last fully filled page. We
don't want to end up flushing WAL for every page, but if you just have a
backend-local accounting mechanism, I think that's inevitably what you'd end
up with when you have a large number of sessions. But if you'd limit the
flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
only ever to a prior boundary, the amount of unnecessary WAL flushes would be
proportional to synchronous_commit_flush_wal_after.

Greetings,

Andres Freund

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Andres Freund (#2)

Re: Syncrep and improving latency due to WAL throttling

On Thu, Jan 26, 2023 at 12:35 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-01-25 14:32:51 +0100, Jakub Wartak wrote:

In other words it allows slow down of any backend activity. Any feedback on
such a feature is welcome, including better GUC name proposals ;) and
conditions in which such feature should be disabled even if it would be
enabled globally (right now only anti- wraparound VACUUM comes to mind, it's
not in the patch).

Such a feature could be useful - but I don't think the current place of
throttling has any hope of working reliably:

+1 for the feature as it keeps replication lag in check and provides a
way for defining RPO (recovery point objective).

You're blocking in the middle of an XLOG insertion. We will commonly hold
important buffer lwlocks, it'll often be in a critical section (no cancelling
/ terminating the session!). This'd entail loads of undetectable deadlocks
(i.e. hard hangs). And even leaving that aside, doing an unnecessary
XLogFlush() with important locks held will seriously increase contention.

My best idea for how to implement this in a somewhat safe way would be for
XLogInsertRecord() to set a flag indicating that we should throttle, and set
InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
proceed (i.e. we'll not be in a critical / interrupts off section) can
actually perform the delay. That should fix the hard deadlock danger and
remove most of the increase in lock contention.

We've discussed this feature quite extensively in a recent thread -
/messages/by-id/CAHg+QDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw@mail.gmail.com.
Folks were agreeing to Andres' suggestion there -
/messages/by-id/20220105174643.lozdd3radxv4tlmx@alap3.anarazel.de.
However, that thread got lost from my radar. It's good that someone
else started working on it and I'm happy to help with this feature.

I don't think doing an XLogFlush() of a record that we just wrote is a good
idea - that'll sometimes significantly increase the number of fdatasyncs/sec
we're doing. To make matters worse, this will often cause partially filled WAL
pages to be flushed out - rewriting a WAL page multiple times can
significantly increase overhead on the storage level. At the very least this'd
have to flush only up to the last fully filled page.

Just counting the number of bytes inserted by a backend will make the overhead
even worse, as the flush will be triggered even for OLTP sessions doing tiny
transactions, even though they don't contribute to the problem you're trying
to address.

Right.

How about counting how many bytes of WAL a backend has inserted
since the last time that backend did an XLogFlush()?

Seems reasonable.

A bulk writer won't do a lot of XLogFlush()es, so the time/bytes since the
last XLogFlush() will increase quickly, triggering a flush at the next
opportunity. But short OLTP transactions will do XLogFlush()es at least at
every commit.

Right.

I also suspect the overhead will be more manageable if you were to force a
flush only up to a point further back than the last fully filled page. We
don't want to end up flushing WAL for every page, but if you just have a
backend-local accounting mechanism, I think that's inevitably what you'd end
up with when you have a large number of sessions. But if you'd limit the
flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
only ever to a prior boundary, the amount of unnecessary WAL flushes would be
proportional to synchronous_commit_flush_wal_after.

When the WAL records are of large in size, then the backend inserting
them would throttle frequently generating more flushes and more waits
for sync replication ack and more contention on WALWriteLock. Not sure
if this is good unless the impact is measured.

Few more thoughts:

1. This feature keeps replication lag in check at the cost of
throttling write traffic. I'd be curious to know improvement in
replication lag vs drop in transaction throughput, say pgbench with a
custom insert script and one or more async/sync standbys.

2. So, heavy WAL throttling can turn into a lot of writes and fsyncs.
Eventually, each backend gets a chance to throttle WAL if it ever
generates WAL irrespective of whether there's a replication lag or
not. How about we let backends throttle themselves not just based on
wal_distance_from_last_flush but also depending on the replication lag
at the moment, say, if replication lag crosses
wal_throttle_replication_lag_threshold bytes?

3. Should the backends wait indefinitely for sync rep ack when they
crossed wal_throttle_threshold? Well, I don't think so, it must be
given a chance to do its stuff instead of favouring other backends by
throttling itself.

4. I'd prefer adding a TAP test for this feature to check if the WAL
throttle is picked up by a backend.

5. Can we also extend this WAL throttling feature to logical
replication to keep replication lag in check there as well?

6. Backends can ignore throttling for WAL records marked as unimportant, no?

7. I think we need to not let backends throttle too frequently even
though they have crossed wal_throttle_threshold bytes. The best way is
to rely on replication lag, after all the goal of this feature is to
keep replication lag under check - say, throttle only when
wal_distance > wal_throttle_threshold && replication_lag >
wal_throttle_replication_lag_threshold.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 3 years ago

In reply to: Andres Freund (#2)

Re: Syncrep and improving latency due to WAL throttling

On 1/25/23 20:05, Andres Freund wrote:

Hi,

On 2023-01-25 14:32:51 +0100, Jakub Wartak wrote:

In other words it allows slow down of any backend activity. Any feedback on
such a feature is welcome, including better GUC name proposals ;) and
conditions in which such feature should be disabled even if it would be
enabled globally (right now only anti- wraparound VACUUM comes to mind, it's
not in the patch).

Such a feature could be useful - but I don't think the current place of
throttling has any hope of working reliably:
@@ -1021,6 +1025,21 @@ XLogInsertRecord(XLogRecData *rdata,
pgWalUsage.wal_bytes += rechdr->xl_tot_len;
pgWalUsage.wal_records++;
pgWalUsage.wal_fpi += num_fpi;
+
+		backendWalInserted += rechdr->xl_tot_len;
+
+		if ((synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY || synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_WRITE) &&
+			synchronous_commit_flush_wal_after > 0 &&
+			backendWalInserted > synchronous_commit_flush_wal_after * 1024L)
+		{
+			elog(DEBUG3, "throttling WAL down on this session (backendWalInserted=%d)", backendWalInserted);
+			XLogFlush(EndPos);
+			/* XXX: refactor SyncRepWaitForLSN() to have different waitevent than default WAIT_EVENT_SYNC_REP  */
+			/* maybe new WAIT_EVENT_SYNC_REP_BIG or something like that */
+			SyncRepWaitForLSN(EndPos, false);
+			elog(DEBUG3, "throttling WAL down on this session - end");
+			backendWalInserted = 0;
+		}
}
You're blocking in the middle of an XLOG insertion. We will commonly hold
important buffer lwlocks, it'll often be in a critical section (no cancelling
/ terminating the session!). This'd entail loads of undetectable deadlocks
(i.e. hard hangs). And even leaving that aside, doing an unnecessary
XLogFlush() with important locks held will seriously increase contention.

Yeah, I agree the sleep would have to happen elsewhere.

It's not clear to me how could it cause deadlocks, as we're not waiting
for a lock/resource locked by someone else, but it's certainly an issue
for uninterruptible hangs.

My best idea for how to implement this in a somewhat safe way would be for
XLogInsertRecord() to set a flag indicating that we should throttle, and set
InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
proceed (i.e. we'll not be in a critical / interrupts off section) can
actually perform the delay. That should fix the hard deadlock danger and
remove most of the increase in lock contention.

The solution I've imagined is something like autovacuum throttling - do
some accounting of how much "WAL bandwidth" each process consumed, and
then do the delay/sleep in a suitable place.

I don't think doing an XLogFlush() of a record that we just wrote is a good
idea - that'll sometimes significantly increase the number of fdatasyncs/sec
we're doing. To make matters worse, this will often cause partially filled WAL
pages to be flushed out - rewriting a WAL page multiple times can
significantly increase overhead on the storage level. At the very least this'd
have to flush only up to the last fully filled page.

I don't see why would that significantly increase the number of flushes.
The goal is to do this only every ~1MB of a WAL or so, and chances are
there were many (perhaps hundreds or more) commits in between. At least
in a workload with a fair number of OLTP transactions (which is kinda
the point of all this).

And the "small" OLTP transactions don't really do any extra fsyncs,
because the accounting resets at commit (or places that flush WAL).

Same for the flushes of partially flushed pages - if there's enough of
small OLTP transactions, we're already having this issue. I don't see
why would this make it measurably worse. But if needed, we can simply
backoff to the last page boundary, so that we only flush the complete
page. That would work too - the goal is not to flush everything, but to
reduce how much of the lag is due to the current process (i.e. to wait
for sync replica to catch up).

Just counting the number of bytes inserted by a backend will make the overhead
even worse, as the flush will be triggered even for OLTP sessions doing tiny
transactions, even though they don't contribute to the problem you're trying
to address. How about counting how many bytes of WAL a backend has inserted
since the last time that backend did an XLogFlush()?

No, we should reset the counter at commit, so small OLTP transactions
should not reach really trigger this. That's kinda the point, to only
delay "large" transactions producing a lot of WAL.

A bulk writer won't do a lot of XLogFlush()es, so the time/bytes since the
last XLogFlush() will increase quickly, triggering a flush at the next
opportunity. But short OLTP transactions will do XLogFlush()es at least at
every commit.

Right, that's kinda the approach the patch is trying to do.

I also suspect the overhead will be more manageable if you were to force a
flush only up to a point further back than the last fully filled page. We
don't want to end up flushing WAL for every page, but if you just have a
backend-local accounting mechanism, I think that's inevitably what you'd end
up with when you have a large number of sessions. But if you'd limit the
flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
only ever to a prior boundary, the amount of unnecessary WAL flushes would be
proportional to synchronous_commit_flush_wal_after.

True, that's kinda what I suggested above as a solution for partially
filled WAL pages.

I agree this is crude and we'd probably need some sort of "balancing"
logic that distributes WAL bandwidth between backends, and throttles
backends producing a lot of WAL.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Jakub Wartak

jakub.wartak@enterprisedb.com

over 3 years ago

In reply to: Tomas Vondra (#4)

Re: Syncrep and improving latency due to WAL throttling

On 1/25/23 20:05, Andres Freund wrote:

Hi,

Such a feature could be useful - but I don't think the current place of
throttling has any hope of working reliably:

[..]

You're blocking in the middle of an XLOG insertion.

[..]

Yeah, I agree the sleep would have to happen elsewhere.

Fixed.

My best idea for how to implement this in a somewhat safe way would be for
XLogInsertRecord() to set a flag indicating that we should throttle, and set
InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
proceed (i.e. we'll not be in a critical / interrupts off section) can
actually perform the delay. That should fix the hard deadlock danger and
remove most of the increase in lock contention.

The solution I've imagined is something like autovacuum throttling - do
some accounting of how much "WAL bandwidth" each process consumed, and
then do the delay/sleep in a suitable place.

By the time you replied I've already tried what Andres has recommended.

[..]

At the very least this'd
have to flush only up to the last fully filled page.

Same for the flushes of partially flushed pages - if there's enough of
small OLTP transactions, we're already having this issue. I don't see
why would this make it measurably worse. But if needed, we can simply
backoff to the last page boundary, so that we only flush the complete
page. That would work too - the goal is not to flush everything, but to
reduce how much of the lag is due to the current process (i.e. to wait
for sync replica to catch up).

I've introduced the rounding to the last written page (hopefully).

Just counting the number of bytes inserted by a backend will make the overhead
even worse, as the flush will be triggered even for OLTP sessions doing tiny
transactions, even though they don't contribute to the problem you're trying
to address. How about counting how many bytes of WAL a backend has inserted
since the last time that backend did an XLogFlush()?

No, we should reset the counter at commit, so small OLTP transactions
should not reach really trigger this. That's kinda the point, to only
delay "large" transactions producing a lot of WAL.

Fixed.

I also suspect the overhead will be more manageable if you were to force a
flush only up to a point further back than the last fully filled page. We
don't want to end up flushing WAL for every page, but if you just have a
backend-local accounting mechanism, I think that's inevitably what you'd end
up with when you have a large number of sessions. But if you'd limit the
flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
only ever to a prior boundary, the amount of unnecessary WAL flushes would be
proportional to synchronous_commit_flush_wal_after.

True, that's kinda what I suggested above as a solution for partially
filled WAL pages.

I agree this is crude and we'd probably need some sort of "balancing"
logic that distributes WAL bandwidth between backends, and throttles
backends producing a lot of WAL.

OK - that's not included (yet?), as it would make this much more complex.

In summary: Attached is a slightly reworked version of this patch.
1. Moved logic outside XLogInsertRecord() under ProcessInterrupts()
2. Flushes up to the last page boundary, still uses SyncRepWaitForLSN()
3. Removed GUC for now (always on->256kB); potentially to be restored
4. Resets backendWal counter on commits

It's still crude, but first tests indicate that it behaves similarly
(to the initial one with GUC = 256kB).

Also from the Bharath email, I've found another patch proposal by
Simon [1]/messages/by-id/CA+U5nMLfxBgHQ1VLSeBHYEMjHXz_OHSkuFdU6_1quiGM0RNKEg@mail.gmail.com and I would like to avoid opening the Pandora's box again,
but to stress this: this feature is specifically aimed at solving OLTP
latency on *sync*rep (somewhat some code could be added/generalized
later on and this feature could be extended to async case, but this
then opens the question of managing the possible WAL throughput
budget/back throttling - this was also raised in first thread here [2]/messages/by-id/71f3e6fb-2fca-a798-856a-f23c8ede2333@garret.ru
by Konstantin).

IMHO it could implement various strategies under user-settable GUC
"wal_throttle_larger_transactions=(sync,256kB)" or
"wal_throttle_larger_transactions=off" , and someday later someone
could teach the code the async case (let's say under
"wal_throttle_larger_transactions=(asyncMaxRPO, maxAllowedLag=8MB,
256kB)"). Thoughts?

[1]: /messages/by-id/CA+U5nMLfxBgHQ1VLSeBHYEMjHXz_OHSkuFdU6_1quiGM0RNKEg@mail.gmail.com
[2]: /messages/by-id/71f3e6fb-2fca-a798-856a-f23c8ede2333@garret.ru

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Tomas Vondra (#4)

Re: Syncrep and improving latency due to WAL throttling

Hi,

On 2023-01-26 12:08:16 +0100, Tomas Vondra wrote:

It's not clear to me how could it cause deadlocks, as we're not waiting
for a lock/resource locked by someone else, but it's certainly an issue
for uninterruptible hangs.

Maybe not. But I wouldn't want to bet on it. It's a violation of all kinds of
lock-ordering rules.

My best idea for how to implement this in a somewhat safe way would be for
XLogInsertRecord() to set a flag indicating that we should throttle, and set
InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
proceed (i.e. we'll not be in a critical / interrupts off section) can
actually perform the delay. That should fix the hard deadlock danger and
remove most of the increase in lock contention.

The solution I've imagined is something like autovacuum throttling - do
some accounting of how much "WAL bandwidth" each process consumed, and
then do the delay/sleep in a suitable place.

Where would such a point be, except for ProcessInterrupts()? Iteratively
adding a separate set of "wal_delay()" points all over the executor,
commands/, ... seems like a bad plan.

I don't think doing an XLogFlush() of a record that we just wrote is a good
idea - that'll sometimes significantly increase the number of fdatasyncs/sec
we're doing. To make matters worse, this will often cause partially filled WAL
pages to be flushed out - rewriting a WAL page multiple times can
significantly increase overhead on the storage level. At the very least this'd
have to flush only up to the last fully filled page.

I don't see why would that significantly increase the number of flushes.
The goal is to do this only every ~1MB of a WAL or so, and chances are
there were many (perhaps hundreds or more) commits in between. At least
in a workload with a fair number of OLTP transactions (which is kinda
the point of all this).

Because the accounting is done separately in each process. Even if you just
add a few additional flushes for each connection, in aggregate that'll be a
lot.

And the "small" OLTP transactions don't really do any extra fsyncs,
because the accounting resets at commit (or places that flush WAL).
[...]

Just counting the number of bytes inserted by a backend will make the overhead
even worse, as the flush will be triggered even for OLTP sessions doing tiny
transactions, even though they don't contribute to the problem you're trying
to address. How about counting how many bytes of WAL a backend has inserted
since the last time that backend did an XLogFlush()?

No, we should reset the counter at commit, so small OLTP transactions
should not reach really trigger this. That's kinda the point, to only
delay "large" transactions producing a lot of WAL.

I might have missed something, but I don't think the first version of patch
resets the accounting at commit?

Same for the flushes of partially flushed pages - if there's enough of
small OLTP transactions, we're already having this issue. I don't see
why would this make it measurably worse.

Yes, we already have that problem, and it hurts. *Badly*. I don't see how v1
could *not* make it worse. Suddenly you get a bunch of additional XLogFlush()
calls to partial boundaries by every autovacuum, by every process doing a bit
more bulky work. Because you're flushing the LSN after a record you just
inserted, you're commonly not going to be "joining" the work of an already
in-process XLogFlush().

But if needed, we can simply backoff to the last page boundary, so that we
only flush the complete page. That would work too - the goal is not to flush
everything, but to reduce how much of the lag is due to the current process
(i.e. to wait for sync replica to catch up).

Yes, as I proposed...

I also suspect the overhead will be more manageable if you were to force a
flush only up to a point further back than the last fully filled page. We
don't want to end up flushing WAL for every page, but if you just have a
backend-local accounting mechanism, I think that's inevitably what you'd end
up with when you have a large number of sessions. But if you'd limit the
flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
only ever to a prior boundary, the amount of unnecessary WAL flushes would be
proportional to synchronous_commit_flush_wal_after.

True, that's kinda what I suggested above as a solution for partially
filled WAL pages.

I think flushing only up to a point further in the past than the last page
boundary is somewhat important to prevent unnecessary flushes.

Greetings,

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Jakub Wartak (#5)

Re: Syncrep and improving latency due to WAL throttling

Hi,

On 2023-01-26 14:40:56 +0100, Jakub Wartak wrote:

In summary: Attached is a slightly reworked version of this patch.
1. Moved logic outside XLogInsertRecord() under ProcessInterrupts()
2. Flushes up to the last page boundary, still uses SyncRepWaitForLSN()
3. Removed GUC for now (always on->256kB); potentially to be restored

Huh? Why did you remove the GUC? Clearly this isn't something we can default
to on.

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d85e313908..05d56d65f9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2395,6 +2395,7 @@ CommitTransaction(void)
XactTopFullTransactionId = InvalidFullTransactionId;
nParallelCurrentXids = 0;
+ backendWalInserted = 0;

/*
* done with commit processing, set current transaction state back to

I don't like the resets around like this. Why not just move it into
XLogFlush()?

+/*
+ * Called from ProcessMessageInterrupts() to avoid waiting while being in critical section
+ */ 
+void HandleXLogDelayPending()
+{
+	/* flush only up to the last fully filled page */
+	XLogRecPtr 	LastFullyWrittenXLogPage = XactLastRecEnd - (XactLastRecEnd % XLOG_BLCKSZ);
+	XLogDelayPending = false;

Hm - wonder if it'd be better to determine the LSN to wait for in
XLogInsertRecord(). There'll be plenty cases where we'll insert multiple (but
bounded number of) WAL records before processing interrupts. No need to flush
more aggressively than necessary...

+	//HOLD_INTERRUPTS();
+
+	/* XXX Debug for now */
+	elog(WARNING, "throttling WAL down on this session (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)", 
+		backendWalInserted, 
+		LSN_FORMAT_ARGS(XactLastRecEnd),
+		LSN_FORMAT_ARGS(LastFullyWrittenXLogPage));
+
+	/* XXX: refactor SyncRepWaitForLSN() to have different waitevent than default WAIT_EVENT_SYNC_REP  */
+	/* maybe new WAIT_EVENT_SYNC_REP_BIG or something like that */
+	XLogFlush(LastFullyWrittenXLogPage);
+	SyncRepWaitForLSN(LastFullyWrittenXLogPage, false);

SyncRepWaitForLSN() has this comment:
* 'lsn' represents the LSN to wait for. 'commit' indicates whether this LSN
* represents a commit record. If it doesn't, then we wait only for the WAL
* to be flushed if synchronous_commit is set to the higher level of
* remote_apply, because only commit records provide apply feedback.

+	elog(WARNING, "throttling WAL down on this session - end");
+	backendWalInserted = 0;
+
+	//RESUME_INTERRUPTS();
+}

I think we'd want a distinct wait event for this.

diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 1b1d814254..8ed66b2eae 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -37,6 +37,7 @@ volatile sig_atomic_t IdleSessionTimeoutPending = false;
volatile sig_atomic_t ProcSignalBarrierPending = false;
volatile sig_atomic_t LogMemoryContextPending = false;
volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
+volatile sig_atomic_t XLogDelayPending = false;
volatile uint32 InterruptHoldoffCount = 0;
volatile uint32 QueryCancelHoldoffCount = 0;
volatile uint32 CritSectionCount = 0;

I don't think the new variable needs to be volatile, or even
sig_atomic_t. We'll just manipulate it from outside signal handlers.

Greetings,

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Bharath Rupireddy (#3)

Re: Syncrep and improving latency due to WAL throttling

Hi,

On 2023-01-26 13:33:27 +0530, Bharath Rupireddy wrote:

6. Backends can ignore throttling for WAL records marked as unimportant, no?

Why would that be a good idea? Not that it matters today, but those records
still need to be flushed in case of a commit by another transaction.

7. I think we need to not let backends throttle too frequently even
though they have crossed wal_throttle_threshold bytes. The best way is
to rely on replication lag, after all the goal of this feature is to
keep replication lag under check - say, throttle only when
wal_distance > wal_throttle_threshold && replication_lag >
wal_throttle_replication_lag_threshold.

I think my idea of only forcing to flush/wait an LSN some distance in the past
would automatically achieve that?

Greetings,

Andres Freund

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 3 years ago

In reply to: Andres Freund (#6)

Re: Syncrep and improving latency due to WAL throttling

On 1/26/23 16:40, Andres Freund wrote:

Hi,

On 2023-01-26 12:08:16 +0100, Tomas Vondra wrote:

It's not clear to me how could it cause deadlocks, as we're not waiting
for a lock/resource locked by someone else, but it's certainly an issue
for uninterruptible hangs.

Maybe not. But I wouldn't want to bet on it. It's a violation of all kinds of
lock-ordering rules.

Not sure what lock ordering issues you have in mind, but I agree it's
not the right place for the sleep, no argument here.

My best idea for how to implement this in a somewhat safe way would be for
XLogInsertRecord() to set a flag indicating that we should throttle, and set
InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
proceed (i.e. we'll not be in a critical / interrupts off section) can
actually perform the delay. That should fix the hard deadlock danger and
remove most of the increase in lock contention.

The solution I've imagined is something like autovacuum throttling - do
some accounting of how much "WAL bandwidth" each process consumed, and
then do the delay/sleep in a suitable place.

Where would such a point be, except for ProcessInterrupts()? Iteratively
adding a separate set of "wal_delay()" points all over the executor,
commands/, ... seems like a bad plan.

I haven't thought about where to do the work, TBH. ProcessInterrupts()
may very well be a good place.

I should have been clearer, but the main benefit of autovacuum-like
throttling is IMHO that it involves all processes and a global limit,
while the current approach is per-backend.

I don't think doing an XLogFlush() of a record that we just wrote is a good
idea - that'll sometimes significantly increase the number of fdatasyncs/sec
we're doing. To make matters worse, this will often cause partially filled WAL
pages to be flushed out - rewriting a WAL page multiple times can
significantly increase overhead on the storage level. At the very least this'd
have to flush only up to the last fully filled page.

I don't see why would that significantly increase the number of flushes.
The goal is to do this only every ~1MB of a WAL or so, and chances are
there were many (perhaps hundreds or more) commits in between. At least
in a workload with a fair number of OLTP transactions (which is kinda
the point of all this).

Because the accounting is done separately in each process. Even if you just
add a few additional flushes for each connection, in aggregate that'll be a
lot.

How come? Imagine the backend does flush only after generating e.g. 1MB
of WAL. Small transactions won't do any additional flushes at all
(because commit resets the accounting). Large transactions will do an
extra flush every 1MB, so 16x per WAL segment. But in between there will
be many commits from the small transactions. If we backoff to the last
complete page, that should eliminate even most of these flushes.

So where would the additional flushes come from?

And the "small" OLTP transactions don't really do any extra fsyncs,
because the accounting resets at commit (or places that flush WAL).
[...]

Just counting the number of bytes inserted by a backend will make the overhead
even worse, as the flush will be triggered even for OLTP sessions doing tiny
transactions, even though they don't contribute to the problem you're trying
to address. How about counting how many bytes of WAL a backend has inserted
since the last time that backend did an XLogFlush()?

No, we should reset the counter at commit, so small OLTP transactions
should not reach really trigger this. That's kinda the point, to only
delay "large" transactions producing a lot of WAL.

I might have missed something, but I don't think the first version of patch
resets the accounting at commit?

Yeah, it doesn't. It's mostly a minimal PoC patch, sufficient to test
the behavior / demonstrate the effect.

Same for the flushes of partially flushed pages - if there's enough of
small OLTP transactions, we're already having this issue. I don't see
why would this make it measurably worse.

Yes, we already have that problem, and it hurts. *Badly*. I don't see how v1
could *not* make it worse. Suddenly you get a bunch of additional XLogFlush()
calls to partial boundaries by every autovacuum, by every process doing a bit
more bulky work. Because you're flushing the LSN after a record you just
inserted, you're commonly not going to be "joining" the work of an already
in-process XLogFlush().

Right. We do ~16 additional flushes per 16MB segment (or something like
that, depending on the GUC value). Considering we do thousand of commits
per segment, each of which does a flush, I don't see how would this make
it measurably worse.

But if needed, we can simply backoff to the last page boundary, so that we
only flush the complete page. That would work too - the goal is not to flush
everything, but to reduce how much of the lag is due to the current process
(i.e. to wait for sync replica to catch up).

Yes, as I proposed...

Right.

I also suspect the overhead will be more manageable if you were to force a
flush only up to a point further back than the last fully filled page. We
don't want to end up flushing WAL for every page, but if you just have a
backend-local accounting mechanism, I think that's inevitably what you'd end
up with when you have a large number of sessions. But if you'd limit the
flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
only ever to a prior boundary, the amount of unnecessary WAL flushes would be
proportional to synchronous_commit_flush_wal_after.

True, that's kinda what I suggested above as a solution for partially
filled WAL pages.

I think flushing only up to a point further in the past than the last page
boundary is somewhat important to prevent unnecessary flushes.

Not sure I agree with that. Yes, we should not be doing flushes unless
we need to, but OTOH we should not delay sending WAL to standby too much
- because that's what affects syncrep latency for small transactions.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Andres Freund (#8)

Re: Syncrep and improving latency due to WAL throttling

On Thu, Jan 26, 2023 at 9:21 PM Andres Freund <andres@anarazel.de> wrote:

7. I think we need to not let backends throttle too frequently even
though they have crossed wal_throttle_threshold bytes. The best way is
to rely on replication lag, after all the goal of this feature is to
keep replication lag under check - say, throttle only when
wal_distance > wal_throttle_threshold && replication_lag >
wal_throttle_replication_lag_threshold.

I think my idea of only forcing to flush/wait an LSN some distance in the past
would automatically achieve that?

I'm sorry, I couldn't get your point, can you please explain it a bit more?

Looking at the patch, the feature, in its current shape, focuses on
improving replication lag (by throttling WAL on the primary) only when
synchronous replication is enabled. Why is that? Why can't we design
it for replication in general (async, sync, and logical replication)?

Keeping replication lag under check enables one to provide a better
RPO guarantee as discussed in the other thread
/messages/by-id/CAHg+QDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw@mail.gmail.com.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#11

Alvaro Herrera

alvherre@2ndquadrant.com

over 3 years ago

In reply to: Bharath Rupireddy (#10)

Re: Syncrep and improving latency due to WAL throttling

On 2023-Jan-27, Bharath Rupireddy wrote:

Looking at the patch, the feature, in its current shape, focuses on
improving replication lag (by throttling WAL on the primary) only when
synchronous replication is enabled. Why is that? Why can't we design
it for replication in general (async, sync, and logical replication)?

Keeping replication lag under check enables one to provide a better
RPO guarantee

Hmm, but you can already do that by tuning walwriter, no?

--
Álvaro Herrera

#12

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Alvaro Herrera (#11)

Re: Syncrep and improving latency due to WAL throttling

On Fri, Jan 27, 2023 at 2:03 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2023-Jan-27, Bharath Rupireddy wrote:

Looking at the patch, the feature, in its current shape, focuses on
improving replication lag (by throttling WAL on the primary) only when
synchronous replication is enabled. Why is that? Why can't we design
it for replication in general (async, sync, and logical replication)?

Keeping replication lag under check enables one to provide a better
RPO guarantee

Hmm, but you can already do that by tuning walwriter, no?

IIUC, one can increase wal_writer_delay or wal_writer_flush_after to
control the amount of WAL walwriter writes and flushes so that the
walsenders will get slower a bit thus improving replication lag. If my
understanding is correct, what if other backends doing WAL writes and
flushes? How do we control that?

I think tuning walwriter alone may not help to keep replication lag
under check. Even if it does, it requires manual intervention.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#13

Jakub Wartak

jakub.wartak@enterprisedb.com

over 3 years ago

In reply to: Andres Freund (#7)

Re: Syncrep and improving latency due to WAL throttling

Hi,

v2 is attached.

On Thu, Jan 26, 2023 at 4:49 PM Andres Freund <andres@anarazel.de> wrote:

Huh? Why did you remove the GUC?

After reading previous threads, my optimism level of getting it ever
in shape of being widely accepted degraded significantly (mainly due
to the discussion of wider category of 'WAL I/O throttling' especially
in async case, RPO targets in async case and potentially calculating
global bandwidth). I've assumed that it is a working sketch, and as
such not having GUC name right now (just for sync case) would still
allow covering various other async cases in future without breaking
compatibility with potential name GUC changes (see my previous
"wal_throttle_larger_transactions=<strategies>" proposal ). Although
I've restored the synchronous_commit_flush_wal_after GUC into v2
patch, sticking to such a name removes the way of using the code to
achieve async WAL throttling in future.

Clearly this isn't something we can default to on..

Yes, I agree. It's never meant to be on by default.

I don't like the resets around like this. Why not just move it into
XLogFlush()?

Fixed.

Hm - wonder if it'd be better to determine the LSN to wait for in
XLogInsertRecord(). There'll be plenty cases where we'll insert multiple (but
bounded number of) WAL records before processing interrupts. No need to flush
more aggressively than necessary...

Fixed.

SyncRepWaitForLSN() has this comment:
* 'lsn' represents the LSN to wait for. 'commit' indicates whether this LSN
* represents a commit record. If it doesn't, then we wait only for the WAL
* to be flushed if synchronous_commit is set to the higher level of
* remote_apply, because only commit records provide apply feedback.

Hm, not sure if I understand: are you saying that we should (in the
throttled scenario) have some special feedback msgs or not --
irrespective of the setting? To be honest the throttling shouldn't
wait for the standby full setting, it's just about slowdown fact (so
IMHO it would be fine even in remote_write/remote_apply scenario if
the remote walreceiver just received the data, not necessarily write
it into file or wait for for applying it). Just this waiting for a
round-trip ack about LSN progress would be enough to slow down the
writer (?). I've added some timing log into the draft and it shows
more or less constantly solid RTT even as it stands:

psql:insertBIG.sql:2: WARNING: throttling WAL down on this session
(backendWalInserted=262632, LSN=2/A6CB70B8 flushingTo=2/A6CB6000)
psql:insertBIG.sql:2: WARNING: throttling WAL down on this session -
end (10.500052 ms)
psql:insertBIG.sql:2: WARNING: throttling WAL down on this session
(backendWalInserted=262632, LSN=2/A6CF7C08 flushingTo=2/A6CF6000)
psql:insertBIG.sql:2: WARNING: throttling WAL down on this session -
end (10.655370 ms)
psql:insertBIG.sql:2: WARNING: throttling WAL down on this session
(backendWalInserted=262632, LSN=2/A6D385E0 flushingTo=2/A6D38000)
psql:insertBIG.sql:2: WARNING: throttling WAL down on this session -
end (10.627334 ms)
psql:insertBIG.sql:2: WARNING: throttling WAL down on this session
(backendWalInserted=262632, LSN=2/A6D78FA0 flushingTo=2/A6D78000)
[..]

I think we'd want a distinct wait event for this.

Added and tested that it shows up.

+volatile sig_atomic_t XLogDelayPending = false;

I don't think the new variable needs to be volatile, or even
sig_atomic_t. We'll just manipulate it from outside signal handlers.

Changed to bool, previously I wanted it to "fit" the above ones.

-J.

#14

Jakub Wartak

jakub.wartak@enterprisedb.com

over 3 years ago

In reply to: Bharath Rupireddy (#12)

Re: Syncrep and improving latency due to WAL throttling

Hi Bharath,

On Fri, Jan 27, 2023 at 12:04 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Fri, Jan 27, 2023 at 2:03 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2023-Jan-27, Bharath Rupireddy wrote:

Looking at the patch, the feature, in its current shape, focuses on
improving replication lag (by throttling WAL on the primary) only when
synchronous replication is enabled. Why is that? Why can't we design
it for replication in general (async, sync, and logical replication)?

Keeping replication lag under check enables one to provide a better
RPO guarantee

Sorry for not answering earlier; although the title of the thread goes
for the SyncRep-only I think everyone would be open to also cover the
more advanced async scenarios that Satyanarayana proposed in those
earlier threads (as just did Simon much earlier). I was proposing
wal_throttle_larger_transactions=<..> (for the lack of better name),
however v2 of the patch from today right now contains again reference
to syncrep (it could be changed of course). It's just the code that is
missing that could be also added on top of v2, so we could combine our
efforts. It's just the competency and time that I lack on how to
implement such async-scenario code-paths (maybe Tomas V. has something
in mind with his own words [1]"The solution I've imagined is something like autovacuum throttling - do some accounting of how much "WAL bandwidth" each process consumed, and then do the delay/sleep in a suitable place. ") so also any feedback from senior
hackers is more than welcome ...

-J.

[1]: "The solution I've imagined is something like autovacuum throttling - do some accounting of how much "WAL bandwidth" each process consumed, and then do the delay/sleep in a suitable place. "
throttling - do some accounting of how much "WAL bandwidth" each
process consumed, and then do the delay/sleep in a suitable place. "

#15

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 3 years ago

In reply to: Bharath Rupireddy (#10)

Re: Syncrep and improving latency due to WAL throttling

On 1/27/23 08:18, Bharath Rupireddy wrote:

On Thu, Jan 26, 2023 at 9:21 PM Andres Freund <andres@anarazel.de> wrote:

7. I think we need to not let backends throttle too frequently even
though they have crossed wal_throttle_threshold bytes. The best way is
to rely on replication lag, after all the goal of this feature is to
keep replication lag under check - say, throttle only when
wal_distance > wal_throttle_threshold && replication_lag >
wal_throttle_replication_lag_threshold.

I think my idea of only forcing to flush/wait an LSN some distance in the past
would automatically achieve that?

I'm sorry, I couldn't get your point, can you please explain it a bit more?

The idea is that we would not flush the exact current LSN, because
that's likely somewhere in the page, and we always write the whole page
which leads to write amplification.

But if we backed off a bit, and wrote e.g. to the last page boundary,
that wouldn't have this issue (either the page was already flushed -
noop, or we'd have to flush it anyway).

We could even back off a bit more, to increase the probability it was
actually flushed / sent to standby. That would still work, because the
whole point is not to allow one process to generate too much unflushed
WAL, forcing the other (small) xacts to wait at commit.

Imagine we have the limit set to 8MB, i.e. the backend flushes WAL after
generating 8MB of WAL. If we flush to the exact current LSN, the other
backends will wait for ~4MB on average. If we back off to 1MB, the wait
average increases to ~4.5MB. (This is simplified, as it ignores WAL from
the small xacts. But those flush regularly, which limit the amount. It
also ignores there might be multiple large xacts.)

Looking at the patch, the feature, in its current shape, focuses on
improving replication lag (by throttling WAL on the primary) only when
synchronous replication is enabled. Why is that? Why can't we design
it for replication in general (async, sync, and logical replication)?

This focuses on sync rep, because that's where the commit latency comes
from. Async doesn't have that issue, because it doesn't wait for the
standby.

In particular, the trick is in penalizing the backends generating a lot
of WAL, while leaving the small xacts alone.

Keeping replication lag under check enables one to provide a better
RPO guarantee as discussed in the other thread
/messages/by-id/CAHg+QDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw@mail.gmail.com.

Isn't that a bit over-complicated? RPO generally only cares about xacts
that committed (because that's what you want to not lose), so why not to
simply introduce a "sync mode" that simply uses a bit older LSN when
waiting for the replica? Seems much simpler and similar to what we
already do.

Yeah, if someone generates a lot of WAL in uncommitted transaction, all
of that may be lost. But who cares (from the RPO point of view)?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Jakub Wartak (#13)

Re: Syncrep and improving latency due to WAL throttling

Hi,

On 2023-01-27 12:06:49 +0100, Jakub Wartak wrote:

On Thu, Jan 26, 2023 at 4:49 PM Andres Freund <andres@anarazel.de> wrote:

Huh? Why did you remove the GUC?

After reading previous threads, my optimism level of getting it ever
in shape of being widely accepted degraded significantly (mainly due
to the discussion of wider category of 'WAL I/O throttling' especially
in async case, RPO targets in async case and potentially calculating
global bandwidth).

I think it's quite reasonable to limit this to a smaller scope. Particularly
because those other goals are pretty vague but ambitious goals. IMO the
problem with a lot of the threads is precisely that that they aimed at a level
of generallity that isn't achievable in one step.

I've assumed that it is a working sketch, and as such not having GUC name
right now (just for sync case) would still allow covering various other
async cases in future without breaking compatibility with potential name GUC
changes (see my previous "wal_throttle_larger_transactions=<strategies>"
proposal ).

It's harder to benchmark something like this without a GUC, so I think it's
worth having, even if it's not the final name.

SyncRepWaitForLSN() has this comment:
* 'lsn' represents the LSN to wait for. 'commit' indicates whether this LSN
* represents a commit record. If it doesn't, then we wait only for the WAL
* to be flushed if synchronous_commit is set to the higher level of
* remote_apply, because only commit records provide apply feedback.

Hm, not sure if I understand: are you saying that we should (in the
throttled scenario) have some special feedback msgs or not --
irrespective of the setting? To be honest the throttling shouldn't
wait for the standby full setting, it's just about slowdown fact (so
IMHO it would be fine even in remote_write/remote_apply scenario if
the remote walreceiver just received the data, not necessarily write
it into file or wait for for applying it). Just this waiting for a
round-trip ack about LSN progress would be enough to slow down the
writer (?). I've added some timing log into the draft and it shows
more or less constantly solid RTT even as it stands:

My problem was that the function header for SyncRepWaitForLSN() seems to say
that we don't wait at all if commit=false and synchronous_commit <
remote_apply. But I think that might just be bad formulation.

[...] 'commit' indicates whether this LSN
* represents a commit record. If it doesn't, then we wait only for the WAL
* to be flushed if synchronous_commit is set to the higher level of
* remote_apply, because only commit records provide apply feedback.

because the code does something entirely different afaics:

/* Cap the level for anything other than commit to remote flush only. */
if (commit)
mode = SyncRepWaitMode;
else
mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);

Greetings,

Andres Freund

#17

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Bharath Rupireddy (#10)

Re: Syncrep and improving latency due to WAL throttling

Hi,

On 2023-01-27 12:48:43 +0530, Bharath Rupireddy wrote:

Looking at the patch, the feature, in its current shape, focuses on
improving replication lag (by throttling WAL on the primary) only when
synchronous replication is enabled. Why is that? Why can't we design
it for replication in general (async, sync, and logical replication)?

Keeping replication lag under check enables one to provide a better
RPO guarantee as discussed in the other thread
/messages/by-id/CAHg+QDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw@mail.gmail.com.

I think something narrower and easier to achieve is a good thing. We've
already had loads of discussion for the more general problem, without a lot of
actual progress.

Greetings,

Andres Freund

#18

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Tomas Vondra (#15)

Re: Syncrep and improving latency due to WAL throttling

Hi,

On 2023-01-27 21:45:16 +0100, Tomas Vondra wrote:

On 1/27/23 08:18, Bharath Rupireddy wrote:

I think my idea of only forcing to flush/wait an LSN some distance in the past
would automatically achieve that?

I'm sorry, I couldn't get your point, can you please explain it a bit more?

The idea is that we would not flush the exact current LSN, because
that's likely somewhere in the page, and we always write the whole page
which leads to write amplification.

But if we backed off a bit, and wrote e.g. to the last page boundary,
that wouldn't have this issue (either the page was already flushed -
noop, or we'd have to flush it anyway).

Yep.

We could even back off a bit more, to increase the probability it was
actually flushed / sent to standby.

That's not the sole goal, from my end: I'd like to avoid writing out +
flushing the WAL in too small chunks. Imagine a few concurrent vacuums or
COPYs or such - if we're unlucky they'd each end up exceeding their "private"
limit close to each other, leading to a number of small writes of the
WAL. Which could end up increasing local commit latency / iops.

If we instead decide to only ever flush up to something like
last_page_boundary - 1/8 * throttle_pages * XLOG_BLCKSZ

we'd make sure that the throttling mechanism won't cause a lot of small
writes.

Keeping replication lag under check enables one to provide a better
RPO guarantee as discussed in the other thread
/messages/by-id/CAHg+QDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw@mail.gmail.com.

Isn't that a bit over-complicated? RPO generally only cares about xacts
that committed (because that's what you want to not lose), so why not to
simply introduce a "sync mode" that simply uses a bit older LSN when
waiting for the replica? Seems much simpler and similar to what we
already do.

I don't think that really helps you that much. If there's e.g. a huge VACUUM /
COPY emitting loads of WAL you'll suddenly see commit latency of a
concurrently committing transactions spike into oblivion. Whereas a general
WAL throttling mechanism would throttle the VACUUM, without impacting the
commit latency of normal transactions.

Greetings,

Andres Freund

#19

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 3 years ago

In reply to: Andres Freund (#18)

Re: Syncrep and improving latency due to WAL throttling

On 1/27/23 22:33, Andres Freund wrote:

Hi,

On 2023-01-27 21:45:16 +0100, Tomas Vondra wrote:

On 1/27/23 08:18, Bharath Rupireddy wrote:

I think my idea of only forcing to flush/wait an LSN some distance in the past
would automatically achieve that?

I'm sorry, I couldn't get your point, can you please explain it a bit more?

The idea is that we would not flush the exact current LSN, because
that's likely somewhere in the page, and we always write the whole page
which leads to write amplification.

But if we backed off a bit, and wrote e.g. to the last page boundary,
that wouldn't have this issue (either the page was already flushed -
noop, or we'd have to flush it anyway).

Yep.

We could even back off a bit more, to increase the probability it was
actually flushed / sent to standby.

That's not the sole goal, from my end: I'd like to avoid writing out +
flushing the WAL in too small chunks. Imagine a few concurrent vacuums or
COPYs or such - if we're unlucky they'd each end up exceeding their "private"
limit close to each other, leading to a number of small writes of the
WAL. Which could end up increasing local commit latency / iops.

If we instead decide to only ever flush up to something like
last_page_boundary - 1/8 * throttle_pages * XLOG_BLCKSZ

we'd make sure that the throttling mechanism won't cause a lot of small
writes.

I'm not saying we shouldn't do this, but I still don't see how this
could make a measurable difference. At least assuming a sensible value
of the throttling limit (say, more than 256kB per backend), and OLTP
workload running concurrently. That means ~64 extra flushes/writes per
16MB segment (at most). Yeah, a particular page might get unlucky and be
flushed by multiple backends, but the average still holds. Meanwhile,
the OLTP transactions will generate (at least) an order of magnitude
more flushes.

Keeping replication lag under check enables one to provide a better
RPO guarantee as discussed in the other thread
/messages/by-id/CAHg+QDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw@mail.gmail.com.

Isn't that a bit over-complicated? RPO generally only cares about xacts
that committed (because that's what you want to not lose), so why not to
simply introduce a "sync mode" that simply uses a bit older LSN when
waiting for the replica? Seems much simpler and similar to what we
already do.

I don't think that really helps you that much. If there's e.g. a huge VACUUM /
COPY emitting loads of WAL you'll suddenly see commit latency of a
concurrently committing transactions spike into oblivion. Whereas a general
WAL throttling mechanism would throttle the VACUUM, without impacting the
commit latency of normal transactions.

True, but it solves the RPO goal which is what the other thread was about.

IMHO it's useful to look at this as a resource scheduling problem:
limited WAL bandwidth consumed by backends, with the bandwidth
distributed using some sort of policy.

The patch discussed in this thread uses fundamentally unfair policy,
with throttling applied only on backends that produce a lot of WAL. And
trying to leave the OLTP as unaffected as possible.

The RPO thread seems to be aiming for a "fair" policy, providing the
same fraction of bandwidth to all processes. This will affect all xacts
the same way (sleeps proportional to amount of WAL generated by the xact).

Perhaps we want such alternative scheduling policies, but it'll probably
require something like the autovacuum throttling.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 3 years ago

In reply to: Andres Freund (#16)

Re: Syncrep and improving latency due to WAL throttling

On 1/27/23 22:19, Andres Freund wrote:

Hi,

On 2023-01-27 12:06:49 +0100, Jakub Wartak wrote:

On Thu, Jan 26, 2023 at 4:49 PM Andres Freund <andres@anarazel.de> wrote:

Huh? Why did you remove the GUC?

After reading previous threads, my optimism level of getting it ever
in shape of being widely accepted degraded significantly (mainly due
to the discussion of wider category of 'WAL I/O throttling' especially
in async case, RPO targets in async case and potentially calculating
global bandwidth).

I think it's quite reasonable to limit this to a smaller scope. Particularly
because those other goals are pretty vague but ambitious goals. IMO the
problem with a lot of the threads is precisely that that they aimed at a level
of generallity that isn't achievable in one step.