sequences vs. synchronous replication

Started by Tomas Vondraover 4 years ago38 messageshackers
Jump to latest
#1Tomas Vondra
tomas.vondra@2ndquadrant.com

Hi,

while working on logical decoding of sequences, I ran into an issue with
nextval() in a transaction that rolls back, described in [1]/messages/by-id/ae3cab67-c31e-b527-dd73-08f196999ad4@enterprisedb.com. But after
thinking about it a bit more (and chatting with Petr Jelinek), I think
this issue affects physical sync replication too.

Imagine you have a primary <-> sync_replica cluster, and you do this:

CREATE SEQUENCE s;

-- shutdown the sync replica

BEGIN;
SELECT nextval('s') FROM generate_series(1,50);
ROLLBACK;

BEGIN;
SELECT nextval('s');
COMMIT;

The natural expectation would be the COMMIT gets stuck, waiting for the
sync replica (which is not running), right? But it does not.

The problem is exactly the same as in [1]/messages/by-id/ae3cab67-c31e-b527-dd73-08f196999ad4@enterprisedb.com - the aborted transaction
generated WAL, but RecordTransactionAbort() ignores that and does not
update LogwrtResult.Write, with the reasoning that aborted transactions
do not matter. But sequences violate that, because we only write WAL
once every 32 increments, so the following nextval() gets "committed"
without waiting for the replica (because it did not produce WAL).

I'm not sure this is a clear data corruption bug, but it surely walks
and quacks like one. My proposal is to fix this by tracking the lsn of
the last LSN for a sequence increment, and then check that LSN in
RecordTransactionCommit() before calling XLogFlush().

regards

[1]: /messages/by-id/ae3cab67-c31e-b527-dd73-08f196999ad4@enterprisedb.com
/messages/by-id/ae3cab67-c31e-b527-dd73-08f196999ad4@enterprisedb.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#1)
Re: sequences vs. synchronous replication

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

The problem is exactly the same as in [1] - the aborted transaction
generated WAL, but RecordTransactionAbort() ignores that and does not
update LogwrtResult.Write, with the reasoning that aborted transactions
do not matter. But sequences violate that, because we only write WAL
once every 32 increments, so the following nextval() gets "committed"
without waiting for the replica (because it did not produce WAL).

Ugh.

I'm not sure this is a clear data corruption bug, but it surely walks
and quacks like one. My proposal is to fix this by tracking the lsn of
the last LSN for a sequence increment, and then check that LSN in
RecordTransactionCommit() before calling XLogFlush().

(1) Does that work if the aborted increment was in a different
session? I think it is okay but I'm tired enough to not be sure.

(2) I'm starting to wonder if we should rethink the sequence logging
mechanism altogether. It was cool when designed, but it seems
really problematic when you start thinking about replication
behaviors. Perhaps if wal_level > minimal, we don't do things
the same way?

regards, tom lane

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: sequences vs. synchronous replication

On 12/18/21 05:52, Tom Lane wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

The problem is exactly the same as in [1] - the aborted transaction
generated WAL, but RecordTransactionAbort() ignores that and does not
update LogwrtResult.Write, with the reasoning that aborted transactions
do not matter. But sequences violate that, because we only write WAL
once every 32 increments, so the following nextval() gets "committed"
without waiting for the replica (because it did not produce WAL).

Ugh.

I'm not sure this is a clear data corruption bug, but it surely walks
and quacks like one. My proposal is to fix this by tracking the lsn of
the last LSN for a sequence increment, and then check that LSN in
RecordTransactionCommit() before calling XLogFlush().

(1) Does that work if the aborted increment was in a different
session? I think it is okay but I'm tired enough to not be sure.

Good point - it doesn't :-( At least not by simply storing LSN in a
global variable or something like that.

The second backend needs to know the LSN of the last WAL-logged sequence
increment, but only the first backend knows that. So we'd need to share
that between backends somehow. I doubt we want to track LSN for every
individual sequence (because for clusters with many dbs / sequences that
may be a lot).

Perhaps we could track just a fixed number o LSN values in shared memory
(say, 1024), and update/read just the element determined by hash(oid).
That is, the backend WAL-logging sequence with given oid would set the
current LSN to array[hash(oid) % 1024], and backend doing nextval()
would simply remember the LSN in that slot. Yes, if there are conflicts
that'll flush more than needed.

Alternatively we could simply use the current insert LSN, but that's
going to flush more stuff than needed all the time.

(2) I'm starting to wonder if we should rethink the sequence logging
mechanism altogether. It was cool when designed, but it seems
really problematic when you start thinking about replication
behaviors. Perhaps if wal_level > minimal, we don't do things
the same way?

Maybe, but I have no idea how should the reworked WAL logging work. Any
batching seems to have this issue, and loging individual increments is
likely going to be slower.

Of course, reworking how sequences are WAL-logged may invalidate the
"sequence decoding" patch I've been working on :-(

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#3)
Re: sequences vs. synchronous replication

On 12/18/21 07:00, Tomas Vondra wrote:

On 12/18/21 05:52, Tom Lane wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

The problem is exactly the same as in [1] - the aborted transaction
generated WAL, but RecordTransactionAbort() ignores that and does not
update LogwrtResult.Write, with the reasoning that aborted transactions
do not matter. But sequences violate that, because we only write WAL
once every 32 increments, so the following nextval() gets "committed"
without waiting for the replica (because it did not produce WAL).

Ugh.

I'm not sure this is a clear data corruption bug, but it surely walks
and quacks like one. My proposal is to fix this by tracking the lsn of
the last LSN for a sequence increment, and then check that LSN in
RecordTransactionCommit() before calling XLogFlush().

(1) Does that work if the aborted increment was in a different
session?  I think it is okay but I'm tired enough to not be sure.

Good point - it doesn't :-( At least not by simply storing LSN in a
global variable or something like that.

The second backend needs to know the LSN of the last WAL-logged sequence
increment, but only the first backend knows that. So we'd need to share
that between backends somehow. I doubt we want to track LSN for every
individual sequence (because for clusters with many dbs / sequences that
may be a lot).

Perhaps we could track just a fixed number o LSN values in shared memory
(say, 1024), and update/read just the element determined by hash(oid).
That is, the backend WAL-logging sequence with given oid would set the
current LSN to array[hash(oid) % 1024], and backend doing nextval()
would simply remember the LSN in that slot. Yes, if there are conflicts
that'll flush more than needed.

Here's a PoC demonstrating this idea. I'm not convinced it's the right
way to deal with this - it surely seems more like a duct tape fix than a
clean solution. But it does the trick.

I wonder if storing this in shmem is good enough - we lose the LSN info
on restart, but the checkpoint should trigger FPI which makes it OK.

A bigger question is whether sequences are the only thing affected by
this. If you look at RecordTransactionCommit() then we skip flush/wait
in two cases:

1) !wrote_xlog - if the xact did not produce WAL

2) !markXidCommitted - if the xact does not have a valid XID

Both apply to sequences, and the PoC patch tweaks them. But maybe there
are other places where we don't generate WAL and/or assign XID in some
cases, to save time?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

sequences-track-lsn-poc.patchtext/x-patch; charset=UTF-8; name=sequences-track-lsn-poc.patchDownload+162-2
#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#4)
Re: sequences vs. synchronous replication

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

Here's a PoC demonstrating this idea. I'm not convinced it's the right
way to deal with this - it surely seems more like a duct tape fix than a
clean solution. But it does the trick.

I was imagining something a whole lot simpler, like "don't try to
cache unused sequence numbers when wal_level > minimal". We've
accepted worse performance hits in that operating mode, and it'd
fix a number of user complaints we've seen about weird sequence
behavior on standbys.

regards, tom lane

#6Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#5)
Re: sequences vs. synchronous replication

On 12/18/21 22:27, Tom Lane wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

Here's a PoC demonstrating this idea. I'm not convinced it's the right
way to deal with this - it surely seems more like a duct tape fix than a
clean solution. But it does the trick.

I was imagining something a whole lot simpler, like "don't try to
cache unused sequence numbers when wal_level > minimal". We've
accepted worse performance hits in that operating mode, and it'd
fix a number of user complaints we've seen about weird sequence
behavior on standbys.

What do you mean by "not caching unused sequence numbers"? Reducing
SEQ_LOG_VALS to 1, i.e. WAL-logging every sequence increment?

That'd work, but I wonder how significant the impact will be. It'd bet
it hurts the patch adding logical decoding of sequences quite a bit.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#6)
Re: sequences vs. synchronous replication

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

What do you mean by "not caching unused sequence numbers"? Reducing
SEQ_LOG_VALS to 1, i.e. WAL-logging every sequence increment?

Right.

That'd work, but I wonder how significant the impact will be.

As I said, we've accepted worse in order to have stable replication
behavior.

regards, tom lane

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#1)
Re: sequences vs. synchronous replication

On Sat, Dec 18, 2021 at 7:24 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

while working on logical decoding of sequences, I ran into an issue with
nextval() in a transaction that rolls back, described in [1]. But after
thinking about it a bit more (and chatting with Petr Jelinek), I think
this issue affects physical sync replication too.

Imagine you have a primary <-> sync_replica cluster, and you do this:

CREATE SEQUENCE s;

-- shutdown the sync replica

BEGIN;
SELECT nextval('s') FROM generate_series(1,50);
ROLLBACK;

BEGIN;
SELECT nextval('s');
COMMIT;

The natural expectation would be the COMMIT gets stuck, waiting for the
sync replica (which is not running), right? But it does not.

How about if we always WAL log the first sequence change in a transaction?

--
With Regards,
Amit Kapila.

#9Peter Eisentraut
peter_e@gmx.net
In reply to: Tomas Vondra (#6)
Re: sequences vs. synchronous replication

On 18.12.21 22:48, Tomas Vondra wrote:

What do you mean by "not caching unused sequence numbers"? Reducing
SEQ_LOG_VALS to 1, i.e. WAL-logging every sequence increment?

That'd work, but I wonder how significant the impact will be. It'd bet
it hurts the patch adding logical decoding of sequences quite a bit.

It might be worth testing. This behavior is ancient and has never
really been scrutinized since it was added.

#10Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Eisentraut (#9)
Re: sequences vs. synchronous replication

On 12/20/21 15:31, Peter Eisentraut wrote:

On 18.12.21 22:48, Tomas Vondra wrote:

What do you mean by "not caching unused sequence numbers"? Reducing
SEQ_LOG_VALS to 1, i.e. WAL-logging every sequence increment?

That'd work, but I wonder how significant the impact will be. It'd bet
it hurts the patch adding logical decoding of sequences quite a bit.

It might be worth testing.  This behavior is ancient and has never
really been scrutinized since it was added.

OK, I'll do some testing to measure the overhead, and I'll see how much
it affects the sequence decoding patch.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#10)
Re: sequences vs. synchronous replication

On 12/20/21 17:40, Tomas Vondra wrote:

On 12/20/21 15:31, Peter Eisentraut wrote:

On 18.12.21 22:48, Tomas Vondra wrote:

What do you mean by "not caching unused sequence numbers"? Reducing
SEQ_LOG_VALS to 1, i.e. WAL-logging every sequence increment?

That'd work, but I wonder how significant the impact will be. It'd
bet it hurts the patch adding logical decoding of sequences quite a bit.

It might be worth testing.  This behavior is ancient and has never
really been scrutinized since it was added.

OK, I'll do some testing to measure the overhead, and I'll see how much
it affects the sequence decoding patch.

OK, I did a quick test with two very simple benchmarks - simple select
from a sequence, and 'pgbench -N' on scale 1. Benchmark was on current
master, patched means SEQ_LOG_VALS was set to 1.

Average of 10 runs, each 30 seconds long, look like this:

1) select nextval('s');

clients 1 4
------------------------------
master 39497 123137
patched 6813 18326
------------------------------
diff -83% -86%

2) pgbench -N

clients 1 4
------------------------------
master 2935 9156
patched 2937 9100
------------------------------
diff 0% 0%

Clearly the extreme case (1) is hit pretty bad, while the much mure
likely workload (2) is almost unaffected.

I'm not sure what conclusion to make from this, but assuming almost no
one does just nextval calls, it should be acceptable.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#11)
Re: sequences vs. synchronous replication

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

OK, I did a quick test with two very simple benchmarks - simple select
from a sequence, and 'pgbench -N' on scale 1. Benchmark was on current
master, patched means SEQ_LOG_VALS was set to 1.

But ... pgbench -N doesn't use sequences at all, does it?

Probably inserts into a table with a serial column would constitute a
plausible real-world case.

regards, tom lane

#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#12)
Re: sequences vs. synchronous replication

On 12/21/21 02:01, Tom Lane wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

OK, I did a quick test with two very simple benchmarks - simple select
from a sequence, and 'pgbench -N' on scale 1. Benchmark was on current
master, patched means SEQ_LOG_VALS was set to 1.

But ... pgbench -N doesn't use sequences at all, does it?

Probably inserts into a table with a serial column would constitute a
plausible real-world case.

D'oh! For some reason I thought pgbench has a sequence on the history
table, but clearly I was mistaken. There's another thinko, because after
inspecting pg_waldump output I realized "SEQ_LOG_VALS 1" actually logs
only every 2nd increment. So it should be "SEQ_LOG_VALS 0".

So I repeated the test fixing SEQ_LOG_VALS, and doing the pgbench with a
table like this:

create table test (a serial, b int);

and a script doing

insert into test (b) values (1);

The results look like this:

1) select nextval('s');

clients 1 4
------------------------------
master 39533 124998
patched 3748 9114
------------------------------
diff -91% -93%

2) insert into test (b) values (1);

clients 1 4
------------------------------
master 3718 9188
patched 3698 9209
------------------------------
diff 0% 0%

So the nextval() results are a bit worse, due to not caching 1/2 the
nextval calls. The -90% is roughly expected, due to generating about 32x
more WAL (and having to wait for commit).

But results for the more realistic insert workload are about the same as
before (i.e. no measurable difference). Also kinda expected, because
those transactions have to wait for WAL anyway.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#8)
Re: sequences vs. synchronous replication

On 12/19/21 04:03, Amit Kapila wrote:

On Sat, Dec 18, 2021 at 7:24 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

while working on logical decoding of sequences, I ran into an issue with
nextval() in a transaction that rolls back, described in [1]. But after
thinking about it a bit more (and chatting with Petr Jelinek), I think
this issue affects physical sync replication too.

Imagine you have a primary <-> sync_replica cluster, and you do this:

CREATE SEQUENCE s;

-- shutdown the sync replica

BEGIN;
SELECT nextval('s') FROM generate_series(1,50);
ROLLBACK;

BEGIN;
SELECT nextval('s');
COMMIT;

The natural expectation would be the COMMIT gets stuck, waiting for the
sync replica (which is not running), right? But it does not.

How about if we always WAL log the first sequence change in a transaction?

I've been thinking about doing something like this, but I think it would
not have any significant advantages compared to using "SEQ_LOG_VALS 0".
It would still have the same performance hit for plain nextval() calls,
and there's no measurable impact on simple workloads that already write
WAL in transactions even with SEQ_LOG_VALS 0.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Fujii Masao
masao.fujii@gmail.com
In reply to: Tomas Vondra (#14)
Re: sequences vs. synchronous replication

On 2021/12/22 10:57, Tomas Vondra wrote:

On 12/19/21 04:03, Amit Kapila wrote:

On Sat, Dec 18, 2021 at 7:24 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

while working on logical decoding of sequences, I ran into an issue with
nextval() in a transaction that rolls back, described in [1]. But after
thinking about it a bit more (and chatting with Petr Jelinek), I think
this issue affects physical sync replication too.

Imagine you have a primary <-> sync_replica cluster, and you do this:

    CREATE SEQUENCE s;

    -- shutdown the sync replica

    BEGIN;
    SELECT nextval('s') FROM generate_series(1,50);
    ROLLBACK;

    BEGIN;
    SELECT nextval('s');
    COMMIT;

The natural expectation would be the COMMIT gets stuck, waiting for the
sync replica (which is not running), right? But it does not.

How about if we always WAL log the first sequence change in a transaction?

I've been thinking about doing something like this, but I think it would not have any significant advantages compared to using "SEQ_LOG_VALS 0". It would still have the same performance hit for plain nextval() calls, and there's no measurable impact on simple workloads that already write WAL in transactions even with SEQ_LOG_VALS 0.

Just idea; if wal_level > minimal, how about making nextval_internal() (1) check whether WAL is replicated to sync standbys, up to the page lsn of the sequence, and (2) forcibly emit a WAL record if not replicated yet? The similar check is performed at the beginning of SyncRepWaitForLSN(), so probably we can reuse that code.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#16Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Fujii Masao (#15)
Re: sequences vs. synchronous replication

On 12/22/21 05:56, Fujii Masao wrote:

On 2021/12/22 10:57, Tomas Vondra wrote:

On 12/19/21 04:03, Amit Kapila wrote:

On Sat, Dec 18, 2021 at 7:24 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

while working on logical decoding of sequences, I ran into an issue
with
nextval() in a transaction that rolls back, described in [1]. But after
thinking about it a bit more (and chatting with Petr Jelinek), I think
this issue affects physical sync replication too.

Imagine you have a primary <-> sync_replica cluster, and you do this:

    CREATE SEQUENCE s;

    -- shutdown the sync replica

    BEGIN;
    SELECT nextval('s') FROM generate_series(1,50);
    ROLLBACK;

    BEGIN;
    SELECT nextval('s');
    COMMIT;

The natural expectation would be the COMMIT gets stuck, waiting for the
sync replica (which is not running), right? But it does not.

How about if we always WAL log the first sequence change in a
transaction?

I've been thinking about doing something like this, but I think it
would not have any significant advantages compared to using
"SEQ_LOG_VALS 0". It would still have the same performance hit for
plain nextval() calls, and there's no measurable impact on simple
workloads that already write WAL in transactions even with
SEQ_LOG_VALS 0.

Just idea; if wal_level > minimal, how about making nextval_internal()
(1) check whether WAL is replicated to sync standbys, up to the page lsn
of the sequence, and (2) forcibly emit a WAL record if not replicated
yet? The similar check is performed at the beginning of
SyncRepWaitForLSN(), so probably we can reuse that code.

Interesting idea, but I think it has a couple of issues :-(

1) We'd need to know the LSN of the last WAL record for any given
sequence, and we'd need to communicate that between backends somehow.
Which seems rather tricky to do without affecting performance.

2) SyncRepWaitForLSN() is used only in commit-like situations, and it's
a simple wait, not a decision to write more WAL. Environments without
sync replicas are affected by this too - yes, the data loss issue is not
there, but the amount of WAL is still increased.

IIRC sync_standby_names can change while a transaction is running, even
just right before commit, at which point we can't just go back in time
and generate WAL for sequences accessed earlier. But we still need to
ensure the sequence is properly replicated.

3) I don't think it'd actually reduce the amount of WAL records in
environments with many sessions (incrementing the same sequence). In
those cases the WAL (generated by in-progress xact from another session)
is likely to not be flushed, so we'd generate the extra WAL record. (And
if the other backends would need flush LSN of this new WAL record, which
would make it more likely they have to generate WAL too.)

So I don't think this would actually help much.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17Fujii Masao
masao.fujii@gmail.com
In reply to: Tomas Vondra (#16)
Re: sequences vs. synchronous replication

On 2021/12/22 21:11, Tomas Vondra wrote:

Interesting idea, but I think it has a couple of issues :-(

Thanks for the review!

1) We'd need to know the LSN of the last WAL record for any given sequence, and we'd need to communicate that between backends somehow. Which seems rather tricky to do without affecting performance.

How about using the page lsn for the sequence? nextval_internal() already uses that to check whether it's less than or equal to checkpoint redo location.

2) SyncRepWaitForLSN() is used only in commit-like situations, and it's a simple wait, not a decision to write more WAL. Environments without sync replicas are affected by this too - yes, the data loss issue is not there, but the amount of WAL is still increased.

How about reusing only a part of code in SyncRepWaitForLSN()? Attached is the PoC patch that implemented what I'm thinking.

IIRC sync_standby_names can change while a transaction is running, even just right before commit, at which point we can't just go back in time and generate WAL for sequences accessed earlier. But we still need to ensure the sequence is properly replicated.

Yes. In the PoC patch, SyncRepNeedsWait() still checks sync_standbys_defined and uses SyncRepWaitMode. But they should not be checked nor used because their values can be changed on the fly, as you pointed out. Probably SyncRepNeedsWait() will need to be changed so that it doesn't use them.

3) I don't think it'd actually reduce the amount of WAL records in environments with many sessions (incrementing the same sequence). In those cases the WAL (generated by in-progress xact from another session) is likely to not be flushed, so we'd generate the extra WAL record. (And if the other backends would need flush LSN of this new WAL record, which would make it more likely they have to generate WAL too.)

With the PoC patch, only when previous transaction that executed nextval() and caused WAL record is aborted, subsequent nextval() generates additional WAL record. So this approach can reduce WAL volume than other approach?

In the PoC patch, to reduce WAL volume more, it might be better to make nextval_internal() update XactLastRecEnd and assign XID rather than emitting new WAL record, when SyncRepNeedsWait() returns true.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

sequence_log.patchtext/plain; charset=UTF-8; name=sequence_log.patchDownload+32-1
#18Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#13)
Re: sequences vs. synchronous replication

On 12/21/21 03:49, Tomas Vondra wrote:

On 12/21/21 02:01, Tom Lane wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

OK, I did a quick test with two very simple benchmarks - simple select
from a sequence, and 'pgbench -N' on scale 1. Benchmark was on current
master, patched means SEQ_LOG_VALS was set to 1.

But ... pgbench -N doesn't use sequences at all, does it?

Probably inserts into a table with a serial column would constitute a
plausible real-world case.

D'oh! For some reason I thought pgbench has a sequence on the history
table, but clearly I was mistaken. There's another thinko, because after
inspecting pg_waldump output I realized "SEQ_LOG_VALS 1" actually logs
only every 2nd increment. So it should be "SEQ_LOG_VALS 0".

So I repeated the test fixing SEQ_LOG_VALS, and doing the pgbench with a
table like this:

  create table test (a serial, b int);

and a script doing

  insert into test (b) values (1);

The results look like this:

1) select nextval('s');

     clients          1         4
    ------------------------------
     master       39533    124998
     patched       3748      9114
    ------------------------------
     diff          -91%      -93%

2) insert into test (b) values (1);

     clients          1         4
    ------------------------------
     master        3718      9188
     patched       3698      9209
    ------------------------------
     diff            0%        0%

So the nextval() results are a bit worse, due to not caching 1/2 the
nextval calls. The -90% is roughly expected, due to generating about 32x
more WAL (and having to wait for commit).

But results for the more realistic insert workload are about the same as
before (i.e. no measurable difference). Also kinda expected, because
those transactions have to wait for WAL anyway.

Attached is a patch tweaking WAL logging - in wal_level=minimal we do
the same thing as now, in higher levels we log every sequence fetch.

After thinking about this a bit more, I think even the nextval workload
is not such a big issue, because we can set cache for the sequences.
Until now this had fairly limited impact, but it can significantly
reduce the performance drop caused by WAL-logging every sequence fetch.

I've repeated the nextval test on a different machine (the one I used
before is busy with something else), and the results look like this:

1) 1 client

cache 1 32 128
--------------------------------------
master 13975 14425 19886
patched 886 7900 18397
--------------------------------------
diff -94% -45% -7%

4) 4 clients

cache 1 32 128
-----------------------------------------
master 8338 12849 18248
patched 331 8124 18983
-----------------------------------------
diff -96% -37% 4%

So I think this makes it acceptable / manageable. Of course, this means
the values are much less monotonous (across backends), but I don't think
we really promised that. And I doubt anyone is really using sequences
like this (just nextval) in performance critical use cases.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-WAL-log-individual-sequence-fetches-20211222.patchtext/x-patch; charset=UTF-8; name=0001-WAL-log-individual-sequence-fetches-20211222.patchDownload+16-3
#19Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Fujii Masao (#17)
Re: sequences vs. synchronous replication

On 12/22/21 18:50, Fujii Masao wrote:

On 2021/12/22 21:11, Tomas Vondra wrote:

Interesting idea, but I think it has a couple of issues :-(

Thanks for the review!

1) We'd need to know the LSN of the last WAL record for any given
sequence, and we'd need to communicate that between backends somehow.
Which seems rather tricky to do without affecting performance.

How about using the page lsn for the sequence? nextval_internal()
already uses that to check whether it's less than or equal to checkpoint
redo location.

Hmm, maybe.

2) SyncRepWaitForLSN() is used only in commit-like situations, and
it's a simple wait, not a decision to write more WAL. Environments
without sync replicas are affected by this too - yes, the data loss
issue is not there, but the amount of WAL is still increased.

How about reusing only a part of code in SyncRepWaitForLSN()? Attached
is the PoC patch that implemented what I'm thinking.

IIRC sync_standby_names can change while a transaction is running,
even just right before commit, at which point we can't just go back in
time and generate WAL for sequences accessed earlier. But we still
need to ensure the sequence is properly replicated.

Yes. In the PoC patch, SyncRepNeedsWait() still checks
sync_standbys_defined and uses SyncRepWaitMode. But they should not be
checked nor used because their values can be changed on the fly, as you
pointed out. Probably SyncRepNeedsWait() will need to be changed so that
it doesn't use them.

Right. I think the data loss with sync standby is merely a symptom, not
the root cause. We'd need to deduce the LSN for which to wait at commit.

3) I don't think it'd actually reduce the amount of WAL records in
environments with many sessions (incrementing the same sequence). In
those cases the WAL (generated by in-progress xact from another
session) is likely to not be flushed, so we'd generate the extra WAL
record. (And if the other backends would need flush LSN of this new
WAL record, which would make it more likely they have to generate WAL
too.)

With the PoC patch, only when previous transaction that executed
nextval() and caused WAL record is aborted, subsequent nextval()
generates additional WAL record. So this approach can reduce WAL volume
than other approach?

In the PoC patch, to reduce WAL volume more, it might be better to make

nextval_internal() update XactLastRecEnd and assign XID rather than
emitting new WAL record, when SyncRepNeedsWait() returns true.

Yes, but I think there are other cases. For example the WAL might have
been generated by another backend, in a transaction that might be still
running. In which case I don't see how updating XactLastRecEnd in
nextval_internal would fix this, right?

I did some experiments with increasing CACHE for the sequence, and that
mostly eliminates the overhead. See the message I sent a couple minutes
ago. IMHO that's a reasonable solution for the tiny number of people
using nextval() in a way that'd be affected by this (i.e. without
writing anything else in the xact).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Fujii Masao
masao.fujii@gmail.com
In reply to: Tomas Vondra (#18)
Re: sequences vs. synchronous replication

On 2021/12/23 3:49, Tomas Vondra wrote:

Attached is a patch tweaking WAL logging - in wal_level=minimal we do the same thing as now, in higher levels we log every sequence fetch.

Thanks for the patch!

With the patch, I found that the regression test for sequences failed.

+ fetch = log = fetch;

This should be "log = fetch"?

On second thought, originally a sequence doesn't guarantee that the value already returned by nextval() will never be returned by subsequent nextval() after the server crash recovery. That is, nextval() may return the same value across crash recovery. Is this understanding right? For example, this case can happen if the server crashes after nextval() returned the value but before WAL for the sequence was flushed to the permanent storage. So it's not a bug that sync standby may return the same value as already returned in the primary because the corresponding WAL has not been replicated yet, isn't it?

BTW, if the returned value is stored in database, the same value is guaranteed not to be returned again after the server crash or by sync standby. Because in that case the WAL of the transaction storing that value is flushed and replicated.

So I think this makes it acceptable / manageable. Of course, this means the values are much less monotonous (across backends), but I don't think we really promised that. And I doubt anyone is really using sequences like this (just nextval) in performance critical use cases.

I think that this approach is not acceptable to some users. So, if we actually adopt WAL-logging every sequence fetch, also how about exposing SEQ_LOG_VALS as reloption for a sequence? If so, those who want to log every sequence fetch can set this SEQ_LOG_VALS reloption to 0. OTOH, those who prefer the current behavior in spite of the risk we're discussing at this thread can set the reloption to 32 like it is for now, for example.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Fujii Masao (#20)
#22Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tomas Vondra (#21)
#23Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#22)
#24Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tomas Vondra (#23)
#25Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#24)
#26Peter Eisentraut
peter_e@gmx.net
In reply to: Kyotaro Horiguchi (#24)
#27Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Eisentraut (#26)
#28Fujii Masao
masao.fujii@gmail.com
In reply to: Tomas Vondra (#25)
#29Sascha Kuhl
yogidabanli@gmail.com
In reply to: Fujii Masao (#28)
#30Pavel Stehule
pavel.stehule@gmail.com
In reply to: Sascha Kuhl (#29)
#31Sascha Kuhl
yogidabanli@gmail.com
In reply to: Pavel Stehule (#30)
#32Pavel Stehule
pavel.stehule@gmail.com
In reply to: Sascha Kuhl (#31)
#33Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Fujii Masao (#17)
#34Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#25)
#35Fujii Masao
masao.fujii@gmail.com
In reply to: Tomas Vondra (#33)
#36Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Fujii Masao (#35)
#37Peter Eisentraut
peter_e@gmx.net
In reply to: Tomas Vondra (#36)
#38Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Eisentraut (#37)