logical decoding and replication of sequences, take 2

Started by Tomas Vondraover 3 years ago240 messages
Jump to latest
#1Tomas Vondra
tomas.vondra@2ndquadrant.com

Hi,

Here's a rebased version of the patch adding logical decoding of
sequences. The previous attempt [1]/messages/by-id/d045f3c2-6cfb-06d3-5540-e63c320df8bc@enterprisedb.com ended up getting reverted, due to
running into issues with non-transactional nature of sequences when
decoding the existing WAL records. See [2]/messages/by-id/00708727-d856-1886-48e3-811296c7ba8c@enterprisedb.com for details.

This patch uses a different approach, proposed by Hannu Krosing [3]/messages/by-id/CAMT0RQQeDR51xs8zTa25YpfKB1B34nS-Q4hhsRPznVsjMB_P1w@mail.gmail.com,
based on tracking sequences actually modified in each transaction, and
then WAL-logging the state at the end.

This does work, but I'm not very happy about WAL-logging all sequences
at the end. The "problem" is we have to re-read the current state of the
sequence from disk, because it might be concurrently updated by another
transaction.

Imagine two transactions, T1 and T2:

T1: BEGIN

T1: SELECT nextval('s') FROM generate_series(1,1000)

T2: BEGIN

T2: SELECT nextval('s') FROM generate_series(1,1000)

T2: COMMIT

T1: COMMIT

The expected outcome is that the sequence value is ~2000. We must not
blindly apply the changes from T2 by the increments in T1. So the patch
simply reads "current" state of the transaction at commit time. Which is
annoying, because it involves I/O, increases the commit duration, etc.

On the other hand, this is likely cheaper than the other approach based
on WAL-logging every sequence increment (that would have to be careful
about obsoleted increments too, when applying them transactionally).

I wonder if we might deal with this by simply WAL-logging LSN of the
last change for each sequence (in the given xact), which would allow
discarding the "obsolete" changes quite easily I think. nextval() would
simply look at LSN in the page header.

And maybe we could then use the LSN to read the increment from the WAL
during decoding, instead of having to read it and WAL-log it during
commit. Essentially, we'd run a local XLogReader. Of course, we'd have
to be careful about checkpoints, not sure what to do about that.

Another idea that just occurred to me is that if we end up having to
read the sequence state during commit, maybe we could at least optimize
it somehow. For example we might track LSN of the last logged state for
each sequence (in shared memory or something), and the other sessions
could just skip the WAL-log if their "local" LSN is <= than this LSN.

regards

[1]: /messages/by-id/d045f3c2-6cfb-06d3-5540-e63c320df8bc@enterprisedb.com
/messages/by-id/d045f3c2-6cfb-06d3-5540-e63c320df8bc@enterprisedb.com

[2]: /messages/by-id/00708727-d856-1886-48e3-811296c7ba8c@enterprisedb.com
/messages/by-id/00708727-d856-1886-48e3-811296c7ba8c@enterprisedb.com

[3]: /messages/by-id/CAMT0RQQeDR51xs8zTa25YpfKB1B34nS-Q4hhsRPznVsjMB_P1w@mail.gmail.com
/messages/by-id/CAMT0RQQeDR51xs8zTa25YpfKB1B34nS-Q4hhsRPznVsjMB_P1w@mail.gmail.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

decoding-sequences-tracking-20220818.patchtext/x-patch; charset=UTF-8; name=decoding-sequences-tracking-20220818.patchDownload+4488-554
#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#1)
Re: logical decoding and replication of sequences, take 2

I've been thinking about the two optimizations mentioned at the end a
bit more, so let me share my thoughts before I forget that:

On 8/18/22 23:10, Tomas Vondra wrote:

...

And maybe we could then use the LSN to read the increment from the WAL
during decoding, instead of having to read it and WAL-log it during
commit. Essentially, we'd run a local XLogReader. Of course, we'd have
to be careful about checkpoints, not sure what to do about that.

I think logging just the LSN is workable.

I was worried about dealing with checkpoints, because imagine you do
nextval() on sequence that was last WAL-logged a couple checkpoints
back. Then you wouldn't be able to read the LSN (when decoding), because
the WAL might have been recycled. But that can't happen, because we
always force WAL-logging the first time nextval() is called after a
checkpoint. So we know the LSN is guaranteed to be available.

Of course, this would not reduce the amount of WAL messages, because
we'd still log all sequences touched by the transaction. We wouldn't
need to read the state from disk, though, and we could ignore "old"
stuff in decoding (with LSN lower than the last LSN we decoded).

For frequently used sequences that seems like a win.

Another idea that just occurred to me is that if we end up having to
read the sequence state during commit, maybe we could at least optimize
it somehow. For example we might track LSN of the last logged state for
each sequence (in shared memory or something), and the other sessions
could just skip the WAL-log if their "local" LSN is <= than this LSN.

Tracking the last LSN for each sequence (in a SLRU or something) should
work too, I guess. In principle this just moves the skipping of "old"
increments from decoding to writing, so that we don't even have to write
those into WAL.

We don't even need persistence, nor to keep all the records, I think. If
you don't find a record for a given sequence, assume it wasn't logged
yet and just log it. Of course, it requires a bit of shared memory for
each sequence, say ~32B. Not sure about the overhead, but I'd bet if you
have many (~thousands) frequently used sequences, there'll be a lot of
other overhead making this irrelevant.

Of course, if we're doing the skipping when writing the WAL, maybe we
should just read the sequence state - we'd do the I/O, but only in
fraction of the transactions, and we wouldn't need to read old WAL in
logical decoding.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#2)
Re: logical decoding and replication of sequences, take 2

Hi,

I noticed on cfbot the patch no longer applies, so here's a rebased
version. Most of the breakage was due to the column filtering reworks,
grammar changes etc. A lot of bitrot, but mostly mechanical stuff.

I haven't looked into the optimizations / improvements I discussed in my
previous post (logging only LSN of the last WAL-logged increment),
because while fixing "make check-world" I ran into a more serious issue
that I think needs to be discussed first. And I suspect it might also
affect the feasibility of the LSN optimization.

So, what's the issue - the current solution is based on WAL-logging
state of all sequences incremented by the transaction at COMMIT. To do
that, we read the state from disk, and write that into WAL. However,
these WAL messages are not necessarily correlated to COMMIT records, so
stuff like this might happen:

1. transaction T1 increments sequence S
2. transaction T2 increments sequence S
3. both T1 and T2 start to COMMIT
4. T1 reads state of S from disk, writes it into WAL
5. transaction T3 increments sequence S
6. T2 reads state of S from disk, writes it into WAL
7. T2 write COMMIT into WAL
8. T1 write COMMIT into WAL

Because the apply order is determined by ordering of COMMIT records,
this means we'd apply the increments logged by T2, and then by T1. But
that undoes the increment by T3, and the sequence would go backwards.

The previous patch version addressed that by acquiring lock on the
sequence, holding it until transaction end. This effectively ensures the
order of sequence messages and COMMIT matches. But that's problematic
for a number of reasons:

1) throughput reduction, because the COMMIT records need to serialize

2) deadlock risk, if we happen to lock sequences in different order
(in different transactions)

3) problem for prepared transactions - the sequences are locked and
logged in PrepareTransaction, because we may not have seqhashtab
beyond that point. This is a much worse variant of (1).

Note: I also wonder what happens if someone does DISCARD SEQUENCES. I
guess we'll forget the sequences, which is bad - so we'd have to invent
a separate cache that does not have this issue.

I realized (3) because one of the test_decoding TAP tests got stuck
exactly because of a sequence locked by a prepared transaction.

This patch simply releases the lock after writing the WAL message, but
that just makes it vulnerable to the reordering. And this would have
been true even with the LSN optimization.

However, I was thinking that maybe we could use the LSN of the WAL
message (XLOG_LOGICAL_SEQUENCE) to deal with the ordering issue, because
*this* is the sensible sequence increment ordering.

In the example above, we'd first apply the WAL message from T2 (because
that commits first). And then we'd get to apply T1, but the WAL message
has an older LSN, so we'd skip it.

But this requires us remembering LSN of the already applied WAL sequence
messages, which could be tricky - we'd need to persist it in some way
because of restarts, etc. We can't do this while decoding but on the
apply side, I think, because of streaming, aborts.

The other option might be to make these messages non-transactional, in
which case we'd separate the ordering from COMMIT ordering, evading the
reordering problem.

That'd mean we'd ignore rollbacks (which seems fine), we could probably
optimize this by checking if the state actually changed, etc. But we'd
also need to deal with transactions created in the (still uncommitted)
transaction. But I'm also worried it might lead to the same issue with
non-transactional behaviors that forced revert in v15.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

decoding-sequences-tracking-20221112.patchtext/x-patch; charset=UTF-8; name=decoding-sequences-tracking-20221112.patchDownload+4527-578
#4Ian Lawrence Barwick
barwick@gmail.com
In reply to: Tomas Vondra (#3)
Re: logical decoding and replication of sequences, take 2

2022年11月12日(土) 7:49 Tomas Vondra <tomas.vondra@enterprisedb.com>:

Hi,

I noticed on cfbot the patch no longer applies, so here's a rebased
version. Most of the breakage was due to the column filtering reworks,
grammar changes etc. A lot of bitrot, but mostly mechanical stuff.

(...)

Hi

Thanks for the update patch.

While reviewing the patch backlog, we have determined that this patch adds
one or more TAP tests but has not added the test to the "meson.build" file.

To do this, locate the relevant "meson.build" file for each test and add it
in the 'tests' dictionary, which will look something like this:

'tap': {
'tests': [
't/001_basic.pl',
],
},

For some additional details please see this Wiki article:

https://wiki.postgresql.org/wiki/Meson_for_patch_authors

For more information on the meson build system for PostgreSQL see:

https://wiki.postgresql.org/wiki/Meson

Regards

Ian Barwick

#5Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#3)
Re: logical decoding and replication of sequences, take 2

On Fri, Nov 11, 2022 at 5:49 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

The other option might be to make these messages non-transactional, in
which case we'd separate the ordering from COMMIT ordering, evading the
reordering problem.

That'd mean we'd ignore rollbacks (which seems fine), we could probably
optimize this by checking if the state actually changed, etc. But we'd
also need to deal with transactions created in the (still uncommitted)
transaction. But I'm also worried it might lead to the same issue with
non-transactional behaviors that forced revert in v15.

I think it might be a good idea to step back slightly from
implementation details and try to agree on a theoretical model of
what's happening here. Let's start by banishing the words
transactional and non-transactional from the conversation and talk
about what logical replication is trying to do.

We can imagine that the replicated objects on the primary pass through
a series of states S1, S2, ..., Sn, where n keeps going up as new
state changes occur. The state, for our purposes here, is the contents
of the database as they could be observed by a user running SELECT
queries at some moment in time chosen by the user. For instance, if
the initial state of the database is S1, and then the user executes
BEGIN, 2 single-row INSERT statements, and a COMMIT, then S2 is the
state that differs from S1 in that both of those rows are now part of
the database contents. There is no state where one of those rows is
visible and the other is not. That was never observable by the user,
except from within the transaction as it was executing, which we can
and should discount. I believe that the goal of logical replication is
to bring about a state of affairs where the set of states observable
on the standby is a subset of the states observable on the primary.
That is, if the primary goes from S1 to S2 to S3, the standby can do
the same thing, or it can go straight from S1 to S3 without ever
making it possible for the user to observe S2. Either is correct
behavior. But the standby cannot invent any new states that didn't
occur on the primary. It can't decide to go from S1 to S1.5 to S2.5 to
S3, or something like that. It can only consolidate changes that
occurred separately on the primary, never split them up. Neither can
it reorder them.

Now, if you accept this as a reasonable definition of correctness,
then the next question is what consequences it has for transactional
and non-transactional behavior. If all behavior is transactional, then
we've basically got to replay each primary transaction in a single
standby transaction, and commit those transactions in the same order
that the corresponding primary transactions committed. We could
legally choose to merge a group of transactions that committed one
after the other on the primary into a single transaction on the
standby, and it might even be a good idea if they're all very tiny,
but it's not required. But if there are non-transactional things
happening, then there are changes that become visible at some time
other than at a transaction commit. For example, consider this
sequence of events, in which each "thing" that happens is
transactional except where the contrary is noted:

T1: BEGIN;
T2: BEGIN;
T1: Do thing 1;
T2: Do thing 2;
T1: Do a non-transactional thing;
T1: Do thing 3;
T2: Do thing 4;
T2: COMMIT;
T1: COMMIT;

From the point of the user here, there are 4 observable states here:

S1: Initiate state.
S2: State after the non-transactional thing happens.
S3: State after T2 commits (reflects the non-transactional thing plus
things 2 and 4).
S4: State after T1 commits.

Basically, the non-transactional thing behaves a whole lot like a
separate transaction. That non-transactional operation ought to be
replicated before T2, which ought to be replicated before T1. Maybe
logical replication ought to treat it in exactly that way: as a
separate operation that needs to be replicated after any earlier
transactions that completed prior to the history shown here, but
before T2 or T1. Alternatively, you can merge the non-transactional
change into T2, i.e. the first transaction that committed after it
happened. But you can't merge it into T1, even though it happened in
T1. If you do that, then you're creating states on the standby that
never existed on the primary, which is wrong. You could argue that
this is just nitpicking: who cares if the change in the sequence value
doesn't get replicated at exactly the right moment? But I don't think
it's a technicality at all: I think if we don't make the operation
appear to happen at the same point in the sequence as it became
visible on the master, then there will be endless artifacts and corner
cases to the bottom of which we will never get. Just like if we
replicated the actual transactions out of order, chaos would ensue,
because there can be logical dependencies between them, so too can
there be logical dependencies between non-transactional operations, or
between a non-transactional operation and a transactional operation.

To make it more concrete, consider two sessions concurrently running this SQL:

insert into t1 select nextval('s1') from generate_series(1,1000000) g;

There are, in effect, 2000002 transaction-like things here. The
sequence gets incremented 2 million times, and then there are 2
commits that each insert a million rows. Perhaps the actual order of
events looks something like this:

1. nextval the sequence N times, where N >= 1 million
2. commit the first transaction, adding a million rows to t1
3. nextval the sequence 2 million - N times
4. commit the second transaction, adding another million rows to t1

Unless we replicate all of the nextval operations that occur in step 1
at the same time or prior to replicating the first transaction in step
2, we might end up making visible a state where the next value of the
sequence is less than the highest value present in the table, which
would be bad.

With that perhaps overly-long set of preliminaries, I'm going to move
on to talking about the implementation ideas which you mention. You
write that "the current solution is based on WAL-logging state of all
sequences incremented by the transaction at COMMIT" and then, it seems
to me, go on to demonstrate that it's simply incorrect. In my opinion,
the fundamental problem is that it doesn't look at the order that
things happened on the primary and do them in the same order on the
standby. Instead, it accepts that the non-transactional operations are
going to be replicated at the wrong time, and then tries to patch
around the issue by attempting to scrounge up the correct values at
some convenient point and use that data to compensate for our failure
to do the right thing at an earlier point. That doesn't seem like a
satisfying solution, and I think it will be hard to make it fully
correct.

Your alternative proposal says "The other option might be to make
these messages non-transactional, in which case we'd separate the
ordering from COMMIT ordering, evading the reordering problem." But I
don't think that avoids the reordering problem at all. Nor do I think
it's correct. I don't think you *can* separate the ordering of these
operations from the COMMIT ordering. They are, as I argue here,
essentially mini-commits that only bump the sequence value, and they
need to be replicated after the transactions that commit prior to the
sequence value bump and before those that commit afterward. If they
aren't handled that way, I don't think you're going to get fully
correct behavior.

I'm going to confess that I have no really specific idea how to
implement that. I'm just not sufficiently familiar with this code.
However, I suspect that the solution lies in changing things on the
decoding side rather than in the WAL format. I feel like the
information that we need in order to do the right thing must already
be present in the WAL. If it weren't, then how could crash recovery
work correctly, or physical replication? At any given moment, you can
choose to promote a physical standby, and at that point the state you
observe on the new primary had better be some state that existed on
the primary at some point in its history. At any moment, you can
unplug the primary, restart it, and run crash recovery, and if you do,
you had better end up with some state that existed on the primary at
some point shortly before the crash. I think that there are actually a
few subtle inaccuracies in the last two sentences, because actually
the order in which transactions become visible on a physical standby
can differ from the order in which it happens on the primary, but I
don't think that actually changes the picture much. The point is that
the WAL is the definitive source of information about what happened
and in what order it happened, and we use it in that way already in
the context of physical replication, and of standbys. If logical
decoding has a problem with some case that those systems handle
correctly, the problem is with logical decoding, not the WAL format.

In particular, I think it's likely that the "non-transactional
messages" that you mention earlier don't get applied at the point in
the commit sequence where they were found in the WAL. Not sure why
exactly, but perhaps the point at which we're reading WAL runs ahead
of the decoding per se, or something like that, and thus those
non-transactional messages arrive too early relative to the commit
ordering. Possibly that could be changed, and they could be buffered
until earlier commits are replicated. Or else, when we see a WAL
record for a non-transactional sequence operation, we could arrange to
bundle that operation into an "adjacent" replicated transaction i.e.
the transaction whose commit record occurs most nearly prior to, or
most nearly after, the WAL record for the operation itself. Or else,
we could create "virtual" transactions for such operations and make
sure those get replayed at the right point in the commit sequence. Or
else, I don't know, maybe something else. But I think the overall
picture is that we need to approach the problem by replicating changes
in WAL order, as a physical standby would do. Saying that a change is
"nontransactional" doesn't mean that it's exempt from ordering
requirements; rather, it means that that change has its own place in
that ordering, distinct from the transaction in which it occurred.

--
Robert Haas
EDB: http://www.enterprisedb.com

#6Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#5)
Re: logical decoding and replication of sequences, take 2

On 11/16/22 22:05, Robert Haas wrote:

On Fri, Nov 11, 2022 at 5:49 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

The other option might be to make these messages non-transactional, in
which case we'd separate the ordering from COMMIT ordering, evading the
reordering problem.

That'd mean we'd ignore rollbacks (which seems fine), we could probably
optimize this by checking if the state actually changed, etc. But we'd
also need to deal with transactions created in the (still uncommitted)
transaction. But I'm also worried it might lead to the same issue with
non-transactional behaviors that forced revert in v15.

I think it might be a good idea to step back slightly from
implementation details and try to agree on a theoretical model of
what's happening here. Let's start by banishing the words
transactional and non-transactional from the conversation and talk
about what logical replication is trying to do.

OK, let's try.

We can imagine that the replicated objects on the primary pass through
a series of states S1, S2, ..., Sn, where n keeps going up as new
state changes occur. The state, for our purposes here, is the contents
of the database as they could be observed by a user running SELECT
queries at some moment in time chosen by the user. For instance, if
the initial state of the database is S1, and then the user executes
BEGIN, 2 single-row INSERT statements, and a COMMIT, then S2 is the
state that differs from S1 in that both of those rows are now part of
the database contents. There is no state where one of those rows is
visible and the other is not. That was never observable by the user,
except from within the transaction as it was executing, which we can
and should discount. I believe that the goal of logical replication is
to bring about a state of affairs where the set of states observable
on the standby is a subset of the states observable on the primary.
That is, if the primary goes from S1 to S2 to S3, the standby can do
the same thing, or it can go straight from S1 to S3 without ever
making it possible for the user to observe S2. Either is correct
behavior. But the standby cannot invent any new states that didn't
occur on the primary. It can't decide to go from S1 to S1.5 to S2.5 to
S3, or something like that. It can only consolidate changes that
occurred separately on the primary, never split them up. Neither can
it reorder them.

I mostly agree, and in a way the last patch aims to do roughly this,
i.e. make sure that the state after each transaction matches the state a
user might observe on the primary (modulo implementation challenges).

There's a couple of caveats, though:

1) Maybe we should focus more on "actually observed" state instead of
"observable". Who cares if the sequence moved forward in a transaction
that was ultimately rolled back? No committed transaction should have
observer those values - in a way, the last "valid" state of the sequence
is the last value generated in a transaction that ultimately committed.

2) I think what matters more is that we never generate duplicate value.
That is, if you generate a value from a sequence, commit a transaction
and replicate it, then the logical standby should not generate the same
value from the sequence. This guarantee seems necessary for "failover"
to logical standby.

Now, if you accept this as a reasonable definition of correctness,
then the next question is what consequences it has for transactional
and non-transactional behavior. If all behavior is transactional, then
we've basically got to replay each primary transaction in a single
standby transaction, and commit those transactions in the same order
that the corresponding primary transactions committed. We could
legally choose to merge a group of transactions that committed one
after the other on the primary into a single transaction on the
standby, and it might even be a good idea if they're all very tiny,
but it's not required. But if there are non-transactional things
happening, then there are changes that become visible at some time
other than at a transaction commit. For example, consider this
sequence of events, in which each "thing" that happens is
transactional except where the contrary is noted:

T1: BEGIN;
T2: BEGIN;
T1: Do thing 1;
T2: Do thing 2;
T1: Do a non-transactional thing;
T1: Do thing 3;
T2: Do thing 4;
T2: COMMIT;
T1: COMMIT;

From the point of the user here, there are 4 observable states here:

S1: Initiate state.
S2: State after the non-transactional thing happens.
S3: State after T2 commits (reflects the non-transactional thing plus
things 2 and 4).
S4: State after T1 commits.

Basically, the non-transactional thing behaves a whole lot like a
separate transaction. That non-transactional operation ought to be
replicated before T2, which ought to be replicated before T1. Maybe
logical replication ought to treat it in exactly that way: as a
separate operation that needs to be replicated after any earlier
transactions that completed prior to the history shown here, but
before T2 or T1. Alternatively, you can merge the non-transactional
change into T2, i.e. the first transaction that committed after it
happened. But you can't merge it into T1, even though it happened in
T1. If you do that, then you're creating states on the standby that
never existed on the primary, which is wrong. You could argue that
this is just nitpicking: who cares if the change in the sequence value
doesn't get replicated at exactly the right moment? But I don't think
it's a technicality at all: I think if we don't make the operation
appear to happen at the same point in the sequence as it became
visible on the master, then there will be endless artifacts and corner
cases to the bottom of which we will never get. Just like if we
replicated the actual transactions out of order, chaos would ensue,
because there can be logical dependencies between them, so too can
there be logical dependencies between non-transactional operations, or
between a non-transactional operation and a transactional operation.

Well, yeah - we can either try to perform the stuff independently of the
transactions that triggered it, or we can try making it part of some of
the transactions. Each of those options has problems, though :-(

The first version of the patch tried the first approach, i.e. decode the
increments and apply that independently. But:

(a) What would you do with increments of sequences created/reset in a
transaction? Can't apply those outside the transaction, because it
might be rolled back (and that state is not visible on primary).

(b) What about increments created before we have a proper snapshot?
There may be transactions dependent on the increment. This is what
ultimately led to revert of the patch.

This version of the patch tries to do the opposite thing - make sure
that the state after each commit matches what the transaction might have
seen (for sequences it accessed). It's imperfect, because it might log a
state generated "after" the sequence got accessed - it focuses on the
guarantee not to generate duplicate values.

To make it more concrete, consider two sessions concurrently running this SQL:

insert into t1 select nextval('s1') from generate_series(1,1000000) g;

There are, in effect, 2000002 transaction-like things here. The
sequence gets incremented 2 million times, and then there are 2
commits that each insert a million rows. Perhaps the actual order of
events looks something like this:

1. nextval the sequence N times, where N >= 1 million
2. commit the first transaction, adding a million rows to t1
3. nextval the sequence 2 million - N times
4. commit the second transaction, adding another million rows to t1

Unless we replicate all of the nextval operations that occur in step 1
at the same time or prior to replicating the first transaction in step
2, we might end up making visible a state where the next value of the
sequence is less than the highest value present in the table, which
would be bad.

Right, that's the "guarantee" I've mentioned above, more or less.

With that perhaps overly-long set of preliminaries, I'm going to move
on to talking about the implementation ideas which you mention. You
write that "the current solution is based on WAL-logging state of all
sequences incremented by the transaction at COMMIT" and then, it seems
to me, go on to demonstrate that it's simply incorrect. In my opinion,
the fundamental problem is that it doesn't look at the order that
things happened on the primary and do them in the same order on the
standby. Instead, it accepts that the non-transactional operations are
going to be replicated at the wrong time, and then tries to patch
around the issue by attempting to scrounge up the correct values at
some convenient point and use that data to compensate for our failure
to do the right thing at an earlier point. That doesn't seem like a
satisfying solution, and I think it will be hard to make it fully
correct.

I understand what you're saying, but I'm not sure I agree with you.

Yes, this would mean we accept we may end up with something like this:

1: T1 logs sequence state S1
2: someone increments sequence
3: T2 logs sequence stats S2
4: T2 commits
5: T1 commits

which "inverts" the apply order of S1 vs. S2, because we first apply S2
and then the "old" S1. But as long as we're smart enough to "discard"
applying S1, I think that's acceptable - because it guarantees we'll not
generate duplicate values (with values in the committed transaction).

I'd also argue it does not actually generate invalid state, because once
we commit either transaction, S2 is what's visible.

Yes, if you so "SELECT * FROM sequence" you'll see some intermediate
state, but that's not how sequences are accessed. And you can't do
currval('s') from a transaction that never accessed the sequence.

And if it did, we'd write S2 (or whatever it saw) as part of it's commits.

So I think the main issue of this approach is how to decide which
sequence states are obsolete and should be skipped.

Your alternative proposal says "The other option might be to make
these messages non-transactional, in which case we'd separate the
ordering from COMMIT ordering, evading the reordering problem." But I
don't think that avoids the reordering problem at all.

I don't understand why. Why would it not address the reordering issue?

Nor do I think it's correct.

Nor do I understand this. I mean, isn't it essentially the option you
mentioned earlier - treating the non-transactional actions as
independent transactions? Yes, we'd be batching them so that we'd not
see "intermediate" states, but those are not observed by abyone.

I don't think you *can* separate the ordering of these
operations from the COMMIT ordering. They are, as I argue here,
essentially mini-commits that only bump the sequence value, and they
need to be replicated after the transactions that commit prior to the
sequence value bump and before those that commit afterward. If they
aren't handled that way, I don't think you're going to get fully
correct behavior.

I'm confused. Isn't that pretty much exactly what I'm proposing? Imagine
you have something like this:

1: T1 does something and also increments a sequence
2: T1 logs state of the sequence (right before commit)
3: T1 writes COMMIT

Now when we decode/apply this, we end up doing this:

1: decode all T1 changes, stash them
2: decode the sequence state and apply it separately
3: decode COMMIT, apply all T1 changes

There might be other transactions interleaving with this, but I think
it'd behave correctly. What example would not work?

I'm going to confess that I have no really specific idea how to
implement that. I'm just not sufficiently familiar with this code.
However, I suspect that the solution lies in changing things on the
decoding side rather than in the WAL format. I feel like the
information that we need in order to do the right thing must already
be present in the WAL. If it weren't, then how could crash recovery
work correctly, or physical replication? At any given moment, you can
choose to promote a physical standby, and at that point the state you
observe on the new primary had better be some state that existed on
the primary at some point in its history. At any moment, you can
unplug the primary, restart it, and run crash recovery, and if you do,
you had better end up with some state that existed on the primary at
some point shortly before the crash. I think that there are actually a
few subtle inaccuracies in the last two sentences, because actually
the order in which transactions become visible on a physical standby
can differ from the order in which it happens on the primary, but I
don't think that actually changes the picture much. The point is that
the WAL is the definitive source of information about what happened
and in what order it happened, and we use it in that way already in
the context of physical replication, and of standbys. If logical
decoding has a problem with some case that those systems handle
correctly, the problem is with logical decoding, not the WAL format.

The problem lies in how we log sequences. If we wrote each individual
increment to WAL, it might work the way you propose (except for cases
with sequences created in a transaction, etc.). But that's not what we
do - we log sequence increments in batches of 32 values, and then only
modify the sequence relfilenode.

This works for physical replication, because the WAL describes the
"next" state of the sequence (so if you do "SELECT * FROM sequence"
you'll not see the same state, and the sequence value may "jump ahead"
after a failover).

But for logical replication this does not work, because the transaction
might depend on a state created (WAL-logged) by some other transaction.
And perhaps that transaction actually happened *before* we even built
the first snapshot for decoding :-/

There's also the issue with what snapshot to use when decoding these
transactional changes in logical decoding (see

In particular, I think it's likely that the "non-transactional
messages" that you mention earlier don't get applied at the point in
the commit sequence where they were found in the WAL. Not sure why
exactly, but perhaps the point at which we're reading WAL runs ahead
of the decoding per se, or something like that, and thus those
non-transactional messages arrive too early relative to the commit
ordering. Possibly that could be changed, and they could be buffered

I'm not sure which case of "non-transactional messages" this refers to,
so I can't quite respond to these comments. Perhaps you mean the
problems that killed the previous patch [1]/messages/by-id/00708727-d856-1886-48e3-811296c7ba8c@enterprisedb.com?

[1]: /messages/by-id/00708727-d856-1886-48e3-811296c7ba8c@enterprisedb.com
/messages/by-id/00708727-d856-1886-48e3-811296c7ba8c@enterprisedb.com

until earlier commits are replicated. Or else, when we see a WAL
record for a non-transactional sequence operation, we could arrange to
bundle that operation into an "adjacent" replicated transaction i.e.

IIRC moving stuff between transactions during decoding is problematic,
because of snapshots.

the transaction whose commit record occurs most nearly prior to, or
most nearly after, the WAL record for the operation itself. Or else,
we could create "virtual" transactions for such operations and make
sure those get replayed at the right point in the commit sequence. Or
else, I don't know, maybe something else. But I think the overall
picture is that we need to approach the problem by replicating changes
in WAL order, as a physical standby would do. Saying that a change is
"nontransactional" doesn't mean that it's exempt from ordering
requirements; rather, it means that that change has its own place in
that ordering, distinct from the transaction in which it occurred.

But doesn't the approach with WAL-logging sequence state before COMMIT,
and then applying it independently in WAL-order, do pretty much this?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#6)
Re: logical decoding and replication of sequences, take 2

Hi,

On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:

Well, yeah - we can either try to perform the stuff independently of the
transactions that triggered it, or we can try making it part of some of
the transactions. Each of those options has problems, though :-(

The first version of the patch tried the first approach, i.e. decode the
increments and apply that independently. But:

(a) What would you do with increments of sequences created/reset in a
transaction? Can't apply those outside the transaction, because it
might be rolled back (and that state is not visible on primary).

I think a reasonable approach could be to actually perform different WAL
logging for that case. It'll require a bit of machinery, but could actually
result in *less* WAL logging overall, because we don't need to emit a WAL
record for each SEQ_LOG_VALS sequence values.

(b) What about increments created before we have a proper snapshot?
There may be transactions dependent on the increment. This is what
ultimately led to revert of the patch.

I don't understand this - why would we ever need to process those increments
from before we have a snapshot? Wouldn't they, by definition, be before the
slot was active?

To me this is the rough equivalent of logical decoding not giving the initial
state of all tables. You need some process outside of logical decoding to get
that (obviously we have some support for that via the exported data snapshot
during slot creation).

I assume that part of the initial sync would have to be a new sequence
synchronization step that reads all the sequence states on the publisher and
ensures that the subscriber sequences are at the same point. There's a bit of
trickiness there, but it seems entirely doable. The logical replication replay
support for sequences will have to be a bit careful about not decreasing the
subscriber's sequence values - the standby initially will be ahead of the
increments we'll see in the WAL. But that seems inevitable given the
non-transactional nature of sequences.

This version of the patch tries to do the opposite thing - make sure
that the state after each commit matches what the transaction might have
seen (for sequences it accessed). It's imperfect, because it might log a
state generated "after" the sequence got accessed - it focuses on the
guarantee not to generate duplicate values.

That approach seems quite wrong to me.

I'm going to confess that I have no really specific idea how to
implement that. I'm just not sufficiently familiar with this code.
However, I suspect that the solution lies in changing things on the
decoding side rather than in the WAL format. I feel like the
information that we need in order to do the right thing must already
be present in the WAL. If it weren't, then how could crash recovery
work correctly, or physical replication? At any given moment, you can
choose to promote a physical standby, and at that point the state you
observe on the new primary had better be some state that existed on
the primary at some point in its history. At any moment, you can
unplug the primary, restart it, and run crash recovery, and if you do,
you had better end up with some state that existed on the primary at
some point shortly before the crash.

One minor exception here is that there's no real time bound to see the last
few sequence increments if nothing after the XLOG_SEQ_LOG records forces a WAL
flush.

I think that there are actually a
few subtle inaccuracies in the last two sentences, because actually
the order in which transactions become visible on a physical standby
can differ from the order in which it happens on the primary, but I
don't think that actually changes the picture much. The point is that
the WAL is the definitive source of information about what happened
and in what order it happened, and we use it in that way already in
the context of physical replication, and of standbys. If logical
decoding has a problem with some case that those systems handle
correctly, the problem is with logical decoding, not the WAL format.

The problem lies in how we log sequences. If we wrote each individual
increment to WAL, it might work the way you propose (except for cases
with sequences created in a transaction, etc.). But that's not what we
do - we log sequence increments in batches of 32 values, and then only
modify the sequence relfilenode.

This works for physical replication, because the WAL describes the
"next" state of the sequence (so if you do "SELECT * FROM sequence"
you'll not see the same state, and the sequence value may "jump ahead"
after a failover).

But for logical replication this does not work, because the transaction
might depend on a state created (WAL-logged) by some other transaction.
And perhaps that transaction actually happened *before* we even built
the first snapshot for decoding :-/

I really can't follow the "depend on state ... by some other transaction"
aspect.

Even the case of a sequence that is renamed inside a transaction that did
*not* create / reset the sequence and then also triggers increment of the
sequence seems to be dealt with reasonably by processing sequence increments
outside a transaction - the old name will be used for the increments, replay
of the renaming transaction would then implement the rename in a hypothetical
DDL-replay future.

There's also the issue with what snapshot to use when decoding these
transactional changes in logical decoding (see

Incomplete parenthetical? Or were you referencing the next paragraph?

What are the transactional changes you're referring to here?

I did some skimming of the referenced thread about the reversal of the last
approach, but I couldn't really understand what the fundamental issues were
with the reverted implementation - it's a very long thread and references
other threads.

Greetings,

Andres Freund

#8Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#7)
Re: logical decoding and replication of sequences, take 2

On 11/17/22 03:43, Andres Freund wrote:

Hi,

On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:

Well, yeah - we can either try to perform the stuff independently of the
transactions that triggered it, or we can try making it part of some of
the transactions. Each of those options has problems, though :-(

The first version of the patch tried the first approach, i.e. decode the
increments and apply that independently. But:

(a) What would you do with increments of sequences created/reset in a
transaction? Can't apply those outside the transaction, because it
might be rolled back (and that state is not visible on primary).

I think a reasonable approach could be to actually perform different WAL
logging for that case. It'll require a bit of machinery, but could actually
result in *less* WAL logging overall, because we don't need to emit a WAL
record for each SEQ_LOG_VALS sequence values.

Could you elaborate? Hard to comment without knowing more ...

My point was that stuff like this (creating a new sequence or at least a
new relfilenode) means we can't apply that independently of the
transaction (unlike regular increments). I'm not sure how a change to
WAL logging would make that go away.

(b) What about increments created before we have a proper snapshot?
There may be transactions dependent on the increment. This is what
ultimately led to revert of the patch.

I don't understand this - why would we ever need to process those increments
from before we have a snapshot? Wouldn't they, by definition, be before the
slot was active?

To me this is the rough equivalent of logical decoding not giving the initial
state of all tables. You need some process outside of logical decoding to get
that (obviously we have some support for that via the exported data snapshot
during slot creation).

Which is what already happens during tablesync, no? We more or less copy
sequences as if they were tables.

I assume that part of the initial sync would have to be a new sequence
synchronization step that reads all the sequence states on the publisher and
ensures that the subscriber sequences are at the same point. There's a bit of
trickiness there, but it seems entirely doable. The logical replication replay
support for sequences will have to be a bit careful about not decreasing the
subscriber's sequence values - the standby initially will be ahead of the
increments we'll see in the WAL. But that seems inevitable given the
non-transactional nature of sequences.

See fetch_sequence_data / copy_sequence in the patch. The bit about
ensuring the sequence does not go away (say, using page LSN and/or LSN
of the increment) is not there, however isn't that pretty much what I
proposed doing for "reconciling" the sequence state logged at COMMIT?

This version of the patch tries to do the opposite thing - make sure
that the state after each commit matches what the transaction might have
seen (for sequences it accessed). It's imperfect, because it might log a
state generated "after" the sequence got accessed - it focuses on the
guarantee not to generate duplicate values.

That approach seems quite wrong to me.

Why? Because it might log a state for sequence as of COMMIT, when the
transaction accessed the sequence much earlier? That is, this may happen:

T1: nextval('s') -> 1
T2: call nextval('s') 1000000x
T1: commit

and T1 will log sequence state ~1000001, give or take. I don't think
there's way around that, given the non-transactional nature of
sequences. And I'm not convinced this is an issue, as it ensures
uniqueness of values generated on the subscriber. And I think it's
reasonable to replicate the sequence state as of the commit (because
that's what you'd see on the primary).

I'm going to confess that I have no really specific idea how to
implement that. I'm just not sufficiently familiar with this code.
However, I suspect that the solution lies in changing things on the
decoding side rather than in the WAL format. I feel like the
information that we need in order to do the right thing must already
be present in the WAL. If it weren't, then how could crash recovery
work correctly, or physical replication? At any given moment, you can
choose to promote a physical standby, and at that point the state you
observe on the new primary had better be some state that existed on
the primary at some point in its history. At any moment, you can
unplug the primary, restart it, and run crash recovery, and if you do,
you had better end up with some state that existed on the primary at
some point shortly before the crash.

One minor exception here is that there's no real time bound to see the last
few sequence increments if nothing after the XLOG_SEQ_LOG records forces a WAL
flush.

Right. Another issue is we ignore stuff that happened in aborted
transactions, so then nextval('s') in another transaction may not wait
for syncrep to confirm receiving that WAL. Which is a data loss case,
see [1]/messages/by-id/712cad46-a9c8-1389-aef8-faf0203c9be9@enterprisedb.com:

[1]: /messages/by-id/712cad46-a9c8-1389-aef8-faf0203c9be9@enterprisedb.com
/messages/by-id/712cad46-a9c8-1389-aef8-faf0203c9be9@enterprisedb.com

I think that there are actually a
few subtle inaccuracies in the last two sentences, because actually
the order in which transactions become visible on a physical standby
can differ from the order in which it happens on the primary, but I
don't think that actually changes the picture much. The point is that
the WAL is the definitive source of information about what happened
and in what order it happened, and we use it in that way already in
the context of physical replication, and of standbys. If logical
decoding has a problem with some case that those systems handle
correctly, the problem is with logical decoding, not the WAL format.

The problem lies in how we log sequences. If we wrote each individual
increment to WAL, it might work the way you propose (except for cases
with sequences created in a transaction, etc.). But that's not what we
do - we log sequence increments in batches of 32 values, and then only
modify the sequence relfilenode.

This works for physical replication, because the WAL describes the
"next" state of the sequence (so if you do "SELECT * FROM sequence"
you'll not see the same state, and the sequence value may "jump ahead"
after a failover).

But for logical replication this does not work, because the transaction
might depend on a state created (WAL-logged) by some other transaction.
And perhaps that transaction actually happened *before* we even built
the first snapshot for decoding :-/

I really can't follow the "depend on state ... by some other transaction"
aspect.

T1: nextval('s') -> writes WAL, covering by the next 32 increments
T2: nextval('s') -> no WAL generated, covered by T1 WAL

This is what I mean by "dependency" on state logged by another
transaction. It already causes problems with streaming replication (see
the reference to syncrep above), logical replication has the same issue.

Even the case of a sequence that is renamed inside a transaction that did
*not* create / reset the sequence and then also triggers increment of the
sequence seems to be dealt with reasonably by processing sequence increments
outside a transaction - the old name will be used for the increments, replay
of the renaming transaction would then implement the rename in a hypothetical
DDL-replay future.

There's also the issue with what snapshot to use when decoding these
transactional changes in logical decoding (see

Incomplete parenthetical? Or were you referencing the next paragraph?

What are the transactional changes you're referring to here?

Sorry, IIRC I merely wanted to mention/reference the snapshot issue in
the thread [2]/messages/by-id/00708727-d856-1886-48e3-811296c7ba8c@enterprisedb.com that I ended up referencing in the next paragraph.

[2]: /messages/by-id/00708727-d856-1886-48e3-811296c7ba8c@enterprisedb.com
/messages/by-id/00708727-d856-1886-48e3-811296c7ba8c@enterprisedb.com

I did some skimming of the referenced thread about the reversal of the last
approach, but I couldn't really understand what the fundamental issues were
with the reverted implementation - it's a very long thread and references
other threads.

Yes, it's long/complex, but I intentionally linked to a specific message
which describes the issue ...

It's entirely possible there is a simple fix for the issue, and I just
got confused / unable to see the solution. The whole issue was due to
having a mix of transactional and non-transactional cases, similarly to
logical messages - and logicalmsg_decode() has the same issue, so maybe
let's talk about that for a moment.

See [3]https://github.com/postgres/postgres/blob/master/src/backend/replication/logical/decode.c#L585 and imagine you're dealing with a transactional message, but
you're still building a consistent snapshot. So the first branch applies:

if (transactional &&
!SnapBuildProcessChange(builder, xid, buf->origptr))
return;

but because we don't have a snapshot, SnapBuildProcessChange does this:

if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
return false;

which however means logicalmsg_decode() does

snapshot = SnapBuildGetOrBuildSnapshot(builder);

which crashes, because it hits this assert:

Assert(builder->state == SNAPBUILD_CONSISTENT);

The sequence decoding did almost the same thing, with the same issue.
Maybe the correct thing to do is to just ignore the change in this case?
Presumably it'd be replicated by tablesync. But we've been unable to
convince ourselves that's correct, or what snapshot to pass to
ReorderBufferQueueMessage/ReorderBufferQueueSequence.

[3]: https://github.com/postgres/postgres/blob/master/src/backend/replication/logical/decode.c#L585
https://github.com/postgres/postgres/blob/master/src/backend/replication/logical/decode.c#L585

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#6)
Re: logical decoding and replication of sequences, take 2

On Wed, Nov 16, 2022 at 8:41 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

There's a couple of caveats, though:

1) Maybe we should focus more on "actually observed" state instead of
"observable". Who cares if the sequence moved forward in a transaction
that was ultimately rolled back? No committed transaction should have
observer those values - in a way, the last "valid" state of the sequence
is the last value generated in a transaction that ultimately committed.

When I say "observable" I mean from a separate transaction, not one
that is making changes to things.

I said "observable" rather than "actually observed" because we neither
know nor care whether someone actually ran a SELECT statement at any
given moment in time, just what they would have seen if they did.

2) I think what matters more is that we never generate duplicate value.
That is, if you generate a value from a sequence, commit a transaction
and replicate it, then the logical standby should not generate the same
value from the sequence. This guarantee seems necessary for "failover"
to logical standby.

I think that matters, but I don't think it's sufficient. We need to
preserve the order in which things appear to happen, and which changes
are and are not atomic, not just the final result.

Well, yeah - we can either try to perform the stuff independently of the
transactions that triggered it, or we can try making it part of some of
the transactions. Each of those options has problems, though :-(

The first version of the patch tried the first approach, i.e. decode the
increments and apply that independently. But:

(a) What would you do with increments of sequences created/reset in a
transaction? Can't apply those outside the transaction, because it
might be rolled back (and that state is not visible on primary).

If the state isn't going to be visible until the transaction commits,
it has to be replicated as part of the transaction. If I create a
sequence and then nextval it a bunch of times, I can't replicate that
by first creating the sequence, and then later, as a separate
operation, replicating the nextvals. If I do that, then there's an
intermediate state visible on the replica that was never visible on
the origin server. That's broken.

(b) What about increments created before we have a proper snapshot?
There may be transactions dependent on the increment. This is what
ultimately led to revert of the patch.

Whatever problem exists here is with the implementation, not the
concept. If you copy the initial state as it exists at some moment in
time to a replica, and then replicate all the changes that happen
afterward to that replica without messing up the order, the replica
WILL be in sync with the origin server. The things that happen before
you copy the initial state do not and cannot matter.

But what you're describing sounds like the changes aren't really
replicated in visibility order, and then it is easy to see how a
problem like this can happen. Because now, an operation that actually
became visible just before or just after the initial copy was taken
might be thought to belong on the other side of that boundary, and
then everything will break. And it sounds like that is what you are
describing.

This version of the patch tries to do the opposite thing - make sure
that the state after each commit matches what the transaction might have
seen (for sequences it accessed). It's imperfect, because it might log a
state generated "after" the sequence got accessed - it focuses on the
guarantee not to generate duplicate values.

Like Andres, I just can't imagine this being correct. It feels like
it's trying to paper over the failure to do the replication properly
during the transaction by overwriting state at the end.

Yes, this would mean we accept we may end up with something like this:

1: T1 logs sequence state S1
2: someone increments sequence
3: T2 logs sequence stats S2
4: T2 commits
5: T1 commits

which "inverts" the apply order of S1 vs. S2, because we first apply S2
and then the "old" S1. But as long as we're smart enough to "discard"
applying S1, I think that's acceptable - because it guarantees we'll not
generate duplicate values (with values in the committed transaction).

I'd also argue it does not actually generate invalid state, because once
we commit either transaction, S2 is what's visible.

I agree that it's OK if the sequence increment gets merged into the
commit that immediately follows. However, I disagree with the idea of
discarding the second update on the grounds that it would make the
sequence go backward and we know that can't be right. That algorithm
works in the really specific case where the only operations are
increments. As soon as anyone does anything else to the sequence, such
an algorithm can no longer work. Nor can it work for objects that are
not sequences. The alternative strategy of replicating each change
exactly once and in the correct order works for all current and future
object types in all cases.

Your alternative proposal says "The other option might be to make
these messages non-transactional, in which case we'd separate the
ordering from COMMIT ordering, evading the reordering problem." But I
don't think that avoids the reordering problem at all.

I don't understand why. Why would it not address the reordering issue?

Nor do I think it's correct.

Nor do I understand this. I mean, isn't it essentially the option you
mentioned earlier - treating the non-transactional actions as
independent transactions? Yes, we'd be batching them so that we'd not
see "intermediate" states, but those are not observed by abyone.

I don't think that batching them is a bad idea, in fact I think it's
necessary. But those batches still have to be applied at the right
time relative to the sequence of commits.

I'm confused. Isn't that pretty much exactly what I'm proposing? Imagine
you have something like this:

1: T1 does something and also increments a sequence
2: T1 logs state of the sequence (right before commit)
3: T1 writes COMMIT

Now when we decode/apply this, we end up doing this:

1: decode all T1 changes, stash them
2: decode the sequence state and apply it separately
3: decode COMMIT, apply all T1 changes

There might be other transactions interleaving with this, but I think
it'd behave correctly. What example would not work?

What if one of the other transactions renames the sequence, or changes
the current value, or does basically anything to it other than
nextval?

The problem lies in how we log sequences. If we wrote each individual
increment to WAL, it might work the way you propose (except for cases
with sequences created in a transaction, etc.). But that's not what we
do - we log sequence increments in batches of 32 values, and then only
modify the sequence relfilenode.

This works for physical replication, because the WAL describes the
"next" state of the sequence (so if you do "SELECT * FROM sequence"
you'll not see the same state, and the sequence value may "jump ahead"
after a failover).

But for logical replication this does not work, because the transaction
might depend on a state created (WAL-logged) by some other transaction.
And perhaps that transaction actually happened *before* we even built
the first snapshot for decoding :-/

I agree that there's a problem here but I don't think that it's a huge
problem. I think that it's not QUITE right to think about what state
is visible on the primary. It's better to think about what state would
be visible on the primary if it crashed and restarted after writing
any given amount of WAL, or what would be visible on a physical
standby after replaying any given amount of WAL. If logical
replication mimics that, I think it's as correct as it needs to be. If
not, those other systems are broken, too.

So I think what should happen is that when we write a WAL record
saying that the sequence has been incremented by 32, that should be
logically replicated after all commits whose commit record precedes
that WAL record and before commits whose commit record follows that
WAL record. It is OK to merge the replication of that record into one
of either the immediately preceding or the immediately following
commit, but you can't do it as part of any other commit because then
you're changing the order of operations.

For instance, consider:

T1: BEGIN; INSERT; COMMIT;
T2: BEGIN; nextval('a_seq') causing a logged advancement to the sequence;
T3: BEGIN; nextval('b_seq') causing a logged advancement to the sequence;
T4: BEGIN; INSERT; COMMIT;
T2: COMMIT;
T3: COMMIT;

The sequence increments can be replicated as part of T1 or part of T4
or in between applying T1 and T4. They cannot be applied as part of T2
or T3. Otherwise, suppose T4 read the current value of one of those
sequences and included that value in the inserted row, and the target
table happened to be the sequence_value_at_end_of_period table. Then
imagine that after receiving the data for T4 and replicating it, the
primary server is hit by a meteor and the replica is promoted. Well,
it's now possible for some new transaction to get a value from that
sequence than what has already been written to the
sequence_value_at_end_of_period table, which will presumably break the
application.

In particular, I think it's likely that the "non-transactional
messages" that you mention earlier don't get applied at the point in
the commit sequence where they were found in the WAL. Not sure why
exactly, but perhaps the point at which we're reading WAL runs ahead
of the decoding per se, or something like that, and thus those
non-transactional messages arrive too early relative to the commit
ordering. Possibly that could be changed, and they could be buffered

I'm not sure which case of "non-transactional messages" this refers to,
so I can't quite respond to these comments. Perhaps you mean the
problems that killed the previous patch [1]?

In /messages/by-id/8bf1c518-b886-fe1b-5c42-09f9c663146d@enterprisedb.com
you said "The other option might be to make these messages
non-transactional". I was referring to that.

the transaction whose commit record occurs most nearly prior to, or
most nearly after, the WAL record for the operation itself. Or else,
we could create "virtual" transactions for such operations and make
sure those get replayed at the right point in the commit sequence. Or
else, I don't know, maybe something else. But I think the overall
picture is that we need to approach the problem by replicating changes
in WAL order, as a physical standby would do. Saying that a change is
"nontransactional" doesn't mean that it's exempt from ordering
requirements; rather, it means that that change has its own place in
that ordering, distinct from the transaction in which it occurred.

But doesn't the approach with WAL-logging sequence state before COMMIT,
and then applying it independently in WAL-order, do pretty much this?

I'm sort of repeating myself here, but: only if the only operations
that ever get performed on sequences are increments. Which is just not
true.

--
Robert Haas
EDB: http://www.enterprisedb.com

#10Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#8)
Re: logical decoding and replication of sequences, take 2

Hi,

On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote:

On 11/17/22 03:43, Andres Freund wrote:

On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:

Well, yeah - we can either try to perform the stuff independently of the
transactions that triggered it, or we can try making it part of some of
the transactions. Each of those options has problems, though :-(

The first version of the patch tried the first approach, i.e. decode the
increments and apply that independently. But:

(a) What would you do with increments of sequences created/reset in a
transaction? Can't apply those outside the transaction, because it
might be rolled back (and that state is not visible on primary).

I think a reasonable approach could be to actually perform different WAL
logging for that case. It'll require a bit of machinery, but could actually
result in *less* WAL logging overall, because we don't need to emit a WAL
record for each SEQ_LOG_VALS sequence values.

Could you elaborate? Hard to comment without knowing more ...

My point was that stuff like this (creating a new sequence or at least a
new relfilenode) means we can't apply that independently of the
transaction (unlike regular increments). I'm not sure how a change to
WAL logging would make that go away.

Different WAL logging would make it easy to handle that on the logical
decoding level. We don't need to emit WAL records each time a
created-in-this-toplevel-xact sequences gets incremented as they're not
persisting anyway if the surrounding xact aborts. We already need to remember
the filenode so it can be dropped at the end of the transaction, so we could
emit a single record for each sequence at that point.

(b) What about increments created before we have a proper snapshot?
There may be transactions dependent on the increment. This is what
ultimately led to revert of the patch.

I don't understand this - why would we ever need to process those increments
from before we have a snapshot? Wouldn't they, by definition, be before the
slot was active?

To me this is the rough equivalent of logical decoding not giving the initial
state of all tables. You need some process outside of logical decoding to get
that (obviously we have some support for that via the exported data snapshot
during slot creation).

Which is what already happens during tablesync, no? We more or less copy
sequences as if they were tables.

I think you might have to copy sequences after tables, but I'm not sure. But
otherwise, yea.

I assume that part of the initial sync would have to be a new sequence
synchronization step that reads all the sequence states on the publisher and
ensures that the subscriber sequences are at the same point. There's a bit of
trickiness there, but it seems entirely doable. The logical replication replay
support for sequences will have to be a bit careful about not decreasing the
subscriber's sequence values - the standby initially will be ahead of the
increments we'll see in the WAL. But that seems inevitable given the
non-transactional nature of sequences.

See fetch_sequence_data / copy_sequence in the patch. The bit about
ensuring the sequence does not go away (say, using page LSN and/or LSN
of the increment) is not there, however isn't that pretty much what I
proposed doing for "reconciling" the sequence state logged at COMMIT?

Well, I think the approach of logging all sequence increments at commit is the
wrong idea...

Creating a new relfilenode whenever a sequence is incremented seems like a
complete no-go to me. That increases sequence overhead by several orders of
magnitude and will lead to *awful* catalog bloat on the subscriber.

This version of the patch tries to do the opposite thing - make sure
that the state after each commit matches what the transaction might have
seen (for sequences it accessed). It's imperfect, because it might log a
state generated "after" the sequence got accessed - it focuses on the
guarantee not to generate duplicate values.

That approach seems quite wrong to me.

Why? Because it might log a state for sequence as of COMMIT, when the
transaction accessed the sequence much earlier?

Mainly because sequences aren't transactional and trying to make them will
require awful contortions.

While there are cases where we don't flush the WAL / wait for syncrep for
sequences, we do replicate their state correctly on physical replication. If
an LSN has been acknowledged as having been replicated, we won't just loose a
prior sequence increment after promotion, even if the transaction didn't [yet]
commit.

It's completely valid for an application to call nextval() in one transaction,
potentially even abort it, and then only use that sequence value in another
transaction.

I did some skimming of the referenced thread about the reversal of the last
approach, but I couldn't really understand what the fundamental issues were
with the reverted implementation - it's a very long thread and references
other threads.

Yes, it's long/complex, but I intentionally linked to a specific message
which describes the issue ...

It's entirely possible there is a simple fix for the issue, and I just
got confused / unable to see the solution. The whole issue was due to
having a mix of transactional and non-transactional cases, similarly to
logical messages - and logicalmsg_decode() has the same issue, so maybe
let's talk about that for a moment.

See [3] and imagine you're dealing with a transactional message, but
you're still building a consistent snapshot. So the first branch applies:

if (transactional &&
!SnapBuildProcessChange(builder, xid, buf->origptr))
return;

but because we don't have a snapshot, SnapBuildProcessChange does this:

if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
return false;

In this case we'd just return without further work in logicalmsg_decode(). The
problematic case presumably is is when we have a full snapshot but aren't yet
consistent, but xid is >= next_phase_at. Then SnapBuildProcessChange() returns
true. And we reach:

which however means logicalmsg_decode() does

snapshot = SnapBuildGetOrBuildSnapshot(builder);

which crashes, because it hits this assert:

Assert(builder->state == SNAPBUILD_CONSISTENT);

I think the problem here is just that we shouldn't even try to get a snapshot
in the transactional case - note that it's not even used in
ReorderBufferQueueMessage() for transactional message. The transactional case
needs to behave like a "normal" change - we might never decode the message if
the transaction ends up committing before we've reached a consistent point.

The sequence decoding did almost the same thing, with the same issue.
Maybe the correct thing to do is to just ignore the change in this case?

No, I don't think that'd be correct, the message | sequence needs to be queued
for the transaction. If the transaction ends up committing after we've reached
consistency, we'll get the correct snapshot from the base snapshot set in
SnapBuildProcessChange().

Greetings,

Andres Freund

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#10)
Re: logical decoding and replication of sequences, take 2

On 11/17/22 18:07, Andres Freund wrote:

Hi,

On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote:

On 11/17/22 03:43, Andres Freund wrote:

On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:

Well, yeah - we can either try to perform the stuff independently of the
transactions that triggered it, or we can try making it part of some of
the transactions. Each of those options has problems, though :-(

The first version of the patch tried the first approach, i.e. decode the
increments and apply that independently. But:

(a) What would you do with increments of sequences created/reset in a
transaction? Can't apply those outside the transaction, because it
might be rolled back (and that state is not visible on primary).

I think a reasonable approach could be to actually perform different WAL
logging for that case. It'll require a bit of machinery, but could actually
result in *less* WAL logging overall, because we don't need to emit a WAL
record for each SEQ_LOG_VALS sequence values.

Could you elaborate? Hard to comment without knowing more ...

My point was that stuff like this (creating a new sequence or at least a
new relfilenode) means we can't apply that independently of the
transaction (unlike regular increments). I'm not sure how a change to
WAL logging would make that go away.

Different WAL logging would make it easy to handle that on the logical
decoding level. We don't need to emit WAL records each time a
created-in-this-toplevel-xact sequences gets incremented as they're not
persisting anyway if the surrounding xact aborts. We already need to remember
the filenode so it can be dropped at the end of the transaction, so we could
emit a single record for each sequence at that point.

(b) What about increments created before we have a proper snapshot?
There may be transactions dependent on the increment. This is what
ultimately led to revert of the patch.

I don't understand this - why would we ever need to process those increments
from before we have a snapshot? Wouldn't they, by definition, be before the
slot was active?

To me this is the rough equivalent of logical decoding not giving the initial
state of all tables. You need some process outside of logical decoding to get
that (obviously we have some support for that via the exported data snapshot
during slot creation).

Which is what already happens during tablesync, no? We more or less copy
sequences as if they were tables.

I think you might have to copy sequences after tables, but I'm not sure. But
otherwise, yea.

I assume that part of the initial sync would have to be a new sequence
synchronization step that reads all the sequence states on the publisher and
ensures that the subscriber sequences are at the same point. There's a bit of
trickiness there, but it seems entirely doable. The logical replication replay
support for sequences will have to be a bit careful about not decreasing the
subscriber's sequence values - the standby initially will be ahead of the
increments we'll see in the WAL. But that seems inevitable given the
non-transactional nature of sequences.

See fetch_sequence_data / copy_sequence in the patch. The bit about
ensuring the sequence does not go away (say, using page LSN and/or LSN
of the increment) is not there, however isn't that pretty much what I
proposed doing for "reconciling" the sequence state logged at COMMIT?

Well, I think the approach of logging all sequence increments at commit is the
wrong idea...

But we're not logging all sequence increments, no?

We're logging the state for each sequence touched by the transaction,
but only once - if the transaction incremented the sequence 1000000x
times, we'll still log it just once (at least for this particular purpose).

Yes, if transactions touch each sequence just once, then we're logging
individual increments.

The only more efficient solution would be to decode the existing WAL
(every ~32 increments), and perhaps also tracking which sequences were
accessed by a transaction. And then simply stashing the increments in a
global reorderbuffer hash table, and then applying only the last one at
commit time. This would require the transactional / non-transactional
behavior (I think), but perhaps we can make that work.

Or are you thinking about some other scheme?

Creating a new relfilenode whenever a sequence is incremented seems like a
complete no-go to me. That increases sequence overhead by several orders of
magnitude and will lead to *awful* catalog bloat on the subscriber.

You mean on the the apply side? Yes, I agree this needs a better
approach, I've focused on the decoding side so far.

This version of the patch tries to do the opposite thing - make sure
that the state after each commit matches what the transaction might have
seen (for sequences it accessed). It's imperfect, because it might log a
state generated "after" the sequence got accessed - it focuses on the
guarantee not to generate duplicate values.

That approach seems quite wrong to me.

Why? Because it might log a state for sequence as of COMMIT, when the
transaction accessed the sequence much earlier?

Mainly because sequences aren't transactional and trying to make them will
require awful contortions.

While there are cases where we don't flush the WAL / wait for syncrep for
sequences, we do replicate their state correctly on physical replication. If
an LSN has been acknowledged as having been replicated, we won't just loose a
prior sequence increment after promotion, even if the transaction didn't [yet]
commit.

True, I agree we should aim to achieve that.

It's completely valid for an application to call nextval() in one transaction,
potentially even abort it, and then only use that sequence value in another
transaction.

I don't quite agree with that - we make no promises about what happens
to sequence changes in aborted transactions. I don't think I've ever
seen an application using such pattern either.

And I'd argue we already fail to uphold such guarantee, because we don't
wait for syncrep if the sequence WAL happened in aborted transaction. So
if you use the value elsewhere (outside PG), you may lose it.

Anyway, I think the scheme I outlined above (with stashing decoded
increments, logged once every 32 values) and applying the latest
increment for all sequences at commit, would work.

I did some skimming of the referenced thread about the reversal of the last
approach, but I couldn't really understand what the fundamental issues were
with the reverted implementation - it's a very long thread and references
other threads.

Yes, it's long/complex, but I intentionally linked to a specific message
which describes the issue ...

It's entirely possible there is a simple fix for the issue, and I just
got confused / unable to see the solution. The whole issue was due to
having a mix of transactional and non-transactional cases, similarly to
logical messages - and logicalmsg_decode() has the same issue, so maybe
let's talk about that for a moment.

See [3] and imagine you're dealing with a transactional message, but
you're still building a consistent snapshot. So the first branch applies:

if (transactional &&
!SnapBuildProcessChange(builder, xid, buf->origptr))
return;

but because we don't have a snapshot, SnapBuildProcessChange does this:

if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
return false;

In this case we'd just return without further work in logicalmsg_decode(). The
problematic case presumably is is when we have a full snapshot but aren't yet
consistent, but xid is >= next_phase_at. Then SnapBuildProcessChange() returns
true. And we reach:

which however means logicalmsg_decode() does

snapshot = SnapBuildGetOrBuildSnapshot(builder);

which crashes, because it hits this assert:

Assert(builder->state == SNAPBUILD_CONSISTENT);

I think the problem here is just that we shouldn't even try to get a snapshot
in the transactional case - note that it's not even used in
ReorderBufferQueueMessage() for transactional message. The transactional case
needs to behave like a "normal" change - we might never decode the message if
the transaction ends up committing before we've reached a consistent point.

The sequence decoding did almost the same thing, with the same issue.
Maybe the correct thing to do is to just ignore the change in this case?

No, I don't think that'd be correct, the message | sequence needs to be queued
for the transaction. If the transaction ends up committing after we've reached
consistency, we'll get the correct snapshot from the base snapshot set in
SnapBuildProcessChange().

Yeah, I think you're right. I looked at this again, with fresh mind, and
I came to the same conclusion. Roughly what the attached patch does.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

logical-msg-fix.patchtext/x-patch; charset=UTF-8; name=logical-msg-fix.patchDownload+8-6
#12Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#11)
Re: logical decoding and replication of sequences, take 2

Hi,

On 2022-11-17 22:13:23 +0100, Tomas Vondra wrote:

On 11/17/22 18:07, Andres Freund wrote:

On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote:

On 11/17/22 03:43, Andres Freund wrote:

I assume that part of the initial sync would have to be a new sequence
synchronization step that reads all the sequence states on the publisher and
ensures that the subscriber sequences are at the same point. There's a bit of
trickiness there, but it seems entirely doable. The logical replication replay
support for sequences will have to be a bit careful about not decreasing the
subscriber's sequence values - the standby initially will be ahead of the
increments we'll see in the WAL. But that seems inevitable given the
non-transactional nature of sequences.

See fetch_sequence_data / copy_sequence in the patch. The bit about
ensuring the sequence does not go away (say, using page LSN and/or LSN
of the increment) is not there, however isn't that pretty much what I
proposed doing for "reconciling" the sequence state logged at COMMIT?

Well, I think the approach of logging all sequence increments at commit is the
wrong idea...

But we're not logging all sequence increments, no?

I was imprecise - I meant streaming them out at commit.

Yeah, I think you're right. I looked at this again, with fresh mind, and
I came to the same conclusion. Roughly what the attached patch does.

To me it seems a bit nicer to keep the SnapBuildGetOrBuildSnapshot() call in
decode.c instead of moving it to reorderbuffer.c. Perhaps we should add a
snapbuild.c helper similar to SnapBuildProcessChange() for non-transactional
changes that also gets a snapshot?

Could look something like

Snapshot snapshot = NULL;

if (message->transactional &&
!SnapBuildProcessChange(builder, xid, buf->origptr))
return;
else if (!SnapBuildProcessStateNonTx(builder, &snapshot))
return;

...

Or perhaps we should just bite the bullet and add an argument to
SnapBuildProcessChange to deal with that?

Greetings,

Andres Freund

#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#12)
Re: logical decoding and replication of sequences, take 2

Hi,

Here's a rebased version of the sequence decoding patch.

0001 is a fix for the pre-existing issue in logicalmsg_decode,
attempting to build a snapshot before getting into a consistent state.
AFAICS this only affects assert-enabled builds and is otherwise
harmless, because we are not actually using the snapshot (apply gets a
valid snapshot from the transaction).

This is mostly the fix I shared in November, except that I kept the call
in decode.c (per comment from Andres). I haven't added any argument to
SnapBuildProcessChange because we may need to backpatch this (and it
didn't seem much simpler, IMHO).

0002 is a rebased version of the original approach, committed as
0da92dc530 (and then reverted in 2c7ea57e56). This includes the same fix
as 0001 (for the sequence messages), the primary reason for the revert.

The rebase was not quite straightforward, due to extensive changes in
how publications deal with tables/schemas, and so on. So this adopts
them, but other than that it behaves just like the original patch.

So this abandons the approach with COMMIT-time logging for sequences
accessed/modified by the transaction, proposed in response to the
revert. It seemed like a good (and simpler) alternative, but there were
far too many issues - higher overhead, ordering of records for
concurrent transactions, making it reliable, etc.

I think the main remaining question is what's the goal of this patch, or
rather what "guarantees" we expect from it - what we expect to see on
the replica after incrementing a sequence on the primary.

Robert described [1]/messages/by-id/CA+TgmoaYG7672OgdwpGm5cOwy8_ftbs=3u-YMvR9fiJwQUzgrQ@mail.gmail.com a model and argued the standby should not "invent"
new states. It's a long / detailed explanation, I'm not going to try to
shorten in here because that'd inevitably omit various details. So
better read it whole ...

Anyway, I don't think this approach (essentially treating most sequence
increments as non-transactional) breaks any consistency guarantees or
introduces any "new" states that would not be observable on the primary.
In a way, this treats non-transactional sequence increments as separate
transactions, and applies them directly. If you read the sequence in
between two commits, you might see any "intermediate" state of the
sequence - that's the nature of non-transactional changes.

We could "postpone" applying the decoded changes until the first next
commit, which might improve performance if a transaction is long enough
to cover many sequence increments. But that's more a performance
optimization than a matter of correctness, IMHO.

One caveat is that because of how WAL works for sequences, we're
actually decoding changes "ahead" so if you read the sequence on the
subscriber it'll actually seem to be slightly ahead (up to ~32 values).
This could be eliminated by setting SEQ_LOG_VALS to 0, which however
increases the sequence costs, of course.

This however brings me to the original question what's the purpose of
this patch - and that's essentially keeping sequences up to date to make
them usable after a failover. We can't generate values from the sequence
on the subscriber, because it'd just get overwritten. And from this
point of view, it's also fine that the sequence is slightly ahead,
because that's what happens after crash recovery anyway. And we're not
guaranteeing the sequences to be gap-less.

regards

[1]: /messages/by-id/CA+TgmoaYG7672OgdwpGm5cOwy8_ftbs=3u-YMvR9fiJwQUzgrQ@mail.gmail.com
/messages/by-id/CA+TgmoaYG7672OgdwpGm5cOwy8_ftbs=3u-YMvR9fiJwQUzgrQ@mail.gmail.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Fix-snapshot-handling-in-logicalmsg_decode-20230110.patchtext/x-patch; charset=UTF-8; name=0001-Fix-snapshot-handling-in-logicalmsg_decode-20230110.patchDownload+18-3
0002-Logical-decoding-of-sequences-20230110.patchtext/x-patch; charset=UTF-8; name=0002-Logical-decoding-of-sequences-20230110.patchDownload+4715-560
#14Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#13)
Re: logical decoding and replication of sequences, take 2

On Tue, Jan 10, 2023 at 1:32 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

0001 is a fix for the pre-existing issue in logicalmsg_decode,
attempting to build a snapshot before getting into a consistent state.
AFAICS this only affects assert-enabled builds and is otherwise
harmless, because we are not actually using the snapshot (apply gets a
valid snapshot from the transaction).

This is mostly the fix I shared in November, except that I kept the call
in decode.c (per comment from Andres). I haven't added any argument to
SnapBuildProcessChange because we may need to backpatch this (and it
didn't seem much simpler, IMHO).

I tend to associate transactional behavior with snapshots, so it looks
odd to see code that builds a snapshot only when the message is
non-transactional. I think that a more detailed comment spelling out
the reasoning would be useful here.

This however brings me to the original question what's the purpose of
this patch - and that's essentially keeping sequences up to date to make
them usable after a failover. We can't generate values from the sequence
on the subscriber, because it'd just get overwritten. And from this
point of view, it's also fine that the sequence is slightly ahead,
because that's what happens after crash recovery anyway. And we're not
guaranteeing the sequences to be gap-less.

I agree that it's fine for the sequence to be slightly ahead, but I
think that it can't be too far ahead without causing problems. Suppose
for example that transaction #1 creates a sequence. Transaction #2
does nextval on the sequence a bunch of times and inserts rows into a
table using the sequence values as the PK. It's fine if the nextval
operations are replicated ahead of the commit of transaction #2 -- in
fact I'd say it's necessary for correctness -- but they can't precede
the commit of transaction #1, since then the sequence won't exist yet.
Likewise, if there's an ALTER SEQUENCE that creates a new relfilenode,
I think that needs to act as a barrier: non-transactional changes that
happened before that transaction must also be replicated before that
transaction is replicated, and those that happened after that
transaction is replicated must be replayed after that transaction is
replicated. Otherwise, at the very least, there will be states visible
on the standby that were never visible on the origin server, and maybe
we'll just straight up get the wrong answer. For instance:

1. nextval, setting last_value to 3
2. ALTER SEQUENCE, getting a new relfilenode, and also set last_value to 19
3. nextval, setting last_value to 20

If 3 happens before 2, the sequence ends up in the wrong state.

Maybe you've already got this and similar cases totally correctly
handled, I'm not sure, just throwing it out there.

--
Robert Haas
EDB: http://www.enterprisedb.com

#15Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#14)
Re: logical decoding and replication of sequences, take 2

On 1/10/23 20:52, Robert Haas wrote:

On Tue, Jan 10, 2023 at 1:32 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

0001 is a fix for the pre-existing issue in logicalmsg_decode,
attempting to build a snapshot before getting into a consistent state.
AFAICS this only affects assert-enabled builds and is otherwise
harmless, because we are not actually using the snapshot (apply gets a
valid snapshot from the transaction).

This is mostly the fix I shared in November, except that I kept the call
in decode.c (per comment from Andres). I haven't added any argument to
SnapBuildProcessChange because we may need to backpatch this (and it
didn't seem much simpler, IMHO).

I tend to associate transactional behavior with snapshots, so it looks
odd to see code that builds a snapshot only when the message is
non-transactional. I think that a more detailed comment spelling out
the reasoning would be useful here.

I'll try adding a comment explaining this, but the reasoning is fairly
simple AFAICS:

1) We don't actually need to build the snapshot for transactional
changes, because if we end up applying the change, we'll use snapshot
provided/maintained by reorderbuffer.

2) But we don't know if we end up applying the change - it may happen
this is one of the transactions we're waiting to finish / skipped, in
which case the snapshot is kinda bogus anyway. What "saved" us is that
we'll not actually use the snapshot in the end. It's just the assert
that causes issues.

3) For non-transactional changes, we need a snapshot because we're going
to execute the callback right away. But in this case the code actually
protects against building inconsistent snapshots.

This however brings me to the original question what's the purpose of
this patch - and that's essentially keeping sequences up to date to make
them usable after a failover. We can't generate values from the sequence
on the subscriber, because it'd just get overwritten. And from this
point of view, it's also fine that the sequence is slightly ahead,
because that's what happens after crash recovery anyway. And we're not
guaranteeing the sequences to be gap-less.

I agree that it's fine for the sequence to be slightly ahead, but I
think that it can't be too far ahead without causing problems. Suppose
for example that transaction #1 creates a sequence. Transaction #2
does nextval on the sequence a bunch of times and inserts rows into a
table using the sequence values as the PK. It's fine if the nextval
operations are replicated ahead of the commit of transaction #2 -- in
fact I'd say it's necessary for correctness -- but they can't precede
the commit of transaction #1, since then the sequence won't exist yet.

It's not clear to me how could that even happen. If transaction #1
creates a sequence, it's invisible for transaction #2. So how could it
do nextval() on it? #2 has to wait for #1 to commit before it can do
anything on the sequence, which enforces the correct ordering, no?

Likewise, if there's an ALTER SEQUENCE that creates a new relfilenode,
I think that needs to act as a barrier: non-transactional changes that
happened before that transaction must also be replicated before that
transaction is replicated, and those that happened after that
transaction is replicated must be replayed after that transaction is
replicated. Otherwise, at the very least, there will be states visible
on the standby that were never visible on the origin server, and maybe
we'll just straight up get the wrong answer. For instance:

1. nextval, setting last_value to 3
2. ALTER SEQUENCE, getting a new relfilenode, and also set last_value to 19
3. nextval, setting last_value to 20

If 3 happens before 2, the sequence ends up in the wrong state.

Maybe you've already got this and similar cases totally correctly
handled, I'm not sure, just throwing it out there.

I believe this should behave correctly too, thanks to locking.

If a transaction does ALTER SEQUENCE, that locks the sequence, so only
that transaction can do stuff with that sequence (and changes from that
point are treated as transactional). And everyone else is waiting for #1
to commit.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#13)
Re: logical decoding and replication of sequences, take 2

Hi,

Heikki, CCed you due to the point about 2c03216d8311 below.

On 2023-01-10 19:32:12 +0100, Tomas Vondra wrote:

0001 is a fix for the pre-existing issue in logicalmsg_decode,
attempting to build a snapshot before getting into a consistent state.
AFAICS this only affects assert-enabled builds and is otherwise
harmless, because we are not actually using the snapshot (apply gets a
valid snapshot from the transaction).

LGTM.

0002 is a rebased version of the original approach, committed as
0da92dc530 (and then reverted in 2c7ea57e56). This includes the same fix
as 0001 (for the sequence messages), the primary reason for the revert.

The rebase was not quite straightforward, due to extensive changes in
how publications deal with tables/schemas, and so on. So this adopts
them, but other than that it behaves just like the original patch.

This is a huge diff:

72 files changed, 4715 insertions(+), 612 deletions(-)

It'd be nice to split it to make review easier. Perhaps the sequence decoding
support could be split from the whole publication rigamarole?

This does not include any changes to test_decoding and/or the built-in
replication - those will be committed in separate patches.

Looks like that's not the case anymore?

+/*
+ * Update the sequence state by modifying the existing sequence data row.
+ *
+ * This keeps the same relfilenode, so the behavior is non-transactional.
+ */
+static void
+SetSequence_non_transactional(Oid seqrelid, int64 last_value, int64 log_cnt, bool is_called)
+{
+	SeqTable	elm;
+	Relation	seqrel;
+	Buffer		buf;
+	HeapTupleData seqdatatuple;
+	Form_pg_sequence_data seq;
+
+	/* open and lock sequence */
+	init_sequence(seqrelid, &elm, &seqrel);
+
+	/* lock page' buffer and read tuple */
+	seq = read_seq_tuple(seqrel, &buf, &seqdatatuple);
+
+	/* check the comment above nextval_internal()'s equivalent call. */
+	if (RelationNeedsWAL(seqrel))
+	{
+		GetTopTransactionId();
+
+		if (XLogLogicalInfoActive())
+			GetCurrentTransactionId();
+	}
+
+	/* ready to change the on-disk (or really, in-buffer) tuple */
+	START_CRIT_SECTION();
+
+	seq->last_value = last_value;
+	seq->is_called = is_called;
+	seq->log_cnt = log_cnt;
+
+	MarkBufferDirty(buf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(seqrel))
+	{
+		xl_seq_rec	xlrec;
+		XLogRecPtr	recptr;
+		Page		page = BufferGetPage(buf);
+
+		XLogBeginInsert();
+		XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT);
+
+		xlrec.locator = seqrel->rd_locator;
+		xlrec.created = false;
+
+		XLogRegisterData((char *) &xlrec, sizeof(xl_seq_rec));
+		XLogRegisterData((char *) seqdatatuple.t_data, seqdatatuple.t_len);
+
+		recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG);
+
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	UnlockReleaseBuffer(buf);
+
+	/* Clear local cache so that we don't think we have cached numbers */
+	/* Note that we do not change the currval() state */
+	elm->cached = elm->last;
+
+	relation_close(seqrel, NoLock);
+}
+
+/*
+ * Update the sequence state by creating a new relfilenode.
+ *
+ * This creates a new relfilenode, to allow transactional behavior.
+ */
+static void
+SetSequence_transactional(Oid seq_relid, int64 last_value, int64 log_cnt, bool is_called)
+{
+	SeqTable	elm;
+	Relation	seqrel;
+	Buffer		buf;
+	HeapTupleData seqdatatuple;
+	Form_pg_sequence_data seq;
+	HeapTuple	tuple;
+
+	/* open and lock sequence */
+	init_sequence(seq_relid, &elm, &seqrel);
+
+	/* lock page' buffer and read tuple */
+	seq = read_seq_tuple(seqrel, &buf, &seqdatatuple);
+
+	/* Copy the existing sequence tuple. */
+	tuple = heap_copytuple(&seqdatatuple);
+
+	/* Now we're done with the old page */
+	UnlockReleaseBuffer(buf);
+
+	/*
+	 * Modify the copied tuple to update the sequence state (similar to what
+	 * ResetSequence does).
+	 */
+	seq = (Form_pg_sequence_data) GETSTRUCT(tuple);
+	seq->last_value = last_value;
+	seq->is_called = is_called;
+	seq->log_cnt = log_cnt;
+
+	/*
+	 * Create a new storage file for the sequence - this is needed for the
+	 * transactional behavior.
+	 */
+	RelationSetNewRelfilenumber(seqrel, seqrel->rd_rel->relpersistence);
+
+	/*
+	 * Ensure sequence's relfrozenxid is at 0, since it won't contain any
+	 * unfrozen XIDs.  Same with relminmxid, since a sequence will never
+	 * contain multixacts.
+	 */
+	Assert(seqrel->rd_rel->relfrozenxid == InvalidTransactionId);
+	Assert(seqrel->rd_rel->relminmxid == InvalidMultiXactId);
+
+	/*
+	 * Insert the modified tuple into the new storage file. This does all the
+	 * necessary WAL-logging etc.
+	 */
+	fill_seq_with_data(seqrel, tuple);
+
+	/* Clear local cache so that we don't think we have cached numbers */
+	/* Note that we do not change the currval() state */
+	elm->cached = elm->last;
+
+	relation_close(seqrel, NoLock);
+}
+
+/*
+ * Set a sequence to a specified internal state.
+ *
+ * The change is made transactionally, so that on failure of the current
+ * transaction, the sequence will be restored to its previous state.
+ * We do that by creating a whole new relfilenode for the sequence; so this
+ * works much like the rewriting forms of ALTER TABLE.
+ *
+ * Caller is assumed to have acquired AccessExclusiveLock on the sequence,
+ * which must not be released until end of transaction.  Caller is also
+ * responsible for permissions checking.
+ */
+void
+SetSequence(Oid seq_relid, bool transactional, int64 last_value, int64 log_cnt, bool is_called)
+{
+	if (transactional)
+		SetSequence_transactional(seq_relid, last_value, log_cnt, is_called);
+	else
+		SetSequence_non_transactional(seq_relid, last_value, log_cnt, is_called);
+}

That's a lot of duplication with existing code. There's no explanation why
SetSequence() as well as do_setval() exists.

/*
* Initialize a sequence's relation with the specified tuple as content
*
@@ -406,8 +560,13 @@ fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum)

/* check the comment above nextval_internal()'s equivalent call. */
if (RelationNeedsWAL(rel))
+ {
GetTopTransactionId();

+		if (XLogLogicalInfoActive())
+			GetCurrentTransactionId();
+	}

Is it actually possible to reach this without an xid already having been
assigned for the current xact?

@@ -806,10 +966,28 @@ nextval_internal(Oid relid, bool check_permissions)
* It's sufficient to ensure the toplevel transaction has an xid, no need
* to assign xids subxacts, that'll already trigger an appropriate wait.
* (Have to do that here, so we're outside the critical section)
+	 *
+	 * We have to ensure we have a proper XID, which will be included in
+	 * the XLOG record by XLogRecordAssemble. Otherwise the first nextval()
+	 * in a subxact (without any preceding changes) would get XID 0, and it
+	 * would then be impossible to decide which top xact it belongs to.
+	 * It'd also trigger assert in DecodeSequence. We only do that with
+	 * wal_level=logical, though.
+	 *
+	 * XXX This might seem unnecessary, because if there's no XID the xact
+	 * couldn't have done anything important yet, e.g. it could not have
+	 * created a sequence. But that's incorrect, because of subxacts. The
+	 * current subtransaction might not have done anything yet (thus no XID),
+	 * but an earlier one might have created the sequence.
*/

What about restricting this to the case you're mentioning,
i.e. subtransactions?

@@ -845,6 +1023,7 @@ nextval_internal(Oid relid, bool check_permissions)
seq->log_cnt = 0;

xlrec.locator = seqrel->rd_locator;

I realize this isn't from this patch, but:

Why do we include the locator in the record? We already have it via
XLogRegisterBuffer(), no? And afaict we don't even use it, as we read the page
via XLogInitBufferForRedo() during recovery.

Kinda looks like an oversight in 2c03216d8311

+/*
+ * Handle sequence decode
+ *
+ * Decoding sequences is a bit tricky, because while most sequence actions
+ * are non-transactional (not subject to rollback), some need to be handled
+ * as transactional.
+ *
+ * By default, a sequence increment is non-transactional - we must not queue
+ * it in a transaction as other changes, because the transaction might get
+ * rolled back and we'd discard the increment. The downstream would not be
+ * notified about the increment, which is wrong.
+ *
+ * On the other hand, the sequence may be created in a transaction. In this
+ * case we *should* queue the change as other changes in the transaction,
+ * because we don't want to send the increments for unknown sequence to the
+ * plugin - it might get confused about which sequence it's related to etc.
+ */
+void
+sequence_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
+{
+	/* extract the WAL record, with "created" flag */
+	xlrec = (xl_seq_rec *) XLogRecGetData(r);
+
+	/* XXX how could we have sequence change without data? */
+	if(!datalen || !tupledata)
+		return;

Yea, I think we should error out here instead, something has gone quite wrong
if this happens.

+	tuplebuf = ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
+	DecodeSeqTuple(tupledata, datalen, tuplebuf);
+
+	/*
+	 * Should we handle the sequence increment as transactional or not?
+	 *
+	 * If the sequence was created in a still-running transaction, treat
+	 * it as transactional and queue the increments. Otherwise it needs
+	 * to be treated as non-transactional, in which case we send it to
+	 * the plugin right away.
+	 */
+	transactional = ReorderBufferSequenceIsTransactional(ctx->reorder,
+														 target_locator,
+														 xlrec->created);

Why re-create this information during decoding, when we basically already have
it available on the primary? I think we already pay the price for that
tracking, which we e.g. use for doing a non-transactional truncate:

/*
* Normally, we need a transaction-safe truncation here. However, if
* the table was either created in the current (sub)transaction or has
* a new relfilenumber in the current (sub)transaction, then we can
* just truncate it in-place, because a rollback would cause the whole
* table or the current physical file to be thrown away anyway.
*/
if (rel->rd_createSubid == mySubid ||
rel->rd_newRelfilelocatorSubid == mySubid)
{
/* Immediate, non-rollbackable truncation is OK */
heap_truncate_one_rel(rel);
}

Afaict we could do something similar for sequences, except that I think we
would just check if the sequence was created in the current transaction
(i.e. any of the fields are set).

+/*
+ * A transactional sequence increment is queued to be processed upon commit
+ * and a non-transactional increment gets processed immediately.
+ *
+ * A sequence update may be both transactional and non-transactional. When
+ * created in a running transaction, treat it as transactional and queue
+ * the change in it. Otherwise treat it as non-transactional, so that we
+ * don't forget the increment in case of a rollback.
+ */
+void
+ReorderBufferQueueSequence(ReorderBuffer *rb, TransactionId xid,
+						   Snapshot snapshot, XLogRecPtr lsn, RepOriginId origin_id,
+						   RelFileLocator rlocator, bool transactional, bool created,
+						   ReorderBufferTupleBuf *tuplebuf)
+		/*
+		 * Decoding needs access to syscaches et al., which in turn use
+		 * heavyweight locks and such. Thus we need to have enough state around to
+		 * keep track of those.  The easiest way is to simply use a transaction
+		 * internally.  That also allows us to easily enforce that nothing writes
+		 * to the database by checking for xid assignments.
+		 *
+		 * When we're called via the SQL SRF there's already a transaction
+		 * started, so start an explicit subtransaction there.
+		 */
+		using_subtxn = IsTransactionOrTransactionBlock();

This duplicates a lot of the code from ReorderBufferProcessTXN(). But only
does so partially. It's hard to tell whether some of the differences are
intentional. Could we de-duplicate that code with ReorderBufferProcessTXN()?

Maybe something like

void
ReorderBufferSetupXactEnv(ReorderBufferXactEnv *, bool process_invals);

void
ReorderBufferTeardownXactEnv(ReorderBufferXactEnv *, bool is_error);

Greetings,

Andres Freund

#17Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#15)
Re: logical decoding and replication of sequences, take 2

On Wed, Jan 11, 2023 at 1:29 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I agree that it's fine for the sequence to be slightly ahead, but I
think that it can't be too far ahead without causing problems. Suppose
for example that transaction #1 creates a sequence. Transaction #2
does nextval on the sequence a bunch of times and inserts rows into a
table using the sequence values as the PK. It's fine if the nextval
operations are replicated ahead of the commit of transaction #2 -- in
fact I'd say it's necessary for correctness -- but they can't precede
the commit of transaction #1, since then the sequence won't exist yet.

It's not clear to me how could that even happen. If transaction #1
creates a sequence, it's invisible for transaction #2. So how could it
do nextval() on it? #2 has to wait for #1 to commit before it can do
anything on the sequence, which enforces the correct ordering, no?

Yeah, I meant if #1 had committed and then #2 started to do its thing.
I was worried that decoding might reach the nextval operations in
transaction #2 before it replayed #1.

This worry may be entirely based on me not understanding how this
actually works. Do we always apply a transaction as soon as we see the
commit record for it, before decoding any further?

--
Robert Haas
EDB: http://www.enterprisedb.com

#18Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#17)
Re: logical decoding and replication of sequences, take 2

Hi,

On 2023-01-11 15:23:18 -0500, Robert Haas wrote:

Yeah, I meant if #1 had committed and then #2 started to do its thing.
I was worried that decoding might reach the nextval operations in
transaction #2 before it replayed #1.

This worry may be entirely based on me not understanding how this
actually works. Do we always apply a transaction as soon as we see the
commit record for it, before decoding any further?

Yes.

Otherwise we'd have a really hard time figuring out the correct historical
snapshot to use for subsequent transactions - they'd have been able to see the
catalog modifications made by the committing transaction.

Greetings,

Andres Freund

#19Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#18)
Re: logical decoding and replication of sequences, take 2

On Wed, Jan 11, 2023 at 3:28 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-01-11 15:23:18 -0500, Robert Haas wrote:

Yeah, I meant if #1 had committed and then #2 started to do its thing.
I was worried that decoding might reach the nextval operations in
transaction #2 before it replayed #1.

This worry may be entirely based on me not understanding how this
actually works. Do we always apply a transaction as soon as we see the
commit record for it, before decoding any further?

Yes.

Otherwise we'd have a really hard time figuring out the correct historical
snapshot to use for subsequent transactions - they'd have been able to see the
catalog modifications made by the committing transaction.

I wonder, then, what happens if somebody wants to do parallel apply.
That would seem to require some relaxation of this rule, but then
doesn't that break what this patch wants to do?

--
Robert Haas
EDB: http://www.enterprisedb.com

#20Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#19)
Re: logical decoding and replication of sequences, take 2

Hi,

On 2023-01-11 15:41:45 -0500, Robert Haas wrote:

I wonder, then, what happens if somebody wants to do parallel apply. That
would seem to require some relaxation of this rule, but then doesn't that
break what this patch wants to do?

I don't think it'd pose a direct problem - presumably you'd only parallelize
applying changes, not committing the transactions containing them. You'd get a
lot of inconsistencies otherwise.

If you're thinking of decoding changes in parallel (rather than streaming out
large changes before commit when possible), you'd only be able to do that in
cases when transaction haven't performed catalog changes, I think. In which
case there'd also be no issue wrt transactional sequence changes.

Greetings,

Andres Freund

#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#20)
#22Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#16)
#23Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#21)
#24Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#22)
#25Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#24)
#26vignesh C
vignesh21@gmail.com
In reply to: Tomas Vondra (#25)
#27Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: vignesh C (#26)
#28Jonathan S. Katz
jkatz@postgresql.org
In reply to: Tomas Vondra (#27)
#29Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jonathan S. Katz (#28)
#30Jonathan S. Katz
jkatz@postgresql.org
In reply to: Tomas Vondra (#29)
#31Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jonathan S. Katz (#30)
#32Jonathan S. Katz
jkatz@postgresql.org
In reply to: Tomas Vondra (#31)
#33Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#27)
#34John Naylor
john.naylor@enterprisedb.com
In reply to: Tomas Vondra (#33)
#35John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#34)
#36Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: John Naylor (#34)
#37Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: John Naylor (#35)
#38Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tomas Vondra (#37)
#39Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#38)
#40Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Masahiko Sawada (#38)
#41vignesh C
vignesh21@gmail.com
In reply to: Tomas Vondra (#40)
#42John Naylor
john.naylor@enterprisedb.com
In reply to: Tomas Vondra (#36)
#43John Naylor
john.naylor@enterprisedb.com
In reply to: Tomas Vondra (#37)
#44vignesh C
vignesh21@gmail.com
In reply to: Tomas Vondra (#40)
#45Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: John Naylor (#42)
#46Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#45)
#47Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#46)
#48Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#47)
#49Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#48)
#50Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#49)
#51Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#50)
#52Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#51)
#53Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#52)
#54Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#53)
#55Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#54)
#56Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tomas Vondra (#55)
#57Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Masahiko Sawada (#56)
#58Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tomas Vondra (#57)
#59Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Masahiko Sawada (#58)
#60Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tomas Vondra (#59)
#61Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#59)
#62Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#61)
#63Peter Eisentraut
peter_e@gmx.net
In reply to: Tomas Vondra (#62)
#64Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#62)
#65Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#64)
#66Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Masahiko Sawada (#65)
#67Gregory Stark (as CFM)
stark.cfm@gmail.com
In reply to: Tomas Vondra (#66)
#68Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tomas Vondra (#66)
#69Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alvaro Herrera (#68)
#70Peter Eisentraut
peter_e@gmx.net
In reply to: Tomas Vondra (#66)
#71Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Peter Eisentraut (#70)
#72Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#71)
#73Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#71)
#74Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tomas Vondra (#73)
#75Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#74)
#76Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#75)
#77Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#76)
#78Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tomas Vondra (#77)
#79Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tomas Vondra (#77)
#80Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#79)
#81Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#79)
#82Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Bapat (#78)
#83Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#82)
#84Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tomas Vondra (#83)
#85Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#74)
#86Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#80)
#87Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#84)
#88Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tomas Vondra (#85)
#89Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tomas Vondra (#87)
#90Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#89)
#91Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#88)
#92Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tomas Vondra (#90)
#93Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tomas Vondra (#91)
#94Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#92)
#95Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#93)
#96Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#91)
#97Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tomas Vondra (#94)
#98Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#97)
#99Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tomas Vondra (#98)
#100Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#99)
#101Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#100)
#102Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tomas Vondra (#100)
#103Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#102)
#104Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#103)
#105Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#103)
#106Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Amit Kapila (#105)
#107Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Bapat (#80)
#108Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#107)
#109Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#105)
#110Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tomas Vondra (#103)
#111Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tomas Vondra (#103)
#112Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alvaro Herrera (#110)
#113Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tomas Vondra (#104)
#114Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#105)
#115Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#111)
#116Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tomas Vondra (#112)
#117Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#107)
#118Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#117)
#119Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#108)
#120Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#118)
#121Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tomas Vondra (#120)
#122Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#121)
#123Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#113)
#124Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#122)
#125Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#120)
#126Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#125)
#127Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#126)
#128Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#127)
#129Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tomas Vondra (#123)
#130Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#128)
#131Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tomas Vondra (#127)
#132Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#129)
#133Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#130)
#134Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#131)
#135Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#133)
#136Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#129)
#137Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#135)
#138Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#134)
#139Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#138)
#140Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#139)
#141Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#140)
#142Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tomas Vondra (#141)
#143Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#142)
#144Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tomas Vondra (#143)
#145Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Bapat (#144)
#146Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#145)
#147Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#144)
#148Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Kapila (#146)
#149Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Tomas Vondra (#143)
#150Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tomas Vondra (#143)
#151Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#150)
#152Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Zhijie Hou (Fujitsu) (#149)
#153Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dilip Kumar (#151)
#154Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dilip Kumar (#150)
#155Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashutosh Bapat (#148)
#156Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#119)
#157Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#156)
#158Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Tomas Vondra (#153)
#159Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Zhijie Hou (Fujitsu) (#158)
#160Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#159)
#161Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#160)
#162Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#161)
#163Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#162)
#164Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#163)
#165Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#164)
#166Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#164)
#167Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#164)
#168Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Hayato Kuroda (Fujitsu) (#166)
#169Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#168)
#170Peter Smith
smithpb2250@gmail.com
In reply to: Tomas Vondra (#169)
#171Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#169)
#172Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#171)
#173Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#172)
#174Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Smith (#170)
#175Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#173)
#176Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#173)
#177Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#176)
#178Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#177)
#179Peter Smith
smithpb2250@gmail.com
In reply to: Tomas Vondra (#174)
#180Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#178)
#181Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Tomas Vondra (#178)
#182Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#180)
#183Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Hayato Kuroda (Fujitsu) (#181)
#184Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Tomas Vondra (#183)
#185Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Hayato Kuroda (Fujitsu) (#184)
#186Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#185)
#187Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#185)
#188Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#187)
#189Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tomas Vondra (#185)
#190Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#188)
#191Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#189)
#192Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#189)
#193Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#186)
#194Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#192)
#195Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dilip Kumar (#191)
#196Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dilip Kumar (#194)
#197Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#193)
#198Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#190)
#199Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#197)
#200Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tomas Vondra (#196)
#201Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#196)
#202Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tomas Vondra (#195)
#203Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#201)
#204Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#202)
#205Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#204)
#206Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#204)
#207Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Dilip Kumar (#206)
#208Dilip Kumar
dilipbalaut@gmail.com
In reply to: Ashutosh Bapat (#207)
#209Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Bapat (#207)
#210Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Kapila (#209)
#211Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Bapat (#210)
#212Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Kapila (#211)
#213Euler Taveira
euler@eulerto.com
In reply to: Ashutosh Bapat (#212)
#214Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Bapat (#212)
#215Christophe Pettus
xof@thebuild.com
In reply to: Tomas Vondra (#203)
#216Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Christophe Pettus (#215)
#217Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#214)
#218Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#217)
#219Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#218)
#220Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#219)
#221Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#220)
#222Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#221)
#223Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#222)
#224Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#223)
#225Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#224)
#226Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#225)
#227Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#226)
#228Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#216)
#229Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#227)
#230Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#229)
#231Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#230)
#232Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#228)
#233Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#232)
#234Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#224)
#235Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#231)
#236Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#235)
#237Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#236)
#238Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#237)
#239Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#238)
#240Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dilip Kumar (#239)