ERROR: subtransaction logged without previous top-level txn record
Hello,
I saw the previous thread but it wasn't in my inbox at the time, so I’m creating a new one sorry about that.
/messages/by-id/20190516170434.masck6ehwg2kvbi2@alap3.anarazel.de
I’ve managed to reproduce the issue pretty consistently on REL9_6_STABLE on commit 959792087a10baf7f1b58408d28411109bcedb7a
OS version:
[ec2-user@ ... ~]$ uname -a
... 4.14.77-80.57.amzn2.x86_64 #1 SMP Tue Nov 6 21:18:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Postgres version:
postgres=# SELECT version();
version
----------------------------------------------------------------------------------------------------------
PostgreSQL 9.6.13 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5), 64-bit
(1 row)
I’m on an EC2 m5.4xlarge if that matters.
Repro steps:
1. Create the following script
[ec2-user@ip-172-31-18-48 ~]$ cat ~/subbench.txt
\set aid random(1, 100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
\set subcnt random(1, 800)
select * from pgbench(:aid, :bid, :tid, :delta, :subcnt);
2. Create the following function:
CREATE OR REPLACE FUNCTION pgbench(pAid int, pBid int, pTid int, delta int, subcnt int) returns int as $$
DECLARE
abal int;
BEGIN
FOR i in 1 .. subcnt LOOP
BEGIN
UPDATE pgbench_accounts SET abalance = abalance + delta WHERE aid = pAid;
--subcnt := subcnt;
EXCEPTION
WHEN division_by_zero THEN
subcnt := subcnt;
END;
END LOOP;
abal := abalance FROM pgbench_accounts WHERE aid = pAid;
return abal;
END; $$LANGUAGE 'plpgsql';
3. Create a few logical slots in the database
select pg_create_logical_replication_slot('test_slot_1', 'test_decoding');
select pg_create_logical_replication_slot('test_slot_2', 'test_decoding');
select pg_create_logical_replication_slot('test_slot_3', 'test_decoding');
...
4. Initialize pgbench
pgbench -i -d postgres
5. Load the data
pgbench -f subbench.txt -c 64 -j 64 -T 600 -P 1 -d postgres
6. Run pg_recvlogical with a timeout, it usually takes a few iterations (~7-9) before the error occurs
var=0
while true; do
timeout 30 pg_recvlogical -d postgres --start --slot test_slot_1 -f /dev/null;
var=$((var+1))
echo "Sleeping 5s Time: $var";
sleep 5;
done
pg_recvlogical -d postgres --start --slot test_slot_1 -f -
pg_recvlogical: unexpected termination of replication stream: ERROR: subtransaction logged without previous top-level txn record
pg_recvlogical -d postgres --start --slot test_slot_2 -f -
pg_recvlogical: unexpected termination of replication stream: ERROR: subtransaction logged without previous top-level txn record
pg_recvlogical -d postgres --start --slot test_slot_3 -f -
pg_recvlogical: unexpected termination of replication stream: ERROR: subtransaction logged without previous top-level txn record
pg_recvlogical: disconnected; waiting 5 seconds to try again
What's interesting is that the confirmed_flush_lsn are all different from test_slot_1 --> test_slot_3
postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
--------------+---------------+-----------+--------+----------+--------+------------+------+--------------+-------------+---------------------
test_slot_1 | test_decoding | logical | 13382 | postgres | f | | | 1848 | 0/1C5BC5A0 | 0/5488E468
test_slot_2 | test_decoding | logical | 13382 | postgres | f | | | 1848 | 0/1C5BC5A0 | 0/40E45EA0
test_slot_3 | test_decoding | logical | 13382 | postgres | f | | | 1848 | 0/3F4B6AF8 | 0/6BB3A990
Let me know if you require more info to repro.
Thanks!
John H
Hi,
Our customer also encountered this issue and I've looked into it. The problem is
reproduced well enough using the instructions in the previous message.
The check leading to this ERROR is too strict, it forbids legit behaviours. Say
we have in WAL
[ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ]
- First xl_xact_assignment record is beyond reading, i.e. earlier
restart_lsn, where ready snapshot will be taken from disk.
- After restart_lsn there is some change of a subxact.
- After that, there is second xl_xact_assignment (for another subxact)
revealing relationship between top and first subxact, where this ERROR fires.
Such transaction won't be streamed because we hadn't seen it in full. It must be
finished before streaming will start, i.e. before confirmed_flush_lsn.
Of course, the easiest fix is to just throw the check out. However, supposing
that someone would probably want to relax it instead, I considered ways to
accomplish this. Something like 'if we are still in SNAPSHOT_FULL and xid is
before SnapBuildNextPhaseAt, just ignore xl_xact_assignment record, we haven't
seen such xact in full and definitely won't stream it.' That led to discovery of
another bug in the place which I had found suspicious long before.
Snapbuilder enters into SNAPBUILD_CONSISTENT immediately after deserializing the
snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not
just complete snapshot (snapshot by itself in FULL state is just good as in
CONSISTENT), but also reorderbuffer filled with all currently running
xacts. This is painless for decoding sessions with existing slots because they
won't stream anything before confirmed_flush_lsn is reached anyway, at which
point all transactions which hadn't got into reorderbuffer would definitely
finish. However, new slots might be created too early, thus losing (not
decoding) parts of transactions committed after freshly created
confirmed_flush_lsn. This can happen under the following extremely unlucky
circumstances:
- New slot creation reserves point in WAL since which it would read it
(GetXLogInsertRecPtr);
- It logs xl_running_xacts to start assembling a snapshot;
- Running decoding session with another slot quickly reads this
xl_running_xacts and serializes its snapshot;
- New slot reads xl_running_xacts and picks this snapshot up, saying that it
is ready to stream henceforth, though its reorderbuffer is empty.
Exact reproducing steps:
-- session 1
create table t (i int);
select pg_create_logical_replication_slot('slot_1', 'test_decoding');
-- session 2
begin;
insert into t values (1);
-- session 3, start slot creation
select pg_create_logical_replication_slot('slot_2', 'test_decoding');
-- stop (with gdb or something) it at DecodingContextFindStartpoint(ctx);
-- session 1
-- xl_running_xacts is dumped by ReplicationSlotReserveWal in previous command, no
-- need to sleep; our snap will be immediately serialized there
SELECT data FROM pg_logical_slot_get_changes('slot_1', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');
-- continue slot_2 creation
-- session 2: insert some more and commit
insert into t values (1);
commit;
-- now this would find second insert, but not the first one
SELECT data FROM pg_logical_slot_get_changes('slot_2', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');
What we can do here? Initially I was like, ok, then let's get into FULL_SNAPSHOT
upon deserializing the snap and wait for all xacts finish as usual. However, to
my surprise I've found this impossible. That is, snapbuilder has no way to
enforce that we go into CONSISTENT only when we have seen all running xacts
completely without risk of skipping legit transactions. Specifically, after
deserializing FULL snapshot snapbuilder must iterate over WAL further until all
running xacts finish, as we must see with correct snapshots all changes of every
transaction we are going to stream. However, snapbuilder can't *immediately*
notice this point, because
- Snapbuilder updates xmin (first running xact) by taking it from xl_running_xacts
(c.f. SnapBuildProcessRunningXacts). Even if we guarantee that, for
each possible WAL reading starting position, there is always an an
xl_running_xacts records logged right before the earliest possible
streaming point -- IOW, after all xacts which we can't stream had
finished (which is currently true btw, as slot's advancement is
considered only at xl_running_xacts) -- that would not be enough due
to races around xl_running_xacts, i.e with WAL like
[ <T1> <restart_lsn> <T1 commit> <confirmed_flush_lsn, xrx> <T2 commit> ]
T2 might be skipped if T1 is shown as running in xl_running_xacts.
- Tracking xmin manually by recoding commits is not only inefficient,
it just not feasible because serialized snapshot is not full: it
contains only committed catalog-modifying xacts. Thus, we can't
distinguish non-catalog-modifying xact committed before serialized
snapshot from not yet committed one.
Which means only code external to snapbuilder knows the earliest point suitable
for streaming; slot advancement machinery ensures that <restart_lsn,
confirmed_flush_lsn> pair is always good. So possible fix is the following: if
snapbuilder's user knows exact LSN since which streaming is safe (existing slot,
essentially), trust him and and switch into CONSISTENT state after deserializing
snapshot as before. OTOH, if he doesn't know it (new slot creation), go via
usual FULL -> CONSISTENT procedure; we might transition into CONSISTENT a bit
later than it became possible, but there is nothing bad about that.
First attached patch implements this. I don't particularly like it, but the only
alternative which I see is to rework slots advancement logic to make
<restart_lsn, confirmed_flush_lsn> pair such that there is always
xl_running_xacts before confirmed_flush_lsn which confirms all xacts running as
of restart_lsn have finished. This unnecessary complexity looks much worse.
As for the check in the topic, I nonetheless propose to remove it completely, as
in second attached patch. Saying for sure whether xact of some record
encountered after snapshot was deserialized can be streamed or not requires to
know nextXid (first not yet running xid) as of this snapshot's lsn -- all xids <
nextXid possibly hadn't been seen in full and are not subject to
decoding. However, generally we don't know nextXid which is taken from
xl_running_xacts; in particular snapshot can be serizalized/deserialized at
XLOG_END_OF_RECOVERY. Changing that for the sake of the check in question is not
worthwhile, so just throw it out instead of trying to relax.
In fact, I don't see what is so important about seeing the top xact first at
all. During COMMIT decoding we'll know all subxids anyway, so why care?
P.S. While digging this, I have noted that return values of
SnapBuildFindSnapshot seem pretty random to me. Basically returning 'true'
performs immediately 4 things:
- update xmin
- purge old xip entries
- advance xmin of the slot
- if CONSISTENT, advance lsn (earliest serialized snap)
The latter two make sense only after slot created or confirmed_flush_lsn
reached. The first two make sense even immediately after deserializing the
snapshot (because it is serialized *before* updating xmin and xip); generally,
always when committed xids are tracked. Then why cleanup is done when xmin
horizon is too low? Why it is not performed after restoring serialized snapshot?
On the whole, I find this not very important as all these operations are pretty
cheap and harmless if executed too early -- it would be simpler just do them
always.
--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi,
On 2019-10-24 12:59:30 +0300, Arseny Sher wrote:
Our customer also encountered this issue and I've looked into it. The problem is
reproduced well enough using the instructions in the previous message.
Is this with
commit bac2fae05c7737530a6fe8276cd27d210d25c6ac
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: 2019-09-13 16:36:28 -0300
logical decoding: process ASSIGNMENT during snapshot build
Most WAL records are ignored in early SnapBuild snapshot build phases.
But it's critical to process some of them, so that later messages have
the correct transaction state after the snapshot is completely built; in
particular, XLOG_XACT_ASSIGNMENT messages are critical in order for
sub-transactions to be correctly assigned to their parent transactions,
or at least one assert misbehaves, as reported by Ildar Musin.
Diagnosed-by: Masahiko Sawada
Author: Masahiko Sawada
Discussion: /messages/by-id/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@mail.gmail.com
applied?
The check leading to this ERROR is too strict, it forbids legit behaviours. Say
we have in WAL[ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ]
- First xl_xact_assignment record is beyond reading, i.e. earlier
restart_lsn, where ready snapshot will be taken from disk.
- After restart_lsn there is some change of a subxact.
- After that, there is second xl_xact_assignment (for another subxact)
revealing relationship between top and first subxact, where this ERROR fires.Such transaction won't be streamed because we hadn't seen it in full. It must be
finished before streaming will start, i.e. before confirmed_flush_lsn.Of course, the easiest fix is to just throw the check out.
I don't think that'd actually be a fix, and just hiding a bug.
Snapbuilder enters into SNAPBUILD_CONSISTENT immediately after deserializing the
snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not
just complete snapshot (snapshot by itself in FULL state is just good as in
CONSISTENT), but also reorderbuffer filled with all currently running
xacts. This is painless for decoding sessions with existing slots because they
won't stream anything before confirmed_flush_lsn is reached anyway, at which
point all transactions which hadn't got into reorderbuffer would definitely
finish. However, new slots might be created too early, thus losing (not
decoding) parts of transactions committed after freshly created
confirmed_flush_lsn. This can happen under the following extremely unlucky
circumstances:
- New slot creation reserves point in WAL since which it would read it
(GetXLogInsertRecPtr);
- It logs xl_running_xacts to start assembling a snapshot;
- Running decoding session with another slot quickly reads this
xl_running_xacts and serializes its snapshot;
- New slot reads xl_running_xacts and picks this snapshot up, saying that it
is ready to stream henceforth, though its reorderbuffer is empty.
Yea, that's a problem :(
Exact reproducing steps:
-- session 1
create table t (i int);
select pg_create_logical_replication_slot('slot_1', 'test_decoding');-- session 2
begin;
insert into t values (1);-- session 3, start slot creation
select pg_create_logical_replication_slot('slot_2', 'test_decoding');
-- stop (with gdb or something) it at DecodingContextFindStartpoint(ctx);-- session 1
-- xl_running_xacts is dumped by ReplicationSlotReserveWal in previous command, no
-- need to sleep; our snap will be immediately serialized there
SELECT data FROM pg_logical_slot_get_changes('slot_1', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');-- continue slot_2 creation
-- session 2: insert some more and commit
insert into t values (1);
commit;-- now this would find second insert, but not the first one
SELECT data FROM pg_logical_slot_get_changes('slot_2', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');
It'd be good to make this into an isolation test.
What we can do here?
I think the easiest fix might actually be to just ignore serialized
snapshots when building the initial snapshot.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
Hi,
On 2019-10-24 12:59:30 +0300, Arseny Sher wrote:
Our customer also encountered this issue and I've looked into it. The problem is
reproduced well enough using the instructions in the previous message.Is this with
commit bac2fae05c7737530a6fe8276cd27d210d25c6ac
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: 2019-09-13 16:36:28 -0300logical decoding: process ASSIGNMENT during snapshot build
Most WAL records are ignored in early SnapBuild snapshot build phases.
But it's critical to process some of them, so that later messages have
the correct transaction state after the snapshot is completely built; in
particular, XLOG_XACT_ASSIGNMENT messages are critical in order for
sub-transactions to be correctly assigned to their parent transactions,
or at least one assert misbehaves, as reported by Ildar Musin.Diagnosed-by: Masahiko Sawada
Author: Masahiko Sawada
Discussion: /messages/by-id/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@mail.gmail.comapplied?
Yeah, I tried fresh master. See below: skipped xl_xact_assignment is
beyond restart_lsn at all (and thus not read), so this doesn't matter.
The check leading to this ERROR is too strict, it forbids legit behaviours. Say
we have in WAL[ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ]
- First xl_xact_assignment record is beyond reading, i.e. earlier
restart_lsn, where ready snapshot will be taken from disk.
- After restart_lsn there is some change of a subxact.
- After that, there is second xl_xact_assignment (for another subxact)
revealing relationship between top and first subxact, where this ERROR fires.Such transaction won't be streamed because we hadn't seen it in full. It must be
finished before streaming will start, i.e. before confirmed_flush_lsn.Of course, the easiest fix is to just throw the check out.
I don't think that'd actually be a fix, and just hiding a bug.
I don't see a bug here. At least in reproduced scenario I see false
alert, as explained above: transaction with skipped xl_xact_assignment
won't be streamed as it finishes before confirmed_flush_lsn. And I am
pretty sure people encountered in the field the same issue.
In the end of my mail I have offered a way to relax this check instead
of removing it to avoid false triggers: serialize/deserialize a snapshot
only at xl_running_xacts to know nextXid, add function to snapbuilder to
check whether xact can be streamed or not by looking at its xid, etc,
somehow deal with existing serialized snaps which may be logged at
END_OF_RECOVERY without nextXid. I don't believe this check is worth
these complexities. If you think it does, I can do that though.
Snapbuilder enters into SNAPBUILD_CONSISTENT immediately after deserializing the
snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not
just complete snapshot (snapshot by itself in FULL state is just good as in
CONSISTENT), but also reorderbuffer filled with all currently running
xacts. This is painless for decoding sessions with existing slots because they
won't stream anything before confirmed_flush_lsn is reached anyway, at which
point all transactions which hadn't got into reorderbuffer would definitely
finish. However, new slots might be created too early, thus losing (not
decoding) parts of transactions committed after freshly created
confirmed_flush_lsn. This can happen under the following extremely unlucky
circumstances:
- New slot creation reserves point in WAL since which it would read it
(GetXLogInsertRecPtr);
- It logs xl_running_xacts to start assembling a snapshot;
- Running decoding session with another slot quickly reads this
xl_running_xacts and serializes its snapshot;
- New slot reads xl_running_xacts and picks this snapshot up, saying that it
is ready to stream henceforth, though its reorderbuffer is empty.Yea, that's a problem :(
Exact reproducing steps:
-- session 1
create table t (i int);
select pg_create_logical_replication_slot('slot_1', 'test_decoding');-- session 2
begin;
insert into t values (1);-- session 3, start slot creation
select pg_create_logical_replication_slot('slot_2', 'test_decoding');
-- stop (with gdb or something) it at DecodingContextFindStartpoint(ctx);-- session 1
-- xl_running_xacts is dumped by ReplicationSlotReserveWal in previous command, no
-- need to sleep; our snap will be immediately serialized there
SELECT data FROM pg_logical_slot_get_changes('slot_1', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');-- continue slot_2 creation
-- session 2: insert some more and commit
insert into t values (1);
commit;-- now this would find second insert, but not the first one
SELECT data FROM pg_logical_slot_get_changes('slot_2', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');It'd be good to make this into an isolation test.
Yeah, but to get real chance of firing this requires kinda sleep/break
in the middle of pg_create_logical_replication_slot execution, so I have
no idea how to do that =(
What we can do here?
I think the easiest fix might actually be to just ignore serialized
snapshots when building the initial snapshot.
That's an option. However, that anyway requires the distinction between
new slot creation and streaming from existing slot at snapbuilder level,
which currently doesn't exist and which constitutes most part of my
first patch. If we add that, changing between using and not using
serialized snapshots in new slot creation is easy (my patch uses it), I
think this is not principal.
--
Arseny Sher
Andres, Álvaro, could you please have a look at this?
--
Arseny Sher
Is the resolution of the issue in this thread being tracked elsewhere,
either in a commit fest or other stream of work?
On Tue, Dec 17, 2019 at 9:47 AM Arseny Sher <a.sher@postgrespro.ru> wrote:
Show quoted text
Andres, Álvaro, could you please have a look at this?
--
Arseny Sher
Dan Katz <dkatz@joor.com> writes:
Is the resolution of the issue in this thread being tracked elsewhere,
either in a commit fest or other stream of work?
Ok, I've created a cf entry:
https://commitfest.postgresql.org/26/2383/
I believe it is the most important to commit at least
0002-Stop-demanding-that-top-xact-must-be-seen-before-sub.patch
from my mail above -- as we see, this issue creates problems in the
field. Moreover, the patch is trivial and hopefully I've shown that
ERROR triggers were spurious and there is no easy way to relax the
check.
--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Is there any chance this fix will get into the next minor version of PostgreSQL scheduled for February?
Arseny,
I was hoping you could give me some insights about how this bug might
appear with multiple replications slots. For example if I have two
replication slots would you expect both slots to see the same error, even
if they were started, consumed or the LSN was confirmed-flushed at
different times?
Dan
On Tue, Dec 17, 2019 at 10:15 AM Arseny Sher <a.sher@postgrespro.ru> wrote:
Show quoted text
Dan Katz <dkatz@joor.com> writes:
Is the resolution of the issue in this thread being tracked elsewhere,
either in a commit fest or other stream of work?Ok, I've created a cf entry:
https://commitfest.postgresql.org/26/2383/I believe it is the most important to commit at least
0002-Stop-demanding-that-top-xact-must-be-seen-before-sub.patch
from my mail above -- as we see, this issue creates problems in the
field. Moreover, the patch is trivial and hopefully I've shown that
ERROR triggers were spurious and there is no easy way to relax the
check.--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi,
Dan Katz <dkatz@joor.com> writes:
Arseny,
I was hoping you could give me some insights about how this bug might
appear with multiple replications slots. For example if I have two
replication slots would you expect both slots to see the same error, even
if they were started, consumed or the LSN was confirmed-flushed at
different times?
Well, to encounter this you must happen to interrupt decoding session
(e.g. shutdown server) when restart_lsn (LSN since WAL will be read next
time) is at unfortunate position, as described in
/messages/by-id/87ftjifoql.fsf@ars-thinkpad
Generally each slot has its own restart_lsn, so if one decoding session
stucked on this issue, another one won't necessarily fail at the same
time. However, restart_lsn can be advanced only to certain points,
mainly xl_running_xacts records, which is logged every 15 seconds. So if
all consumers acknowledge changes fast enough, it is quite likely that
during shutdown restart_lsn will be the same for all slots -- which
means either all of them will stuck on further decoding or all of them
won't. If not, different slots might have different restart_lsn and
probably won't fail at the same time; but encountering this issue even
once suggests that your workload makes possibility of such problematic
restart_lsn perceptible (i.e. many subtransactions). And each
restart_lsn probably has approximately the same chance to be 'bad'
(provided the workload is even).
We need a committer familiar with this code to look here...
--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Fri, Oct 25, 2019 at 12:26 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
Andres Freund <andres@anarazel.de> writes:
Hi,
On 2019-10-24 12:59:30 +0300, Arseny Sher wrote:
Our customer also encountered this issue and I've looked into it. The problem is
reproduced well enough using the instructions in the previous message.Is this with
commit bac2fae05c7737530a6fe8276cd27d210d25c6ac
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: 2019-09-13 16:36:28 -0300logical decoding: process ASSIGNMENT during snapshot build
Most WAL records are ignored in early SnapBuild snapshot build phases.
But it's critical to process some of them, so that later messages have
the correct transaction state after the snapshot is completely built; in
particular, XLOG_XACT_ASSIGNMENT messages are critical in order for
sub-transactions to be correctly assigned to their parent transactions,
or at least one assert misbehaves, as reported by Ildar Musin.Diagnosed-by: Masahiko Sawada
Author: Masahiko Sawada
Discussion: /messages/by-id/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@mail.gmail.comapplied?
Yeah, I tried fresh master. See below: skipped xl_xact_assignment is
beyond restart_lsn at all (and thus not read), so this doesn't matter.The check leading to this ERROR is too strict, it forbids legit behaviours. Say
we have in WAL[ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ]
- First xl_xact_assignment record is beyond reading, i.e. earlier
restart_lsn, where ready snapshot will be taken from disk.
- After restart_lsn there is some change of a subxact.
- After that, there is second xl_xact_assignment (for another subxact)
revealing relationship between top and first subxact, where this ERROR fires.Such transaction won't be streamed because we hadn't seen it in full. It must be
finished before streaming will start, i.e. before confirmed_flush_lsn.Of course, the easiest fix is to just throw the check out.
I don't think that'd actually be a fix, and just hiding a bug.
I don't see a bug here. At least in reproduced scenario I see false
alert, as explained above: transaction with skipped xl_xact_assignment
won't be streamed as it finishes before confirmed_flush_lsn.
Does this guarantee come from the fact that we need to wait for such a
transaction before reaching a consistent snapshot state? If not, can
you explain a bit more what makes you say so?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes:
I don't see a bug here. At least in reproduced scenario I see false
alert, as explained above: transaction with skipped xl_xact_assignment
won't be streamed as it finishes before confirmed_flush_lsn.Does this guarantee come from the fact that we need to wait for such a
transaction before reaching a consistent snapshot state? If not, can
you explain a bit more what makes you say so?
Right, see FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition -- it exists
exactly for this purpose: once we have good snapshot, we need to wait
for all running xacts to finish to see all xacts we are promising to
stream in full. This ensures <restart_lsn, confirmed_flush_lsn> pair is
good (reading WAL since the former is enough to see all xacts committing
after the latter in full) initially, and slot advancement arrangements
ensure it stays good forever (see
LogicalIncreaseRestartDecodingForSlot).
Well, almost. This is true as long initial snapshot construction process
goes the long way of building the snapshot by itself. If it happens to
pick up from disk ready snapshot pickled there by another decoding
session, it fast path'es to SNAPBUILD_CONSISTENT, which is technically a
bug as described in
/messages/by-id/87ftjifoql.fsf@ars-thinkpad
In theory, this bug could indeed lead to 'subtransaction logged without
previous top-level txn record' error. In practice, I think its
possibility is disappearingly small -- process of slot creation must be
intervened in a very short gap by another decoder who serializes its
snapshot (see the exact sequence of steps in the mail above). What is
much more probable (doesn't involve new slot creation and relatively
easily reproducible without sleeps) is false alert triggered by unlucky
position of restart_lsn.
Surely we still must fix it. I just mean
- People definitely encountered false alert, not this bug
(at least because nobody said this was immediately after slot
creation).
- I've no bright ideas how to relax the check to make it proper
without additional complications and I'm pretty sure this is
impossible (again, see above for details), so I'd remove it.
--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Mon, Feb 3, 2020 at 2:50 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
Amit Kapila <amit.kapila16@gmail.com> writes:
I don't see a bug here. At least in reproduced scenario I see false
alert, as explained above: transaction with skipped xl_xact_assignment
won't be streamed as it finishes before confirmed_flush_lsn.Does this guarantee come from the fact that we need to wait for such a
transaction before reaching a consistent snapshot state? If not, can
you explain a bit more what makes you say so?Right, see FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition -- it exists
exactly for this purpose: once we have good snapshot, we need to wait
for all running xacts to finish to see all xacts we are promising to
stream in full.
So, doesn't this mean that it started occurring after the fix done in
commit 96b5033e11 [1]commit bac2fae05c7737530a6fe8276cd27d210d25c6ac Author: Alvaro Herrera <alvherre@alvh.no-ip.org> Date: 2019-09-13 16:36:28 -0300? Because before that fix we wouldn't have
allowed processing XLOG_XACT_ASSIGNMENT records unless we are in
SNAPBUILD_FULL_SNAPSHOT state. I am not telling the fix in that
commit is wrong, but just trying to understand the situation here.
Well, almost. This is true as long initial snapshot construction process
goes the long way of building the snapshot by itself. If it happens to
pick up from disk ready snapshot pickled there by another decoding
session, it fast path'es to SNAPBUILD_CONSISTENT, which is technically a
bug as described in
/messages/by-id/87ftjifoql.fsf@ars-thinkpad
Can't we deal with this separately? If so, I think let's not mix the
discussions for both as the root cause of both seems different.
[1]: commit bac2fae05c7737530a6fe8276cd27d210d25c6ac Author: Alvaro Herrera <alvherre@alvh.no-ip.org> Date: 2019-09-13 16:36:28 -0300
commit bac2fae05c7737530a6fe8276cd27d210d25c6ac
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: 2019-09-13 16:36:28 -0300
logical decoding: process ASSIGNMENT during snapshot build
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes:
So, doesn't this mean that it started occurring after the fix done in
commit 96b5033e11 [1]? Because before that fix we wouldn't have
allowed processing XLOG_XACT_ASSIGNMENT records unless we are in
SNAPBUILD_FULL_SNAPSHOT state. I am not telling the fix in that
commit is wrong, but just trying to understand the situation here.
Nope. Consider again example of WAL above triggering the error:
[ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_2> <commit> <confirmed_flush_lsn> ]
Decoder starting reading WAL at <restart_lsn> where he immediately reads
from disk snapshot serialized earlier, which makes it jump to
SNAPBUILD_CONSISTENT right away. It doesn't read xl_xact_assignment_1,
but it reads xl_xact_assignment_2 already in SNAPBUILD_CONSISTENT state,
so catches the error regardless of this commit.
Well, almost. This is true as long initial snapshot construction process
goes the long way of building the snapshot by itself. If it happens to
pick up from disk ready snapshot pickled there by another decoding
session, it fast path'es to SNAPBUILD_CONSISTENT, which is technically a
bug as described in
/messages/by-id/87ftjifoql.fsf@ars-thinkpadCan't we deal with this separately? If so, I think let's not mix the
discussions for both as the root cause of both seems different.
These issues are related: before removing the check it would be nice to
ensure that there is no bugs it might protect us from (and it turns out
there actually is, though it won't always protect, and though this bug
has very small probability). Moreover, they are about more or less
subject -- avoiding partially decoded xacts -- and once you dived deep
enough to deal with one, it is reasonable to deal with another instead
of doing that twice. But as a practical matter, removing the check is
simple one-liner, and its presence causes people troubles -- so I'd
suggest doing that first and then deal with the rest. I don't think
starting new thread is worthwhile here, but if you think it does, I can
create it.
--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Mon, Feb 3, 2020 at 7:16 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
Amit Kapila <amit.kapila16@gmail.com> writes:
So, doesn't this mean that it started occurring after the fix done in
commit 96b5033e11 [1]? Because before that fix we wouldn't have
allowed processing XLOG_XACT_ASSIGNMENT records unless we are in
SNAPBUILD_FULL_SNAPSHOT state. I am not telling the fix in that
commit is wrong, but just trying to understand the situation here.Nope. Consider again example of WAL above triggering the error:
[ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_2> <commit> <confirmed_flush_lsn> ]
Decoder starting reading WAL at <restart_lsn> where he immediately reads
from disk snapshot serialized earlier, which makes it jump to
SNAPBUILD_CONSISTENT right away.
Sure, I understand that if we get serialized snapshot from disk, this
problem can occur and probably we can fix by ignoring serialized
snapshots during slot creation or something on those lines. However,
what I am trying to understand is whether this can occur from another
path as well. I think it might occur without using serialized
snapshots as well because we allow decoding xl_xact_assignment record
even when the snapshot state is not full. Say in your above example,
even if the snapshot state is not SNAPBUILD_CONSISTENT as we haven't
used the serialized snapshot, then also, it can lead to the above
problem due to decoding of xl_xact_assignment. I have not tried to
generate a test case for this, so I could easily be wrong here.
What I am trying to get at is if the problem can only occur by using
serialized snapshots and we fix it by somehow not using them initial
slot creation, then ideally we don't need to remove the error in
ReorderBufferAssignChild, but if that is not the case, then we need to
discuss other cases as well.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes:
On Mon, Feb 3, 2020 at 7:16 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
Amit Kapila <amit.kapila16@gmail.com> writes:
So, doesn't this mean that it started occurring after the fix done in
commit 96b5033e11 [1]? Because before that fix we wouldn't have
allowed processing XLOG_XACT_ASSIGNMENT records unless we are in
SNAPBUILD_FULL_SNAPSHOT state. I am not telling the fix in that
commit is wrong, but just trying to understand the situation here.Nope. Consider again example of WAL above triggering the error:
[ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_2> <commit> <confirmed_flush_lsn> ]
Decoder starting reading WAL at <restart_lsn> where he immediately reads
from disk snapshot serialized earlier, which makes it jump to
SNAPBUILD_CONSISTENT right away.Sure, I understand that if we get serialized snapshot from disk, this
problem can occur and probably we can fix by ignoring serialized
snapshots during slot creation or something on those lines.
There is some confusion. I'll try to reword what we have here:
1) Decoding from existing slot (*not* initial snapshot construction)
starts up, immediately picks up snapshot at restart_lsn (getting into
SNAPBUILD_CONSISTENT) and in some xl_xact_assignment learns that it
hadn't seen in full (no toplevel records) transaction which it is not
even going to stream -- but still dies with "subtransation logged
without...". That's my example above, and that's what people are
complaining about. Here, usage of serialized snapshot and jump to
SNAPBUILD_CONSISTENT is not just legit, it is essential: or order to be
able to stream data since confirmed_flush_lsn, we must pick it up as we
might not be able to assemble it from scratch in time. I mean, what is
wrong here is not serialized snapshot usage but the check.
(Lengthy comment to AllocateSnapshotBuilder in my
0001-Fix-serialized-snapshot-usage-for-new-logical-slots.patch explains
why snapbuilder is not able to do FULL -> CONSISTENT transition on its
own early enough for decoding from existing slot, so the jump on
snapshot pickup is performed to CONSISTENT directly.)
This happens with or without bac2fae05c.
2) *New* slot creationg process picks up serialized snapshot and jumps
to CONSISTENT without waiting for all running xacts to finish. This is
wrong and is a bug (of very low probability), as we risk promising to
decode xacts which we might not seen in full. Sometimes it could be
arrested by "subtransation logged without..." check, but not necessarily
-- e.g. there could be no subxacts at all.
However,
what I am trying to understand is whether this can occur from another
path as well. I think it might occur without using serialized
snapshots as well because we allow decoding xl_xact_assignment record
even when the snapshot state is not full. Say in your above example,
even if the snapshot state is not SNAPBUILD_CONSISTENT as we haven't
used the serialized snapshot, then also, it can lead to the above
problem due to decoding of xl_xact_assignment. I have not tried to
generate a test case for this, so I could easily be wrong here.
What you are suggesting here is 3), which is, well, sort of form of 1),
meaning "subxact logged..." error also pointlessly triggered, but for
new slot creation. With bac2fae0, decoder might miss first
xl_xact_assignment because it is beyond start of reading WAL but
encounter second xl_xact_assignment and die on it due to this check
before even getting FULL.
But now that I'm thinking about it, I suspect that similar could happen
even before bac2fae0. Imagine
<start_of_reading_wal> <xl_xact_assignment_1> <SNAPBUILD_FULL> <subxact_change> <xl_xact_assignment_2> <commit> ... <SNAPBUILD_CONSISTENT>
Before bac2fae0, xl_xact_assignment_1 was ignored, so
xl_xact_assignment_1 would trigger the error.
What I am trying to get at is if the problem can only occur by using
serialized snapshots and we fix it by somehow not using them initial
slot creation, then ideally we don't need to remove the error in
ReorderBufferAssignChild, but if that is not the case, then we need to
discuss other cases as well.
So, 1) and 3) mean this is not the case.
--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Wed, Feb 5, 2020 at 2:34 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
Amit Kapila <amit.kapila16@gmail.com> writes:
On Mon, Feb 3, 2020 at 7:16 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
Amit Kapila <amit.kapila16@gmail.com> writes:
So, doesn't this mean that it started occurring after the fix done in
commit 96b5033e11 [1]? Because before that fix we wouldn't have
allowed processing XLOG_XACT_ASSIGNMENT records unless we are in
SNAPBUILD_FULL_SNAPSHOT state. I am not telling the fix in that
commit is wrong, but just trying to understand the situation here.Nope. Consider again example of WAL above triggering the error:
[ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_2> <commit> <confirmed_flush_lsn> ]
Decoder starting reading WAL at <restart_lsn> where he immediately reads
from disk snapshot serialized earlier, which makes it jump to
SNAPBUILD_CONSISTENT right away.Sure, I understand that if we get serialized snapshot from disk, this
problem can occur and probably we can fix by ignoring serialized
snapshots during slot creation or something on those lines.There is some confusion. I'll try to reword what we have here:
1) Decoding from existing slot (*not* initial snapshot construction)
starts up, immediately picks up snapshot at restart_lsn (getting into
SNAPBUILD_CONSISTENT) and in some xl_xact_assignment learns that it
hadn't seen in full (no toplevel records) transaction which it is not
even going to stream -- but still dies with "subtransation logged
without...". That's my example above, and that's what people are
complaining about. Here, usage of serialized snapshot and jump to
SNAPBUILD_CONSISTENT is not just legit, it is essential: or order to be
able to stream data since confirmed_flush_lsn, we must pick it up as we
might not be able to assemble it from scratch in time. I mean, what is
wrong here is not serialized snapshot usage but the check.
I was thinking if we have some way to skip processing of
xl_xact_assignment for such cases, then it might be better. Say,
along with restart_lsn, if have some way to find corresponding nextXid
(below which we don't need to process records). Say, if, during
SnapBuildProcessRunningXacts, we record the xid of txn we got by
ReorderBufferGetOldestTXN in slot, then can't we use it to skip such
records.
(Lengthy comment to AllocateSnapshotBuilder in my
0001-Fix-serialized-snapshot-usage-for-new-logical-slots.patch explains
why snapbuilder is not able to do FULL -> CONSISTENT transition on its
own early enough for decoding from existing slot, so the jump on
snapshot pickup is performed to CONSISTENT directly.)This happens with or without bac2fae05c.
2) *New* slot creationg process picks up serialized snapshot and jumps
to CONSISTENT without waiting for all running xacts to finish. This is
wrong and is a bug (of very low probability), as we risk promising to
decode xacts which we might not seen in full. Sometimes it could be
arrested by "subtransation logged without..." check, but not necessarily
-- e.g. there could be no subxacts at all.However,
what I am trying to understand is whether this can occur from another
path as well. I think it might occur without using serialized
snapshots as well because we allow decoding xl_xact_assignment record
even when the snapshot state is not full. Say in your above example,
even if the snapshot state is not SNAPBUILD_CONSISTENT as we haven't
used the serialized snapshot, then also, it can lead to the above
problem due to decoding of xl_xact_assignment. I have not tried to
generate a test case for this, so I could easily be wrong here.What you are suggesting here is 3), which is, well, sort of form of 1),
meaning "subxact logged..." error also pointlessly triggered, but for
new slot creation. With bac2fae0, decoder might miss first
xl_xact_assignment because it is beyond start of reading WAL but
encounter second xl_xact_assignment and die on it due to this check
before even getting FULL.But now that I'm thinking about it, I suspect that similar could happen
even before bac2fae0. Imagine<start_of_reading_wal> <xl_xact_assignment_1> <SNAPBUILD_FULL> <subxact_change> <xl_xact_assignment_2> <commit> ... <SNAPBUILD_CONSISTENT>
Before bac2fae0, xl_xact_assignment_1 was ignored, so
xl_xact_assignment_1 would trigger the error.
'xl_xact_assignment_1 would trigger the error', I think in this part
of sentence you mean to say xl_xact_assignment_2 because we won't try
to decode xl_xact_assignment_1 before bac2fae0. If so, won't we wait
for such a transaction to finish while changing the snapshot state
from SNAPBUILD_BUILDING_SNAPSHOT to SNAPBUILD_FULL_SNAPSHOT? And if
the transaction is finished, ideally, we should not try to decode its
WAL and or create its ReorderBufferTxn.
What I am trying to get at is if the problem can only occur by using
serialized snapshots and we fix it by somehow not using them initial
slot creation, then ideally we don't need to remove the error in
ReorderBufferAssignChild, but if that is not the case, then we need to
discuss other cases as well.So, 1) and 3) mean this is not the case.
Right, I am thinking that if we can find some way to skip the xact
assignment for (1) and (3), then that might be a better fix.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Fri, Feb 7, 2020 at 4:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Feb 5, 2020 at 2:34 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
What I am trying to get at is if the problem can only occur by using
serialized snapshots and we fix it by somehow not using them initial
slot creation, then ideally we don't need to remove the error in
ReorderBufferAssignChild, but if that is not the case, then we need to
discuss other cases as well.So, 1) and 3) mean this is not the case.
Right, I am thinking that if we can find some way to skip the xact
assignment for (1) and (3), then that might be a better fix.
Just to be clear, I am just brainstorming the ideas to see if we can
find some better solutions to the problems (1) and (3) described by
Arseny in the above email [1]/messages/by-id/87zhdx76d5.fsf@ars-thinkpad. At this stage, it is not clear to me
that we have a fix simple enough to backpatch apart from what Arseny
posted in his fist email [2]/messages/by-id/87ftjifoql.fsf@ars-thinkpad (which is to stop demanding that top xact
must be seen before subxact in decoding.).
[1]: /messages/by-id/87zhdx76d5.fsf@ars-thinkpad
[2]: /messages/by-id/87ftjifoql.fsf@ars-thinkpad
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes:
1) Decoding from existing slot (*not* initial snapshot construction)
starts up, immediately picks up snapshot at restart_lsn (getting into
SNAPBUILD_CONSISTENT) and in some xl_xact_assignment learns that it
hadn't seen in full (no toplevel records) transaction which it is not
even going to stream -- but still dies with "subtransation logged
without...". That's my example above, and that's what people are
complaining about. Here, usage of serialized snapshot and jump to
SNAPBUILD_CONSISTENT is not just legit, it is essential: or order to be
able to stream data since confirmed_flush_lsn, we must pick it up as we
might not be able to assemble it from scratch in time. I mean, what is
wrong here is not serialized snapshot usage but the check.I was thinking if we have some way to skip processing of
xl_xact_assignment for such cases, then it might be better. Say,
along with restart_lsn, if have some way to find corresponding nextXid
(below which we don't need to process records).
I don't believe you can that without persisting additional
data. Basically, what we need is list of transactions who are running at
the point of snapshot serialization *and* already wrote something before
it -- those we hadn't seen in full and can't decode. We have no such
data currently. The closest thing we have is xl_running_xacts->nextXid,
but
1) issued xid doesn't necessarily means xact actually wrote something,
so we can't just skip xl_xact_assignment for xid < nextXid, it might
still be decoded
2) snapshot might be serialized not at xl_running_xacts anyway
Surely this thing doesn't deserve changing persisted data format.
Somehow I hadn't realized this earlier, so my comments/commit messages
in patches above were not accurate here; I've edited them. Also in the
first patch serialized snapshots are not no longer used for new slot
creation at all, as Andres suggested above. This is not principal, as I
said, but arguably makes things simpler a bit.
I've also found a couple of issues with slot copying feature, will post
in separate thread on them.
On Sun, Feb 9, 2020 at 9:37 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
Amit Kapila <amit.kapila16@gmail.com> writes:
1) Decoding from existing slot (*not* initial snapshot construction)
starts up, immediately picks up snapshot at restart_lsn (getting into
SNAPBUILD_CONSISTENT) and in some xl_xact_assignment learns that it
hadn't seen in full (no toplevel records) transaction which it is not
even going to stream -- but still dies with "subtransation logged
without...". That's my example above, and that's what people are
complaining about. Here, usage of serialized snapshot and jump to
SNAPBUILD_CONSISTENT is not just legit, it is essential: or order to be
able to stream data since confirmed_flush_lsn, we must pick it up as we
might not be able to assemble it from scratch in time. I mean, what is
wrong here is not serialized snapshot usage but the check.I was thinking if we have some way to skip processing of
xl_xact_assignment for such cases, then it might be better. Say,
along with restart_lsn, if have some way to find corresponding nextXid
(below which we don't need to process records).I don't believe you can that without persisting additional
data. Basically, what we need is list of transactions who are running at
the point of snapshot serialization *and* already wrote something before
it -- those we hadn't seen in full and can't decode. We have no such
data currently. The closest thing we have is xl_running_xacts->nextXid,
but1) issued xid doesn't necessarily means xact actually wrote something,
so we can't just skip xl_xact_assignment for xid < nextXid, it might
still be decoded
2) snapshot might be serialized not at xl_running_xacts anywaySurely this thing doesn't deserve changing persisted data format.
I agree that it won't be a good idea to change the persisted data
format, especially in back-branches. I don't see any fix which can
avoid this without doing major changes in the code. Apart from this,
we have to come up with a solution for point (3) discussed in the
above email [1]/messages/by-id/87zhdx76d5.fsf@ars-thinkpad which again could be change in design. I think we can
first try to proceed with the patch
0002-Stop-demanding-that-top-xact-must-be-seen-before--v2 and then we
can discuss the other patch. I can't see a way to write a test case
for this, can you think of any way?
Andres, anyone else, if you have a better idea other than changing the
code (removing the expected error) as in
0002-Stop-demanding-that-top-xact-must-be-seen-before--v2, then
please, let us know. You can read the points (1) and (3) in the email
above [1]/messages/by-id/87zhdx76d5.fsf@ars-thinkpad where the below error check will hit for valid cases. We
have discussed this in detail, but couldn't come up with anything
better than to remove this check.
@@ -778,9 +778,6 @@ ReorderBufferAssignChild(ReorderBuffer *rb,
TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true);
subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false);
- if (new_top && !new_sub)
- elog(ERROR, "subtransaction logged without previous top-level txn record");
-
[1]: /messages/by-id/87zhdx76d5.fsf@ars-thinkpad
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com