Assertion failure when streaming logical changes

Started by Michael Paquieralmost 11 years ago9 messages

michael.paquier@gmail.com

almost 11 years ago

Hi all,

Using test_decoding on HEAD (cc761b1) I am seeing the following assertion
failure:
TRAP: FailedAssertion("!(!((&RegisteredSnapshots)->ph_root ==
((void*)0)))", File: "snapmgr.c", Line: 677)

(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff8b246d46 libsystem_kernel.dylib`__kill
+ 10, stop reason = signal SIGSTOP
* frame #0: 0x00007fff8b246d46 libsystem_kernel.dylib`__kill + 10
frame #1: 0x00007fff8861bf83 libsystem_c.dylib`abort + 177
frame #2: 0x000000010b556dd9
postgres`ExceptionalCondition(conditionName=0x000000010b66ffcb,
errorType=0x000000010b5c105d, fileName=0x000000010b66fe68, lineNumber=677)
+ 137 at assert.c:54
frame #3: 0x000000010b5af952
postgres`UnregisterSnapshotFromOwner(snapshot=0x00007faea38c22d8,
owner=0x00007faea38a6838) + 146 at snapmgr.c:677
frame #4: 0x000000010b5af8b2
postgres`UnregisterSnapshot(snapshot=0x00007faea38c22d8) + 50 at
snapmgr.c:663
frame #5: 0x000000010b0166ce
postgres`systable_endscan(sysscan=0x00007faea38cf090) + 110 at genam.c:504
frame #6: 0x000000010b5474b8
postgres`RelationBuildTupleDesc(relation=0x0000000114932e68) + 952 at
relcache.c:568
frame #7: 0x000000010b53e45c
postgres`RelationBuildDesc(targetRelId=3455, insertIt='\x01') + 604 at
relcache.c:1035
frame #8: 0x000000010b53d564
postgres`RelationIdGetRelation(relationId=3455) + 324 at relcache.c:1777
frame #9: 0x000000010aff093c postgres`relation_open(relationId=3455,
lockmode=1) + 108 at heapam.c:1047
frame #10: 0x000000010b016ac9 postgres`index_open(relationId=3455,
lockmode=1) + 25 at indexam.c:167
frame #11: 0x000000010b01603d
postgres`systable_beginscan(heapRelation=0x00000001149214d0, indexId=3455,
indexOK='\x01', snapshot=0x0000000000000000, nkeys=2,
key=0x00007fff54c6a990) + 93 at genam.c:334
frame #12: 0x000000010b54a976
postgres`RelidByRelfilenode(reltablespace=0, relfilenode=12709) + 742 at
relfilenodemap.c:204
frame #13: 0x000000010b3353d9
postgres`ReorderBufferCommit(rb=0x00007faea38c1038, xid=1001,
commit_lsn=23817712, end_lsn=23818128, commit_time=476842122105685) + 665
at reorderbuffer.c:1338
frame #14: 0x000000010b330ccb
postgres`DecodeCommit(ctx=0x00007faea38b0838, buf=0x00007fff54c6ad58,
xid=1001, dboid=16384, commit_time=476842122105685, nsubxacts=0,
sub_xids=0x00007faea3885530, ninval_msgs=22, msgs=0x00007faea3885530) + 443
at decode.c:548
frame #15: 0x000000010b32f663
postgres`DecodeXactOp(ctx=0x00007faea38b0838, buf=0x00007fff54c6ad58) + 547
at decode.c:210
frame #16: 0x000000010b32f10e
postgres`LogicalDecodingProcessRecord(ctx=0x00007faea38b0838,
record=0x00007faea38b0af8) + 142 at decode.c:103
frame #17: 0x000000010b3433f5 postgres`XLogSendLogical + 165 at
walsender.c:2425
frame #18: 0x000000010b3431ad
postgres`WalSndLoop(send_data=0x000000010b343350) + 269 at walsender.c:1834
frame #19: 0x000000010b341938
postgres`StartLogicalReplication(cmd=0x00007faea38846a8) + 568 at
walsender.c:997
frame #20: 0x000000010b34021c
postgres`exec_replication_command(cmd_string=0x00007faea3833a38) + 524 at
walsender.c:1326
frame #21: 0x000000010b3abaab postgres`PostgresMain(argc=1,
argv=0x00007faea3803fc8, dbname=0x00007faea3803ec0,
username=0x00007faea3803ea0) + 2475 at postgres.c:4022
frame #22: 0x000000010b312a2e
postgres`BackendRun(port=0x00007faea3405d60) + 686 at postmaster.c:4141
frame #23: 0x000000010b311ff0
postgres`BackendStartup(port=0x00007faea3405d60) + 384 at postmaster.c:3826
frame #24: 0x000000010b30e7f7 postgres`ServerLoop + 663 at
postmaster.c:1594
frame #25: 0x000000010b30c017 postgres`PostmasterMain(argc=3,
argv=0x00007faea34044d0) + 5623 at postmaster.c:1241
frame #26: 0x000000010b24e11d postgres`main(argc=3,
argv=0x00007faea34044d0) + 749 at main.c:221

The problem can be easily reproduced using pg_recvlogical after creating a
logical slot plugged with test_decoding.
Regards,
--
Michael

Andres Freund

andres@2ndquadrant.com

almost 11 years ago

In reply to: Michael Paquier (#1)

Re: Assertion failure when streaming logical changes

Hi,

On 2015-02-10 21:20:37 +0900, Michael Paquier wrote:

Using test_decoding on HEAD (cc761b1) I am seeing the following assertion
failure:
TRAP: FailedAssertion("!(!((&RegisteredSnapshots)->ph_root ==
((void*)0)))", File: "snapmgr.c", Line: 677)

Ick. I guess a revert of 94028691609f8e148bd4ce72c46163f018832a5b fixes
it?

(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff8b246d46 libsystem_kernel.dylib`__kill
+ 10, stop reason = signal SIGSTOP
* frame #0: 0x00007fff8b246d46 libsystem_kernel.dylib`__kill + 10
frame #1: 0x00007fff8861bf83 libsystem_c.dylib`abort + 177
frame #2: 0x000000010b556dd9
postgres`ExceptionalCondition(conditionName=0x000000010b66ffcb,
errorType=0x000000010b5c105d, fileName=0x000000010b66fe68, lineNumber=677)
+ 137 at assert.c:54
frame #3: 0x000000010b5af952
postgres`UnregisterSnapshotFromOwner(snapshot=0x00007faea38c22d8,
owner=0x00007faea38a6838) + 146 at snapmgr.c:677
frame #4: 0x000000010b5af8b2
postgres`UnregisterSnapshot(snapshot=0x00007faea38c22d8) + 50 at
snapmgr.c:663
frame #5: 0x000000010b0166ce
postgres`systable_endscan(sysscan=0x00007faea38cf090) + 110 at genam.c:504
frame #6: 0x000000010b5474b8
postgres`RelationBuildTupleDesc(relation=0x0000000114932e68) + 952 at
relcache.c:568
frame #7: 0x000000010b53e45c
postgres`RelationBuildDesc(targetRelId=3455, insertIt='\x01') + 604 at
relcache.c:1035
frame #8: 0x000000010b53d564
postgres`RelationIdGetRelation(relationId=3455) + 324 at relcache.c:1777
frame #9: 0x000000010aff093c postgres`relation_open(relationId=3455,
lockmode=1) + 108 at heapam.c:1047
frame #10: 0x000000010b016ac9 postgres`index_open(relationId=3455,
lockmode=1) + 25 at indexam.c:167
frame #11: 0x000000010b01603d
postgres`systable_beginscan(heapRelation=0x00000001149214d0, indexId=3455,
indexOK='\x01', snapshot=0x0000000000000000, nkeys=2,
key=0x00007fff54c6a990) + 93 at genam.c:334
frame #12: 0x000000010b54a976

Yea, it really looks like the above commit is to "blame". The new xmin
tracking infrastructure doesn't know about the historical snapshot...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

almost 11 years ago

In reply to: Andres Freund (#2)

Re: Assertion failure when streaming logical changes

On Tue, Feb 10, 2015 at 9:46 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

Yea, it really looks like the above commit is to "blame". The new xmin
tracking infrastructure doesn't know about the historical snapshot...

I think that we need a better regression coverage here... For example, we
could add some tap tests in test_decoding to test streaming of logical
changes. This would help in the future to detect such problems via the
buildfarm.
--
Michael

Andres Freund

andres@2ndquadrant.com

almost 11 years ago

In reply to: Michael Paquier (#3)

Re: Assertion failure when streaming logical changes

On 2015-02-10 22:06:34 +0900, Michael Paquier wrote:

On Tue, Feb 10, 2015 at 9:46 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

Yea, it really looks like the above commit is to "blame". The new xmin
tracking infrastructure doesn't know about the historical snapshot...

I think that we need a better regression coverage here... For example, we
could add some tap tests in test_decoding to test streaming of logical
changes. This would help in the future to detect such problems via the
buildfarm.

I agree. It's more or less a accident that the assert - which just
should be moved in the regd_count == 0 branch - didn't trigger for the
SQL interface. The snapshot acquired by the SELECT statement prevents it
there.

It's not entirely trivial to add tests for receivelogical though. You
need to stop it programatically after a while.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

almost 11 years ago

In reply to: Andres Freund (#2)

1 attachment(s)

Re: Assertion failure when streaming logical changes

On 02/10/2015 02:46 PM, Andres Freund wrote:

On 2015-02-10 21:20:37 +0900, Michael Paquier wrote:

Using test_decoding on HEAD (cc761b1) I am seeing the following assertion
failure:
TRAP: FailedAssertion("!(!((&RegisteredSnapshots)->ph_root ==
((void*)0)))", File: "snapmgr.c", Line: 677)

Ick. I guess a revert of 94028691609f8e148bd4ce72c46163f018832a5b fixes
it?
...

Yea, it really looks like the above commit is to "blame". The new xmin
tracking infrastructure doesn't know about the historical snapshot...

The new xmin tracking code assumes that if a snapshots's regd_count > 0,
it has already been pushed to the RegisteredSnapshots heap. That
assumption doesn't hold because the logical decoding stuff creates its
own snapshots and sets regd_count=1 to prevent snapmgr.c from freeing
them, even though they haven't been registered with RegisterSnapshot.

The second paragraph in this comment in snapmgr.c needs fixing:

* Likewise, any snapshots that have been exported by pg_export_snapshot
* have regd_count = 1 and are counted in RegisteredSnapshots, but are not
* tracked by any resource owner.
*
* The same is true for historic snapshots used during logical decoding,
* their lifetime is managed separately (as they life longer as one xact.c
* transaction).

Besides the spelling, that's incorrect: historic snapshots are *not*
counted in RegisteredSnapshots. That was harmless before commit
94028691, but it was always wrong.

I think setting regd_count=1 outside snapmgr.c is a pretty ugly hack.
snapbuild.c also abuses active_count as a reference counter, which is
similarly ugly. I think regd_count and active_count should both be
treated as private to snapmgr.c, and initialized to zero elsewhere.

As a minimal fix, we could change the logical decoding code to not use
regd_count to prevent snapmgr.c from freeing its snapshots, but use
active_count for that too. Setting regd_count to 1 in
SnapBuildBuildSnapshot() seems unnecessary in the first place, as the
callers also call SnapBuildSnapIncRefcount(). Patch attached.

But could we get rid of those active_count manipulations too? Could you
replace the SnapBuildSnap[Inc|Dec]Refcount calls with
[Un]RegisterSnapshot()?

- Heikki

Attachments:

minimal-fix-for-historic-snapshots-1.patchapplication/x-patch; name=minimal-fix-for-historic-snapshots-1.patchDownload

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bcd5896..1f76731 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1188,8 +1188,8 @@ ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 	memcpy(snap, orig_snap, sizeof(SnapshotData));
 
 	snap->copied = true;
-	snap->active_count = 0;
-	snap->regd_count = 1;
+	snap->active_count = 1;
+	snap->regd_count = 0;
 	snap->xip = (TransactionId *) (snap + 1);
 
 	memcpy(snap->xip, orig_snap->xip, sizeof(TransactionId) * snap->xcnt);
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index e911453..80f5b44 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -348,7 +348,7 @@ SnapBuildFreeSnapshot(Snapshot snap)
 	Assert(snap->curcid == FirstCommandId);
 	Assert(!snap->suboverflowed);
 	Assert(!snap->takenDuringRecovery);
-	Assert(snap->regd_count == 1);
+	Assert(snap->regd_count == 0);
 
 	/* slightly more likely, so it's checked even without c-asserts */
 	if (snap->copied)
@@ -407,16 +407,16 @@ SnapBuildSnapDecRefcount(Snapshot snap)
 	Assert(!snap->suboverflowed);
 	Assert(!snap->takenDuringRecovery);
 
-	Assert(snap->regd_count == 1);
+	Assert(snap->regd_count == 0);
 
-	Assert(snap->active_count);
+	Assert(snap->active_count > 0);
 
 	/* slightly more likely, so it's checked even without casserts */
 	if (snap->copied)
 		elog(ERROR, "cannot free a copied snapshot");
 
 	snap->active_count--;
-	if (!snap->active_count)
+	if (snap->active_count == 0)
 		SnapBuildFreeSnapshot(snap);
 }
 
@@ -495,7 +495,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder, TransactionId xid)
 	snapshot->copied = false;
 	snapshot->curcid = FirstCommandId;
 	snapshot->active_count = 0;
-	snapshot->regd_count = 1;	/* mark as registered so nobody frees it */
+	snapshot->regd_count = 0;
 
 	return snapshot;
 }

Craig Ringer

craig@2ndquadrant.com

almost 11 years ago

In reply to: Andres Freund (#4)

Re: Assertion failure when streaming logical changes

On 10 February 2015 at 21:43, Andres Freund <andres@2ndquadrant.com> wrote:

On 2015-02-10 22:06:34 +0900, Michael Paquier wrote:

On Tue, Feb 10, 2015 at 9:46 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

Yea, it really looks like the above commit is to "blame". The new xmin
tracking infrastructure doesn't know about the historical snapshot...

I think that we need a better regression coverage here... For example, we
could add some tap tests in test_decoding to test streaming of logical
changes. This would help in the future to detect such problems via the
buildfarm.

I agree. It's more or less a accident that the assert - which just
should be moved in the regd_count == 0 branch - didn't trigger for the
SQL interface. The snapshot acquired by the SELECT statement prevents it
there.

It's not entirely trivial to add tests for receivelogical though. You
need to stop it programatically after a while.

I reckon we should improve that then.

What about an idle timeout for pg_recvlogical, so we can get it to time out
after (say) 5 seconds of no data? Time-based facilities aren't great in
tests though.

The SQL interface can stop after a given number of changes received or a
target LSN. Is there a practical reason pg_recvlogical doesn't? Or just
that there wasn't a reason to implement it? I figured I'd ask before
jumping in and trying to add it in case I'd be wasting my time trying.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Craig Ringer

craig@2ndquadrant.com

almost 11 years ago

In reply to: Heikki Linnakangas (#5)

Re: Assertion failure when streaming logical changes

On 11 February 2015 at 08:51, Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:

The new xmin tracking code assumes that if a snapshots's regd_count > 0,
it has already been pushed to the RegisteredSnapshots heap. That assumption
doesn't hold because the logical decoding stuff creates its own snapshots
and sets regd_count=1 to prevent snapmgr.c from freeing them, even though
they haven't been registered with RegisterSnapshot.

The second paragraph in this comment in snapmgr.c needs fixing:

* Likewise, any snapshots that have been exported by pg_export_snapshot

* have regd_count = 1 and are counted in RegisteredSnapshots, but are not
* tracked by any resource owner.
*
* The same is true for historic snapshots used during logical decoding,
* their lifetime is managed separately (as they life longer as one xact.c
* transaction).

Besides the spelling, that's incorrect: historic snapshots are *not*
counted in RegisteredSnapshots. That was harmless before commit 94028691,
but it was always wrong.

I think setting regd_count=1 outside snapmgr.c is a pretty ugly hack.
snapbuild.c also abuses active_count as a reference counter, which is
similarly ugly. I think regd_count and active_count should both be treated
as private to snapmgr.c, and initialized to zero elsewhere.

As a minimal fix, we could change the logical decoding code to not use
regd_count to prevent snapmgr.c from freeing its snapshots, but use
active_count for that too. Setting regd_count to 1 in
SnapBuildBuildSnapshot() seems unnecessary in the first place, as the
callers also call SnapBuildSnapIncRefcount(). Patch attached.

It might be a good idea to apply this if nothing better is forthcoming.
Logical decoding in WALsenders is broken at the moment.

Otherwise it needs to go on the 9.5 blockers list.

But could we get rid of those active_count manipulations too? Could you

replace the SnapBuildSnap[Inc|Dec]Refcount calls with
[Un]RegisterSnapshot()?

It would be interesting to know why it was done that way in the first
place, rather than using the snapshot management infrastructure. I presume
it needed to do something not directly offered by the snapshot manager but
I haven't managed to grasp what, exactly.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Andres Freund

andres@anarazel.de

almost 11 years ago

In reply to: Craig Ringer (#7)

Re: Assertion failure when streaming logical changes

On 2015-04-07 17:22:12 +0800, Craig Ringer wrote:

It might be a good idea to apply this if nothing better is forthcoming.
Logical decoding in WALsenders is broken at the moment.

Yes.

Otherwise it needs to go on the 9.5 blockers list.

But could we get rid of those active_count manipulations too? Could you

replace the SnapBuildSnap[Inc|Dec]Refcount calls with
[Un]RegisterSnapshot()?

It would be interesting to know why it was done that way in the first
place, rather than using the snapshot management infrastructure.

Because the snapshot tracking infrastructure isn't particularly
suitable. These snapshots are much longer lived; they span transactions
and such. Nothing snapmgr.c is made for. It seemed more
complicated/bug-prone to change snapmgr.c to support a foreign use case.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnaka@iki.fi

over 10 years ago

In reply to: Michael Paquier (#1)

Re: Assertion failure when streaming logical changes

On 04/07/2015 03:54 PM, Andres Freund wrote:

On 2015-04-07 17:22:12 +0800, Craig Ringer wrote:

It might be a good idea to apply this if nothing better is forthcoming.
Logical decoding in WALsenders is broken at the moment.

Yes.

Committed.
- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 11544_1428411302_5523D3A5_11544_32_1_20150407125420.GA12291@awork2.anarazel.de