nbtree: assertion failure in _bt_killitems() for posting tuple

Started by Anastasia Lubennikovaalmost 6 years ago6 messages

a.lubennikova@postgrespro.ru

almost 6 years ago

During tests, we catched an assertion failure in _bt_killitems() for
posting tuple in unique index:

/* kitem must have matching offnum when heap TIDs match */
Assert(kitem->indexOffset == offnum);

https://github.com/postgres/postgres/blob/master/src/backend/access/nbtree/nbtutils.c#L1809

I struggle to understand the meaning of this assertion.
Don't we allow the chance that posting tuple moved right on the page as
the comment says?

* We match items by heap TID before assuming they are the right ones to
* delete. We cope with cases where items have moved right due to
insertions.

It seems that this is exactly the case for this failure.
We expected to find tuple at offset 121, but instead it is at offset
125. (see dump details below).

Unfortunately I cannot attach test and core dump, since they rely on the
enterprise multimaster extension code.
Here are some details from the core dump, that I find essential:

Stack is
_bt_killitems
_bt_release_current_position
_bt_release_scan_state
btrescan
index_rescan
RelationFindReplTupleByIndex

(gdb) p offnum
$3 = 125
(gdb) p *item
$4 = {ip_blkid = {bi_hi = 0, bi_lo = 2}, ip_posid = 200}
(gdb) p *kitem
$5 = {heapTid = {ip_blkid = {bi_hi = 0, bi_lo = 2}, ip_posid = 200},
indexOffset = 121, tupleOffset = 32639}

Unless I miss something, this assertion must be removed.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Anastasia Lubennikova (#1)

Re: nbtree: assertion failure in _bt_killitems() for posting tuple

On Thu, Mar 19, 2020 at 9:34 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Unfortunately I cannot attach test and core dump, since they rely on the
enterprise multimaster extension code.
Here are some details from the core dump, that I find essential:

Stack is
_bt_killitems
_bt_release_current_position
_bt_release_scan_state
btrescan
index_rescan
RelationFindReplTupleByIndex

(gdb) p offnum
$3 = 125
(gdb) p *item
$4 = {ip_blkid = {bi_hi = 0, bi_lo = 2}, ip_posid = 200}
(gdb) p *kitem
$5 = {heapTid = {ip_blkid = {bi_hi = 0, bi_lo = 2}, ip_posid = 200},
indexOffset = 121, tupleOffset = 32639}

Unless I miss something, this assertion must be removed.

Is this index an unlogged index, under the hood?

--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Anastasia Lubennikova (#1)

Re: nbtree: assertion failure in _bt_killitems() for posting tuple

On Thu, Mar 19, 2020 at 9:34 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

During tests, we catched an assertion failure in _bt_killitems() for
posting tuple in unique index:

/* kitem must have matching offnum when heap TIDs match */
Assert(kitem->indexOffset == offnum);

https://github.com/postgres/postgres/blob/master/src/backend/access/nbtree/nbtutils.c#L1809

I struggle to understand the meaning of this assertion.
Don't we allow the chance that posting tuple moved right on the page as
the comment says?

I think you're right. However, it still seems like we should check
that "kitem->indexOffset" is consistent among all of the BTScanPosItem
entries that we have for each TID that we believe to be from the same
posting list tuple.

(Thinks some more...)

Even if the offnum changes when the buffer lock is released, due to
somebody inserting on to the same page, I guess that we still expect
to observe all of the heap TIDs together in the posting list. Though
maybe not. Maybe it's possible for a deduplication pass to occur when
the buffer lock is dropped, in which case we should arguably behave in
the same way when we see the same heap TIDs (i.e. delete the entire
posting list without regard for whether or not the TIDs happened to
appear in a posting list initially). I'm not sure, though.

It will make no difference most of the time, since the
kill_prior_tuple stuff is generally not applied when the page is
changed at all -- the LSN is checked by the logic added by commit
2ed5b87f. That's why I asked about unlogged indexes (we don't do the
LSN thing there). But I still think that we need to take a firm
position on it.

--
Peter Geoghegan

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

almost 6 years ago

In reply to: Peter Geoghegan (#3)

1 attachment(s)

Re: nbtree: assertion failure in _bt_killitems() for posting tuple

On 20.03.2020 03:34, Peter Geoghegan wrote:

On Thu, Mar 19, 2020 at 9:34 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

During tests, we catched an assertion failure in _bt_killitems() for
posting tuple in unique index:

/* kitem must have matching offnum when heap TIDs match */
Assert(kitem->indexOffset == offnum);

https://github.com/postgres/postgres/blob/master/src/backend/access/nbtree/nbtutils.c#L1809

I struggle to understand the meaning of this assertion.
Don't we allow the chance that posting tuple moved right on the page as
the comment says?

I think you're right. However, it still seems like we should check
that "kitem->indexOffset" is consistent among all of the BTScanPosItem
entries that we have for each TID that we believe to be from the same
posting list tuple.

What kind of consistency do you mean here?
We can probably change this assertion to
Assert(kitem->indexOffset <= offnum);

Anything else?

(Thinks some more...)

Even if the offnum changes when the buffer lock is released, due to
somebody inserting on to the same page, I guess that we still expect
to observe all of the heap TIDs together in the posting list. Though
maybe not. Maybe it's possible for a deduplication pass to occur when
the buffer lock is dropped, in which case we should arguably behave in
the same way when we see the same heap TIDs (i.e. delete the entire
posting list without regard for whether or not the TIDs happened to
appear in a posting list initially). I'm not sure, though.

It will make no difference most of the time, since the
kill_prior_tuple stuff is generally not applied when the page is
changed at all -- the LSN is checked by the logic added by commit
2ed5b87f. That's why I asked about unlogged indexes (we don't do the
LSN thing there). But I still think that we need to take a firm
position on it.

It was a logged index. Though the failed test setup includes logical
replication. Does it handle LSNs differently?
Finally, Alexander Lakhin managed to reproduce this on master. Test is
attached as a patch.

Speaking of unlogged indexes. Now the situation, where items moved left
on the page is legal even if LSN haven't changed.
Anyway, the cycle starts from the offset that we saved in a first pass:

    OffsetNumber offnum = kitem->indexOffset;
    while (offnum <= maxoff)
        ...

It still works correctly, but probably microvacuum becomes less
efficient, if items were concurrently deduplicated.
I wonder if this case worth optimizing?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

000_rep_btree_test.patchtext/x-patch; charset=UTF-8; name=000_rep_btree_test.patchDownload

diff --git a/src/test/subscription/t/000_rep_btree+.pl b/src/test/subscription/t/000_rep_btree+.pl
new file mode 100644
index 0000000000..5a7149334c
--- /dev/null
+++ b/src/test/subscription/t/000_rep_btree+.pl
@@ -0,0 +1,83 @@
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', q{
+autovacuum = 'off'
+max_connections = 100
+});
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', q{
+autovacuum = 'off'
+max_connections = 100
+});
+$node_subscriber->start;
+
+$node_publisher->safe_psql('postgres', "CREATE TABLE t (k int primary key, v int, c int);");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE t (k int primary key, v int, c int);");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres', "ALTER PUBLICATION tap_pub ADD TABLE t");
+
+$node_subscriber->safe_psql('postgres',
+    "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
+);
+
+$node_publisher->wait_for_catchup('tap_sub');
+
+# Also wait for initial table sync to finish
+my $synced_query = "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+$node_publisher->safe_psql('postgres', "INSERT INTO t (select generate_series(0, 399), 0, 0);");
+
+open (PGB_SCRIPT, '>/tmp/writer.pgb');
+print PGB_SCRIPT << 'END';
+\set src random(0, 399)
+\set dst random(0, 399)
+\set amount random(1, 10)
+begin;
+update t set v = v - :amount, c = c + 1 where k=:src;
+update t set v = v + :amount, c = c + 1 where k=:dst;
+commit;
+END
+close(PGB_SCRIPT);
+
+my $time = localtime();
+diag("$time: pgbench starting...");
+my $pgbench1 = IPC::Run::start(
+    [ 'pgbench', '-n', '-c', 10, '-T', 60, '-f', '/tmp/writer.pgb', '-d', $node_publisher->connstr('postgres') ],
+    '<', \undef, '>&', '/tmp/pgbench1.log');
+
+my $pgbench2 = IPC::Run::start(
+    [ 'pgbench', '-n', '-c', 10, '-T', 60, '-f', '/tmp/writer.pgb', '-d', $node_subscriber->connstr('postgres') ],
+    '<', \undef, '>&', '/tmp/pgbench2.log');
+
+for my $i (1..60 + 2) {
+    my $result1 = $node_publisher->safe_psql('postgres', "SELECT SUM(c) FROM t");
+    my $result2 = $node_subscriber->safe_psql('postgres', "SELECT SUM(c) FROM t");
+    diag("changes: $result1 / $result2");
+    sleep(1);
+}
+$pgbench1->finish;
+$pgbench2->finish;
+
+$time = localtime();
+diag("$time: pgbench ended.");
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
+ok(1);
+

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Anastasia Lubennikova (#4)

1 attachment(s)

Re: nbtree: assertion failure in _bt_killitems() for posting tuple

On Tue, Mar 24, 2020 at 1:00 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

I think you're right. However, it still seems like we should check
that "kitem->indexOffset" is consistent among all of the BTScanPosItem
entries that we have for each TID that we believe to be from the same
posting list tuple.

The assertion failure happens in the logical replication worker
because it uses a dirty snapshot, which cannot release the pin per
commit 2ed5b87f. This means that the leaf page can change between the
time that we observe an item is dead, and the time we reach
_bt_killitems(), even though _bt_killitems() does get to kill items.

I am thinking about pushing a fix along the lines of the attached
patch. This preserves the assertion, while avoiding the check in cases
where it doesn't apply, such as when a dirty snapshot is in use.

--
Peter Geoghegan

Attachments:

v1-0001-Fix-assertion-failure-in-_bt_killitems.patchapplication/octet-stream; name=v1-0001-Fix-assertion-failure-in-_bt_killitems.patchDownload

From d21e9e3fbed134020e0c69e7b17b6c95eb784920 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 5 Apr 2020 17:08:05 -0700
Subject: [PATCH v1] Fix assertion failure in _bt_killitems().

---
 src/backend/access/nbtree/nbtutils.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index aaa0c89c7d..e8d66b73c6 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1725,6 +1725,7 @@ _bt_killitems(IndexScanDesc scan)
 	int			i;
 	int			numKilled = so->numKilled;
 	bool		killedsomething = false;
+	bool		pagecouldchange PG_USED_FOR_ASSERTS_ONLY;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -1745,6 +1746,7 @@ _bt_killitems(IndexScanDesc scan)
 		LockBuffer(so->currPos.buf, BT_READ);
 
 		page = BufferGetPage(so->currPos.buf);
+		pagecouldchange = true;
 	}
 	else
 	{
@@ -1766,6 +1768,7 @@ _bt_killitems(IndexScanDesc scan)
 			_bt_relbuf(scan->indexRelation, buf);
 			return;
 		}
+		pagecouldchange = false;
 	}
 
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1806,8 +1809,11 @@ _bt_killitems(IndexScanDesc scan)
 					if (!ItemPointerEquals(item, &kitem->heapTid))
 						break;	/* out of posting list loop */
 
-					/* kitem must have matching offnum when heap TIDs match */
-					Assert(kitem->indexOffset == offnum);
+					/*
+					 * kitem must have matching offnum when heap TIDs match
+					 * and page unchanged since pin was dropped
+					 */
+					Assert(pagecouldchange || kitem->indexOffset == offnum);
 
 					/*
 					 * Read-ahead to later kitems here.
-- 
2.25.1

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Peter Geoghegan (#5)

Re: nbtree: assertion failure in _bt_killitems() for posting tuple

On Sun, Apr 5, 2020 at 5:15 PM Peter Geoghegan <pg@bowt.ie> wrote:

I am thinking about pushing a fix along the lines of the attached
patch. This preserves the assertion, while avoiding the check in cases
where it doesn't apply, such as when a dirty snapshot is in use.

Pushed. Thanks.

--
Peter Geoghegan