BUG #17949: Adding an index introduces serialisation anomalies.

Started by PG Bug reporting formalmost 3 years ago27 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 17949
Logged by: Artem Anisimov
Email address: artem.anisimov.255@gmail.com
PostgreSQL version: 15.3
Operating system: fedora 38
Description:

Hello dear pg authors,

I have come across a behaviour in pg that contradicts
https://www.postgresql.org/docs/15/transaction-iso.html#XACT-SERIALIZABLE.

I've minimised my testcase to a scenario that essentially duplicates a
scenario that you describe in the following paragraph:

In particular, it is possible to see unique constraint violations
caused by conflicts with overlapping Serializable transactions
even after explicitly checking that the key isn't present
before attempting to insert it.
This can be avoided by making sure that all Serializable transactions

that

insert potentially conflicting keys explicitly check if they can do so

first.

At the end of the report there is a C reproducer that highlights the
problem. Let me give a high-level overview of the scenario first.

I have N threads each trying to "acquire an exclusive lock" this way:
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT * FROM locks WHERE path = $1;
quit if there already is a lock
INSERT INTO locks(path, ...) VALUES($1, ...);
COMMIT;

Once all threads have attempted to acquire a lock, I count how many of them
succeeded to INSERT INTO. Normally, there is only one thread that succeeds,
which is what I expect. However, if I add a HASH or a BTREE index on "path",
it becomes possible for multiple threads to do a successful INSERT INTO,
which violates the serialisability.

The problem can be reproduced with the default postgresql.conf, but it takes
some time. If I increase "shared_buffers" to 1024MB, the issue appears
almost immediately (fewer chances to promote predicate locks to locks on the
whole table?).

I've seen this behaviour with pg 13, 15 and 16 (503b055).

Now let us see the reproducer. It has two primary components:
1. test_once() spawns 32 threads that try to acquire a lock, waits for them,
and counts the number of "acquired exclusive locks",
2. try_acquire_lock() executes a transaction described above.

To build the reproducer:
$ gcc -std=c99 -o test test.c -pthread `pkg-config libpq --cflags --libs`

To run the reproducer:
$ ./test
or
$ DB_CONNSTR="dbname=abc host=def user=ghi" ./test

The reproducer:

#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <assert.h>

#include <libpq-fe.h>

// Comment this to make serialisation anomalies go away.
#define WITH_INDEX

// I have seen the problem with as few as 3 threads. 32 threads make
// the issue appear much sooner.
#define NR_THREADS (32)
#define NR_RUNS (1024 * 1024)

static PGconn *conns[NR_THREADS];

static void* try_acquire_lock(void *arg)
{
PGconn *c = arg;
PGresult *res;
int ntuples;

res = PQexec(c, "BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE");
assert(PQresultStatus(res) == PGRES_COMMAND_OK);
PQclear(res);

res = PQexec(c, "SELECT * FROM locks WHERE path = 'xyz'");
assert(PQresultStatus(res) == PGRES_TUPLES_OK);
ntuples = PQntuples(res);
PQclear(res);

if (ntuples > 0) {
// someone else already has a lock
res = PQexec(c, "COMMIT");
assert(PQresultStatus(res) == PGRES_COMMAND_OK);
PQclear(res);
return NULL;
}

res = PQexec(c, "INSERT INTO locks(path) VALUES('xyz')");
PQclear(res);

res = PQexec(c, "COMMIT");
PQclear(res);

return NULL;
}

static void test_once(void)
{
PGconn *c = conns[0];
PGresult *res;
int ntuples;

pthread_t thrs[NR_THREADS];
for (int i = 0; i < NR_THREADS; ++i)
pthread_create(&thrs[i], NULL, &try_acquire_lock,
conns[i]);
for (int i = 0; i < NR_THREADS; ++i)
pthread_join(thrs[i], NULL);

res = PQexec(c, "SELECT * FROM locks WHERE path = 'xyz'");
assert(PQresultStatus(res) == PGRES_TUPLES_OK);
ntuples = PQntuples(res);
PQclear(res);

if (ntuples != 1)
printf("ntuples = %d\n", ntuples);
assert(ntuples == 1);

res = PQexec(c, "TRUNCATE TABLE locks");
assert(PQresultStatus(res) == PGRES_COMMAND_OK);
PQclear(res);
}

static void prepare_db(void)
{
PGconn *c = conns[0];
PGresult *res;

res = PQexec(c, "DROP TABLE locks");
PQclear(res);

res = PQexec(c, "CREATE TABLE locks (path TEXT NOT NULL)");
assert(PQresultStatus(res) == PGRES_COMMAND_OK);
PQclear(res);

#ifdef WITH_INDEX
res = PQexec(c, "CREATE INDEX ON locks USING HASH(path)");
assert(PQresultStatus(res) == PGRES_COMMAND_OK);
PQclear(res);
#endif
}

int main(void)
{
const char *connstr = getenv("DB_CONNSTR");
if (connstr == NULL)
connstr = "dbname=postgres";

for (int i = 0; i < NR_THREADS; ++i) {
conns[i] = PQconnectdb(connstr);
assert(PQstatus(conns[i]) == CONNECTION_OK);
}
prepare_db();

for (int i = 0; i < NR_RUNS; ++i)
test_once();

for (int i = 0; i < NR_THREADS; ++i)
PQfinish(conns[i]);

return 0;
}

#2Thomas Munro
thomas.munro@gmail.com
In reply to: PG Bug reporting form (#1)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

Hi,

Reproduced here. Thanks for the reproducer. I agree that something
is wrong here, but I haven't had time to figure out what, yet, but let
me share what I noticed so far... I modified your test to add a pid
column to the locks table and to insert insert pg_backend_pid() into
it, and got:

postgres=# select xmin, * from locks;

┌───────┬──────┬───────┐
│ xmin │ path │ pid │
├───────┼──────┼───────┤
│ 17634 │ xyz │ 32932 │
│ 17639 │ xyz │ 32957 │
└───────┴──────┴───────┘

Then I filtered the logs (having turned the logging up to capture all
queries) so I could see just those PIDs and saw this sequence:

2023-05-29 00:15:43.933 EDT [32932] LOG: duration: 0.182 ms
statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
2023-05-29 00:15:43.934 EDT [32957] LOG: duration: 0.276 ms
statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
2023-05-29 00:15:43.935 EDT [32932] LOG: duration: 1.563 ms
statement: SELECT * FROM locks WHERE path = 'xyz'
2023-05-29 00:15:43.936 EDT [32932] LOG: duration: 0.126 ms
statement: INSERT INTO locks(path, pid) VALUES('xyz',
pg_backend_pid())
2023-05-29 00:15:43.937 EDT [32957] LOG: duration: 2.191 ms
statement: SELECT * FROM locks WHERE path = 'xyz'
2023-05-29 00:15:43.937 EDT [32957] LOG: duration: 0.261 ms
statement: INSERT INTO locks(path, pid) VALUES('xyz',
pg_backend_pid())
2023-05-29 00:15:43.937 EDT [32932] LOG: duration: 0.222 ms statement: COMMIT
2023-05-29 00:15:43.939 EDT [32957] LOG: duration: 1.775 ms statement: COMMIT

That sequence if run (without overlap) in the logged order is normally
rejected. The query plan being used (at least when I run the query
myself) looks like this:

Query Text: SELECT * FROM locks WHERE path = 'xyz'
Bitmap Heap Scan on locks (cost=4.20..13.67 rows=6 width=36)
Recheck Cond: (path = 'xyz'::text)
-> Bitmap Index Scan on locks_path_idx (cost=0.00..4.20 rows=6 width=0)
Index Cond: (path = 'xyz'::text)

#3Artem Anisimov
artem.anisimov.255@gmail.com
In reply to: Thomas Munro (#2)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

Hi Thomas,

thank you for the confirmation and analysis. Did you have a chance to
take a more detailed look at the problem?

Best regards,

Artem.

Show quoted text

On 30/05/2023 04:29, Thomas Munro wrote:

Hi,

Reproduced here. Thanks for the reproducer. I agree that something
is wrong here, but I haven't had time to figure out what, yet, but let
me share what I noticed so far... I modified your test to add a pid
column to the locks table and to insert insert pg_backend_pid() into
it, and got:

postgres=# select xmin, * from locks;

┌───────┬──────┬───────┐
│ xmin │ path │ pid │
├───────┼──────┼───────┤
│ 17634 │ xyz │ 32932 │
│ 17639 │ xyz │ 32957 │
└───────┴──────┴───────┘

Then I filtered the logs (having turned the logging up to capture all
queries) so I could see just those PIDs and saw this sequence:

2023-05-29 00:15:43.933 EDT [32932] LOG: duration: 0.182 ms
statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
2023-05-29 00:15:43.934 EDT [32957] LOG: duration: 0.276 ms
statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
2023-05-29 00:15:43.935 EDT [32932] LOG: duration: 1.563 ms
statement: SELECT * FROM locks WHERE path = 'xyz'
2023-05-29 00:15:43.936 EDT [32932] LOG: duration: 0.126 ms
statement: INSERT INTO locks(path, pid) VALUES('xyz',
pg_backend_pid())
2023-05-29 00:15:43.937 EDT [32957] LOG: duration: 2.191 ms
statement: SELECT * FROM locks WHERE path = 'xyz'
2023-05-29 00:15:43.937 EDT [32957] LOG: duration: 0.261 ms
statement: INSERT INTO locks(path, pid) VALUES('xyz',
pg_backend_pid())
2023-05-29 00:15:43.937 EDT [32932] LOG: duration: 0.222 ms statement: COMMIT
2023-05-29 00:15:43.939 EDT [32957] LOG: duration: 1.775 ms statement: COMMIT

That sequence if run (without overlap) in the logged order is normally
rejected. The query plan being used (at least when I run the query
myself) looks like this:

Query Text: SELECT * FROM locks WHERE path = 'xyz'
Bitmap Heap Scan on locks (cost=4.20..13.67 rows=6 width=36)
Recheck Cond: (path = 'xyz'::text)
-> Bitmap Index Scan on locks_path_idx (cost=0.00..4.20 rows=6 width=0)
Index Cond: (path = 'xyz'::text)

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Artem Anisimov (#3)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

On Tue, Jun 13, 2023 at 8:00 PM Artem Anisimov
<artem.anisimov.255@gmail.com> wrote:

thank you for the confirmation and analysis. Did you have a chance to
take a more detailed look at the problem?

Looking into it...

#5Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Thomas Munro (#2)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

On Mon, May 29, 2023 at 09:29:09PM -0400, Thomas Munro wrote:
Starred

Hi,

Reproduced here. Thanks for the reproducer. I agree that something
is wrong here, but I haven't had time to figure out what, yet, but let
me share what I noticed so far... I modified your test to add a pid
column to the locks table and to insert insert pg_backend_pid() into
it, and got:

postgres=# select xmin, * from locks;

┌───────┬──────┬───────┐
│ xmin │ path │ pid │
├───────┼──────┼───────┤
│ 17634 │ xyz │ 32932 │
│ 17639 │ xyz │ 32957 │
└───────┴──────┴───────┘

Then I filtered the logs (having turned the logging up to capture all
queries) so I could see just those PIDs and saw this sequence:

2023-05-29 00:15:43.933 EDT [32932] LOG: duration: 0.182 ms
statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
2023-05-29 00:15:43.934 EDT [32957] LOG: duration: 0.276 ms
statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
2023-05-29 00:15:43.935 EDT [32932] LOG: duration: 1.563 ms
statement: SELECT * FROM locks WHERE path = 'xyz'
2023-05-29 00:15:43.936 EDT [32932] LOG: duration: 0.126 ms
statement: INSERT INTO locks(path, pid) VALUES('xyz',
pg_backend_pid())
2023-05-29 00:15:43.937 EDT [32957] LOG: duration: 2.191 ms
statement: SELECT * FROM locks WHERE path = 'xyz'
2023-05-29 00:15:43.937 EDT [32957] LOG: duration: 0.261 ms
statement: INSERT INTO locks(path, pid) VALUES('xyz',
pg_backend_pid())
2023-05-29 00:15:43.937 EDT [32932] LOG: duration: 0.222 ms statement: COMMIT
2023-05-29 00:15:43.939 EDT [32957] LOG: duration: 1.775 ms statement: COMMIT

I've tried to reproduce it as well, adding more logging around the
serialization code. If it helps, what I observe is the second
overlapping transaction, that has started a bit later, do not error out
because in OnConflict_CheckForSerializationFailure (when checking for
"writer has become a pivot") there are no more conflicts received from
SHMQueueNext. All the rest of the reported serialization conflicts are
coming from this check, so I assume the incorrect transaction should
fail there too. Not sure yet why is that so.

#6Thomas Munro
thomas.munro@gmail.com
In reply to: Dmitry Dolgov (#5)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

On Thu, Jun 15, 2023 at 7:29 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

I've tried to reproduce it as well, adding more logging around the
serialization code. If it helps, what I observe is the second
overlapping transaction, that has started a bit later, do not error out
because in OnConflict_CheckForSerializationFailure (when checking for
"writer has become a pivot") there are no more conflicts received from
SHMQueueNext. All the rest of the reported serialization conflicts are
coming from this check, so I assume the incorrect transaction should
fail there too. Not sure yet why is that so.

Some more observations: happens on 11 and master, happens with btrees,
happens with bitmapscan disabled (eg with plain index scan), but so
far in my testing it doesn't happen if the table already contains one
other tuple (ie if you change the reproducer to insert another row
('foo') after the TRUNCATE). There is a special case for predicate
locking empty indexes, which uses a relation-level (since there are no
pages to lock yet), but that doesn't seem to be wrong and if you hack
it to lock pages 1 and 2 instead, it still reproduces. Pondering the
empty index case made me wonder if the case "If we found one of our
own SIREAD locks to remove, remove it now" was implicated (that's
something that would not happen for a relation-level lock), but it
still reproduces if you comment out that optimisation. So far I have
not been able to reproduce it below 8 threads. Hmm, I wonder if there
might be a missing check/lock in some racy code path around the
initial creation of the root page...

#7Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#6)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

How is this schedule supposed to work?

S1: _bt_search(&buf)
S1: if (!BufferIsValid(buf)) // because index is empty
S1: {
S1: PredicateLockRelation(...);
S1: ...
S1: return false; // no tuples for you

S2: _bt_search(&buf)

S1: INSERT ...

S2: if (!BufferIsValid(buf)) // because index *was* empty
S2: {
S2: PredicateLockRelation(...);
S2: ...
S2: return false; // no tuples for you

S2: ...

My point is that S2 won't ever scan S1's tuples, so it won't pass S1's
xid to CheckForSerializableConflictOut(). Am I missing something? I
can repro this with NR_THREADS set to only 2, after inserting
pg_usleep(1) after the _bt_search() in _bt_first() (sched_yield()
wasn't quite enough).

#8Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#7)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

On Mon, Jun 19, 2023 at 2:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:

S2: _bt_search(&buf)

S1: INSERT ...

S2: PredicateLockRelation(...);

My point is that S2 won't ever scan S1's tuples, so it won't pass S1's
xid to CheckForSerializableConflictOut()

[completing that sentence a little more] ... so this was our only
chance to detect that S2 read an object that S1 wrote. But the
_bt_search() and PredicateLockXXX() calls are not atomic and not
rechecked, so a write between them is invisible to the algorithm. I'm
not sure about this, but it's the idea I'm exploring...

#9Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#8)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

The attached shows one approach to the problem I described already,
namely scans of empty btrees that concurrently gain an initial root
page. This seems to fix the posted repro case if using a btree with
enable_seqscan=off, enable_bitmapscan=off. That's the variant I had
been working with, because it was simpler.

But that turns out to be only one problem hit by the repro.
Apparently there's a second bug, if you let it use bitmap heapscans.
Or perhaps I am misdiagnosing the above. It could be something like:
btgetbitmap not following the predicate lock protocol correctly (but
it looks OK at a glance), or bitmap heapscan not checking for
conflicts out comprehensively (xids on invisible tuples we scan), eg
not fetching heap tuples for some short cut reason somewhere. But
that's all I have time for today.

Attachments:

0001-Fix-rare-race-in-SSI-interaction-with-empty-btrees.patchapplication/octet-stream; name=0001-Fix-rare-race-in-SSI-interaction-with-empty-btrees.patchDownload+24-15
#10Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#9)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

On Mon, Jun 19, 2023 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:

... or bitmap heapscan not checking for
conflicts out comprehensively (xids on invisible tuples we scan), eg
not fetching heap tuples for some short cut reason somewhere. ...

Ahh, here's such a place. I can't reproduce it with the patch already
posted + this check commented out.

--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2127,16 +2127,18 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,

hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;

+#if 0
/*
* Ignore any claimed entries past what we think is the end of the
* relation. It may have been extended after the start of our scan (we
* only hold an AccessShareLock, and it could be inserts from this
* backend).
*/
if (block >= hscan->rs_nblocks)
return false;
+#endif

#11Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Thomas Munro (#10)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

On Mon, Jun 19, 2023 at 09:30:12PM +1200, Thomas Munro wrote:
On Mon, Jun 19, 2023 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:

... or bitmap heapscan not checking for
conflicts out comprehensively (xids on invisible tuples we scan), eg
not fetching heap tuples for some short cut reason somewhere. ...

Ahh, here's such a place. I can't reproduce it with the patch already
posted + this check commented out.

--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2127,16 +2127,18 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,

hscan->rs_cindex = 0;
hscan->rs_ntuples = 0;

+#if 0
/*
* Ignore any claimed entries past what we think is the end of the
* relation. It may have been extended after the start of our scan (we
* only hold an AccessShareLock, and it could be inserts from this
* backend).
*/
if (block >= hscan->rs_nblocks)
return false;
+#endif

Great, thanks! Can confirm, after applying both the posted patch and the
change above the issue is not reproducible anymore.

One thing I've noticed is that one can observe a similar issue using a
gin index and int[] for the "path" column, even applying changes from
the thread. The gin implementation does something similar to btree in
startScanEntry -- it lands in "No entry found" branch, but instead of
locking the relation it locks "the leaf page, to lock the place where
the entry would've been, had there been one". The similar fix retrying
ginFindLeafPage didn't solve the problem, even if locking the whole
relation instead, but maybe I'm missing something.

#12Thomas Munro
thomas.munro@gmail.com
In reply to: Dmitry Dolgov (#11)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

On Tue, Jun 20, 2023 at 12:18 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Mon, Jun 19, 2023 at 09:30:12PM +1200, Thomas Munro wrote:
+#if 0
/*
* Ignore any claimed entries past what we think is the end of the
* relation. It may have been extended after the start of our scan (we
* only hold an AccessShareLock, and it could be inserts from this
* backend).
*/
if (block >= hscan->rs_nblocks)
return false;
+#endif

Great, thanks! Can confirm, after applying both the posted patch and the
change above the issue is not reproducible anymore.

Here's a cleaned-up version of the first two changes. What do you
think about the assertions I make in the commit message for 0002?

One thing I've noticed is that one can observe a similar issue using a
gin index and int[] for the "path" column, even applying changes from
the thread. The gin implementation does something similar to btree in
startScanEntry -- it lands in "No entry found" branch, but instead of
locking the relation it locks "the leaf page, to lock the place where
the entry would've been, had there been one". The similar fix retrying
ginFindLeafPage didn't solve the problem, even if locking the whole
relation instead, but maybe I'm missing something.

Ouch. I would have to go and study gin's interlocking model, but one
superficial bug I spotted is that ginget.c's collectMatchBitmap()
calls PredicateLockPage(stack->buffer), where a block number is
expected. I wish we had strong typedefs, to reject stuff like that at
compile time. But fixing that alone isn't enough.

In case someone who knows more about gin is interested in helping, I
attach Artem's repro, modified to use gin.

Attachments:

v2-0001-Fix-race-in-SSI-interaction-with-empty-btrees.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Fix-race-in-SSI-interaction-with-empty-btrees.patchDownload+25-15
v2-0002-Fix-race-in-SSI-interaction-with-bitmap-heap-scan.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Fix-race-in-SSI-interaction-with-bitmap-heap-scan.patchDownload+0-10
test-ssi-gin-repro.ctext/x-csrc; charset=US-ASCII; name=test-ssi-gin-repro.cDownload
#13Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#12)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

On Tue, Jun 20, 2023 at 1:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Jun 20, 2023 at 12:18 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

One thing I've noticed is that one can observe a similar issue using a
gin index and int[] for the "path" column, even applying changes from
the thread. The gin implementation does something similar to btree in
startScanEntry -- it lands in "No entry found" branch, but instead of
locking the relation it locks "the leaf page, to lock the place where
the entry would've been, had there been one". The similar fix retrying
ginFindLeafPage didn't solve the problem, even if locking the whole
relation instead, but maybe I'm missing something.

Ouch. I would have to go and study gin's interlocking model, but one
superficial bug I spotted is that ginget.c's collectMatchBitmap()
calls PredicateLockPage(stack->buffer), where a block number is
expected. I wish we had strong typedefs, to reject stuff like that at
compile time. But fixing that alone isn't enough.

In case someone who knows more about gin is interested in helping, I
attach Artem's repro, modified to use gin.

This is probably going to go faster if I CC the authors of commit
0bef1c06. Any ideas about how we're missing rw-conflicts under high
concurrency?

#14Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#13)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

On Wed, Jun 21, 2023 at 10:54 AM Thomas Munro <thomas.munro@gmail.com> wrote:

This is probably going to go faster if I CC the authors of commit
0bef1c06. Any ideas about how we're missing rw-conflicts under high
concurrency?

I guess one (probably stupid) question I have: where is the
CheckForSerializableConflictIn(rootPostingTree) that I was expecting
to see when writing to an existing posting tree? IIUC that should
pair with PredicateLockPage(btree->index, rootPostingTree, snapshot)
when reading.

#15Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Thomas Munro (#12)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

On Tue, Jun 20, 2023 at 01:22:19PM +1200, Thomas Munro wrote:
On Tue, Jun 20, 2023 at 12:18 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Mon, Jun 19, 2023 at 09:30:12PM +1200, Thomas Munro wrote:
+#if 0
/*
* Ignore any claimed entries past what we think is the end of the
* relation. It may have been extended after the start of our scan (we
* only hold an AccessShareLock, and it could be inserts from this
* backend).
*/
if (block >= hscan->rs_nblocks)
return false;
+#endif

Great, thanks! Can confirm, after applying both the posted patch and the
change above the issue is not reproducible anymore.

Here's a cleaned-up version of the first two changes. What do you
think about the assertions I make in the commit message for 0002?

Yep, it sounds correct to me. After a quick look I couldn't find where
exactly the similar code lives in the pre-tableam version, so can't say
anything about back-patching there.

#16Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Thomas Munro (#14)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

On Wed, Jun 21, 2023 at 11:20:28AM +1200, Thomas Munro wrote:
On Wed, Jun 21, 2023 at 10:54 AM Thomas Munro <thomas.munro@gmail.com> wrote:

This is probably going to go faster if I CC the authors of commit
0bef1c06. Any ideas about how we're missing rw-conflicts under high
concurrency?

I guess one (probably stupid) question I have: where is the
CheckForSerializableConflictIn(rootPostingTree) that I was expecting
to see when writing to an existing posting tree? IIUC that should
pair with PredicateLockPage(btree->index, rootPostingTree, snapshot)
when reading.

I can't find it either, but based on my superficial investigation this
particular reproducer doesn't seem to hit posting trees functionality at
all. What I observe is the inserting transaction uses
ginHeapTupleFastCollect + ginHeapTupleFastInsert, and the corresponding
commentary says that serialization in this case depends on the
metabuffer:

/*
* An insertion to the pending list could logically belong anywhere in the
* tree, so it conflicts with all serializable scans. All scans acquire a
* predicate lock on the metabuffer to represent that.
*/

Now the reading transaction actually does PredicateLockPage on the
metabuffer inside scanPendingInsert, but strangely enough it doesn't
lock anything because the SerializationNeededForRead condition is false.
I'm trying to verify if it's somehow a part of the issue, or something
is broken on my side.

#17Thomas Munro
thomas.munro@gmail.com
In reply to: Dmitry Dolgov (#16)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

On Wed, Jun 21, 2023 at 9:04 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Now the reading transaction actually does PredicateLockPage on the
metabuffer inside scanPendingInsert, but strangely enough it doesn't
lock anything because the SerializationNeededForRead condition is false.
I'm trying to verify if it's somehow a part of the issue, or something
is broken on my side.

Maybe you were confused by the presence of non-SSI transactions in the
repro (eg the transaction that sets up the index)?

To answer my own earlier question, the conflict-in check for posting
trees is hidden in getFindLeafPage(..., true, ...).

I spent some more time trying to grok this today. FTR it reproduces
faster without the extra tuple that repro I posted inserts after
TRUNCATE (the point of that was to find out whether it was an
empty-to-non-empty transition). I still don't know what's wrong but I
am beginning to suspect the "fast" code. It seems as though, under
high concurrency, we sometimes don't scan a recently inserted
(invisible to our snapshot, but needed for SSI checks) tuple, but I
don't yet know why.

#18Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Thomas Munro (#17)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

On Thu, Jun 22, 2023 at 10:02:19PM +1200, Thomas Munro wrote:
On Wed, Jun 21, 2023 at 9:04 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Now the reading transaction actually does PredicateLockPage on the
metabuffer inside scanPendingInsert, but strangely enough it doesn't
lock anything because the SerializationNeededForRead condition is false.
I'm trying to verify if it's somehow a part of the issue, or something
is broken on my side.

Maybe you were confused by the presence of non-SSI transactions in the
repro (eg the transaction that sets up the index)?

Yeah, sort of. Need to optimize the way how I consume the logs.

To answer my own earlier question, the conflict-in check for posting
trees is hidden in getFindLeafPage(..., true, ...).

I spent some more time trying to grok this today. FTR it reproduces
faster without the extra tuple that repro I posted inserts after
TRUNCATE (the point of that was to find out whether it was an
empty-to-non-empty transition). I still don't know what's wrong but I
am beginning to suspect the "fast" code. It seems as though, under
high concurrency, we sometimes don't scan a recently inserted
(invisible to our snapshot, but needed for SSI checks) tuple, but I
don't yet know why.

Yep, it's definitely something in the "fast" path. Testing the same, but
with an index having (fastupdate=off) works just fine for me.

#19Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Dmitry Dolgov (#18)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

On Fri, Jun 23, 2023 at 04:05:42PM +0200, Dmitry Dolgov wrote:

I spent some more time trying to grok this today. FTR it reproduces
faster without the extra tuple that repro I posted inserts after
TRUNCATE (the point of that was to find out whether it was an
empty-to-non-empty transition). I still don't know what's wrong but I
am beginning to suspect the "fast" code. It seems as though, under
high concurrency, we sometimes don't scan a recently inserted
(invisible to our snapshot, but needed for SSI checks) tuple, but I
don't yet know why.

Yep, it's definitely something in the "fast" path. Testing the same, but
with an index having (fastupdate=off) works just fine for me.

I've managed to resolve it, or at least reduce the chances for the issue to
appear, via semi-randomly adding more CheckForSerializableConflictIn /
PredicateLock around the new sublist that has to be created in
ginHeapTupleFastInsert. I haven't seen the reproducer failing with this
changeset after running it multiple times for a couple of minutes, where on the
main branch, with the two fixes from Thomas included, it was failing within a
couple of seconds.

	diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
	@@ -198,6 +199,7 @@ makeSublist(Relation index, IndexTuple *tuples, int32 ntuples,
			/*
			 * Write last page
			 */
	+       CheckForSerializableConflictIn(index, NULL, BufferGetBlockNumber(curBuffer));
			res->tail = BufferGetBlockNumber(curBuffer);
			res->tailFreeSize = writeListPage(index, curBuffer,
	@@ -273,6 +275,11 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
							separateList = true;
							LockBuffer(metabuffer, GIN_UNLOCK);
					}
	+               else
	+               {
	+                       CheckForSerializableConflictIn(index, NULL, metadata->head);
	+               }
			}
	diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
	@@ -140,7 +140,7 @@ collectMatchBitmap(GinBtreeData *btree, GinBtreeStack *stack,
			 * Predicate lock entry leaf page, following pages will be locked by
			 * moveRightIfItNeeded()
			 */
	-       PredicateLockPage(btree->index, stack->buffer, snapshot);
	+       PredicateLockPage(btree->index, BufferGetBlockNumber(stack->buffer), snapshot);
			for (;;)
			{
	@@ -1925,6 +1926,8 @@ gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
			/*
			 * Set up the scan keys, and check for unsatisfiable query.
			 */
	+       PredicateLockRelation(scan->indexRelation, scan->xs_snapshot);
			ginFreeScanKeys(so);            /* there should be no keys yet, but just to be

Now the last PredicateLockRelation does look rather weird. But without it, or
with this locking happening in scanPendingInsert (instead of locking the meta
page), or without other changes in ginfast.c, the reproducer is still failing.
This of course makes it not a proper solution by any mean, but hopefully it
will help to understand the problem a bit better.

#20Thomas Munro
thomas.munro@gmail.com
In reply to: Dmitry Dolgov (#19)
Re: BUG #17949: Adding an index introduces serialisation anomalies.

On Sun, Jun 25, 2023 at 1:59 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

I've managed to resolve it, or at least reduce the chances for the issue to
appear, via semi-randomly adding more CheckForSerializableConflictIn /
PredicateLock around the new sublist that has to be created in
ginHeapTupleFastInsert. I haven't seen the reproducer failing with this
changeset after running it multiple times for a couple of minutes, where on the
main branch, with the two fixes from Thomas included, it was failing within a
couple of seconds.

Ahh, right, thanks. I don't think we need to lock all those pages as
you showed, as this whole "fast" path is supposed to be covered by the
meta page (in other words, GIN is expected to have the highest
possible serialisation failure rate under SSI unless you turn fast
updates off). But there is an ordering bug with the existing
predicate lock in that code, which allows this to happen:

S1: CheckForSerializableConflictIn(meta)

S2: PredicateLockPage(meta)
S2: scan, find no tuples

S1: BufferLock(EXCLUSIVE
S1: modify stuff...

CheckForSerializableConflictIn() was written with the assumption that
you're inserting a tuple (ie you have the page containing the tuple
locked), so you'll either conflict with a reader who already has a
predicate lock at that point OR you'll insert first and then the
reader will see your (invisible-to-snapshot) tuples, but here we're
doing some fancy footwork with a meta page, and we screwed up the
ordering and left a window where neither of those things happens.
Perhaps it was coded that way because there is drop-then-reacquire
dance, but it's easy enough to move the check in both branches. Does
that make sense?

Attachments:

v3-0001-Fix-race-in-SSI-interaction-with-empty-btrees.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Fix-race-in-SSI-interaction-with-empty-btrees.patchDownload+25-15
v3-0002-Fix-race-in-SSI-interaction-with-bitmap-heap-scan.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Fix-race-in-SSI-interaction-with-bitmap-heap-scan.patchDownload+0-10
v3-0003-Fix-race-in-SSI-interaction-with-gin-fast-path.patchtext/x-patch; charset=US-ASCII; name=v3-0003-Fix-race-in-SSI-interaction-with-gin-fast-path.patchDownload+11-4
In reply to: Thomas Munro (#20)
#22Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Thomas Munro (#20)
#23Thomas Munro
thomas.munro@gmail.com
In reply to: Peter Geoghegan (#21)
#24Thomas Munro
thomas.munro@gmail.com
In reply to: Heikki Linnakangas (#22)
#25Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#24)
#26Artem Anisimov
artem.anisimov.255@gmail.com
In reply to: Thomas Munro (#25)
#27Thomas Munro
thomas.munro@gmail.com
In reply to: Artem Anisimov (#26)