Too coarse predicate locks granularity for B+ tree indexes
Hi,
TLDR: this email describes a serialization failure that happens (as I
understand it) due to too coarse predicate locks granularity for primary
key index.
I have a concurrent testsuite that runs 14 test cases. Each test case
operates on a disjoint set of records, doesn't retry transactions and is
run under 'serializable' isolation level. The test data is small and likely
fits within a single tuple page.
When I finished the test suite I was surprised that PostgreSQL 14.5 returns
serialization failure on every test suite run. I was even more surprised
when I tested the suite against the current CockroachDB and didn't get
serialization failures. Actually I was able to reproduce RETRY_SERIALIZABLE
errors a couple of times on CockroachDB but it required me to run the test
suite in a loop for more than a half hour.
I started to investigate the test behavior with PostgreSQL with more
simplified and shrinked code and found a serialization failure of two
concurrent `update_user` operations.
The test defines the following `Users` table:
CREATE TABLE Users (
id UUID,
title VARCHAR(255),
first_name VARCHAR(40),
last_name VARCHAR(80) NOT NULL,
email VARCHAR(255) NOT NULL,
lower_email VARCHAR(255) GENERATED ALWAYS AS (lower(email)) STORED,
marketing_optin BOOLEAN,
mobile_phone VARCHAR(50),
phone VARCHAR(50),
phone_ext VARCHAR(40),
is_contact BOOLEAN DEFAULT false NOT NULL,
unlinked_link_ids UUID[],
CONSTRAINT unique_user_email UNIQUE(lower_email),
PRIMARY KEY (id)
);
Concurrent `update_user` operation run the UPDATE query to change user
email to a unique value
UPDATE Users
SET
title = CASE WHEN false= true THEN 'foo' ELSE title END,
first_name = CASE WHEN false= true THEN 'foo' ELSE first_name END,
last_name = CASE WHEN false= true THEN 'foo' ELSE last_name END,
email = CASE WHEN true = true THEN 'email2' ELSE email END,
marketing_optin = CASE WHEN false = true THEN true ELSE
marketing_optin END,
mobile_phone = CASE WHEN false = true THEN 'foo' ELSE mobile_phone END,
phone = CASE WHEN false = true THEN 'foo' ELSE phone END,
phone_ext = CASE WHEN false = true THEN 'foo' ELSE phone_ext END
WHERE id = '018629fd-7b28-743c-8647-b6321c166d46';
I use the following helper view to monitor locks:
CREATE VIEW locks_v AS
SELECT pid,
virtualtransaction,
locktype,
CASE locktype
WHEN 'relation' THEN relation::regclass::text
WHEN 'virtualxid' THEN virtualxid::text
WHEN 'transactionid' THEN transactionid::text
WHEN 'tuple' THEN
relation::regclass::text||':'||page::text||':'||tuple::text
WHEN 'page' THEN relation::regclass::text||':'||page::text
END AS lockid,
mode,
granted
FROM pg_locks;
When the test Users table has only a few records the query uses a
sequential scan the serialization failure is reproducible without inserting
sleeps before `update_user` transaction commit.
This is caused by relation level predicate locks on Users table:
select * from locks_v;
pid | virtualtransaction | locktype | lockid |
mode | granted------+--------------------+---------------+-------------------+------------------+---------
3676 | 5/2444 | relation | unique_user_email |
RowExclusiveLock | t
3676 | 5/2444 | relation | users_pkey |
RowExclusiveLock | t
3676 | 5/2444 | relation | users |
RowExclusiveLock | t
3676 | 5/2444 | virtualxid | 5/2444 |
ExclusiveLock | t
3737 | 4/13470 | relation | pg_locks |
AccessShareLock | t
3737 | 4/13470 | relation | locks_v |
AccessShareLock | t
3737 | 4/13470 | virtualxid | 4/13470 |
ExclusiveLock | t
3669 | 3/17334 | relation | unique_user_email |
RowExclusiveLock | t
3669 | 3/17334 | relation | users_pkey |
RowExclusiveLock | t
3669 | 3/17334 | relation | users |
RowExclusiveLock | t
3669 | 3/17334 | virtualxid | 3/17334 |
ExclusiveLock | t
3676 | 5/2444 | transactionid | 6571 |
ExclusiveLock | t
3669 | 3/17334 | transactionid | 6570 |
ExclusiveLock | t
3676 | 5/2444 | relation | users |
SIReadLock | t
3669 | 3/17334 | relation | users |
SIReadLock | t
(15 rows)
If I add ballast data to Users table (1000 records) the cost optimizer
switches to index scan and it's hard to reproduce the issue for two
concurrent `update_user` operations without sleeps. After adding long
sleeps after UPDATE query and before commit I could see page-level
predicates locks for the primary key index users_pkey:
select * from locks_v;
pid | virtualtransaction | locktype | lockid | mode
| granted-----+--------------------+---------------+-------------------+------------------+---------
371 | 6/523 | relation | unique_user_email |
RowExclusiveLock | t
371 | 6/523 | relation | users_pkey |
RowExclusiveLock | t
371 | 6/523 | relation | users |
RowExclusiveLock | t
371 | 6/523 | virtualxid | 6/523 |
ExclusiveLock | t
381 | 14/215 | relation | unique_user_email |
RowExclusiveLock | t
381 | 14/215 | relation | users_pkey |
RowExclusiveLock | t
381 | 14/215 | relation | users |
RowExclusiveLock | t
381 | 14/215 | virtualxid | 14/215 |
ExclusiveLock | t
350 | 4/885 | relation | pg_locks |
AccessShareLock | t
350 | 4/885 | relation | locks_v |
AccessShareLock | t
350 | 4/885 | virtualxid | 4/885 |
ExclusiveLock | t
371 | 6/523 | transactionid | 1439 |
ExclusiveLock | t
381 | 14/215 | transactionid | 1431 |
ExclusiveLock | t
381 | 14/215 | page | users_pkey:5 | SIReadLock
| t
371 | 6/523 | page | users_pkey:5 | SIReadLock
| t
(15 rows)
With sleeps the serialization failure is reproduced on each run.
I started to read more about SSI implementation in PostgreSQL. The article
https://arxiv.org/pdf/1208.4179.pdf mentions that
Currently, locks on B+-tree indexes are acquired at page granularity; we
intend to refine this to next-key locking [16] in a future release.
[16]: C. Mohan. ARIES/KVL: A key-value locking method for concurrency
control of multiaction transactions operating on B-tree indexes. In VLDB,
pages 392–405, 1990.
My question follows:
Does the current PostgreSQL release support B+ tree index predicate locks
more granular then page-level locks?
With kindest regards, Rinat Shigapov
On Tue, 2023-02-07 at 16:23 +0600, Rinat Shigapov wrote:
I have a concurrent testsuite that runs 14 test cases. Each test case operates
on a disjoint set of records, doesn't retry transactions and is run under
'serializable' isolation level. The test data is small and likely fits within
a single tuple page.When I finished the test suite I was surprised that PostgreSQL 14.5 returns
serialization failure on every test suite run.
This is no question for the hackers list; redirecting to general.
That behavior sounds perfectly normal to me: if everything is in a single
page, PostgreSQL probably won't use an index scan. With a sequential scan,
the predicate lock will be on the whole table. So you should expect
serialization failures. This is well documented.
Perhaps you should use a more realistic test case with a reasonable
amount of data.
Yours,
Laurenz Albe
Thank you for your prompt reply!
I've mentioned that I've generated ballast data to make the cost optimizer
to switch to page-level locks.
But my question is about more finer grained (less then page) predicate
locks for indices. With page-level locks I could still get serialization
failures if I add more queries (or emulate it with sleeps) to the
transaction with the UPDATE Users query.
Below I describe the problem again for psql-general:
I have a concurrent testsuite that runs 14 test cases. Each test case
operates on a disjoint set of records, doesn't retry transactions and is
run under 'serializable' isolation level. The test data is small and likely
fits within a single tuple page.
When I finished the test suite I was surprised that PostgreSQL 14.5 returns
serialization failure on every test suite run. I was even more surprised
when I tested the suite against the current CockroachDB and didn't get
serialization failures. Actually I was able to reproduce RETRY_SERIALIZABLE
errors a couple of times on CockroachDB but it required me to run the test
suite in a loop for more than a half hour.
I started to investigate the test behavior with PostgreSQL with more
simplified and shrinked code and found a serialization failure of two
concurrent `update_user` operations.
The test defines the following `Users` table:
CREATE TABLE Users (
id UUID,
title VARCHAR(255),
first_name VARCHAR(40),
last_name VARCHAR(80) NOT NULL,
email VARCHAR(255) NOT NULL,
lower_email VARCHAR(255) GENERATED ALWAYS AS (lower(email)) STORED,
marketing_optin BOOLEAN,
mobile_phone VARCHAR(50),
phone VARCHAR(50),
phone_ext VARCHAR(40),
is_contact BOOLEAN DEFAULT false NOT NULL,
unlinked_link_ids UUID[],
CONSTRAINT unique_user_email UNIQUE(lower_email),
PRIMARY KEY (id)
);
Concurrent `update_user` operation run the UPDATE query to change user
email to a unique value
UPDATE Users
SET
title = CASE WHEN false= true THEN 'foo' ELSE title END,
first_name = CASE WHEN false= true THEN 'foo' ELSE first_name END,
last_name = CASE WHEN false= true THEN 'foo' ELSE last_name END,
email = CASE WHEN true = true THEN 'email2' ELSE email END,
marketing_optin = CASE WHEN false = true THEN true ELSE
marketing_optin END,
mobile_phone = CASE WHEN false = true THEN 'foo' ELSE mobile_phone END,
phone = CASE WHEN false = true THEN 'foo' ELSE phone END,
phone_ext = CASE WHEN false = true THEN 'foo' ELSE phone_ext END
WHERE id = '018629fd-7b28-743c-8647-b6321c166d46';
I use the following helper view to monitor locks:
CREATE VIEW locks_v AS
SELECT pid,
virtualtransaction,
locktype,
CASE locktype
WHEN 'relation' THEN relation::regclass::text
WHEN 'virtualxid' THEN virtualxid::text
WHEN 'transactionid' THEN transactionid::text
WHEN 'tuple' THEN
relation::regclass::text||':'||page::text||':'||tuple::text
WHEN 'page' THEN relation::regclass::text||':'||page::text
END AS lockid,
mode,
granted
FROM pg_locks;
When the test Users table has only a few records the query uses a
sequential scan the serialization failure is reproducible without inserting
sleeps before `update_user` transaction commit.
This is caused by relation level predicate locks on Users table:
select * from locks_v;
pid | virtualtransaction | locktype | lockid |
mode | granted------+--------------------+---------------+-------------------+------------------+---------
3676 | 5/2444 | relation | unique_user_email |
RowExclusiveLock | t
3676 | 5/2444 | relation | users_pkey |
RowExclusiveLock | t
3676 | 5/2444 | relation | users |
RowExclusiveLock | t
3676 | 5/2444 | virtualxid | 5/2444 |
ExclusiveLock | t
3737 | 4/13470 | relation | pg_locks |
AccessShareLock | t
3737 | 4/13470 | relation | locks_v |
AccessShareLock | t
3737 | 4/13470 | virtualxid | 4/13470 |
ExclusiveLock | t
3669 | 3/17334 | relation | unique_user_email |
RowExclusiveLock | t
3669 | 3/17334 | relation | users_pkey |
RowExclusiveLock | t
3669 | 3/17334 | relation | users |
RowExclusiveLock | t
3669 | 3/17334 | virtualxid | 3/17334 |
ExclusiveLock | t
3676 | 5/2444 | transactionid | 6571 |
ExclusiveLock | t
3669 | 3/17334 | transactionid | 6570 |
ExclusiveLock | t
3676 | 5/2444 | relation | users |
SIReadLock | t
3669 | 3/17334 | relation | users |
SIReadLock | t
(15 rows)
If I add ballast data to Users table (1000 records) the cost optimizer
switches to index scan and it's hard to reproduce the issue for two
concurrent `update_user` operations without sleeps. After adding long
sleeps after UPDATE query and before commit I could see page-level
predicates locks for the primary key index users_pkey:
select * from locks_v;
pid | virtualtransaction | locktype | lockid | mode
| granted-----+--------------------+---------------+-------------------+------------------+---------
371 | 6/523 | relation | unique_user_email |
RowExclusiveLock | t
371 | 6/523 | relation | users_pkey |
RowExclusiveLock | t
371 | 6/523 | relation | users |
RowExclusiveLock | t
371 | 6/523 | virtualxid | 6/523 |
ExclusiveLock | t
381 | 14/215 | relation | unique_user_email |
RowExclusiveLock | t
381 | 14/215 | relation | users_pkey |
RowExclusiveLock | t
381 | 14/215 | relation | users |
RowExclusiveLock | t
381 | 14/215 | virtualxid | 14/215 |
ExclusiveLock | t
350 | 4/885 | relation | pg_locks |
AccessShareLock | t
350 | 4/885 | relation | locks_v |
AccessShareLock | t
350 | 4/885 | virtualxid | 4/885 |
ExclusiveLock | t
371 | 6/523 | transactionid | 1439 |
ExclusiveLock | t
381 | 14/215 | transactionid | 1431 |
ExclusiveLock | t
381 | 14/215 | page | users_pkey:5 | SIReadLock
| t
371 | 6/523 | page | users_pkey:5 | SIReadLock
| t
(15 rows)
With sleeps the serialization failure is reproduced on each run.
I started to read more about SSI implementation in PostgreSQL. The article
https://arxiv.org/pdf/1208.4179.pdf mentions that
Currently, locks on B+-tree indexes are acquired at page granularity; we
intend to refine this to next-key locking [16] in a future release.
[16]: C. Mohan. ARIES/KVL: A key-value locking method for concurrency
control of multiaction transactions operating on B-tree indexes. In VLDB,
pages 392–405, 1990.
My question follows:
Does the current PostgreSQL release support B+ tree index predicate locks
more granular then page-level locks?
With kindest regards, Rinat Shigapov
вт, 7 февр. 2023 г. в 16:29, Laurenz Albe <laurenz.albe@cybertec.at>:
Show quoted text
On Tue, 2023-02-07 at 16:23 +0600, Rinat Shigapov wrote:
I have a concurrent testsuite that runs 14 test cases. Each test case
operates
on a disjoint set of records, doesn't retry transactions and is run under
'serializable' isolation level. The test data is small and likely fitswithin
a single tuple page.
When I finished the test suite I was surprised that PostgreSQL 14.5
returns
serialization failure on every test suite run.
This is no question for the hackers list; redirecting to general.
That behavior sounds perfectly normal to me: if everything is in a single
page, PostgreSQL probably won't use an index scan. With a sequential scan,
the predicate lock will be on the whole table. So you should expect
serialization failures. This is well documented.Perhaps you should use a more realistic test case with a reasonable
amount of data.Yours,
Laurenz Albe
On Tue, Feb 7, 2023 at 11:24 PM Rinat Shigapov <rinatshigapov@gmail.com> wrote:
Does the current PostgreSQL release support B+ tree index predicate locks more granular then page-level locks?
No. I tried to follow some breadcrumbs left by Kevin and Dan that
should allow unique index scans that find a match to skip the btree
page lock, though, and p-lock just the heap tuple. If you like
half-baked experimental code, see the v4-0002 patch in this thread,
where I took some shortcuts (jamming stuff that should be in the
planner down into the executor) for a proof-of-concept:
/messages/by-id/CAEepm=2GK3FVdnt5V3d+h9njWipCv_fNL=wjxyUhzsF=0PcbNg@mail.gmail.com
With that approach, if it *doesn't* find a match, then you're back to
having to p-lock the whole index page to represent the "gap", so that
you can conflict with anyone who tries to insert a matching value
later. I believe the next-key approach would allow for finer grained
gap-locks (haven't studied that myself), but that's a secondary
problem; the primary problem (it seems to me) is getting rid of index
locks completely in the (common?) case that you have a qualifying
match.
Thomas, thank you for the details!
Have you kept the branch that you used to generate the patch? Which commit
should the patch apply to?
With kindest regards, Rinat Shigapov
вт, 7 февр. 2023 г. в 17:11, Thomas Munro <thomas.munro@gmail.com>:
Show quoted text
On Tue, Feb 7, 2023 at 11:24 PM Rinat Shigapov <rinatshigapov@gmail.com>
wrote:Does the current PostgreSQL release support B+ tree index predicate
locks more granular then page-level locks?
No. I tried to follow some breadcrumbs left by Kevin and Dan that
should allow unique index scans that find a match to skip the btree
page lock, though, and p-lock just the heap tuple. If you like
half-baked experimental code, see the v4-0002 patch in this thread,
where I took some shortcuts (jamming stuff that should be in the
planner down into the executor) for a proof-of-concept:/messages/by-id/CAEepm=2GK3FVdnt5V3d+h9njWipCv_fNL=wjxyUhzsF=0PcbNg@mail.gmail.com
With that approach, if it *doesn't* find a match, then you're back to
having to p-lock the whole index page to represent the "gap", so that
you can conflict with anyone who tries to insert a matching value
later. I believe the next-key approach would allow for finer grained
gap-locks (haven't studied that myself), but that's a secondary
problem; the primary problem (it seems to me) is getting rid of index
locks completely in the (common?) case that you have a qualifying
match.
On Tue, Feb 7, 2023 at 4:01 AM Rinat Shigapov <rinatshigapov@gmail.com> wrote:
Thomas, thank you for the details!
Have you kept the branch that you used to generate the patch? Which commit should the patch apply to?
You can try something like
git checkout 'master@{2018-05-13 13:37:00}'
to get a commit by date from rev-parse.
Best regards, Andrey Borodin.
On Wed, Feb 8, 2023 at 5:25 AM Andrey Borodin <amborodin86@gmail.com> wrote:
On Tue, Feb 7, 2023 at 4:01 AM Rinat Shigapov <rinatshigapov@gmail.com> wrote:
Thomas, thank you for the details!
Have you kept the branch that you used to generate the patch? Which commit should the patch apply to?
You can try something like
git checkout 'master@{2018-05-13 13:37:00}'
to get a commit by date from rev-parse.
I don't have time to work on this currently but if Rinat or others
want to look into it... maybe I should rebase that experiment on top
of current master. Here's the branch:
https://github.com/macdice/postgres/tree/ssi-index-locking-refinements
On Wed, Feb 8, 2023 at 10:44 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Feb 8, 2023 at 5:25 AM Andrey Borodin <amborodin86@gmail.com> wrote:
On Tue, Feb 7, 2023 at 4:01 AM Rinat Shigapov <rinatshigapov@gmail.com> wrote:
Thomas, thank you for the details!
Have you kept the branch that you used to generate the patch? Which commit should the patch apply to?
You can try something like
git checkout 'master@{2018-05-13 13:37:00}'
to get a commit by date from rev-parse.I don't have time to work on this currently but if Rinat or others
want to look into it... maybe I should rebase that experiment on top
of current master. Here's the branch:https://github.com/macdice/postgres/tree/ssi-index-locking-refinements
Erm, I guess I should also post the rebased patches here, for the
mailing list archives.