Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

Started by Michail Nikolaevalmost 2 years ago99 messageshackers

michail.nikolaev@gmail.com

almost 2 years ago

Hello, hackers.

While testing my work on (1) I was struggling with addressing a strange
issue with ON CONFLICT UPDATE and REINDEX CONCURRENTLY.

After some time, I have realized the same issue persists on the master
branch as well :)

I have prepared two TAP tests to reproduce the issues (2), also in
attachment.

First one, does the next thing:

CREATE UNLOGGED TABLE tbl(i int primary key, updated_at timestamp);
CREATE INDEX idx ON tbl(i, updated_at); -- it is not required to
reproduce but make it to happen faster

Then it runs next scripts with pgbench concurrently:

1) INSERT INTO tbl VALUES(13,now()) on conflict(i) do update set
updated_at = now();
2) INSERT INTO tbl VALUES(42,now()) on conflict(i) do update set
updated_at = now();
3) INSERT INTO tbl VALUES(69,now()) on conflict(i) do update set
updated_at = now();

Also, during pgbench the next command is run in the loop:

REINDEX INDEX CONCURRENTLY tbl_pkey;

For some time, everything looks more-less fine (except live locks, but this
is the issue for the next test).
But after some time, about a minute or so (on ~3000th REINDEX) it just
fails like this:

make -C src/test/modules/test_misc/ check
PROVE_TESTS='t/006_*'

# waiting for an about 3000, now is 2174, seconds passed :
84
# waiting for an about 3000, now is 2175, seconds passed :
84
# waiting for an about 3000, now is 2176, seconds passed :
84
# waiting for an about 3000, now is 2177, seconds passed :
84
# waiting for an about 3000, now is 2178, seconds passed :
84
# waiting for an about 3000, now is 2179, seconds passed :
84
# waiting for an about 3000, now is 2180, seconds passed :
84
# waiting for an about 3000, now is 2181, seconds passed :
84
# waiting for an about 3000, now is 2182, seconds passed :
84
# waiting for an about 3000, now is 2183, seconds passed :
84
# waiting for an about 3000, now is 2184, seconds passed :
84

# Failed test 'concurrent INSERTs, UPDATES and RC status
(got 2 vs expected 0)'
# at t/006_concurrently_unique_fail.pl line 69.

# Failed test 'concurrent INSERTs, UPDATES and RC stderr
/(?^:^$)/'
# at t/006_concurrently_unique_fail.pl line 69.
# 'pgbench: error: pgbench: error: client
4 script 0 aborted in command 1 query 0: ERROR: duplicate key value
violates unique constraint "tbl_pkey_ccnew"
# DETAIL: Key (i)=(13) already exists.
# client 15 script 0 aborted in command 1 query 0: ERROR:
duplicate key value violates unique constraint "tbl_pkey_ccnew"
# DETAIL: Key (i)=(13) already exists.
# pgbench: error: client 9 script 0 aborted in command 1
query 0: ERROR: duplicate key value violates unique constraint
"tbl_pkey_ccnew"
# DETAIL: Key (i)=(13) already exists.
# pgbench: error: client 11 script 0 aborted in command 1
query 0: ERROR: duplicate key value violates unique constraint
"tbl_pkey_ccnew"
# DETAIL: Key (i)=(13) already exists.
# pgbench: error: client 8 script 0 aborted in command 1
query 0: ERROR: duplicate key value violates unique constraint
"tbl_pkey_ccnew"
# DETAIL: Key (i)=(13) already exists.
# pgbench: error: client 3 script 2 aborted in command 1
query 0: ERROR: duplicate key value violates unique constraint
"tbl_pkey_ccnew"
# DETAIL: Key (i)=(69) already exists.
# pgbench: error: client 2 script 2 aborted in command 1
query 0: ERROR: duplicate key value violates unique constraint
"tbl_pkey_ccnew"
# DETAIL: Key (i)=(69) already exists.
# pgbench: error: client 12 script 0 aborted in command 1
query 0: ERROR: duplicate key value violates unique constraint
"tbl_pkey_ccold"
# DETAIL: Key (i)=(13) already exists.
# pgbench: error: client 10 script 0 aborted in command 1
query 0: ERROR: duplicate key value violates unique constraint
"tbl_pkey_ccold"
# DETAIL: Key (i)=(13) already exists.
# pgbench: error: client 18 script 2 aborted in command 1
query 0: ERROR: duplicate key value violates unique constraint
"tbl_pkey_ccnew"
# DETAIL: Key (i)=(69) already exists.
# pgbench: error: pgbench:client 14 script 0 aborted in
command 1 query 0: ERROR: duplicate key value violates unique constraint
"tbl_pkey"
# DETAIL: Key (i)=(13) already exists.
# error: client 1 script 0 aborted in command 1 query 0:
ERROR: duplicate key value violates unique constraint "tbl_pkey"
# DETAIL: Key (i)=(13) already exists.
# pgbench: error: client 0 script 2 aborted in command 1
query 0: ERROR: duplicate key value violates unique constraint "tbl_pkey"
# DETAIL: Key (i)=(69) already exists.
# pgbench: error: client 13 script 1 aborted in command 1
query 0: ERROR: duplicate key value violates unique constraint
"tbl_pkey_ccnew"
# DETAIL: Key (i)=(42) already exists.
# pgbench: error: client 16 script 1 aborted in command 1
query 0: ERROR: duplicate key value violates unique constraint
"tbl_pkey_ccnew"
# DETAIL: Key (i)=(42) already exists.
# pgbench: error: client 5 script 1 aborted in command 1
query 0: ERROR: duplicate key value violates unique constraint
"tbl_pkey_ccnew"
# DETAIL: Key (i)=(42) already exists.
# pgbench: error: Run was aborted; the above results are
incomplete.
# '

Probably something wrong with arbiter index selection for different
backends. I am afraid it could be a symptom of a more serious issue.

-------------------------------------

The second test shows an interesting live lock state in the similar
situation.

CREATE UNLOGGED TABLE tbl(i int primary key, n int);
CREATE INDEX idx ON tbl(i, n);
INSERT INTO tbl VALUES(13,1);

pgbench concurrently runs single command

INSERT INTO tbl VALUES(13,1) on conflict(i) do update set n = tbl.n +
EXCLUDED.n;

And also reindexing in the loop

REINDEX INDEX CONCURRENTLY tbl_pkey;

After the start, a little bit strange issue happens

make -C src/test/modules/test_misc/ check PROVE_TESTS='t/007_*'

# going to start reindex, num tuples in table is 1
# reindex 0 done in 0.00704598426818848 seconds, num inserted
during reindex tuples is 0 speed is 0 per second
# going to start reindex, num tuples in table is 7
# reindex 1 done in 0.453176021575928 seconds, num inserted during
reindex tuples is 632 speed is 1394.60158947115 per second
# going to start reindex, num tuples in table is 647
# current n is 808, 808 per one second
# current n is 808, 0 per one second
# current n is 808, 0 per one second
# current n is 808, 0 per one second
# current n is 808, 0 per one second
# current n is 811, 3 per one second
# current n is 917, 106 per one second
# current n is 1024, 107 per one second
# reindex 2 done in 8.4104950428009 seconds, num inserted during
reindex tuples is 467 speed is 55.5258635340064 per second
# going to start reindex, num tuples in table is 1136
# current n is 1257, 233 per one second
# current n is 1257, 0 per one second
# current n is 1257, 0 per one second
# current n is 1257, 0 per one second
# current n is 1257, 0 per one second
# current n is 1490, 233 per one second
# reindex 3 done in 5.21368479728699 seconds, num inserted during
reindex tuples is 411 speed is 78.8310026363446 per second
# going to start reindex, num tuples in table is 1566

In some moments, all CPUs all hot but 30 connections are unable to do any
upsert. I think it may be somehow caused by two arbiter indexes (old and
new reindexed one).

Best regards,
Mikhail.

[1]: /messages/by-id/CANtu0ogBOtd9ravu1CUbuZWgq6qvn1rny38PGKDPk9zzQPH8_A@mail.gmail.com
/messages/by-id/CANtu0ogBOtd9ravu1CUbuZWgq6qvn1rny38PGKDPk9zzQPH8_A@mail.gmail.com
[2]: https://github.com/michail-nikolaev/postgres/commit/9446f944b415306d9e5d5ab98f69938d8f5ee87f
https://github.com/michail-nikolaev/postgres/commit/9446f944b415306d9e5d5ab98f69938d8f5ee87f

Michael Paquier

michael@paquier.xyz

almost 2 years ago

In reply to: Michail Nikolaev (#1)

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

On Tue, Jun 11, 2024 at 01:00:00PM +0200, Michail Nikolaev wrote:

Probably something wrong with arbiter index selection for different
backends. I am afraid it could be a symptom of a more serious issue.

ON CONFLICT selects an index that may be rebuilt in parallel of the
REINDEX happening, and its contents may be incomplete. Isn't the
issue that we may select as arbiter indexes stuff that's !indisvalid?
Using the ccnew or ccold indexes would not be correct for the conflict
resolutions.
--
Michael

Michail Nikolaev

michail.nikolaev@gmail.com

almost 2 years ago

In reply to: Michael Paquier (#2)

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

Hello, Michael.

Isn't the issue that we may select as arbiter indexes stuff that's

!indisvalid?
As far as I can see (1) !indisvalid indexes are filtered out.

But... It looks like this choice is not locked in any way (2), so
index_concurrently_swap or index_concurrently_set_dead can change this
index after the decision is made, even despite WaitForLockersMultiple (3).
In some cases, it may cause a data loss...
But I was unable to reproduce that using some random usleep(), however -
maybe it is a wrong assumption.

[1]: https://github.com/postgres/postgres/blob/915de706d28c433283e9dc63701e8f978488a2b9/src/backend/optimizer/util/plancat.c#L804
https://github.com/postgres/postgres/blob/915de706d28c433283e9dc63701e8f978488a2b9/src/backend/optimizer/util/plancat.c#L804

[2]: https://github.com/postgres/postgres/blob/915de706d28c433283e9dc63701e8f978488a2b9/src/backend/optimizer/util/plancat.c#L924-L928
https://github.com/postgres/postgres/blob/915de706d28c433283e9dc63701e8f978488a2b9/src/backend/optimizer/util/plancat.c#L924-L928
[3]: https://github.com/postgres/postgres/blob/8aee330af55d8a759b2b73f5a771d9d34a7b887f/src/backend/commands/indexcmds.c#L4153
https://github.com/postgres/postgres/blob/8aee330af55d8a759b2b73f5a771d9d34a7b887f/src/backend/commands/indexcmds.c#L4153

Michail Nikolaev

michail.nikolaev@gmail.com

almost 2 years ago

In reply to: Michail Nikolaev (#3)

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

Hello.

But I was unable to reproduce that using some random usleep(), however -

maybe it is a wrong assumption.
It seems like the assumption is correct - we may use an invalid index as
arbiter due to race condition.

The attached patch adds a check for that case, and now the test fails like
this:

# pgbench: error: client 16 script 1 aborted in command 1 query 0:
ERROR: duplicate key value violates unique constraint "tbl_pkey_ccold"
# DETAIL: Key (i)=(42) already exists.
# pgbench: error: client 9 script 1 aborted in command 1 query 0:
ERROR: ON CONFLICT does not support invalid indexes as arbiters
# pgbench: error: client 0 script 2 aborted in command 1 query 0:
ERROR: duplicate key value violates unique constraint "tbl_pkey"
# DETAIL: Key (i)=(69) already exists.
# pgbench: error: client 7 script 0 aborted in command 1 query 0:
ERROR: ON CONFLICT does not support invalid indexes as arbiters
# pgbench: error: client 10 script 0 aborted in command 1 query 0:
ERROR: ON CONFLICT does not support invalid indexes as arbiters
# pgbench: error: client 11 script 0 aborted in command 1 query 0:
ERROR: ON CONFLICT does not support invalid indexes as arbiters

I think It is even possible to see !alive index in the same situation (it
is worse case), but I was unable to reproduce it so far.

Best regards,
Mikhail.

Michail Nikolaev

michail.nikolaev@gmail.com

almost 2 years ago

In reply to: Michail Nikolaev (#4)

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

Hello, everyone.

I think It is even possible to see !alive index in the same situation (it

is worse case), but I was unable to reproduce it so far.
Fortunately, it is not possible.

So, seems like I have found the source of the problem:

1) infer_arbiter_indexes calls RelationGetIndexList to get the list of
candidates.
It does no lock selected indexes in any additional way which
prevents index_concurrently_swap changing them (set and clear validity).

RelationGetIndexList relcache.c:4857
infer_arbiter_indexes plancat.c:780
make_modifytable createplan.c:7097 ----------
node->arbiterIndexes = infer_arbiter_indexes(root);
create_modifytable_plan createplan.c:2826
create_plan_recurse createplan.c:532
create_plan createplan.c:349
standard_planner planner.c:421
planner planner.c:282
pg_plan_query postgres.c:904
pg_plan_queries postgres.c:996
exec_simple_query postgres.c:1193

2) other backend marks some index as invalid and commits

index_concurrently_swap index.c:1600
ReindexRelationConcurrently indexcmds.c:4115
ReindexIndex indexcmds.c:2814
ExecReindex indexcmds.c:2743
ProcessUtilitySlow utility.c:1567
standard_ProcessUtility utility.c:1067
ProcessUtility utility.c:523
PortalRunUtility pquery.c:1158
PortalRunMulti pquery.c:1315
PortalRun pquery.c:791
exec_simple_query postgres.c:1274

3) first backend invalidates catalog snapshot because transactional snapshot

InvalidateCatalogSnapshot snapmgr.c:426
GetTransactionSnapshot snapmgr.c:278
PortalRunMulti pquery.c:1244
PortalRun pquery.c:791
exec_simple_query postgres.c:1274

4) first backend copies indexes selected using previous catalog snapshot

ExecInitModifyTable nodeModifyTable.c:4499 --------
resultRelInfo->ri_onConflictArbiterIndexes = node->arbiterIndexes;
ExecInitNode execProcnode.c:177
InitPlan execMain.c:966
standard_ExecutorStart execMain.c:261
ExecutorStart execMain.c:137
ProcessQuery pquery.c:155
PortalRunMulti pquery.c:1277
PortalRun pquery.c:791
exec_simple_query postgres.c:1274

5) then reads indexes using new fresh snapshot

RelationGetIndexList relcache.c:4816
ExecOpenIndices execIndexing.c:175
ExecInsert nodeModifyTable.c:792 -------------
ExecOpenIndices(resultRelInfo, onconflict != ONCONFLICT_NONE);
ExecModifyTable nodeModifyTable.c:4059
ExecProcNodeFirst execProcnode.c:464
ExecProcNode executor.h:274
ExecutePlan execMain.c:1646
standard_ExecutorRun execMain.c:363
ExecutorRun execMain.c:304
ProcessQuery pquery.c:160
PortalRunMulti pquery.c:1277
PortalRun pquery.c:791
exec_simple_query postgres.c:1274

5) and uses arbiter selected with stale snapshot with new index view
(marked as invalid)

ExecInsert nodeModifyTable.c:1016 -------------- arbiterIndexes
= resultRelInfo->ri_onConflictArbiterIndexes;
............

ExecInsert nodeModifyTable.c:1048 ---------------if
(!ExecCheckIndexConstraints(resultRelInfo, slot, estate, conflictTid,
arbiterIndexes))
ExecModifyTable nodeModifyTable.c:4059
ExecProcNodeFirst execProcnode.c:464
ExecProcNode executor.h:274
ExecutePlan execMain.c:1646
standard_ExecutorRun execMain.c:363
ExecutorRun execMain.c:304
ProcessQuery pquery.c:160
PortalRunMulti pquery.c:1277
PortalRun pquery.c:791
exec_simple_query postgres.c:1274

I have attached an updated test for the issue (it fails on assert quickly
and uses only 2 backends).
The same issue may happen in case of CREATE/DROP INDEX CONCURRENTLY as well.

The simplest possible fix is to use ShareLock
instead ShareUpdateExclusiveLock in the index_concurrently_swap

oldClassRel = relation_open(oldIndexId, ShareLock);
newClassRel = relation_open(newIndexId, ShareLock);

But this is not a "concurrent" way. But such update should be fast enough
as far as I understand.

Best regards,
Mikhail.

Michael Paquier

michael@paquier.xyz

almost 2 years ago

In reply to: Michail Nikolaev (#5)

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

On Mon, Jun 17, 2024 at 07:00:51PM +0200, Michail Nikolaev wrote:

The simplest possible fix is to use ShareLock
instead ShareUpdateExclusiveLock in the index_concurrently_swap

oldClassRel = relation_open(oldIndexId, ShareLock);
newClassRel = relation_open(newIndexId, ShareLock);

But this is not a "concurrent" way. But such update should be fast enough
as far as I understand.

Nope, that won't fly far. We should not use a ShareLock in this step
or we are going to conflict with row exclusive locks, impacting all
workloads when doing a REINDEX CONCURRENTLY.

That may be a long shot, but the issue is that we do the swap of all
the indexes in a single transaction, but do not wait for them to
complete when committing the swap's transaction in phase 4.  Your
report is telling us that we really have a good reason to wait for all
the transactions that may use these indexes to finish.  One thing
coming on top of my mind to keep things concurrent-safe while allowing
a clean use of the arbiter indexes would be to stick a
WaitForLockersMultiple() on AccessExclusiveLock just *before* the
transaction commit of phase 4, say, lacking the progress report part:
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -4131,6 +4131,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		CommandCounterIncrement();
 	}

+	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
+
 	/* Commit this transaction and make index swaps visible */
 	CommitTransactionCommand();
 	StartTransactionCommand();

This is a non-fresh Friday-afternoon idea, but it would make sure that
we don't have any transactions using the indexes switched to _ccold
with indisvalid that are waiting for a drop in phase 5. Your tests
seem to pass with that, and that keeps the operation intact
concurrent-wise (I'm really wishing for isolation tests with injection
points just now, because I could use them here).

+		Assert(indexRelation->rd_index->indislive);
+		Assert(indexRelation->rd_index->indisvalid);
+
if (!indexRelation->rd_index->indimmediate)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),

This kind of validation check may be a good idea in the long term.
That seems incredibly useful to me if we were to add more code paths
that do concurrent index rebuilds, to make sure that we don't rely on
an index we should not use at all. That's a HEAD-only thing IMO,
though.
--
Michael

Michail Nikolaev

michail.nikolaev@gmail.com

almost 2 years ago

In reply to: Michael Paquier (#6)

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

Hello, Michael!

This is a non-fresh Friday-afternoon idea, but it would make sure that
we don't have any transactions using the indexes switched to _ccold
with indisvalid that are waiting for a drop in phase 5. Your tests
seem to pass with that, and that keeps the operation intact
concurrent-wise (I'm really wishing for isolation tests with injection
points just now, because I could use them here).

Yes, I also have tried that approach, but it doesn't work, unfortunately.
You may fail test increasing number of connections:

'--no-vacuum --client=10 -j 2 --transactions=1000',

The source of the issue is not the swap of the indexes (and not related to
REINDEX CONCURRENTLY only), but the fact that indexes are fetched once
during planning (to find the arbiter), but then later reread with a new
catalog snapshot for the the actual execution.

So, other possible fixes I see:
* fallback to replanning in case we see something changed during the
execution
* select arbiter indexes during actual execution

That's a HEAD-only thing IMO,
though.

Do you mean that it needs to be moved to a separate patch?

Best regards,
Mikhail.

Michael Paquier

michael@paquier.xyz

almost 2 years ago

In reply to: Michail Nikolaev (#5)

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

On Mon, Jun 17, 2024 at 07:00:51PM +0200, Michail Nikolaev wrote:

The same issue may happen in case of CREATE/DROP INDEX CONCURRENTLY as well.

While looking at all that, I've been also curious about this specific
point, and it is indeed possible to finish in a state where a
duplicate key would be found in one of indexes selected by the
executor during an INSERT ON CONFLICT while a concurrent set of CICs
and DICs are run, so you don't really need a REINDEX. See for example
the attached test.
--
Michael

Michael Paquier

michael@paquier.xyz

almost 2 years ago

In reply to: Michail Nikolaev (#7)

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

On Fri, Jun 21, 2024 at 11:31:21AM +0200, Michail Nikolaev wrote:

Yes, I also have tried that approach, but it doesn't work, unfortunately.
You may fail test increasing number of connections:

'--no-vacuum --client=10 -j 2 --transactions=1000',

The source of the issue is not the swap of the indexes (and not related to
REINDEX CONCURRENTLY only), but the fact that indexes are fetched once
during planning (to find the arbiter), but then later reread with a new
catalog snapshot for the the actual execution.

When I first saw this report, my main worry was that I have somewhat
managed to break the state of the indexes leading to data corruption
because of an incorrect step in the concurrent operations. However,
as far as I can see this is not the case, as an effect of two
properties we rely on for concurrent index operations, that hold in
the executor and the planner. Simply put:
- The planner ignores indexes with !indisvalid.
- The executor ignores indexes with !indislive.

The critical point is that we wait in DROP INDEX CONC and REINDEX CONC
for any transactions still using an index that's waiting to be marked
as !indislive, because such indexes *must* not be used in the
executor.

So, other possible fixes I see:
* fallback to replanning in case we see something changed during the
execution
* select arbiter indexes during actual execution

These two properties make ON CONFLICT react the way it should
depending on the state of the indexes selected by the planner based on
the query clauses, with changes reflecting when executing, with two
patterns involved:
- An index may be created in a concurrent operation after the planner
has selected the arbiter indexes (the index may be defined, still not
valid yet, or just created after), then the query execution would need
to handle the extra index created available at execution, with a
failure on a ccnew index.
- An index may be selected at planning phase, then a different index
could be used by a constraint once both indexes swap, with a failure
on a ccold index.

As far as I can see, it depends on what kind of query semantics and
the amount of transparency you are looking for here in your
application. An error in the query itself can also be defined as
useful so as your application is aware of what happens as an effect of
the concurrent index build (reindex or CIC/DIC), and it is not really
clear to me why silently falling back to a re-selection of the arbiter
indexes would be always better. Replanning could be actually
dangerous if a workload is heavily concurrently REINDEX'd, as we could
fall into a trap where a query can never decide which index to use.
I'm not saying that it cannot be improved, but it's not completely
clear to me what query semantics are the best for all users because
the behavior of HEAD and your suggestions have merits and demerits.
Anything we could come up with would be an improvement anyway, AFAIU.

That's a HEAD-only thing IMO,
though.

Do you mean that it needs to be moved to a separate patch?

It should, but I'm wondering if that's necessary for two reasons.

First, a check on indisvalid would be incorrect, because indexes
marked as !indisvalid && indislive mean that there is a concurrent
operation happening, and that this concurrent operation is waiting for
all transactions working with a lock on this index to finish before
flipping the live flag and make this index invalid for decisions taken
in the executor, like HOT updates, etc.

A check on indislive may be an idea, still I'm slightly biased
regarding its additional value because any indexes opened for a
relation are fetched from the relcache with RelationGetIndexList()
explaining why indislive indexes cannot be fetched, and we rely on
that in the executor for the indexes opened by a relation.
--
Michael

#10

Michail Nikolaev

michail.nikolaev@gmail.com

almost 2 years ago

In reply to: Michael Paquier (#9)

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

Hello, Michael!

As far as I can see, it depends on what kind of query semantics and
the amount of transparency you are looking for here in your
application. An error in the query itself can also be defined as
useful so as your application is aware of what happens as an effect of
the concurrent index build (reindex or CIC/DIC), and it is not really
clear to me why silently falling back to a re-selection of the arbiter
indexes would be always better.

From my point of view, INSERT ON CONFLICT UPDATE should never fail with
"ERROR: duplicate key value violates unique constraint" because the main
idea of upsert is to avoid such situations.
So, it is expected by majority and, probably, is even documented.

On the other side, REINDEX CONCURRENTLY should not cause any queries to
fail accidentally without any clear reason.

Also, as you can see from the topic starter letter, we could see errors
like this:

* ERROR: duplicate key value violates unique constraint "tbl_pkey"
* ERROR: duplicate key value violates unique constraint "tbl_pkey_ccnew"
* ERROR: duplicate key value violates unique constraint "tbl_pkey_ccold"

So, the first error message does not provide any clue for the developer to
understand what happened.

- The planner ignores indexes with !indisvalid.
- The executor ignores indexes with !indislive.

Yes, and it feels like we need one more flag here to distinguish
!indisvalid indexes which are going to become valid and which are going to
become !indislive.

For example, let name it as indiscorrect (it means it contains all the
data). In such case, we may use the following logic:

1) !indisvalid && !indiscorrect - index in validation phase probably, do
not use it as arbiter because it does not contain all the data yet
2) !indisvalid && indiscorrect - index will be dropped most likely. Do not
plan new queries with it, but it still may be used by other queries
(including upserts). So, we still need to include it to the arbiters.

And, during the reindex concurrently:

1) begin; mark new index as indisvalid and indiscorrect; mark old one as
!indisvalid but still indiscorrect. invalidate relcache; commit;

Currently, some queries are still using the old one as arbiter, some
queries use both.

2) WaitForLockersMultiple

Now all queries use both indexes as arbiter.

3) begin; mark old index as !indiscorrect, additionally to !indisvalid;
invalidate cache; commit;

Now, some queries use only the new index, both some still use both.

4) WaitForLockersMultiple;

Now, all queries use only the new index - we are safe to mark the old
one it as !indislive.

It should, but I'm wondering if that's necessary for two reasons.

In that case, it becomes:

Assert(indexRelation->rd_index->indiscorrect);
Assert(indexRelation->rd_index->indislive);

and it is always the valid check.

Best regards,
Mikhail.

#11

Michail Nikolaev

michail.nikolaev@gmail.com

almost 2 years ago

In reply to: Michail Nikolaev (#10)

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

Hello, Noah!

Answering
/messages/by-id/20240612194857.1c.nmisch@google.com

On your other thread, it would be useful to see stack traces from the

high-CPU

processes once the live lock has ended all query completion.

I was wrong, it is not a livelock, it is a deadlock, actually. I missed it
because pgbench retries deadlocks automatically.

It looks like this:

2024-06-25 17:16:17.447 CEST [711743] 007_concurrently_unique_stuck.pl
ERROR: deadlock detected
2024-06-25 17:16:17.447 CEST [711743] 007_concurrently_unique_stuck.pl
DETAIL: Process 711743 waits for ShareLock on transaction 3633; blocked by
process 711749.
Process 711749 waits for ShareLock on speculative token 2 of transaction
3622; blocked by process 711743.
Process 711743: INSERT INTO tbl VALUES(13,89318) on conflict(i) do update
set n = tbl.n + 1 RETURNING n
Process 711749: INSERT INTO tbl VALUES(13,41011) on conflict(i) do update
set n = tbl.n + 1 RETURNING n
2024-06-25 17:16:17.447 CEST [711743] 007_concurrently_unique_stuck.pl
HINT: See server log for query details.
2024-06-25 17:16:17.447 CEST [711743] 007_concurrently_unique_stuck.pl
CONTEXT: while inserting index tuple (15,145) in relation "tbl_pkey_ccnew"
2024-06-25 17:16:17.447 CEST [711743] 007_concurrently_unique_stuck.pl
STATEMENT: INSERT INTO tbl VALUES(13,89318) on conflict(i) do update set n
= tbl.n + 1 RETURNING n

Stacktraces:

-------------------------

INSERT INTO tbl VALUES(13,41011) on conflict(i) do update set n = tbl.n + 1
RETURNING n

#0 in epoll_wait (epfd=5, events=0x1203328, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 in WaitEventSetWaitBlock (set=0x12032c0, cur_timeout=-1,
occurred_events=0x7ffcc4e38e30, nevents=1) at latch.c:1570
#2 in WaitEventSetWait (set=0x12032c0, timeout=-1,
occurred_events=0x7ffcc4e38e30, nevents=1, wait_event_info=50331655) at
latch.c:1516
#3 in WaitLatch (latch=0x7acb2a2f5f14, wakeEvents=33, timeout=0,
wait_event_info=50331655) at latch.c:538
#4 in ProcSleep (locallock=0x122f778, lockMethodTable=0x1037340
<default_lockmethod>, dontWait=false) at proc.c:1355
#5 in WaitOnLock (locallock=0x122f778, owner=0x1247408, dontWait=false) at
lock.c:1833
#6 in LockAcquireExtended (locktag=0x7ffcc4e39220, lockmode=5,
sessionLock=false, dontWait=false, reportMemoryError=true, locallockp=0x0)
at lock.c:1046
#7 in LockAcquire (locktag=0x7ffcc4e39220, lockmode=5, sessionLock=false,
dontWait=false) at lock.c:739
#8 in SpeculativeInsertionWait (xid=3622, token=2) at lmgr.c:833
#9 in _bt_doinsert (rel=0x7acb2dbb12e8, itup=0x12f1308,
checkUnique=UNIQUE_CHECK_YES, indexUnchanged=true, heapRel=0x7acb2dbb0f08)
at nbtinsert.c:225
#10 in btinsert (rel=0x7acb2dbb12e8, values=0x7ffcc4e39440,
isnull=0x7ffcc4e39420, ht_ctid=0x12ebe20, heapRel=0x7acb2dbb0f08,
checkUnique=UNIQUE_CHECK_YES, indexUnchanged=true, indexInfo=0x12f08a8) at
nbtree.c:195
#11 in index_insert (indexRelation=0x7acb2dbb12e8, values=0x7ffcc4e39440,
isnull=0x7ffcc4e39420, heap_t_ctid=0x12ebe20, heapRelation=0x7acb2dbb0f08,
checkUnique=UNIQUE_CHECK_YES, indexUnchanged=true, indexInfo=0x12f08a8) at
indexam.c:230
#12 in ExecInsertIndexTuples (resultRelInfo=0x12eaa00, slot=0x12ebdf0,
estate=0x12ea560, update=true, noDupErr=false, specConflict=0x0,
arbiterIndexes=0x0, onlySummarizing=false) at execIndexing.c:438
#13 in ExecUpdateEpilogue (context=0x7ffcc4e39870,
updateCxt=0x7ffcc4e3962c, resultRelInfo=0x12eaa00, tupleid=0x7ffcc4e39732,
oldtuple=0x0, slot=0x12ebdf0) at nodeModifyTable.c:2130
#14 in ExecUpdate (context=0x7ffcc4e39870, resultRelInfo=0x12eaa00,
tupleid=0x7ffcc4e39732, oldtuple=0x0, slot=0x12ebdf0, canSetTag=true) at
nodeModifyTable.c:2478
#15 in ExecOnConflictUpdate (context=0x7ffcc4e39870,
resultRelInfo=0x12eaa00, conflictTid=0x7ffcc4e39732,
excludedSlot=0x12f05b8, canSetTag=true, returning=0x7ffcc4e39738) at
nodeModifyTable.c:2694
#16 in ExecInsert (context=0x7ffcc4e39870, resultRelInfo=0x12eaa00,
slot=0x12f05b8, canSetTag=true, inserted_tuple=0x0, insert_destrel=0x0) at
nodeModifyTable.c:1048
#17 in ExecModifyTable (pstate=0x12ea7f0) at nodeModifyTable.c:4059
#18 in ExecProcNodeFirst (node=0x12ea7f0) at execProcnode.c:464
#19 in ExecProcNode (node=0x12ea7f0) at
../../../src/include/executor/executor.h:274
#20 in ExecutePlan (estate=0x12ea560, planstate=0x12ea7f0,
use_parallel_mode=false, operation=CMD_INSERT, sendTuples=true,
numberTuples=0, direction=ForwardScanDirection, dest=0x12daac8,
execute_once=true) at execMain.c:1646
#21 in standard_ExecutorRun (queryDesc=0x12dab58,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:363
#22 in ExecutorRun (queryDesc=0x12dab58, direction=ForwardScanDirection,
count=0, execute_once=true) at execMain.c:304
#23 in ProcessQuery (plan=0x12e1360, sourceText=0x12083b0 "INSERT INTO tbl
VALUES(13,41011) on conflict(i) do update set n = tbl.n + 1 RETURNING n ",
params=0x0, queryEnv=0x0, dest=0x12daac8, qc=0x7ffcc4e39ae0) at pquery.c:160
#24 in PortalRunMulti (portal=0x1289c90, isTopLevel=true,
setHoldSnapshot=true, dest=0x12daac8, altdest=0x10382a0 <donothingDR>,
qc=0x7ffcc4e39ae0) at pquery.c:1277
#25 in FillPortalStore (portal=0x1289c90, isTopLevel=true) at pquery.c:1026
#26 in PortalRun (portal=0x1289c90, count=9223372036854775807,
isTopLevel=true, run_once=true, dest=0x12e14c0, altdest=0x12e14c0,
qc=0x7ffcc4e39d30) at pquery.c:763
#27 in exec_simple_query (query_string=0x12083b0 "INSERT INTO tbl
VALUES(13,41011) on conflict(i) do update set n = tbl.n + 1 RETURNING n ")
at postgres.c:1274

-------------------------

INSERT INTO tbl VALUES(13,89318) on conflict(i) do update set n = tbl.n + 1
RETURNING n

#0 in epoll_wait (epfd=5, events=0x1203328, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 in WaitEventSetWaitBlock (set=0x12032c0, cur_timeout=-1,
occurred_events=0x7ffcc4e38f60, nevents=1) at latch.c:1570
#2 in WaitEventSetWait (set=0x12032c0, timeout=-1,
occurred_events=0x7ffcc4e38f60, nevents=1, wait_event_info=50331653) at
latch.c:1516
#3 in WaitLatch (latch=0x7acb2a2f4dbc, wakeEvents=33, timeout=0,
wait_event_info=50331653) at latch.c:538
#4 in ProcSleep (locallock=0x122f670, lockMethodTable=0x1037340
<default_lockmethod>, dontWait=false) at proc.c:1355
#5 in WaitOnLock (locallock=0x122f670, owner=0x1247408, dontWait=false) at
lock.c:1833
#6 in LockAcquireExtended (locktag=0x7ffcc4e39370, lockmode=5,
sessionLock=false, dontWait=false, reportMemoryError=true, locallockp=0x0)
at lock.c:1046
#7 in LockAcquire (locktag=0x7ffcc4e39370, lockmode=5, sessionLock=false,
dontWait=false) at lock.c:739
#8 in XactLockTableWait (xid=3633, rel=0x7acb2dba66d8, ctid=0x1240a68,
oper=XLTW_InsertIndex) at lmgr.c:701
#9 in _bt_doinsert (rel=0x7acb2dba66d8, itup=0x1240a68,
checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false, heapRel=0x7acb2dbb0f08)
at nbtinsert.c:227
#10 in btinsert (rel=0x7acb2dba66d8, values=0x7ffcc4e395c0,
isnull=0x7ffcc4e395a0, ht_ctid=0x12400e8, heapRel=0x7acb2dbb0f08,
checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false, indexInfo=0x1240500) at
nbtree.c:195
#11 in index_insert (indexRelation=0x7acb2dba66d8, values=0x7ffcc4e395c0,
isnull=0x7ffcc4e395a0, heap_t_ctid=0x12400e8, heapRelation=0x7acb2dbb0f08,
checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false, indexInfo=0x1240500) at
indexam.c:230
#12 in ExecInsertIndexTuples (resultRelInfo=0x12eaa00, slot=0x12400b8,
estate=0x12ea560, update=false, noDupErr=true, specConflict=0x7ffcc4e39722,
arbiterIndexes=0x12e0998, onlySummarizing=false) at execIndexing.c:438
#13 in ExecInsert (context=0x7ffcc4e39870, resultRelInfo=0x12eaa00,
slot=0x12400b8, canSetTag=true, inserted_tuple=0x0, insert_destrel=0x0) at
nodeModifyTable.c:1095
#14 in ExecModifyTable (pstate=0x12ea7f0) at nodeModifyTable.c:4059
#15 in ExecProcNodeFirst (node=0x12ea7f0) at execProcnode.c:464
#16 in ExecProcNode (node=0x12ea7f0) at
../../../src/include/executor/executor.h:274
#17 in ExecutePlan (estate=0x12ea560, planstate=0x12ea7f0,
use_parallel_mode=false, operation=CMD_INSERT, sendTuples=true,
numberTuples=0, direction=ForwardScanDirection, dest=0x12daac8,
execute_once=true) at execMain.c:1646
#18 in standard_ExecutorRun (queryDesc=0x12dab58,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:363
#19 in ExecutorRun (queryDesc=0x12dab58, direction=ForwardScanDirection,
count=0, execute_once=true) at execMain.c:304
#20 in ProcessQuery (plan=0x12e1360, sourceText=0x12083b0 "INSERT INTO tbl
VALUES(13,89318) on conflict(i) do update set n = tbl.n + 1 RETURNING n ",
params=0x0, queryEnv=0x0, dest=0x12daac8, qc=0x7ffcc4e39ae0) at pquery.c:160
#21 in PortalRunMulti (portal=0x1289c90, isTopLevel=true,
setHoldSnapshot=true, dest=0x12daac8, altdest=0x10382a0 <donothingDR>,
qc=0x7ffcc4e39ae0) at pquery.c:1277
#22 in FillPortalStore (portal=0x1289c90, isTopLevel=true) at pquery.c:1026
#23 in PortalRun (portal=0x1289c90, count=9223372036854775807,
isTopLevel=true, run_once=true, dest=0x12e14c0, altdest=0x12e14c0,
qc=0x7ffcc4e39d30) at pquery.c:763
#24 in exec_simple_query (query_string=0x12083b0 "INSERT INTO tbl
VALUES(13,89318) on conflict(i) do update set n = tbl.n + 1 RETURNING n ")
at postgres.c:1274

-------------------------

Also, at that time (but not reported in deadlock) reindex is happening.
Without reindex I am unable to reproduce deadlock.

It looks like a deadlock caused by different set of indexes being used as
arbiter indexes (or by the different order).

Best regards,
Mikhail.

#12

Michail Nikolaev

michail.nikolaev@gmail.com

over 1 year ago

In reply to: Michail Nikolaev (#11)

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

Hell, everyone!

Using the brand-new injection points support in specs, I created a spec to
reproduce the issue.

It fails like this currently:

make -C src/test/modules/injection_points/ check

@@ -64,6 +64,7 @@

step s3_s1: <... completed>
step s2_s1: <... completed>
+ERROR: duplicate key value violates unique constraint "tbl_pkey_ccold"

starting permutation: s3_s1 s2_s1 s4_s1 s1_s1 s4_s2 s4_s3
injection_points_attach
@@ -129,3 +130,4 @@

step s3_s1: <... completed>
step s2_s1: <... completed>
+ERROR: duplicate key value violates unique constraint "tbl_pkey"

Best regards,
Mikhail.

#13

Michail Nikolaev

michail.nikolaev@gmail.com

over 1 year ago

In reply to: Michail Nikolaev (#12)

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

Hello, everyone.

I have updated the spec to reproduce the issue, now it includes cases with
both CREATE INDEX and REINDEX.

To run:
make -C src/test/modules/injection_points/ check

Issue reproduced on empty index, but it may happen on index of any with the
same probability.
It is not critical, of course, but in production system indexes are
regularly rebuilt using REINDEX CONCURRENTLY as recommended in
documentation [1]https://www.postgresql.org/docs/current/routine-reindex.html.
In most of the cases it is done using pg_repack as far as I know.

So, in these production systems, there is no guarantee what INSERT ON
CONFLICT DO NOTHING/UPDATE will not fail with a "duplicate key value
violates unique constraint" error.

Best regards,
Mikhail.

[1]: https://www.postgresql.org/docs/current/routine-reindex.html

Show quoted text

#14

Michail Nikolaev

michail.nikolaev@gmail.com

over 1 year ago

In reply to: Michail Nikolaev (#13)

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

Hello, everyone!

This patch set addresses the issues discussed in this thread.

The main idea behind this fix is that it is safe to consider indisready
indexes alongside indisvalid indexes as arbiter indexes. However, it's
crucial that at least one fully valid index is present.

Why is it necessary to consider indisready during the planning phase?

The reason is that these indexes are required for correct processing during
the execution phase.
If "ready" indexes are skipped as arbiters by one transaction, they may
already have become "valid" for another concurrent transaction during its
planning phase.
As a result, both transactions could concurrently process the UPSERT
command with different sets of arbiters (while using the same set of
indexes for tuple insertion later).
This can lead to unexpected "duplicate key value violates unique
constraint" errors and deadlocks.

Is it safe to use a "ready" but not yet "valid" index as an arbiter?
Yes, as long as at least one "valid" index is also used as an arbiter.
The valid index ensures the correctness of the UPSERT logic, while the
"ready" index contains an equal or lesser number of tuples, making it safe
for speculative insertion.
In any case, the insert to that index will be processed during
ExecInsertIndexTuples one way or another (with applyNoDupErr or without).

Fix is divided into a few patches, each following this logic:

1) The first patch provides specs (and injection points) for the various
scenarios related to the issue.
2) The second patch introduces a straightforward change—adding indisready
indexes to arbiters alongside indisvalid. However, at least one indisvalid
is still required. This resolves simple cases involving REINDEX
CONCURRENTLY and CREATE INDEX CONCURRENTLY.
3) The third patch deals with named constraints. Instead of relying solely
on the index with the specified name, we attempt to find other indexes that
are equivalent in terms of being used as an arbiter.
4) This patch fixes a scenario involving partitioned tables. Special checks
are required for partitioned indexes, which may be processed by REINDEX
CONCURRENTLY.

Additionally, a patch with three extra TAP specifications for stress
testing is attached. This patch is not intended for commitment, so I
renamed the extension to prevent accidental application in some CI/DI jobs.

Also, it is possible to look at the patches on GitHub:
https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:reindex_concurrently_with_upsert

Best regards,
Mikhail.

#15

Michail Nikolaev

michail.nikolaev@gmail.com

over 1 year ago

In reply to: Michail Nikolaev (#14)

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

Hello, everyone.

Rebased on master.

Show quoted text

#16

Michail Nikolaev

michail.nikolaev@gmail.com

over 1 year ago

In reply to: Michail Nikolaev (#15)

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

Hello, everyone!

I've improved the test stability. The revised version should provide
consistent results in all test runs.

Best regards,
Mikhail.

Show quoted text

#17

Michail Nikolaev

michail.nikolaev@gmail.com

over 1 year ago

In reply to: Michail Nikolaev (#16)

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

Hello, everyone!

I have noticed tests are still flapping a little bit on FreeBSD.
Now I have added some markers to isolation specs to avoid possible race
conditions.

Best regards,
Mikhail.

Show quoted text

#18

Michail Nikolaev

michail.nikolaev@gmail.com

over 1 year ago

In reply to: Michail Nikolaev (#17)

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

Hello, everyone!

This is just small test refactoring after the last stabilization.

Best regards,
Mikhail.

Show quoted text

Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: