Unexpected result count from update statement on partitioned table
Hello,
Our application uses a queue-like table to assign tasks to users and this
has worked well for us for a few years. Now we are in the process of
adding some restrictions to which tasks a user can work on and that is
based on an attribute of each task that does not change for the task's
lifespan. Users may have access to work on one or more or types of tasks.
To improve query time when finding the set of tasks that we assign, we are
introducing partitioning into our task queue table. When assigning tasks,
we issue an update statement to mark the tasks as reserved using a subquery
that orders the tasks by age. With the introduction of partitioning, we
are seeing that the update statement affects more rows than expected. An
example query is:
---
update task_parent
set reserved = true
from (
select id
from task_parent
where reserved = false
and task_type = 1 or task_type = 2
order by task_timestamp
limit 50
for update skip locked) as sub
where sub.id = task_parent.id
returning task_parent.id
---
In the statement above, we have a subquery to limit the number of tasks to
50 yet the update statement sometimes returns more than 50 records. I have
narrowed this down to a small, reproducible example shown below. The first
time I run the update statement I get ~65 records, then typically ~53 the
next few runs, and then it starts consistently giving me 50 records after
that. Then if I bump the limit to 100 I will get more than 100 initially
and after several executions it starts to settle into always giving the
expected 100.
Below is the full setup that can be used to reproduce what I'm seeing. It
was initially observed on PostgreSQL 11.8 but I can also reproduce it on
13.0.
---
create table task_parent (
id bigint not null,
task_type smallint not null,
reserved boolean not null,
task_timestamp timestamp not null
) partition by list (task_type);
create table task_child_1
partition of task_parent for values in (1);
create table task_child_2
partition of task_parent for values in (2);
insert into task_parent
select
generate_series(1, 500000),
case when random() < 0.5 then 1 else 2 end,
false,
now() - (random() * '1 day'::interval);
create index task_parent_task_time_idx
on task_parent (task_timestamp);
update task_parent
set reserved = true
from (
select id
from task_parent
where reserved = false
and task_type = 1 or task_type = 2
order by task_timestamp
limit 50
for update skip locked) as sub
where sub.id = task_parent.id
returning task_parent.id;
---
A couple of interesting observations:
1) If I remove the order by clause I always get the expected number of
results
2) If I rewrite the query to use a CTE for the task IDs instead of a
subquery then I always get the expected number of results
At its surface, this seems like it could be a bug but maybe there is
something about this usage pattern that is known/expected to cause this
behavior. So that's the question - is this a bug that should be reported
to pgsql-bugs, or is this expected and if so, why?
Craig
On Thu, 2020-12-17 at 12:21 -0500, Craig McIlwee wrote:
Our application uses a queue-like table to assign tasks to users and this has worked well for us for a few years. Now we are in the process of adding some restrictions to which tasks a user can
work on and that is based on an attribute of each task that does not change for the task's lifespan. Users may have access to work on one or more or types of tasks. To improve query time when
finding the set of tasks that we assign, we are introducing partitioning into our task queue table. When assigning tasks, we issue an update statement to mark the tasks as reserved using a subquery
that orders the tasks by age. With the introduction of partitioning, we are seeing that the update statement affects more rows than expected. An example query is:--- update task_parent set reserved = true from ( select id from task_parent where reserved = false and task_type = 1 or task_type = 2 order by task_timestamp limit 50 for update skip locked) as sub where sub.id = task_parent.id returning task_parent.id ---In the statement above, we have a subquery to limit the number of tasks to 50 yet the update statement sometimes returns more than 50 records. I have narrowed this down to a small, reproducible
example shown below. The first time I run the update statement I get ~65 records, then typically ~53 the next few runs, and then it starts consistently giving me 50 records after that. Then if I
bump the limit to 100 I will get more than 100 initially and after several executions it starts to settle into always giving the expected 100.Below is the full setup that can be used to reproduce what I'm seeing. It was initially observed on PostgreSQL 11.8 but I can also reproduce it on 13.0.
---
create table task_parent (
id bigint not null,
task_type smallint not null,
reserved boolean not null,
task_timestamp timestamp not null
) partition by list (task_type);create table task_child_1
partition of task_parent for values in (1);create table task_child_2
partition of task_parent for values in (2);insert into task_parent
select
generate_series(1, 500000),
case when random() < 0.5 then 1 else 2 end,
false,
now() - (random() * '1 day'::interval);create index task_parent_task_time_idx
on task_parent (task_timestamp);update task_parent
set reserved = true
from (
select id
from task_parent
where reserved = false
and task_type = 1 or task_type = 2
order by task_timestamp
limit 50
for update skip locked) as sub
where sub.id = task_parent.id
returning task_parent.id;
---A couple of interesting observations:
1) If I remove the order by clause I always get the expected number of results
2) If I rewrite the query to use a CTE for the task IDs instead of a subquery then I always get the expected number of resultsAt its surface, this seems like it could be a bug but maybe there is something about this usage pattern that is known/expected to cause this behavior. So that's the question - is this a bug that
should be reported to pgsql-bugs, or is this expected and if so, why?
Yes, this must be a bug:
EXPLAIN (COSTS OFF) update task_parent
set reserved = true
from (
select id
from task_parent
where reserved = false
and task_type = 1 or task_type = 2
order by task_timestamp
limit 50
for update skip locked) as sub
where sub.id = task_parent.id
returning task_parent.id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Update on task_parent
Update on task_child_1 task_parent_1
Update on task_child_2 task_parent_2
-> Hash Join
Hash Cond: (task_parent_1.id = sub.id)
-> Seq Scan on task_child_1 task_parent_1
-> Hash
-> Subquery Scan on sub
-> Limit
-> LockRows
-> Merge Append
Sort Key: task_parent_3.task_timestamp
-> Index Scan using task_child_1_task_timestamp_idx on task_child_1 task_parent_4
Filter: (((NOT reserved) AND (task_type = 1)) OR (task_type = 2))
-> Index Scan using task_child_2_task_timestamp_idx on task_child_2 task_parent_5
Filter: (((NOT reserved) AND (task_type = 1)) OR (task_type = 2))
-> Hash Join
Hash Cond: (task_parent_2.id = sub_1.id)
-> Seq Scan on task_child_2 task_parent_2
-> Hash
-> Subquery Scan on sub_1
-> Limit
-> LockRows
-> Merge Append
Sort Key: task_parent_6.task_timestamp
-> Index Scan using task_child_1_task_timestamp_idx on task_child_1 task_parent_7
Filter: (((NOT reserved) AND (task_type = 1)) OR (task_type = 2))
-> Index Scan using task_child_2_task_timestamp_idx on task_child_2 task_parent_8
Filter: (((NOT reserved) AND (task_type = 1)) OR (task_type = 2))
(29 rows)
The subquery is executed twice, and the two executions obviously don't
return the same results. I am at a loss for an explanation ...
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
Despite looking at this query on and off for a couple of days, it wasn't
until seeing it in Lauenz's reply that I noticed a logical issue with the
query that changes things a bit. There should be parenthesis around the
task_type predicates, otherwise you end up getting reserved rows in the
result set. After looking at it more, I see that when the original query
settles in to consistently returning the expected number of results they
happen to be the exact same results each time. Correcting the logical
error in the query solves that and now consistently gives too many results,
regardless of the number of executions.
Corrected query (but still not working properly):
update task_parent
set reserved = true
from (
select id
from task_parent
where reserved = false
and (task_type = 1 or task_type = 2)
order by task_timestamp
limit 50
for update skip locked) as sub
where sub.id = task_parent.id
returning *;
Yes, this must be a bug:
EXPLAIN (COSTS OFF) update task_parent
set reserved = true
from (
select id
from task_parent
where reserved = false
and task_type = 1 or task_type = 2
order by task_timestamp
limit 50
for update skip locked) as sub
where sub.id = task_parent.id
returning task_parent.id;QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Update on task_parent
Update on task_child_1 task_parent_1
Update on task_child_2 task_parent_2
-> Hash Join
Hash Cond: (task_parent_1.id = sub.id)
-> Seq Scan on task_child_1 task_parent_1
-> Hash
-> Subquery Scan on sub
-> Limit
-> LockRows
-> Merge Append
Sort Key:
task_parent_3.task_timestamp
-> Index Scan using
task_child_1_task_timestamp_idx on task_child_1 task_parent_4
Filter: (((NOT reserved) AND
(task_type = 1)) OR (task_type = 2))
-> Index Scan using
task_child_2_task_timestamp_idx on task_child_2 task_parent_5
Filter: (((NOT reserved) AND
(task_type = 1)) OR (task_type = 2))
-> Hash Join
Hash Cond: (task_parent_2.id = sub_1.id)
-> Seq Scan on task_child_2 task_parent_2
-> Hash
-> Subquery Scan on sub_1
-> Limit
-> LockRows
-> Merge Append
Sort Key:
task_parent_6.task_timestamp
-> Index Scan using
task_child_1_task_timestamp_idx on task_child_1 task_parent_7
Filter: (((NOT reserved) AND
(task_type = 1)) OR (task_type = 2))
-> Index Scan using
task_child_2_task_timestamp_idx on task_child_2 task_parent_8
Filter: (((NOT reserved) AND
(task_type = 1)) OR (task_type = 2))
(29 rows)The subquery is executed twice, and the two executions obviously don't
return the same results. I am at a loss for an explanation ...
I did notice that the subquery is executed twice and this is the difference
between the subquery and CTE since the CTE is only executed once and then
the CTE result is used in the joins against the child tables. Beyond that,
I was unable to rationalize what was going on. One thing I don't know is
if the subquery should be executed more than once. If yes, then like you
said, I would think that they should give the same results.
Another interesting thing I noticed is that multiple executions give tasks
with overlapping time ranges. Given 2 executions, the task_time ordering
should give the oldest 50 first and then the next oldest 50 after that. In
reality, I see that the second execution has tasks that are older than the
newest timestamp of the first execution. Wrapping the update statement in
a CTE and then picking out the min/max shows this:
with updated as (
update task_parent
set reserved = true
from (
select id, task_type
from task_parent
where reserved = false
and (task_type = 1 or task_type = 2)
order by task_timestamp
limit 50
for update skip locked) as sub
where sub.id = task_parent.id
and sub.task_type = task_parent.task_type
returning *
)
select min(task_timestamp), max(task_timestamp)
from updated;
Execution 1:
min | max
----------------------------+----------------------------
2020-12-17 11:44:51.192119 | 2020-12-17 11:45:03.881409
Execution 2:
min | max
----------------------------+----------------------------
2020-12-17 11:44:59.943108 | 2020-12-17 11:45:14.273185
min of execution 2 is older than max of execution 1 which should not happen.
Craig
On Fri, Dec 18, 2020 at 4:06 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:
Show quoted text
On Thu, 2020-12-17 at 12:21 -0500, Craig McIlwee wrote:
Our application uses a queue-like table to assign tasks to users and
this has worked well for us for a few years. Now we are in the process of
adding some restrictions to which tasks a user canwork on and that is based on an attribute of each task that does not
change for the task's lifespan. Users may have access to work on one or
more or types of tasks. To improve query time whenfinding the set of tasks that we assign, we are introducing partitioning
into our task queue table. When assigning tasks, we issue an update
statement to mark the tasks as reserved using a subquerythat orders the tasks by age. With the introduction of partitioning, we
are seeing that the update statement affects more rows than expected. An
example query is:---
update task_parent
set reserved = true
from (
select id
from task_parent
where reserved = false
and task_type = 1 or task_type = 2
order by task_timestamp
limit 50
for update skip locked) as sub
where sub.id = task_parent.id
returning task_parent.id
---In the statement above, we have a subquery to limit the number of tasks
to 50 yet the update statement sometimes returns more than 50 records. I
have narrowed this down to a small, reproducibleexample shown below. The first time I run the update statement I get
~65 records, then typically ~53 the next few runs, and then it starts
consistently giving me 50 records after that. Then if Ibump the limit to 100 I will get more than 100 initially and after
several executions it starts to settle into always giving the expected 100.
Below is the full setup that can be used to reproduce what I'm seeing.
It was initially observed on PostgreSQL 11.8 but I can also reproduce it on
13.0.---
create table task_parent (
id bigint not null,
task_type smallint not null,
reserved boolean not null,
task_timestamp timestamp not null
) partition by list (task_type);create table task_child_1
partition of task_parent for values in (1);create table task_child_2
partition of task_parent for values in (2);insert into task_parent
select
generate_series(1, 500000),
case when random() < 0.5 then 1 else 2 end,
false,
now() - (random() * '1 day'::interval);create index task_parent_task_time_idx
on task_parent (task_timestamp);update task_parent
set reserved = true
from (
select id
from task_parent
where reserved = false
and task_type = 1 or task_type = 2
order by task_timestamp
limit 50
for update skip locked) as sub
where sub.id = task_parent.id
returning task_parent.id;
---A couple of interesting observations:
1) If I remove the order by clause I always get the expected number ofresults
2) If I rewrite the query to use a CTE for the task IDs instead of a
subquery then I always get the expected number of results
At its surface, this seems like it could be a bug but maybe there is
something about this usage pattern that is known/expected to cause this
behavior. So that's the question - is this a bug thatshould be reported to pgsql-bugs, or is this expected and if so, why?
Yes, this must be a bug:
EXPLAIN (COSTS OFF) update task_parent
set reserved = true
from (
select id
from task_parent
where reserved = false
and task_type = 1 or task_type = 2
order by task_timestamp
limit 50
for update skip locked) as sub
where sub.id = task_parent.id
returning task_parent.id;QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Update on task_parent
Update on task_child_1 task_parent_1
Update on task_child_2 task_parent_2
-> Hash Join
Hash Cond: (task_parent_1.id = sub.id)
-> Seq Scan on task_child_1 task_parent_1
-> Hash
-> Subquery Scan on sub
-> Limit
-> LockRows
-> Merge Append
Sort Key:
task_parent_3.task_timestamp
-> Index Scan using
task_child_1_task_timestamp_idx on task_child_1 task_parent_4
Filter: (((NOT reserved) AND
(task_type = 1)) OR (task_type = 2))
-> Index Scan using
task_child_2_task_timestamp_idx on task_child_2 task_parent_5
Filter: (((NOT reserved) AND
(task_type = 1)) OR (task_type = 2))
-> Hash Join
Hash Cond: (task_parent_2.id = sub_1.id)
-> Seq Scan on task_child_2 task_parent_2
-> Hash
-> Subquery Scan on sub_1
-> Limit
-> LockRows
-> Merge Append
Sort Key:
task_parent_6.task_timestamp
-> Index Scan using
task_child_1_task_timestamp_idx on task_child_1 task_parent_7
Filter: (((NOT reserved) AND
(task_type = 1)) OR (task_type = 2))
-> Index Scan using
task_child_2_task_timestamp_idx on task_child_2 task_parent_8
Filter: (((NOT reserved) AND
(task_type = 1)) OR (task_type = 2))
(29 rows)The subquery is executed twice, and the two executions obviously don't
return the same results. I am at a loss for an explanation ...Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
Would (task_type in (1,2)) make any logical difference?
On 12/18/20 6:11 AM, Craig McIlwee wrote:
Despite looking at this query on and off for a couple of days, it wasn't
until seeing it in Lauenz's reply that I noticed a logical issue with the
query that changes things a bit. There should be parenthesis around the
task_type predicates, otherwise you end up getting reserved rows in the
result set. After looking at it more, I see that when the original query
settles in to consistently returning the expected number of results they
happen to be the exact same results each time. Correcting the logical
error in the query solves that and now consistently gives too many
results, regardless of the number of executions.Corrected query (but still not working properly):
update task_parent
set reserved = true
from (
select id
from task_parent
where reserved = false
and (task_type = 1 or task_type = 2)
order by task_timestamp
limit 50
for update skip locked) as sub
where sub.id <http://sub.id> = task_parent.id <http://task_parent.id>
returning *;Yes, this must be a bug:
EXPLAIN (COSTS OFF) update task_parent
set reserved = true
from (
select id
from task_parent
where reserved = false
and task_type = 1 or task_type = 2
order by task_timestamp
limit 50
for update skip locked) as sub
where sub.id <http://sub.id/> = task_parent.id <http://task_parent.id/>
returning task_parent.id <http://task_parent.id/>;QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Update on task_parent
Update on task_child_1 task_parent_1
Update on task_child_2 task_parent_2
-> Hash Join
Hash Cond: (task_parent_1.id <http://task_parent_1.id/> =
sub.id <http://sub.id/>)
-> Seq Scan on task_child_1 task_parent_1
-> Hash
-> Subquery Scan on sub
-> Limit
-> LockRows
-> Merge Append
Sort Key:
task_parent_3.task_timestamp
-> Index Scan using
task_child_1_task_timestamp_idx on task_child_1 task_parent_4
Filter: (((NOT reserved)
AND (task_type = 1)) OR (task_type = 2))
-> Index Scan using
task_child_2_task_timestamp_idx on task_child_2 task_parent_5
Filter: (((NOT reserved)
AND (task_type = 1)) OR (task_type = 2))
-> Hash Join
Hash Cond: (task_parent_2.id <http://task_parent_2.id/> =
sub_1.id <http://sub_1.id/>)
-> Seq Scan on task_child_2 task_parent_2
-> Hash
-> Subquery Scan on sub_1
-> Limit
-> LockRows
-> Merge Append
Sort Key:
task_parent_6.task_timestamp
-> Index Scan using
task_child_1_task_timestamp_idx on task_child_1 task_parent_7
Filter: (((NOT reserved)
AND (task_type = 1)) OR (task_type = 2))
-> Index Scan using
task_child_2_task_timestamp_idx on task_child_2 task_parent_8
Filter: (((NOT reserved)
AND (task_type = 1)) OR (task_type = 2))
(29 rows)The subquery is executed twice, and the two executions obviously don't
return the same results. I am at a loss for an explanation ...I did notice that the subquery is executed twice and this is the
difference between the subquery and CTE since the CTE is only executed
once and then the CTE result is used in the joins against the child
tables. Beyond that, I was unable to rationalize what was going on. One
thing I don't know is if the subquery should be executed more than once.
If yes, then like you said, I would think that they should give the same
results.Another interesting thing I noticed is that multiple executions give tasks
with overlapping time ranges. Given 2 executions, the task_time ordering
should give the oldest 50 first and then the next oldest 50 after that.
In reality, I see that the second execution has tasks that are older than
the newest timestamp of the first execution. Wrapping the update
statement in a CTE and then picking out the min/max shows this:with updated as (
update task_parent
set reserved = true
from (
select id, task_type
from task_parent
where reserved = false
and (task_type = 1 or task_type = 2)
order by task_timestamp
limit 50
for update skip locked) as sub
where sub.id <http://sub.id> = task_parent.id <http://task_parent.id>
and sub.task_type = task_parent.task_type
returning *
)
select min(task_timestamp), max(task_timestamp)
from updated;Execution 1:
min | max
----------------------------+----------------------------
2020-12-17 11:44:51.192119 | 2020-12-17 11:45:03.881409Execution 2:
min | max
----------------------------+----------------------------
2020-12-17 11:44:59.943108 | 2020-12-17 11:45:14.273185min of execution 2 is older than max of execution 1 which should not happen.
Craig
On Fri, Dec 18, 2020 at 4:06 AM Laurenz Albe <laurenz.albe@cybertec.at
<mailto:laurenz.albe@cybertec.at>> wrote:On Thu, 2020-12-17 at 12:21 -0500, Craig McIlwee wrote:
Our application uses a queue-like table to assign tasks to users and
this has worked well for us for a few years. Now we are in the
process of adding some restrictions to which tasks a user canwork on and that is based on an attribute of each task that does not
change for the task's lifespan. Users may have access to work on one
or more or types of tasks. To improve query time whenfinding the set of tasks that we assign, we are introducing
partitioning into our task queue table. When assigning tasks, we
issue an update statement to mark the tasks as reserved using a subquerythat orders the tasks by age. With the introduction of
partitioning, we are seeing that the update statement affects more
rows than expected. An example query is:---
update task_parent
set reserved = true
from (
select id
from task_parent
where reserved = false
and task_type = 1 or task_type = 2
order by task_timestamp
limit 50
for update skip locked) as sub
where sub.id <http://sub.id> = task_parent.id <http://task_parent.id>
returning task_parent.id <http://task_parent.id>
---In the statement above, we have a subquery to limit the number of
tasks to 50 yet the update statement sometimes returns more than 50
records. I have narrowed this down to a small, reproducibleexample shown below. The first time I run the update statement I
get ~65 records, then typically ~53 the next few runs, and then it
starts consistently giving me 50 records after that. Then if Ibump the limit to 100 I will get more than 100 initially and after
several executions it starts to settle into always giving the expected
100.Below is the full setup that can be used to reproduce what I'm
seeing. It was initially observed on PostgreSQL 11.8 but I can also
reproduce it on 13.0.---
create table task_parent (
id bigint not null,
task_type smallint not null,
reserved boolean not null,
task_timestamp timestamp not null
) partition by list (task_type);create table task_child_1
partition of task_parent for values in (1);create table task_child_2
partition of task_parent for values in (2);insert into task_parent
select
generate_series(1, 500000),
case when random() < 0.5 then 1 else 2 end,
false,
now() - (random() * '1 day'::interval);create index task_parent_task_time_idx
on task_parent (task_timestamp);update task_parent
set reserved = true
from (
select id
from task_parent
where reserved = false
and task_type = 1 or task_type = 2
order by task_timestamp
limit 50
for update skip locked) as sub
where sub.id <http://sub.id> = task_parent.id <http://task_parent.id>
returning task_parent.id <http://task_parent.id>;
---A couple of interesting observations:
1) If I remove the order by clause I always get the expected numberof results
2) If I rewrite the query to use a CTE for the task IDs instead of a
subquery then I always get the expected number of results
At its surface, this seems like it could be a bug but maybe there is
something about this usage pattern that is known/expected to cause
this behavior. So that's the question - is this a bug thatshould be reported to pgsql-bugs, or is this expected and if so, why?
Yes, this must be a bug:
EXPLAIN (COSTS OFF) update task_parent
set reserved = true
from (
select id
from task_parent
where reserved = false
and task_type = 1 or task_type = 2
order by task_timestamp
limit 50
for update skip locked) as sub
where sub.id <http://sub.id> = task_parent.id <http://task_parent.id>
returning task_parent.id <http://task_parent.id>;QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Update on task_parent
Update on task_child_1 task_parent_1
Update on task_child_2 task_parent_2
-> Hash Join
Hash Cond: (task_parent_1.id <http://task_parent_1.id> =
sub.id <http://sub.id>)
-> Seq Scan on task_child_1 task_parent_1
-> Hash
-> Subquery Scan on sub
-> Limit
-> LockRows
-> Merge Append
Sort Key:
task_parent_3.task_timestamp
-> Index Scan using
task_child_1_task_timestamp_idx on task_child_1 task_parent_4
Filter: (((NOT reserved)
AND (task_type = 1)) OR (task_type = 2))
-> Index Scan using
task_child_2_task_timestamp_idx on task_child_2 task_parent_5
Filter: (((NOT reserved)
AND (task_type = 1)) OR (task_type = 2))
-> Hash Join
Hash Cond: (task_parent_2.id <http://task_parent_2.id> =
sub_1.id <http://sub_1.id>)
-> Seq Scan on task_child_2 task_parent_2
-> Hash
-> Subquery Scan on sub_1
-> Limit
-> LockRows
-> Merge Append
Sort Key:
task_parent_6.task_timestamp
-> Index Scan using
task_child_1_task_timestamp_idx on task_child_1 task_parent_7
Filter: (((NOT reserved)
AND (task_type = 1)) OR (task_type = 2))
-> Index Scan using
task_child_2_task_timestamp_idx on task_child_2 task_parent_8
Filter: (((NOT reserved)
AND (task_type = 1)) OR (task_type = 2))
(29 rows)The subquery is executed twice, and the two executions obviously don't
return the same results. I am at a loss for an explanation ...Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
--
Angular momentum makes the world go 'round.
Laurenz Albe <laurenz.albe@cybertec.at> writes:
The subquery is executed twice, and the two executions obviously don't
return the same results. I am at a loss for an explanation ...
Yeah, this is a fairly fundamental shortcoming in inheritance_planner():
it supposes that it can duplicate the whole query for each target table.
If you have a sub-SELECT that generates unstable results, then the
duplicated copies don't necessarily generate the same results.
And multiple executions of a sub-SELECT with "for update skip locked"
are guaranteed to not give the same results, because the second one
will skip the row(s) already locked by the first one.
It seems to work as desired if you stick the unstable result into a CTE:
=# explain
with sub as (select id
from task_parent
where reserved = false
and task_type = 1 or task_type = 2
order by task_timestamp
limit 50
for update skip locked)
update task_parent
set reserved = true
from sub
where sub.id = task_parent.id
returning task_parent.id;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Update on task_parent (cost=6.30..10069.93 rows=100 width=57)
Update on task_child_1 task_parent_1
Update on task_child_2 task_parent_2
CTE sub
-> Limit (cost=0.85..4.68 rows=50 width=26)
-> LockRows (cost=0.85..38252.82 rows=500000 width=26)
-> Merge Append (cost=0.85..33252.82 rows=500000 width=26)
Sort Key: task_parent_3.task_timestamp
-> Index Scan using task_child_1_task_timestamp_idx on task_child_1 task_parent_4 (cost=0.42..14123.60 rows=249960 width=26)
Filter: (((NOT reserved) AND (task_type = 1)) OR (task_type = 2))
-> Index Scan using task_child_2_task_timestamp_idx on task_child_2 task_parent_5 (cost=0.42..14129.20 rows=250040 width=26)
Filter: (((NOT reserved) AND (task_type = 1)) OR (task_type = 2))
-> Hash Join (cost=1.62..5032.07 rows=50 width=57)
Hash Cond: (task_parent_1.id = sub.id)
-> Seq Scan on task_child_1 task_parent_1 (cost=0.00..4092.60 rows=249960 width=24)
-> Hash (cost=1.00..1.00 rows=50 width=40)
-> CTE Scan on sub (cost=0.00..1.00 rows=50 width=40)
-> Hash Join (cost=1.62..5033.18 rows=50 width=57)
Hash Cond: (task_parent_2.id = sub.id)
-> Seq Scan on task_child_2 task_parent_2 (cost=0.00..4093.40 rows=250040 width=24)
-> Hash (cost=1.00..1.00 rows=50 width=40)
-> CTE Scan on sub (cost=0.00..1.00 rows=50 width=40)
(22 rows)
It's been obvious for some time that inheritance_planner() needs to
be nuked from orbit, because aside from this fundamental semantic
issue it's got horrible performance problems with large inheritance
trees (ie many partitions). We might finally get that done for v14
--- at least, there's a patch in the queue about it. In existing
releases, I recommend the CTE solution.
regards, tom lane
On Fri, Dec 18, 2020 at 12:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Laurenz Albe <laurenz.albe@cybertec.at> writes:
The subquery is executed twice, and the two executions obviously don't
return the same results. I am at a loss for an explanation ...Yeah, this is a fairly fundamental shortcoming in inheritance_planner():
it supposes that it can duplicate the whole query for each target table.
If you have a sub-SELECT that generates unstable results, then the
duplicated copies don't necessarily generate the same results.
And multiple executions of a sub-SELECT with "for update skip locked"
are guaranteed to not give the same results, because the second one
will skip the row(s) already locked by the first one.
Are there other examples of gotchas with this? Would it be any volatile
function (or behavior like skip locked) in a sub-query? It isn't apparent
to me why the subquery is executed twice for this example either and since
that is a pre-req for hitting this unexpected situation... what is the
factor that means the sub-query would be executed multiple times?
With the behavior change for CTEs to no longer be materialized by default
in PG12... why does the CTE still mean it is executed only once? Is it
because it is NOT side effect free (locking) so it cannot be in-lined? If
it were a volatile function instead, might we have gotten more than 50 rows
updated?
Michael Lewis <mlewis@entrata.com> writes:
On Fri, Dec 18, 2020 at 12:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, this is a fairly fundamental shortcoming in inheritance_planner():
it supposes that it can duplicate the whole query for each target table.
Are there other examples of gotchas with this? Would it be any volatile
function (or behavior like skip locked) in a sub-query?
Right, anything that causes multiple executions to not deliver identical
results. The different executions will use the same snapshot, so there's
not a hazard from external changes to the DB, but internal sources of
nonrepeatability are a problem.
... what is the
factor that means the sub-query would be executed multiple times?
If it's in the FROM clause of an UPDATE or DELETE on a table with
inheritance children (either traditional inheritance or partitioning).
Actually, after further thought, I'm not entirely sure that the issue is
confined to inherited UPDATE/DELETE. If you had such a sub-SELECT in
an ordinary join, and the planner chose to put it on the inside of a
nestloop, you'd have a problem. I do not think there's any check to
avoid doing that just because the subquery's results are potentially
volatile. Probably evaluation-cost considerations would discourage
such a plan in most cases, but there's no direct defense AFAIR.
With the behavior change for CTEs to no longer be materialized by default
in PG12... why does the CTE still mean it is executed only once? Is it
because it is NOT side effect free (locking) so it cannot be in-lined?
Exactly.
regards, tom lane