problem with RETURNING and update row movement
Hi,
While working on [1]/messages/by-id/CA+HiwqH-2sq-3Zq-CtuWjfRSyrGPXJBf1nCKKvTHuGVyfQ1OYA@mail.gmail.com, I came across a bug.
Reproduction steps:
create table foo (a int, b int) partition by list (a);
create table foo1 (c int, b int, a int);
alter table foo1 drop c;
alter table foo attach partition foo1 for values in (1);
create table foo2 partition of foo for values in (2);
create table foo3 partition of foo for values in (3);
create or replace function trigfunc () returns trigger language
plpgsql as $$ begin new.b := 2; return new; end; $$;
create trigger trig before insert on foo2 for each row execute
function trigfunc();
insert into foo values (1, 1), (2, 2), (3, 3);
update foo set a = 2 from (values (1), (2), (3)) s(x) where a = s.x returning *;
ERROR: attribute 5 of type record has wrong type
DETAIL: Table has type record, but query expects integer.
The error occurs when projecting the RETURNING list. The problem is
that the projection being used when the error occurs belongs to result
relation foo2 which is the destination partition of a row movement
operation, but it's trying to access a column in the tuple produced by
the plan belonging to foo1, the source partition of the row movement.
foo2's RETURNING projection can only work correctly when it's being
fed tuples from the plan belonging to foo2.
Note that the targetlists of the plans belonging to different result
relations can be different depending on the respective relation's
tuple descriptors, so are not interchangeable. Also, each result
relation's RETURNING list is made to refer to its own plan's output.
Without row movement, there is only one result relation to consider,
so there's no confusion regarding which RETURNING list to compute.
With row movement however, while there is only one plan tuple, there
are two result relations to consider each with its own RETURNING list.
I think we should be computing the *source* relation's RETURNING list,
because only that one of the two can consume the plan tuple correctly.
Attached is a patch that fixes things to be that way.
By the way, the problem exists since PG 11 when UPDATE row movement
feature went in.
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com
[1]: /messages/by-id/CA+HiwqH-2sq-3Zq-CtuWjfRSyrGPXJBf1nCKKvTHuGVyfQ1OYA@mail.gmail.com
Attachments:
v1-0001-Fix-a-bug-with-RETURNING-when-UPDATE-moves-tuple.patchapplication/octet-stream; name=v1-0001-Fix-a-bug-with-RETURNING-when-UPDATE-moves-tuple.patchDownload+74-7
Hi Amit-san,
On Thu, Jun 11, 2020 at 6:10 PM Amit Langote <amitlangote09@gmail.com> wrote:
Reproduction steps:
create table foo (a int, b int) partition by list (a);
create table foo1 (c int, b int, a int);
alter table foo1 drop c;
alter table foo attach partition foo1 for values in (1);
create table foo2 partition of foo for values in (2);
create table foo3 partition of foo for values in (3);
create or replace function trigfunc () returns trigger language
plpgsql as $$ begin new.b := 2; return new; end; $$;
create trigger trig before insert on foo2 for each row execute
function trigfunc();
insert into foo values (1, 1), (2, 2), (3, 3);
update foo set a = 2 from (values (1), (2), (3)) s(x) where a = s.x returning *;
ERROR: attribute 5 of type record has wrong type
DETAIL: Table has type record, but query expects integer.
Reproduced. Could you add the patch to the next commitfest so that it
doesn't get lost?
Best regards,
Etsuro Fujita
On Sun, Jun 14, 2020 at 4:23 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Hi Amit-san,
On Thu, Jun 11, 2020 at 6:10 PM Amit Langote <amitlangote09@gmail.com> wrote:
Reproduction steps:
create table foo (a int, b int) partition by list (a);
create table foo1 (c int, b int, a int);
alter table foo1 drop c;
alter table foo attach partition foo1 for values in (1);
create table foo2 partition of foo for values in (2);
create table foo3 partition of foo for values in (3);
create or replace function trigfunc () returns trigger language
plpgsql as $$ begin new.b := 2; return new; end; $$;
create trigger trig before insert on foo2 for each row execute
function trigfunc();
insert into foo values (1, 1), (2, 2), (3, 3);
update foo set a = 2 from (values (1), (2), (3)) s(x) where a = s.x returning *;
ERROR: attribute 5 of type record has wrong type
DETAIL: Table has type record, but query expects integer.Reproduced. Could you add the patch to the next commitfest so that it
doesn't get lost?
Done. Thank you for taking a look.
https://commitfest.postgresql.org/28/2597/
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com
Amit san
Hello. I've tested your patch.
This patch can be applied comfortably and make check-world has produced no failure.
I didn't do performance test
because this patch doesn't have an effect on it in my opinion.
I reproduced the bug you described before,
which can be prevented by your patch currently.
After applying your patch, I've gotten a correct output without error
using the test case in the 1st mail of this thread.
Just small comment about your patch.
I felt the test you added in update.sql could be simpler or shorter in other form.
Excuse me if I say something silly.
It's because I supposed you can check the bug is prevented without definitions of both a function and its trigger for this case. Neither of them is essentially connected with the row movement between source partition and destination partition and can be replaced by simpler expression ?
Best,
Takamichi Osumi
Hi Takamichi-san,
On Tue, Jul 14, 2020 at 8:26 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
Amit san
Hello. I've tested your patch.
Thanks for that.
Just small comment about your patch.
I felt the test you added in update.sql could be simpler or shorter in other form.
Excuse me if I say something silly.
It's because I supposed you can check the bug is prevented without definitions of both a function and its trigger for this case. Neither of them is essentially connected with the row movement between source partition and destination partition and can be replaced by simpler expression ?
Well, it's true that the function and the trigger have nothing to do
with the main bug, but it's often good to be sure that the bug-fix
isn't breaking cases where they are present and have visible effect.
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 11, 2020 at 2:40 PM Amit Langote <amitlangote09@gmail.com> wrote:
Hi,
While working on [1], I came across a bug.
Reproduction steps:
create table foo (a int, b int) partition by list (a);
create table foo1 (c int, b int, a int);
alter table foo1 drop c;
alter table foo attach partition foo1 for values in (1);
create table foo2 partition of foo for values in (2);
create table foo3 partition of foo for values in (3);
create or replace function trigfunc () returns trigger language
plpgsql as $$ begin new.b := 2; return new; end; $$;
create trigger trig before insert on foo2 for each row execute
function trigfunc();
insert into foo values (1, 1), (2, 2), (3, 3);
update foo set a = 2 from (values (1), (2), (3)) s(x) where a = s.x returning *;
ERROR: attribute 5 of type record has wrong type
DETAIL: Table has type record, but query expects integer.The error occurs when projecting the RETURNING list. The problem is
that the projection being used when the error occurs belongs to result
relation foo2 which is the destination partition of a row movement
operation, but it's trying to access a column in the tuple produced by
the plan belonging to foo1, the source partition of the row movement.
foo2's RETURNING projection can only work correctly when it's being
fed tuples from the plan belonging to foo2.
IIUC, here the problem is related to below part of code:
ExecInsert(..)
{
/* Process RETURNING if present */
if (resultRelInfo->ri_projectReturning)
result = ExecProcessReturning(resultRelInfo, slot, planSlot);
..
}
The reason is that planSlot is for foo1 and slot is for foo2 and when
it tries to access tuple during ExecProcessReturning(), it results in
an error. Is my understanding correct? If so, then can't we ensure
someway that planSlot also belongs to foo2 instead of skipping return
processing in Insert and then later do more work to perform in Update.
Like Takamichi-san, I also think here we don't need trigger/function
in the test case. If one reads the comment you have added in the test
case, it is not evident why is trigger or function required. If you
really think it is important to cover the trigger case then either
have a separate test or at least add some comments on how trigger
helps here or what you want to test it.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Hi Amit,
Thanks for taking a look at this.
On Mon, Jul 20, 2020 at 8:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
IIUC, here the problem is related to below part of code:
ExecInsert(..)
{
/* Process RETURNING if present */
if (resultRelInfo->ri_projectReturning)
result = ExecProcessReturning(resultRelInfo, slot, planSlot);
..
}The reason is that planSlot is for foo1 and slot is for foo2 and when
it tries to access tuple during ExecProcessReturning(), it results in
an error. Is my understanding correct?
Yes. Specifically, the problem exists if there are any non-target
relation attributes in RETURNING which are computed by referring to
planSlot, the plan's output tuple, which may be shaped differently
among result relations due to their tuple descriptors being different.
If so, then can't we ensure
someway that planSlot also belongs to foo2 instead of skipping return
processing in Insert and then later do more work to perform in Update.
I did consider that option but failed to see a way to make it work.
I am not sure if there is a way to make a copy of the plan's output
tuple (planSlot) that is compatible with the destination partition.
Simple conversion using execute_attr_map_slot() is okay when we know
the source and the target slots contain relation tuples, but plan's
output tuples will also have other junk attributes. Also, not all
destination partitions have an associated plan and hence a slot to
hold plan tuples.
Like Takamichi-san, I also think here we don't need trigger/function
in the test case. If one reads the comment you have added in the test
case, it is not evident why is trigger or function required. If you
really think it is important to cover the trigger case then either
have a separate test or at least add some comments on how trigger
helps here or what you want to test it.
That's fair. I have updated the test comment.
To expand on that here, because now we'll be computing RETURNING using
the source partition's projection and the tuple in the source
partition's format, I wanted to make sure that any changes made by the
destination partition's triggers are reflected in the output.
PFA v2.
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v2-0001-Fix-a-bug-with-RETURNING-when-UPDATE-moves-tuple.patchapplication/octet-stream; name=v2-0001-Fix-a-bug-with-RETURNING-when-UPDATE-moves-tuple.patchDownload+79-7
On Wed, Jul 22, 2020 at 3:16 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Mon, Jul 20, 2020 at 8:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
IIUC, here the problem is related to below part of code:
ExecInsert(..)
{
/* Process RETURNING if present */
if (resultRelInfo->ri_projectReturning)
result = ExecProcessReturning(resultRelInfo, slot, planSlot);
..
}The reason is that planSlot is for foo1 and slot is for foo2 and when
it tries to access tuple during ExecProcessReturning(), it results in
an error. Is my understanding correct?Yes. Specifically, the problem exists if there are any non-target
relation attributes in RETURNING which are computed by referring to
planSlot, the plan's output tuple, which may be shaped differently
among result relations due to their tuple descriptors being different.If so, then can't we ensure
someway that planSlot also belongs to foo2 instead of skipping return
processing in Insert and then later do more work to perform in Update.I did consider that option but failed to see a way to make it work.
I am not sure if there is a way to make a copy of the plan's output
tuple (planSlot) that is compatible with the destination partition.
Simple conversion using execute_attr_map_slot() is okay when we know
the source and the target slots contain relation tuples, but plan's
output tuples will also have other junk attributes. Also, not all
destination partitions have an associated plan and hence a slot to
hold plan tuples.
Yeah, I think it might be possible to create planSlot to pass to
ExecInsert() so that we can process RETURNING within that function,
but even if so, that would be cumbersome not only because partitions
can have different rowtypes but because they can have different junk
columns as well, because e.g., subplan foreign partitions may have
different row ID columns as junk columns. The proposed patch is
simple, so I would vote for it. (Note: in case of a foreign
partition, we call ExecForeignInsert() with the source partition’s
planSlot in ExecInsert(), which is not correct, but that would sbe OK
because it seems unlikely that the FDW would look at the planSlot for
INSERT.)
One thing I noticed is that the patch changes the existing behavior.
Here is an example:
create table range_parted (a text, b bigint) partition by range (a, b);
create table part_a_1_a_10 partition of range_parted for values from
('a', 1) to ('a', 10);
create table part_b_1_b_10 partition of range_parted for values from
('b', 1) to ('b', 10);
create function trigfunc() returns trigger as $$ begin return null;
end; $$ language plpgsql;
create trigger trig before insert on part_b_1_b_10 for each row
execute function trigfunc();
insert into range_parted values ('a', 1);
In HEAD:
postgres=# update range_parted r set a = 'b' from (values ('a', 1))
s(x, y) where s.x = r.a and s.y = r.b returning tableoid::regclass,
r.*;
tableoid | a | b
----------+---+---
(0 rows)
UPDATE 0
But with the patch:
postgres=# update range_parted r set a = 'b' from (values ('a', 1))
s(x, y) where s.x = r.a and s.y = r.b returning tableoid::regclass,
r.*;
tableoid | a | b
---------------+---+---
part_a_1_a_10 | b | 1
(1 row)
UPDATE 1
This produces RETURNING, though INSERT on the destination partition
was skipped by the trigger.
Another thing is that the patch assumes that the tuple slot to pass to
ExecInsert() would store the inserted tuple when doing that function,
but that’s not always true, because in case of a foreign partition,
the FDW may return a slot other than the passed-in slot when called
from ExecForeignInsert(), in which case the passed-in slot would not
store the inserted tuple anymore.
To fix these, I modified the patch so that we 1) add to ExecInsert()
an output parameter slot to store the inserted tuple, and 2) compute
RETURNING based on the parameter. I also added a regression test
case. Attached is an updated version of the patch.
Sorry for the long delay.
Best regards,
Etsuro Fujita
Attachments:
v3-0001-Fix-a-bug-with-RETURNING-when-UPDATE-moves-tuple.patchapplication/octet-stream; name=v3-0001-Fix-a-bug-with-RETURNING-when-UPDATE-moves-tuple.patchDownload+117-7
Yet another thing I noticed is that the patch incorrectly produces
values for the tableoid columns specified in the RETURNING list, like
this:
+UPDATE range_parted r set a = 'c' FROM (VALUES ('a', 1), ('a', 10),
('b', 12)) s(x, y) WHERE s.x = r.a AND s.y = r.b RETURNING
tableoid::regclass, *;
+ tableoid | a | b | c | d | e | x | y
+----------------+---+----+-----+---+---------------+---+----
+ part_a_1_a_10 | c | 1 | 1 | 1 | in trigfunc() | a | 1
+ part_a_10_a_20 | c | 10 | 200 | 1 | in trigfunc() | a | 10
+ part_c_1_100 | c | 12 | 96 | 1 | in trigfunc() | b | 12
+(3 rows)
The source partitions are shown as tableoid, but the destination
partition (ie, part_c_1_c_20) should be shown. To fix this, I
modified the patch further so that 1) we override tts_tableOid of the
original slot with the OID of the destination partition before calling
ExecProcessReturning() if needed, and 2) in ExecProcessReturning(), we
only initialize ecxt_scantuple's tts_tableOid when needed, which would
save cycles a bit for non-foreign-table-direct-modification cases.
Attached is a new version of the patch.
Best regards,
Etsuro Fujita
Attachments:
v4-0001-Fix-a-bug-with-RETURNING-when-UPDATE-moves-tuple.patchapplication/octet-stream; name=v4-0001-Fix-a-bug-with-RETURNING-when-UPDATE-moves-tuple.patchDownload+142-18
Fujita-san,
Thanks for your time on this.
On Sun, Aug 2, 2020 at 5:57 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Yet another thing I noticed is that the patch incorrectly produces
values for the tableoid columns specified in the RETURNING list, like
this:
Yeah, I noticed that too.
+UPDATE range_parted r set a = 'c' FROM (VALUES ('a', 1), ('a', 10), ('b', 12)) s(x, y) WHERE s.x = r.a AND s.y = r.b RETURNING tableoid::regclass, *; + tableoid | a | b | c | d | e | x | y +----------------+---+----+-----+---+---------------+---+---- + part_a_1_a_10 | c | 1 | 1 | 1 | in trigfunc() | a | 1 + part_a_10_a_20 | c | 10 | 200 | 1 | in trigfunc() | a | 10 + part_c_1_100 | c | 12 | 96 | 1 | in trigfunc() | b | 12 +(3 rows)The source partitions are shown as tableoid, but the destination
partition (ie, part_c_1_c_20) should be shown. To fix this, I
modified the patch further so that 1) we override tts_tableOid of the
original slot with the OID of the destination partition before calling
ExecProcessReturning() if needed, and 2) in ExecProcessReturning(), we
only initialize ecxt_scantuple's tts_tableOid when needed, which would
save cycles a bit for non-foreign-table-direct-modification cases.Attached is a new version of the patch.
Thanks for the updated patch. I reviewed your changes in v3 too and
they looked fine to me.
However, I noticed that having system columns like ctid, xmin, etc. in
the RETURNING list is now broken and maybe irrepairably due to the
approach we are taking in the patch. Let me show an example:
drop table foo;
create table foo (a int, b int) partition by list (a);
create table foo1 (c int, b int, a int);
alter table foo1 drop c;
alter table foo attach partition foo1 for values in (1);
create table foo2 partition of foo for values in (2);
create table foo3 partition of foo for values in (3);
create or replace function trigfunc () returns trigger language
plpgsql as $$ begin new.b := 2; return new; end; $$;
create trigger trig before insert on foo2 for each row execute
function trigfunc();
insert into foo values (1, 1), (2, 2), (3, 3);
update foo set a = 2 from (values (1), (2), (3)) s(x) where a = s.x
returning tableoid::regclass, ctid, xmin, xmax, *;
tableoid | ctid | xmin | xmax | a | b | x
----------+----------------+------+------------+---+---+---
foo2 | (4294967295,0) | 128 | 4294967295 | 2 | 2 | 1
foo2 | (0,3) | 782 | 0 | 2 | 2 | 2
foo2 | (0,4) | 782 | 0 | 2 | 2 | 3
(3 rows)
During foo1's update, it appears that we are losing the system
information in the physical tuple initialized during ExecInsert on
foo2 during its conversion back to foo1's reltype using the new code.
I haven't been able to figure out how to preserve the system
information in HeapTuple contained in the destination slot across the
conversion. Note we want to use the latter to project the RETURNING
list.
By the way, you'll need two adjustments to even get this example
working, one of which is a reported problem that system columns in
RETURNING list during an operation on partitioned table stopped
working in PG 12 [1]/messages/by-id/141051591267657@mail.yandex.ru for which I've proposed a workaround (attached).
Another is that we forgot in our patch to "materialize" the virtual
tuple after conversion, which means slot_getsysattr() can't find it to
access system columns like xmin, etc.
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com
Attachments:
use-heap-slot-for-partitioned-table.patchapplication/octet-stream; name=use-heap-slot-for-partitioned-table.patchDownload+3-3
On Mon, Aug 3, 2020 at 2:54 PM Amit Langote <amitlangote09@gmail.com> wrote:
However, I noticed that having system columns like ctid, xmin, etc. in
the RETURNING list is now broken and maybe irrepairably due to the
approach we are taking in the patch. Let me show an example:drop table foo;
create table foo (a int, b int) partition by list (a);
create table foo1 (c int, b int, a int);
alter table foo1 drop c;
alter table foo attach partition foo1 for values in (1);
create table foo2 partition of foo for values in (2);
create table foo3 partition of foo for values in (3);
create or replace function trigfunc () returns trigger language
plpgsql as $$ begin new.b := 2; return new; end; $$;
create trigger trig before insert on foo2 for each row execute
function trigfunc();
insert into foo values (1, 1), (2, 2), (3, 3);
update foo set a = 2 from (values (1), (2), (3)) s(x) where a = s.x
returning tableoid::regclass, ctid, xmin, xmax, *;
tableoid | ctid | xmin | xmax | a | b | x
----------+----------------+------+------------+---+---+---
foo2 | (4294967295,0) | 128 | 4294967295 | 2 | 2 | 1
foo2 | (0,3) | 782 | 0 | 2 | 2 | 2
foo2 | (0,4) | 782 | 0 | 2 | 2 | 3
(3 rows)During foo1's update, it appears that we are losing the system
information in the physical tuple initialized during ExecInsert on
foo2 during its conversion back to foo1's reltype using the new code.
I haven't been able to figure out how to preserve the system
information in HeapTuple contained in the destination slot across the
conversion. Note we want to use the latter to project the RETURNING
list.By the way, you'll need two adjustments to even get this example
working, one of which is a reported problem that system columns in
RETURNING list during an operation on partitioned table stopped
working in PG 12 [1] for which I've proposed a workaround (attached).
Another is that we forgot in our patch to "materialize" the virtual
tuple after conversion, which means slot_getsysattr() can't find it to
access system columns like xmin, etc.
The only solution I could think of for this so far is this:
+ if (map)
+ {
+ orig_slot = execute_attr_map_slot(map,
+ res_slot,
+ orig_slot);
+
+ /*
+ * A HACK to install system information into the just
+ * converted tuple so that RETURNING computes any
+ * system columns correctly. This would be the same
+ * information that would be present in the HeapTuple
+ * version of the tuple in res_slot.
+ */
+ tuple = ExecFetchSlotHeapTuple(orig_slot, true,
+ &should_free);
+ tuple->t_data->t_infomask &= ~(HEAP_XACT_MASK);
+ tuple->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
+ tuple->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ HeapTupleHeaderSetXmin(tuple->t_data,
+ GetCurrentTransactionId());
+ HeapTupleHeaderSetCmin(tuple->t_data,
+ estate->es_output_cid);
+ HeapTupleHeaderSetXmax(tuple->t_data, 0); /*
for cleanliness */
+ }
+ /*
+ * Override tts_tableOid with the OID of the destination
+ * partition.
+ */
+ orig_slot->tts_tableOid =
RelationGetRelid(destRel->ri_RelationDesc);
+ /* Also the TID. */
+ orig_slot->tts_tid = res_slot->tts_tid;
..but it might be too ugly :(.
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com
Amit-san,
On Mon, Aug 3, 2020 at 2:55 PM Amit Langote <amitlangote09@gmail.com> wrote:
However, I noticed that having system columns like ctid, xmin, etc. in
the RETURNING list is now broken and maybe irrepairably due to the
approach we are taking in the patch. Let me show an example:drop table foo;
create table foo (a int, b int) partition by list (a);
create table foo1 (c int, b int, a int);
alter table foo1 drop c;
alter table foo attach partition foo1 for values in (1);
create table foo2 partition of foo for values in (2);
create table foo3 partition of foo for values in (3);
create or replace function trigfunc () returns trigger language
plpgsql as $$ begin new.b := 2; return new; end; $$;
create trigger trig before insert on foo2 for each row execute
function trigfunc();
insert into foo values (1, 1), (2, 2), (3, 3);
update foo set a = 2 from (values (1), (2), (3)) s(x) where a = s.x
returning tableoid::regclass, ctid, xmin, xmax, *;
tableoid | ctid | xmin | xmax | a | b | x
----------+----------------+------+------------+---+---+---
foo2 | (4294967295,0) | 128 | 4294967295 | 2 | 2 | 1
foo2 | (0,3) | 782 | 0 | 2 | 2 | 2
foo2 | (0,4) | 782 | 0 | 2 | 2 | 3
(3 rows)During foo1's update, it appears that we are losing the system
information in the physical tuple initialized during ExecInsert on
foo2 during its conversion back to foo1's reltype using the new code.
I haven't been able to figure out how to preserve the system
information in HeapTuple contained in the destination slot across the
conversion. Note we want to use the latter to project the RETURNING
list.
Thanks for pointing that out!
By the way, you'll need two adjustments to even get this example
working, one of which is a reported problem that system columns in
RETURNING list during an operation on partitioned table stopped
working in PG 12 [1] for which I've proposed a workaround (attached).
Another is that we forgot in our patch to "materialize" the virtual
tuple after conversion, which means slot_getsysattr() can't find it to
access system columns like xmin, etc.
Ok, I’ll look at those closely.
Best regards,
Etsuro Fujita
On Mon, Aug 3, 2020 at 4:39 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Mon, Aug 3, 2020 at 2:54 PM Amit Langote <amitlangote09@gmail.com> wrote:
By the way, you'll need two adjustments to even get this example
working, one of which is a reported problem that system columns in
RETURNING list during an operation on partitioned table stopped
working in PG 12 [1] for which I've proposed a workaround (attached).
Another is that we forgot in our patch to "materialize" the virtual
tuple after conversion, which means slot_getsysattr() can't find it to
access system columns like xmin, etc.The only solution I could think of for this so far is this:
+ if (map) + { + orig_slot = execute_attr_map_slot(map, + res_slot, + orig_slot); + + /* + * A HACK to install system information into the just + * converted tuple so that RETURNING computes any + * system columns correctly. This would be the same + * information that would be present in the HeapTuple + * version of the tuple in res_slot. + */ + tuple = ExecFetchSlotHeapTuple(orig_slot, true, + &should_free); + tuple->t_data->t_infomask &= ~(HEAP_XACT_MASK); + tuple->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK); + tuple->t_data->t_infomask |= HEAP_XMAX_INVALID; + HeapTupleHeaderSetXmin(tuple->t_data, + GetCurrentTransactionId()); + HeapTupleHeaderSetCmin(tuple->t_data, + estate->es_output_cid); + HeapTupleHeaderSetXmax(tuple->t_data, 0); /* for cleanliness */ + } + /* + * Override tts_tableOid with the OID of the destination + * partition. + */ + orig_slot->tts_tableOid = RelationGetRelid(destRel->ri_RelationDesc); + /* Also the TID. */ + orig_slot->tts_tid = res_slot->tts_tid;..but it might be too ugly :(.
Yeah, I think that would be a bit ugly, and actually, is not correct
in case of postgres_fdw foreign table, in which case Xmin and Cmin are
also set to 0 [2]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=da7d44b627ba839de32c9409aca659f60324de76. I think we should probably first address the
tableam issue that you pointed out, but I don't think I'm the right
person to do so.
Best regards,
Etsuro Fujita
On Fri, Aug 7, 2020 at 10:45 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Mon, Aug 3, 2020 at 4:39 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Mon, Aug 3, 2020 at 2:54 PM Amit Langote <amitlangote09@gmail.com> wrote:
By the way, you'll need two adjustments to even get this example
working, one of which is a reported problem that system columns in
RETURNING list during an operation on partitioned table stopped
working in PG 12 [1] for which I've proposed a workaround (attached).
Another is that we forgot in our patch to "materialize" the virtual
tuple after conversion, which means slot_getsysattr() can't find it to
access system columns like xmin, etc.The only solution I could think of for this so far is this:
+ if (map) + { + orig_slot = execute_attr_map_slot(map, + res_slot, + orig_slot); + + /* + * A HACK to install system information into the just + * converted tuple so that RETURNING computes any + * system columns correctly. This would be the same + * information that would be present in the HeapTuple + * version of the tuple in res_slot. + */ + tuple = ExecFetchSlotHeapTuple(orig_slot, true, + &should_free); + tuple->t_data->t_infomask &= ~(HEAP_XACT_MASK); + tuple->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK); + tuple->t_data->t_infomask |= HEAP_XMAX_INVALID; + HeapTupleHeaderSetXmin(tuple->t_data, + GetCurrentTransactionId()); + HeapTupleHeaderSetCmin(tuple->t_data, + estate->es_output_cid); + HeapTupleHeaderSetXmax(tuple->t_data, 0); /* for cleanliness */ + } + /* + * Override tts_tableOid with the OID of the destination + * partition. + */ + orig_slot->tts_tableOid = RelationGetRelid(destRel->ri_RelationDesc); + /* Also the TID. */ + orig_slot->tts_tid = res_slot->tts_tid;..but it might be too ugly :(.
Yeah, I think that would be a bit ugly, and actually, is not correct
in case of postgres_fdw foreign table, in which case Xmin and Cmin are
also set to 0 [2].
Yeah, need to think a bit harder.
I think we should probably first address the
tableam issue that you pointed out, but I don't think I'm the right
person to do so.
Okay, I will try to revive that thread.
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com
I noticed that this bugfix has stalled, probably because the other
bugfix has also stalled.
It seems that cleanly returning system columns from table AM is not
going to be a simple task -- whatever fix we get for that is likely not
going to make it all the way to PG 12. So I suggest that we should fix
this bug in 11-13 without depending on that.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Alvaro,
On Sat, Sep 12, 2020 at 5:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
I noticed that this bugfix has stalled, probably because the other
bugfix has also stalled.It seems that cleanly returning system columns from table AM is not
going to be a simple task -- whatever fix we get for that is likely not
going to make it all the way to PG 12. So I suggest that we should fix
this bug in 11-13 without depending on that.
Although I would be reversing course on what I said upthread, I tend
to agree with that, because the core idea behind the fix for this
issue does not seem likely to be invalidated by any conclusion
regarding the other issue. That is, as far as the issue here is
concerned, we can't avoid falling back to using the source partition's
RETURNING projection whose scan tuple is provided using the source
partition's tuple slot.
However, given that we have to pass the *new* tuple to the projection,
not the old one, we will need a "clean" way to transfer its (the new
tuple's) system columns into the source partition's tuple slot. The
sketch I gave upthread of what that could look like was not liked by
Fujita-san much.
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 14, 2020 at 3:53 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Sat, Sep 12, 2020 at 5:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
I noticed that this bugfix has stalled, probably because the other
bugfix has also stalled.It seems that cleanly returning system columns from table AM is not
going to be a simple task -- whatever fix we get for that is likely not
going to make it all the way to PG 12. So I suggest that we should fix
this bug in 11-13 without depending on that.Although I would be reversing course on what I said upthread, I tend
to agree with that, because the core idea behind the fix for this
issue does not seem likely to be invalidated by any conclusion
regarding the other issue. That is, as far as the issue here is
concerned, we can't avoid falling back to using the source partition's
RETURNING projection whose scan tuple is provided using the source
partition's tuple slot.
I agree on that point.
However, given that we have to pass the *new* tuple to the projection,
not the old one, we will need a "clean" way to transfer its (the new
tuple's) system columns into the source partition's tuple slot. The
sketch I gave upthread of what that could look like was not liked by
Fujita-san much.
IIUC, I think two issues are discussed in the thread [1]/messages/by-id/141051591267657@mail.yandex.ru: (a) there is
currently no way to define the set of meaningful system columns for a
partitioned table that contains pluggable storages other than standard
heap, and (b) even in the case where the set is defined as before,
like partitioned tables that don’t contain any pluggable storages,
system column values are not obtained from a tuple inserted into a
partitioned table in cases as reported in [1]/messages/by-id/141051591267657@mail.yandex.ru, because virtual slots
are assigned for partitioned tables [2]/messages/by-id/CA+HiwqHrsNa4e0MfpSzv7xOM94TvX=R0MskYxYWfy0kjL0DAdQ@mail.gmail.com[3]/messages/by-id/20200811180231.fcssvhelqpnydnx7@alap3.anarazel.de. (I think the latter is
the original issue in the thread, though.)
I think we could fix this update-tuple-routing-vs-RETURNING issue
without the fix for (a). But to transfer system column values in a
cleaner way, I think we need to fix (b) first so that we can always
obtain/copy them from the new tuple moved to another partition by
INSERT following DELETE.
(I think we could address this issue for v11 independently of [1]/messages/by-id/141051591267657@mail.yandex.ru, off course.)
Best regards,
Etsuro Fujita
[1]: /messages/by-id/141051591267657@mail.yandex.ru
[2]: /messages/by-id/CA+HiwqHrsNa4e0MfpSzv7xOM94TvX=R0MskYxYWfy0kjL0DAdQ@mail.gmail.com
[3]: /messages/by-id/20200811180231.fcssvhelqpnydnx7@alap3.anarazel.de
On Mon, Sep 14, 2020 at 5:56 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Mon, Sep 14, 2020 at 3:53 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Sat, Sep 12, 2020 at 5:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
I noticed that this bugfix has stalled, probably because the other
bugfix has also stalled.It seems that cleanly returning system columns from table AM is not
going to be a simple task -- whatever fix we get for that is likely not
going to make it all the way to PG 12. So I suggest that we should fix
this bug in 11-13 without depending on that.Although I would be reversing course on what I said upthread, I tend
to agree with that, because the core idea behind the fix for this
issue does not seem likely to be invalidated by any conclusion
regarding the other issue. That is, as far as the issue here is
concerned, we can't avoid falling back to using the source partition's
RETURNING projection whose scan tuple is provided using the source
partition's tuple slot.I agree on that point.
However, given that we have to pass the *new* tuple to the projection,
not the old one, we will need a "clean" way to transfer its (the new
tuple's) system columns into the source partition's tuple slot. The
sketch I gave upthread of what that could look like was not liked by
Fujita-san much.IIUC, I think two issues are discussed in the thread [1]: (a) there is
currently no way to define the set of meaningful system columns for a
partitioned table that contains pluggable storages other than standard
heap, and (b) even in the case where the set is defined as before,
like partitioned tables that don’t contain any pluggable storages,
system column values are not obtained from a tuple inserted into a
partitioned table in cases as reported in [1], because virtual slots
are assigned for partitioned tables [2][3]. (I think the latter is
the original issue in the thread, though.)
Right, (b) can be solved by using a leaf partition's tuple slot as
proposed. Although (a) needs a long term fix.
I think we could fix this update-tuple-routing-vs-RETURNING issue
without the fix for (a). But to transfer system column values in a
cleaner way, I think we need to fix (b) first so that we can always
obtain/copy them from the new tuple moved to another partition by
INSERT following DELETE.
Yes, I think you are right. Although, I am still not sure how to
"copy" system columns from one partition's slot to another's, that is,
without assuming what they are.
(I think we could address this issue for v11 independently of [1], off course.)
Yeah.
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 14, 2020 at 10:45 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Mon, Sep 14, 2020 at 5:56 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
IIUC, I think two issues are discussed in the thread [1]: (a) there is
currently no way to define the set of meaningful system columns for a
partitioned table that contains pluggable storages other than standard
heap, and (b) even in the case where the set is defined as before,
like partitioned tables that don’t contain any pluggable storages,
system column values are not obtained from a tuple inserted into a
partitioned table in cases as reported in [1], because virtual slots
are assigned for partitioned tables [2][3]. (I think the latter is
the original issue in the thread, though.)Right, (b) can be solved by using a leaf partition's tuple slot as
proposed.
You mean what is proposed in [3]?
I think we could fix this update-tuple-routing-vs-RETURNING issue
without the fix for (a). But to transfer system column values in a
cleaner way, I think we need to fix (b) first so that we can always
obtain/copy them from the new tuple moved to another partition by
INSERT following DELETE.Yes, I think you are right. Although, I am still not sure how to
"copy" system columns from one partition's slot to another's, that is,
without assuming what they are.
I just thought we assume that partitions support all the existing
system attributes until we have the fix for (a), i.e., the slot
assigned for a partition must have the getsysattr callback routine
from which we can fetch each existing system attribute of a underlying
tuple in the slot, regardless of whether that system attribute is used
for the partition or not.
(I think we could address this issue for v11 independently of [1], off course.)
Yeah.
I noticed that I modified your patch incorrectly. Sorry for that.
Attached is an updated patch fixing that. I also added a bit of code
to copy CTID from the new tuple moved to another partition, not just
table OID, and did some code/comment adjustments, mainly to match
other places. I also created a patch for PG11, which I am attaching
as well.
Best regards,
Etsuro Fujita
Attachments:
v5-0001-Fix-a-bug-with-RETURNING-when-UPDATE-moves-tuple-HEAD.patchapplication/octet-stream; name=v5-0001-Fix-a-bug-with-RETURNING-when-UPDATE-moves-tuple-HEAD.patchDownload+144-17
v5-0001-Fix-a-bug-with-RETURNING-when-UPDATE-moves-tuple-PG11.patchapplication/octet-stream; name=v5-0001-Fix-a-bug-with-RETURNING-when-UPDATE-moves-tuple-PG11.patchDownload+151-10
Fujita-san,
On Sun, Sep 20, 2020 at 11:41 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Mon, Sep 14, 2020 at 10:45 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Mon, Sep 14, 2020 at 5:56 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
IIUC, I think two issues are discussed in the thread [1]: (a) there is
currently no way to define the set of meaningful system columns for a
partitioned table that contains pluggable storages other than standard
heap, and (b) even in the case where the set is defined as before,
like partitioned tables that don’t contain any pluggable storages,
system column values are not obtained from a tuple inserted into a
partitioned table in cases as reported in [1], because virtual slots
are assigned for partitioned tables [2][3]. (I think the latter is
the original issue in the thread, though.)Right, (b) can be solved by using a leaf partition's tuple slot as
proposed.You mean what is proposed in [3]?
Yes. Although, I am for assigning a dedicated slot to partitions
*unconditionally*, whereas the PoC patch Andres shared makes it
conditional on either needing tuple conversion between the root and
the partition or having a RETURNING projection present in the query.
I think we could fix this update-tuple-routing-vs-RETURNING issue
without the fix for (a). But to transfer system column values in a
cleaner way, I think we need to fix (b) first so that we can always
obtain/copy them from the new tuple moved to another partition by
INSERT following DELETE.Yes, I think you are right. Although, I am still not sure how to
"copy" system columns from one partition's slot to another's, that is,
without assuming what they are.I just thought we assume that partitions support all the existing
system attributes until we have the fix for (a), i.e., the slot
assigned for a partition must have the getsysattr callback routine
from which we can fetch each existing system attribute of a underlying
tuple in the slot, regardless of whether that system attribute is used
for the partition or not.
Hmm, to copy one slot's system attributes into another, we will also
need a way to "set" the system attributes in the destination slot.
But maybe I didn't fully understand what you said.
(I think we could address this issue for v11 independently of [1], off course.)
Yeah.
I noticed that I modified your patch incorrectly. Sorry for that.
Attached is an updated patch fixing that.
Ah, no problem. Thanks for updating.
I also added a bit of code
to copy CTID from the new tuple moved to another partition, not just
table OID, and did some code/comment adjustments, mainly to match
other places. I also created a patch for PG11, which I am attaching
as well.
In the patch for PG 11:
+ new_tuple->t_self = new_tuple->t_data->t_ctid =
+ old_tuple->t_self;
...
Should we add a one-line comment above this block of code to transfer
system attributes? Maybe: /* Also transfer the system attributes. */?
BTW, do you think we should alter the test in PG 11 branch to test
that system attributes are returned correctly? Once we settle the
other issue, we can adjust the HEAD's test to do likewise?
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com