non-bulk inserts and tuple routing

Started by Amit Langoteover 8 years ago43 messageshackers
Jump to latest
#1Amit Langote
Langote_Amit_f8@lab.ntt.co.jp

Hi.

I have a patch that rearranges the code around partition tuple-routing,
such that allocation of per-partition objects (ResultRelInfo,
TupleConversionMap, etc.) is delayed until a given partition is actually
inserted into (i.e., a tuple is routed to it). I can see good win for
non-bulk inserts with the patch and the patch is implemented such that it
doesn't affect the bulk-insert case much.

Performance numbers:

* Uses following hash-partitioned table:

create table t1 (a int, b int) partition by hash (a);
create table t1_x partition of t1 for values with (modulus M, remainder R)
...

* Non-bulk insert uses the following code (insert 100,000 rows one-by-one):

do $$
begin
for i in 1..100000 loop
insert into t1 values (i, i+1);
end loop;
end; $$;

* Times in milliseconds:

#parts HEAD Patched

8 6216.300 4977.670
16 9061.388 6360.093
32 14081.656 8752.405
64 24887.110 13919.384
128 45926.251 24582.411
256 88088.084 45490.894

As you can see the performance can be as much as 2x faster with the patch,
although time taken still increases as the number of partitions increases,
because we still lock *all* partitions at the beginning.

* Bulk-inserting 100,000 rows using COPY:

copy t1 from '/tmp/t1.csv' csv;

* Times in milliseconds:

#parts HEAD Patched

8 458.301 450.875
16 409.271 510.723
32 500.960 612.003
64 430.687 795.046
128 449.314 565.786
256 493.171 490.187

Not much harm here, although numbers are a bit noisy.

Patch is divided into 4, first 3 of which are refactoring patches.

I know this patch will conflict severely with [1]https://commitfest.postgresql.org/16/1023/ and [2]https://commitfest.postgresql.org/16/1184/, so it's fine if
we consider applying these later. Will add this to next CF.

Thanks,
Amit

[1]: https://commitfest.postgresql.org/16/1023/

[2]: https://commitfest.postgresql.org/16/1184/

Attachments:

0001-Teach-CopyFrom-to-use-ModifyTableState-for-tuple-rou.patchtext/plain; charset=UTF-8; name=0001-Teach-CopyFrom-to-use-ModifyTableState-for-tuple-rou.patchDownload+42-38
0002-ExecFindPartition-refactoring.patchtext/plain; charset=UTF-8; name=0002-ExecFindPartition-refactoring.patchDownload+9-21
0003-ExecSetupPartitionTupleRouting-refactoring.patchtext/plain; charset=UTF-8; name=0003-ExecSetupPartitionTupleRouting-refactoring.patchDownload+33-93
0004-During-tuple-routing-initialize-per-partition-object.patchtext/plain; charset=UTF-8; name=0004-During-tuple-routing-initialize-per-partition-object.patchDownload+180-170
#2Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#1)
Re: non-bulk inserts and tuple routing

On Tue, Dec 19, 2017 at 3:36 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

* Bulk-inserting 100,000 rows using COPY:

copy t1 from '/tmp/t1.csv' csv;

* Times in milliseconds:

#parts HEAD Patched

8 458.301 450.875
16 409.271 510.723
32 500.960 612.003
64 430.687 795.046
128 449.314 565.786
256 493.171 490.187

While the earlier numbers were monotonically increasing with number of
partitions, these numbers don't. For example the number on HEAD with 8
partitions is higher than that with 128 partitions as well. That's
kind of wierd. May be something wrong with the measurement. Do we see
similar unstability when bulk inserting in an unpartitioned table?
Also, the numbers against 64 partitions are really bad. That's almost
2x slower.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#3Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#2)
Re: non-bulk inserts and tuple routing

Hi Ashutosh.

On 2017/12/19 19:12, Ashutosh Bapat wrote:

On Tue, Dec 19, 2017 at 3:36 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

* Bulk-inserting 100,000 rows using COPY:

copy t1 from '/tmp/t1.csv' csv;

* Times in milliseconds:

#parts HEAD Patched

8 458.301 450.875
16 409.271 510.723
32 500.960 612.003
64 430.687 795.046
128 449.314 565.786
256 493.171 490.187

While the earlier numbers were monotonically increasing with number of
partitions, these numbers don't. For example the number on HEAD with 8
partitions is higher than that with 128 partitions as well. That's
kind of wierd. May be something wrong with the measurement.

In the bulk-insert case, we initialize partitions only once, because the
COPY that loads those 100,000 rows is executed just once.

Whereas in the non-bulk insert case, we initialize partitions (lock,
allocate various objects) 100,000 times, because that's how many times the
INSERT is executed, once for each of 100,000 rows to be inserted.

Without the patch, the object initialization occurs N times where N is the
number of partitions. With the patch it occurs just once -- only for the
partition to which the row was routed. Time required, although became
smaller with the patch, is still monotonically increasing, because the
patch didn't do anything about locking all partitions.

Does that make sense?

Do we see
similar unstability when bulk inserting in an unpartitioned table?
Also, the numbers against 64 partitions are really bad. That's almost
2x slower.

Sorry, as I said the numbers I initially posted were a bit noisy. I just
re-ran that COPY against the patched and get the following numbers:

#parts Patched

8 441.852
16 417.510
32 435.276
64 486.497
128 436.473
256 446.312

Thanks,
Amit

#4Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Langote (#1)
Re: non-bulk inserts and tuple routing

On 2017/12/19 19:06, Amit Langote wrote:

Hi.

I have a patch that rearranges the code around partition tuple-routing,
such that allocation of per-partition objects (ResultRelInfo,
TupleConversionMap, etc.) is delayed until a given partition is actually
inserted into (i.e., a tuple is routed to it). I can see good win for
non-bulk inserts with the patch and the patch is implemented such that it
doesn't affect the bulk-insert case much.

Performance numbers:

* Uses following hash-partitioned table:

create table t1 (a int, b int) partition by hash (a);
create table t1_x partition of t1 for values with (modulus M, remainder R)
...

* Non-bulk insert uses the following code (insert 100,000 rows one-by-one):

do $$
begin
for i in 1..100000 loop
insert into t1 values (i, i+1);
end loop;
end; $$;

* Times in milliseconds:

#parts HEAD Patched

8 6216.300 4977.670
16 9061.388 6360.093
32 14081.656 8752.405
64 24887.110 13919.384
128 45926.251 24582.411
256 88088.084 45490.894

As you can see the performance can be as much as 2x faster with the patch,
although time taken still increases as the number of partitions increases,
because we still lock *all* partitions at the beginning.

* Bulk-inserting 100,000 rows using COPY:

copy t1 from '/tmp/t1.csv' csv;

* Times in milliseconds:

#parts HEAD Patched

8 458.301 450.875
16 409.271 510.723
32 500.960 612.003
64 430.687 795.046
128 449.314 565.786
256 493.171 490.187

Not much harm here, although numbers are a bit noisy.

Patch is divided into 4, first 3 of which are refactoring patches.

I know this patch will conflict severely with [1] and [2], so it's fine if
we consider applying these later. Will add this to next CF.

I rebased the patches, since they started conflicting with a recently
committed patch [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=cc6337d2fed5.

Thanks,
Amit

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=cc6337d2fed5
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=cc6337d2fed5

Attachments:

v2-0001-Teach-CopyFrom-to-use-ModifyTableState-for-tuple-.patchtext/plain; charset=UTF-8; name=v2-0001-Teach-CopyFrom-to-use-ModifyTableState-for-tuple-.patchDownload+18-16
v2-0002-ExecFindPartition-refactoring.patchtext/plain; charset=UTF-8; name=v2-0002-ExecFindPartition-refactoring.patchDownload+10-21
v2-0003-ExecSetupPartitionTupleRouting-refactoring.patchtext/plain; charset=UTF-8; name=v2-0003-ExecSetupPartitionTupleRouting-refactoring.patchDownload+14-10
v2-0004-During-tuple-routing-initialize-per-partition-obj.patchtext/plain; charset=UTF-8; name=v2-0004-During-tuple-routing-initialize-per-partition-obj.patchDownload+188-180
#5Robert Haas
robertmhaas@gmail.com
In reply to: Amit Langote (#4)
Re: non-bulk inserts and tuple routing

On Fri, Jan 19, 2018 at 3:56 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

I rebased the patches, since they started conflicting with a recently
committed patch [1].

I think that my latest commit has managed to break this pretty thoroughly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Robert Haas (#5)
Re: non-bulk inserts and tuple routing

On 2018/01/20 7:07, Robert Haas wrote:

On Fri, Jan 19, 2018 at 3:56 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

I rebased the patches, since they started conflicting with a recently
committed patch [1].

I think that my latest commit has managed to break this pretty thoroughly.

I rebased it. Here are the performance numbers again.

* Uses following hash-partitioned table:

create table t1 (a int, b int) partition by hash (a);
create table t1_x partition of t1 for values with (modulus M, remainder R)
...

* Non-bulk insert uses the following code (insert 100,000 rows one-by-one):

do $$
begin
for i in 1..100000 loop
insert into t1 values (i, i+1);
end loop;
end; $$;

Times in milliseconds:

#parts HEAD Patched
8 6148.313 4938.775
16 8882.420 6203.911
32 14251.072 8595.068
64 24465.691 13718.161
128 45099.435 23898.026
256 87307.332 44428.126

* Bulk-inserting 100,000 rows using COPY:

copy t1 from '/tmp/t1.csv' csv;

Times in milliseconds:

#parts HEAD Patched

8 466.170 446.865
16 445.341 444.990
32 443.544 487.713
64 460.579 435.412
128 469.953 422.403
256 463.592 431.118

Thanks,
Amit

Attachments:

v3-0001-Teach-CopyFrom-to-use-ModifyTableState-for-tuple-.patchtext/plain; charset=UTF-8; name=v3-0001-Teach-CopyFrom-to-use-ModifyTableState-for-tuple-.patchDownload+15-12
v3-0002-ExecFindPartition-refactoring.patchtext/plain; charset=UTF-8; name=v3-0002-ExecFindPartition-refactoring.patchDownload+12-21
v3-0003-During-tuple-routing-initialize-per-partition-obj.patchtext/plain; charset=UTF-8; name=v3-0003-During-tuple-routing-initialize-per-partition-obj.patchDownload+375-298
#7Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Langote (#6)
Re: non-bulk inserts and tuple routing

On 2018/01/24 17:25, Amit Langote wrote:

On 2018/01/20 7:07, Robert Haas wrote:

On Fri, Jan 19, 2018 at 3:56 AM, Amit Langote wrote:

I rebased the patches, since they started conflicting with a recently
committed patch [1].

I think that my latest commit has managed to break this pretty thoroughly.

I rebased it. Here are the performance numbers again.

* Uses following hash-partitioned table:

create table t1 (a int, b int) partition by hash (a);
create table t1_x partition of t1 for values with (modulus M, remainder R)
...

* Non-bulk insert uses the following code (insert 100,000 rows one-by-one):

do $$
begin
for i in 1..100000 loop
insert into t1 values (i, i+1);
end loop;
end; $$;

Times in milliseconds:

#parts HEAD Patched
8 6148.313 4938.775
16 8882.420 6203.911
32 14251.072 8595.068
64 24465.691 13718.161
128 45099.435 23898.026
256 87307.332 44428.126

* Bulk-inserting 100,000 rows using COPY:

copy t1 from '/tmp/t1.csv' csv;

Times in milliseconds:

#parts HEAD Patched

8 466.170 446.865
16 445.341 444.990
32 443.544 487.713
64 460.579 435.412
128 469.953 422.403
256 463.592 431.118

Rebased again.

Thanks,
Amit

Attachments:

v4-0001-Teach-CopyFrom-to-use-ModifyTableState-for-tuple-.patchtext/plain; charset=UTF-8; name=v4-0001-Teach-CopyFrom-to-use-ModifyTableState-for-tuple-.patchDownload+15-12
v4-0002-ExecFindPartition-refactoring.patchtext/plain; charset=UTF-8; name=v4-0002-ExecFindPartition-refactoring.patchDownload+12-21
v4-0003-During-tuple-routing-initialize-per-partition-obj.patchtext/plain; charset=UTF-8; name=v4-0003-During-tuple-routing-initialize-per-partition-obj.patchDownload+377-299
#8Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Amit Langote (#7)
Re: non-bulk inserts and tuple routing

(2018/01/25 11:11), Amit Langote wrote:

Rebased again.

Thanks for the rebased patch!

The patches apply cleanly and compile successfully, but make check fails
in an assert-enabled build.

Best regards,
Etsuro Fujita

#9Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Etsuro Fujita (#8)
Re: non-bulk inserts and tuple routing

Fujita-san,

Thanks for the review.

On 2018/01/25 18:30, Etsuro Fujita wrote:

(2018/01/25 11:11), Amit Langote wrote:

Rebased again.

Thanks for the rebased patch!

The patches apply cleanly and compile successfully, but make check fails
in an assert-enabled build.

Hmm, I can't seem to reproduce the failure with v4 patches I posted
earlier today.

=======================
All 186 tests passed.
=======================

Can you please post the errors you're seeing?

Thanks,
Amit

#10Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Amit Langote (#9)
Re: non-bulk inserts and tuple routing

(2018/01/25 18:52), Amit Langote wrote:

On 2018/01/25 18:30, Etsuro Fujita wrote:

The patches apply cleanly and compile successfully, but make check fails
in an assert-enabled build.

Hmm, I can't seem to reproduce the failure with v4 patches I posted
earlier today.

=======================
All 186 tests passed.
=======================

Can you please post the errors you're seeing?

OK, will do.

Best regards,
Etsuro Fujita

#11Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Amit Langote (#9)
Re: non-bulk inserts and tuple routing

(2018/01/25 18:52), Amit Langote wrote:

On 2018/01/25 18:30, Etsuro Fujita wrote:

The patches apply cleanly and compile successfully, but make check fails
in an assert-enabled build.

Hmm, I can't seem to reproduce the failure with v4 patches I posted
earlier today.

Can you please post the errors you're seeing?

A quick debug showed that the failure was due to a segmentation fault
caused by this change to ExecSetupPartitionTupleRouting (in patch
v4-0003-During-tuple-routing-initialize-per-partition-obj):

- bool is_update = false;

+ bool is_update;

I modified that patch to initialize the is_update to false as before.
With the modified version, make check passed successfully.

I'll review the patch in more detail!

Best regards,
Etsuro Fujita

#12Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Etsuro Fujita (#11)
Re: non-bulk inserts and tuple routing

Fujita-san,

On 2018/01/30 18:22, Etsuro Fujita wrote:

(2018/01/25 18:52), Amit Langote wrote:

On 2018/01/25 18:30, Etsuro Fujita wrote:

The patches apply cleanly and compile successfully, but make check fails
in an assert-enabled build.

Hmm, I can't seem to reproduce the failure with v4 patches I posted
earlier today.

Can you please post the errors you're seeing?

A quick debug showed that the failure was due to a segmentation fault
caused by this change to ExecSetupPartitionTupleRouting (in patch
v4-0003-During-tuple-routing-initialize-per-partition-obj):

-    bool        is_update = false;

+    bool        is_update;

I modified that patch to initialize the is_update to false as before. With
the modified version, make check passed successfully.

Oops, my bad.

I'll review the patch in more detail!

Thank you. Will wait for your comments before sending a new version then.

Regards,
Amit

#13Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Amit Langote (#12)
Re: non-bulk inserts and tuple routing

(2018/01/30 18:39), Amit Langote wrote:

Will wait for your comments before sending a new version then.

Ok, I'll post my comments as soon as possible.

Best regards,
Etsuro Fujita

#14Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Etsuro Fujita (#13)
Re: non-bulk inserts and tuple routing

(2018/01/30 18:52), Etsuro Fujita wrote:

(2018/01/30 18:39), Amit Langote wrote:

Will wait for your comments before sending a new version then.

Ok, I'll post my comments as soon as possible.

* ExecInitPartitionResultRelInfo is called from ExecFindPartition, but
we could call that another way; in ExecInsert/CopyFrom we call that
after ExecFindPartition if the partition chosen by ExecFindPartition has
not been initialized yet. Maybe either would be OK, but I like that
because I think that would not only better divide that labor but better
fit into the existing code in ExecInsert/CopyFrom IMO.

* In ExecInitPartitionResultRelInfo:
+       /*
+        * Note that the entries in this list appear in no predetermined
+        * order as result of initializing partition result rels as and when
+        * they're needed.
+        */
+       estate->es_leaf_result_relations =
+ 
lappend(estate->es_leaf_result_relations,
+                                           leaf_part_rri);

Is it OK to put this within the "if (leaf_part_rri == NULL)" block?

* In the same function:
+   /*
+    * Verify result relation is a valid target for INSERT.
+    */
+   CheckValidResultRel(leaf_part_rri, CMD_INSERT);

I think it would be better to leave the previous comments as-is here:

/*
* Verify result relation is a valid target for an INSERT. An
UPDATE
* of a partition-key becomes a DELETE+INSERT operation, so
this check
* is still required when the operation is CMD_UPDATE.
*/

* ExecInitPartitionResultRelInfo does the work other than the
initialization of ResultRelInfo for the chosen partition (eg, create a
tuple conversion map to convert a tuple routed to the partition from the
parent's type to the partition's). So I'd propose to rename that
function to eg, ExecInitPartition.

* CopyFrom is modified so that it calls ExecSetupPartitionTupleRouting
and ExecFindPartition with a mostly-dummy ModifyTableState node. I'm
not sure that is a good idea. My concern about that is that might be
something like a headache in future development.

* The patch 0001 and 0002 are pretty small but can't be reviewed without
the patch 0003. The total size of the three patches aren't that large,
so I think it would be better to put those patches together into a
single patch.

That's all for now. I'll continue to review the patches!

Best regards,
Etsuro Fujita

#15Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Etsuro Fujita (#14)
Re: non-bulk inserts and tuple routing

Fujita-san,

Thank you for the review.

On 2018/02/02 19:56, Etsuro Fujita wrote:

(2018/01/30 18:52), Etsuro Fujita wrote:

(2018/01/30 18:39), Amit Langote wrote:

Will wait for your comments before sending a new version then.

Ok, I'll post my comments as soon as possible.

* ExecInitPartitionResultRelInfo is called from ExecFindPartition, but we
could call that another way; in ExecInsert/CopyFrom we call that after
ExecFindPartition if the partition chosen by ExecFindPartition has not
been initialized yet.  Maybe either would be OK, but I like that because I
think that would not only better divide that labor but better fit into the
existing code in ExecInsert/CopyFrom IMO.

I see no problem with that, so done that way.

* In ExecInitPartitionResultRelInfo:
+       /*
+        * Note that the entries in this list appear in no predetermined
+        * order as result of initializing partition result rels as and when
+        * they're needed.
+        */
+       estate->es_leaf_result_relations =
+ lappend(estate->es_leaf_result_relations,
+                                           leaf_part_rri);

Is it OK to put this within the "if (leaf_part_rri == NULL)" block?

Good catch. I moved it outside the block. I was under the impression
that leaf result relations that were reused from the
mtstate->resultRelInfo arrary would have already been added to the list,
but it seems they are not.

* In the same function:
+   /*
+    * Verify result relation is a valid target for INSERT.
+    */
+   CheckValidResultRel(leaf_part_rri, CMD_INSERT);

I think it would be better to leave the previous comments as-is here:

        /*
         * Verify result relation is a valid target for an INSERT.  An UPDATE
         * of a partition-key becomes a DELETE+INSERT operation, so this
check
         * is still required when the operation is CMD_UPDATE.
         */

Oops, my bad. Didn't notice that I had ended up removing the part about
UPDATE.

* ExecInitPartitionResultRelInfo does the work other than the
initialization of ResultRelInfo for the chosen partition (eg, create a
tuple conversion map to convert a tuple routed to the partition from the
parent's type to the partition's).  So I'd propose to rename that function
to eg, ExecInitPartition.

I went with ExevInitPartitionInfo.

* CopyFrom is modified so that it calls ExecSetupPartitionTupleRouting and
ExecFindPartition with a mostly-dummy ModifyTableState node.  I'm not sure
that is a good idea.  My concern about that is that might be something
like a headache in future development.

OK, I removed those changes.

* The patch 0001 and 0002 are pretty small but can't be reviewed without
the patch 0003.  The total size of the three patches aren't that large, so
I think it would be better to put those patches together into a single patch.

As I said above, I got rid of 0001. Then, I merged the
ExecFindPartition() refactoring patch 0002 into 0003.

The code in tupconv_map_for_subplan() currently assumes that it can rely
on all leaf partitions having been initialized. Since we're breaking that
assumption with this proposal, that needed to be changed. So the patch
contained some refactoring to make it not rely on that assumption. I
carved that out into a separate patch which can be applied and tested
before the main patch.

That's all for now.  I'll continue to review the patches!

Here is the updated version that contains two patches as described above.

Thanks,
Amit

Attachments:

v24-0001-Refactor-partition-tuple-conversion-maps-handlin.patchtext/plain; charset=UTF-8; name=v24-0001-Refactor-partition-tuple-conversion-maps-handlin.patchDownload+73-40
v24-0002-During-tuple-routing-initialize-per-partition-ob.patchtext/plain; charset=UTF-8; name=v24-0002-During-tuple-routing-initialize-per-partition-ob.patchDownload+302-251
#16Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Amit Langote (#15)
Re: non-bulk inserts and tuple routing

(2018/02/05 14:34), Amit Langote wrote:

On 2018/02/02 19:56, Etsuro Fujita wrote:

* ExecInitPartitionResultRelInfo is called from ExecFindPartition, but we
could call that another way; in ExecInsert/CopyFrom we call that after
ExecFindPartition if the partition chosen by ExecFindPartition has not
been initialized yet. Maybe either would be OK, but I like that because I
think that would not only better divide that labor but better fit into the
existing code in ExecInsert/CopyFrom IMO.

I see no problem with that, so done that way.

Thanks.

* In ExecInitPartitionResultRelInfo:
+       /*
+        * Note that the entries in this list appear in no predetermined
+        * order as result of initializing partition result rels as and when
+        * they're needed.
+        */
+       estate->es_leaf_result_relations =
+ lappend(estate->es_leaf_result_relations,
+                                           leaf_part_rri);

Is it OK to put this within the "if (leaf_part_rri == NULL)" block?

Good catch. I moved it outside the block. I was under the impression
that leaf result relations that were reused from the
mtstate->resultRelInfo arrary would have already been added to the list,
but it seems they are not.

I commented this because the update-tuple-routing patch has added to the
list ResultRelInfos reused from the mtstate->resultRelInfo arrary, but
on reflection I noticed this would cause oddity in reporting execution
stats for partitions' triggers in EXPLAIN ANALYZE. Here is an example
using the head:

postgres=# create table trigger_test (a int, b text) partition by list (a);
CREATE TABLE
postgres=# create table trigger_test1 partition of trigger_test for
values in (1);
CREATE TABLE
postgres=# create table trigger_test2 partition of trigger_test for
values in (2);
CREATE TABLE
postgres=# create trigger before_upd_row_trig before update on
trigger_test1 for
each row execute procedure trigger_data (23, 'skidoo');
CREATE TRIGGER
postgres=# create trigger before_del_row_trig before delete on
trigger_test1 for
each row execute procedure trigger_data (23, 'skidoo');
CREATE TRIGGER
postgres=# insert into trigger_test values (1, 'foo');
INSERT 0 1
postgres=# explain analyze update trigger_test set a = 2 where a = 1;
NOTICE: before_upd_row_trig(23, skidoo) BEFORE ROW UPDATE ON trigger_test1
NOTICE: OLD: (1,foo),NEW: (2,foo)
NOTICE: before_del_row_trig(23, skidoo) BEFORE ROW DELETE ON trigger_test1
NOTICE: OLD: (1,foo)
QUERY PLAN

--------------------------------------------------------------------------------
-------------------------------
Update on trigger_test (cost=0.00..25.88 rows=6 width=42) (actual
time=2.337..
2.337 rows=0 loops=1)
Update on trigger_test1
-> Seq Scan on trigger_test1 (cost=0.00..25.88 rows=6 width=42)
(actual tim
e=0.009..0.011 rows=1 loops=1)
Filter: (a = 1)
Planning time: 0.186 ms
Trigger before_del_row_trig on trigger_test1: time=0.495 calls=1
Trigger before_upd_row_trig on trigger_test1: time=0.870 calls=1
Trigger before_del_row_trig on trigger_test1: time=0.495 calls=1
Trigger before_upd_row_trig on trigger_test1: time=0.870 calls=1
Execution time: 2.396 ms
(10 rows)

Both trigger stats for the on-update and on-delete triggers are doubly
shown in the above output. The reason would be that
ExplainPrintTriggers called report_triggers twice for trigger_test1's
ResultRelInfo: once for it from queryDesc->estate->es_result_relations
and once for it from queryDesc->estate->es_leaf_result_relations. I
don't think this is intended behavior, so I think we should fix this. I
think we could probably address this by modifying ExecInitPartitionInfo
in your patch so that it doesn't add to the es_leaf_result_relations
list ResultRelInfos reused from the mtstate->resultRelInfo arrary, as
your previous version of the patch. (ExecGetTriggerResultRel looks at
the list too, but it would probably work well for this change.) It
might be better to address this in another patch, though.

* In the same function:
+   /*
+    * Verify result relation is a valid target for INSERT.
+    */
+   CheckValidResultRel(leaf_part_rri, CMD_INSERT);

I think it would be better to leave the previous comments as-is here:

/*
* Verify result relation is a valid target for an INSERT. An UPDATE
* of a partition-key becomes a DELETE+INSERT operation, so this
check
* is still required when the operation is CMD_UPDATE.
*/

Oops, my bad. Didn't notice that I had ended up removing the part about
UPDATE.

OK.

* ExecInitPartitionResultRelInfo does the work other than the
initialization of ResultRelInfo for the chosen partition (eg, create a
tuple conversion map to convert a tuple routed to the partition from the
parent's type to the partition's). So I'd propose to rename that function
to eg, ExecInitPartition.

I went with ExevInitPartitionInfo.

Fine with me.

* CopyFrom is modified so that it calls ExecSetupPartitionTupleRouting and
ExecFindPartition with a mostly-dummy ModifyTableState node. I'm not sure
that is a good idea. My concern about that is that might be something
like a headache in future development.

OK, I removed those changes.

Thanks.

* The patch 0001 and 0002 are pretty small but can't be reviewed without
the patch 0003. The total size of the three patches aren't that large, so
I think it would be better to put those patches together into a single patch.

As I said above, I got rid of 0001. Then, I merged the
ExecFindPartition() refactoring patch 0002 into 0003.

The code in tupconv_map_for_subplan() currently assumes that it can rely
on all leaf partitions having been initialized. Since we're breaking that
assumption with this proposal, that needed to be changed. So the patch
contained some refactoring to make it not rely on that assumption. I
carved that out into a separate patch which can be applied and tested
before the main patch.

OK, will review that patch separately.

Here is the updated version that contains two patches as described above.

Thanks for updating the patches! I'll post my next comments in a few days.

Best regards,
Etsuro Fujita

#17Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Etsuro Fujita (#16)
Re: non-bulk inserts and tuple routing

On 2018/02/05 19:43, Etsuro Fujita wrote:

(2018/02/05 14:34), Amit Langote wrote:

On 2018/02/02 19:56, Etsuro Fujita wrote:

* In ExecInitPartitionResultRelInfo:
+       /*
+        * Note that the entries in this list appear in no predetermined
+        * order as result of initializing partition result rels as and
when
+        * they're needed.
+        */
+       estate->es_leaf_result_relations =
+ lappend(estate->es_leaf_result_relations,
+                                           leaf_part_rri);

Is it OK to put this within the "if (leaf_part_rri == NULL)" block?

Good catch.  I moved it outside the block.  I was under the impression
that leaf result relations that were reused from the
mtstate->resultRelInfo arrary would have already been added to the list,
but it seems they are not.

I commented this because the update-tuple-routing patch has added to the
list ResultRelInfos reused from the mtstate->resultRelInfo arrary, but on
reflection I noticed this would cause oddity in reporting execution stats
for partitions' triggers in EXPLAIN ANALYZE.  Here is an example using the
head:

postgres=# create table trigger_test (a int, b text) partition by list (a);
CREATE TABLE
postgres=# create table trigger_test1 partition of trigger_test for values
in (1);
CREATE TABLE
postgres=# create table trigger_test2 partition of trigger_test for values
in (2);
CREATE TABLE
postgres=# create trigger before_upd_row_trig before update on
trigger_test1 for
 each row execute procedure trigger_data (23, 'skidoo');
CREATE TRIGGER
postgres=# create trigger before_del_row_trig before delete on
trigger_test1 for
 each row execute procedure trigger_data (23, 'skidoo');
CREATE TRIGGER
postgres=# insert into trigger_test values (1, 'foo');
INSERT 0 1
postgres=# explain analyze update trigger_test set a = 2 where a = 1;
NOTICE:  before_upd_row_trig(23, skidoo) BEFORE ROW UPDATE ON trigger_test1
NOTICE:  OLD: (1,foo),NEW: (2,foo)
NOTICE:  before_del_row_trig(23, skidoo) BEFORE ROW DELETE ON trigger_test1
NOTICE:  OLD: (1,foo)
                                                  QUERY PLAN

--------------------------------------------------------------------------------

-------------------------------
 Update on trigger_test  (cost=0.00..25.88 rows=6 width=42) (actual
time=2.337..
2.337 rows=0 loops=1)
   Update on trigger_test1
   ->  Seq Scan on trigger_test1  (cost=0.00..25.88 rows=6 width=42)
(actual tim
e=0.009..0.011 rows=1 loops=1)
         Filter: (a = 1)
 Planning time: 0.186 ms
 Trigger before_del_row_trig on trigger_test1: time=0.495 calls=1
 Trigger before_upd_row_trig on trigger_test1: time=0.870 calls=1
 Trigger before_del_row_trig on trigger_test1: time=0.495 calls=1
 Trigger before_upd_row_trig on trigger_test1: time=0.870 calls=1
 Execution time: 2.396 ms
(10 rows)

Both trigger stats for the on-update and on-delete triggers are doubly
shown in the above output.  The reason would be that ExplainPrintTriggers
called report_triggers twice for trigger_test1's ResultRelInfo: once for
it from queryDesc->estate->es_result_relations and once for it from
queryDesc->estate->es_leaf_result_relations.  I don't think this is
intended behavior, so I think we should fix this.  I think we could
probably address this by modifying ExecInitPartitionInfo in your patch so
that it doesn't add to the es_leaf_result_relations list ResultRelInfos
reused from the mtstate->resultRelInfo arrary, as your previous version of
the patch.  (ExecGetTriggerResultRel looks at the list too, but it would
probably work well for this change.)  It might be better to address this
in another patch, though.

I see. Thanks for the analysis and the explanation.

Seeing as this bug exists in HEAD, as you also seem to be saying, we'd
need to fix it independently of the patches on this thread. I've posted a
patch in another thread titled "update tuple routing and triggers".

Thanks,
Amit

#18Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Etsuro Fujita (#16)
Re: non-bulk inserts and tuple routing

(2018/02/05 19:43), Etsuro Fujita wrote:

(2018/02/05 14:34), Amit Langote wrote:

The code in tupconv_map_for_subplan() currently assumes that it can rely
on all leaf partitions having been initialized.

On reflection I noticed this analysis is not 100% correct; I think what
that function actually assumes is that all sublplans' partitions have
already been initialized, not all leaf partitions.

Since we're breaking that
assumption with this proposal, that needed to be changed. So the patch
contained some refactoring to make it not rely on that assumption.

I don't think we really need this refactoring because since that as in
another patch you posted, we could initialize all subplans' partitions
in ExecSetupPartitionTupleRouting, I think tupconv_map_for_subplan could
be called without any changes to that function because of what I said above.

Here is the updated version that contains two patches as described above.

Thanks for updating the patches! I'll post my next comments in a few days.

Here are comments for the other patch (patch
v24-0002-During-tuple-routing-initialize-per-partition-ob.patch):

o On changes to ExecSetupPartitionTupleRouting:

* The comment below wouldn't be correct; no ExecInitResultRelInfo in the
patch.

-   proute->partitions = (ResultRelInfo **) palloc(proute->num_partitions *
-                                                  sizeof(ResultRelInfo *));
+   /*
+    * Actual ResultRelInfo's and TupleConversionMap's are allocated in
+    * ExecInitResultRelInfo().
+    */
+   proute->partitions = (ResultRelInfo **) palloc0(proute->num_partitions *
+                                                   sizeof(ResultRelInfo 
*));

* The patch removes this from the initialization step for a subplan's
partition, but I think it would be better to keep this here because I
think it's a good thing to put the initialization stuff together into
one place.

- /*
- * This is required in order to we convert the partition's
- * tuple to be compatible with the root partitioned table's
- * tuple descriptor. When generating the per-subplan result
- * rels, this was not set.
- */
- leaf_part_rri->ri_PartitionRoot = rel;

* I think it would be better to keep this comment here.

- /* Remember the subplan offset for this ResultRelInfo */

* Why is this removed from that initialization?

- proute->partitions[i] = leaf_part_rri;

o On changes to ExecInitPartitionInfo:

* I don't understand the step starting from this, but I'm wondering if
that step can be removed by keeping the above setup of
proute->partitions for the subplan's partition in
ExecSetupPartitionTupleRouting.

+   /*
+    * If we are doing tuple routing for update, try to reuse the
+    * per-subplan resultrel for this partition that ExecInitModifyTable()
+    * might already have created.
+    */
+   if (mtstate && mtstate->operation == CMD_UPDATE)

That's all I have for now.

Best regards,
Etsuro Fujita

#19Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Etsuro Fujita (#18)
Re: non-bulk inserts and tuple routing

(2018/02/07 19:36), Etsuro Fujita wrote:

(2018/02/05 19:43), Etsuro Fujita wrote:

(2018/02/05 14:34), Amit Langote wrote:

Here is the updated version that contains two patches as described
above.

Here are some minor comments:

o On changes to ExecInsert

* This might be just my taste, but I think it would be better to (1)
change ExecInitPartitionInfo so that it creates and returns a
newly-initialized ResultRelInfo for an initialized partition, and (2)
rewrite this bit:

+       /* Initialize partition info, if not done already. */
+       ExecInitPartitionInfo(mtstate, resultRelInfo, proute, estate,
+                             leaf_part_index);
+
         /*
          * Save the old ResultRelInfo and switch to the one 
corresponding to
          * the selected partition.
          */
         saved_resultRelInfo = resultRelInfo;
         resultRelInfo = proute->partitions[leaf_part_index];
+       Assert(resultRelInfo != NULL);

to something like this (I would say the same thing to the copy.c changes):

/*
* Save the old ResultRelInfo and switch to the one corresponding to
* the selected partition.
*/
saved_resultRelInfo = resultRelInfo;
resultRelInfo = proute->partitions[leaf_part_index];
if (resultRelInfo == NULL);
{
/* Initialize partition info. */
resultRelInfo = ExecInitPartitionInfo(mtstate,
saved_resultRelInfo,
proute,
estate,
leaf_part_index);
}

This would make ExecInitPartitionInfo more simple because it can assume
that the given partition has not been initialized yet.

o On changes to execPartition.h

* Please add a brief decsription about partition_oids to the comments
for this struct.

@@ -91,6 +91,7 @@ typedef struct PartitionTupleRouting
{
PartitionDispatch *partition_dispatch_info;
int num_dispatch;
+ Oid *partition_oids;

That's it.

Best regards,
Etsuro Fujita

#20Robert Haas
robertmhaas@gmail.com
In reply to: Etsuro Fujita (#19)
Re: non-bulk inserts and tuple routing

On Thu, Feb 8, 2018 at 5:16 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

if (resultRelInfo == NULL);
{
/* Initialize partition info. */
resultRelInfo = ExecInitPartitionInfo(mtstate,
saved_resultRelInfo,
proute,
estate,
leaf_part_index);
}

I'm pretty sure that code has one semicolon too many.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Etsuro Fujita (#19)
#22Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Robert Haas (#20)
#23Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Amit Langote (#21)
#24Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Etsuro Fujita (#23)
#25Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Amit Langote (#24)
#26Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Etsuro Fujita (#25)
#27Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Amit Langote (#26)
#28Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Etsuro Fujita (#27)
#29Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Amit Langote (#28)
#30Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Etsuro Fujita (#29)
#31Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Amit Langote (#30)
#32Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Etsuro Fujita (#31)
#33Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Amit Langote (#32)
#34Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Etsuro Fujita (#33)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Amit Langote (#34)
#36Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#35)
#37Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#36)
#38Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Alvaro Herrera (#36)
#39Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Robert Haas (#35)
#40Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#35)
#41Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#35)
#42Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Andres Freund (#40)
#43Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Andres Freund (#41)