Multi Inserts in CREATE TABLE AS - revived patch
Hi,
I would like to propose an updated patch on multi/bulk inserts in CTAS [1]/messages/by-id/CAEET0ZHRWxbRUgwzUK_tOFDWx7VE2-P=xMBT6-N+gAa9WQ=xxA@mail.gmail.com
that tries to address the review comments that came up in [1]/messages/by-id/CAEET0ZHRWxbRUgwzUK_tOFDWx7VE2-P=xMBT6-N+gAa9WQ=xxA@mail.gmail.com. One of the
main review comments was to calculate/estimate the tuple size to decide on
when to flush. I tried to solve this point with a new function
GetTupleSize()(see the patch for implementation).
I did some testing with custom configuration[2]The postgresql.conf used: shared_buffers = 40GB synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off.
Use case 1- 100mn tuples, 2 integer columns, exec time in sec:
HEAD: *131.507* when the select part is not parallel, 128.832 when the
select part is parallel
Patch: 98.925 when the select part is not parallel, *52.901* when the
select part is parallel
Use case 2- 10mn tuples, 4 integer and 6 text columns, exec time in sec:
HEAD: *76.801* when the select part is not parallel, 66.074 when the select
part is parallel
Patch: 74.083 when the select part is not parallel, *57.739* when the
select part is parallel
Thoughts?
If the approach followed in the patch looks okay, I can work on a separate
patch for multi inserts in refresh materialized view cases.
I thank Simon Riggs for the offlist discussion.
PS: I chose to start a new thread as the previous thread [1]/messages/by-id/CAEET0ZHRWxbRUgwzUK_tOFDWx7VE2-P=xMBT6-N+gAa9WQ=xxA@mail.gmail.com was closed in
the CF. I hope that's not a problem.
[1]: /messages/by-id/CAEET0ZHRWxbRUgwzUK_tOFDWx7VE2-P=xMBT6-N+gAa9WQ=xxA@mail.gmail.com
/messages/by-id/CAEET0ZHRWxbRUgwzUK_tOFDWx7VE2-P=xMBT6-N+gAa9WQ=xxA@mail.gmail.com
[2]: The postgresql.conf used: shared_buffers = 40GB synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off
shared_buffers = 40GB
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v1-0001-Multi-Inserts-in-Create-Table-As.patchapplication/octet-stream; name=v1-0001-Multi-Inserts-in-Create-Table-As.patchDownload+187-35
On Tue, Nov 3, 2020 at 4:54 PM Bharath Rupireddy <
bharath.rupireddyforpostgres@gmail.com> wrote:
Use case 1- 100mn tuples, 2 integer columns, exec time in sec:
HEAD: 131.507 when the select part is not parallel, 128.832 when the
select part is parallel
Patch: 98.925 when the select part is not parallel, 52.901 when the
select part is parallel
Use case 2- 10mn tuples, 4 integer and 6 text columns, exec time in sec:
HEAD: 76.801 when the select part is not parallel, 66.074 when the select
part is parallel
Patch: 74.083 when the select part is not parallel, 57.739 when the
select part is parallel
I did some more testing with v1 patch: execution time is in seconds, each
test is run 2 times, with custom configuration [1]The postgresql.conf used: shared_buffers = 40GB synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off.
Use case 3: 1 int and 1 text column. each row size 129 bytes, size of 1
text column 101 bytes, number of rows 100million, size of heap file 12.9GB.
HEAD: 253.227, 259.575
Patch: 177.921, 174.196
We get better performance 1.4X with the patch.
Use case 4: 1 int and 30 text columns. each row size 28108 bytes, size of 1
text column 932 bytes, number of rows 10000, size of heap file 281.08MB.
HEAD: 222.812, 218.837
Patch: 222.492, 222.295
We don't see much difference with and without patch. Each time only 2
tuples(2*28108 = 56216 bytes < MAX_MULTI_INSERT_BUFFERED_BYTES(65535
bytes)) are buffered and flushed.
Use case 5: 1 int and 75 text columns. each row size 70228 bytes, size of 1
text column 932 bytes, number of rows 10000, size of heap file 702.28MB.
HEAD: 554.709, 562.745
Patch: 553.378, 560.370
We don't see much difference with and without patch. Since each row
size(70228 bytes) is bigger than the MAX_MULTI_INSERT_BUFFERED_BYTES(65535
bytes), multi insert code is not picked, each single row is inserted with
table_tuple_insert() itself.
Use case 6: 1 int and 1 text column. each row size 9205 bytes, size of 1
text column 9173 bytes, number of rows 10000, size of heap file 92.05MB.
HEAD: 70.583, 70251
Patch: 72.633, 73.521
We see 2-3 seconds more with patch. When I intentionally made the computed
tuple size to 0(sz =0) after GetTupleSize(), which means the single inserts
happen, the results are 70.364, 70.406. Looks like this 2-3 seconds extra
time is due to the multi insert code and happens for with this use case
only. And I think this should not be a problem as the difference is not
huge.
+ sz = GetTupleSize(slot, MAX_MULTI_INSERT_BUFFERED_BYTES);
+
*+. sz = 0;*
+
+ /* In case the computed tuple size is 0, we go for single inserts. */
+ if (sz != 0)
+ {
[1]: The postgresql.conf used: shared_buffers = 40GB synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off
shared_buffers = 40GB
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On Tue, Nov 3, 2020 at 4:54 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
If the approach followed in the patch looks okay, I can work on a separate patch for multi inserts in refresh materialized view cases.
Hi, I'm attaching a v2 patch that has multi inserts for CTAS as well
as REFRESH MATERIALiZED VIEW.
I did some testing: exec time in seconds.
Use case 1: 1 int and 1 text column. each row size 129 bytes, size of
1 text column 101 bytes, number of rows 100million, size of heap file
12.9GB.
HEAD: 220.733, 220.428
Patch: 151.923, 152.484
Thoughts?
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v2-0001-Multi-Inserts-in-CTAS-Refresh-Materialized-View.patchapplication/octet-stream; name=v2-0001-Multi-Inserts-in-CTAS-Refresh-Materialized-View.patchDownload+278-40
On Nov 9, 2020, at 6:41 PM, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Tue, Nov 3, 2020 at 4:54 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:If the approach followed in the patch looks okay, I can work on a separate patch for multi inserts in refresh materialized view cases.
Hi, I'm attaching a v2 patch that has multi inserts for CTAS as well
as REFRESH MATERIALiZED VIEW.I did some testing: exec time in seconds.
Use case 1: 1 int and 1 text column. each row size 129 bytes, size of
1 text column 101 bytes, number of rows 100million, size of heap file
12.9GB.
HEAD: 220.733, 220.428
Patch: 151.923, 152.484Thoughts?
With Regards,
Bharath Rupireddy.
EnterpriseDB: https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.enterprisedb.com%2F&amp;data=04%7C01%7Cguopa%40vmware.com%7C2471a90558ce4bf0af5b08d8849c03bb%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637405152899337347%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=QKeRMGQjOlOL%2FlQv%2BuEAb2ocLVq6zqXESKoNOaJ6YCo%3D&amp;reserved=0
<v2-0001-Multi-Inserts-in-CTAS-Refresh-Materialized-View.patch>
Thanks for doing this. There might be another solution - use raw insert interfaces (i.e. raw_heap_insert()).
Attached is the test (not formal) patch that verifies this idea. raw_heap_insert() writes the page into the
table files directly and also write the FPI xlog when the tuples filled up the whole page. This seems be
more efficient.
In addition, those raw write interfaces call smgrimmedsync() when finishing raw inserting, this is because
the write bypasses the shared buffer so a CHECKPOINT plus crash might cause data corruption since
some FPI xlogs cannot be replayed and those table files are not fsync-ed during crash. It seems that a sync
request could be forwarded to the checkpointer for each table segment file and then we do not need to call
smgrimmedsync(). If the theory is correct this should be in a separate patch. Anyway I tested this idea
also by simply commenting out the smgrimmedsync() call in heap_raw_insert_end() (a new function in
the attached patch) since forwarding fsync request is light-weight.
I did a quick and simple testing. The test environment is a centos6 vm with 7G memory on my Mac laptop.
-O3 gcc compiler option; shared_buffers as 2GB. Did not check if parallel scanning is triggered by the test
query and the data volume is not large so test time is not long.
Here are the test script.
create table t1 (a int, b int, c int, d int);
insert into t1 select i,i,i,i from generate_series(1,10000000) i;
show shared_buffers;
\timing on
create table t2 as select * from t1;
\timing off
Here are the results:
HEAD (37d2ff380312):
Time: 5143.041 ms (00:05.143)
Multi insert patch:
Time: 4456.461 ms (00:04.456)
Raw insert (attached):
Time: 2317.453 ms (00:02.317)
Raw insert + no smgrimmedsync():
Time: 2103.610 ms (00:02.104).
From the above data raw insert is better; also forwarding sync should be able to improve further
(Note my laptop is with SSD so on machine with SATA/SAS, I believe forwarding sync should
be able to help more.)
I tested removing smgrimmedsync in "vacuum full” code that uses raw insert also. FYI.
HEAD:
Time: 3567.036 ms (00:03.567)
no smgrimmedsync:
Time: 3023.487 ms (00:03.023)
Raw insert could be used on CTAS & Create MatView. For Refresh MatView the code is a bit
different. I did not spend more time on this so not sure raw insert could be used for that.
But I think the previous multi insert work could be still used in at least "INSERT tbl SELECT…” (if the INSERT
is a simple one, e.g. no trigger, no index, etc).
Regards,
Paul
Attachments:
v1-0001-ctas-using-raw-insert.patchapplication/octet-stream; name=v1-0001-ctas-using-raw-insert.patchDownload+126-2
On Tue, Nov 10, 2020 at 10:17:15AM +0000, Paul Guo wrote:
Raw insert could be used on CTAS & Create MatView. For Refresh MatView the code is a bit
different. I did not spend more time on this so not sure raw insert could be used for that.But I think the previous multi insert work could be still used in at least "INSERT tbl SELECT…” (if the INSERT
is a simple one, e.g. no trigger, no index, etc).
Note that I've started that idea on another thread:
https://commitfest.postgresql.org/30/2553/
- should INSERT SELECT use a BulkInsertState? (and multi_insert)
There's also this one:
https://commitfest.postgresql.org/30/2818/
- split copy.c, Heikki
--
Justin
On Tue, Nov 10, 2020 at 3:47 PM Paul Guo <guopa@vmware.com> wrote:
Thanks for doing this. There might be another solution - use raw insert interfaces (i.e. raw_heap_insert()).
Attached is the test (not formal) patch that verifies this idea. raw_heap_insert() writes the page into the
table files directly and also write the FPI xlog when the tuples filled up the whole page. This seems be
more efficient.
Thanks. Will the new raw_heap_insert() APIs scale well (i.e. extend
the table parallelly) with parallelism? The existing
table_multi_insert() API scales well, see, for instance, the benefit
with parallel copy[1]/messages/by-id/CALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ@mail.gmail.com and parallel multi inserts in CTAS[2]/messages/by-id/CALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf+0-Jg+KYT7ZO-Ug@mail.gmail.com.
[1]: /messages/by-id/CALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ@mail.gmail.com
[2]: /messages/by-id/CALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf+0-Jg+KYT7ZO-Ug@mail.gmail.com
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On Nov 13, 2020, at 7:21 PM, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Tue, Nov 10, 2020 at 3:47 PM Paul Guo <guopa@vmware.com> wrote:
Thanks for doing this. There might be another solution - use raw insert interfaces (i.e. raw_heap_insert()).
Attached is the test (not formal) patch that verifies this idea. raw_heap_insert() writes the page into the
table files directly and also write the FPI xlog when the tuples filled up the whole page. This seems be
more efficient.Thanks. Will the new raw_heap_insert() APIs scale well (i.e. extend
the table parallelly) with parallelism? The existing
table_multi_insert() API scales well, see, for instance, the benefit
with parallel copy[1] and parallel multi inserts in CTAS[2].
Yes definitely some work needs to be done to make raw heap insert interfaces fit the parallel work, but
it seems that there is no hard blocking issues for this?
Show quoted text
[1] - https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2FCALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%2540mail.gmail.com&amp;data=04%7C01%7Cguopa%40vmware.com%7C6fb10e05b7a243e0042608d887c651ac%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637408633136197927%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=fyQaor4yhmqVRYcK78JyPW25i7zjRoWXqZVf%2BfFYq1w%3D&amp;reserved=0
[2] - https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2FCALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%252B0-Jg%252BKYT7ZO-Ug%2540mail.gmail.com&amp;data=04%7C01%7Cguopa%40vmware.com%7C6fb10e05b7a243e0042608d887c651ac%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637408633136207912%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=CkFToJ11nmoyT2SodsJYYMOGP3cHSpeNYn8ZTYurn3U%3D&amp;reserved=0
On Mon, Nov 16, 2020 at 8:02 PM Paul Guo <guopa@vmware.com> wrote:
On Nov 13, 2020, at 7:21 PM, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Tue, Nov 10, 2020 at 3:47 PM Paul Guo <guopa@vmware.com> wrote:
Thanks for doing this. There might be another solution - use raw insert interfaces (i.e. raw_heap_insert()).
Attached is the test (not formal) patch that verifies this idea. raw_heap_insert() writes the page into the
table files directly and also write the FPI xlog when the tuples filled up the whole page. This seems be
more efficient.Thanks. Will the new raw_heap_insert() APIs scale well (i.e. extend
the table parallelly) with parallelism? The existing
table_multi_insert() API scales well, see, for instance, the benefit
with parallel copy[1] and parallel multi inserts in CTAS[2].Yes definitely some work needs to be done to make raw heap insert interfaces fit the parallel work, but
it seems that there is no hard blocking issues for this?
I may be wrong here. If we were to allow raw heap insert APIs to
handle parallelism, shouldn't we need some sort of shared memory to
allow coordination among workers? If we do so, at the end, aren't
these raw insert APIs equivalent to current table_multi_insert() API
which uses a separate shared ring buffer(bulk insert state) for
insertions?
And can we think of these raw insert APIs similar to the behaviour of
table_multi_insert() API for unlogged tables?
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Attaching v2 patch, rebased on the latest master 17958972.
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v2-0001-Multi-Inserts-in-Create-Table-As.patchapplication/octet-stream; name=v2-0001-Multi-Inserts-in-Create-Table-As.patchDownload+185-25
On 23/11/2020 11:15, Bharath Rupireddy wrote:
Attaching v2 patch, rebased on the latest master 17958972.
I just broke this again with commit c532d15ddd to split up copy.c.
Here's another rebased version.
- Heikki
Attachments:
v3-0001-Multi-Inserts-in-Create-Table-As.patchtext/x-patch; charset=UTF-8; name=v3-0001-Multi-Inserts-in-Create-Table-As.patchDownload+185-30
On Mon, Nov 23, 2020 at 3:26 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 23/11/2020 11:15, Bharath Rupireddy wrote:
Attaching v2 patch, rebased on the latest master 17958972.
I just broke this again with commit c532d15ddd to split up copy.c.
Here's another rebased version.
Thanks! I noticed that and am about to post a new patch. Anyways,
thanks for the rebased v3 patch. Attaching here v3 again for
visibility.
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v3-0001-Multi-Inserts-in-Create-Table-As.patchapplication/x-patch; name=v3-0001-Multi-Inserts-in-Create-Table-As.patchDownload+185-30
On 23-11-2020 11:23, Bharath Rupireddy wrote:
On Mon, Nov 23, 2020 at 3:26 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 23/11/2020 11:15, Bharath Rupireddy wrote:
Attaching v2 patch, rebased on the latest master 17958972.
I just broke this again with commit c532d15ddd to split up copy.c.
Here's another rebased version.Thanks! I noticed that and am about to post a new patch. Anyways,
thanks for the rebased v3 patch. Attaching here v3 again for
visibility.With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Hi,
Thanks for reviving the patch! I did unfortunately have to shift my
priorities somewhat and did not find much time to work on open source
things the last week(s).
I'm wondering about the use of the GetTupleSize function. As far as I
understand the idea is to limit the amount of buffered data, presumably
to not write too much data at once for intorel_flush_multi_insert.
If I understood correctly how it all works, the table slot can however
be of different type than the source slot, which makes that the call to
CopySlot() potentially stores a different amount of data than computed
by GetTupleSize(). Not sure if this is a big problem as an estimation
might be good enough?
Some other solutions/implementations would be:
- compute the size after doing CopySlot. Maybe the relation never wants
a virtual tuple and then you can also simplify GetTupleSize?
- after CopySlot ask for the memory consumed in the slot using
MemoryContextMemAllocated.
Some small things to maybe change are:
===========
+ if (myState->mi_slots[myState->mi_slots_num] == NULL)
+ {
+ batchslot = table_slot_create(myState->rel, NULL);
+ myState->mi_slots[myState->mi_slots_num] = batchslot;
+ }
+ else
+ batchslot = myState->mi_slots[myState->mi_slots_num];
Alternative:
+ if (myState->mi_slots[myState->mi_slots_num] == NULL)
+ myState->mi_slots[myState->mi_slots_num] =
table_slot_create(myState->rel, NULL);
+ batchslot = myState->mi_slots[myState->mi_slots_num];
==============
+ sz = att_align_nominal(sz, att->attalign);
This could be moved out of the if statement?
==============
Regards,
Luc
Swarm64
On Wed, Nov 25, 2020 at 2:11 PM Luc Vlaming <luc@swarm64.com> wrote:
Thanks for reviving the patch! I did unfortunately have to shift my
priorities somewhat and did not find much time to work on open source
things the last week(s).
Thanks for the comments.
I'm wondering about the use of the GetTupleSize function. As far as I
understand the idea is to limit the amount of buffered data, presumably
to not write too much data at once for intorel_flush_multi_insert.
If I understood correctly how it all works, the table slot can however
be of different type than the source slot, which makes that the call to
CopySlot() potentially stores a different amount of data than computed
by GetTupleSize(). Not sure if this is a big problem as an estimation
might be good enough?
Yeah. The tuple size may change after ExecCopySlot(). For instance, create
table t2 as select a1 from t1; where t1 has two integer columns a1, b1. I'm
creating t2 with single column a1 from t1 which makes the source slot
virtual.
Source slot is virtual and the size calculated with GetTupleSize() is 8
bytes:
(gdb) p *slot
$18 = {type = T_TupleTableSlot, tts_flags = 16, tts_nvalid = 1,
tts_ops = 0x562c592652c0 <TTSOpsVirtual>,
tts_tupleDescriptor = 0x562c5a0409f0, tts_values = 0x562c5a040b50,
tts_isnull = 0x562c5a040b58, tts_mcxt = 0x562c5a040320, tts_tid = {
ip_blkid = {bi_hi = 65535, bi_lo = 65535}, ip_posid = 0}, tts_tableOid
= 0}
(gdb) call GetTupleSize(slot, 65535)
$24 = 8
After ExecCopySlot(batchslot, slot), destination slot changes to
TTSOpsBufferHeapTuple and the GetTupleSize() gives 28 bytes now.
(gdb) p *batchslot
$19 = {type = T_TupleTableSlot, tts_flags = 20, tts_nvalid = 0,
tts_ops = 0x562c592653e0 <TTSOpsBufferHeapTuple>,
tts_tupleDescriptor = 0x7f063fbeecd0, tts_values = 0x562c5a05daa8,
tts_isnull = 0x562c5a05dab0, tts_mcxt = 0x562c5a040320, tts_tid = {
ip_blkid = {bi_hi = 65535, bi_lo = 65535}, ip_posid = 0}, tts_tableOid
= 0}
(gdb) call GetTupleSize(batchslot, 65535)
$25 = 28
I think your suggestion to call GetTupleSize() on the destination slot
after ExecCopySlot() is right. I changed it in the v4 patch.
Some other solutions/implementations would be:
- compute the size after doing CopySlot. Maybe the relation never wants
a virtual tuple and then you can also simplify GetTupleSize?
I think we need to have TTSOpsVirtual code in GetTupleSize() because
table_slot_create() which gets called before ExecCopySlot() may create
virtual slots for cases such as views and partitioned tables. Though we can
not insert into views or partitioned tables using CTAS, I want
GetTupleSize() to be a generic function. Right now, I can not find other
use cases where GetTupleSize() can be used.
- after CopySlot ask for the memory consumed in the slot using
MemoryContextMemAllocated.
MemoryContextMemAllocated of the slot's tts_mcxt will always have extra
bytes and those extra bytes are way more compared to the actual tuple
bytes. And most of the time, ExecCopySlot() will just point the src slot
tts_mcxt to dest slot tts_mcxt. For instance, for a single row with a
single integer column of 8 bytes, the mem_allocated is 49232 bytes. This is
the reason we can not rely on mem_allocated.
(gdb) p slot->tts_mcxt -----> source slot
$22 = (MemoryContext) 0x562c5a040320
(gdb) p *slot->tts_mcxt
$20 = {type = T_AllocSetContext, isReset = false, allowInCritSection =
false,
*mem_allocated = 49232*, methods = 0x562c5926d560 <AllocSetMethods>,
parent = 0x562c59f97820, firstchild = 0x562c5a042330, prevchild = 0x0,
nextchild = 0x0, name = 0x562c590d3554 "ExecutorState", ident = 0x0,
reset_cbs = 0x0}
(gdb) p batchslot->tts_mcxt -----> destination slot after
ExecCopySlot().
$23 = (MemoryContext) 0x562c5a040320
(gdb) p *batchslot->tts_mcxt
$21 = {type = T_AllocSetContext, isReset = false, allowInCritSection =
false,
*mem_allocated = 49232*, methods = 0x562c5926d560 <AllocSetMethods>,
parent = 0x562c59f97820, firstchild = 0x562c5a042330, prevchild = 0x0,
nextchild = 0x0, name = 0x562c590d3554 "ExecutorState", ident = 0x0,
reset_cbs = 0x0}
Some small things to maybe change are: =========== + if (myState->mi_slots[myState->mi_slots_num] == NULL) + { + batchslot = table_slot_create(myState->rel, NULL); + myState->mi_slots[myState->mi_slots_num] =
batchslot;
+ } + else + batchslot =
myState->mi_slots[myState->mi_slots_num];
Alternative: + if (myState->mi_slots[myState->mi_slots_num] == NULL) + myState->mi_slots[myState->mi_slots_num] = table_slot_create(myState->rel, NULL); + batchslot = myState->mi_slots[myState->mi_slots_num];
Changed.
==============
+ sz = att_align_nominal(sz, att->attalign);
This could be moved out of the if statement?==============
I don't think we can change it. If we were to move it, then sz =
att_addlength_datum(sz, att->attlen, val); which takes aligned sz may have
problems like below:
Say att_align_nominal sets sz to 4 bytes, then att_addlength_datum takes
this 4 bytes adds attlen to it. If we move att_align_nominal(sz,
att->attalign) out, then att_addlength_datum(sz, att->attlen, val) will not
consider the aligned bytes. We might have to add up the aligned bytes
separately for the else case. And also note that this code is derived from
ts_virtual_materialize(), where we have the att_align_nominal inside both
if and else blocks. I may be wrong here.
Attaching v4 patch. Consider it for further review.
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v4-0001-Multi-Inserts-in-Create-Table-As.patchapplication/octet-stream; name=v4-0001-Multi-Inserts-in-Create-Table-As.patchDownload+187-30
On Thu, Nov 26, 2020 at 07:24:01AM +0530, Bharath Rupireddy wrote:
Yeah. The tuple size may change after ExecCopySlot(). For instance, create
table t2 as select a1 from t1; where t1 has two integer columns a1, b1. I'm
creating t2 with single column a1 from t1 which makes the source slot
virtual.
+inline Size
+GetTupleSize(TupleTableSlot *slot, Size maxsize)
+{
+ Size sz = 0;
+ HeapTuple tuple = NULL;
+
+ if (TTS_IS_HEAPTUPLE(slot))
+ tuple = ((HeapTupleTableSlot *) slot)->tuple;
+ else if(TTS_IS_BUFFERTUPLE(slot))
+ tuple = ((BufferHeapTupleTableSlot *) slot)->base.tuple;
+ else if(TTS_IS_MINIMALTUPLE(slot))
+ tuple = ((MinimalTupleTableSlot *) slot)->tuple;
There have been various talks about the methods we could use to
evaluate the threshold in bytes when evaluating that a flush can
happen, including the use of memory contexts, or even estimate the
size of the number of tuples. This one looks promising because it
seems exact, however for virtual slots I don't like much the fact that
you basically just extracted the parts of tts_virtual_materialize()
and stuck them in this routine. That's a recipe for future bugs if
the materialization logic changes. In short, I am surprised that this
calculation is not directly part of TupleTableSlotOps. What we'd want
is to get this information depending on the slot type dealt with, and
with your patch you would miss to handle any new slot type
introduced.
--
Michael
On Thu, Nov 26, 2020 at 9:55 AM Michael Paquier <michael@paquier.xyz> wrote:
+inline Size +GetTupleSize(TupleTableSlot *slot, Size maxsize) +{ + Size sz = 0; + HeapTuple tuple = NULL; + + if (TTS_IS_HEAPTUPLE(slot)) + tuple = ((HeapTupleTableSlot *) slot)->tuple; + else if(TTS_IS_BUFFERTUPLE(slot)) + tuple = ((BufferHeapTupleTableSlot *) slot)->base.tuple; + else if(TTS_IS_MINIMALTUPLE(slot)) + tuple = ((MinimalTupleTableSlot *) slot)->tuple;There have been various talks about the methods we could use to
evaluate the threshold in bytes when evaluating that a flush can
happen, including the use of memory contexts, or even estimate the
size of the number of tuples. This one looks promising because it
seems exact, however for virtual slots I don't like much the fact that
you basically just extracted the parts of tts_virtual_materialize()
and stuck them in this routine. That's a recipe for future bugs if
the materialization logic changes. In short, I am surprised that this
calculation is not directly part of TupleTableSlotOps. What we'd want
is to get this information depending on the slot type dealt with, and
with your patch you would miss to handle any new slot type
introduced.
Yes for virtual slots, I reused the code from
tts_virtual_materialize() in GetTupleSize(). I can think of below
options:
1) Make the size calculation code for virtual slots, a macro or a
static inline function and use that in tts_virtual_materialize() and
GetTupleSize().
2) Add comments in both the places, such as "if any code is changed
here, consider changing it in tts_virtual_materialize() /
GetTupleSize()"
3) Add a size variable to TupleTableSlotOps structure.
4) Add a new API to TupleTableSlotOps structure say get_slot_size().
5) For new slot types, maybe we can have comments in tuptable.h to
consider having equivalent change in GetTupleSize().
If we go with 3 and 4, will it be acceptable to add the extra code in
generic structure which gets used in most of the code base and use
that new code only in limited places (for multi inserts in CTAS and
Refresh Mat View)? I think we can go ahead with 2 and 5. Thoughts?
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On 26-11-2020 07:31, Bharath Rupireddy wrote:
On Thu, Nov 26, 2020 at 9:55 AM Michael Paquier <michael@paquier.xyz> wrote:
+inline Size +GetTupleSize(TupleTableSlot *slot, Size maxsize) +{ + Size sz = 0; + HeapTuple tuple = NULL; + + if (TTS_IS_HEAPTUPLE(slot)) + tuple = ((HeapTupleTableSlot *) slot)->tuple; + else if(TTS_IS_BUFFERTUPLE(slot)) + tuple = ((BufferHeapTupleTableSlot *) slot)->base.tuple; + else if(TTS_IS_MINIMALTUPLE(slot)) + tuple = ((MinimalTupleTableSlot *) slot)->tuple;There have been various talks about the methods we could use to
evaluate the threshold in bytes when evaluating that a flush can
happen, including the use of memory contexts, or even estimate the
size of the number of tuples. This one looks promising because it
seems exact, however for virtual slots I don't like much the fact that
you basically just extracted the parts of tts_virtual_materialize()
and stuck them in this routine. That's a recipe for future bugs if
the materialization logic changes. In short, I am surprised that this
calculation is not directly part of TupleTableSlotOps. What we'd want
is to get this information depending on the slot type dealt with, and
with your patch you would miss to handle any new slot type
introduced.Yes for virtual slots, I reused the code from
tts_virtual_materialize() in GetTupleSize(). I can think of below
options:1) Make the size calculation code for virtual slots, a macro or a
static inline function and use that in tts_virtual_materialize() and
GetTupleSize().
2) Add comments in both the places, such as "if any code is changed
here, consider changing it in tts_virtual_materialize() /
GetTupleSize()"
3) Add a size variable to TupleTableSlotOps structure.
4) Add a new API to TupleTableSlotOps structure say get_slot_size().
5) For new slot types, maybe we can have comments in tuptable.h to
consider having equivalent change in GetTupleSize().If we go with 3 and 4, will it be acceptable to add the extra code in
generic structure which gets used in most of the code base and use
that new code only in limited places (for multi inserts in CTAS and
Refresh Mat View)? I think we can go ahead with 2 and 5. Thoughts?With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
What I'm wondering about is the reason for wanting a cap on data volume.
When doing some local (highly concurrent) ingest speed tests a few weeks
ago it seemed to mostly matter how many pages were being written and the
resulting pressure on locks, etc. and not necessarily so much the actual
memory usage. I didn't collect proof on that though (yet). There was
however a very clearly observable contention point where with bigger
buffers the performance would not only stagnate but actually drop.
So what I'm kinda wondering is if we should worry more about the amount
of pages that are going to be written and maybe not so much about the
memory usage?
If this were to be the case then maybe we can consider improving the
current design, potentially in a follow-up patch? The problem I see is
that generically each tableam will have different choices to make on how
to buffer and flush multiple rows, given that a storage engine might
have more or less write amplification, a different way of extending a
relation, fsm use, etc.
Assuming we indeed want a per-tableam implementation, we could either:
- make multi_insert buffer the tuples itself and add a flush_multi_insert.
- add a new function called create_multi_insert which returns something
like a MultiInsertState, which, like a destreceiver, has a set of
callbacks to start, shutdown and insert.
With both solutions one part that to me seems appealing is that we
buffer the data in something that likely resembles the disk format very
much. Thoughts?
Regards,
Luc
Swarm64
On Thu, Nov 26, 2020 at 12:25 PM Luc Vlaming <luc@swarm64.com> wrote:
What I'm wondering about is the reason for wanting a cap on data volume.
When doing some local (highly concurrent) ingest speed tests a few weeks
ago it seemed to mostly matter how many pages were being written and the
resulting pressure on locks, etc. and not necessarily so much the actual
memory usage. I didn't collect proof on that though (yet). There was
however a very clearly observable contention point where with bigger
buffers the performance would not only stagnate but actually drop.So what I'm kinda wondering is if we should worry more about the amount
of pages that are going to be written and maybe not so much about the
memory usage?If this were to be the case then maybe we can consider improving the
current design, potentially in a follow-up patch? The problem I see is
that generically each tableam will have different choices to make on how
to buffer and flush multiple rows, given that a storage engine might
have more or less write amplification, a different way of extending a
relation, fsm use, etc.
Assuming we indeed want a per-tableam implementation, we could either:
- make multi_insert buffer the tuples itself and add a flush_multi_insert.
- add a new function called create_multi_insert which returns something
like a MultiInsertState, which, like a destreceiver, has a set of
callbacks to start, shutdown and insert.With both solutions one part that to me seems appealing is that we
buffer the data in something that likely resembles the disk format very
much. Thoughts?
IMHO, I would like to go with your option 1 i.e. add a few APIs to the
TableAmRoutine structure. Advantage is that we could use these APIs in
at least 3 places, without much code duplication: 1) COPY 2) CTAS and
3) Refresh Materialized View. I could roughly sketch the APIs in below
way:
typedef struct MultiInsertStateData
{
MemoryContext micontext; /* A temporary memory context for
multi insert. */
BulkInsertStateData *bistate; /* Bulk insert state. */
TupleTableSlot **mislots; /* Array of buffered slots. */
uint32 nslots; /* Total number of buffered slots. */
uint64 nbytes; /* Flush buffers if the total tuple
size >= nbytes. */
int32 nused; /* Number of current buffered slots for
a multi insert batch. */
int64 nsize; /* Total tuple size for a multi insert
batch. */
} MultiInsertStateData;
/* Creates a temporary memory context, allocates the
MultiInsertStateData, BulkInsertStateData and initializes other
members. */
void (*begin_multi_insert) (Relation rel,
MultiInsertStateData **mistate, uint32 nslots, uint64 nbytes);
/* Buffers the input slot into mistate slots, computes the size of the
tuple, and adds it to the total tuple size of the buffered tuples, if
this size crosses mistate->nbytes, flush the buffered tuples into
table. For heapam, existing heap_multi_insert can be used. Once the
buffer is flushed, then micontext can be reset and buffered slots can
be cleared. */
void (*do_multi_insert) (Relation rel, MultiInsertStateData
*mistate, struct TupleTableSlot *slot, CommandId cid, int options);
/* Flush the buffered tuples if any. For heapam, existing
heap_multi_insert can be used. Deletes temporary memory context and
deallocates mistate. */
void (*end_multi_insert) (Relation rel,
MultiInsertStateData *mistate, CommandId cid, int options);
Thoughts?
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Few things:
IIUC Andres mentioned similar kinds of APIs earlier in [1]/messages/by-id/20200924024128.kyk3r5g7dnu3fxxx@alap3.anarazel.de.
[1]: /messages/by-id/20200924024128.kyk3r5g7dnu3fxxx@alap3.anarazel.de
/messages/by-id/20200924024128.kyk3r5g7dnu3fxxx@alap3.anarazel.de
I would like to add some more info to one of the API:
typedef struct MultiInsertStateData
{
MemoryContext micontext; /* A temporary memory context for
multi insert. */
BulkInsertStateData *bistate; /* Bulk insert state. */
TupleTableSlot **mislots; /* Array of buffered slots. */
uint32 nslots; /* Total number of buffered slots. */
int64 nbytes; /* Flush buffers if the total tuple size >=
nbytes. */
int32 nused; /* Number of current buffered slots for a
multi insert batch. */
int64 nsize; /* Total tuple size for a multi insert batch.
*/
} MultiInsertStateData;
/* Creates a temporary memory context, allocates the MultiInsertStateData,
BulkInsertStateData and initializes other members. */
void (*begin_multi_insert) (Relation rel, MultiInsertStateData
**mistate, uint32 nslots, uint64 nbytes);
/* Buffers the input slot into mistate slots, computes the size of the
tuple, and adds it total buffer tuple size, if this size crosses
mistate->nbytes, flush the buffered tuples into table. For heapam, existing
heap_multi_insert can be used. Once the buffer is flushed, then the
micontext can be reset and buffered slots can be cleared. *If nbytes i.e.
total tuple size of the batch is not given, tuple size is not calculated,
tuples are buffered until all the nslots are filled and then flushed.* */
void (*do_multi_insert) (Relation rel, MultiInsertStateData
*mistate, struct TupleTableSlot *slot, CommandId cid, int options);
/* Flush the buffered tuples if any. For heapam, existing heap_multi_insert
can be used. Deletes temporary memory context and deallocates mistate. */
void (*end_multi_insert) (Relation rel, MultiInsertStateData
*mistate, CommandId cid, int options);
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On 26-11-2020 12:36, Bharath Rupireddy wrote:
Few things:
IIUC Andres mentioned similar kinds of APIs earlier in [1].
[1] -
/messages/by-id/20200924024128.kyk3r5g7dnu3fxxx@alap3.anarazel.de
</messages/by-id/20200924024128.kyk3r5g7dnu3fxxx@alap3.anarazel.de>I would like to add some more info to one of the API:
typedef struct MultiInsertStateData
{
MemoryContext micontext; /* A temporary memory context for
multi insert. */
BulkInsertStateData *bistate; /* Bulk insert state. */
TupleTableSlot **mislots; /* Array of buffered slots. */
uint32 nslots; /* Total number of buffered slots. */
int64 nbytes; /* Flush buffers if the total tuple size= nbytes. */
int32 nused; /* Number of current buffered slots for a
multi insert batch. */
int64 nsize; /* Total tuple size for a multi insert
batch. */
} MultiInsertStateData;/* Creates a temporary memory context, allocates the
MultiInsertStateData, BulkInsertStateData and initializes other members. */
void (*begin_multi_insert) (Relation rel,
MultiInsertStateData **mistate, uint32 nslots, uint64 nbytes);/* Buffers the input slot into mistate slots, computes the size of the
tuple, and adds it total buffer tuple size, if this size crosses
mistate->nbytes, flush the buffered tuples into table. For heapam,
existing heap_multi_insert can be used. Once the buffer is flushed, then
the micontext can be reset and buffered slots can be cleared. *If nbytes
i.e. total tuple size of the batch is not given, tuple size is not
calculated, tuples are buffered until all the nslots are filled and then
flushed.* */
void (*do_multi_insert) (Relation rel, MultiInsertStateData
*mistate, struct TupleTableSlot *slot, CommandId cid, int options);/* Flush the buffered tuples if any. For heapam, existing
heap_multi_insert can be used. Deletes temporary memory context and
deallocates mistate. */
void (*end_multi_insert) (Relation rel, MultiInsertStateData
*mistate, CommandId cid, int options);With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
Looks all good to me, except for the nbytes part.
Could you explain to me what use case that supports? IMHO the tableam
can best decide itself that its time to flush, based on its
implementation that e.g. considers how many pages to flush at a time and
such, etc? This means also that most of the fields of
MultiInsertStateData can be private as each tableam would return a
derivative of that struct (like with the destreceivers).
One thing I'm wondering is in which memory context the slots end up
being allocated. I'd assume we would want to keep the slots around
between flushes. If they are in the temporary context this might prove
problematic however?
Regards,
Luc
On Thu, Nov 26, 2020 at 5:34 PM Luc Vlaming <luc@swarm64.com> wrote:
On 26-11-2020 12:36, Bharath Rupireddy wrote:
Few things:
IIUC Andres mentioned similar kinds of APIs earlier in [1].
[1] -
/messages/by-id/20200924024128.kyk3r5g7dnu3fxxx@alap3.anarazel.de
</messages/by-id/20200924024128.kyk3r5g7dnu3fxxx@alap3.anarazel.de>I would like to add some more info to one of the API:
typedef struct MultiInsertStateData
{
MemoryContext micontext; /* A temporary memory context for
multi insert. */
BulkInsertStateData *bistate; /* Bulk insert state. */
TupleTableSlot **mislots; /* Array of buffered slots. */
uint32 nslots; /* Total number of buffered slots. */
int64 nbytes; /* Flush buffers if the total tuple size= nbytes. */
int32 nused; /* Number of current buffered slots for a
multi insert batch. */
int64 nsize; /* Total tuple size for a multi insert
batch. */
} MultiInsertStateData;/* Creates a temporary memory context, allocates the
MultiInsertStateData, BulkInsertStateData and initializes other members. */
void (*begin_multi_insert) (Relation rel,
MultiInsertStateData **mistate, uint32 nslots, uint64 nbytes);/* Buffers the input slot into mistate slots, computes the size of the
tuple, and adds it total buffer tuple size, if this size crosses
mistate->nbytes, flush the buffered tuples into table. For heapam,
existing heap_multi_insert can be used. Once the buffer is flushed, then
the micontext can be reset and buffered slots can be cleared. *If nbytes
i.e. total tuple size of the batch is not given, tuple size is not
calculated, tuples are buffered until all the nslots are filled and then
flushed.* */
void (*do_multi_insert) (Relation rel, MultiInsertStateData
*mistate, struct TupleTableSlot *slot, CommandId cid, int options);/* Flush the buffered tuples if any. For heapam, existing
heap_multi_insert can be used. Deletes temporary memory context and
deallocates mistate. */
void (*end_multi_insert) (Relation rel, MultiInsertStateData
*mistate, CommandId cid, int options);Looks all good to me, except for the nbytes part.
Could you explain to me what use case that supports? IMHO the tableam
can best decide itself that its time to flush, based on its
implementation that e.g. considers how many pages to flush at a time and
such, etc? This means also that most of the fields of
MultiInsertStateData can be private as each tableam would return a
derivative of that struct (like with the destreceivers).
nbytes is basically to support the following case, say the number of
tuples to buffer is 1000, and if all the tuples are toasted with size
in few hundred MB or even GB, then do we want to wait until 1000
tuples are buffered in which case we occupy for one query 1000*toasted
tuple size in GB. So, if we have a memory limit, then it will give
flexibility. Whether to use it or not is up to the table AM
implementation. And also that existing copy code(since it can know the
tuple size after parsing input data) uses this mechanism to decide
when to flush.
If the nbytes is not used in a table am, then the multi insert can
wait until the total tuples, how much ever large memory they occupy,
are buffered.
IMO, we can retain nbytes for now to decide on when to flush. Thoughts?
I wonder, how can the do_multi_insert() API decide on when to flush, I
mean, based on the number of pages to flush? Do we need to pass the
maximum number of pages the buffered tuples can occupy and track the
pages currently buffered tuples occupy to decide when to flush? Or is
it something that the existing table AM infrastructure already
supports? If we use the number of pages to decide on when to flush,
how well it works with parallel inserts?
One thing I'm wondering is in which memory context the slots end up
being allocated. I'd assume we would want to keep the slots around
between flushes. If they are in the temporary context this might prove
problematic however?
I should not have used the word temporary, it actually is not
temporary. This memory conext will be created in begin_multi_insert(),
all the buffered tuples are copied using this context, it will be
reset at the end of each flush and reused. It can get destroyed at the
end in end_multi_insert(). I think we should even do this with the new
APIs implementation.
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com