Inserting heap tuples in bulk in COPY
COPY is slow. Let's make it faster. One obvious optimization is to
insert heap tuples in bigger chunks, instead of calling heap_insert()
separately for every tuple. That saves the overhead of pinning and
locking the buffer for every tuple, and you only need to write one WAL
record for all the tuples written to the same page, instead of one for
each tuple.
Attached is a WIP patch to do that. It adds a new function,
heap_multi_insert, which does the same thing as heap_insert, but works
in bulk. It takes an array of tuples as argument, and tries to cram as
many of them into the chosen targe page as it can, and only writes a
single WAL record of the operation.
This gives a significant speedup to COPY, particularly for narrow
tables, with small tuples. Grouping multiple tuples into one WAL record
reduces the WAL volume significantly, and the time spent in writing that
WAL. The reduced overhead of repeatedly locking the buffer is also most
noticeable on narrow tables. On wider tables, the effects are smaller.
See copytest-results.txt, containing test results with three tables of
different widths. The scripts used to get those numbers are also attached.
Triggers complicate this. I believe it is only safe to group tuples
together like this if the table has no triggers. A BEFORE ROW trigger
might run a SELECT on the table being copied to, and check if some of
the tuples we're about to insert exist. If we run BEFORE ROW triggers
for a bunch of tuples first, and only then insert them, none of the
trigger invocations will see the other rows as inserted yet. Similarly,
if we run AFTER ROW triggers after inserting a bunch of tuples, the
trigger for each of the insertions would see all the inserted rows. So
at least for now, the patch simply falls back to inserting one row at a
time if there are any triggers on the table.
The patch is WIP, mainly because I didn't write the WAL replay routines
yet, but please let me know if you see any issues.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Fri, Aug 12, 2011 at 3:16 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:
COPY is slow.
No kidding!
So at least for now, the patch simply falls back to inserting one row at a
time if there are any triggers on the table.
Maybe we want to change that to "fall back to old ways if there are any FOR
EACH ROW triggers", since FOR EACH STATEMENT triggers won't be bothered by
this optimization.
--
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
On Aug12, 2011, at 21:16 , Heikki Linnakangas wrote:
Triggers complicate this. I believe it is only safe to group tuples together like this if the table has no triggers. A BEFORE ROW trigger might run a SELECT on the table being copied to, and check if some of the tuples we're about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples first, and only then insert them, none of the trigger invocations will see the other rows as inserted yet. Similarly, if we run AFTER ROW triggers after inserting a bunch of tuples, the trigger for each of the insertions would see all the inserted rows.
Don't we run AFTER ROW triggers after inserting *all* the tuples anyway? At least this is what we do in the case of INSERT/UPDATE/DELETE if I'm not mistaken.
best regards,
Florian Pflug
On 12.08.2011 22:57, Florian Pflug wrote:
On Aug12, 2011, at 21:16 , Heikki Linnakangas wrote:
Triggers complicate this. I believe it is only safe to group tuples together like this if the table has no triggers. A BEFORE ROW trigger might run a SELECT on the table being copied to, and check if some of the tuples we're about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples first, and only then insert them, none of the trigger invocations will see the other rows as inserted yet. Similarly, if we run AFTER ROW triggers after inserting a bunch of tuples, the trigger for each of the insertions would see all the inserted rows.
Don't we run AFTER ROW triggers after inserting *all* the tuples anyway? At least this is what we do in the case of INSERT/UPDATE/DELETE if I'm not mistaken.
Um, yes, you're right. Now I feel silly. The above still applies to
BEFORE ROW triggers, though.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Fri, Aug 12, 2011 at 3:16 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
COPY is slow. Let's make it faster. One obvious optimization is to insert
heap tuples in bigger chunks, instead of calling heap_insert() separately
for every tuple. That saves the overhead of pinning and locking the buffer
for every tuple, and you only need to write one WAL record for all the
tuples written to the same page, instead of one for each tuple.Attached is a WIP patch to do that. It adds a new function,
heap_multi_insert, which does the same thing as heap_insert, but works in
bulk. It takes an array of tuples as argument, and tries to cram as many of
them into the chosen targe page as it can, and only writes a single WAL
record of the operation.This gives a significant speedup to COPY, particularly for narrow tables,
with small tuples. Grouping multiple tuples into one WAL record reduces the
WAL volume significantly, and the time spent in writing that WAL. The
reduced overhead of repeatedly locking the buffer is also most noticeable on
narrow tables. On wider tables, the effects are smaller. See
copytest-results.txt, containing test results with three tables of different
widths. The scripts used to get those numbers are also attached.Triggers complicate this. I believe it is only safe to group tuples together
like this if the table has no triggers. A BEFORE ROW trigger might run a
SELECT on the table being copied to, and check if some of the tuples we're
about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples
first, and only then insert them, none of the trigger invocations will see
the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
after inserting a bunch of tuples, the trigger for each of the insertions
would see all the inserted rows. So at least for now, the patch simply falls
back to inserting one row at a time if there are any triggers on the table.The patch is WIP, mainly because I didn't write the WAL replay routines yet,
but please let me know if you see any issues.
I thought about trying to do this at one point in the past, but I
couldn't figure out exactly how to make it work. I think the approach
you've taken here is good.
Aside from the point already raised about needing to worry only about
BEFORE ROW triggers, I don't see any showstoppers.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 08/12/2011 04:57 PM, Robert Haas wrote:
I thought about trying to do this at one point in the past, but I
couldn't figure out exactly how to make it work. I think the approach
you've taken here is good.Aside from the point already raised about needing to worry only about
BEFORE ROW triggers, I don't see any showstoppers.
Yeah, this looks very promising indeed. Well done. In fact, I'm asking
myself how hard it will be to backport for one particular customer of
ours, for whom the WAL load is a major hotspot.
cheers
andrew
On Fri, Aug 12, 2011 at 8:16 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
COPY is slow. Let's make it faster. One obvious optimization is to insert
heap tuples in bigger chunks, instead of calling heap_insert() separately
for every tuple. That saves the overhead of pinning and locking the buffer
for every tuple, and you only need to write one WAL record for all the
tuples written to the same page, instead of one for each tuple.
We don't pin the buffer for every tuple, that optimisation is already done...
When we discussed this before you said that it wasn't worth trying to
do this additional work - it was certainly a smaller gain than the one
we achieved by removing the pinning overhead.
Also, we discussed that you would work on buffering the index inserts,
which is where the main problem lies. The main heap is only a small
part of the overhead if we have multiple indexes already built on a
table - which is the use case that causes the most problem.
So I'm a little surprised to see you working on this and I'm guessing
that the COPY improvement with indexes is barely noticeable. This
would be a nice improvement, but not until the bulk index inserts are
done.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Aug 12, 2011 at 2:16 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
COPY is slow. Let's make it faster. One obvious optimization is to insert
heap tuples in bigger chunks, instead of calling heap_insert() separately
for every tuple. That saves the overhead of pinning and locking the buffer
for every tuple, and you only need to write one WAL record for all the
tuples written to the same page, instead of one for each tuple.Attached is a WIP patch to do that. It adds a new function,
heap_multi_insert, which does the same thing as heap_insert, but works in
bulk. It takes an array of tuples as argument, and tries to cram as many of
them into the chosen targe page as it can, and only writes a single WAL
record of the operation.This gives a significant speedup to COPY, particularly for narrow tables,
with small tuples. Grouping multiple tuples into one WAL record reduces the
WAL volume significantly, and the time spent in writing that WAL. The
reduced overhead of repeatedly locking the buffer is also most noticeable on
narrow tables. On wider tables, the effects are smaller. See
copytest-results.txt, containing test results with three tables of different
widths. The scripts used to get those numbers are also attached.Triggers complicate this. I believe it is only safe to group tuples together
like this if the table has no triggers. A BEFORE ROW trigger might run a
SELECT on the table being copied to, and check if some of the tuples we're
about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples
first, and only then insert them, none of the trigger invocations will see
the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
after inserting a bunch of tuples, the trigger for each of the insertions
would see all the inserted rows. So at least for now, the patch simply falls
back to inserting one row at a time if there are any triggers on the table.
But generic RI triggers would be ok, right?
merlin
On 13.08.2011 00:17, Simon Riggs wrote:
Also, we discussed that you would work on buffering the index inserts,
which is where the main problem lies. The main heap is only a small
part of the overhead if we have multiple indexes already built on a
table - which is the use case that causes the most problem.
Sure, if you have indexes on the table already, then the overhead of
updating them is more significant. I am actually working on that too, I
will make a separate post about that.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 13.08.2011 00:26, Merlin Moncure wrote:
On Fri, Aug 12, 2011 at 2:16 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:Triggers complicate this. I believe it is only safe to group tuples together
like this if the table has no triggers. A BEFORE ROW trigger might run a
SELECT on the table being copied to, and check if some of the tuples we're
about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples
first, and only then insert them, none of the trigger invocations will see
the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
after inserting a bunch of tuples, the trigger for each of the insertions
would see all the inserted rows. So at least for now, the patch simply falls
back to inserting one row at a time if there are any triggers on the table.But generic RI triggers would be ok, right?
RI triggers are AFTER ROW triggers, which we concluded to be OK after
all, so they would be ok.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 12 August 2011 23:19, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Triggers complicate this. I believe it is only safe to group tuples
together
like this if the table has no triggers. A BEFORE ROW trigger might run a
SELECT on the table being copied to, and check if some of the tuples
we're
about to insert exist. If we run BEFORE ROW triggers for a bunch of
tuples
first, and only then insert them, none of the trigger invocations will
see
the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
after inserting a bunch of tuples, the trigger for each of the insertions
would see all the inserted rows. So at least for now, the patch simply
falls
back to inserting one row at a time if there are any triggers on the
table.
I guess DEFAULT values could also suffer from a similar problem to
BEFORE ROW triggers though (in theory a DEFAULT could be based on a
function that SELECTs from the table being copied to).
Regards,
Dean
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
The patch is WIP, mainly because I didn't write the WAL replay routines
yet, but please let me know if you see any issues.
Why do you need new WAL replay routines? Can't you just use the existing
XLOG_HEAP_NEWPAGE support?
By any large, I think we should be avoiding special-purpose WAL entries
as much as possible.
Also, I strongly object to turning regular heap_insert into a wrapper
around some other more complicated operation. That will likely have
bad consequences for the performance of every other operation.
regards, tom lane
On 13.08.2011 17:33, Tom Lane wrote:
Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:
The patch is WIP, mainly because I didn't write the WAL replay routines
yet, but please let me know if you see any issues.Why do you need new WAL replay routines? Can't you just use the existing
XLOG_HEAP_NEWPAGE support?By any large, I think we should be avoiding special-purpose WAL entries
as much as possible.
I tried that, but most of the reduction in WAL-size melts away with
that. And if the page you're copying to is not empty, logging the whole
page is even more expensive. You'd need to fall back to retail inserts
in that case which complicates the logic.
Also, I strongly object to turning regular heap_insert into a wrapper
around some other more complicated operation. That will likely have
bad consequences for the performance of every other operation.
Ok. I doubt it makes any noticeable difference for performance, but
having done that, it's more readable, too, to duplicate the code.
Attached is a new version of the patch. It is now complete, including
WAL replay code.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
copy-heap-multi-insert-2.patchtext/x-diff; name=copy-heap-multi-insert-2.patchDownload+590-16
Hi Heikki,
I checked your patch, then I have a comment and two questions here.
The heap_prepare_insert() seems a duplication of code with earlier
half of existing heap_insert(). I think it is a good question to
consolidate these portion of the code.
I'm not clear the reason why the argument of
CheckForSerializableConflictIn() was
changed from the one in heap_insert(). Its source code comment describes as:
:
* For a heap insert, we only need to check for table-level SSI locks.
* Our new tuple can't possibly conflict with existing tuple locks, and
* heap page locks are only consolidated versions of tuple locks; they do
* not lock "gaps" as index page locks do. So we don't need to identify
* a buffer before making the call.
*/
Is it feasible that newly inserted tuples conflict with existing tuple
locks when
we expand insert to support multiple tuples at once?
It is a bit confusing for me.
This patch disallow multiple-insertion not only when before-row
trigger is configured,
but volatile functions are used to compute a default value also.
Is it a reasonable condition to avoid multiple-insertion?
All the datums to be delivered to heap_form_tuple() is calculated in
NextCopyFrom,
and default values are also constructed here; sequentially.
So, it seems to me we have no worry about volatile functions are not
invoked toward
each tuples. Or, am I missing something?
Thanks,
2011/9/14 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
On 13.08.2011 17:33, Tom Lane wrote:
Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:
The patch is WIP, mainly because I didn't write the WAL replay routines
yet, but please let me know if you see any issues.Why do you need new WAL replay routines? Can't you just use the existing
XLOG_HEAP_NEWPAGE support?By any large, I think we should be avoiding special-purpose WAL entries
as much as possible.I tried that, but most of the reduction in WAL-size melts away with that.
And if the page you're copying to is not empty, logging the whole page is
even more expensive. You'd need to fall back to retail inserts in that case
which complicates the logic.Also, I strongly object to turning regular heap_insert into a wrapper
around some other more complicated operation. That will likely have
bad consequences for the performance of every other operation.Ok. I doubt it makes any noticeable difference for performance, but having
done that, it's more readable, too, to duplicate the code.Attached is a new version of the patch. It is now complete, including WAL
replay code.--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
KaiGai Kohei <kaigai@kaigai.gr.jp>
On 25 September 2011 09:43, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
Hi Heikki,
I checked your patch, then I have a comment and two questions here.
2011/9/14 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
Attached is a new version of the patch. It is now complete, including WAL
replay code.
Hi,
I had a quick look at the patch as well and spotted an oversight: the
multi-insert code branch fails to invoke AFTER ROW triggers.
Regards,
Dean
On Wed, Sep 14, 2011 at 6:52 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Why do you need new WAL replay routines? Can't you just use the existing
XLOG_HEAP_NEWPAGE support?By any large, I think we should be avoiding special-purpose WAL entries
as much as possible.I tried that, but most of the reduction in WAL-size melts away with that.
And if the page you're copying to is not empty, logging the whole page is
even more expensive. You'd need to fall back to retail inserts in that case
which complicates the logic.
Where does it go? I understand why it'd be a problem for partially
filled pages, but it seems like it ought to be efficient for pages
that are initially empty.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Kohei KaiGai wrote:
I'm not clear the reason why the argument of
CheckForSerializableConflictIn() was changed from the one in
heap_insert().
The code was probably just based on heap_insert() before this recent
commit:
Is it feasible that newly inserted tuples conflict with existing
tuple locks when we expand insert to support multiple tuples at
once?
No.
-Kevin
Import Notes
Resolved by subject fallback
On 25.09.2011 19:01, Robert Haas wrote:
On Wed, Sep 14, 2011 at 6:52 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:Why do you need new WAL replay routines? Can't you just use the existing
XLOG_HEAP_NEWPAGE support?By any large, I think we should be avoiding special-purpose WAL entries
as much as possible.I tried that, but most of the reduction in WAL-size melts away with that.
And if the page you're copying to is not empty, logging the whole page is
even more expensive. You'd need to fall back to retail inserts in that case
which complicates the logic.Where does it go? I understand why it'd be a problem for partially
filled pages, but it seems like it ought to be efficient for pages
that are initially empty.
A regular heap_insert record leaves out a lot of information that can be
deduced at replay time. It can leave out all the headers, including just
the null bitmap + data. In addition to that, there's just the location
of the tuple (RelFileNode+ItemPointer). At replay, xmin is taken from
the WAL record header.
For a multi-insert record, you don't even need to store the RelFileNode
and the block number for every tuple, just the offsets.
In comparison, a full-page image will include the full tuple header, and
also the line pointers. If I'm doing my math right, a full-page image
takes 25 bytes more data per tuple, than the special-purpose
multi-insert record.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Thu, Oct 6, 2011 at 7:33 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
A regular heap_insert record leaves out a lot of information that can be
deduced at replay time. It can leave out all the headers, including just the
null bitmap + data. In addition to that, there's just the location of the
tuple (RelFileNode+ItemPointer). At replay, xmin is taken from the WAL
record header.For a multi-insert record, you don't even need to store the RelFileNode and
the block number for every tuple, just the offsets.In comparison, a full-page image will include the full tuple header, and
also the line pointers. If I'm doing my math right, a full-page image takes
25 bytes more data per tuple, than the special-purpose multi-insert record.
Interesting. It's always seemed to me fairly inefficient in general
to store the whole RelFileNode. For many people, the database and
tablespace OID will be constants, and even if they aren't, there
certainly aren't going to be 96 bits of entropy in the relfilenode. I
thought about whether we could create some sort of mapping layer,
where say once per checkpoint we'd allocate a 4-byte integer to denote
a relfilenode, and WAL-log that mapping. Then after that everyone
could just refer to the 4-byte integer instead of the whole
relfilenode. But it seems like a lot of work for 8 bytes per record.
Then again, if you're getting that much benefit from shaving off 25
bytes per tuple, maybe it is, although I feel like FPW is the elephant
in the room.
But I digress...
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 06.10.2011 15:11, Robert Haas wrote:
On Thu, Oct 6, 2011 at 7:33 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:A regular heap_insert record leaves out a lot of information that can be
deduced at replay time. It can leave out all the headers, including just the
null bitmap + data. In addition to that, there's just the location of the
tuple (RelFileNode+ItemPointer). At replay, xmin is taken from the WAL
record header.For a multi-insert record, you don't even need to store the RelFileNode and
the block number for every tuple, just the offsets.In comparison, a full-page image will include the full tuple header, and
also the line pointers. If I'm doing my math right, a full-page image takes
25 bytes more data per tuple, than the special-purpose multi-insert record.Interesting. It's always seemed to me fairly inefficient in general
to store the whole RelFileNode. For many people, the database and
tablespace OID will be constants, and even if they aren't, there
certainly aren't going to be 96 bits of entropy in the relfilenode. I
thought about whether we could create some sort of mapping layer,
where say once per checkpoint we'd allocate a 4-byte integer to denote
a relfilenode, and WAL-log that mapping. Then after that everyone
could just refer to the 4-byte integer instead of the whole
relfilenode. But it seems like a lot of work for 8 bytes per record.
Then again, if you're getting that much benefit from shaving off 25
bytes per tuple, maybe it is, although I feel like FPW is the elephant
in the room.
A very simple optimization would be to leave out tablespace OID
altogether if it's DEFAULTTABLESPACE_OID, and just set a flag somewhere.
Then again, we could also just compress the WAL wholesale.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com