Optimizing COPY

Started by Heikki Linnakangasover 17 years ago7 messageshackers
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

Back in March, I played around with various hacks to improve COPY FROM
performance:
http://archives.postgresql.org/pgsql-patches/2008-03/msg00145.php

I got busy with other stuff, but I now got around to try what I planned
back then. I don't know if I have the time to finish this for 8.4, but
might as well post what I've got.

The basic idea is to replace the custom loop in CopyReadLineText with
memchr(), because memchr() is very fast. To make that possible, perform
the client-server encoding conversion on each raw block that we read in,
before splitting it into lines. That way CopyReadLineText only needs to
deal with server encodings, which is required for the memchr() to be safe.

Attached is a quick patch for that. Think of it as a prototype; I
haven't tested it much, and I feel that it needs some further cleanup.
Quick testing suggests that it gives 0-20% speedup, depending on the
data. Very narrow tables don't benefit much, but the wider the table,
the bigger the gain. I haven't found a case where it performs worse.

I'm only using memchr() with non-csv format at the moment. It could be
used for CSV as well, but it's more complicated because in CSV mode we
need to keep track of the escapes.

Some collateral damage:
\. no longer works. If we only accept \. on a new line, like we already
do in CSV mode, it shouldn't be hard or expensive to make it work again.
The manual already suggests that we only accept it on a single line:
"End of data can be represented by a single line containing just
backslash-period (\.)."

Escaping a linefeed or carriage return by prepending it with a backslash
no longer works. You have to use \n and \r. The manual already warns
against doing that, so I think we could easily drop support for it. If
we wanted, it wouldn't be very hard to make it work, though: after
hitting a newline, scan back and count how many backslashes there is
before the newline. An odd number means that it's an escaped newline,
even number (including 0) means it's a real newline.

For the sake of simplifying the code, would anyone care if we removed
support for COPY with protocol version 2?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

copy-memchr-1.patchtext/x-diff; name=copy-memchr-1.patchDownload+378-401
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
Re: Optimizing COPY

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

The basic idea is to replace the custom loop in CopyReadLineText with
memchr(), because memchr() is very fast. To make that possible, perform
the client-server encoding conversion on each raw block that we read in,
before splitting it into lines. That way CopyReadLineText only needs to
deal with server encodings, which is required for the memchr() to be safe.

Okay, so of course the trick with that is the block boundary handling.
The protocol says the client can break the data apart however it likes.
I see you've tried to deal with that, but this part seems wrong:

! if (convertable_bytes == 0)
! {
! /*
! * EOF, and there was some unconvertable chars at the end.
! * Call pg_client_to_server on the remaining bytes, to
! * let it throw an error.
! */
! cvt = pg_client_to_server(raw, inbytes);
! Assert(false); /* pg_client_to_server should've errored */
! }

You're not (AFAICS) definitely at EOF here; you might just have gotten
a pathologically short message.

regards, tom lane

#3Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: Optimizing COPY

Heikki,

I was assigned as a round-robin reviewer for this patch, but it looks
to me like it is still WIP, so I'm not sure how much effort it's worth
putting in at this point. Do you plan to finish this for 8.4, and if
so, should I wait for the next version before reviewing further?

Thanks,

...Robert

On Thu, Oct 30, 2008 at 8:14 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Show quoted text

Back in March, I played around with various hacks to improve COPY FROM
performance:
http://archives.postgresql.org/pgsql-patches/2008-03/msg00145.php

I got busy with other stuff, but I now got around to try what I planned back
then. I don't know if I have the time to finish this for 8.4, but might as
well post what I've got.

The basic idea is to replace the custom loop in CopyReadLineText with
memchr(), because memchr() is very fast. To make that possible, perform the
client-server encoding conversion on each raw block that we read in, before
splitting it into lines. That way CopyReadLineText only needs to deal with
server encodings, which is required for the memchr() to be safe.

Attached is a quick patch for that. Think of it as a prototype; I haven't
tested it much, and I feel that it needs some further cleanup. Quick testing
suggests that it gives 0-20% speedup, depending on the data. Very narrow
tables don't benefit much, but the wider the table, the bigger the gain. I
haven't found a case where it performs worse.

I'm only using memchr() with non-csv format at the moment. It could be used
for CSV as well, but it's more complicated because in CSV mode we need to
keep track of the escapes.

Some collateral damage:
\. no longer works. If we only accept \. on a new line, like we already do
in CSV mode, it shouldn't be hard or expensive to make it work again. The
manual already suggests that we only accept it on a single line: "End of
data can be represented by a single line containing just backslash-period
(\.)."

Escaping a linefeed or carriage return by prepending it with a backslash no
longer works. You have to use \n and \r. The manual already warns against
doing that, so I think we could easily drop support for it. If we wanted, it
wouldn't be very hard to make it work, though: after hitting a newline, scan
back and count how many backslashes there is before the newline. An odd
number means that it's an escaped newline, even number (including 0) means
it's a real newline.

For the sake of simplifying the code, would anyone care if we removed
support for COPY with protocol version 2?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#3)
Re: Optimizing COPY

Robert Haas wrote:

Heikki,

I was assigned as a round-robin reviewer for this patch, but it looks
to me like it is still WIP, so I'm not sure how much effort it's worth
putting in at this point. Do you plan to finish this for 8.4, and if
so, should I wait for the next version before reviewing further?

I'd really like to work on this to get it into 8.4, but being honest to
myself, I don't think I have the time to finish and benchmark it. I'm
swamped with reviewing other's patches, as well as with non-PG-related
work. I have some work to do on my visibility map patch too, which seems
more important.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#5Chuck McDevitt
cmcdevitt@greenplum.com
In reply to: Heikki Linnakangas (#1)
Re: Optimizing COPY

What if the block of text is split in the middle of a multibyte character?
I don't think it is safe to assume raw blocks always end on a character boundary.

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#4)
Re: Optimizing COPY

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

I'd really like to work on this to get it into 8.4, but being honest to
myself, I don't think I have the time to finish and benchmark it. I'm
swamped with reviewing other's patches, as well as with non-PG-related
work. I have some work to do on my visibility map patch too, which seems
more important.

Agreed, you should get the visibility map done first. The copy hack is
just a marginal thing; partial vacuums would be game-changing.

regards, tom lane

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Chuck McDevitt (#5)
Re: Optimizing COPY

Chuck McDevitt wrote:

What if the block of text is split in the middle of a multibyte character?
I don't think it is safe to assume raw blocks always end on a character boundary.

Yeah, it's not. I realized myself after submitting. The generic approach
is to loop with pg_mblen() to find out the max. safe length. For UTF-8,
and probably many other multi-byte encodings as well, we can detect
whether a byte is the first byte of a multi-byte character, just by
looking at the few high-bits of the byte.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com