bytea vs. pg_dump
From time to time we had complains about slow dump of large tables with
bytea columns, people often complaining about a) size and b) duration of
the dump.
That latter occurred recently to me, a customer would like to dump large
tables (approx. 12G in size) with pg_dump, but he was annoyed about the
performance. Using COPY BINARY reduced the time (unsurprisingly) to a
fraction (from 12 minutes to 3 minutes).
As discussed in the past[1]<http://archives.postgresql.org//pgsql-hackers/2007-12/msg00139.php> -- Thanks, we didn't implement pg_dump to support BINARY
to preserve portability and version independence of dumps using pg_dump. I
would like to bring that topic up again, since implementing an option like
--binary-copy seems interesting in use cases, where portability and version
issues doesn't matter and someone wants to have a fast COPY of his
documents . This would make this task much easier, especially in the
described case, where the customer has to dump referenced tables as well.
Another approach would be to just dump bytea columns in binary format only
(not sure how doable that is, though).
Opinions, again?
[1]: <http://archives.postgresql.org//pgsql-hackers/2007-12/msg00139.php> -- Thanks
--
Thanks
Bernd
Bernd Helmle <mailings@oopsware.de> writes:
From time to time we had complains about slow dump of large tables with
bytea columns, people often complaining about a) size and b) duration of
the dump.
That latter occurred recently to me, a customer would like to dump large
tables (approx. 12G in size) with pg_dump, but he was annoyed about the
performance. Using COPY BINARY reduced the time (unsurprisingly) to a
fraction (from 12 minutes to 3 minutes).
Seems like the right response might be some micro-optimization effort on
byteaout.
regards, tom lane
Bernd Helmle <mailings@oopsware.de> wrote:
Another approach would be to just dump bytea columns in binary
format only (not sure how doable that is, though).
If that's not doable, perhaps a base64 option for bytea COPY?
-Kevin
Tom Lane <tgl@sss.pgh.pa.us> writes:
Bernd Helmle <mailings@oopsware.de> writes:
That latter occurred recently to me, a customer would like to dump large
tables (approx. 12G in size) with pg_dump, but he was annoyed about the
performance. Using COPY BINARY reduced the time (unsurprisingly) to a
fraction (from 12 minutes to 3 minutes).Seems like the right response might be some micro-optimization effort on
byteaout.
Still, apart from lack of interest from developpers and/or resources, is
there some reason we don't have a pg_dump --binary option?
DBA would have to make sure his exports are usable, but when the routine
pg_dump backup is mainly there to be able to restore on the same machine
in case of unwanted event (DELETE bug, malicious TRUNCATE, you name it),
having a faster dump/restore even if local only would be of interest.
Regards,
--
dim
Dimitri Fontaine <dfontaine@hi-media.com> writes:
Still, apart from lack of interest from developpers and/or resources, is
there some reason we don't have a pg_dump --binary option?
It seems rather antithetical to one of the main goals of pg_dump,
which is to provide a dump that can reliably be loaded onto other
machines or newer versions of Postgres. I don't think that we
should provide such a foot-gun in hopes of getting relatively
minor performance improvements; especially when we have not
exhausted the alternatives.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes:
It seems rather antithetical to one of the main goals of pg_dump,
which is to provide a dump that can reliably be loaded onto other
machines or newer versions of Postgres.
You're calling for a pg_export/pg_import tool suite, or I have to learn
to read again :)
I don't think that we should provide such a foot-gun in hopes of
getting relatively minor performance improvements; especially when we
have not exhausted the alternatives.
If you think improvements will be minor while alternatives are
promising, of course, I'm gonna take your word for it.
Regards,
--
dim
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
Bernd Helmle <mailings@oopsware.de> wrote:
Another approach would be to just dump bytea columns in binary
format only (not sure how doable that is, though).
If that's not doable, perhaps a base64 option for bytea COPY?
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion. The main problem
in any case would be to decide how to control the format option.
regards, tom lane
Tom Lane wrote:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
Bernd Helmle <mailings@oopsware.de> wrote:
Another approach would be to just dump bytea columns in binary
format only (not sure how doable that is, though).If that's not doable, perhaps a base64 option for bytea COPY?
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion. The main problem
in any case would be to decide how to control the format option.
It would be great if COPY FROM could read some fields as binary while
the rest is text. That would allow us to do something like
--bytea-column-format=binary
--bytea-column-format=hexpair
--bytea-column-format=text
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
--On Dienstag, Mai 05, 2009 10:00:37 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:
Seems like the right response might be some micro-optimization effort on
byteaout.
Hmm looking into profiler statistics seems to second your suspicion:
Normal COPY shows:
% cumulative self self total
time seconds seconds calls s/call s/call name
31.29 81.38 81.38 134487 0.00 0.00 CopyOneRowTo
22.88 140.89 59.51 134487 0.00 0.00 byteaout
13.44 175.84 34.95 3052797224 0.00 0.00
appendBinaryStringInfo
12.10 207.32 31.48 3052990837 0.00 0.00 CopySendChar
8.45 229.31 21.99 3052797226 0.00 0.00 enlargeStringInfo
3.90 239.45 10.14 55500 0.00 0.00 pglz_decompress
3.28 247.97 8.52 3 2.84 2.84 appendStringInfoChar
1.82 252.71 4.74 134489 0.00 0.00 resetStringInfo
1.72 257.18 4.47 copy_dest_destroy
0.27 257.89 0.71 5544679 0.00 0.00
hash_search_with_hash_value
0.09 258.13 0.24 13205044 0.00 0.00 LWLockAcquire
0.08 258.35 0.22 13205044 0.00 0.00 LWLockRelease
COPY BINARY generates:
time seconds seconds calls s/call s/call name
73.70 9.05 9.05 55500 0.00 0.00 pglz_decompress
6.03 9.79 0.74 5544679 0.00 0.00
hash_search_with_hash_value
2.93 10.15 0.36 13205362 0.00 0.00 LWLockAcquire
1.87 10.38 0.23 13205362 0.00 0.00 LWLockRelease
This is PostgreSQL 8.3.7 btw.
--
Thanks
Bernd
Tom Lane wrote:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
Bernd Helmle <mailings@oopsware.de> wrote:
Another approach would be to just dump bytea columns in binary
format only (not sure how doable that is, though).If that's not doable, perhaps a base64 option for bytea COPY?
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion. The main problem
in any case would be to decide how to control the format option.
Yeah. Any ideas on how to do that? I can't think of anything very clean
offhand.
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
Tom Lane wrote:
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion. The main problem
in any case would be to decide how to control the format option.
Yeah. Any ideas on how to do that? I can't think of anything very clean
offhand.
Well, there's nothing much wrong with a GUC setting to control output
--- we have lots of precedent, such as DateStyle. The problem is with
figuring out what ambiguous input is meant to be. There seems to be
an uncomfortably high risk of misinterpreting the input.
For sake of argument, suppose we define the hex format as "0x followed
by pairs of hex digits". We could then modify byteaout so that if it
were told to print in old-style a value that happened to start with
"0x", it could output "0\x" instead, which means the same but would be
unambiguous. This would fix the problem going forward, but old-style
dumps and un-updated clients would still be at risk. The risk might
not be too high though, since the odds of successfully parsing old-style
data as hex would be relatively low, particularly if we were draconian
about case (ie the "x" MUST be lower case and the hex digits MUST be
upper).
regards, tom lane
Sorry got top-posting -- stupid iphone mail client.
We could eliminate the problem with old dumps by doing something like
\x to indicate a new-style hex dump.
That doesn't make us 100% safe against arbitrary user input but should
be pretty low risk.
--
Greg
On 5 May 2009, at 18:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Andrew Dunstan <andrew@dunslane.net> writes:
Tom Lane wrote:
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion. The main problem
in any case would be to decide how to control the format option.Yeah. Any ideas on how to do that? I can't think of anything very
clean
offhand.Well, there's nothing much wrong with a GUC setting to control output --- we have lots of precedent, such as DateStyle. The problem is with figuring out what ambiguous input is meant to be. There seems to be an uncomfortably high risk of misinterpreting the input.For sake of argument, suppose we define the hex format as "0x followed
by pairs of hex digits". We could then modify byteaout so that if it
were told to print in old-style a value that happened to start with
"0x", it could output "0\x" instead, which means the same but would be
unambiguous. This would fix the problem going forward, but old-style
dumps and un-updated clients would still be at risk. The risk might
not be too high though, since the odds of successfully parsing old-
style
data as hex would be relatively low, particularly if we were draconian
about case (ie the "x" MUST be lower case and the hex digits MUST be
upper).regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane wrote:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
Bernd Helmle <mailings@oopsware.de> wrote:
Another approach would be to just dump bytea columns in binary
format only (not sure how doable that is, though).If that's not doable, perhaps a base64 option for bytea COPY?
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion.
That's a lot less space-efficient than base64, though.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Tom Lane wrote:
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion.
That's a lot less space-efficient than base64, though.
Well, base64 could give a 33% savings, but it's significantly harder
to encode/decode. Also, since it has a much larger set of valid
data characters, it would be *much* more likely to allow old-style
formatting to be mistaken for new-style. Unless we can think of
a more bulletproof format selection mechanism, that could be
an overriding consideration.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote:
Unless we can think of a more bulletproof format selection mechanism
Would it make any sense to have an option on the COPY command to tell
it to use base64 for bytea columns?
-Kevin
Tom Lane wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Tom Lane wrote:
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion.That's a lot less space-efficient than base64, though.
Well, base64 could give a 33% savings, but it's significantly harder
to encode/decode. Also, since it has a much larger set of valid
data characters, it would be *much* more likely to allow old-style
formatting to be mistaken for new-style. Unless we can think of
a more bulletproof format selection mechanism, that could be
an overriding consideration.
Hex will already provide some space savings over our current encoding
method for most byteas anyway. It's not like we'd be making things less
efficient space-wise. And in compressed archives the space difference is
likely to dissolve to not very much, I suspect.
cheers
andrew
--On Dienstag, Mai 05, 2009 16:57:50 -0400 Andrew Dunstan
<andrew@dunslane.net> wrote:
Hex will already provide some space savings over our current encoding
method for most byteas anyway. It's not like we'd be making things less
efficient space-wise. And in compressed archives the space difference is
likely to dissolve to not very much, I suspect.
I'm dumb: I don't understand why a hex conversion would be significantly
faster than what we have now?
--
Thanks
Bernd
Bernd Helmle wrote:
--On Dienstag, Mai 05, 2009 16:57:50 -0400 Andrew Dunstan
<andrew@dunslane.net> wrote:Hex will already provide some space savings over our current encoding
method for most byteas anyway. It's not like we'd be making things less
efficient space-wise. And in compressed archives the space difference is
likely to dissolve to not very much, I suspect.I'm dumb: I don't understand why a hex conversion would be
significantly faster than what we have now?
Quite apart from anything else you would not need the current loop over
the bytea input to calculate the result length - in hex it would just be
the input length * 2.
cheers
andrew
On Tue, May 5, 2009 at 4:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Tom Lane wrote:
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion.That's a lot less space-efficient than base64, though.
Well, base64 could give a 33% savings, but it's significantly harder
to encode/decode. Also, since it has a much larger set of valid
data characters, it would be *much* more likely to allow old-style
formatting to be mistaken for new-style. Unless we can think of
a more bulletproof format selection mechanism, that could be
an overriding consideration.
another nit with base64 is that properly encoded data requires
newlines according to the standard.
merlin
Merlin Moncure wrote:
On Tue, May 5, 2009 at 4:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Tom Lane wrote:
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion.That's a lot less space-efficient than base64, though.
Well, base64 could give a 33% savings, but it's significantly harder
to encode/decode. Also, since it has a much larger set of valid
data characters, it would be *much* more likely to allow old-style
formatting to be mistaken for new-style. Unless we can think of
a more bulletproof format selection mechanism, that could be
an overriding consideration.another nit with base64 is that properly encoded data requires
newlines according to the standard.
er, no, not as I read rfc 3548 s 2.1.
cheers
andrew