bytea vs. pg_dump
From time to time we had complains about slow dump of large tables with
bytea columns, people often complaining about a) size and b) duration of
the dump.
That latter occurred recently to me, a customer would like to dump large
tables (approx. 12G in size) with pg_dump, but he was annoyed about the
performance. Using COPY BINARY reduced the time (unsurprisingly) to a
fraction (from 12 minutes to 3 minutes).
As discussed in the past[1]<http://archives.postgresql.org//pgsql-hackers/2007-12/msg00139.php> -- Thanks, we didn't implement pg_dump to support BINARY
to preserve portability and version independence of dumps using pg_dump. I
would like to bring that topic up again, since implementing an option like
--binary-copy seems interesting in use cases, where portability and version
issues doesn't matter and someone wants to have a fast COPY of his
documents . This would make this task much easier, especially in the
described case, where the customer has to dump referenced tables as well.
Another approach would be to just dump bytea columns in binary format only
(not sure how doable that is, though).
Opinions, again?
[1]: <http://archives.postgresql.org//pgsql-hackers/2007-12/msg00139.php> -- Thanks
--
Thanks
Bernd
Bernd Helmle <mailings@oopsware.de> writes:
From time to time we had complains about slow dump of large tables with
bytea columns, people often complaining about a) size and b) duration of
the dump.
That latter occurred recently to me, a customer would like to dump large
tables (approx. 12G in size) with pg_dump, but he was annoyed about the
performance. Using COPY BINARY reduced the time (unsurprisingly) to a
fraction (from 12 minutes to 3 minutes).
Seems like the right response might be some micro-optimization effort on
byteaout.
regards, tom lane
Bernd Helmle <mailings@oopsware.de> wrote:
Another approach would be to just dump bytea columns in binary
format only (not sure how doable that is, though).
If that's not doable, perhaps a base64 option for bytea COPY?
-Kevin
Tom Lane <tgl@sss.pgh.pa.us> writes:
Bernd Helmle <mailings@oopsware.de> writes:
That latter occurred recently to me, a customer would like to dump large
tables (approx. 12G in size) with pg_dump, but he was annoyed about the
performance. Using COPY BINARY reduced the time (unsurprisingly) to a
fraction (from 12 minutes to 3 minutes).Seems like the right response might be some micro-optimization effort on
byteaout.
Still, apart from lack of interest from developpers and/or resources, is
there some reason we don't have a pg_dump --binary option?
DBA would have to make sure his exports are usable, but when the routine
pg_dump backup is mainly there to be able to restore on the same machine
in case of unwanted event (DELETE bug, malicious TRUNCATE, you name it),
having a faster dump/restore even if local only would be of interest.
Regards,
--
dim
Dimitri Fontaine <dfontaine@hi-media.com> writes:
Still, apart from lack of interest from developpers and/or resources, is
there some reason we don't have a pg_dump --binary option?
It seems rather antithetical to one of the main goals of pg_dump,
which is to provide a dump that can reliably be loaded onto other
machines or newer versions of Postgres. I don't think that we
should provide such a foot-gun in hopes of getting relatively
minor performance improvements; especially when we have not
exhausted the alternatives.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes:
It seems rather antithetical to one of the main goals of pg_dump,
which is to provide a dump that can reliably be loaded onto other
machines or newer versions of Postgres.
You're calling for a pg_export/pg_import tool suite, or I have to learn
to read again :)
I don't think that we should provide such a foot-gun in hopes of
getting relatively minor performance improvements; especially when we
have not exhausted the alternatives.
If you think improvements will be minor while alternatives are
promising, of course, I'm gonna take your word for it.
Regards,
--
dim
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
Bernd Helmle <mailings@oopsware.de> wrote:
Another approach would be to just dump bytea columns in binary
format only (not sure how doable that is, though).
If that's not doable, perhaps a base64 option for bytea COPY?
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion. The main problem
in any case would be to decide how to control the format option.
regards, tom lane
Tom Lane wrote:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
Bernd Helmle <mailings@oopsware.de> wrote:
Another approach would be to just dump bytea columns in binary
format only (not sure how doable that is, though).If that's not doable, perhaps a base64 option for bytea COPY?
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion. The main problem
in any case would be to decide how to control the format option.
It would be great if COPY FROM could read some fields as binary while
the rest is text. That would allow us to do something like
--bytea-column-format=binary
--bytea-column-format=hexpair
--bytea-column-format=text
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
--On Dienstag, Mai 05, 2009 10:00:37 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:
Seems like the right response might be some micro-optimization effort on
byteaout.
Hmm looking into profiler statistics seems to second your suspicion:
Normal COPY shows:
% cumulative self self total
time seconds seconds calls s/call s/call name
31.29 81.38 81.38 134487 0.00 0.00 CopyOneRowTo
22.88 140.89 59.51 134487 0.00 0.00 byteaout
13.44 175.84 34.95 3052797224 0.00 0.00
appendBinaryStringInfo
12.10 207.32 31.48 3052990837 0.00 0.00 CopySendChar
8.45 229.31 21.99 3052797226 0.00 0.00 enlargeStringInfo
3.90 239.45 10.14 55500 0.00 0.00 pglz_decompress
3.28 247.97 8.52 3 2.84 2.84 appendStringInfoChar
1.82 252.71 4.74 134489 0.00 0.00 resetStringInfo
1.72 257.18 4.47 copy_dest_destroy
0.27 257.89 0.71 5544679 0.00 0.00
hash_search_with_hash_value
0.09 258.13 0.24 13205044 0.00 0.00 LWLockAcquire
0.08 258.35 0.22 13205044 0.00 0.00 LWLockRelease
COPY BINARY generates:
time seconds seconds calls s/call s/call name
73.70 9.05 9.05 55500 0.00 0.00 pglz_decompress
6.03 9.79 0.74 5544679 0.00 0.00
hash_search_with_hash_value
2.93 10.15 0.36 13205362 0.00 0.00 LWLockAcquire
1.87 10.38 0.23 13205362 0.00 0.00 LWLockRelease
This is PostgreSQL 8.3.7 btw.
--
Thanks
Bernd
Tom Lane wrote:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
Bernd Helmle <mailings@oopsware.de> wrote:
Another approach would be to just dump bytea columns in binary
format only (not sure how doable that is, though).If that's not doable, perhaps a base64 option for bytea COPY?
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion. The main problem
in any case would be to decide how to control the format option.
Yeah. Any ideas on how to do that? I can't think of anything very clean
offhand.
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
Tom Lane wrote:
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion. The main problem
in any case would be to decide how to control the format option.
Yeah. Any ideas on how to do that? I can't think of anything very clean
offhand.
Well, there's nothing much wrong with a GUC setting to control output
--- we have lots of precedent, such as DateStyle. The problem is with
figuring out what ambiguous input is meant to be. There seems to be
an uncomfortably high risk of misinterpreting the input.
For sake of argument, suppose we define the hex format as "0x followed
by pairs of hex digits". We could then modify byteaout so that if it
were told to print in old-style a value that happened to start with
"0x", it could output "0\x" instead, which means the same but would be
unambiguous. This would fix the problem going forward, but old-style
dumps and un-updated clients would still be at risk. The risk might
not be too high though, since the odds of successfully parsing old-style
data as hex would be relatively low, particularly if we were draconian
about case (ie the "x" MUST be lower case and the hex digits MUST be
upper).
regards, tom lane
Sorry got top-posting -- stupid iphone mail client.
We could eliminate the problem with old dumps by doing something like
\x to indicate a new-style hex dump.
That doesn't make us 100% safe against arbitrary user input but should
be pretty low risk.
--
Greg
On 5 May 2009, at 18:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Andrew Dunstan <andrew@dunslane.net> writes:
Tom Lane wrote:
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion. The main problem
in any case would be to decide how to control the format option.Yeah. Any ideas on how to do that? I can't think of anything very
clean
offhand.Well, there's nothing much wrong with a GUC setting to control output --- we have lots of precedent, such as DateStyle. The problem is with figuring out what ambiguous input is meant to be. There seems to be an uncomfortably high risk of misinterpreting the input.For sake of argument, suppose we define the hex format as "0x followed
by pairs of hex digits". We could then modify byteaout so that if it
were told to print in old-style a value that happened to start with
"0x", it could output "0\x" instead, which means the same but would be
unambiguous. This would fix the problem going forward, but old-style
dumps and un-updated clients would still be at risk. The risk might
not be too high though, since the odds of successfully parsing old-
style
data as hex would be relatively low, particularly if we were draconian
about case (ie the "x" MUST be lower case and the hex digits MUST be
upper).regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane wrote:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
Bernd Helmle <mailings@oopsware.de> wrote:
Another approach would be to just dump bytea columns in binary
format only (not sure how doable that is, though).If that's not doable, perhaps a base64 option for bytea COPY?
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion.
That's a lot less space-efficient than base64, though.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Tom Lane wrote:
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion.
That's a lot less space-efficient than base64, though.
Well, base64 could give a 33% savings, but it's significantly harder
to encode/decode. Also, since it has a much larger set of valid
data characters, it would be *much* more likely to allow old-style
formatting to be mistaken for new-style. Unless we can think of
a more bulletproof format selection mechanism, that could be
an overriding consideration.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote:
Unless we can think of a more bulletproof format selection mechanism
Would it make any sense to have an option on the COPY command to tell
it to use base64 for bytea columns?
-Kevin
Tom Lane wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Tom Lane wrote:
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion.That's a lot less space-efficient than base64, though.
Well, base64 could give a 33% savings, but it's significantly harder
to encode/decode. Also, since it has a much larger set of valid
data characters, it would be *much* more likely to allow old-style
formatting to be mistaken for new-style. Unless we can think of
a more bulletproof format selection mechanism, that could be
an overriding consideration.
Hex will already provide some space savings over our current encoding
method for most byteas anyway. It's not like we'd be making things less
efficient space-wise. And in compressed archives the space difference is
likely to dissolve to not very much, I suspect.
cheers
andrew
--On Dienstag, Mai 05, 2009 16:57:50 -0400 Andrew Dunstan
<andrew@dunslane.net> wrote:
Hex will already provide some space savings over our current encoding
method for most byteas anyway. It's not like we'd be making things less
efficient space-wise. And in compressed archives the space difference is
likely to dissolve to not very much, I suspect.
I'm dumb: I don't understand why a hex conversion would be significantly
faster than what we have now?
--
Thanks
Bernd
Bernd Helmle wrote:
--On Dienstag, Mai 05, 2009 16:57:50 -0400 Andrew Dunstan
<andrew@dunslane.net> wrote:Hex will already provide some space savings over our current encoding
method for most byteas anyway. It's not like we'd be making things less
efficient space-wise. And in compressed archives the space difference is
likely to dissolve to not very much, I suspect.I'm dumb: I don't understand why a hex conversion would be
significantly faster than what we have now?
Quite apart from anything else you would not need the current loop over
the bytea input to calculate the result length - in hex it would just be
the input length * 2.
cheers
andrew
On Tue, May 5, 2009 at 4:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Tom Lane wrote:
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion.That's a lot less space-efficient than base64, though.
Well, base64 could give a 33% savings, but it's significantly harder
to encode/decode. Also, since it has a much larger set of valid
data characters, it would be *much* more likely to allow old-style
formatting to be mistaken for new-style. Unless we can think of
a more bulletproof format selection mechanism, that could be
an overriding consideration.
another nit with base64 is that properly encoded data requires
newlines according to the standard.
merlin
Merlin Moncure wrote:
On Tue, May 5, 2009 at 4:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Tom Lane wrote:
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion.That's a lot less space-efficient than base64, though.
Well, base64 could give a 33% savings, but it's significantly harder
to encode/decode. Also, since it has a much larger set of valid
data characters, it would be *much* more likely to allow old-style
formatting to be mistaken for new-style. Unless we can think of
a more bulletproof format selection mechanism, that could be
an overriding consideration.another nit with base64 is that properly encoded data requires
newlines according to the standard.
er, no, not as I read rfc 3548 s 2.1.
cheers
andrew
Andrew Dunstan wrote:
another nit with base64 is that properly encoded data requires
newlines according to the standard.er, no, not as I read rfc 3548 s 2.1.
cheers
andrew
Why does encode('my text', 'base64') include newlines in its output? I
think MIME requires text to be broken into 76 char lines but why does
encode do this?
--
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/
Andrew Dunstan <andrew@dunslane.net> writes:
Bernd Helmle wrote:
I'm dumb: I don't understand why a hex conversion would be
significantly faster than what we have now?
Quite apart from anything else you would not need the current loop over
the bytea input to calculate the result length - in hex it would just be
the input length * 2.
Another point is that the current format results in a very large number
of backslashes in the output data, which translates to extra time and
space at the level of the COPY protocol itself (since that has to double
all those backslashes).
Of course, base64 would also have these two advantages.
regards, tom lane
On Wed, May 6, 2009 at 8:02 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
Merlin Moncure wrote:
On Tue, May 5, 2009 at 4:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Tom Lane wrote:
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion.That's a lot less space-efficient than base64, though.
Well, base64 could give a 33% savings, but it's significantly harder
to encode/decode. Also, since it has a much larger set of valid
data characters, it would be *much* more likely to allow old-style
formatting to be mistaken for new-style. Unless we can think of
a more bulletproof format selection mechanism, that could be
an overriding consideration.another nit with base64 is that properly encoded data requires
newlines according to the standard.er, no, not as I read rfc 3548 s 2.1.
PostgreSQL (sort of) follows RFC 2045, not RFC 3548. I don't think it
would be a good idea to introduce a second method of encoding base64.
merlin
On Tuesday 05 May 2009 17:38:33 Tom Lane wrote:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
Bernd Helmle <mailings@oopsware.de> wrote:
Another approach would be to just dump bytea columns in binary
format only (not sure how doable that is, though).If that's not doable, perhaps a base64 option for bytea COPY?
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion. The main problem
in any case would be to decide how to control the format option.
The output format can be controlled by a GUC parameter. And while we are at
it, we can also make bytea understand the new output format on input, so we
can offer an end-to-end alternative to the amazingly confusing current bytea
format and also make byteain() equally faster at the same time.
For distinguishing various input formats, we could use the backslash to escape
the format specification without breaking backward compatibilty, e.g.,
'\hexd41d8cd98f00b204e9800998ecf8427e'
With a bit of extra work we can wrap this up to be a more or less SQL-
conforming blob type, which would also make a lot of people very happy.
Peter Eisentraut <peter_e@gmx.net> writes:
For distinguishing various input formats, we could use the backslash
to escape the format specification without breaking backward
compatibilty, e.g.,
Oh, you're right! I had been thinking that byteain treats \x as
just meaning x if x isn't an octal digit, but actually it throws
an error for anything except octal digits and backslashes:
regression=# select E'\\x'::bytea;
ERROR: invalid input syntax for type bytea
LINE 1: select E'\\x'::bytea;
^
and a quick check verifies it has always done that.
So the ambiguous-input problem is solved if we define the new format(s)
to be started by backslash and something that the old code would reject.
I'd keep it short, like "\x", but there's still room for multiple
formats if anyone really wants to go to the trouble.
regards, tom lane
Bernd Helmle <mailings@oopsware.de> writes:
--On Dienstag, Mai 05, 2009 10:00:37 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:Seems like the right response might be some micro-optimization effort on
byteaout.
Hmm looking into profiler statistics seems to second your suspicion:
Normal COPY shows:
% cumulative self self total
time seconds seconds calls s/call s/call name
31.29 81.38 81.38 134487 0.00 0.00 CopyOneRowTo
22.88 140.89 59.51 134487 0.00 0.00 byteaout
13.44 175.84 34.95 3052797224 0.00 0.00
appendBinaryStringInfo
12.10 207.32 31.48 3052990837 0.00 0.00 CopySendChar
8.45 229.31 21.99 3052797226 0.00 0.00 enlargeStringInfo
3.90 239.45 10.14 55500 0.00 0.00 pglz_decompress
I hadn't looked closely at these numbers before, but now that I do,
what I think they are telling us is that the high proportion of
backslashes in standard bytea output is a real killer for COPY
performance. With no backslashes, CopySendChar wouldn't be in the
picture at all here, and appendBinaryStringInfo/enlargeStringInfo
would be called many fewer times (roughly 134487 not 3052797224)
with proportionately more characters processed per call. The inner
loop of CopyOneRowTo (I assume CopyAttributeOutText has been inlined
into that function) is relatively cheap for ordinary characters and
much less so for backslashes, so I bet that number would go down too.
And as already noted, byteaout itself works pretty hard to produce
the current representation.
So I'm now persuaded that a better textual representation for bytea
should indeed make things noticeably better here. It would be
useful though to cross-check this thought by profiling a case that
dumps a comparable volume of text data that contains no backslashes...
regards, tom lane
--On Mittwoch, Mai 06, 2009 19:04:21 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:
So I'm now persuaded that a better textual representation for bytea
should indeed make things noticeably better here. It would be
useful though to cross-check this thought by profiling a case that
dumps a comparable volume of text data that contains no backslashes...
I'm going to try to create a profile with a converted text representation
of the data.
--
Thanks
Bernd
--On Mittwoch, Mai 06, 2009 19:04:21 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:
So I'm now persuaded that a better textual representation for bytea
should indeed make things noticeably better here. It would be
useful though to cross-check this thought by profiling a case that
dumps a comparable volume of text data that contains no backslashes...
This is a profiling result of the same data converted into a printable text
format without any backslashes. The data amount is quite the same and as
you already guessed, calls to appendBinaryStringInfo() and friends gives
the expected numbers:
time seconds seconds calls s/call s/call name
35.13 24.67 24.67 134488 0.00 0.00 byteaout
32.61 47.57 22.90 134488 0.00 0.00 CopyOneRowTo
28.92 67.88 20.31 85967 0.00 0.00 pglz_decompress
0.67 68.35 0.47 4955300 0.00 0.00
hash_search_with_hash_value
0.28 68.55 0.20 11643046 0.00 0.00 LWLockRelease
0.28 68.75 0.20 4828896 0.00 0.00 index_getnext
0.24 68.92 0.17 1208577 0.00 0.00 StrategyGetBuffer
0.23 69.08 0.16 11643046 0.00 0.00 LWLockAcquire
...
0.00 70.23 0.00 134498 0.00 0.00 enlargeStringInfo
0.00 70.23 0.00 134497 0.00 0.00 appendBinaryStringInfo
0.00 70.23 0.00 134490 0.00 0.00 AllocSetReset
0.00 70.23 0.00 134490 0.00 0.00 resetStringInfo
0.00 70.23 0.00 134488 0.00 0.00 CopySendChar
0.00 70.23 0.00 134488 0.00 0.00 CopySendEndOfRow
--
Thanks
Bernd
Bernd Helmle wrote:
--On Mittwoch, Mai 06, 2009 19:04:21 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:So I'm now persuaded that a better textual representation for bytea
should indeed make things noticeably better here. It would be
useful though to cross-check this thought by profiling a case that
dumps a comparable volume of text data that contains no backslashes...This is a profiling result of the same data converted into a printable
text format without any backslashes. The data amount is quite the same
and as you already guessed, calls to appendBinaryStringInfo() and
friends gives the expected numbers:time seconds seconds calls s/call s/call name
35.13 24.67 24.67 134488 0.00 0.00 byteaout
32.61 47.57 22.90 134488 0.00 0.00 CopyOneRowTo
28.92 67.88 20.31 85967 0.00 0.00 pglz_decompress
0.67 68.35 0.47 4955300 0.00 0.00
hash_search_with_hash_value
0.28 68.55 0.20 11643046 0.00 0.00 LWLockRelease
0.28 68.75 0.20 4828896 0.00 0.00 index_getnext
0.24 68.92 0.17 1208577 0.00 0.00 StrategyGetBuffer
0.23 69.08 0.16 11643046 0.00 0.00 LWLockAcquire
...
0.00 70.23 0.00 134498 0.00 0.00 enlargeStringInfo
0.00 70.23 0.00 134497 0.00 0.00 appendBinaryStringInfo
0.00 70.23 0.00 134490 0.00 0.00 AllocSetReset
0.00 70.23 0.00 134490 0.00 0.00 resetStringInfo
0.00 70.23 0.00 134488 0.00 0.00 CopySendChar
0.00 70.23 0.00 134488 0.00 0.00 CopySendEndOfRow
while doing some pg_migrator testing I noticed that dumping a database
seems to be much slower than IO-system is capable off. ie i get 100% CPU
usage with no IO-wait at all with between 15-30MB/s read rate if i say
do a pg_dumpall > /dev/null.
The profile for that looks like:
samples % image name symbol name
1333764 29.3986 postgres CopyOneRowTo
463205 10.2099 postgres enlargeStringInfo
237117 5.2265 postgres AllocSetAlloc
231017 5.0920 postgres appendBinaryStringInfo
224792 4.9548 postgres heap_deform_tuple
172154 3.7946 postgres AllocSetReset
162434 3.5803 postgres DoCopyTo
149948 3.3051 postgres internal_putbytes
137548 3.0318 postgres OutputFunctionCall
129480 2.8540 postgres heapgettup_pagemode
101017 2.2266 postgres FunctionCall1
93584 2.0628 postgres pq_putmessage
86553 1.9078 postgres timesub
81400 1.7942 postgres CopySendChar
81230 1.7905 postgres int4out
78374 1.7275 postgres localsub
52003 1.1462 postgres MemoryContextAlloc
51265 1.1300 postgres CopySendEndOfRow
49849 1.0988 postgres SPI_push_conditional
48157 1.0615 postgres pg_server_to_client
47670 1.0507 postgres timestamptz_out
42762 0.9426 postgres timestamp2tm
Stefan
On Sat, May 16, 2009 at 11:23 AM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:
Bernd Helmle wrote:
--On Mittwoch, Mai 06, 2009 19:04:21 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:So I'm now persuaded that a better textual representation for bytea
should indeed make things noticeably better here. It would be
useful though to cross-check this thought by profiling a case that
dumps a comparable volume of text data that contains no backslashes...This is a profiling result of the same data converted into a printable
text format without any backslashes. The data amount is quite the same and
as you already guessed, calls to appendBinaryStringInfo() and friends gives
the expected numbers:time seconds seconds calls s/call s/call name
35.13 24.67 24.67 134488 0.00 0.00 byteaout
32.61 47.57 22.90 134488 0.00 0.00 CopyOneRowTo
28.92 67.88 20.31 85967 0.00 0.00 pglz_decompress
0.67 68.35 0.47 4955300 0.00 0.00
hash_search_with_hash_value
0.28 68.55 0.20 11643046 0.00 0.00 LWLockRelease
0.28 68.75 0.20 4828896 0.00 0.00 index_getnext
0.24 68.92 0.17 1208577 0.00 0.00 StrategyGetBuffer
0.23 69.08 0.16 11643046 0.00 0.00 LWLockAcquire
...
0.00 70.23 0.00 134498 0.00 0.00 enlargeStringInfo
0.00 70.23 0.00 134497 0.00 0.00
appendBinaryStringInfo
0.00 70.23 0.00 134490 0.00 0.00 AllocSetReset
0.00 70.23 0.00 134490 0.00 0.00 resetStringInfo
0.00 70.23 0.00 134488 0.00 0.00 CopySendChar
0.00 70.23 0.00 134488 0.00 0.00 CopySendEndOfRowwhile doing some pg_migrator testing I noticed that dumping a database seems
to be much slower than IO-system is capable off. ie i get 100% CPU usage
with no IO-wait at all with between 15-30MB/s read rate if i say do a
pg_dumpall > /dev/null.
Part of the problem is the decompression. Can't do much about that
except to not compress your data.
I don't have any hard statistics on hand at the moment, but a while
back we compared 'COPY' vs a hand written SPI routine that got the
tuple data in binary and streamed it out field by field raw to a file.
The speed difference was enormous..I don't recall the exact
difference but copy was at least 2x slower. This seems to suggest
there are many potential improvements to copy (my test was mainly
bytea as well).
merlin
Added to TODO:
|Improve bytea COPY format
* http://archives.postgresql.org/pgsql-hackers/2009-05/msg00192.php
---------------------------------------------------------------------------
Merlin Moncure wrote:
On Sat, May 16, 2009 at 11:23 AM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:Bernd Helmle wrote:
--On Mittwoch, Mai 06, 2009 19:04:21 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:So I'm now persuaded that a better textual representation for bytea
should indeed make things noticeably better here. ?It would be
useful though to cross-check this thought by profiling a case that
dumps a comparable volume of text data that contains no backslashes...This is a profiling result of the same data converted into a printable
text format without any backslashes. The data amount is quite the same and
as you already guessed, calls to appendBinaryStringInfo() and friends gives
the expected numbers:time ? seconds ? seconds ? ?calls ? s/call ? s/call ?name
35.13 ? ? 24.67 ? ?24.67 ? 134488 ? ? 0.00 ? ? 0.00 ?byteaout
32.61 ? ? 47.57 ? ?22.90 ? 134488 ? ? 0.00 ? ? 0.00 ?CopyOneRowTo
28.92 ? ? 67.88 ? ?20.31 ? ?85967 ? ? 0.00 ? ? 0.00 ?pglz_decompress
?0.67 ? ? 68.35 ? ? 0.47 ?4955300 ? ? 0.00 ? ? 0.00
hash_search_with_hash_value
?0.28 ? ? 68.55 ? ? 0.20 11643046 ? ? 0.00 ? ? 0.00 ?LWLockRelease
?0.28 ? ? 68.75 ? ? 0.20 ?4828896 ? ? 0.00 ? ? 0.00 ?index_getnext
?0.24 ? ? 68.92 ? ? 0.17 ?1208577 ? ? 0.00 ? ? 0.00 ?StrategyGetBuffer
?0.23 ? ? 69.08 ? ? 0.16 11643046 ? ? 0.00 ? ? 0.00 ?LWLockAcquire
...
?0.00 ? ? 70.23 ? ? 0.00 ? 134498 ? ? 0.00 ? ? 0.00 ?enlargeStringInfo
?0.00 ? ? 70.23 ? ? 0.00 ? 134497 ? ? 0.00 ? ? 0.00
?appendBinaryStringInfo
?0.00 ? ? 70.23 ? ? 0.00 ? 134490 ? ? 0.00 ? ? 0.00 ?AllocSetReset
?0.00 ? ? 70.23 ? ? 0.00 ? 134490 ? ? 0.00 ? ? 0.00 ?resetStringInfo
?0.00 ? ? 70.23 ? ? 0.00 ? 134488 ? ? 0.00 ? ? 0.00 ?CopySendChar
?0.00 ? ? 70.23 ? ? 0.00 ? 134488 ? ? 0.00 ? ? 0.00 ?CopySendEndOfRowwhile doing some pg_migrator testing I noticed that dumping a database seems
to be much slower than IO-system is capable off. ie i get 100% CPU usage
with no IO-wait at all with between 15-30MB/s read rate if i say do a
pg_dumpall > /dev/null.Part of the problem is the decompression. Can't do much about that
except to not compress your data.I don't have any hard statistics on hand at the moment, but a while
back we compared 'COPY' vs a hand written SPI routine that got the
tuple data in binary and streamed it out field by field raw to a file.
The speed difference was enormous..I don't recall the exact
difference but copy was at least 2x slower. This seems to suggest
there are many potential improvements to copy (my test was mainly
bytea as well).merlin
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
On Friday 29 May 2009 04:26:35 Bruce Momjian wrote:
Added to TODO:
|Improve bytea COPY format* http://archives.postgresql.org/pgsql-hackers/2009-05/msg00192.php
Btw., I have started to write some code for that.
On Fri, 2009-05-29 at 11:06 +0300, Peter Eisentraut wrote:
On Friday 29 May 2009 04:26:35 Bruce Momjian wrote:
Added to TODO:
|Improve bytea COPY format* http://archives.postgresql.org/pgsql-hackers/2009-05/msg00192.php
Btw., I have started to write some code for that.
why not copy bytea always in base64 encoded or similar format - this
will both save at least 2x the space on average random bytea data _and_
is probably faster, as it can be more easily done by table lookups in
bigger chunks
an alternative is to just escape minimal amount of characters, probably
just \0 , \n and \\
--
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training
On Wed, 2009-05-06 at 18:33 +0300, Peter Eisentraut wrote:
On Tuesday 05 May 2009 17:38:33 Tom Lane wrote:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
Bernd Helmle <mailings@oopsware.de> wrote:
Another approach would be to just dump bytea columns in binary
format only (not sure how doable that is, though).If that's not doable, perhaps a base64 option for bytea COPY?
I'm thinking plain old pairs-of-hex-digits might be the best
tradeoff if conversion speed is the criterion. The main problem
in any case would be to decide how to control the format option.The output format can be controlled by a GUC parameter. And while we are at
it, we can also make bytea understand the new output format on input, so we
can offer an end-to-end alternative to the amazingly confusing current bytea
format and also make byteain() equally faster at the same time.For distinguishing various input formats, we could use the backslash to escape
the format specification without breaking backward compatibilty, e.g.,'\hexd41d8cd98f00b204e9800998ecf8427e'
With a bit of extra work we can wrap this up to be a more or less SQL-
conforming blob type, which would also make a lot of people very happy.
And we can also escape the need to uncompress TOAST'ed fields - just
markup the compression as another \c at the beginning of data.
--
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training
--On Samstag, Mai 30, 2009 00:47:16 +0300 Hannu Krosing
<hannu@2ndQuadrant.com> wrote:
And we can also escape the need to uncompress TOAST'ed fields - just
markup the compression as another \c at the beginning of data.
Hmm i thought about that, but that seems only to make sense if there is an
easy way to "bypass" compressing the data on restore. Also, it seems to me
that compression/decompression isn't a "real" bottleneck, but that needs to
be confirmed.
--
Thanks
Bernd
--On Freitag, Mai 29, 2009 11:06:28 +0300 Peter Eisentraut
<peter_e@gmx.net> wrote:
Btw., I have started to write some code for that.
Cool. Let me know if i can help out somewhere.
--
Thanks
Bernd
On Wednesday 06 May 2009 18:47:57 Tom Lane wrote:
So the ambiguous-input problem is solved if we define the new format(s)
to be started by backslash and something that the old code would reject.
I'd keep it short, like "\x", but there's still room for multiple
formats if anyone really wants to go to the trouble.
Here is a first cut at a new hex bytea input and output format. Example:
SET bytea_output_hex = true;
SELECT E'\\xDeAdBeEf'::bytea;
bytea
------------
\xdeadbeef
(1 row)
Bernd did some performance testing for me, and it looked pretty good.
Questions:
Should this be the default format?
Should the configuration parameter be a boolean or an enum, opening
possibilities for other formats?
Attachments:
bytea-format.patchtext/x-patch; charset=UTF-8; name=bytea-format.patchDownload
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index c944d8f..a6ac9c8 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL$ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.237 2009/04/27 16:27:35 momjian Exp $ -->
<chapter id="datatype">
<title id="datatype-title">Data Types</title>
@@ -1189,6 +1189,63 @@ SELECT b, char_length(b) FROM test2;
</para>
<para>
+ The <type>bytea</type> type supports two external formats for
+ input and output: the traditional bytea format that is particular
+ to PostgreSQL, and the hex format. Both of these are always
+ accepted on input. The output format depends on the configuration
+ parameter bytea_output_format; the default is hex. (Note that the
+ hex format was introduced in PostgreSQL 8.5; so earlier version
+ and some tools don't understand it.)
+ </para>
+
+ <para>
+ The <acronym>SQL</acronym> standard defines a different binary
+ string type, called <type>BLOB</type> or <type>BINARY LARGE
+ OBJECT</type>. The input format is different from
+ <type>bytea</type>, but the provided functions and operators are
+ mostly the same.
+ </para>
+
+ <sect2>
+ <title>Hex Format</title>
+
+ <para>
+ The hex format encodes the binary data as 2 hexadecimal digits per
+ byte, highest significant nibble first. The entire string ist
+ preceded by the sequence <literal>\x</literal> (to distinguish it
+ from the bytea format). In SQL literals, the backslash may need
+ to be escaped, but it is one logical backslash as far as the
+ <type>bytea</type> type is concerned. The hex format is compatible with a wide
+ range of external applications and protocols, and it tends to be
+ faster than the traditional bytea format, so its use is
+ somewhat preferrable.
+ </para>
+
+ <para>
+ Example:
+<programlisting>
+SELECT E'\\xDEADBEEF';
+</programlisting>
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>Traditional Bytea Format</title>
+
+ <para>
+ The traditional bytea format takes the approach of representing a
+ binary string as a sequence of ASCII characters and escaping those
+ bytes that cannot be represented as an ASCII character by a
+ special escape sequence. If, from the point of view of the
+ application, representing bytes as characters makes sense, then
+ this representation can be convenient, but in practice it is
+ usually confusing becauses it fuzzes up the distinction between
+ binary strings and characters strings, and the particular escape
+ mechanism that was chosen is also somewhat unwieldy. So this
+ format should probably not be used for most new applications.
+ </para>
+
+ <para>
When entering <type>bytea</type> values, octets of certain
values <emphasis>must</emphasis> be escaped (but all octet
values <emphasis>can</emphasis> be escaped) when used as part
@@ -1341,14 +1398,7 @@ SELECT b, char_length(b) FROM test2;
have to escape line feeds and carriage returns if your interface
automatically translates these.
</para>
-
- <para>
- The <acronym>SQL</acronym> standard defines a different binary
- string type, called <type>BLOB</type> or <type>BINARY LARGE
- OBJECT</type>. The input format is different from
- <type>bytea</type>, but the provided functions and operators are
- mostly the same.
- </para>
+ </sect2>
</sect1>
diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index eed799a..b8a3cef 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -122,8 +122,8 @@ static const int8 hexlookup[128] = {
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
};
-static unsigned
-hex_encode(const char *src, unsigned len, char *dst)
+size_t
+hex_encode(const char *src, size_t len, char *dst)
{
const char *end = src + len;
@@ -152,8 +152,8 @@ get_hex(char c)
return (char) res;
}
-static unsigned
-hex_decode(const char *src, unsigned len, char *dst)
+size_t
+hex_decode(const char *src, size_t len, char *dst)
{
const char *s,
*srcend;
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 4cf3966..3c24686 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -62,6 +62,8 @@ static text *text_substring(Datum str,
bool length_not_specified);
static void appendStringInfoText(StringInfo str, const text *t);
+bool bytea_output_hex = true;
+
/*****************************************************************************
* CONVERSION ROUTINES EXPORTED FOR USE BY C CODE *
@@ -189,6 +191,18 @@ byteain(PG_FUNCTION_ARGS)
int byte;
bytea *result;
+ if (inputText[0] == '\\' && inputText[1] == 'x')
+ {
+ size_t len = strlen(inputText);
+
+ byte = (len - 2)/2 + VARHDRSZ;
+ result = palloc(byte);
+ SET_VARSIZE(result, byte);
+ hex_decode(inputText + 2, len - 2, VARDATA(result));
+
+ PG_RETURN_BYTEA_P(result);
+ }
+
for (byte = 0, tp = inputText; *tp != '\0'; byte++)
{
if (tp[0] != '\\')
@@ -269,8 +283,17 @@ byteaout(PG_FUNCTION_ARGS)
char *result;
char *vp;
char *rp;
- int val; /* holds unprintable chars */
int i;
+
+ if (bytea_output_hex)
+ {
+ rp = result = palloc(VARSIZE_ANY_EXHDR(vlena) * 2 + 2 + 1);
+ *rp++ = '\\';
+ *rp++ = 'x';
+ rp += hex_encode(VARDATA_ANY(vlena), VARSIZE_ANY_EXHDR(vlena), rp);
+ }
+ else
+ {
int len;
len = 1; /* empty string has 1 char */
@@ -295,6 +318,8 @@ byteaout(PG_FUNCTION_ARGS)
}
else if ((unsigned char) *vp < 0x20 || (unsigned char) *vp > 0x7e)
{
+ int val; /* holds unprintable chars */
+
val = *vp;
rp[0] = '\\';
rp[3] = DIG(val & 07);
@@ -307,6 +332,7 @@ byteaout(PG_FUNCTION_ARGS)
else
*rp++ = *vp;
}
+ }
*rp = '\0';
PG_RETURN_CSTRING(result);
}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4f06725..19d07ff 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -584,6 +584,14 @@ const char *const config_type_names[] =
static struct config_bool ConfigureNamesBool[] =
{
{
+ {"bytea_output_hex", PGC_USERSET, CLIENT_CONN_OTHER,
+ gettext_noop("Sets the bytea output format to the hex format."),
+ NULL
+ },
+ &bytea_output_hex,
+ true, NULL, NULL
+ },
+ {
{"enable_seqscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of sequential-scan plans."),
NULL
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index c1b9393..d91a54c 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -704,6 +704,8 @@ extern Datum unknownout(PG_FUNCTION_ARGS);
extern Datum unknownrecv(PG_FUNCTION_ARGS);
extern Datum unknownsend(PG_FUNCTION_ARGS);
+extern bool bytea_output_hex;
+
extern Datum byteain(PG_FUNCTION_ARGS);
extern Datum byteaout(PG_FUNCTION_ARGS);
extern Datum bytearecv(PG_FUNCTION_ARGS);
@@ -728,6 +730,9 @@ extern Datum bytea_substr(PG_FUNCTION_ARGS);
extern Datum bytea_substr_no_len(PG_FUNCTION_ARGS);
extern Datum pg_column_size(PG_FUNCTION_ARGS);
+extern size_t hex_encode(const char *src, size_t len, char *dst);
+extern size_t hex_decode(const char *src, size_t len, char *dst);
+
/* version.c */
extern Datum pgsql_version(PG_FUNCTION_ARGS);
diff --git a/src/test/regress/expected/conversion.out b/src/test/regress/expected/conversion.out
index e80e1a4..3742dc0 100644
--- a/src/test/regress/expected/conversion.out
+++ b/src/test/regress/expected/conversion.out
@@ -1,3 +1,4 @@
+SET bytea_output_hex TO false;
--
-- create user defined conversion
--
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index be8eb91..0c1fc41 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -97,6 +97,87 @@ LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
^
DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
RESET standard_conforming_strings;
+-- bytea
+SET bytea_output_hex = true;
+SELECT E'\\xDeAdBeEf'::bytea;
+ bytea
+------------
+ \xdeadbeef
+(1 row)
+
+SELECT E'\\xDeAdBeE'::bytea;
+ERROR: invalid hexadecimal data: odd number of digits
+LINE 1: SELECT E'\\xDeAdBeE'::bytea;
+ ^
+SELECT E'\\xDeAdBeEx'::bytea;
+ERROR: invalid hexadecimal digit: "x"
+LINE 1: SELECT E'\\xDeAdBeEx'::bytea;
+ ^
+SELECT E'\\xDe00BeEf'::bytea;
+ bytea
+------------
+ \xde00beef
+(1 row)
+
+SELECT E'DeAdBeEf'::bytea;
+ bytea
+--------------------
+ \x4465416442654566
+(1 row)
+
+SELECT E'De\\000dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465006442654566
+(1 row)
+
+SELECT E'De\123dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465536442654566
+(1 row)
+
+SELECT E'De\\123dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465536442654566
+(1 row)
+
+SELECT E'De\\678dBeEf'::bytea;
+ERROR: invalid input syntax for type bytea
+LINE 1: SELECT E'De\\678dBeEf'::bytea;
+ ^
+SET bytea_output_hex = false;
+SELECT E'\\xDeAdBeEf'::bytea;
+ bytea
+------------------
+ \336\255\276\357
+(1 row)
+
+SELECT E'\\xDe00BeEf'::bytea;
+ bytea
+------------------
+ \336\000\276\357
+(1 row)
+
+SELECT E'DeAdBeEf'::bytea;
+ bytea
+----------
+ DeAdBeEf
+(1 row)
+
+SELECT E'De\\000dBeEf'::bytea;
+ bytea
+-------------
+ De\000dBeEf
+(1 row)
+
+SELECT E'De\\123dBeEf'::bytea;
+ bytea
+----------
+ DeSdBeEf
+(1 row)
+
--
-- test conversions between various string types
-- E021-10 implicit casting among the character data types
diff --git a/src/test/regress/input/largeobject.source b/src/test/regress/input/largeobject.source
index 46ba926..43f6855 100644
--- a/src/test/regress/input/largeobject.source
+++ b/src/test/regress/input/largeobject.source
@@ -2,6 +2,8 @@
-- Test large object support
--
+SET bytea_output_hex TO false;
+
-- Load a file
CREATE TABLE lotest_stash_values (loid oid, fd integer);
-- lo_creat(mode integer) returns oid
diff --git a/src/test/regress/output/largeobject.source b/src/test/regress/output/largeobject.source
index 9d69f6c..7012586 100644
--- a/src/test/regress/output/largeobject.source
+++ b/src/test/regress/output/largeobject.source
@@ -1,6 +1,7 @@
--
-- Test large object support
--
+SET bytea_output_hex TO false;
-- Load a file
CREATE TABLE lotest_stash_values (loid oid, fd integer);
-- lo_creat(mode integer) returns oid
diff --git a/src/test/regress/output/largeobject_1.source b/src/test/regress/output/largeobject_1.source
index 1fbc29c..0ece752 100644
--- a/src/test/regress/output/largeobject_1.source
+++ b/src/test/regress/output/largeobject_1.source
@@ -1,6 +1,7 @@
--
-- Test large object support
--
+SET bytea_output_hex TO false;
-- Load a file
CREATE TABLE lotest_stash_values (loid oid, fd integer);
-- lo_creat(mode integer) returns oid
diff --git a/src/test/regress/sql/conversion.sql b/src/test/regress/sql/conversion.sql
index 99a9178..001de4e 100644
--- a/src/test/regress/sql/conversion.sql
+++ b/src/test/regress/sql/conversion.sql
@@ -1,3 +1,5 @@
+SET bytea_output_hex TO false;
+
--
-- create user defined conversion
--
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index a28c75a..8345534 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -43,6 +43,25 @@ SELECT U&'wrong: +0061' UESCAPE '+';
RESET standard_conforming_strings;
+-- bytea
+SET bytea_output_hex = true;
+SELECT E'\\xDeAdBeEf'::bytea;
+SELECT E'\\xDeAdBeE'::bytea;
+SELECT E'\\xDeAdBeEx'::bytea;
+SELECT E'\\xDe00BeEf'::bytea;
+SELECT E'DeAdBeEf'::bytea;
+SELECT E'De\\000dBeEf'::bytea;
+SELECT E'De\123dBeEf'::bytea;
+SELECT E'De\\123dBeEf'::bytea;
+SELECT E'De\\678dBeEf'::bytea;
+
+SET bytea_output_hex = false;
+SELECT E'\\xDeAdBeEf'::bytea;
+SELECT E'\\xDe00BeEf'::bytea;
+SELECT E'DeAdBeEf'::bytea;
+SELECT E'De\\000dBeEf'::bytea;
+SELECT E'De\\123dBeEf'::bytea;
+
--
-- test conversions between various string types
-- E021-10 implicit casting among the character data types
Peter Eisentraut <peter_e@gmx.net> writes:
Here is a first cut at a new hex bytea input and output format. Example:
...
SET bytea_output_hex = true;
Should the configuration parameter be a boolean or an enum, opening
possibilities for other formats?
Enum. If we do this then it seems entirely fair that someone might
want other settings someday. Also, it seems silly to pick a format
partly on the grounds that it's expansible, and then not make the
control GUC expansible. Perhaps
SET bytea_output = [ hex | traditional ]
regards, tom lane
--On Dienstag, Juli 07, 2009 18:07:08 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:
Enum. If we do this then it seems entirely fair that someone might
want other settings someday. Also, it seems silly to pick a format
partly on the grounds that it's expansible, and then not make the
control GUC expansible. PerhapsSET bytea_output = [ hex | traditional ]
I like the enum much better, too, but
SET bytea_output = [ hex | escape ]
looks better to me (encode/decode are using something like this already).
--
Thanks
Bernd
2009/7/8 Bernd Helmle <mailings@oopsware.de>:
--On Dienstag, Juli 07, 2009 18:07:08 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:Enum. If we do this then it seems entirely fair that someone might
want other settings someday. Also, it seems silly to pick a format
partly on the grounds that it's expansible, and then not make the
control GUC expansible. PerhapsSET bytea_output = [ hex | traditional ]
I like the enum much better, too, but
SET bytea_output = [ hex | escape ]
+ 1
Pavel
Show quoted text
looks better to me (encode/decode are using something like this already).
--
ThanksBernd
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello, Bernd.
You wrote:
BH> --On Dienstag, Juli 07, 2009 18:07:08 -0400 Tom Lane <tgl@sss.pgh.pa.us>
BH> wrote:
Enum. If we do this then it seems entirely fair that someone might
want other settings someday. Also, it seems silly to pick a format
partly on the grounds that it's expansible, and then not make the
control GUC expansible. PerhapsSET bytea_output = [ hex | traditional ]
BH> I like the enum much better, too, but
BH> SET bytea_output = [ hex | escape ]
BH> looks better to me (encode/decode are using something like this already).
BH> --
BH> Thanks
BH> Bernd
Yeah, this looks nice for me too
--
With best wishes,
Pavel mailto:pavel@gf.microolap.com
On Wednesday 08 July 2009 01:07:08 Tom Lane wrote:
Peter Eisentraut <peter_e@gmx.net> writes:
Here is a first cut at a new hex bytea input and output format. Example:
...
SET bytea_output_hex = true;Should the configuration parameter be a boolean or an enum, opening
possibilities for other formats?Enum. If we do this then it seems entirely fair that someone might
want other settings someday. Also, it seems silly to pick a format
partly on the grounds that it's expansible, and then not make the
control GUC expansible. PerhapsSET bytea_output = [ hex | traditional ]
OK, here is an updated patch. It has the setting as enum, completed
documentation, and libpq support. I'll add it to the commit fest in the hope
that someone else can look it over in detail.
I'm attaching two versions of the patch. One it made with the -w option,
which leads to less differences.
Attachments:
bytea-format.patchtext/x-patch; charset=UTF-8; name=bytea-format.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2572d78..fece041 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3703,6 +3703,23 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
<title>Statement Behavior</title>
<variablelist>
+ <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
+ <term><varname>bytea_output</varname> (<type>enum</type>)</term>
+ <indexterm>
+ <primary><varname>bytea_output</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Sets the output format for values of type <type>bytea</type>.
+ Valid values are <literal>hex</literal> (the default)
+ and <literal>escape</literal> (the traditional PostgreSQL
+ format). The <xref linkend="datatype-binary"> for more
+ information. Note that the <type>bytea</type> type always
+ accepts both formats on input.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-search-path" xreflabel="search_path">
<term><varname>search_path</varname> (<type>string</type>)</term>
<indexterm>
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index c944d8f..bdead3e 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL$ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.237 2009/04/27 16:27:35 momjian Exp $ -->
<chapter id="datatype">
<title id="datatype-title">Data Types</title>
@@ -1189,6 +1189,66 @@ SELECT b, char_length(b) FROM test2;
</para>
<para>
+ The <type>bytea</type> type supports two external formats for
+ input and output: the <quote>escape</quote> format that is
+ particular to PostgreSQL, and the <quote>hex</quote> format. Both
+ of these are always accepted on input. The output format depends
+ on the configuration parameter <xref linkend="guc-bytea-output">;
+ the default is hex. (Note that the hex format was introduced in
+ PostgreSQL 8.5; so earlier version and some tools don't understand
+ it.)
+ </para>
+
+ <para>
+ The <acronym>SQL</acronym> standard defines a different binary
+ string type, called <type>BLOB</type> or <type>BINARY LARGE
+ OBJECT</type>. The input format is different from
+ <type>bytea</type>, but the provided functions and operators are
+ mostly the same.
+ </para>
+
+ <sect2>
+ <title>The Hex Format</title>
+
+ <para>
+ The hex format encodes the binary data as 2 hexadecimal digits per
+ byte, highest significant nibble first. The entire string ist
+ preceded by the sequence <literal>\x</literal> (to distinguish it
+ from the bytea format). In SQL literals, the backslash may need
+ to be escaped, but it is one logical backslash as far as the
+ <type>bytea</type> type is concerned. The hex format is compatible with a wide
+ range of external applications and protocols, and it tends to be
+ faster than the traditional bytea format, so its use is
+ somewhat preferrable.
+ </para>
+
+ <para>
+ Example:
+<programlisting>
+SELECT E'\\xDEADBEEF';
+</programlisting>
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>The Escape Format</title>
+
+ <para>
+ The <quote>escape</quote> format is the traditional
+ PostgreSQL-specific format for the <type>bytea</type> type. It
+ takes the approach of representing a binary string as a sequence
+ of ASCII characters and escaping those bytes that cannot be
+ represented as an ASCII character by a special escape sequence.
+ If, from the point of view of the application, representing bytes
+ as characters makes sense, then this representation can be
+ convenient, but in practice it is usually confusing becauses it
+ fuzzes up the distinction between binary strings and characters
+ strings, and the particular escape mechanism that was chosen is
+ also somewhat unwieldy. So this format should probably not be
+ used for most new applications.
+ </para>
+
+ <para>
When entering <type>bytea</type> values, octets of certain
values <emphasis>must</emphasis> be escaped (but all octet
values <emphasis>can</emphasis> be escaped) when used as part
@@ -1341,14 +1401,7 @@ SELECT b, char_length(b) FROM test2;
have to escape line feeds and carriage returns if your interface
automatically translates these.
</para>
-
- <para>
- The <acronym>SQL</acronym> standard defines a different binary
- string type, called <type>BLOB</type> or <type>BINARY LARGE
- OBJECT</type>. The input format is different from
- <type>bytea</type>, but the provided functions and operators are
- mostly the same.
- </para>
+ </sect2>
</sect1>
diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index eed799a..b8a3cef 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -122,8 +122,8 @@ static const int8 hexlookup[128] = {
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
};
-static unsigned
-hex_encode(const char *src, unsigned len, char *dst)
+size_t
+hex_encode(const char *src, size_t len, char *dst)
{
const char *end = src + len;
@@ -152,8 +152,8 @@ get_hex(char c)
return (char) res;
}
-static unsigned
-hex_decode(const char *src, unsigned len, char *dst)
+size_t
+hex_decode(const char *src, size_t len, char *dst)
{
const char *s,
*srcend;
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 4cf3966..7b80bc9 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -62,6 +62,8 @@ static text *text_substring(Datum str,
bool length_not_specified);
static void appendStringInfoText(StringInfo str, const text *t);
+int bytea_output;
+
/*****************************************************************************
* CONVERSION ROUTINES EXPORTED FOR USE BY C CODE *
@@ -189,6 +191,18 @@ byteain(PG_FUNCTION_ARGS)
int byte;
bytea *result;
+ if (inputText[0] == '\\' && inputText[1] == 'x')
+ {
+ size_t len = strlen(inputText);
+
+ byte = (len - 2)/2 + VARHDRSZ;
+ result = palloc(byte);
+ SET_VARSIZE(result, byte);
+ hex_decode(inputText + 2, len - 2, VARDATA(result));
+
+ PG_RETURN_BYTEA_P(result);
+ }
+
for (byte = 0, tp = inputText; *tp != '\0'; byte++)
{
if (tp[0] != '\\')
@@ -266,47 +280,61 @@ Datum
byteaout(PG_FUNCTION_ARGS)
{
bytea *vlena = PG_GETARG_BYTEA_PP(0);
- char *result;
+ char *result = NULL;
char *vp;
- char *rp;
- int val; /* holds unprintable chars */
+ char *rp = NULL;
int i;
- int len;
- len = 1; /* empty string has 1 char */
- vp = VARDATA_ANY(vlena);
- for (i = VARSIZE_ANY_EXHDR(vlena); i != 0; i--, vp++)
+ if (bytea_output == BYTEA_OUTPUT_HEX)
{
- if (*vp == '\\')
- len += 2;
- else if ((unsigned char) *vp < 0x20 || (unsigned char) *vp > 0x7e)
- len += 4;
- else
- len++;
+ rp = result = palloc(VARSIZE_ANY_EXHDR(vlena) * 2 + 2 + 1);
+ *rp++ = '\\';
+ *rp++ = 'x';
+ rp += hex_encode(VARDATA_ANY(vlena), VARSIZE_ANY_EXHDR(vlena), rp);
}
- rp = result = (char *) palloc(len);
- vp = VARDATA_ANY(vlena);
- for (i = VARSIZE_ANY_EXHDR(vlena); i != 0; i--, vp++)
+ else if (bytea_output == BYTEA_OUTPUT_ESCAPE)
{
- if (*vp == '\\')
+ int len;
+
+ len = 1; /* empty string has 1 char */
+ vp = VARDATA_ANY(vlena);
+ for (i = VARSIZE_ANY_EXHDR(vlena); i != 0; i--, vp++)
{
- *rp++ = '\\';
- *rp++ = '\\';
+ if (*vp == '\\')
+ len += 2;
+ else if ((unsigned char) *vp < 0x20 || (unsigned char) *vp > 0x7e)
+ len += 4;
+ else
+ len++;
}
- else if ((unsigned char) *vp < 0x20 || (unsigned char) *vp > 0x7e)
+ rp = result = (char *) palloc(len);
+ vp = VARDATA_ANY(vlena);
+ for (i = VARSIZE_ANY_EXHDR(vlena); i != 0; i--, vp++)
{
- val = *vp;
- rp[0] = '\\';
- rp[3] = DIG(val & 07);
- val >>= 3;
- rp[2] = DIG(val & 07);
- val >>= 3;
- rp[1] = DIG(val & 03);
- rp += 4;
+ if (*vp == '\\')
+ {
+ *rp++ = '\\';
+ *rp++ = '\\';
+ }
+ else if ((unsigned char) *vp < 0x20 || (unsigned char) *vp > 0x7e)
+ {
+ int val; /* holds unprintable chars */
+
+ val = *vp;
+ rp[0] = '\\';
+ rp[3] = DIG(val & 07);
+ val >>= 3;
+ rp[2] = DIG(val & 07);
+ val >>= 3;
+ rp[1] = DIG(val & 03);
+ rp += 4;
+ }
+ else
+ *rp++ = *vp;
}
- else
- *rp++ = *vp;
}
+ else
+ elog(ERROR, "unrecognized bytea_output setting: %d", bytea_output);
*rp = '\0';
PG_RETURN_CSTRING(result);
}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4f06725..19b1f46 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -180,6 +180,12 @@ static char *config_enum_get_options(struct config_enum *record,
* NOTE! Option values may not contain double quotes!
*/
+static const struct config_enum_entry bytea_output_options[] = {
+ {"escape", BYTEA_OUTPUT_ESCAPE, false},
+ {"hex", BYTEA_OUTPUT_HEX, false},
+ {NULL, 0, false}
+};
+
/*
* We have different sets for client and server message level options because
* they sort slightly different (see "log" level)
@@ -2533,6 +2539,15 @@ static struct config_enum ConfigureNamesEnum[] =
},
{
+ {"bytea_output", PGC_USERSET, CLIENT_CONN_OTHER,
+ gettext_noop("Sets the bytea output format to the hex format."),
+ NULL
+ },
+ &bytea_output,
+ BYTEA_OUTPUT_HEX, bytea_output_options, NULL, NULL
+ },
+
+ {
{"client_min_messages", PGC_USERSET, LOGGING_WHEN,
gettext_noop("Sets the message levels that are sent to the client."),
gettext_noop("Each level includes all the levels that follow it. The later"
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index c1b9393..0c93a7e 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -704,6 +704,13 @@ extern Datum unknownout(PG_FUNCTION_ARGS);
extern Datum unknownrecv(PG_FUNCTION_ARGS);
extern Datum unknownsend(PG_FUNCTION_ARGS);
+typedef enum {
+ BYTEA_OUTPUT_ESCAPE,
+ BYTEA_OUTPUT_HEX
+} ByteaOutputType;
+
+extern int bytea_output; /* ByteaOutputType, but int for guc enum */
+
extern Datum byteain(PG_FUNCTION_ARGS);
extern Datum byteaout(PG_FUNCTION_ARGS);
extern Datum bytearecv(PG_FUNCTION_ARGS);
@@ -728,6 +735,9 @@ extern Datum bytea_substr(PG_FUNCTION_ARGS);
extern Datum bytea_substr_no_len(PG_FUNCTION_ARGS);
extern Datum pg_column_size(PG_FUNCTION_ARGS);
+extern size_t hex_encode(const char *src, size_t len, char *dst);
+extern size_t hex_decode(const char *src, size_t len, char *dst);
+
/* version.c */
extern Datum pgsql_version(PG_FUNCTION_ARGS);
diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index c1934ad..66d1114 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -3166,6 +3166,29 @@ PQescapeBytea(const unsigned char *from, size_t from_length, size_t *to_length)
}
+static const int8 hexlookup[128] = {
+ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, -1, -1, -1, -1, -1, -1,
+ -1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ -1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+};
+
+static char
+get_hex(char c)
+{
+ int res = -1;
+
+ if (c > 0 && c < 127)
+ res = hexlookup[(unsigned char) c];
+
+ return (char) res;
+}
+
+
#define ISFIRSTOCTDIGIT(CH) ((CH) >= '0' && (CH) <= '3')
#define ISOCTDIGIT(CH) ((CH) >= '0' && (CH) <= '7')
#define OCTVAL(CH) ((CH) - '0')
@@ -3197,6 +3220,37 @@ PQunescapeBytea(const unsigned char *strtext, size_t *retbuflen)
strtextlen = strlen((const char *) strtext);
+ if (strtext[0] == '\\' && strtext[1] == 'x')
+ {
+ const unsigned char *s;
+ unsigned char *p;
+
+ *retbuflen = (strtextlen - 2)/2;
+ /* Avoid unportable malloc(0) */
+ buffer = malloc(*retbuflen > 0 ? *retbuflen : 1);
+ if (buffer == NULL)
+ return NULL;
+
+ s = strtext + 2;
+ p = buffer;
+ while (*s)
+ {
+ char v1, v2;
+
+ v1 = get_hex(*s++) << 4;
+ if (!*s)
+ {
+ *retbuflen = -1;
+ return NULL;
+ }
+
+ v2 = get_hex(*s++);
+ *p++ = v1 | v2;
+ }
+
+ return buffer;
+ }
+ else {
/*
* Length of input is max length of output, but add one to avoid
* unportable malloc(0) if input is zero-length.
@@ -3257,4 +3311,5 @@ PQunescapeBytea(const unsigned char *strtext, size_t *retbuflen)
*retbuflen = buflen;
return tmpbuf;
+ }
}
diff --git a/src/test/regress/expected/conversion.out b/src/test/regress/expected/conversion.out
index e80e1a4..8f08706 100644
--- a/src/test/regress/expected/conversion.out
+++ b/src/test/regress/expected/conversion.out
@@ -1,3 +1,4 @@
+SET bytea_output TO escape;
--
-- create user defined conversion
--
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index be8eb91..0315341 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -97,6 +97,87 @@ LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
^
DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
RESET standard_conforming_strings;
+-- bytea
+SET bytea_output TO hex;
+SELECT E'\\xDeAdBeEf'::bytea;
+ bytea
+------------
+ \xdeadbeef
+(1 row)
+
+SELECT E'\\xDeAdBeE'::bytea;
+ERROR: invalid hexadecimal data: odd number of digits
+LINE 1: SELECT E'\\xDeAdBeE'::bytea;
+ ^
+SELECT E'\\xDeAdBeEx'::bytea;
+ERROR: invalid hexadecimal digit: "x"
+LINE 1: SELECT E'\\xDeAdBeEx'::bytea;
+ ^
+SELECT E'\\xDe00BeEf'::bytea;
+ bytea
+------------
+ \xde00beef
+(1 row)
+
+SELECT E'DeAdBeEf'::bytea;
+ bytea
+--------------------
+ \x4465416442654566
+(1 row)
+
+SELECT E'De\\000dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465006442654566
+(1 row)
+
+SELECT E'De\123dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465536442654566
+(1 row)
+
+SELECT E'De\\123dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465536442654566
+(1 row)
+
+SELECT E'De\\678dBeEf'::bytea;
+ERROR: invalid input syntax for type bytea
+LINE 1: SELECT E'De\\678dBeEf'::bytea;
+ ^
+SET bytea_output TO escape;
+SELECT E'\\xDeAdBeEf'::bytea;
+ bytea
+------------------
+ \336\255\276\357
+(1 row)
+
+SELECT E'\\xDe00BeEf'::bytea;
+ bytea
+------------------
+ \336\000\276\357
+(1 row)
+
+SELECT E'DeAdBeEf'::bytea;
+ bytea
+----------
+ DeAdBeEf
+(1 row)
+
+SELECT E'De\\000dBeEf'::bytea;
+ bytea
+-------------
+ De\000dBeEf
+(1 row)
+
+SELECT E'De\\123dBeEf'::bytea;
+ bytea
+----------
+ DeSdBeEf
+(1 row)
+
--
-- test conversions between various string types
-- E021-10 implicit casting among the character data types
diff --git a/src/test/regress/input/largeobject.source b/src/test/regress/input/largeobject.source
index 46ba926..5bfba18 100644
--- a/src/test/regress/input/largeobject.source
+++ b/src/test/regress/input/largeobject.source
@@ -2,6 +2,8 @@
-- Test large object support
--
+SET bytea_output TO escape;
+
-- Load a file
CREATE TABLE lotest_stash_values (loid oid, fd integer);
-- lo_creat(mode integer) returns oid
diff --git a/src/test/regress/output/largeobject.source b/src/test/regress/output/largeobject.source
index 9d69f6c..5ff2e3b 100644
--- a/src/test/regress/output/largeobject.source
+++ b/src/test/regress/output/largeobject.source
@@ -1,6 +1,7 @@
--
-- Test large object support
--
+SET bytea_output TO escape;
-- Load a file
CREATE TABLE lotest_stash_values (loid oid, fd integer);
-- lo_creat(mode integer) returns oid
diff --git a/src/test/regress/output/largeobject_1.source b/src/test/regress/output/largeobject_1.source
index 1fbc29c..0ece752 100644
--- a/src/test/regress/output/largeobject_1.source
+++ b/src/test/regress/output/largeobject_1.source
@@ -1,6 +1,7 @@
--
-- Test large object support
--
+SET bytea_output_hex TO false;
-- Load a file
CREATE TABLE lotest_stash_values (loid oid, fd integer);
-- lo_creat(mode integer) returns oid
diff --git a/src/test/regress/sql/conversion.sql b/src/test/regress/sql/conversion.sql
index 99a9178..c4bb583 100644
--- a/src/test/regress/sql/conversion.sql
+++ b/src/test/regress/sql/conversion.sql
@@ -1,3 +1,5 @@
+SET bytea_output TO escape;
+
--
-- create user defined conversion
--
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index a28c75a..3e9cfa3 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -43,6 +43,25 @@ SELECT U&'wrong: +0061' UESCAPE '+';
RESET standard_conforming_strings;
+-- bytea
+SET bytea_output TO hex;
+SELECT E'\\xDeAdBeEf'::bytea;
+SELECT E'\\xDeAdBeE'::bytea;
+SELECT E'\\xDeAdBeEx'::bytea;
+SELECT E'\\xDe00BeEf'::bytea;
+SELECT E'DeAdBeEf'::bytea;
+SELECT E'De\\000dBeEf'::bytea;
+SELECT E'De\123dBeEf'::bytea;
+SELECT E'De\\123dBeEf'::bytea;
+SELECT E'De\\678dBeEf'::bytea;
+
+SET bytea_output TO escape;
+SELECT E'\\xDeAdBeEf'::bytea;
+SELECT E'\\xDe00BeEf'::bytea;
+SELECT E'DeAdBeEf'::bytea;
+SELECT E'De\\000dBeEf'::bytea;
+SELECT E'De\\123dBeEf'::bytea;
+
--
-- test conversions between various string types
-- E021-10 implicit casting among the character data types
bytea-format-w.patchtext/x-patch; charset=UTF-8; name=bytea-format-w.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2572d78..fece041 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3703,6 +3703,23 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
<title>Statement Behavior</title>
<variablelist>
+ <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
+ <term><varname>bytea_output</varname> (<type>enum</type>)</term>
+ <indexterm>
+ <primary><varname>bytea_output</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Sets the output format for values of type <type>bytea</type>.
+ Valid values are <literal>hex</literal> (the default)
+ and <literal>escape</literal> (the traditional PostgreSQL
+ format). The <xref linkend="datatype-binary"> for more
+ information. Note that the <type>bytea</type> type always
+ accepts both formats on input.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-search-path" xreflabel="search_path">
<term><varname>search_path</varname> (<type>string</type>)</term>
<indexterm>
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index c944d8f..bdead3e 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL$ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.237 2009/04/27 16:27:35 momjian Exp $ -->
<chapter id="datatype">
<title id="datatype-title">Data Types</title>
@@ -1189,6 +1189,66 @@ SELECT b, char_length(b) FROM test2;
</para>
<para>
+ The <type>bytea</type> type supports two external formats for
+ input and output: the <quote>escape</quote> format that is
+ particular to PostgreSQL, and the <quote>hex</quote> format. Both
+ of these are always accepted on input. The output format depends
+ on the configuration parameter <xref linkend="guc-bytea-output">;
+ the default is hex. (Note that the hex format was introduced in
+ PostgreSQL 8.5; so earlier version and some tools don't understand
+ it.)
+ </para>
+
+ <para>
+ The <acronym>SQL</acronym> standard defines a different binary
+ string type, called <type>BLOB</type> or <type>BINARY LARGE
+ OBJECT</type>. The input format is different from
+ <type>bytea</type>, but the provided functions and operators are
+ mostly the same.
+ </para>
+
+ <sect2>
+ <title>The Hex Format</title>
+
+ <para>
+ The hex format encodes the binary data as 2 hexadecimal digits per
+ byte, highest significant nibble first. The entire string ist
+ preceded by the sequence <literal>\x</literal> (to distinguish it
+ from the bytea format). In SQL literals, the backslash may need
+ to be escaped, but it is one logical backslash as far as the
+ <type>bytea</type> type is concerned. The hex format is compatible with a wide
+ range of external applications and protocols, and it tends to be
+ faster than the traditional bytea format, so its use is
+ somewhat preferrable.
+ </para>
+
+ <para>
+ Example:
+<programlisting>
+SELECT E'\\xDEADBEEF';
+</programlisting>
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>The Escape Format</title>
+
+ <para>
+ The <quote>escape</quote> format is the traditional
+ PostgreSQL-specific format for the <type>bytea</type> type. It
+ takes the approach of representing a binary string as a sequence
+ of ASCII characters and escaping those bytes that cannot be
+ represented as an ASCII character by a special escape sequence.
+ If, from the point of view of the application, representing bytes
+ as characters makes sense, then this representation can be
+ convenient, but in practice it is usually confusing becauses it
+ fuzzes up the distinction between binary strings and characters
+ strings, and the particular escape mechanism that was chosen is
+ also somewhat unwieldy. So this format should probably not be
+ used for most new applications.
+ </para>
+
+ <para>
When entering <type>bytea</type> values, octets of certain
values <emphasis>must</emphasis> be escaped (but all octet
values <emphasis>can</emphasis> be escaped) when used as part
@@ -1341,14 +1401,7 @@ SELECT b, char_length(b) FROM test2;
have to escape line feeds and carriage returns if your interface
automatically translates these.
</para>
-
- <para>
- The <acronym>SQL</acronym> standard defines a different binary
- string type, called <type>BLOB</type> or <type>BINARY LARGE
- OBJECT</type>. The input format is different from
- <type>bytea</type>, but the provided functions and operators are
- mostly the same.
- </para>
+ </sect2>
</sect1>
diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index eed799a..b8a3cef 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -122,8 +122,8 @@ static const int8 hexlookup[128] = {
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
};
-static unsigned
-hex_encode(const char *src, unsigned len, char *dst)
+size_t
+hex_encode(const char *src, size_t len, char *dst)
{
const char *end = src + len;
@@ -152,8 +152,8 @@ get_hex(char c)
return (char) res;
}
-static unsigned
-hex_decode(const char *src, unsigned len, char *dst)
+size_t
+hex_decode(const char *src, size_t len, char *dst)
{
const char *s,
*srcend;
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 4cf3966..7b80bc9 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -62,6 +62,8 @@ static text *text_substring(Datum str,
bool length_not_specified);
static void appendStringInfoText(StringInfo str, const text *t);
+int bytea_output;
+
/*****************************************************************************
* CONVERSION ROUTINES EXPORTED FOR USE BY C CODE *
@@ -189,6 +191,18 @@ byteain(PG_FUNCTION_ARGS)
int byte;
bytea *result;
+ if (inputText[0] == '\\' && inputText[1] == 'x')
+ {
+ size_t len = strlen(inputText);
+
+ byte = (len - 2)/2 + VARHDRSZ;
+ result = palloc(byte);
+ SET_VARSIZE(result, byte);
+ hex_decode(inputText + 2, len - 2, VARDATA(result));
+
+ PG_RETURN_BYTEA_P(result);
+ }
+
for (byte = 0, tp = inputText; *tp != '\0'; byte++)
{
if (tp[0] != '\\')
@@ -266,11 +280,20 @@ Datum
byteaout(PG_FUNCTION_ARGS)
{
bytea *vlena = PG_GETARG_BYTEA_PP(0);
- char *result;
+ char *result = NULL;
char *vp;
- char *rp;
- int val; /* holds unprintable chars */
+ char *rp = NULL;
int i;
+
+ if (bytea_output == BYTEA_OUTPUT_HEX)
+ {
+ rp = result = palloc(VARSIZE_ANY_EXHDR(vlena) * 2 + 2 + 1);
+ *rp++ = '\\';
+ *rp++ = 'x';
+ rp += hex_encode(VARDATA_ANY(vlena), VARSIZE_ANY_EXHDR(vlena), rp);
+ }
+ else if (bytea_output == BYTEA_OUTPUT_ESCAPE)
+ {
int len;
len = 1; /* empty string has 1 char */
@@ -295,6 +318,8 @@ byteaout(PG_FUNCTION_ARGS)
}
else if ((unsigned char) *vp < 0x20 || (unsigned char) *vp > 0x7e)
{
+ int val; /* holds unprintable chars */
+
val = *vp;
rp[0] = '\\';
rp[3] = DIG(val & 07);
@@ -307,6 +332,9 @@ byteaout(PG_FUNCTION_ARGS)
else
*rp++ = *vp;
}
+ }
+ else
+ elog(ERROR, "unrecognized bytea_output setting: %d", bytea_output);
*rp = '\0';
PG_RETURN_CSTRING(result);
}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4f06725..19b1f46 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -180,6 +180,12 @@ static char *config_enum_get_options(struct config_enum *record,
* NOTE! Option values may not contain double quotes!
*/
+static const struct config_enum_entry bytea_output_options[] = {
+ {"escape", BYTEA_OUTPUT_ESCAPE, false},
+ {"hex", BYTEA_OUTPUT_HEX, false},
+ {NULL, 0, false}
+};
+
/*
* We have different sets for client and server message level options because
* they sort slightly different (see "log" level)
@@ -2533,6 +2539,15 @@ static struct config_enum ConfigureNamesEnum[] =
},
{
+ {"bytea_output", PGC_USERSET, CLIENT_CONN_OTHER,
+ gettext_noop("Sets the bytea output format to the hex format."),
+ NULL
+ },
+ &bytea_output,
+ BYTEA_OUTPUT_HEX, bytea_output_options, NULL, NULL
+ },
+
+ {
{"client_min_messages", PGC_USERSET, LOGGING_WHEN,
gettext_noop("Sets the message levels that are sent to the client."),
gettext_noop("Each level includes all the levels that follow it. The later"
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index c1b9393..0c93a7e 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -704,6 +704,13 @@ extern Datum unknownout(PG_FUNCTION_ARGS);
extern Datum unknownrecv(PG_FUNCTION_ARGS);
extern Datum unknownsend(PG_FUNCTION_ARGS);
+typedef enum {
+ BYTEA_OUTPUT_ESCAPE,
+ BYTEA_OUTPUT_HEX
+} ByteaOutputType;
+
+extern int bytea_output; /* ByteaOutputType, but int for guc enum */
+
extern Datum byteain(PG_FUNCTION_ARGS);
extern Datum byteaout(PG_FUNCTION_ARGS);
extern Datum bytearecv(PG_FUNCTION_ARGS);
@@ -728,6 +735,9 @@ extern Datum bytea_substr(PG_FUNCTION_ARGS);
extern Datum bytea_substr_no_len(PG_FUNCTION_ARGS);
extern Datum pg_column_size(PG_FUNCTION_ARGS);
+extern size_t hex_encode(const char *src, size_t len, char *dst);
+extern size_t hex_decode(const char *src, size_t len, char *dst);
+
/* version.c */
extern Datum pgsql_version(PG_FUNCTION_ARGS);
diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index c1934ad..66d1114 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -3166,6 +3166,29 @@ PQescapeBytea(const unsigned char *from, size_t from_length, size_t *to_length)
}
+static const int8 hexlookup[128] = {
+ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, -1, -1, -1, -1, -1, -1,
+ -1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ -1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+};
+
+static char
+get_hex(char c)
+{
+ int res = -1;
+
+ if (c > 0 && c < 127)
+ res = hexlookup[(unsigned char) c];
+
+ return (char) res;
+}
+
+
#define ISFIRSTOCTDIGIT(CH) ((CH) >= '0' && (CH) <= '3')
#define ISOCTDIGIT(CH) ((CH) >= '0' && (CH) <= '7')
#define OCTVAL(CH) ((CH) - '0')
@@ -3197,6 +3220,37 @@ PQunescapeBytea(const unsigned char *strtext, size_t *retbuflen)
strtextlen = strlen((const char *) strtext);
+ if (strtext[0] == '\\' && strtext[1] == 'x')
+ {
+ const unsigned char *s;
+ unsigned char *p;
+
+ *retbuflen = (strtextlen - 2)/2;
+ /* Avoid unportable malloc(0) */
+ buffer = malloc(*retbuflen > 0 ? *retbuflen : 1);
+ if (buffer == NULL)
+ return NULL;
+
+ s = strtext + 2;
+ p = buffer;
+ while (*s)
+ {
+ char v1, v2;
+
+ v1 = get_hex(*s++) << 4;
+ if (!*s)
+ {
+ *retbuflen = -1;
+ return NULL;
+ }
+
+ v2 = get_hex(*s++);
+ *p++ = v1 | v2;
+ }
+
+ return buffer;
+ }
+ else {
/*
* Length of input is max length of output, but add one to avoid
* unportable malloc(0) if input is zero-length.
@@ -3258,3 +3312,4 @@ PQunescapeBytea(const unsigned char *strtext, size_t *retbuflen)
*retbuflen = buflen;
return tmpbuf;
}
+}
diff --git a/src/test/regress/expected/conversion.out b/src/test/regress/expected/conversion.out
index e80e1a4..8f08706 100644
--- a/src/test/regress/expected/conversion.out
+++ b/src/test/regress/expected/conversion.out
@@ -1,3 +1,4 @@
+SET bytea_output TO escape;
--
-- create user defined conversion
--
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index be8eb91..0315341 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -97,6 +97,87 @@ LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
^
DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
RESET standard_conforming_strings;
+-- bytea
+SET bytea_output TO hex;
+SELECT E'\\xDeAdBeEf'::bytea;
+ bytea
+------------
+ \xdeadbeef
+(1 row)
+
+SELECT E'\\xDeAdBeE'::bytea;
+ERROR: invalid hexadecimal data: odd number of digits
+LINE 1: SELECT E'\\xDeAdBeE'::bytea;
+ ^
+SELECT E'\\xDeAdBeEx'::bytea;
+ERROR: invalid hexadecimal digit: "x"
+LINE 1: SELECT E'\\xDeAdBeEx'::bytea;
+ ^
+SELECT E'\\xDe00BeEf'::bytea;
+ bytea
+------------
+ \xde00beef
+(1 row)
+
+SELECT E'DeAdBeEf'::bytea;
+ bytea
+--------------------
+ \x4465416442654566
+(1 row)
+
+SELECT E'De\\000dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465006442654566
+(1 row)
+
+SELECT E'De\123dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465536442654566
+(1 row)
+
+SELECT E'De\\123dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465536442654566
+(1 row)
+
+SELECT E'De\\678dBeEf'::bytea;
+ERROR: invalid input syntax for type bytea
+LINE 1: SELECT E'De\\678dBeEf'::bytea;
+ ^
+SET bytea_output TO escape;
+SELECT E'\\xDeAdBeEf'::bytea;
+ bytea
+------------------
+ \336\255\276\357
+(1 row)
+
+SELECT E'\\xDe00BeEf'::bytea;
+ bytea
+------------------
+ \336\000\276\357
+(1 row)
+
+SELECT E'DeAdBeEf'::bytea;
+ bytea
+----------
+ DeAdBeEf
+(1 row)
+
+SELECT E'De\\000dBeEf'::bytea;
+ bytea
+-------------
+ De\000dBeEf
+(1 row)
+
+SELECT E'De\\123dBeEf'::bytea;
+ bytea
+----------
+ DeSdBeEf
+(1 row)
+
--
-- test conversions between various string types
-- E021-10 implicit casting among the character data types
diff --git a/src/test/regress/input/largeobject.source b/src/test/regress/input/largeobject.source
index 46ba926..5bfba18 100644
--- a/src/test/regress/input/largeobject.source
+++ b/src/test/regress/input/largeobject.source
@@ -2,6 +2,8 @@
-- Test large object support
--
+SET bytea_output TO escape;
+
-- Load a file
CREATE TABLE lotest_stash_values (loid oid, fd integer);
-- lo_creat(mode integer) returns oid
diff --git a/src/test/regress/output/largeobject.source b/src/test/regress/output/largeobject.source
index 9d69f6c..5ff2e3b 100644
--- a/src/test/regress/output/largeobject.source
+++ b/src/test/regress/output/largeobject.source
@@ -1,6 +1,7 @@
--
-- Test large object support
--
+SET bytea_output TO escape;
-- Load a file
CREATE TABLE lotest_stash_values (loid oid, fd integer);
-- lo_creat(mode integer) returns oid
diff --git a/src/test/regress/output/largeobject_1.source b/src/test/regress/output/largeobject_1.source
index 1fbc29c..0ece752 100644
--- a/src/test/regress/output/largeobject_1.source
+++ b/src/test/regress/output/largeobject_1.source
@@ -1,6 +1,7 @@
--
-- Test large object support
--
+SET bytea_output_hex TO false;
-- Load a file
CREATE TABLE lotest_stash_values (loid oid, fd integer);
-- lo_creat(mode integer) returns oid
diff --git a/src/test/regress/sql/conversion.sql b/src/test/regress/sql/conversion.sql
index 99a9178..c4bb583 100644
--- a/src/test/regress/sql/conversion.sql
+++ b/src/test/regress/sql/conversion.sql
@@ -1,3 +1,5 @@
+SET bytea_output TO escape;
+
--
-- create user defined conversion
--
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index a28c75a..3e9cfa3 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -43,6 +43,25 @@ SELECT U&'wrong: +0061' UESCAPE '+';
RESET standard_conforming_strings;
+-- bytea
+SET bytea_output TO hex;
+SELECT E'\\xDeAdBeEf'::bytea;
+SELECT E'\\xDeAdBeE'::bytea;
+SELECT E'\\xDeAdBeEx'::bytea;
+SELECT E'\\xDe00BeEf'::bytea;
+SELECT E'DeAdBeEf'::bytea;
+SELECT E'De\\000dBeEf'::bytea;
+SELECT E'De\123dBeEf'::bytea;
+SELECT E'De\\123dBeEf'::bytea;
+SELECT E'De\\678dBeEf'::bytea;
+
+SET bytea_output TO escape;
+SELECT E'\\xDeAdBeEf'::bytea;
+SELECT E'\\xDe00BeEf'::bytea;
+SELECT E'DeAdBeEf'::bytea;
+SELECT E'De\\000dBeEf'::bytea;
+SELECT E'De\\123dBeEf'::bytea;
+
--
-- test conversions between various string types
-- E021-10 implicit casting among the character data types
--On Samstag, Juli 11, 2009 13:40:44 +0300 Peter Eisentraut
<peter_e@gmx.net> wrote:
OK, here is an updated patch. It has the setting as enum, completed
documentation, and libpq support. I'll add it to the commit fest in the
hope that someone else can look it over in detail.
I've started looking at this and did some profiling with large bytea data
again. For those interested, here are the numbers:
Dumping with bytea_output=hex (COPY to file):
real 20m38.699s
user 0m11.265s
sys 1m0.560s
Dumping with bytea_output=escape (COPY to file):
real 39m52.399s
user 0m22.085s
sys 1m50.131s
So the time needed dropped about 50%. The dump file dropped from around 48
GB to 28 GB with the new format. I have some profiler data for this, but
the restore seems much more interesting: the time to restore for both
formats is quite the same:
Restore bytea_output=hex
real 32m11.028s
user 0m0.000s
sys 0m0.008s
Restore bytea_output=escape
real 31m35.378s
user 0m0.000s
sys 0m0.000s
The profile for restoring the hex format looks like this:
% cumulative self self total
time seconds seconds calls s/call s/call name
34.54 156.79 156.79 97836 0.00 0.00 pglz_compress
18.64 241.38 84.59 141374 0.00 0.00 CopyReadLine
12.83 299.62 58.24 3604740854 0.00 0.00 get_hex
8.44 337.95 38.33 14257432 0.00 0.00 XLogInsert
7.39 371.48 33.53 141373 0.00 0.00 hex_decode
7.23 404.31 32.83 1 32.83 436.67 DoCopy
3.48 420.12 15.81 esc_enc_len
0.61 422.89 2.77 134943749 0.00 0.00 _bt_compare
0.54 425.36 2.47 33682172 0.00 0.00 ReadBuffer_common
0.54 427.83 2.47 52166324 0.00 0.00
hash_search_with_hash_value
0.45 429.89 2.06 104798203 0.00 0.00 LWLockAcquire
0.36 431.53 1.64 105234314 0.00 0.00 LWLockRelease
I've attached a slightly edited patch which fixes a compiler warning in
encode.c, too.
--
Thanks
Bernd
Attachments:
bytea-format_review1.patchtext/x-diff; charset=utf-8; name=bytea-format_review1.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 99d25d7..9a543bc 100644
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** COPY postgres_log FROM '/full/path/to/lo
*** 3735,3740 ****
--- 3735,3757 ----
<title>Statement Behavior</title>
<variablelist>
+ <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
+ <term><varname>bytea_output</varname> (<type>enum</type>)</term>
+ <indexterm>
+ <primary><varname>bytea_output</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Sets the output format for values of type <type>bytea</type>.
+ Valid values are <literal>hex</literal> (the default)
+ and <literal>escape</literal> (the traditional PostgreSQL
+ format). The <xref linkend="datatype-binary"> for more
+ information. Note that the <type>bytea</type> type always
+ accepts both formats on input.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-search-path" xreflabel="search_path">
<term><varname>search_path</varname> (<type>string</type>)</term>
<indexterm>
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index f7ee8e9..8576419 100644
*** a/doc/src/sgml/datatype.sgml
--- b/doc/src/sgml/datatype.sgml
***************
*** 1,4 ****
! <!-- $PostgreSQL$ -->
<chapter id="datatype">
<title id="datatype-title">Data Types</title>
--- 1,4 ----
! <!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.237 2009/04/27 16:27:35 momjian Exp $ -->
<chapter id="datatype">
<title id="datatype-title">Data Types</title>
*************** SELECT b, char_length(b) FROM test2;
*** 1191,1196 ****
--- 1191,1256 ----
</para>
<para>
+ The <type>bytea</type> type supports two external formats for
+ input and output: the <quote>escape</quote> format that is
+ particular to PostgreSQL, and the <quote>hex</quote> format. Both
+ of these are always accepted on input. The output format depends
+ on the configuration parameter <xref linkend="guc-bytea-output">;
+ the default is hex. (Note that the hex format was introduced in
+ PostgreSQL 8.5; so earlier version and some tools don't understand
+ it.)
+ </para>
+
+ <para>
+ The <acronym>SQL</acronym> standard defines a different binary
+ string type, called <type>BLOB</type> or <type>BINARY LARGE
+ OBJECT</type>. The input format is different from
+ <type>bytea</type>, but the provided functions and operators are
+ mostly the same.
+ </para>
+
+ <sect2>
+ <title>The Hex Format</title>
+
+ <para>
+ The hex format encodes the binary data as 2 hexadecimal digits per
+ byte, highest significant nibble first. The entire string ist
+ preceded by the sequence <literal>\x</literal> (to distinguish it
+ from the bytea format). In SQL literals, the backslash may need
+ to be escaped, but it is one logical backslash as far as the
+ <type>bytea</type> type is concerned. The hex format is compatible with a wide
+ range of external applications and protocols, and it tends to be
+ faster than the traditional bytea format, so its use is
+ somewhat preferrable.
+ </para>
+
+ <para>
+ Example:
+ <programlisting>
+ SELECT E'\\xDEADBEEF';
+ </programlisting>
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>The Escape Format</title>
+
+ <para>
+ The <quote>escape</quote> format is the traditional
+ PostgreSQL-specific format for the <type>bytea</type> type. It
+ takes the approach of representing a binary string as a sequence
+ of ASCII characters and escaping those bytes that cannot be
+ represented as an ASCII character by a special escape sequence.
+ If, from the point of view of the application, representing bytes
+ as characters makes sense, then this representation can be
+ convenient, but in practice it is usually confusing becauses it
+ fuzzes up the distinction between binary strings and characters
+ strings, and the particular escape mechanism that was chosen is
+ also somewhat unwieldy. So this format should probably not be
+ used for most new applications.
+ </para>
+
+ <para>
When entering <type>bytea</type> values, octets of certain
values <emphasis>must</emphasis> be escaped (but all octet
values <emphasis>can</emphasis> be escaped) when used as part
*************** SELECT b, char_length(b) FROM test2;
*** 1343,1356 ****
have to escape line feeds and carriage returns if your interface
automatically translates these.
</para>
!
! <para>
! The <acronym>SQL</acronym> standard defines a different binary
! string type, called <type>BLOB</type> or <type>BINARY LARGE
! OBJECT</type>. The input format is different from
! <type>bytea</type>, but the provided functions and operators are
! mostly the same.
! </para>
</sect1>
--- 1403,1409 ----
have to escape line feeds and carriage returns if your interface
automatically translates these.
</para>
! </sect2>
</sect1>
diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index eed799a..4f47c8c 100644
*** a/src/backend/utils/adt/encode.c
--- b/src/backend/utils/adt/encode.c
*************** static const int8 hexlookup[128] = {
*** 122,128 ****
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
};
! static unsigned
hex_encode(const char *src, unsigned len, char *dst)
{
const char *end = src + len;
--- 122,128 ----
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
};
! unsigned
hex_encode(const char *src, unsigned len, char *dst)
{
const char *end = src + len;
*************** get_hex(char c)
*** 152,158 ****
return (char) res;
}
! static unsigned
hex_decode(const char *src, unsigned len, char *dst)
{
const char *s,
--- 152,158 ----
return (char) res;
}
! unsigned
hex_decode(const char *src, unsigned len, char *dst)
{
const char *s,
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 4434c97..9c8ea9d 100644
*** a/src/backend/utils/adt/varlena.c
--- b/src/backend/utils/adt/varlena.c
*************** static text *text_substring(Datum str,
*** 62,67 ****
--- 62,69 ----
bool length_not_specified);
static void appendStringInfoText(StringInfo str, const text *t);
+ int bytea_output;
+
/*****************************************************************************
* CONVERSION ROUTINES EXPORTED FOR USE BY C CODE *
*************** byteain(PG_FUNCTION_ARGS)
*** 189,194 ****
--- 191,207 ----
int byte;
bytea *result;
+ if (inputText[0] == '\\' && inputText[1] == 'x')
+ {
+ size_t len = strlen(inputText);
+ byte = (len - 2)/2 + VARHDRSZ;
+ result = palloc(byte);
+ SET_VARSIZE(result, byte);
+ hex_decode(inputText + 2, len - 2, VARDATA(result));
+
+ PG_RETURN_BYTEA_P(result);
+ }
+
for (byte = 0, tp = inputText; *tp != '\0'; byte ++)
{
if (tp[0] != '\\')
*************** Datum
*** 268,278 ****
byteaout(PG_FUNCTION_ARGS)
{
bytea *vlena = PG_GETARG_BYTEA_PP(0);
! char *result;
char *vp;
! char *rp;
! int val; /* holds unprintable chars */
int i;
int len;
len = 1; /* empty string has 1 char */
--- 281,300 ----
byteaout(PG_FUNCTION_ARGS)
{
bytea *vlena = PG_GETARG_BYTEA_PP(0);
! char *result = NULL;
char *vp;
! char *rp = NULL;
int i;
+
+ if (bytea_output == BYTEA_OUTPUT_HEX)
+ {
+ rp = result = palloc(VARSIZE_ANY_EXHDR(vlena) * 2 + 2 + 1);
+ *rp++ = '\\';
+ *rp++ = 'x';
+ rp += hex_encode(VARDATA_ANY(vlena), VARSIZE_ANY_EXHDR(vlena), rp);
+ }
+ else if (bytea_output == BYTEA_OUTPUT_ESCAPE)
+ {
int len;
len = 1; /* empty string has 1 char */
*************** byteaout(PG_FUNCTION_ARGS)
*** 297,302 ****
--- 319,326 ----
}
else if ((unsigned char) *vp < 0x20 || (unsigned char) *vp > 0x7e)
{
+ int val; /* holds unprintable chars */
+
val = *vp;
rp[0] = '\\';
rp[3] = DIG(val & 07);
*************** byteaout(PG_FUNCTION_ARGS)
*** 309,314 ****
--- 333,341 ----
else
*rp++ = *vp;
}
+ }
+ else
+ elog(ERROR, "unrecognized bytea_output setting: %d", bytea_output);
*rp = '\0';
PG_RETURN_CSTRING(result);
}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 11ebac8..0f6ef6c 100644
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static char *config_enum_get_options(str
*** 180,185 ****
--- 180,191 ----
* NOTE! Option values may not contain double quotes!
*/
+ static const struct config_enum_entry bytea_output_options[] = {
+ {"escape", BYTEA_OUTPUT_ESCAPE, false},
+ {"hex", BYTEA_OUTPUT_HEX, false},
+ {NULL, 0, false}
+ };
+
/*
* We have different sets for client and server message level options because
* they sort slightly different (see "log" level)
*************** static struct config_enum ConfigureNames
*** 2533,2538 ****
--- 2539,2553 ----
},
{
+ {"bytea_output", PGC_USERSET, CLIENT_CONN_OTHER,
+ gettext_noop("Sets the bytea output format to the hex format."),
+ NULL
+ },
+ &bytea_output,
+ BYTEA_OUTPUT_HEX, bytea_output_options, NULL, NULL
+ },
+
+ {
{"client_min_messages", PGC_USERSET, LOGGING_WHEN,
gettext_noop("Sets the message levels that are sent to the client."),
gettext_noop("Each level includes all the levels that follow it. The later"
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 13fd41a..8635416 100644
*** a/src/include/utils/builtins.h
--- b/src/include/utils/builtins.h
*************** extern Datum unknownout(PG_FUNCTION_ARGS
*** 704,709 ****
--- 704,716 ----
extern Datum unknownrecv(PG_FUNCTION_ARGS);
extern Datum unknownsend(PG_FUNCTION_ARGS);
+ typedef enum {
+ BYTEA_OUTPUT_ESCAPE,
+ BYTEA_OUTPUT_HEX
+ } ByteaOutputType;
+
+ extern int bytea_output; /* ByteaOutputType, but int for guc enum */
+
extern Datum byteain(PG_FUNCTION_ARGS);
extern Datum byteaout(PG_FUNCTION_ARGS);
extern Datum bytearecv(PG_FUNCTION_ARGS);
*************** extern Datum bytea_substr(PG_FUNCTION_AR
*** 728,733 ****
--- 735,743 ----
extern Datum bytea_substr_no_len(PG_FUNCTION_ARGS);
extern Datum pg_column_size(PG_FUNCTION_ARGS);
+ extern unsigned hex_encode(const char *src, unsigned len, char *dst);
+ extern unsigned hex_decode(const char *src, unsigned len, char *dst);
+
/* version.c */
extern Datum pgsql_version(PG_FUNCTION_ARGS);
diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index bb36121..504ad2f 100644
*** a/src/interfaces/libpq/fe-exec.c
--- b/src/interfaces/libpq/fe-exec.c
*************** PQescapeBytea(const unsigned char *from,
*** 3167,3172 ****
--- 3167,3195 ----
}
+ static const int8 hexlookup[128] = {
+ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, -1, -1, -1, -1, -1, -1,
+ -1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ -1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+ };
+
+ static char
+ get_hex(char c)
+ {
+ int res = -1;
+
+ if (c > 0 && c < 127)
+ res = hexlookup[(unsigned char) c];
+
+ return (char) res;
+ }
+
+
#define ISFIRSTOCTDIGIT(CH) ((CH) >= '0' && (CH) <= '3')
#define ISOCTDIGIT(CH) ((CH) >= '0' && (CH) <= '7')
#define OCTVAL(CH) ((CH) - '0')
*************** PQunescapeBytea(const unsigned char *str
*** 3198,3203 ****
--- 3221,3257 ----
strtextlen = strlen((const char *) strtext);
+ if (strtext[0] == '\\' && strtext[1] == 'x')
+ {
+ const unsigned char *s;
+ unsigned char *p;
+
+ *retbuflen = (strtextlen - 2)/2;
+ /* Avoid unportable malloc(0) */
+ buffer = malloc(*retbuflen > 0 ? *retbuflen : 1);
+ if (buffer == NULL)
+ return NULL;
+
+ s = strtext + 2;
+ p = buffer;
+ while (*s)
+ {
+ char v1, v2;
+
+ v1 = get_hex(*s++) << 4;
+ if (!*s)
+ {
+ *retbuflen = -1;
+ return NULL;
+ }
+
+ v2 = get_hex(*s++);
+ *p++ = v1 | v2;
+ }
+
+ return buffer;
+ }
+ else {
/*
* Length of input is max length of output, but add one to avoid
* unportable malloc(0) if input is zero-length.
*************** PQunescapeBytea(const unsigned char *str
*** 3259,3261 ****
--- 3313,3316 ----
*retbuflen = buflen;
return tmpbuf;
}
+ }
diff --git a/src/test/regress/expected/conversion.out b/src/test/regress/expected/conversion.out
index e80e1a4..8f08706 100644
*** a/src/test/regress/expected/conversion.out
--- b/src/test/regress/expected/conversion.out
***************
*** 1,3 ****
--- 1,4 ----
+ SET bytea_output TO escape;
--
-- create user defined conversion
--
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index 1241a2a..feaa574 100644
*** a/src/test/regress/expected/strings.out
--- b/src/test/regress/expected/strings.out
*************** LINE 1: SELECT U&'wrong: +0061' UESCAPE
*** 97,102 ****
--- 97,183 ----
^
DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
RESET standard_conforming_strings;
+ -- bytea
+ SET bytea_output TO hex;
+ SELECT E'\\xDeAdBeEf'::bytea;
+ bytea
+ ------------
+ \xdeadbeef
+ (1 row)
+
+ SELECT E'\\xDeAdBeE'::bytea;
+ ERROR: invalid hexadecimal data: odd number of digits
+ LINE 1: SELECT E'\\xDeAdBeE'::bytea;
+ ^
+ SELECT E'\\xDeAdBeEx'::bytea;
+ ERROR: invalid hexadecimal digit: "x"
+ LINE 1: SELECT E'\\xDeAdBeEx'::bytea;
+ ^
+ SELECT E'\\xDe00BeEf'::bytea;
+ bytea
+ ------------
+ \xde00beef
+ (1 row)
+
+ SELECT E'DeAdBeEf'::bytea;
+ bytea
+ --------------------
+ \x4465416442654566
+ (1 row)
+
+ SELECT E'De\\000dBeEf'::bytea;
+ bytea
+ --------------------
+ \x4465006442654566
+ (1 row)
+
+ SELECT E'De\123dBeEf'::bytea;
+ bytea
+ --------------------
+ \x4465536442654566
+ (1 row)
+
+ SELECT E'De\\123dBeEf'::bytea;
+ bytea
+ --------------------
+ \x4465536442654566
+ (1 row)
+
+ SELECT E'De\\678dBeEf'::bytea;
+ ERROR: invalid input syntax for type bytea
+ LINE 1: SELECT E'De\\678dBeEf'::bytea;
+ ^
+ SET bytea_output TO escape;
+ SELECT E'\\xDeAdBeEf'::bytea;
+ bytea
+ ------------------
+ \336\255\276\357
+ (1 row)
+
+ SELECT E'\\xDe00BeEf'::bytea;
+ bytea
+ ------------------
+ \336\000\276\357
+ (1 row)
+
+ SELECT E'DeAdBeEf'::bytea;
+ bytea
+ ----------
+ DeAdBeEf
+ (1 row)
+
+ SELECT E'De\\000dBeEf'::bytea;
+ bytea
+ -------------
+ De\000dBeEf
+ (1 row)
+
+ SELECT E'De\\123dBeEf'::bytea;
+ bytea
+ ----------
+ DeSdBeEf
+ (1 row)
+
--
-- test conversions between various string types
-- E021-10 implicit casting among the character data types
diff --git a/src/test/regress/input/largeobject.source b/src/test/regress/input/largeobject.source
index 46ba926..5bfba18 100644
*** a/src/test/regress/input/largeobject.source
--- b/src/test/regress/input/largeobject.source
***************
*** 2,7 ****
--- 2,9 ----
-- Test large object support
--
+ SET bytea_output TO escape;
+
-- Load a file
CREATE TABLE lotest_stash_values (loid oid, fd integer);
-- lo_creat(mode integer) returns oid
diff --git a/src/test/regress/output/largeobject.source b/src/test/regress/output/largeobject.source
index 9d69f6c..5ff2e3b 100644
*** a/src/test/regress/output/largeobject.source
--- b/src/test/regress/output/largeobject.source
***************
*** 1,6 ****
--- 1,7 ----
--
-- Test large object support
--
+ SET bytea_output TO escape;
-- Load a file
CREATE TABLE lotest_stash_values (loid oid, fd integer);
-- lo_creat(mode integer) returns oid
diff --git a/src/test/regress/output/largeobject_1.source b/src/test/regress/output/largeobject_1.source
index 1fbc29c..0ece752 100644
*** a/src/test/regress/output/largeobject_1.source
--- b/src/test/regress/output/largeobject_1.source
***************
*** 1,6 ****
--- 1,7 ----
--
-- Test large object support
--
+ SET bytea_output_hex TO false;
-- Load a file
CREATE TABLE lotest_stash_values (loid oid, fd integer);
-- lo_creat(mode integer) returns oid
diff --git a/src/test/regress/sql/conversion.sql b/src/test/regress/sql/conversion.sql
index 99a9178..c4bb583 100644
*** a/src/test/regress/sql/conversion.sql
--- b/src/test/regress/sql/conversion.sql
***************
*** 1,3 ****
--- 1,5 ----
+ SET bytea_output TO escape;
+
--
-- create user defined conversion
--
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index 681a0e1..de067df 100644
*** a/src/test/regress/sql/strings.sql
--- b/src/test/regress/sql/strings.sql
*************** SELECT U&'wrong: +0061' UESCAPE '+';
*** 43,48 ****
--- 43,67 ----
RESET standard_conforming_strings;
+ -- bytea
+ SET bytea_output TO hex;
+ SELECT E'\\xDeAdBeEf'::bytea;
+ SELECT E'\\xDeAdBeE'::bytea;
+ SELECT E'\\xDeAdBeEx'::bytea;
+ SELECT E'\\xDe00BeEf'::bytea;
+ SELECT E'DeAdBeEf'::bytea;
+ SELECT E'De\\000dBeEf'::bytea;
+ SELECT E'De\123dBeEf'::bytea;
+ SELECT E'De\\123dBeEf'::bytea;
+ SELECT E'De\\678dBeEf'::bytea;
+
+ SET bytea_output TO escape;
+ SELECT E'\\xDeAdBeEf'::bytea;
+ SELECT E'\\xDe00BeEf'::bytea;
+ SELECT E'DeAdBeEf'::bytea;
+ SELECT E'De\\000dBeEf'::bytea;
+ SELECT E'De\\123dBeEf'::bytea;
+
--
-- test conversions between various string types
-- E021-10 implicit casting among the character data types
Bernd Helmle wrote:
--On Samstag, Juli 11, 2009 13:40:44 +0300 Peter Eisentraut
<peter_e@gmx.net> wrote:OK, here is an updated patch. It has the setting as enum, completed
documentation, and libpq support. I'll add it to the commit fest in the
hope that someone else can look it over in detail.I've started looking at this and did some profiling with large bytea
data again. For those interested, here are the numbers:Dumping with bytea_output=hex (COPY to file):
real 20m38.699s
user 0m11.265s
sys 1m0.560sDumping with bytea_output=escape (COPY to file):
real 39m52.399s
user 0m22.085s
sys 1m50.131sSo the time needed dropped about 50%. The dump file dropped from
around 48 GB to 28 GB with the new format.
You just tested COPY, not pg_dump, right? Some pg_dump numbers would be
interesting, both for text and custom formats.
cheers
andrew
--On Dienstag, Juli 21, 2009 16:49:45 -0400 Andrew Dunstan
<andrew@dunslane.net> wrote:
You just tested COPY, not pg_dump, right? Some pg_dump numbers would be
interesting, both for text and custom formats.
Plain COPY, yes. I planned testing pg_dump for this round of my review but
ran out of time unfortunately.
The restore might be limited by xlog (didn't realize that the profile shows
XLogInsert in the top four). I'll try to get some additional numbers soon,
but this won't happen before thursday.
--
Thanks
Bernd
Bernd Helmle wrote:
--On Dienstag, Juli 21, 2009 16:49:45 -0400 Andrew Dunstan
<andrew@dunslane.net> wrote:You just tested COPY, not pg_dump, right? Some pg_dump numbers would be
interesting, both for text and custom formats.Plain COPY, yes. I planned testing pg_dump for this round of my review
but ran out of time unfortunately.The restore might be limited by xlog (didn't realize that the profile
shows XLogInsert in the top four). I'll try to get some additional
numbers soon, but this won't happen before thursday.
If the table is created by the restore job, either use parallel
pg_restore (-j nn) or use the --single-transaction flag - both will
ensure that the WAL log is avoided.
For plain COPY, get the same effect using:
begin;
truncat foo;
copy foo ... ;
commit;
All this assumes that archive_mode is off.
cheers
andrew
Peter Eisentraut <peter_e@gmx.net> writes:
OK, here is an updated patch. It has the setting as enum, completed
documentation, and libpq support. I'll add it to the commit fest in the hope
that someone else can look it over in detail.
I found that there is another issue that should be addressed, maybe not
by this patch but by a follow-on. While looking at Itagaki-san's patch
for making pg_dump --clean drop large objects, I noticed that pg_dump
is still relying on the deprecated function PQescapeBytea to dump the
contents of large objects when it is creating text output. This manages
not to fail for common cases, but there is at least one case we
overlooked: if you pg_dump with standard_conforming_strings turned on
into a custom (or probably tar) archive, and then use pg_restore to
generate a SQL script from that, the strings will be improperly escaped.
It strikes me that the best solution for this is to emit hex-coded
bytea instead of escaped bytea. While we could just hardcode that
into pg_dump, it would probably be better if libpq provided a function
along the lines of PQescapeByteaHex.
In some far future, maybe PQescapeBytea could be rescued from the depths
of deprecation by having it emit hex-coded output; but of course that
would fail against pre-8.5 servers, so it's a long way off.
In the nearer future, it would be possible and perhaps wise for
PQescapeByteaConn to adopt hex coding when it sees the connection is
to a server >= 8.5. It didn't look to me like the patch addressed
this either.
regards, tom lane
Bernd Helmle <mailings@oopsware.de> writes:
I've attached a slightly edited patch which fixes a compiler warning in
encode.c, too.
Bernd, are you done reviewing this or did you intend to do more?
It's still marked as "needs review" on the commitfest page.
regards, tom lane
--On Freitag, Juli 24, 2009 11:38:06 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:
Bernd, are you done reviewing this or did you intend to do more?
It's still marked as "needs review" on the commitfest page.
I hoped to get more profiling data like Andrew suggested, but haven't
enough time to do it :( The customer machine i can test on is not available
all the time, too. I haven't looked very detailed into the source, if you
plan to start a review of your own, feel free. I don't believe i can do
very much this weekend...
--
Thanks
Bernd
--On Freitag, Juli 24, 2009 20:50:16 +0200 Bernd Helmle
<mailings@oopsware.de> wrote:
I don't believe i can do very much this weekend...
I have to delay that until sunday, but will get my hands on some
performance and function tests again, since i have access on the customer
machine then.
--
Thanks
Bernd
Bernd Helmle <mailings@oopsware.de> writes:
--On Samstag, Juli 11, 2009 13:40:44 +0300 Peter Eisentraut
<peter_e@gmx.net> wrote:OK, here is an updated patch. It has the setting as enum, completed
documentation, and libpq support. I'll add it to the commit fest in the
hope that someone else can look it over in detail.
I've attached a slightly edited patch which fixes a compiler warning in
encode.c, too.
I'm starting to look at this patch. I observe that it's setting the
default output format to HEX. If changing the default behavior was
agreed to, or even discussed, I do not remember where. Shouldn't the
default stay the same?
regards, tom lane
--On Montag, August 03, 2009 15:11:08 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:
I'm starting to look at this patch. I observe that it's setting the
default output format to HEX. If changing the default behavior was
agreed to, or even discussed, I do not remember where. Shouldn't the
default stay the same?
I would prefer it to be the default at least for pg_dump, if we can get
some significant performance improvement for both, dump and restore from
it. However, here are some current performance numbers (taken from today,
since yesterday i had some trouble to get on the machine):
I did some restore testing based on the following flow:
BEGIN;
TRUNCATE ... ;
COPY testtable FROM ... ;
ROLLBACK;
with bytea_output = 'escape' i get
Time: 1478801,770 ms
where bytea_output = 'hex' gives:
Time: 1448871,566 ms
So 'hex' is slightly faster on this machine, but not in the numbers i would
have expected. The hex-based restore gives the following profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
37.81 157.22 157.22 97847 0.00 0.00 pglz_compress
20.25 241.43 84.21 141398 0.00 0.00 CopyReadLine
14.44 301.48 60.05 3605691992 0.00 0.00 get_hex
8.29 335.96 34.48 141397 0.00 0.00 hex_decode
7.99 369.20 33.24 1 33.24 398.14 DoCopy
3.95 385.63 16.43 esc_enc_len
0.71 388.58 2.95 137268286 0.00 0.00 _bt_compare
0.54 390.81 2.23 7209863 0.00 0.00 XLogInsert
0.48 392.81 2.00 49329221 0.00 0.00
hash_search_with_hash_value
0.43 394.59 1.78 91132579 0.00 0.00 LWLockAcquire
0.42 396.34 1.75 92250421 0.00 0.00 LWLockRelease
0.42 398.08 1.75 30477526 0.00 0.00 ReadBuffer_common
0.20 398.93 0.85 28686690 0.00 0.00 PinBuffer
0.18 399.67 0.74 21541372 0.00 0.00 _bt_binsrch
0.16 400.34 0.67 39278753 0.00 0.00 AllocSetAlloc
--
Thanks
Bernd
Bernd Helmle <mailings@oopsware.de> writes:
--On Montag, August 03, 2009 15:11:08 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:I'm starting to look at this patch. I observe that it's setting the
default output format to HEX. If changing the default behavior was
agreed to, or even discussed, I do not remember where. Shouldn't the
default stay the same?
I would prefer it to be the default at least for pg_dump,
Well, we could have pg_dump force the output format to hex regardless
of what the default is.
A disadvantage of doing that is there wouldn't be any convenient way
to get pg_dump to *not* set the output format (unless we add a switch,
which seems way overkill). Which would mean there would be no good way
to get pg_dump to produce backwards-compatible output. But considering
how many other backwards-incompatible changes we have put into pg_dump
without blinking, I'm not sure this argument outweighs the probability
of breaking a lot of applications.
regards, tom lane
One other stylistic gripe: I don't much like inserting a GUC variable
definition into builtins.h --- that file has traditionally only
contained function extern declarations. The best alternative I can
think of is to move the bytea-related stuff into a new include file
include/utils/bytea.h. Has anyone got an objection or a better idea?
regards, tom lane
On Tue, Aug 4, 2009 at 12:18 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
One other stylistic gripe: I don't much like inserting a GUC variable
definition into builtins.h --- that file has traditionally only
contained function extern declarations. The best alternative I can
think of is to move the bytea-related stuff into a new include file
include/utils/bytea.h. Has anyone got an objection or a better idea?
The other guc that controls default i/o formats for a data type is
DateStyle. I can't say I expected to find that in miscadmin.h though.
Perhaps move both of them into a utils/adt.h or something like that?
Greg Stark <gsstark@mit.edu> writes:
On Tue, Aug 4, 2009 at 12:18 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
One other stylistic gripe: I don't much like inserting a GUC variable
definition into builtins.h --- that file has traditionally only
contained function extern declarations. �The best alternative I can
think of is to move the bytea-related stuff into a new include file
include/utils/bytea.h. �Has anyone got an objection or a better idea?
The other guc that controls default i/o formats for a data type is
DateStyle. I can't say I expected to find that in miscadmin.h though.
Perhaps move both of them into a utils/adt.h or something like that?
Hmm, actually now that you mention it there's a bunch of GUC variables
in miscadmin.h. Surprise factor aside, I'm inclined to just shove
bytea_output in there along with DateStyle/IntervalStyle/etc.
I did try the new-include-file approach, and unsurprisingly found three
or four files that had to be modified to include it, because they'd been
expecting to find byteain and byteaout declared in builtins.h. I still
think that way is a bit cleaner, but I'm not sure it's enough cleaner to
risk breaking third-party code for.
regards, tom lane
Tom Lane wrote:
Greg Stark <gsstark@mit.edu> writes:
On Tue, Aug 4, 2009 at 12:18 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
One other stylistic gripe: I don't much like inserting a GUC variable
definition into builtins.h --- that file has traditionally only
contained function extern declarations. �The best alternative I can
think of is to move the bytea-related stuff into a new include file
include/utils/bytea.h. �Has anyone got an objection or a better idea?The other guc that controls default i/o formats for a data type is
DateStyle. I can't say I expected to find that in miscadmin.h though.
Perhaps move both of them into a utils/adt.h or something like that?Hmm, actually now that you mention it there's a bunch of GUC variables
in miscadmin.h. Surprise factor aside, I'm inclined to just shove
bytea_output in there along with DateStyle/IntervalStyle/etc.
I vote for a new bytea.h file that does not slurp in byteain/byteaout,
to avoid breaking 3rd party code. miscadmin.h seems the worst solution,
since it's already included in 210 other files.
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Attachments:
miscadmin.patchtext/x-diff; charset=us-asciiDownload
Index: src/pl/plpgsql/src/pl_exec.c
===================================================================
RCS file: /home/alvherre/Code/cvs/pgsql/src/pl/plpgsql/src/pl_exec.c,v
retrieving revision 1.246
diff -c -p -r1.246 pl_exec.c
*** src/pl/plpgsql/src/pl_exec.c 22 Jul 2009 02:31:38 -0000 1.246
--- src/pl/plpgsql/src/pl_exec.c 4 Aug 2009 01:00:06 -0000
***************
*** 23,28 ****
--- 23,29 ----
#include "executor/spi_priv.h"
#include "funcapi.h"
#include "lib/stringinfo.h"
+ #include "miscadmin.h"
#include "nodes/nodeFuncs.h"
#include "parser/scansup.h"
#include "storage/proc.h"
Index: src/pl/plpgsql/src/pl_handler.c
===================================================================
RCS file: /home/alvherre/Code/cvs/pgsql/src/pl/plpgsql/src/pl_handler.c,v
retrieving revision 1.44
diff -c -p -r1.44 pl_handler.c
*** src/pl/plpgsql/src/pl_handler.c 18 Feb 2009 11:33:04 -0000 1.44
--- src/pl/plpgsql/src/pl_handler.c 4 Aug 2009 00:59:54 -0000
***************
*** 18,23 ****
--- 18,24 ----
#include "catalog/pg_proc.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
+ #include "miscadmin.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/lsyscache.h"
Index: src/pl/plpgsql/src/plpgsql.h
===================================================================
RCS file: /home/alvherre/Code/cvs/pgsql/src/pl/plpgsql/src/plpgsql.h,v
retrieving revision 1.114
diff -c -p -r1.114 plpgsql.h
*** src/pl/plpgsql/src/plpgsql.h 22 Jul 2009 02:31:38 -0000 1.114
--- src/pl/plpgsql/src/plpgsql.h 4 Aug 2009 00:59:06 -0000
***************
*** 20,26 ****
#include "access/xact.h"
#include "fmgr.h"
- #include "miscadmin.h"
#include "commands/trigger.h"
#include "executor/spi.h"
#include "utils/tuplestore.h"
--- 20,25 ----
Alvaro Herrera <alvherre@commandprompt.com> writes:
I vote for a new bytea.h file that does not slurp in byteain/byteaout,
to avoid breaking 3rd party code. miscadmin.h seems the worst solution,
since it's already included in 210 other files.
Well, unless you want to leave *all* the bytea functions in builtins.h
there will still be some risk there. I'd actually sooner break calls
of byteaout than other things, because in reality every caller of
byteaout is going to need to be inspected to see if it's expecting
the old-style output format.
regards, tom lane
Tom Lane wrote:
Alvaro Herrera <alvherre@commandprompt.com> writes:
I vote for a new bytea.h file that does not slurp in byteain/byteaout,
to avoid breaking 3rd party code. miscadmin.h seems the worst solution,
since it's already included in 210 other files.Well, unless you want to leave *all* the bytea functions in builtins.h
there will still be some risk there. I'd actually sooner break calls
of byteaout than other things, because in reality every caller of
byteaout is going to need to be inspected to see if it's expecting
the old-style output format.
Hmm, good point ... why avoid the breakage then?
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes:
Tom Lane wrote:
Well, unless you want to leave *all* the bytea functions in builtins.h
there will still be some risk there. I'd actually sooner break calls
of byteaout than other things, because in reality every caller of
byteaout is going to need to be inspected to see if it's expecting
the old-style output format.
Hmm, good point ... why avoid the breakage then?
Maybe we shouldn't. Okay, back to plan A (separate bytea.h file).
(BTW, so far as I can tell there isn't anything in the backend that
will be broken in that way. pg_dump, however, is a different story...
it knows way too much about pg_trigger.tgargs.)
regards, tom lane
On Monday 03 August 2009 22:11:08 Tom Lane wrote:
I'm starting to look at this patch. I observe that it's setting the
default output format to HEX. If changing the default behavior was
agreed to, or even discussed, I do not remember where. Shouldn't the
default stay the same?
I did pose that question in my patch submission email.
Unless there is overwhelming support in favor of changing, we probably
shouldn't change it, at least not yet.
Peter Eisentraut <peter_e@gmx.net> writes:
On Monday 03 August 2009 22:11:08 Tom Lane wrote:
I'm starting to look at this patch. I observe that it's setting the
default output format to HEX. If changing the default behavior was
agreed to, or even discussed, I do not remember where. Shouldn't the
default stay the same?
I did pose that question in my patch submission email.
Unless there is overwhelming support in favor of changing, we probably
shouldn't change it, at least not yet.
While I've been poking at the pg_dump issues, it's occurred to me that
changing the default would be a great forcing function for finding out
any lurking problems. What I'm inclined to do now is to commit it
*with* the change of default, and let it be that way at least for a
few alpha-test releases. We can vote on whether to switch the default
back before 8.5 final.
If this seems reasonable, I can make a note of the point in the commit
message, so that we won't forget when the time comes.
regards, tom lane
On Tue, Aug 4, 2009 at 10:28 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
Peter Eisentraut <peter_e@gmx.net> writes:
On Monday 03 August 2009 22:11:08 Tom Lane wrote:
I'm starting to look at this patch. I observe that it's setting the
default output format to HEX. If changing the default behavior was
agreed to, or even discussed, I do not remember where. Shouldn't the
default stay the same?I did pose that question in my patch submission email.
Unless there is overwhelming support in favor of changing, we probably
shouldn't change it, at least not yet.While I've been poking at the pg_dump issues, it's occurred to me that
changing the default would be a great forcing function for finding out
any lurking problems. What I'm inclined to do now is to commit it
*with* the change of default, and let it be that way at least for a
few alpha-test releases. We can vote on whether to switch the default
back before 8.5 final.If this seems reasonable, I can make a note of the point in the commit
message, so that we won't forget when the time comes.
Or, what we could do is start an open items for 8.5 list similar to
the one we made for 8.4. That worked pretty well, I think.
...Robert
On Tue, Aug 4, 2009 at 16:31, Robert Haas<robertmhaas@gmail.com> wrote:
On Tue, Aug 4, 2009 at 10:28 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
Peter Eisentraut <peter_e@gmx.net> writes:
On Monday 03 August 2009 22:11:08 Tom Lane wrote:
I'm starting to look at this patch. I observe that it's setting the
default output format to HEX. If changing the default behavior was
agreed to, or even discussed, I do not remember where. Shouldn't the
default stay the same?I did pose that question in my patch submission email.
Unless there is overwhelming support in favor of changing, we probably
shouldn't change it, at least not yet.While I've been poking at the pg_dump issues, it's occurred to me that
changing the default would be a great forcing function for finding out
any lurking problems. What I'm inclined to do now is to commit it
*with* the change of default, and let it be that way at least for a
few alpha-test releases. We can vote on whether to switch the default
back before 8.5 final.If this seems reasonable, I can make a note of the point in the commit
message, so that we won't forget when the time comes.Or, what we could do is start an open items for 8.5 list similar to
the one we made for 8.4. That worked pretty well, I think.
+1 for that solution, it seems much better than having to go back
through commit messages. We might as well start it early!
--
Magnus Hagander
Self: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
Robert Haas <robertmhaas@gmail.com> writes:
On Tue, Aug 4, 2009 at 10:28 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
If this seems reasonable, I can make a note of the point in the commit
message, so that we won't forget when the time comes.
Or, what we could do is start an open items for 8.5 list similar to
the one we made for 8.4. That worked pretty well, I think.
OK. Historically we haven't made such a list until beta starts, but
there's no reason we couldn't start it early.
regards, tom lane
--On Dienstag, August 04, 2009 10:28:48 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:
While I've been poking at the pg_dump issues, it's occurred to me that
changing the default would be a great forcing function for finding out
any lurking problems.
+1
--
Thanks
Bernd
Bernd Helmle <mailings@oopsware.de> writes:
--On Samstag, Juli 11, 2009 13:40:44 +0300 Peter Eisentraut
<peter_e@gmx.net> wrote:OK, here is an updated patch. It has the setting as enum, completed
documentation, and libpq support. I'll add it to the commit fest in the
hope that someone else can look it over in detail.
I've attached a slightly edited patch which fixes a compiler warning in
encode.c, too.
Committed with assorted corrections. I have not done anything about
the issues mentioned in
http://archives.postgresql.org/message-id/21837.1248215656@sss.pgh.pa.us
mainly that pg_dump's treatment of large-object contents is not safe
against changes of standard_conforming_strings. I think that ought to
get dealt with before moving on.
regards, tom lane
I wrote:
While I've been poking at the pg_dump issues, it's occurred to me that
changing the default would be a great forcing function for finding out
any lurking problems. What I'm inclined to do now is to commit it
*with* the change of default, and let it be that way at least for a
few alpha-test releases. We can vote on whether to switch the default
back before 8.5 final.
For the archives, attached is a patch to switch the default and then
make pg_dump force use of hex mode. This is just so we won't forget
what needs changing if we decide to switch the default back ...
regards, tom lane