Toast,bytea, Text -blob all confusing

Started by Nonameover 24 years ago14 messages
#1Noname
jason.ory@ndchealth.com

I'm trying my best to convert from MySQL to PgSQL but I cant get a good
answer about
certian questions. It was an easy task in mysql but all this talk about
, text , toast and bytea is just confusing me.
I cant get a clear picture of any of this,from the book from Bruce, the
e-mail archives. Ive looked all i can,
with every keyword i can think of from years past. Here is my situation.

WHAT I WAS DOING IN MYSQL
Via the web my clients are uploading basic text/data files, sometimes >
than 30MB. In the past ,via CGI I have been parsing the file
into one STL string, using mysql_escape_string to escape it and then using
an INSERT to place the
,\'+"stlstring+"\' , into a BLOB column.
I dont want to use a temp. file anywhere. The data will always be passed via
the database and buffers for certian reasons.

THIS IS WHAT I CANT SEEM TO FIGURE OUT IN POSTGRESQL
1. I cant get a clear answer on what kind of data type to use for my large
text string? TEXT, ???, ??? or something about TOAST
I have seen in the e-mail archive but cant find any documentaion?

2. I've written my own escape method ,("cant find one for Pgsql") , BUT i
don't know what
to escape and not to escape. So it keeps failing. I cand find any docs. on
what to escape either?

SUMMARY
What datatype do I use? How Im a supposed to escape it and get it though the
parser correctly so i can
retrieve it correctly?

Thnks for your time.

PS: Using RedHat.

Jason H. Ory
Medprint+
Software Developer
(205) 989-4617
jason.ory@ndchealth.com

#2Frank Joerdens
frank@joerdens.de
In reply to: Noname (#1)
Re: Toast,bytea, Text -blob all confusing

On Thu, Aug 23, 2001 at 01:09:14PM -0400, jason.ory@ndchealth.com wrote:

I'm trying my best to convert from MySQL to PgSQL but I cant get a good
answer about
certian questions. It was an easy task in mysql but all this talk about
, text , toast and bytea is just confusing me.
I cant get a clear picture of any of this,from the book from Bruce, the
e-mail archives. Ive looked all i can,
with every keyword i can think of from years past. Here is my situation.

WHAT I WAS DOING IN MYSQL
Via the web my clients are uploading basic text/data files, sometimes >
than 30MB. In the past ,via CGI I have been parsing the file
into one STL string, using mysql_escape_string to escape it and then using
an INSERT to place the
,\'+"stlstring+"\' , into a BLOB column.
I dont want to use a temp. file anywhere. The data will always be passed via
the database and buffers for certian reasons.

THIS IS WHAT I CANT SEEM TO FIGURE OUT IN POSTGRESQL
1. I cant get a clear answer on what kind of data type to use for my large
text string? TEXT, ???, ??? or something about TOAST
I have seen in the e-mail archive but cant find any documentaion?

TOAST is just a name for the mechanism/feature that is used in Postgres

= 7.1 to overcome the (32 KB if I recall correctly) limit on the row

size in previous versions. It's completely transparent to the
programmer, i.e. if you use TEXT, for instance, you can have a row up to
1 GB in size (which is probably not practical) theoretically. The
advantage over using BLOBs is that you can search this field (with 30 MB
fieldis, if you have a few of them, this is probably not practical
either so you'd probably want to consider some full text indexing
mechanism). I'd use TEXT for this reason.

2. I've written my own escape method ,("cant find one for Pgsql") , BUT i
don't know what
to escape and not to escape. So it keeps failing. I cand find any docs. on
what to escape either?

Hm. I don't understand why the database (using MySQL or Postgres) would
make any difference there.

Hope it helps,

Frank

#3Alex Pilosov
alex@pilosoft.com
In reply to: Noname (#1)
Re: Toast,bytea, Text -blob all confusing

On Thu, 23 Aug 2001 jason.ory@ndchealth.com wrote:

THIS IS WHAT I CANT SEEM TO FIGURE OUT IN POSTGRESQL
1. I cant get a clear answer on what kind of data type to use for my large
text string? TEXT, ???, ??? or something about TOAST
I have seen in the e-mail archive but cant find any documentaion?

I would suggest bytea or blob. Blobs are well-documented in normal
documentation and in documentation of your favorite interface, so I'll
just talk about bytea.

2. I've written my own escape method ,("cant find one for Pgsql") , BUT i
don't know what
to escape and not to escape. So it keeps failing. I cand find any docs. on
what to escape either?

For bytea, follow this rule: to escape a null character, use this:
'\\0'. To escape a backslash, use this: '\\\\'.

Same idea to unescape data.

#4Lincoln Yeoh
lyeoh@pop.jaring.my
In reply to: Alex Pilosov (#3)
Re: Toast,bytea, Text -blob all confusing

At 03:05 PM 27-08-2001 -0400, Alex Pilosov wrote:

On Thu, 23 Aug 2001 jason.ory@ndchealth.com wrote:

THIS IS WHAT I CANT SEEM TO FIGURE OUT IN POSTGRESQL
1. I cant get a clear answer on what kind of data type to use for my large
text string? TEXT, ???, ??? or something about TOAST
I have seen in the e-mail archive but cant find any documentaion?

I would suggest bytea or blob. Blobs are well-documented in normal
documentation and in documentation of your favorite interface, so I'll
just talk about bytea.

2. I've written my own escape method ,("cant find one for Pgsql") , BUT i
don't know what
to escape and not to escape. So it keeps failing. I cand find any docs. on
what to escape either?

For bytea, follow this rule: to escape a null character, use this:
'\\0'. To escape a backslash, use this: '\\\\'.

Same idea to unescape data.

Are there other characters that need to be escaped? I suspect there are
more characters that need to be escaped - ctrl chars? single quotes?. Why
four backslashes for one? Is there a definitive documentation anywhere for
what bytea is _supposed_ (not what it might actually be) to be and how it
is to be handled?

Also why wouldn't escaping stuff like this work with TEXT then? If a null
is going to be backslash backslash zero, and come out the same way, it sure
looks like TEXT to me :). OK so there's this thing about storage. So maybe
I could save a byte by just converting nulls to backslash zero and real
backslashes to backslash backslash. Tada.

OK it's probably not the same, but having to put four backslashes when two
should be enough to quote one makes me rather puzzled and uneasy.

Cheerio,
Link.

#5Alex Pilosov
alex@acecape.com
In reply to: Lincoln Yeoh (#4)
Re: Toast,bytea, Text -blob all confusing

On Tue, 28 Aug 2001, Lincoln Yeoh wrote:

For bytea, follow this rule: to escape a null character, use this:
'\\0'. To escape a backslash, use this: '\\\\'.

Same idea to unescape data.

Are there other characters that need to be escaped? I suspect there are
more characters that need to be escaped - ctrl chars? single quotes?. Why
four backslashes for one? Is there a definitive documentation anywhere for
what bytea is _supposed_ (not what it might actually be) to be and how it
is to be handled?

Yes, sorry for being unclear on this one. Here's a more detailed
explanation: Bytea is just a stream of data. On input, it must follow C
escaping conventions, on output, it will be escaped using C escaping
conventions.

However, there's a trap: before things get to bytea input handler, they
are first processed by postgresql general parser. Hence, the string \\0
given from psql will become \0 when it gets to bytea input handler. String
\\\\ will become \\. All non-printable characters must be escaped like
this: \\(octal of character), for ex, chr(255) must be presented as \\377.
(If you want, you can also use this as an alternative and more generic way
to escape a backslash, \\134). Single quote must be escaped either as \\47
or as \'. Note the single backslash. Why only one? Because bytea parser
doesn't care about single quotes and you only need to escape it for the
postgresql parser...

So, just keep in mind the double-parsing of input and you'll be safe.

Also why wouldn't escaping stuff like this work with TEXT then? If a null
is going to be backslash backslash zero, and come out the same way, it sure
looks like TEXT to me :). OK so there's this thing about storage. So maybe

Because text is null-terminated, can't have a null inside.

I could save a byte by just converting nulls to backslash zero and real
backslashes to backslash backslash. Tada.

If you do that, you'll break ordering/comparison. Bytea in memory is
stored EXACTLY the way input string was, without any escaping, hence, all
comparisons will be correct ( '\\0'::bytea is less than '\\1'::bytea).

With your representation, comparisons will fail, because in memory, data
is escaped using some escaping convention that you made up.

OK it's probably not the same, but having to put four backslashes when two
should be enough to quote one makes me rather puzzled and uneasy.

Double parsing, hence double escaping.

--
Alex Pilosov | http://www.acedsl.com/home.html
CTO - Acecape, Inc. | AceDSL:The best ADSL in the world
325 W 38 St. Suite 1005 | (Stealth Marketing Works! :)
New York, NY 10018 |

#6Joe Conway
joseph.conway@home.com
In reply to: Alex Pilosov (#5)
Re: Re: Toast,bytea, Text -blob all confusing

For bytea, follow this rule: to escape a null character, use this:
'\\0'. To escape a backslash, use this: '\\\\'.

Same idea to unescape data.

Are there other characters that need to be escaped? I suspect there are
more characters that need to be escaped - ctrl chars? single quotes?.

Why

four backslashes for one? Is there a definitive documentation anywhere

for

what bytea is _supposed_ (not what it might actually be) to be and how

it

is to be handled?

Yes, sorry for being unclear on this one. Here's a more detailed
explanation: Bytea is just a stream of data. On input, it must follow C
escaping conventions, on output, it will be escaped using C escaping
conventions.

However, there's a trap: before things get to bytea input handler, they
are first processed by postgresql general parser. Hence, the string \\0
given from psql will become \0 when it gets to bytea input handler. String
\\\\ will become \\. All non-printable characters must be escaped like
this: \\(octal of character), for ex, chr(255) must be presented as \\377.
(If you want, you can also use this as an alternative and more generic way
to escape a backslash, \\134). Single quote must be escaped either as \\47
or as \'. Note the single backslash. Why only one? Because bytea parser
doesn't care about single quotes and you only need to escape it for the
postgresql parser...

So, just keep in mind the double-parsing of input and you'll be safe.

Also why wouldn't escaping stuff like this work with TEXT then? If a

null

is going to be backslash backslash zero, and come out the same way, it

sure

looks like TEXT to me :). OK so there's this thing about storage. So

maybe

Because text is null-terminated, can't have a null inside.

I could save a byte by just converting nulls to backslash zero and real
backslashes to backslash backslash. Tada.

If you do that, you'll break ordering/comparison. Bytea in memory is
stored EXACTLY the way input string was, without any escaping, hence, all
comparisons will be correct ( '\\0'::bytea is less than '\\1'::bytea).

With your representation, comparisons will fail, because in memory, data
is escaped using some escaping convention that you made up.

OK it's probably not the same, but having to put four backslashes when

two

should be enough to quote one makes me rather puzzled and uneasy.

Double parsing, hence double escaping.

Great explanation Alex --thanks! I'll add a bit:

I've done about 400,000 inserts and subsequent queries to verify that, from
PHP at least, only four charaters need to be escaped. The inserts were 20
byte strings gotten by concatenating some known text with a counter in a
loop, and then producing a SHA-1 hash. This produces very uniformly
distributed binary data. Then I ran the same loop, except I queried for the
inserted strings. I'm quite confident from this experiment that binary can
reliably be inserted via standard SQL when these four characters are
escaped. Here's the run down:

\\000 First slash is consumed by the general parser, leaving \000 for the
byteain function. If you only use one '\', the general parser converts the
character into a true '\0' byte, and the byteain function sees this byte as
the string terminator and stops. This causes the input string to be
truncated (which seems to confuse many people).

\\012 In my early tests 0x0a (LF) was getting converted to 0x20 (space).
I think this was happening during PHP's parsing, but I'm still not sure.
I'll dig into this some more later.

\\047 As Alex mentioned, the byteain function doesn't treat this as a
special character, but of course the general parser does as this is a single
quote. It also works fine to escape it as \', I just prefer to use all
octals.

\\134 Both the general parser and the byteain function see this (a single
\) as the special escape character. Therefore the general parser turns \\\\
into \\, and the byteain function turns \\ into \ for actual storage. Again,
I prefer to use the octal representation instead.

I hope this helps reduce the concerns and confusion over bytea. If anyone
can help explain why my linefeeds were getting converted to spaces, all the
mysteries would be explained ;-)

-- Joe

#7Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Joe Conway (#6)
RE: Re: Toast,bytea, Text -blob all confusing

For bytea, follow this rule: to escape a null character, use

this:

'\\0'. To escape a backslash, use this: '\\\\'.

Can anybody explain in technical terms why this is implemented
so inconveniently ?

Since bytea is probably not very common among users yet
we could imho still change it to not do double escapes.

Imho we need to decide where to do the escaping,
eighter in the parser or in the input functions.

I think actually the backend parser has no business changing
constants, he is imho only allowed to parse it, so he knows
where a constant begins, and where it ends.

Andreas

#8Hannu Krosing
hannu@tm.ee
In reply to: Zeugswetter Andreas SB SD (#7)
Re: Re: Toast,bytea, Text -blob all confusing

Zeugswetter Andreas SB SD wrote:

For bytea, follow this rule: to escape a null character, use

this:

'\\0'. To escape a backslash, use this: '\\\\'.

Can anybody explain in technical terms why this is implemented
so inconveniently ?

I think that this has to to with making textin and textout behave
symmetrically, and the requirement that textout must produce a
valid C-string for ASCII transfer format.

Since bytea is probably not very common among users yet
we could imho still change it to not do double escapes.

But how ?

Imho we need to decide where to do the escaping,
eighter in the parser or in the input functions.

It would be probably hard to make the parser to _not_ unescape some
types, as it does not yet know it

I think actually the backend parser has no business changing
constants, he is imho only allowed to parse it, so he knows
where a constant begins, and where it ends.

If it is any consolation then you have to write the inset of
a single \ from shell command so:

psql -c "insert into t values('\\\\\\\\')"

;)

------------------
Hannu

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB SD (#7)
Re: Re: Toast,bytea, Text -blob all confusing

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

I think actually the backend parser has no business changing
constants, he is imho only allowed to parse it, so he knows
where a constant begins, and where it ends.

How do you propose to handle embedded quote marks in literals,
if there is no parser-level escape convention?

Don't suggest a type-specific escape convention; at the time the
parser runs, it's impossible to know what type the literal will
turn out to be.

regards, tom lane

#10Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Joe Conway (#6)
1 attachment(s)
Re: Re: Toast,bytea, Text -blob all confusing

Thanks you your description, I have added a bytea section to the docs.

Patch attached.

For bytea, follow this rule: to escape a null character, use this:
'\\0'. To escape a backslash, use this: '\\\\'.

Same idea to unescape data.

Are there other characters that need to be escaped? I suspect there are
more characters that need to be escaped - ctrl chars? single quotes?.

Why

four backslashes for one? Is there a definitive documentation anywhere

for

what bytea is _supposed_ (not what it might actually be) to be and how

it

is to be handled?

Yes, sorry for being unclear on this one. Here's a more detailed
explanation: Bytea is just a stream of data. On input, it must follow C
escaping conventions, on output, it will be escaped using C escaping
conventions.

However, there's a trap: before things get to bytea input handler, they
are first processed by postgresql general parser. Hence, the string \\0
given from psql will become \0 when it gets to bytea input handler. String
\\\\ will become \\. All non-printable characters must be escaped like
this: \\(octal of character), for ex, chr(255) must be presented as \\377.
(If you want, you can also use this as an alternative and more generic way
to escape a backslash, \\134). Single quote must be escaped either as \\47
or as \'. Note the single backslash. Why only one? Because bytea parser
doesn't care about single quotes and you only need to escape it for the
postgresql parser...

So, just keep in mind the double-parsing of input and you'll be safe.

Also why wouldn't escaping stuff like this work with TEXT then? If a

null

is going to be backslash backslash zero, and come out the same way, it

sure

looks like TEXT to me :). OK so there's this thing about storage. So

maybe

Because text is null-terminated, can't have a null inside.

I could save a byte by just converting nulls to backslash zero and real
backslashes to backslash backslash. Tada.

If you do that, you'll break ordering/comparison. Bytea in memory is
stored EXACTLY the way input string was, without any escaping, hence, all
comparisons will be correct ( '\\0'::bytea is less than '\\1'::bytea).

With your representation, comparisons will fail, because in memory, data
is escaped using some escaping convention that you made up.

OK it's probably not the same, but having to put four backslashes when

two

should be enough to quote one makes me rather puzzled and uneasy.

Double parsing, hence double escaping.

Great explanation Alex --thanks! I'll add a bit:

I've done about 400,000 inserts and subsequent queries to verify that, from
PHP at least, only four charaters need to be escaped. The inserts were 20
byte strings gotten by concatenating some known text with a counter in a
loop, and then producing a SHA-1 hash. This produces very uniformly
distributed binary data. Then I ran the same loop, except I queried for the
inserted strings. I'm quite confident from this experiment that binary can
reliably be inserted via standard SQL when these four characters are
escaped. Here's the run down:

\\000 First slash is consumed by the general parser, leaving \000 for the
byteain function. If you only use one '\', the general parser converts the
character into a true '\0' byte, and the byteain function sees this byte as
the string terminator and stops. This causes the input string to be
truncated (which seems to confuse many people).

\\012 In my early tests 0x0a (LF) was getting converted to 0x20 (space).
I think this was happening during PHP's parsing, but I'm still not sure.
I'll dig into this some more later.

\\047 As Alex mentioned, the byteain function doesn't treat this as a
special character, but of course the general parser does as this is a single
quote. It also works fine to escape it as \', I just prefer to use all
octals.

\\134 Both the general parser and the byteain function see this (a single
\) as the special escape character. Therefore the general parser turns \\\\
into \\, and the byteain function turns \\ into \ for actual storage. Again,
I prefer to use the octal representation instead.

I hope this helps reduce the concerns and confusion over bytea. If anyone
can help explain why my linefeeds were getting converted to spaces, all the
mysteries would be explained ;-)

-- Joe

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Attachments:

/bjm/difftext/plainDownload
Index: doc/src/sgml/datatype.sgml
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/doc/src/sgml/datatype.sgml,v
retrieving revision 1.60
diff -c -r1.60 datatype.sgml
*** doc/src/sgml/datatype.sgml	2001/08/31 01:55:25	1.60
--- doc/src/sgml/datatype.sgml	2001/09/04 03:15:49
***************
*** 84,89 ****
--- 84,95 ----
        </row>
  
        <row>
+        <entry><type>bytea</type></entry>
+        <entry></entry>
+        <entry>binary data</entry>
+       </row>
+ 
+       <row>
         <entry><type>character(<replaceable>n</replaceable>)</type></entry>
         <entry><type>char(<replaceable>n</replaceable>)</type></entry>
         <entry>fixed-length character string</entry>
***************
*** 782,788 ****
  	<entry>text</entry>
  	<entry>Variable unlimited length</entry>
         </row>
!       </tbody>
       </tgroup>
      </table>
  
--- 788,798 ----
  	<entry>text</entry>
  	<entry>Variable unlimited length</entry>
         </row>
!         <row>
! 	<entry>bytea</entry>
! 	<entry>binary data</entry>
!        </row>
!      </tbody>
       </tgroup>
      </table>
  
***************
*** 827,832 ****
--- 837,855 ----
      does not require an explicit declared upper limit on the size of
      the string.  Although the type <type>text</type> is not in the SQL
      standard, many other RDBMS packages have it as well.
+    </para>
+ 
+    <para>
+     The <type>bytea</type> data type allows storage of binary data,
+     specifically allowing storage of NULLs which are entered as
+     <literal>'\\000'</>. The first backslash is interpreted by the
+     single quotes, and the second is recognized by <type>bytea</> and
+     preceeds a three digit octal value. For a similar reason, a
+     backslash must be entered into a field as <literal>'\\\\'</> or
+     <literal>'\\134'</>. You may also have to escape line feeds and
+     carriage return if your interface automatically translates these. It
+     can store values of any length. <type>Bytea</> is a non-standard
+     data type.
     </para>
  
     <para>
#11Joe Conway
joseph.conway@home.com
In reply to: Bruce Momjian (#10)
Re: Re: Toast,bytea, Text -blob all confusing

\\012 In my early tests 0x0a (LF) was getting converted to 0x20

(space).

I think this was happening during PHP's parsing, but I'm still not sure.
I'll dig into this some more later.

<redfaced>
The script I was using in PHP *explicitly* converted all linefeeds to
spaces. Mystery solved.
</redfaced>

I think Bruce's text still works though.

-- Joe

#12Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Joe Conway (#11)
Re: Re: Toast,bytea, Text -blob all confusing

\\012 In my early tests 0x0a (LF) was getting converted to 0x20

(space).

I think this was happening during PHP's parsing, but I'm still not sure.
I'll dig into this some more later.

<redfaced>
The script I was using in PHP *explicitly* converted all linefeeds to
spaces. Mystery solved.
</redfaced>

I think Bruce's text still works though.

I can see other interfaces doing fancy things with newlines and carriage
returns so I added it to the docs.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#10)
Re: Re: Toast,bytea, Text -blob all confusing

Bruce Momjian <pgman@candle.pha.pa.us> writes:

However, there's a trap: before things get to bytea input handler, they
are first processed by postgresql general parser.

This description fails to make clear that the two levels of parsing only
apply for datums that are written as string literals in SQL commands.
An example where this doesn't apply is COPY input data.

regards, tom lane

#14Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#13)
Re: Re: Toast,bytea, Text -blob all confusing

Bruce Momjian <pgman@candle.pha.pa.us> writes:

However, there's a trap: before things get to bytea input handler, they
are first processed by postgresql general parser.

This description fails to make clear that the two levels of parsing only
apply for datums that are written as string literals in SQL commands.
An example where this doesn't apply is COPY input data.

Are you talking about my SGML changes? I clearly mention quote-handling
and bytea handling, which pretty clearly not apply in COPY.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026