OCTET_LENGTH is wrong
I noticed OCTET_LENGTH will return the size of the data after TOAST may
have compressed it. While this could be useful information, this
behaviour has no basis in the SQL standard and it's not what is
documented. Moreover, it eliminates the standard useful behaviour of
OCTET_LENGTH, which is to show the length in bytes of a multibyte string.
--
Peter Eisentraut peter_e@gmx.net
Peter Eisentraut <peter_e@gmx.net> writes:
I noticed OCTET_LENGTH will return the size of the data after TOAST may
have compressed it. While this could be useful information, this
behaviour has no basis in the SQL standard and it's not what is
documented. Moreover, it eliminates the standard useful behaviour of
OCTET_LENGTH, which is to show the length in bytes of a multibyte string.
I wondered about that too, the first time I noticed it. On the other
hand, knowing the compressed length is kinda useful too, at least for
hacking and DBA purposes. (One might also like to know whether a value
has been moved out of line, which is not currently determinable.)
I don't want to force an initdb at this stage, at least not without
compelling reason, so adding more functions right now is not feasible.
Maybe a TODO item for next time.
That leaves us with the question whether to change OCTET_LENGTH now
or leave it for later. Anyone?
BTW, I noticed that textlength() is absolutely unreasonably slow when
MULTIBYTE is enabled --- yesterday I was trying to profile TOAST
overhead, and soon discovered that what I was looking at was nothing
but pg_mblen() calls. It really needs a short-circuit path for
single-byte encodings.
regards, tom lane
Peter Eisentraut <peter_e@gmx.net> writes:
I noticed OCTET_LENGTH will return the size of the data after TOAST may
have compressed it. While this could be useful information, this
behaviour has no basis in the SQL standard and it's not what is
documented. Moreover, it eliminates the standard useful behaviour of
OCTET_LENGTH, which is to show the length in bytes of a multibyte string.I wondered about that too, the first time I noticed it. On the other
hand, knowing the compressed length is kinda useful too, at least for
hacking and DBA purposes. (One might also like to know whether a value
has been moved out of line, which is not currently determinable.)I don't want to force an initdb at this stage, at least not without
compelling reason, so adding more functions right now is not feasible.
Maybe a TODO item for next time.That leaves us with the question whether to change OCTET_LENGTH now
or leave it for later. Anyone?
I am unconcerned about showing people the actual toasted length. Seems
we should get octet_length() computed on the un-TOASTED length, if we
can.
BTW, I noticed that textlength() is absolutely unreasonably slow when
MULTIBYTE is enabled --- yesterday I was trying to profile TOAST
overhead, and soon discovered that what I was looking at was nothing
but pg_mblen() calls. It really needs a short-circuit path for
single-byte encodings.
Added to TODO:
* Optimize textlength(), etc. for single-byte encodings
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Peter Eisentraut <peter_e@gmx.net> writes:
I noticed OCTET_LENGTH will return the size of the data after TOAST may
have compressed it. While this could be useful information, this
behaviour has no basis in the SQL standard and it's not what is
documented. Moreover, it eliminates the standard useful behaviour of
OCTET_LENGTH, which is to show the length in bytes of a multibyte string.I wondered about that too, the first time I noticed it. On the other
hand, knowing the compressed length is kinda useful too, at least for
hacking and DBA purposes. (One might also like to know whether a value
has been moved out of line, which is not currently determinable.)
It seems the behavior of OCTET_LENGTH varies acording to the
corresponding data type:
TEXT: returns the size of data AFTER TOAST
VARCHAR and CHAR: returns the size of data BEFORE TOAST
I think we should fix at least these inconsistencies but am not sure
if it's totally wrong that OCTET_LENGTH returns the length AFTER
TOAST. The SQL standard does not have any idea about TOAST of course.
Also, I tend to agree with Tom's point about hackers and DBAs.
I don't want to force an initdb at this stage, at least not without
compelling reason, so adding more functions right now is not feasible.
Maybe a TODO item for next time.That leaves us with the question whether to change OCTET_LENGTH now
or leave it for later. Anyone?
My opinion is leaving it for 7.3, with the idea (adding new
functions).
BTW, I noticed that textlength() is absolutely unreasonably slow when
MULTIBYTE is enabled --- yesterday I was trying to profile TOAST
overhead, and soon discovered that what I was looking at was nothing
but pg_mblen() calls. It really needs a short-circuit path for
single-byte encodings.
It's easy to optimize that. However I cannot access CVS anymore after
the IP address change. Will post patches later...
--
Tatsuo Ishii
Tom Lane wrote:
Peter Eisentraut <peter_e@gmx.net> writes:
... Moreover, it eliminates the standard useful behaviour of
OCTET_LENGTH, which is to show the length in bytes of a multibyte string.While I don't necessarily dispute this, I do kinda wonder where you
derive the statement. AFAICS, SQL92 defines OCTET_LENGTH in terms
of BIT_LENGTH:6.6 General Rule 5:
a) Let S be the <string value expression>. If the value of S is
not the null value, then the result is the smallest integer
not less than the quotient of the division (BIT_LENGTH(S)/8).
b) Otherwise, the result is the null value.and BIT_LENGTH is defined in the next GR:
a) Let S be the <string value expression>. If the value of S is
not the null value, then the result is the number of bits in
the value of S.
b) Otherwise, the result is the null value.While SQL92 is pretty clear about <bit string>, I'm damned if I can see
anywhere that they define how many bits are in a character string value
So who's to say what representation is to be used to count the bits?
If, say, UTF-16 and UTF-8 are equally reasonable choices, then why
shouldn't a compressed representation be reasonable too?
One objection I have to this, is the fact that nobody uses the compressed
representation in client libraries whrereas they do use both UTF-16 and
UTF-8.
At least UTF-8 is available as client encoding.
And probably it is possible that the length of the "possibly compressed"
representation
can change without the underlying data changing (for example when you
set a bit
somewhere that disables compression and UPDATE some other field in the
tuple)
making the result of OCTET_LENGTH dependent on other things than the
argument
string.
I also like the propery of _uncompressed_ OCTET_LENGTH that
OCTET_LENGTH(s||s) == 2 * OCTET_LENGTH(s)
which is almost never true for compressed length
----------------
Hannu
Peter Eisentraut <peter_e@gmx.net> writes:
... Moreover, it eliminates the standard useful behaviour of
OCTET_LENGTH, which is to show the length in bytes of a multibyte string.
While I don't necessarily dispute this, I do kinda wonder where you
derive the statement. AFAICS, SQL92 defines OCTET_LENGTH in terms
of BIT_LENGTH:
6.6 General Rule 5:
a) Let S be the <string value expression>. If the value of S is
not the null value, then the result is the smallest integer
not less than the quotient of the division (BIT_LENGTH(S)/8).
b) Otherwise, the result is the null value.
and BIT_LENGTH is defined in the next GR:
a) Let S be the <string value expression>. If the value of S is
not the null value, then the result is the number of bits in
the value of S.
b) Otherwise, the result is the null value.
While SQL92 is pretty clear about <bit string>, I'm damned if I can see
anywhere that they define how many bits are in a character string value.
So who's to say what representation is to be used to count the bits?
If, say, UTF-16 and UTF-8 are equally reasonable choices, then why
shouldn't a compressed representation be reasonable too?
regards, tom lane
BTW, I noticed that textlength() is absolutely unreasonably slow when
MULTIBYTE is enabled --- yesterday I was trying to profile TOAST
overhead, and soon discovered that what I was looking at was nothing
but pg_mblen() calls. It really needs a short-circuit path for
single-byte encodings.It's easy to optimize that. However I cannot access CVS anymore after
the IP address change. Will post patches later...
Seems I got the cvs access again (I was asked my pass phrase again)
and I have committed changes for this.
Modified functions are:
bpcharlen
textlen
varcharlen
--
Tatsuo Ishii
BTW, I noticed that textlength() is absolutely unreasonably slow when
MULTIBYTE is enabled --- yesterday I was trying to profile TOAST
overhead, and soon discovered that what I was looking at was nothing
but pg_mblen() calls. It really needs a short-circuit path for
single-byte encodings.It's easy to optimize that. However I cannot access CVS anymore after
the IP address change. Will post patches later...Seems I got the cvs access again (I was asked my pass phrase again)
and I have committed changes for this.Modified functions are:
bpcharlen
textlen
varcharlen
Did you go with the pre or post-TOAST length for these types? I vote
for pre-TOAST because it seems much more useful to ordinary users.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
BTW, I noticed that textlength() is absolutely unreasonably slow when
MULTIBYTE is enabled --- yesterday I was trying to profile TOAST
overhead, and soon discovered that what I was looking at was nothing
but pg_mblen() calls. It really needs a short-circuit path for
single-byte encodings.It's easy to optimize that. However I cannot access CVS anymore after
the IP address change. Will post patches later...Seems I got the cvs access again (I was asked my pass phrase again)
and I have committed changes for this.Modified functions are:
bpcharlen
textlen
varcharlen
OK, sorry, I see you did the optimization, not changed the length
functio.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Tom Lane writes:
a) Let S be the <string value expression>. If the value of S is
not the null value, then the result is the number of bits in
the value of S.
b) Otherwise, the result is the null value.While SQL92 is pretty clear about <bit string>, I'm damned if I can see
anywhere that they define how many bits are in a character string value.
So who's to say what representation is to be used to count the bits?
If, say, UTF-16 and UTF-8 are equally reasonable choices, then why
shouldn't a compressed representation be reasonable too?
I think "the value of S" implies "the user-accessible representation of
the value of S", in the sense, "How much memory do I need to allocate to
store this value".
Furthermore, the size of the TOAST representation that is returned now is
just one particular of several intermediate representations. For
instance, it does not include the VARHDRSZ and it does not include the
size of the tuple headers when it's stored externally. Thus, this size is
heavily skewed toward low numbers and doesn't tell you much about either
the disk end or the user's end.
--
Peter Eisentraut peter_e@gmx.net
Peter Eisentraut <peter_e@gmx.net> writes:
I think "the value of S" implies "the user-accessible representation of
the value of S", in the sense, "How much memory do I need to allocate to
store this value".
If I take that argument seriously, I have to conclude that OCTET_LENGTH
should return the string length measured in the current client encoding
(which may have little to do with its size in the server, if the
server's encoding is different). If the client actually retrieves the
string then that's how much memory he'll need.
I presume that where you want to come out is OCTET_LENGTH = uncompressed
length in the server's encoding ... but so far no one has really made
a convincing argument why that answer is better or more spec-compliant
than any other answer. In particular, it's not obvious to me why
"number of bytes we're actually using on disk" is wrong.
regards, tom lane
I think "the value of S" implies "the user-accessible representation of
the value of S", in the sense, "How much memory do I need to allocate to
store this value".Furthermore, the size of the TOAST representation that is returned now is
just one particular of several intermediate representations. For
instance, it does not include the VARHDRSZ and it does not include the
size of the tuple headers when it's stored externally. Thus, this size is
heavily skewed toward low numbers and doesn't tell you much about either
the disk end or the user's end.
Yes, good arguments. If we want to implement storage_length at some
later time, I think the compressed length may be appropriate, but for
general use, I think we need to return the uncompressed length,
especially considering that multibyte makes the ordinary 2length return
number of characters, so users need a way to get byte length.
Attached is a patch that makes text return the same value type as char()
and varchar() already do. As Tatsuo pointed out, they were
inconsistent. All the other octet_length() functions look fine so it
was only text that had this problem.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Attachments:
/pgpatches/lengthtext/plainDownload+7-7
On Sun, 18 Nov 2001, Tom Lane wrote:
Peter Eisentraut <peter_e@gmx.net> writes:
I think "the value of S" implies "the user-accessible representation of
the value of S", in the sense, "How much memory do I need to allocate to
store this value".If I take that argument seriously, I have to conclude that OCTET_LENGTH
should return the string length measured in the current client encoding
(which may have little to do with its size in the server, if the
server's encoding is different). If the client actually retrieves the
string then that's how much memory he'll need.I presume that where you want to come out is OCTET_LENGTH = uncompressed
length in the server's encoding ... but so far no one has really made
a convincing argument why that answer is better or more spec-compliant
than any other answer. In particular, it's not obvious to me why
"number of bytes we're actually using on disk" is wrong.
I'm not sure, but if we say that the on disk representation is the
value of the character value expression whose size is being checked,
wouldn't that be inconsistent with the other uses of the character value
expression in places like substr where we don't use the on disk
representation? Unless you're saying that the string value expression
that is that character value expression is the compressed one and
the character value expression is the uncompressed one.
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
On Sun, 18 Nov 2001, Tom Lane wrote:
I presume that where you want to come out is OCTET_LENGTH = uncompressed
length in the server's encoding ... but so far no one has really made
a convincing argument why that answer is better or more spec-compliant
than any other answer. In particular, it's not obvious to me why
"number of bytes we're actually using on disk" is wrong.
I'm not sure, but if we say that the on disk representation is the
value of the character value expression whose size is being checked,
wouldn't that be inconsistent with the other uses of the character value
Yeah, it would be and is. In fact, the present code has some
interesting behaviors: if foo.x is a text value long enough to be
toasted, then you get different results from
SELECT OCTET_LENGTH(x) FROM foo;
SELECT OCTET_LENGTH(x || '') FROM foo;
since the result of the concatenation expression won't be compressed.
I'm not actually here to defend the existing code; in fact I believe the
XXX comment on textoctetlen questioning its correctness is mine. What
I am trying to point out is that the spec is so vague that it's not
clear what the correct answer is.
regards, tom lane
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
On Sun, 18 Nov 2001, Tom Lane wrote:
I presume that where you want to come out is OCTET_LENGTH = uncompressed
length in the server's encoding ... but so far no one has really made
a convincing argument why that answer is better or more spec-compliant
than any other answer. In particular, it's not obvious to me why
"number of bytes we're actually using on disk" is wrong.I'm not sure, but if we say that the on disk representation is the
value of the character value expression whose size is being checked,
wouldn't that be inconsistent with the other uses of the character valueYeah, it would be and is. In fact, the present code has some
interesting behaviors: if foo.x is a text value long enough to be
toasted, then you get different results fromSELECT OCTET_LENGTH(x) FROM foo;
SELECT OCTET_LENGTH(x || '') FROM foo;
since the result of the concatenation expression won't be compressed.
I'm not actually here to defend the existing code; in fact I believe the
XXX comment on textoctetlen questioning its correctness is mine. What
I am trying to point out is that the spec is so vague that it's not
clear what the correct answer is.
Well, if the standard is unclear, we should assume to return the most
reasonable answer, which has to be non-compressed length.
In multibyte encodings, when we started returning length() in
_characters_ instead of bytes, I assumed the major use for octet_length
was to return the number of bytes needed to hold the value on the client
side.
In single byte encodings, octet_length is the same as length() so
returning a compressed length may make sense, but I don't think we want
different meanings for the function for single and multi-byte encodings.
I guess the issue is that for single-byte encodings, octet_length is
pretty useless because it is the same as length, but for multi-byte
encodings, octet_length is invaluable and almost has to return
non-compress bytes because uncompressed is that the client sees.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
octet_length is invaluable and almost has to return
non-compress bytes because uncompressed is that the client sees.
What about encoding?
regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes:
octet_length is invaluable and almost has to return
non-compress bytes because uncompressed is that the client sees.
^^^^
what
What about encoding?
Single-byte encodings have the same character and byte lengths. Only
multi-byte encodings are different, right?
In thinking about it, I think the function is called octet_length()
to emphasize is returns the length in octets (bytes) rather than the
length in characters.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
What about encoding?
Single-byte encodings have the same character and byte lengths. Only
multi-byte encodings are different, right?In thinking about it, I think the function is called octet_length()
to emphasize is returns the length in octets (bytes) rather than the
length in characters.
I think Tom's point is whether octet_length() should regard input text
being encoded in the client side encoding or not.
My vote is octet_length() assumes database encodeding.
If you need client side encoded text length, you could do something
like:
select octet_length(convert('foo',pg_client_encoding()));
Note that there was a nasty bug in convert() which prevents above
working. I have committed fixes.
--
Tatsuo Ishii
Tom Lane writes:
What
I am trying to point out is that the spec is so vague that it's not
clear what the correct answer is.
I guess the authors of SQL92 never imagined someone would question what
"value of S" means. In SQL99 they included it:
SQL 99 Part 1, 4.4.3.2.
A value of character type is a string (sequence) of characters
drawn from some character repertoire.
Just to be sure...
SQL 99 Part 1, 3.1 q)
q) sequence: An ordered collection of objects that are not
necessarily distinct.
I don't have a set theory text available, but I think this should give a
fair indication that the number of bits in the value of S is the sum of
the bits in each individual character (which is in turn vaguely defined
elsewhere in SQL99) -- at least in Euclidean memory architectures.
--
Peter Eisentraut peter_e@gmx.net
Peter Eisentraut <peter_e@gmx.net> writes:
I don't have a set theory text available, but I think this should give a
fair indication that the number of bits in the value of S is the sum of
the bits in each individual character (which is in turn vaguely defined
elsewhere in SQL99) -- at least in Euclidean memory architectures.
But "how many bits in a character?" is exactly the question at this
point. To be fair, I don't think our notion of on-the-fly encoding
translation is envisioned anywhere in the SQL spec, so perhaps we
shouldn't expect it to tell us which encoding to count the bits in.
regards, tom lane