Variable length varlena headers redux

Started by Bruce Momjianover 19 years ago50 messageshackers

bruce@momjian.us

over 19 years ago

I've been looking at this again and had a few conversations about it. This may
be easier than I had originally thought but there's one major issue that's
bugging me. Do you see any way to avoid having every user function everywhere
use a new macro api instead of VARDATA/VARATT_DATA and VARSIZE/VARATT_SIZEP?

The two approaches I see are either

a) To have two sets of macros, one of which, VARATT_DATA and VARATT_SIZEP are
for constructing new tuples and behaves exactly as it does now. So you always
construct a four-byte header datum. Then in heap_form*tuple we check if you
can use a shorter header and convert. VARDATA/VARSIZE would be for looking at
existing datums and would interpret the header bits.

This seems very fragile since one stray call site using VARATT_DATA to find
the data in an existing datum would cause random bugs that only occur rarely
in certain circumstances. It would even work as long as the size is filled in
with VARATT_SIZEP first which it usually is, but fail if someone changes the
order of the statements.

b) throw away VARATT_DATA and VARATT_SIZEP and make all user function
everywhere change over to a new macro api. That seems like a pretty big
burden. It's safer but means every contrib module would have to be updated and
so on.

I'm hoping I'm missing something and there's a way to do this without breaking
the api for every user function.

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Bruce Momjian (#1)

Re: Variable length varlena headers redux

Uh, I thought the approach was to create type-specific in/out functions,
and add casting so every time there were referenced, they would expand
to a varlena structure in memory.

---------------------------------------------------------------------------

Gregory Stark wrote:

I've been looking at this again and had a few conversations about it. This may
be easier than I had originally thought but there's one major issue that's
bugging me. Do you see any way to avoid having every user function everywhere
use a new macro api instead of VARDATA/VARATT_DATA and VARSIZE/VARATT_SIZEP?

The two approaches I see are either

a) To have two sets of macros, one of which, VARATT_DATA and VARATT_SIZEP are
for constructing new tuples and behaves exactly as it does now. So you always
construct a four-byte header datum. Then in heap_form*tuple we check if you
can use a shorter header and convert. VARDATA/VARSIZE would be for looking at
existing datums and would interpret the header bits.

This seems very fragile since one stray call site using VARATT_DATA to find
the data in an existing datum would cause random bugs that only occur rarely
in certain circumstances. It would even work as long as the size is filled in
with VARATT_SIZEP first which it usually is, but fail if someone changes the
order of the statements.

or

b) throw away VARATT_DATA and VARATT_SIZEP and make all user function
everywhere change over to a new macro api. That seems like a pretty big
burden. It's safer but means every contrib module would have to be updated and
so on.

I'm hoping I'm missing something and there's a way to do this without breaking
the api for every user function.

-- Start of included mail From: Tom Lane <tgl@sss.pgh.pa.us>

To: Gregory Stark <stark@enterprisedb.com>
cc: Gregory Stark <gsstark@mit.edu>, Bruce Momjian <bruce@momjian.us>,
Peter Eisentraut <peter_e@gmx.net>, pgsql-hackers@postgresql.org,
Martijn van Oosterhout <kleptog@svana.org>
Subject: Re: [HACKERS] Fixed length data types issue
Date: Mon, 11 Sep 2006 13:15:43 -0400
Lines: 64
Xref: stark.xeocode.com work.enterprisedb:683

Gregory Stark <stark@enterprisedb.com> writes:

In any case it seems a bit backwards to me. Wouldn't it be better to
preserve bits in the case of short length words where they're precious
rather than long ones? If we make 0xxxxxxx the 1-byte case it means ...

Well, I don't find that real persuasive: you're saying that it's
important to have a 1-byte not 2-byte header for datums between 64 and
127 bytes long. Which is by definition less than a 2% savings for those
values. I think its's more important to pick bitpatterns that reduce
the number of cases heap_deform_tuple has to think about while decoding
the length of a field --- every "if" in that inner loop is expensive.

I realized this morning that if we are going to preserve the rule that
4-byte-header and compressed-header cases can be distinguished from the
data alone, there is no reason to be very worried about whether the
2-byte cases can represent the maximal length of an in-line datum.
If you want to do 16K inline (and your page is big enough for that)
you can just fall back to the 4-byte-header case. So there's no real
disadvantage if the 2-byte headers can only go up to 4K or so. This
gives us some more flexibility in the bitpattern choices.

Another thought that occurred to me is that if we preserve the
convention that a length word's value includes itself, then for a
1-byte header the bit pattern 10000000 is meaningless --- the count
has to be at least 1. So one trick we could play is to take over
this value as the signal for "toast pointer follows", with the
assumption that the tuple-decoder code knows a-priori how big a
toast pointer is. I am not real enamored of this, because it certainly
adds one case to the inner heap_deform_tuple loop and it'll give us
problems if we ever want more than one kind of toast pointer. But
it's a possibility.

Anyway, a couple of encodings that I'm thinking about now involve
limiting uncompressed data to 1G (same as now), so that we can play
with the first 2 bits instead of just 1:

00xxxxxx 4-byte length word, aligned, uncompressed data (up to 1G)
01xxxxxx 4-byte length word, aligned, compressed data (up to 1G)
100xxxxx 1-byte length word, unaligned, TOAST pointer
1010xxxx 2-byte length word, unaligned, uncompressed data (up to 4K)
1011xxxx 2-byte length word, unaligned, compressed data (up to 4K)
11xxxxxx 1-byte length word, unaligned, uncompressed data (up to 63b)

or

00xxxxxx 4-byte length word, aligned, uncompressed data (up to 1G)
010xxxxx 2-byte length word, unaligned, uncompressed data (up to 8K)
011xxxxx 2-byte length word, unaligned, compressed data (up to 8K)
10000000 1-byte length word, unaligned, TOAST pointer
1xxxxxxx 1-byte length word, unaligned, uncompressed data (up to 127b)
(xxxxxxx not all zero)

This second choice allows longer datums in both the 1-byte and 2-byte
header formats, but it hardwires the length of a TOAST pointer and
requires four cases to be distinguished in the inner loop; the first
choice only requires three cases, because TOAST pointer and 1-byte
header can be handled by the same rule "length is low 6 bits of byte".
The second choice also loses the ability to store in-line compressed
data above 8K, but that's probably an insignificant loss.

There's more than one way to do it ...

regards, tom lane

-- End of included mail.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Bruce Momjian (#2)

Re: Variable length varlena headers redux

Bruce Momjian wrote:

Uh, I thought the approach was to create type-specific in/out functions,
and add casting so every time there were referenced, they would expand
to a varlena structure in memory.

Oh, one more thing. You are going to need to teach the code that walks
through a tuple attributes about the short header types. I think you
should set pg_type.typlen = -3 (vs -1 for varlena) and put your macro
code there too. (As an example, see the macro att_addlength().)

I know it is kind of odd to have a data type that is only used on disk,
and not in memory, but I see this as a baby varlena type, used only to
store and get varlena values using less disk space.

---------------------------------------------------------------------------

Gregory Stark wrote:

I've been looking at this again and had a few conversations about it. This may
be easier than I had originally thought but there's one major issue that's
bugging me. Do you see any way to avoid having every user function everywhere
use a new macro api instead of VARDATA/VARATT_DATA and VARSIZE/VARATT_SIZEP?

The two approaches I see are either

a) To have two sets of macros, one of which, VARATT_DATA and VARATT_SIZEP are
for constructing new tuples and behaves exactly as it does now. So you always
construct a four-byte header datum. Then in heap_form*tuple we check if you
can use a shorter header and convert. VARDATA/VARSIZE would be for looking at
existing datums and would interpret the header bits.

This seems very fragile since one stray call site using VARATT_DATA to find
the data in an existing datum would cause random bugs that only occur rarely
in certain circumstances. It would even work as long as the size is filled in
with VARATT_SIZEP first which it usually is, but fail if someone changes the
order of the statements.

or

b) throw away VARATT_DATA and VARATT_SIZEP and make all user function
everywhere change over to a new macro api. That seems like a pretty big
burden. It's safer but means every contrib module would have to be updated and
so on.

I'm hoping I'm missing something and there's a way to do this without breaking
the api for every user function.

-- Start of included mail From: Tom Lane <tgl@sss.pgh.pa.us>

To: Gregory Stark <stark@enterprisedb.com>
cc: Gregory Stark <gsstark@mit.edu>, Bruce Momjian <bruce@momjian.us>,
Peter Eisentraut <peter_e@gmx.net>, pgsql-hackers@postgresql.org,
Martijn van Oosterhout <kleptog@svana.org>
Subject: Re: [HACKERS] Fixed length data types issue
Date: Mon, 11 Sep 2006 13:15:43 -0400
Lines: 64
Xref: stark.xeocode.com work.enterprisedb:683

Gregory Stark <stark@enterprisedb.com> writes:

In any case it seems a bit backwards to me. Wouldn't it be better to
preserve bits in the case of short length words where they're precious
rather than long ones? If we make 0xxxxxxx the 1-byte case it means ...

Well, I don't find that real persuasive: you're saying that it's
important to have a 1-byte not 2-byte header for datums between 64 and
127 bytes long. Which is by definition less than a 2% savings for those
values. I think its's more important to pick bitpatterns that reduce
the number of cases heap_deform_tuple has to think about while decoding
the length of a field --- every "if" in that inner loop is expensive.

I realized this morning that if we are going to preserve the rule that
4-byte-header and compressed-header cases can be distinguished from the
data alone, there is no reason to be very worried about whether the
2-byte cases can represent the maximal length of an in-line datum.
If you want to do 16K inline (and your page is big enough for that)
you can just fall back to the 4-byte-header case. So there's no real
disadvantage if the 2-byte headers can only go up to 4K or so. This
gives us some more flexibility in the bitpattern choices.

Another thought that occurred to me is that if we preserve the
convention that a length word's value includes itself, then for a
1-byte header the bit pattern 10000000 is meaningless --- the count
has to be at least 1. So one trick we could play is to take over
this value as the signal for "toast pointer follows", with the
assumption that the tuple-decoder code knows a-priori how big a
toast pointer is. I am not real enamored of this, because it certainly
adds one case to the inner heap_deform_tuple loop and it'll give us
problems if we ever want more than one kind of toast pointer. But
it's a possibility.

Anyway, a couple of encodings that I'm thinking about now involve
limiting uncompressed data to 1G (same as now), so that we can play
with the first 2 bits instead of just 1:

00xxxxxx 4-byte length word, aligned, uncompressed data (up to 1G)
01xxxxxx 4-byte length word, aligned, compressed data (up to 1G)
100xxxxx 1-byte length word, unaligned, TOAST pointer
1010xxxx 2-byte length word, unaligned, uncompressed data (up to 4K)
1011xxxx 2-byte length word, unaligned, compressed data (up to 4K)
11xxxxxx 1-byte length word, unaligned, uncompressed data (up to 63b)

or

00xxxxxx 4-byte length word, aligned, uncompressed data (up to 1G)
010xxxxx 2-byte length word, unaligned, uncompressed data (up to 8K)
011xxxxx 2-byte length word, unaligned, compressed data (up to 8K)
10000000 1-byte length word, unaligned, TOAST pointer
1xxxxxxx 1-byte length word, unaligned, uncompressed data (up to 127b)
(xxxxxxx not all zero)

This second choice allows longer datums in both the 1-byte and 2-byte
header formats, but it hardwires the length of a TOAST pointer and
requires four cases to be distinguished in the inner loop; the first
choice only requires three cases, because TOAST pointer and 1-byte
header can be handled by the same rule "length is low 6 bits of byte".
The second choice also loses the ability to store in-line compressed
data above 8K, but that's probably an insignificant loss.

There's more than one way to do it ...

regards, tom lane

-- End of included mail.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Bruce Momjian (#3)

Re: Variable length varlena headers redux

Bruce Momjian <bruce@momjian.us> writes:

Bruce Momjian wrote:

Uh, I thought the approach was to create type-specific in/out functions,
and add casting so every time there were referenced, they would expand
to a varlena structure in memory.

Are you talking about actual casts? Because that would lead to all kinds of
strange places with indexes and function lookups and so on. Or are you just
talking about code in the macro api to datum?

Oh, one more thing. You are going to need to teach the code that walks
through a tuple attributes about the short header types. I think you
should set pg_type.typlen = -3 (vs -1 for varlena) and put your macro
code there too. (As an example, see the macro att_addlength().)

I thought of doing this. It would let us, for example, treat text/varchar,
bpchar, and numeric but leave other data types unchanged.

That does help somewhat but unfortunately text is the problem case. There's
tons of code that generates text without using textin. All of pgcrypto for
example.

I know it is kind of odd to have a data type that is only used on disk,
and not in memory, but I see this as a baby varlena type, used only to
store and get varlena values using less disk space.

I was leaning toward generating the short varlena headers primarily in
heap_form*tuple and just having the datatype specific code generate 4-byte
headers much as you describe.

However that doesn't get us away from having VARDATA/VARSIZE aware of the new
headers. Since heap_deform*tuple and the other entry points which extract
individual attributes return pointers to the datum in the tuple. They can't
expand the header to a 4-byte header on the fly.

I thought of doing it in DETOAST_DATUM on the theory that everyone's going to
be calling it on their arguments. However there are other cases than just
arguments. Other functions might call, say, text_concat() and then call
VARDATA() on the result.

Even if we only ever generate short headers on heap_form*tuple and always
expand them on DETOAST we could have code that passes around tuples that it
"knows" are entirely in memory and therefore not toasted. I'm thinking of
plpgsql here primarily. Perhaps it would be enough to outlaw this behaviour
but it still seems sort of fragile to me.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Bruce Momjian (#4)

Re: Variable length varlena headers redux

Greg Stark wrote:

Bruce Momjian <bruce@momjian.us> writes:

Bruce Momjian wrote:

Uh, I thought the approach was to create type-specific in/out functions,
and add casting so every time there were referenced, they would expand
to a varlena structure in memory.

Are you talking about actual casts? Because that would lead to all kinds of
strange places with indexes and function lookups and so on. Or are you just
talking about code in the macro api to datum?

I am thinking of auto-casts, sort of like how varchar, char, and text
are all internally treated as interchangable.

Oh, one more thing. You are going to need to teach the code that walks
through a tuple attributes about the short header types. I think you
should set pg_type.typlen = -3 (vs -1 for varlena) and put your macro
code there too. (As an example, see the macro att_addlength().)

I thought of doing this. It would let us, for example, treat text/varchar,
bpchar, and numeric but leave other data types unchanged.

That does help somewhat but unfortunately text is the problem case. There's
tons of code that generates text without using textin. All of pgcrypto for
example.

Well, TEXT can't use short headers.

I know it is kind of odd to have a data type that is only used on disk,
and not in memory, but I see this as a baby varlena type, used only to
store and get varlena values using less disk space.

I was leaning toward generating the short varlena headers primarily in
heap_form*tuple and just having the datatype specific code generate 4-byte
headers much as you describe.

Yep.

However that doesn't get us away from having VARDATA/VARSIZE aware of the new
headers. Since heap_deform*tuple and the other entry points which extract
individual attributes return pointers to the datum in the tuple. They can't
expand the header to a 4-byte header on the fly.

Yep, you are going to have to special-case those call points to test
typlen and use your short macros there if == -3.

I thought of doing it in DETOAST_DATUM on the theory that everyone's going to
be calling it on their arguments. However there are other cases than just
arguments. Other functions might call, say, text_concat() and then call
VARDATA() on the result.

Right, I think all the in-memory stuff has to be varlena.

Even if we only ever generate short headers on heap_form*tuple and always
expand them on DETOAST we could have code that passes around tuples that it
"knows" are entirely in memory and therefore not toasted. I'm thinking of
plpgsql here primarily. Perhaps it would be enough to outlaw this behaviour
but it still seems sort of fragile to me.

Yea, we might need some cleanup, but the cleanup is just to do things
properly. I am unclear on the case you are describing.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Bruce Momjian (#4)

Re: Variable length varlena headers redux

Greg Stark <gsstark@mit.edu> writes:

Bruce Momjian <bruce@momjian.us> writes:

I know it is kind of odd to have a data type that is only used on disk,
and not in memory, but I see this as a baby varlena type, used only to
store and get varlena values using less disk space.

I was leaning toward generating the short varlena headers primarily in
heap_form*tuple and just having the datatype specific code generate 4-byte
headers much as you describe.

I thought we had a solution for all this, namely to make the short-form
headers be essentially a TOAST-compressed representation. The format
with 4-byte headers is still legal but just not compressed. Anyone who
fails to detoast an input argument is already broken, so there's no code
compatibility hit taken.

regards, tom lane

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Tom Lane (#6)

Re: Variable length varlena headers redux

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

Greg Stark <gsstark@mit.edu> writes:

Bruce Momjian <bruce@momjian.us> writes:

I know it is kind of odd to have a data type that is only used on disk,
and not in memory, but I see this as a baby varlena type, used only to
store and get varlena values using less disk space.

I was leaning toward generating the short varlena headers primarily in
heap_form*tuple and just having the datatype specific code generate 4-byte
headers much as you describe.

I thought we had a solution for all this, namely to make the short-form
headers be essentially a TOAST-compressed representation. The format
with 4-byte headers is still legal but just not compressed. Anyone who
fails to detoast an input argument is already broken, so there's no code
compatibility hit taken.

It's not just input arguments though. A function could call
DirectFunctionCall* and rightfully expect the return value not to need
detoasting.

I suppose this leads me to *only* generate short headers at heap_form*tuple
time. Then DirectFunctionCall isn't relevant and most of the user code is
perfectly safe.

There could still be cases where a heaptuple is passed around in pl_exec.c or
somewhere but if it's subsequently deformed whoever looks at it hopefully
wouldn't be too surprised for it to be mandatory that they go through
pg_detoast_datum. It'll happen as long as they use the DatumGetFoo macros
anyways.

It does mean that anyone going through a heap_form*tuple/heap_deform*tuple
cycle may generate more copies and memory allocations than they expected.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Tom Lane (#6)

Re: Variable length varlena headers redux

Tom Lane <tgl@sss.pgh.pa.us> writes:

Greg Stark <gsstark@mit.edu> writes:

Bruce Momjian <bruce@momjian.us> writes:

I know it is kind of odd to have a data type that is only used on disk,
and not in memory, but I see this as a baby varlena type, used only to
store and get varlena values using less disk space.

I was leaning toward generating the short varlena headers primarily in
heap_form*tuple and just having the datatype specific code generate 4-byte
headers much as you describe.

I thought we had a solution for all this, namely to make the short-form
headers be essentially a TOAST-compressed representation. The format
with 4-byte headers is still legal but just not compressed. Anyone who
fails to detoast an input argument is already broken, so there's no code
compatibility hit taken.

Uh. So I don't see how to make this work on a little-endian machine. If the
leading its are 0 we don't know if they're toast flags or bits on the least
significant byte of a longer length.

If we store all lengths in network byte order that problem goes away but then
user code that does "VARATT_SIZEP(datum) = len" is incorrect.

If we declare in-memory format to be host byte order and on-disk format to be
network byte order then every single varlena datum needs to be copied when
heap_deform*tuple runs.

If we only do this for a new kind of varlena then only text/varchar/
char/numeric datums would need to be copied but that's still a lot.

--
greg

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Bruce Momjian (#8)

Re: Variable length varlena headers redux

Greg Stark wrote:

Tom Lane <tgl@sss.pgh.pa.us> writes:

Greg Stark <gsstark@mit.edu> writes:

Bruce Momjian <bruce@momjian.us> writes:

I know it is kind of odd to have a data type that is only used on disk,
and not in memory, but I see this as a baby varlena type, used only to
store and get varlena values using less disk space.

I was leaning toward generating the short varlena headers primarily in
heap_form*tuple and just having the datatype specific code generate 4-byte
headers much as you describe.

I thought we had a solution for all this, namely to make the short-form
headers be essentially a TOAST-compressed representation. The format
with 4-byte headers is still legal but just not compressed. Anyone who
fails to detoast an input argument is already broken, so there's no code
compatibility hit taken.

Uh. So I don't see how to make this work on a little-endian machine. If the
leading its are 0 we don't know if they're toast flags or bits on the least
significant byte of a longer length.

If we store all lengths in network byte order that problem goes away but then
user code that does "VARATT_SIZEP(datum) = len" is incorrect.

If we declare in-memory format to be host byte order and on-disk format to be
network byte order then every single varlena datum needs to be copied when
heap_deform*tuple runs.

If we only do this for a new kind of varlena then only text/varchar/
char/numeric datums would need to be copied but that's still a lot.

I wonder if we need to reorder the TOAST structure to have the bits we
need at the start of the structure so we can be sure they are first.

For example, what if we split varattrib.va_header, which is int32 now,
into for 'char' fields, and just reassemble it in the toast code. That
would be pretty localized.

I had forgotten about hooking into the TOAST system, but since we are
going to be "expanding" the headers of these types when they get into
memory, it does make sense.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#10

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Bruce Momjian (#9)

Re: Variable length varlena headers redux

"Bruce Momjian" <bruce@momjian.us> writes:

Uh. So I don't see how to make this work on a little-endian machine. If the
leading its are 0 we don't know if they're toast flags or bits on the least
significant byte of a longer length.
...

I had forgotten about hooking into the TOAST system, but since we are
going to be "expanding" the headers of these types when they get into
memory, it does make sense.

Ok, I guess this can work if we guarantee that in-memory datums always have
4-byte headers. That means that heap_deform*tuple always copies the datum if
it's this type of datum.

That means we never have pointers to shortvarlena datums inside tuples. I'm
not sure if there are parts of the system that assume that the datums they get
out of heap_deform*tuple are pointers into the tuple or not. I haven't come
across any in my travels thus far.

That seems like an awful lot of copying and pallocs that aren't there
currently though. And it'll make us reluctant to change over frequently used
data types like text -- which are precisely the ones that would gain us the
most.

It seems to me that it might be better to change to storing varlena lengths in
network byte order instead. That way we can dedicate the leading bits to toast
flags and read more bytes as necessary.

I think the way to do this would be to throw out the VARATT_SIZEP macro and
replace it with VARATT_SET_SIZE(datum,size). VARSIZE would just call ntohl (or
ntohs if the leading bits on the first byte indicated...)

That does mean touching every piece of data type code. And invalidating every
piece of user code. :( At least it's fairly mechanical. And it has the
advantage of not being at all fragile -- unfixed code won't even compile.

While we're at it I would suggest taking out the VARHDRSZ offset. Just store
the size of the data payload. The constant VARHDRSZ offset no longer makes
sense since it won't actually be the size of the varlena header size anyways.
And predicting the actual size of the varlena header will be annoying and
bug-prone since it depends on the resulting value you calculate.

(Incidentally, this would actually make EnterpriseDB somewhat sad since we
want pg_migrator to work for 8.3. But it wouldn't be out of the realm of
possibility to go through the database and switch varlena headers to network
byte order. There's no need to compress them, just leave the 4-byte format in
place with the bytes swapped around.)

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#11

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Bruce Momjian (#10)

Re: Variable length varlena headers redux

Gregory Stark <stark@enterprisedb.com> writes:

That seems like an awful lot of copying and pallocs that aren't there
currently though. And it'll make us reluctant to change over frequently used
data types like text -- which are precisely the ones that would gain us the
most.

It seems to me that it might be better to change to storing varlena lengths in
network byte order instead. That way we can dedicate the leading bits to toast
flags and read more bytes as necessary.

This'll add its own overhead ... but probably less than pallocs and
data-copying would. And I agree we can find (pretty much) all the
places that need changing by the expedient of deliberately renaming
the macros and struct fields.

One problem I foresee is that I think you are about to propose that
VARDATA depend on the length already having been inserted, which it
does not now; and simple renamings won't detect ordering errors for
that. Also I believe there are places that over-allocate memory,
fill in the data, and only then set the length; something you will
not easily be able to change. It might work if we assume that
*creation* of a varlena value always produces the 4-byte-header form
and only reading of a value that might be on disk needs to cope with
the short-header forms. However this seems to require two different
forms of VARDATA depending on whether one is preparing or reading a
value. Ugh.

regards, tom lane

#12

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Tom Lane (#11)

Re: Variable length varlena headers redux

Tom Lane wrote:

Gregory Stark <stark@enterprisedb.com> writes:

That seems like an awful lot of copying and pallocs that aren't there
currently though. And it'll make us reluctant to change over frequently used
data types like text -- which are precisely the ones that would gain us the
most.

It seems to me that it might be better to change to storing varlena lengths in
network byte order instead. That way we can dedicate the leading bits to toast
flags and read more bytes as necessary.

This'll add its own overhead ... but probably less than pallocs and
data-copying would. And I agree we can find (pretty much) all the
places that need changing by the expedient of deliberately renaming
the macros and struct fields.

I think we should go with the pallocs and see how it performs. That is
certainly going to be easier to do, and we can test it pretty easily.

One palloc optimization idea would be to split out the representation so
the length is stored seprately from the data in memory, and we could use
an int32 for the length, and point to the shared buffer for the data.
However I don't think our macros can handle that so it might be a
non-starter.

However, I think we should find out of the palloc is a problem before
avoiding it.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#13

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Bruce Momjian (#12)

Re: Variable length varlena headers redux

Bruce Momjian wrote:

Tom Lane wrote:

Gregory Stark <stark@enterprisedb.com> writes:

That seems like an awful lot of copying and pallocs that aren't there
currently though. And it'll make us reluctant to change over frequently used
data types like text -- which are precisely the ones that would gain us the
most.

It seems to me that it might be better to change to storing varlena lengths in
network byte order instead. That way we can dedicate the leading bits to toast
flags and read more bytes as necessary.

This'll add its own overhead ... but probably less than pallocs and
data-copying would. And I agree we can find (pretty much) all the
places that need changing by the expedient of deliberately renaming
the macros and struct fields.

I think we should go with the pallocs and see how it performs. That is
certainly going to be easier to do, and we can test it pretty easily.

One palloc optimization idea would be to split out the representation so
the length is stored seprately from the data in memory, and we could use
an int32 for the length, and point to the shared buffer for the data.
However I don't think our macros can handle that so it might be a
non-starter.

However, I think we should find out of the palloc is a problem before
avoiding it.

Another idea about reducing palloc is that we know every short column is
at most 128 + 4 = 132 bytes, so we could allocate a 132-byte buffer for
every short column in the scan, and just re-use the buffer for every
row.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#14

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Bruce Momjian (#1)

Re: Variable length varlena headers redux

Gregory Stark <gsstark@mit.edu> writes:

Do you see any way to avoid having every user function everywhere
use a new macro api instead of VARDATA/VARATT_DATA and VARSIZE/VARATT_SIZEP?

Bruce and I had a long phone conversation about that today. We
concluded that there doesn't seem to be any strong reason to change the
macros for *reading* a varlena value's length. The problem is with the
code that *writes* a length word. The mother example is textin():

result = (text *) palloc(len + VARHDRSZ);
VARATT_SIZEP(result) = len + VARHDRSZ;
memcpy(VARDATA(result), inputText, len);

There is not a lot we can do with that second line: it's going to assign
the value of "len + VARHDRSZ" to something that has to be a native C
variable, struct field, or at best bit field.

If we replaced that line with something like

SET_VARLENA_LEN(result, len + VARHDRSZ);

then we'd have a *whole* lot more control. For example it'd be easy to
implement the previously-discussed design involving storing uncompressed
length words in network byte order: SET_VARLENA_LEN does htonl() and
VARSIZE does ntohl() and nothing else in the per-datatype functions
needs to change. Another idea that we were kicking around is to make
an explicit distinction between little-endian and big-endian hardware:
on big-endian hardware, store the two TOAST flag bits in the MSBs as
now, but on little-endian, store them in the LSBs, shifting the length
value up two bits. This would probably be marginally faster than
htonl/ntohl depending on hardware and compiler intelligence, but either
way you get to guarantee that the flag bits are in the physically first
byte, which is the critical thing needed to be able to tell the
difference between compressed and uncompressed length values.

By my count there are only 170 uses of VARATT_SIZEP in the entire
backend (including contrib) so this is not an especially daunting
change. It would break existing user-written functions that return
varlena values, but the fix wouldn't be painful for them either.
We could guarantee that every problem spot is found by removing the
current definition of the macro (and renaming the underlying struct
fields to get the attention of the truly unreconstructed...). The
important point here is that VARSIZE() still works, so only code that
creates a new varlena value need be affected, not code that examines
one.

Too bad we didn't get this right in the original TOAST rewrite; but
hindsight is always 20/20. Right now I think we gotta absorb some
pain if we want to improve the header-overhead situation, and this
seems about the minimum possible amount of pain.

regards, tom lane

#15

Magnus Hagander

magnus@hagander.net

over 19 years ago

In reply to: Tom Lane (#14)

Re: Variable length varlena headers redux

On Mon, Feb 12, 2007 at 11:19:14PM -0500, Tom Lane wrote:

Gregory Stark <gsstark@mit.edu> writes:

By my count there are only 170 uses of VARATT_SIZEP in the entire
backend (including contrib) so this is not an especially daunting
change. It would break existing user-written functions that return
varlena values, but the fix wouldn't be painful for them either.

Could the "new style" macros be back-ported to previous releases in case
we do this? That way module maintainers wouldn't need to maintain
two different sets of code for it - they could use the new style and
just compile it against an older version of pg?

//Magnus

#16

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Tom Lane (#14)

Re: Variable length varlena headers redux

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

If we replaced that line with something like

SET_VARLENA_LEN(result, len + VARHDRSZ);

then we'd have a *whole* lot more control.

I think that part was already clear. The problem was in VARDATA.

I don't really see a way around it though. Places that fill in VARDATA before
the size (formatting.c seems to be the worst case) will just have to be
changed and it'll be a fairly fragile point.

For example it'd be easy to implement the previously-discussed design
involving storing uncompressed length words in network byte order:
SET_VARLENA_LEN does htonl() and VARSIZE does ntohl() and nothing else in
the per-datatype functions needs to change. Another idea that we were
kicking around is to make an explicit distinction between little-endian and
big-endian hardware: on big-endian hardware, store the two TOAST flag bits
in the MSBs as now, but on little-endian, store them in the LSBs, shifting
the length value up two bits. This would probably be marginally faster than
htonl/ntohl depending on hardware and compiler intelligence, but either way
you get to guarantee that the flag bits are in the physically first byte,
which is the critical thing needed to be able to tell the difference between
compressed and uncompressed length values.

Actually I think neither htonl nor bitshifting the entire 4-byte word is going
to really work here. Both will require 4-byte alignment. Instead I think we
have to access the length byte by byte as a (char*) and do arithmetic. Since
it's the pointer being passed to VARSIZE that isn't too hard, but it might
perform poorly.

The important point here is that VARSIZE() still works, so only code that
creates a new varlena value need be affected, not code that examines one.

So what would VARSIZE() return, the size of the payload plus VARHDRSZ
regardless of what actual size the header was? That seems like it would break
the least existing code though removing all the VARHDRSZ offsets seems like it
would be cleaner.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#17

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Magnus Hagander (#15)

Re: Variable length varlena headers redux

Magnus Hagander wrote:

On Mon, Feb 12, 2007 at 11:19:14PM -0500, Tom Lane wrote:

Gregory Stark <gsstark@mit.edu> writes:

By my count there are only 170 uses of VARATT_SIZEP in the entire
backend (including contrib) so this is not an especially daunting
change. It would break existing user-written functions that return
varlena values, but the fix wouldn't be painful for them either.

Could the "new style" macros be back-ported to previous releases in case
we do this? That way module maintainers wouldn't need to maintain
two different sets of code for it - they could use the new style and
just compile it against an older version of pg?

Yes, Tom and I talked about this. It could appear in the next minor
release of all branches.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#18

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 19 years ago

In reply to: Bruce Momjian (#16)

Re: Variable length varlena headers redux

Gregory Stark wrote:

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

For example it'd be easy to implement the previously-discussed design
involving storing uncompressed length words in network byte order:
SET_VARLENA_LEN does htonl() and VARSIZE does ntohl() and nothing else in
the per-datatype functions needs to change. Another idea that we were
kicking around is to make an explicit distinction between little-endian and
big-endian hardware: on big-endian hardware, store the two TOAST flag bits
in the MSBs as now, but on little-endian, store them in the LSBs, shifting
the length value up two bits. This would probably be marginally faster than
htonl/ntohl depending on hardware and compiler intelligence, but either way
you get to guarantee that the flag bits are in the physically first byte,
which is the critical thing needed to be able to tell the difference between
compressed and uncompressed length values.

Actually I think neither htonl nor bitshifting the entire 4-byte word is going
to really work here. Both will require 4-byte alignment. Instead I think we
have to access the length byte by byte as a (char*) and do arithmetic. Since
it's the pointer being passed to VARSIZE that isn't too hard, but it might
perform poorly.

We would still require all datums with a 4-byte header to be 4-byte
aligned, right? When reading, you would first check if it's a compressed
or uncompressed header. If compressed, read the 1 byte header, if
uncompressed, read the 4-byte header and do htonl or bitshifting. No
need to do htonl or bitshifting on unaligned datums.

The important point here is that VARSIZE() still works, so only code that
creates a new varlena value need be affected, not code that examines one.

So what would VARSIZE() return, the size of the payload plus VARHDRSZ
regardless of what actual size the header was? That seems like it would break
the least existing code though removing all the VARHDRSZ offsets seems like it
would be cleaner.

My vote would be to change every caller. Though there's a lot of
callers, it's a very simple change.

To make it posible to compile an external module against 8.2 and 8.3,
you could have a simple ifdef block to map the new macro to old
behavior. Or we could backport the macro definitions as Magnus suggested.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#19

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Heikki Linnakangas (#18)

Re: Variable length varlena headers redux

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

Actually I think neither htonl nor bitshifting the entire 4-byte word is going
to really work here. Both will require 4-byte alignment. Instead I think we
have to access the length byte by byte as a (char*) and do arithmetic. Since
it's the pointer being passed to VARSIZE that isn't too hard, but it might
perform poorly.

We would still require all datums with a 4-byte header to be 4-byte aligned,
right? When reading, you would first check if it's a compressed or uncompressed
header. If compressed, read the 1 byte header, if uncompressed, read the 4-byte
header and do htonl or bitshifting. No need to do htonl or bitshifting on
unaligned datums.

It's not easy to require datums with 4-byte headers to be 4-byte aligned. How
do you know where to look for the bits to show it's an uncompressed header if
you don't know where it's aligned yet?

It could be done if you rule that if you're on an unaligned byte and see a \0
then scan forward until the aligned byte. But that seems just as cpu expensive
as just doing the arithmetic. And wastes space to boot.

I'm thinking VARSIZE would look something like:

#define VARSIZE((datum)) \
((((uint8*)(datum))[0] & 0x80) ?
(((uint8*)(datum))[0] & 0x7F) : \
(((uint8*)(datum))[0]<< 24 | ((uint8*)(datum))[1]<<16 | ((uint8*)(datum))[2]<<8 | ((uint8*)(datum))[0]))

Which is effectively the same as doing ntohl except that it only works for
left hand sides -- luckily VARSIZE always has a lhs. It also works for
unaligned accesses. It's going to be fairly slow but no slower than doing an
unaligned access looking at nul padding bytes.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#20

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Bruce Momjian (#17)

Re: Variable length varlena headers redux

Bruce Momjian <bruce@momjian.us> writes:

Magnus Hagander wrote:

Could the "new style" macros be back-ported to previous releases in case
we do this?

Yes, Tom and I talked about this. It could appear in the next minor
release of all branches.

I don't really see the point of that. Third-party authors who want
their code to be backwards-compatible would do something like

#ifndef SET_VARLENA_LEN
#define SET_VARLENA_LEN(var,len) (VARATT_SIZEP(var) = (len))
#endif

While we could provide this same macro in later updates of the current
release branches, those authors are still going to want to include the
above in their code so as to be able to compile against existing
releases. Therefore there's not really much point in us doing it too.

regards, tom lane

#21