Reducing data type space usage

Started by Bruce Momjianover 19 years ago34 messageshackers
Jump to latest
#1Bruce Momjian
bruce@momjian.us

Following up on the recent discussion on list about wasted space in data
representations I want to summarise what we found and make some proposals:

As I see it there are two cases:

Case 1) Data types that are variable length but often quite small. This includes
things like NUMERIC which in common use will rarely be larger than 12-20
bytes and often things like text.

In cases like these we really only need 1 or sometimes 2 byte varlena
header overhead, not 4 as we currently do. In fact we *never* need more
than 2 bytes of varlena header on disk anyways with the standard
configuration.

Case 2) Data types that are different sizes depending on the typmod but are always
the same size that can be determined statically for a given typmod. In the
case of a ASCII encoded database CHAR(n) fits this category and in any case
we'll eventually have per-column encoding. NUMERC(a,b) could also be made
to fit this as well.

In cases like these we don't need *any* varlena header. If we could arrange
for the functions to have enough information to know how large the data
must be.

Solutions proposed:

Case 1) We've discussed the variable sized varlena headers and I think it's clear
that that's the most realistic way to approach it.

I don't think any other approaches were even suggested. Tom said he wanted
a second varlena format for numeric that would have 2-byte alignment. But I
think we could always just say that we always use the 2-byte varlena header
on data types with 2-byte alignment and the 4-byte header on data types
with 4-byte alignment needs. Or heap_form_tuple could be even cleverer
about it but I'm not sure it's worth it.

This limits the wasted space to 1-2% for most variable sized data that are
50 bytes long or more. But for very small data such as the quite common
cases where those are often only 1-4 bytes it still means a 25-100%
performance drain.

Case 2) Solving this is quite difficult without introducing major performance
problems or security holes. The one approach we have that's practical right
now is introducing special data types such as the oft-mentioned "char" data
type. "char" doesn't have quite the right semantics to use as a transparent
substitute for CHAR but we could define a CHAR(1) with exactly the right
semantics and substitute it transparently in parser/analyze.c (btw having
two files named analyze.c is pretty annoying). We could do the same with
NUMERIC(a,b) for sufficiently small values of a and b with something like
D'Arcy's CASH data type (which uses an integer internally).

The problem with defining lots of data types is that the number of casts
and cross-data-type comparisons grows quadratically as the number of data
types grows. In theory we would save space by defining a CHAR(n) for
whatever size n the user needs but I can't really see anything other than
CHAR(1) being worthwhile. Similarly a 4-byte NUMERIC substitute like CASH
(with full NUMERIC semantics though) and maybe a 2-byte and 8-byte
substitute might be reasonable but anything else would be pointless.

I see these two solutions as complementary. The variable varlena headers take
care of the larger data and the special-purpose data types take care of the
extremely small data. And pretty important to cover both cases data that fits
in 1-4 bytes is quite common. You often see databases with dozens of CHAR(1)
flag columns or NUMERIC(10,2) currency columns.

With a CHAR(1) and CASH style numeric substitute we won't have 25-100%
performance lost on the things that would fit in 1-4 bytes. And with the
variable sized varlena header we'll limit to 25% at worst and 1-2% usually the
performance drain due to wasted space on larger data.

Doing better would require a complete solution to data types that can
understand how large they are based on their typmod. That would imply more
dramatic solutions like I mused about involving passing around structures that
contain the Datum as well as the attlen or atttypmod. The more I think about
these ideas the more I think they may have merit but they would be awfully
invasive and require more thought.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#2Martijn van Oosterhout
kleptog@svana.org
In reply to: Bruce Momjian (#1)
Re: Reducing data type space usage

On Fri, Sep 15, 2006 at 06:50:37PM +0100, Gregory Stark wrote:

With a CHAR(1) and CASH style numeric substitute we won't have 25-100%
performance lost on the things that would fit in 1-4 bytes. And with the
variable sized varlena header we'll limit to 25% at worst and 1-2% usually the
performance drain due to wasted space on larger data.

I wonder how much of the benefit will be eaten by alignment. I think
it'd be great if we rearrange the fields in a tuple to minimize
alignment, but that logical field order patch has been and gone and the
issues havn't changed.

There's also slack at the end of pages.

Doing better would require a complete solution to data types that can
understand how large they are based on their typmod. That would imply more
dramatic solutions like I mused about involving passing around structures that
contain the Datum as well as the attlen or atttypmod. The more I think about
these ideas the more I think they may have merit but they would be awfully
invasive and require more thought.

Whatever the solution is here, the same logic will have to apply to
extracting Datums out of tuples. If you want the 7th column in a tuple,
you have to find the lengths of all the previous datums first.

Good summary though, probably worth putting on the wiki so next time we
don't have to search the archives.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#3Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Bruce Momjian (#1)
Re: Reducing data type space usage

Gregory Stark wrote:

<snip>

Case 2) Solving this is quite difficult without introducing major performance
problems or security holes. The one approach we have that's practical right
now is introducing special data types such as the oft-mentioned "char" data
type. "char" doesn't have quite the right semantics to use as a transparent
substitute for CHAR but we could define a CHAR(1) with exactly the right
semantics and substitute it transparently in parser/analyze.c (btw having
two files named analyze.c is pretty annoying). We could do the same with
NUMERIC(a,b) for sufficiently small values of a and b with something like
D'Arcy's CASH data type (which uses an integer internally).

Didn't we discuss a problem with using CHAR(n), specifically that the
number of bytes required to store n characters is variable? I had
suggested making an ascii1 type, ascii2 type, etc. Someone else seemed
to be saying that should be called bytea1, bytea2, or perhaps with the
parenthesis bytea(1), bytea(2). The point being that it is a fixed
number of bytes.

The problem with defining lots of data types is that the number of casts
and cross-data-type comparisons grows quadratically as the number of data
types grows. In theory we would save space by defining a CHAR(n) for
whatever size n the user needs but I can't really see anything other than
CHAR(1) being worthwhile. Similarly a 4-byte NUMERIC substitute like CASH
(with full NUMERIC semantics though) and maybe a 2-byte and 8-byte
substitute might be reasonable but anything else would be pointless.

Wouldn't a 4-byte numeric be a "float4" and an 8-byte numeric be a
"float8". I'm not sure I see the difference. As for a 2-byte floating
point number, I like the idea and will look for an ieee specification
for how the bits are arranged, if any such ieee spec exists.

mark

#4Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#1)
Re: Reducing data type space usage

Gregory Stark wrote:

Case 2) Data types that are different sizes depending on the typmod but are always
the same size that can be determined statically for a given typmod. In the
case of a ASCII encoded database CHAR(n) fits this category and in any case
we'll eventually have per-column encoding. NUMERC(a,b) could also be made
to fit this as well.

In cases like these we don't need *any* varlena header. If we could arrange
for the functions to have enough information to know how large the data
must be.

I thought about the CHAR(1) case some more. Rather than restrict
single-byte storage to ASCII-encoded databases, I think there is a more
general solution.

First, I don't think any solution that assumes typmod will be around to
help determine the meaning of the column is going to work.

I think what will work is to store a 1-character, 7-bit ASCII value in
one byte, by setting the high bit. This will work for any database
encoding. This is the zero-length header case.

If the 1-character has a high bit, will require a one-byte length header
and then the high-bit byte, and if it is multi-byte, perhaps more bytes.

Zero-length header will even work for a VARCHAR(8) field that stores one
7-bit ASCII character, because it isn't relying on the typmod.

FYI, we also need to figure out how to store a zero-length string. That
will probably be high-bit, and then all zero bits. We don't store a
zero-byte in strings, so that should be unique for "".

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#4)
Re: Reducing data type space usage

Bruce Momjian <bruce@momjian.us> writes:

FYI, we also need to figure out how to store a zero-length string. That
will probably be high-bit, and then all zero bits. We don't store a
zero-byte in strings, so that should be unique for "".

No, it'll be a 1-byte header with length indicating that no bytes
follow, which likely will be 10000001 rather than 10000000 ... but
in either case there is no ambiguity involved.

regards, tom lane

#6Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#5)
Re: Reducing data type space usage

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

FYI, we also need to figure out how to store a zero-length string. That
will probably be high-bit, and then all zero bits. We don't store a
zero-byte in strings, so that should be unique for "".

No, it'll be a 1-byte header with length indicating that no bytes
follow, which likely will be 10000001 rather than 10000000 ... but
in either case there is no ambiguity involved.

Well, in my idea, 10000001 would be 0x01. I was going to use the
remaining 7 bits for the 7-bit ascii value.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#6)
Re: Reducing data type space usage

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

No, it'll be a 1-byte header with length indicating that no bytes
follow,

Well, in my idea, 10000001 would be 0x01. I was going to use the
remaining 7 bits for the 7-bit ascii value.

Huh? I thought you said 00000001 would be 0x01, that is, high bit
clear means a single byte containing an ASCII character. You could
reverse that but it just seems to make things harder --- the byte
isn't a correct data byte by itself, as it would be with the other
convention.

regards, tom lane

#8Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#7)
Re: Reducing data type space usage

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

No, it'll be a 1-byte header with length indicating that no bytes
follow,

Well, in my idea, 10000001 would be 0x01. I was going to use the
remaining 7 bits for the 7-bit ascii value.

Huh? I thought you said 00000001 would be 0x01, that is, high bit
clear means a single byte containing an ASCII character. You could
reverse that but it just seems to make things harder --- the byte
isn't a correct data byte by itself, as it would be with the other
convention.

Oh, OK, I had high byte meaning no header, but clear is better, so
00000001 is 0x01, and 00000000 is "". But I see now that bytea does
store nulls, so yea, we would be better using 10000001, and it is the
same size as 00000000.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#9Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#8)
Re: Reducing data type space usage

Bruce Momjian <bruce@momjian.us> writes:

Oh, OK, I had high byte meaning no header

Just how annoying would it be if I pointed out I suggested precisely this a
few days ago?

Tom said he didn't think there was enough code space and my own
experimentation was slowly leading me to agree, sadly. It would be neat to get
it to work though.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#8)
Re: Reducing data type space usage

Bruce Momjian <bruce@momjian.us> writes:

Oh, OK, I had high byte meaning no header, but clear is better, so
00000001 is 0x01, and 00000000 is "". But I see now that bytea does
store nulls, so yea, we would be better using 10000001, and it is the
same size as 00000000.

I'm liking this idea more the more I think about it, because it'd
actually be far less painful to put into the system structure than the
other idea of fooling with varlena headers. To review: Bruce is
proposing a var-length type structure with the properties

first byte 0xxxxxxx ---- field length 1 byte, exactly that value
first byte 1xxxxxxx ---- xxxxxxx data bytes follow

This can support *any* stored value from zero to 127 bytes long.
We can imagine creating new datatypes "short varchar" and "short char",
and then having the parser silently substitute these types for varchar(N)
or char(N) whenever N <= 127 / max_encoding_length. Add some
appropriate implicit casts to convert these to the normal varlena types
for computation, and away you go. No breakage of any existing
datatype-specific code, just a few additions in places like
heap_form_tuple.

regards, tom lane

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#9)
Re: Reducing data type space usage

Gregory Stark <stark@enterprisedb.com> writes:

Tom said he didn't think there was enough code space and my own
experimentation was slowly leading me to agree, sadly.

There isn't if you want the type to also handle long strings.
But what if we restrict it to short strings? See my message
just now.

regards, tom lane

#12Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#9)
Re: Reducing data type space usage

Gregory Stark wrote:

Bruce Momjian <bruce@momjian.us> writes:

Oh, OK, I had high byte meaning no header

Just how annoying would it be if I pointed out I suggested precisely this a
few days ago?

Tom said he didn't think there was enough code space and my own
experimentation was slowly leading me to agree, sadly. It would be neat to get
it to work though.

I didn't see the 7-bit thing. Guess I missed it. I was worried we
would not have enough bits to track a 1-gig field.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#13Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#11)
Re: Reducing data type space usage

Tom Lane <tgl@sss.pgh.pa.us> writes:

Gregory Stark <stark@enterprisedb.com> writes:

Tom said he didn't think there was enough code space and my own
experimentation was slowly leading me to agree, sadly.

There isn't if you want the type to also handle long strings.
But what if we restrict it to short strings? See my message
just now.

Then it seems like it imposes a pretty hefty burden on the user.

text columns, for example, can never take advantage of it. And there are
plenty of instances where 127 bytes would be just short enough to be annoying
even though 99% of the data would in fact be shorter. Things like "address"
and "product name" for example.

The user would have to decide that he'll never need a value over 127 bytes
long ever in order to get the benefit.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#14Bort, Paul
pbort@tmwsystems.com
In reply to: Bruce Momjian (#13)
Re: Reducing data type space usage

Gregory Stark writes:

Tom Lane <tgl@sss.pgh.pa.us> writes:

There isn't if you want the type to also handle long strings.
But what if we restrict it to short strings? See my

message just now.

Then it seems like it imposes a pretty hefty burden on the user.

But there are a lot of places where it wins:
- single byte for a multi-state flag
- hex representation of a hash (like SHA-1)
- part numbers
- lots of fields imported from legacy systems
- ZIP/Postal codes

And for all of those you can decisively say at design time that 127
characters is an OK limit.

+1 for Bruce/Tom's idea.

Regards,
Paul Bort

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#13)
Re: Reducing data type space usage

Gregory Stark <stark@enterprisedb.com> writes:

The user would have to decide that he'll never need a value over 127 bytes
long ever in order to get the benefit.

Weren't you the one that's been going on at great length about how
wastefully we store CHAR(1) ? Sure, this has a somewhat restricted
use case, but it's about as efficient as we could possibly get within
that use case.

regards, tom lane

#16Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#15)
Re: Reducing data type space usage

Tom Lane <tgl@sss.pgh.pa.us> writes:

Gregory Stark <stark@enterprisedb.com> writes:

The user would have to decide that he'll never need a value over 127 bytes
long ever in order to get the benefit.

Weren't you the one that's been going on at great length about how
wastefully we store CHAR(1) ? Sure, this has a somewhat restricted
use case, but it's about as efficient as we could possibly get within
that use case.

Sure, but are you saying you would have this in addition to do variable sized
varlena headers?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#16)
Re: Reducing data type space usage

Gregory Stark <stark@enterprisedb.com> writes:

Tom Lane <tgl@sss.pgh.pa.us> writes:

Weren't you the one that's been going on at great length about how
wastefully we store CHAR(1) ? Sure, this has a somewhat restricted
use case, but it's about as efficient as we could possibly get within
that use case.

Sure, but are you saying you would have this in addition to do variable sized
varlena headers?

We could ... but right at the moment I'm thinking this would solve 80%
of the problem with about 1% of the work needed for the varlena change.
So my thought is to do this for 8.3 and then wait a couple releases to
see if the more extensive hack is really needed. Even if we did
eventually install variable-sized varlena headers, this layout would
still be useful for types like inet/cidr.

regards, tom lane

#18Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#7)
Re: Reducing data type space usage

Ühel kenal päeval, R, 2006-09-15 kell 19:18, kirjutas Tom Lane:

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

No, it'll be a 1-byte header with length indicating that no bytes
follow,

Well, in my idea, 10000001 would be 0x01. I was going to use the
remaining 7 bits for the 7-bit ascii value.

Huh? I thought you said 00000001 would be 0x01, that is, high bit
clear means a single byte containing an ASCII character.

why not go all the way, and do utf-7 encoded header if hi bit is set ?

or just always have an utf-8 encoded header.

You could
reverse that but it just seems to make things harder --- the byte
isn't a correct data byte by itself, as it would be with the other
convention.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

#19Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#10)
Re: Reducing data type space usage

Tom Lane wrote:

To review: Bruce is
proposing a var-length type structure with the properties

first byte 0xxxxxxx ---- field length 1 byte, exactly that value
first byte 1xxxxxxx ---- xxxxxxx data bytes follow

This can support *any* stored value from zero to 127 bytes long.
We can imagine creating new datatypes "short varchar" and "short char",
and then having the parser silently substitute these types for varchar(N)
or char(N) whenever N <= 127 / max_encoding_length. Add some
appropriate implicit casts to convert these to the normal varlena types
for computation, and away you go. No breakage of any existing
datatype-specific code, just a few additions in places like
heap_form_tuple.

I like this scheme a lot - maximum bang for buck.

Is there any chance we can do it transparently, without exposing new
types? It is in effect an implementation detail ISTM, and ideally the
user would not need to have any knowledge of it.

cheers

andrew

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#19)
Re: Reducing data type space usage

Andrew Dunstan <andrew@dunslane.net> writes:

I like this scheme a lot - maximum bang for buck.

Is there any chance we can do it transparently, without exposing new
types? It is in effect an implementation detail ISTM, and ideally the
user would not need to have any knowledge of it.

Well, they'd have to be separate types, but the parser handling of them
would be reasonably transparent I think. It would work pretty much
exactly like the way that CHAR(N) maps to "bpchar" now --- is that
sufficiently well hidden for your taste?

regards, tom lane

#21Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#20)
#22Bruce Momjian
bruce@momjian.us
In reply to: Hannu Krosing (#18)
#23Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#15)
#24Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#23)
#25Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#3)
#26Mark Mielke
mark@mark.mielke.cc
In reply to: Mark Dilger (#25)
#27Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#24)
#28Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#15)
#29Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#27)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#18)
#31Martijn van Oosterhout
kleptog@svana.org
In reply to: Bruce Momjian (#24)
#32Bruce Momjian
bruce@momjian.us
In reply to: Martijn van Oosterhout (#31)
#33Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#10)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#33)