Fixed length data types issue
So I'm thinking again about the problems with fixed length data types not
having typmod available when they would need it. But I'm having trouble
finding enough old posts to get a handle on exactly what the problem is.
This would make a nice test of the new wiki. I would be indebted to whoever
could summarize the root of the problem and explain exactly what circumstances
the typmod is unavailable. I would summarize the responses and put them up on
the wiki.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes:
So I'm thinking again about the problems with fixed length data types not
having typmod available when they would need it. But I'm having trouble
finding enough old posts to get a handle on exactly what the problem is.
The problem is it isn't available ;-)
AFAIR the only context where datatype-specific functions *do* get passed
typmod is in the invocation of a datatype input function or length
coercion function. And in those contexts the semantics are really
"convert the input to match this typmod", not "this typmod describes
what you've been passed".
The basic rule here is that you have to be able to find out everything
you need to know about a given instance of a datatype just by looking at
the Datum. If you try to rely on external data then you have the same
security problems that we had to redesign output functions to get rid
of: there's not sufficient guarantee that the external data actually
matches the datum.
regards, tom lane
On Tue, Sep 05, 2006 at 02:48:45PM +0100, Gregory Stark wrote:
So I'm thinking again about the problems with fixed length data types not
having typmod available when they would need it. But I'm having trouble
finding enough old posts to get a handle on exactly what the problem is.
Like Tom said, the problem is you don't have it. In the specific case
of type input functions, what typmod is the output? For type output
functions relying on a passed typmod is a security risk.
So you end up storing the typmod in the Datum itself, which brings you
right back to varlena.
This would make a nice test of the new wiki. I would be indebted to whoever
could summarize the root of the problem and explain exactly what circumstances
the typmod is unavailable. I would summarize the responses and put them up on
the wiki.
Well, the root of the problem depends on your perspective. If the
purpose behind all of this is to save disk space, perhaps the root of
the problem is that disk representation and memory representation are
intimately tied?
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout <kleptog@svana.org> writes:
So you end up storing the typmod in the Datum itself, which brings you
right back to varlena.
Not really since the Datum doesn't actually end up on disk in the case of
pass-by-reference.
which leads us to:
Well, the root of the problem depends on your perspective. If the
purpose behind all of this is to save disk space, perhaps the root of
the problem is that disk representation and memory representation are
intimately tied?
Indeed.
Consider this real table definition I found in a few moments searching for
schemas on google:
PRVDR_CTRL_TYPE_CD: CHAR(2)
PRVDR_NUM: CHAR(6)
NPI: NUMBER
RPT_STUS_CD: CHAR(1)
FY_BGN_DT: DATE
FY_END_DT: DATE
PROC_DT: DATE
INITL_RPT_SW: CHAR(1)
LAST_RPT_SW: CHAR(1)
TRNSMTL_NUM: CHAR(2)
FI_NUM: CHAR(5)
ADR_VNDR_CD: CHAR(1)
FI_CREAT_DT: DATE
UTIL_CD: CHAR(1)
NPR_DT: DATE
SPEC_IND: CHAR(1)
FI_RCPT_DT: DATE
By my count postgres would use 154 bytes for this record. Whereas in fact
there's no need for it to take more than 87 bytes. Almost 100% overhead for
varattlen headers and the padding they necessitate.
This is not a pathological example. This is a very common style of database
schema definition. Many many database tables in the real world are a 1-1
translations of existing flat file databases which have lots of short fixed
length ascii codes. Any database interacting with any old school inventory
management systems, financial databases, marketing database, etc is likely to
be of this form.
So it seems what has to happen here is we need a way of defining a data type
that has a different on-disk representation from its in-memory definition.
That means a lot more cpu overhead since I imagine it will mean pallocing the
in-memory representation before you can actually do anything with the data.
The disk reader and writer functions could probably use the typmod but it
seems what they really want to have access to is the attlen because what they
really want to know is the length of the object that their pointer refers to.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes:
Martijn van Oosterhout <kleptog@svana.org> writes:
So you end up storing the typmod in the Datum itself, which brings you
right back to varlena.Not really since the Datum doesn't actually end up on disk in the case of
pass-by-reference.
Just brain storming here. But what happens if we make Datum 2*sizeof(pointer)
and stored the typmod and/or attlen in it?
Obviously it means the memory use goes up dramatically. But the disk i/o could
potentially be reduced dramatically as well.
Does it let us do anything else we've been dreaming of but not thought doable?
Does it cause any fundamental problems?
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
On Thu, Sep 07, 2006 at 11:57:26AM +0100, Gregory Stark wrote:
Just brain storming here. But what happens if we make Datum 2*sizeof(pointer)
and stored the typmod and/or attlen in it?
The fundamental property of a Datum is that you can pass it by value to
a C function. This generally means it has to fit in a register. On the
whole, the CPU register size is the same as the pointer size, so
2*sizeof(pointer) is unlikely to fit...
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
On 2006-09-07, Gregory Stark <stark@enterprisedb.com> wrote:
Consider this real table definition I found in a few moments searching for
schemas on google:
[snip table with lots of fixed-length char fields]
By my count postgres would use 154 bytes for this record. Whereas in fact
there's no need for it to take more than 87 bytes.
Are you sure? Perhaps you are assuming that a char(1) field can be made
to be fixed-length; this is not the case (consider utf-8 for example).
--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services
Martijn van Oosterhout <kleptog@svana.org> writes:
The fundamental property of a Datum is that you can pass it by value to
a C function. This generally means it has to fit in a register. On the
whole, the CPU register size is the same as the pointer size, so
2*sizeof(pointer) is unlikely to fit...
Not having it fit in a register might impact performance but it certainly
isn't a requirement. You can pass whole structs by value in modern C. (And by
modern here I don't mean C99, this has been supported since before ANSI and is
required by *C89*).
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Andrew - Supernews <andrew+nonews@supernews.com> writes:
Are you sure? Perhaps you are assuming that a char(1) field can be made
to be fixed-length; this is not the case (consider utf-8 for example).
Well that could still be fixed length, it would just be a longer fixed length.
(In theory it would have to be 6 bytes long which I suppose would open up the
argument that if you're usually storing 7-bit ascii then a varlena would
usually be shorter.)
In any case I think the intersection of columns for which you care about i18n
and columns that you're storing according to an old-fashioned fixed column
layout is pretty much nil. And not just because it hasn't been updated to
modern standards either. If you look again at the columns in my example you'll
see none of them are appropriate targets for i18n anyways. They're all codes
and even numbers.
In other words if you're actually storing localized text then you almost
certainly will be using a text or varchar and probably won't even have a
maximum size. The use case for CHAR(n) is when you have fixed length
statically defined strings that are always the same length. it doesn't make
sense to store these in UTF8.
Currently Postgres has a limitation that you can only have one encoding per
database and one locale per cluster. Personally I'm of the opinion that the
only correct choice for that is "C" and all localization should be handled in
the client and with pg_strxfrm. Putting the whole database into non-C locales
guarantees that the columns that should not be localized will have broken
semantics and there's no way to work around things in the other direction.
Perhaps given the current situation what we should have is a cvarchar and
cchar data types that are like varchar and char but guaranteed to always be
interpreted in the c locale with ascii encoding.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
On Thu, Sep 07, 2006 at 01:11:49PM +0100, Gregory Stark wrote:
Martijn van Oosterhout <kleptog@svana.org> writes:
The fundamental property of a Datum is that you can pass it by value to
a C function. This generally means it has to fit in a register. On the
whole, the CPU register size is the same as the pointer size, so
2*sizeof(pointer) is unlikely to fit...Not having it fit in a register might impact performance but it certainly
isn't a requirement. You can pass whole structs by value in modern C. (And by
modern here I don't mean C99, this has been supported since before ANSI and is
required by *C89*).
Sure, the C compiler pushes it on the stack and passes a pointer to the
function. Pass-by-value in this context means "pass a reference to a
copy". It works, but it's not very efficient.
The C compiler also allows you create struct variables and assign them
as if they were plain variables. The assembly code to make this work
isn't pretty. You're proposing doing it for everywhere in the backend,
which seems like a huge cost for very little gain.
A better approach would be to revive the proposal for a variable-length
varlena header. It's four-bytes fixed because that's easy, but given most
values are under 4K you could come up with a coding scheme that cut the
header for such Datums to only 2 bytes, or less...
Have a ncie day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
On Thu, Sep 07, 2006 at 01:27:01PM +0100, Gregory Stark wrote:
... If you look again at the columns in my example you'll
see none of them are appropriate targets for i18n anyways. They're all codes
and even numbers.
Which begs the question of why they don't store the numbers in numeric
columns? That'll take far less space than any string.
In other words if you're actually storing localized text then you almost
certainly will be using a text or varchar and probably won't even have a
maximum size. The use case for CHAR(n) is when you have fixed length
statically defined strings that are always the same length. it doesn't make
sense to store these in UTF8.
It makes sense to store them as numbers, or perhaps an enum.
Currently Postgres has a limitation that you can only have one encoding per
database and one locale per cluster. Personally I'm of the opinion that the
only correct choice for that is "C" and all localization should be handled in
the client and with pg_strxfrm. Putting the whole database into non-C locales
guarantees that the columns that should not be localized will have broken
semantics and there's no way to work around things in the other direction.
Quite. So if someone would code up SQL COLLATE support and integrate
ICU, everyone would be happy and we could all go home.
BTW, requireing localisation to happen in the client is silly. SQL
provides the ORDER BY clause for strings and it'd be silly to have the
client resort them just because they're not using C locale. The point
of a database was to make your life easier, right?
Perhaps given the current situation what we should have is a cvarchar and
cchar data types that are like varchar and char but guaranteed to always be
interpreted in the c locale with ascii encoding.
I think bytea gives you that, pretty much.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout <kleptog@svana.org> writes:
On Thu, Sep 07, 2006 at 11:57:26AM +0100, Gregory Stark wrote:
Just brain storming here. But what happens if we make Datum 2*sizeof(pointer)
and stored the typmod and/or attlen in it?
The fundamental property of a Datum is that you can pass it by value to
a C function. This generally means it has to fit in a register. On the
whole, the CPU register size is the same as the pointer size, so
2*sizeof(pointer) is unlikely to fit...
Not to mention the problem that such a change would break every single
datatype-manipulation function in both the core backend and every
user-written datatype. I don't think we're going there.
regards, tom lane
Martijn van Oosterhout <kleptog@svana.org> writes:
On Thu, Sep 07, 2006 at 01:27:01PM +0100, Gregory Stark wrote:
... If you look again at the columns in my example you'll
see none of them are appropriate targets for i18n anyways. They're all codes
and even numbers.Which begs the question of why they don't store the numbers in numeric
columns? That'll take far less space than any string.
Sure, unless by PRVDR_NUM they mean things like '000001' or even 'C00001'.
...
It makes sense to store them as numbers, or perhaps an enum.
Sure. If you're designing the schema from scratch and don't have to
interoperate with any other systems. But if you're receiving a flat ascii text
file and it has a 5 character opaque identifier called "FI_NUM" which do you
think is the safer approach for storing these opaque identifiers?
You can suggest that there are other ways of designing a schema that will work
better with Postgres but I think you're just working around Postgre
deficiencies.
These may be deficiencies that are pretty low down your priority list but they
may be higher up my list. I just don't think you can argue they're not
deficiencies just because you know better than to get tripped up by them. I'm
discussing these things with an eye to getting some kind of consensus on what
should be done about them so I can go do it, not because I'm trying to get you
to work on it :)
Currently Postgres has a limitation that you can only have one encoding per
database and one locale per cluster. Personally I'm of the opinion that the
only correct choice for that is "C" and all localization should be handled in
the client and with pg_strxfrm. Putting the whole database into non-C locales
guarantees that the columns that should not be localized will have broken
semantics and there's no way to work around things in the other direction.Quite. So if someone would code up SQL COLLATE support and integrate
ICU, everyone would be happy and we could all go home.
Well I for one would be pretty unhappy if ICU were integrated. It seems like a
whole pile of code and complexity for no particular gain. The standard i18n
support with a few extensions (namely strcoll_l) seems to be adequate for us
and not introduce huge new dependencies and code burdens.
BTW, requireing localisation to happen in the client is silly. SQL
provides the ORDER BY clause for strings and it'd be silly to have the
client resort them just because they're not using C locale. The point
of a database was to make your life easier, right?
That's why I mentioned pg_strxfrm. It doesn't solve all your problems if
you're doing lots of string manipulations in queries but it can handle
collation so you can at least execute ORDER BY clauses which of course you
can't efficiently do in the client. For anything more complex you're probably
happier doing your string manipulations in the client just because SQL's
string primitives are so, well, primitive.
Perhaps given the current situation what we should have is a cvarchar and
cchar data types that are like varchar and char but guaranteed to always be
interpreted in the c locale with ascii encoding.I think bytea gives you that, pretty much.
Hm, that's an interesting idea. We could define all the string functions for
bytea as well.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Tom Lane <tgl@sss.pgh.pa.us> writes:
Martijn van Oosterhout <kleptog@svana.org> writes:
On Thu, Sep 07, 2006 at 11:57:26AM +0100, Gregory Stark wrote:
Just brain storming here. But what happens if we make Datum 2*sizeof(pointer)
and stored the typmod and/or attlen in it?The fundamental property of a Datum is that you can pass it by value to
a C function. This generally means it has to fit in a register. On the
whole, the CPU register size is the same as the pointer size, so
2*sizeof(pointer) is unlikely to fit...Not to mention the problem that such a change would break every single
datatype-manipulation function in both the core backend and every
user-written datatype. I don't think we're going there.
Sure, I'm just brain storming. Sometimes thinking about outlandish ideas can
result in quiet reasonable ideas appearing down the line.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
On Thu, Sep 07, 2006 at 03:38:10PM +0100, Gregory Stark wrote:
Well I for one would be pretty unhappy if ICU were integrated. It seems like a
whole pile of code and complexity for no particular gain. The standard i18n
support with a few extensions (namely strcoll_l) seems to be adequate for us
and not introduce huge new dependencies and code burdens.
Let's be serious here. The patch is 18k (729 lines), hardly "whole pile
of code and complexity". The patch has been in the FreeBSD ports
collection for a rather long time, so it's not like it's not tested.
http://www.freebsd.org/cgi/cvsweb.cgi/ports/databases/postgresql81-server/Makefile?rev=1.156&content-type=text/x-cvsweb-markup
and search for "ICU". The actual patches are here:
http://people.freebsd.org/~girgen/postgresql-icu/
The point is that strcoll_l doesn't exist on most platforms, so unless
someone is going to write another locale library, why not just use one
that's available?
That's why I mentioned pg_strxfrm. It doesn't solve all your problems if
you're doing lots of string manipulations in queries but it can handle
collation so you can at least execute ORDER BY clauses which of course you
can't efficiently do in the client. For anything more complex you're probably
happier doing your string manipulations in the client just because SQL's
string primitives are so, well, primitive.
I think you're making the assumption that client locale support is
going to be better than the server's.
Besides, pg_strxfrm doesn't help you if you want to do
accent-insensetive matching. Sometimes you don't just want to change
the order, you also want to change what is equal.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
Gregory Stark wrote:
By my count postgres would use 154 bytes for this record. Whereas in
fact there's no need for it to take more than 87 bytes. Almost 100%
overhead for varattlen headers and the padding they necessitate.
The thing is, 100% extra space is cheap, but the processing power for
making the need for that extra space go away is not.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
Peter Eisentraut wrote:
Gregory Stark wrote:
By my count postgres would use 154 bytes for this record. Whereas in
fact there's no need for it to take more than 87 bytes. Almost 100%
overhead for varattlen headers and the padding they necessitate.The thing is, 100% extra space is cheap, but the processing power for
making the need for that extra space go away is not.
I think it would be good to see if we can extend the varlena data types
to support a shorter header for storing short byte values. Looking at
the header now we have:
#define VARATT_FLAG_EXTERNAL 0x80000000
#define VARATT_FLAG_COMPRESSED 0x40000000
#define VARATT_MASK_FLAGS 0xc0000000
#define VARATT_MASK_SIZE 0x3fffffff
#define VARATT_SIZEP(_PTR) (((varattrib *)(_PTR))->va_header)
so there is precedent for overloading that header, but currently all the
headers are four bytes. The big question is can a bit be allocated to
indicate a short byte header is being used? Can we do this with minimal
performance impact for non-short values?
One test would be to adjust the masks above to assign one bit to be the
"I am a short value" header, and I think that leaves you with 5 bits ==
32, which is probably enough for a test.
--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Martijn van Oosterhout <kleptog@svana.org> writes:
On Thu, Sep 07, 2006 at 03:38:10PM +0100, Gregory Stark wrote:
Well I for one would be pretty unhappy if ICU were integrated. It seems like a
whole pile of code and complexity for no particular gain. The standard i18n
support with a few extensions (namely strcoll_l) seems to be adequate for us
and not introduce huge new dependencies and code burdens.Let's be serious here. The patch is 18k (729 lines), hardly "whole pile
of code and complexity". The patch has been in the FreeBSD ports
collection for a rather long time, so it's not like it's not tested.
Uhm, an ICU source tree is over 40 *megabytes*. That's almost as much as the
rest of Postgres itself and that doesn't even include documentation. Even if
you exclude the data and regression tests you're still talking about depending
on the portability and correctness of over 10 megabytes of new code.
The point is that strcoll_l doesn't exist on most platforms, so unless
someone is going to write another locale library, why not just use one
that's available?
Neither is ICU available on most platforms. In any case we only need strcoll_l
as a performance optimization, the regular interface works, it's just slow.
I think you're making the assumption that client locale support is
going to be better than the server's.
Well we know it is because Postgres's support is basically nonexistent.
Besides, pg_strxfrm doesn't help you if you want to do
accent-insensetive matching. Sometimes you don't just want to change
the order, you also want to change what is equal.
Well equal is part of collation at least in the sense you mean. What it
doesn't help with is things like tolower or regexp matching. These are the
things that I would suggest you usually want to be doing on the client because
SQL's string manipulation facilities are so poor compared to most client
languages.
--
greg
Peter Eisentraut <peter_e@gmx.net> writes:
Gregory Stark wrote:
By my count postgres would use 154 bytes for this record. Whereas in
fact there's no need for it to take more than 87 bytes. Almost 100%
overhead for varattlen headers and the padding they necessitate.The thing is, 100% extra space is cheap, but the processing power for
making the need for that extra space go away is not.
That's simply untrue for most applications. Unless you can fit much of your
database into RAM that 100% extra space translates directly into 100% slower.
This is most obviously the case for data warehouses that are doing lots of
sequential scans of tables that don't fit in cache.
But it's largely true for OLTP applications too. The more compact the data the
more tuples fit on a page and the greater the chance you have the page you
need in cache.
--
greg
Bruce Momjian <bruce@momjian.us> writes:
I think it would be good to see if we can extend the varlena data types
to support a shorter header for storing short byte values. Looking at
the header now we have:
This isn't the first time we've been down that route. There were some
extensive discussions a while back. I think there were even patches.
I don't remember why it was eventually rejected. I suspect it simply got too
complex.
But I think this is a dead-end route. What you're looking at is the number "1"
repeated for *every* record in the table. And what your proposing amounts to
noticing that the number "4" fits in a byte and doesn't need a whole word to
store it. Well sure, but you don't even need a byte if it's going to be the
same for every record in the table.
If someone popped up on the list asking about whether Postgres compressed
their data efficiently if they stored a column that was identical throughout
the whole table you would tell them to normalize their data.
--
greg