Converting MySQL tinyint to PostgreSQL
I have a MySQL database that I'm converting to PostgreSQL which has 10 columns
with TINYINT type, i.e., a one-byte integer. Only one of them qualifies as a
true BOOLEAN. Two are entity identifiers (for limited range "classes" or
"categories") and three others are "type"/code values. The last four are month
numbers or day of month numbers (not implemented as date types because the year
is separate and is part of the primary key).
I'm wondering what would be the best conversion choice for these columns:
smallint, numeric(1), char(1), something else? AFAICT, the application does not
directly do any arithmetic on these columns, but the identifier and code types
are used as indices into PHP arrays. The month/day values are formatted into
dates for display (and the month also servers as an index into a month-name array).
BTW, is TINYINT part of any SQL Standard?
Joe
On Jul 12, 2005, at 1:16 AM, Joe wrote:
I have a MySQL database that I'm converting to PostgreSQL which has
10 columns with TINYINT type, i.e., a one-byte integer. Only one
of them qualifies as a true BOOLEAN. Two are entity identifiers
(for limited range "classes" or "categories") and three others are
"type"/code values. The last four are month numbers or day of
month numbers (not implemented as date types because the year is
separate and is part of the primary key).I'm wondering what would be the best conversion choice for these
columns: smallint, numeric(1), char(1), something else? AFAICT,
the application does not directly do any arithmetic on these
columns, but the identifier and code types are used as indices into
PHP arrays. The month/day values are formatted into dates for
display (and the month also servers as an index into a month-name
array).
I would use smallint. While PHP might work fine with something else,
it seems like the closest match.
BTW, is TINYINT part of any SQL Standard?
From the pg docs:
SQL only specifies the integer types integer (or int) and smallint.
The type bigint, and the type names int2, int4, and int8 are
extensions, which are shared with various other SQL database systems.
John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL
Joe <svn@freedomcircle.net> writes:
I have a MySQL database that I'm converting to PostgreSQL which has 10 columns
with TINYINT type, i.e., a one-byte integer.
I'm wondering what would be the best conversion choice for these columns:
smallint, numeric(1), char(1), something else?
smallint, for sure; the others have more overhead.
BTW, is TINYINT part of any SQL Standard?
No. If it were, we'd feel more urgency about implementing it ;-)
regards, tom lane
On 7/12/05, Joe <svn@freedomcircle.net> wrote:
I have a MySQL database that I'm converting to PostgreSQL which has 10 columns
with TINYINT type, i.e., a one-byte integer. Only one of them qualifies as a
true BOOLEAN. Two are entity identifiers (for limited range "classes" or
"categories") and three others are "type"/code values. The last four are month
numbers or day of month numbers (not implemented as date types because the year
is separate and is part of the primary key).I'm wondering what would be the best conversion choice for these columns:
smallint, numeric(1), char(1), something else? AFAICT, the application does not
smallint takes two bytes. Numeric(1) will take around 10 bytes and char(1) will
take 5 bytes (4 bytes for length of data). The closest match is smallint which
is reasonably small and will do the trick. Remember that PostgreSQL doesn't
have unsigned types.
This may seem "a terrible waste of diskspace" at the first glance, but keep
in mind that PostgreSQL's power lies in an intelligent planner. You can
safely nomralize data and create views which "emulate" one-table which
both simplifies development and can be more efficient in terms of
storage and raw speed.
directly do any arithmetic on these columns, but the identifier and code types
are used as indices into PHP arrays. The month/day values are formatted into
dates for display (and the month also servers as an index into a month-name array).
You should consider using date / time / timestamp column for dates and use
functions like extract(day from mydate). Remember you can use functional
indexes, so if you need data to by indexed by day, you can:
CREATE INDEX dayindex ON sometable ((extract(day from datecolumn)));
Regards,
Dawid
On Tue, Jul 12, 2005 at 01:16:07AM -0400, Joe wrote:
I have a MySQL database that I'm converting to PostgreSQL which has 10
columns with TINYINT type, i.e., a one-byte integer. Only one of them
qualifies as a true BOOLEAN. Two are entity identifiers (for limited range
"classes" or "categories") and three others are "type"/code values. The
last four are month numbers or day of month numbers (not implemented as
date types because the year is separate and is part of the primary key).
Just to broaden your choices, there is also a "char" datatype (quotes
included! If you omit them, it's a different thing, more overhead),
which you can use to store a single byte. Could be used as a poor man's
ENUM. Add appropiate CHECK constraints and it could be good enough for
some purposes. Not useful for months, but maybe for your categories or
classes.
There are some usages of this in the system catalogs (e.g.
pg_class.relkind)
--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
www.google.com: interfaz de l�nea de comando para la web.
Dawid Kuroczko wrote:
smallint takes two bytes. Numeric(1) will take around 10 bytes and char(1) will
take 5 bytes (4 bytes for length of data).
I never would've imagined *that* amount of overhead for CHAR(1)! I would've
imagined that it would take up one byte (or two with a NULL indicator). After
all, we're not talking about VARCHAR(1) [which is sort of useless]. Don't the
catalogs know the declared length and if so, why the length overhead? I'm also
surprised --albeit less-- about the NUMERIC(1) overhead. Is any of this
discussed in the Internals chapters? I didn't see this discussed elsewhere.
As a perhaps-related aside, I've noticed several examples which declare string
types as 'text', rather than VARCHAR or CHAR, the former being non-SQL standard.
Is there some performance benefit to using 'text' (other than it being shorter
to type :-) or is it just the "usual Postgres way"?
Joe
Joe <svn@freedomcircle.net> writes:
I never would've imagined *that* amount of overhead for CHAR(1)! I
would've imagined that it would take up one byte (or two with a NULL
indicator). After all, we're not talking about VARCHAR(1) [which is
sort of useless]. Don't the catalogs know the declared length and if
so, why the length overhead?
Because the length specification is in *characters*, which is not by any
means the same as *bytes*.
We could possibly put enough intelligence into the low-level tuple
manipulation routines to count characters in whatever encoding we happen
to be using, but it's a lot faster and more robust to insist on a count
word for every variable-width field.
regards, tom lane
Tom Lane wrote:
Because the length specification is in *characters*, which is not by any
means the same as *bytes*.We could possibly put enough intelligence into the low-level tuple
manipulation routines to count characters in whatever encoding we happen
to be using, but it's a lot faster and more robust to insist on a count
word for every variable-width field.
I guess what you're saying is that PostgreSQL stores characters in
varying-length encodings. If it stored character data in Unicode (UCS-16) it
would always take up two-bytes per character. Have you considered supporting
NCHAR/NVARCHAR, aka NATIONAL character data? Wouldn't UCS-16 be needed to
support multi-locale clusters (as someone as inquiring about recently)?
Joe
On Tue, Jul 12, 2005 at 05:37:32PM -0400, Joe wrote:
Tom Lane wrote:
Because the length specification is in *characters*, which is not by any
means the same as *bytes*.We could possibly put enough intelligence into the low-level tuple
manipulation routines to count characters in whatever encoding we happen
to be using, but it's a lot faster and more robust to insist on a count
word for every variable-width field.I guess what you're saying is that PostgreSQL stores characters in
varying-length encodings.
It _may_ store characters in variable length encodings. It can use
fixed-length encodings too, such as latin1 or plain ASCII (actually,
unchecked 8 bits, which means about anything) -- you define that at
initdb time or database creation time, I forget. It would be painful
for the code to distinguish fixed-length from variable-length at
runtime, an optimization that would allow getting rid of the otherwise
required length word. So far, nobody has cared enough about it to do
the job.
If it stored character data in Unicode (UCS-16) it would always take
up two-bytes per character.
Really? We don't support UCS-16, for good reasons (we'd have to rewrite
several parts of the code in order to support '0' bytes embedded in
strings ... we use regular C strings extensively).
However we do support Unicode as UTF-8, but it's been said a couple of
times that characters can be wider than 2 or 3 bytes in some cases. So,
I don't see how UCS-16 could always use only 2 bytes.
Have you considered supporting NCHAR/NVARCHAR, aka NATIONAL character
data?
There have been noises, but so far nobody has stepped up the plate to do
the work.
--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"Those who use electric razors are infidels destined to burn in hell while
we drink from rivers of beer, download free vids and mingle with naked
well shaved babes." (http://slashdot.org/comments.pl?sid=44793&cid=4647152)
Alvaro Herrera wrote:
On Tue, Jul 12, 2005 at 05:37:32PM -0400, Joe wrote:
If it stored character data in Unicode (UCS-16) it would always take
up two-bytes per character.Really? We don't support UCS-16, for good reasons (we'd have to rewrite
several parts of the code in order to support '0' bytes embedded in
strings ... we use regular C strings extensively).
I'm sorry. I meant to say UCS-2, which by definition uses 16 bits = 2 octets =
2 bytes. To support an even larger code space, you could use UCS-4 which always
uses 4 bytes.
Joe
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
I don't see how UCS-16 could always use only 2 bytes.
Simple: it fails to handle Unicode code points above 0x10000. (We only
recently fixed a similar limitation in our UTF8 support, by the by, but
it *is* fixed and I doubt we want to backpedal.)
The problem with embedded null bytes is quite serious though, and I
doubt that we'll ever see the backend natively handling encodings that
require that. It's just not worth the effort. Certainly the idea of
not having to store a length word for CHAR(1) fields is not going to
inspire anyone to invest the effort involved ;-)
Keep in mind also that any such change would involve putting slower and
more complicated logic into some routines that are hotspots already;
so even if you did all the work involved, you might find the patch
rejected on the grounds that it's a net performance loss. Most of the
developers have plenty of tasks to do with a larger and more certain
reward than this.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes:
Certainly the idea of not having to store a length word for CHAR(1) fields
is not going to inspire anyone to invest the effort involved ;-)
That's a pretty big motivation though. Storage space efficiency is a huge
factor in raw sequential scan speed.
Personally I would settle for a fuller set of small fixed size datatypes. The
"char" datatype is pretty much exactly what's needed except that it provides
such a quirky interface.
I'm not sure exactly how to clean it up but if it were more user-friendly (and
had less of an "undocumented internal feature" character to it) I have a
feeling a lot of people would be using for things like flags, state codes,
etc. And it would reduce storage space significantly over having lots of text
or even integer fields.
--
greg
Greg Stark <gsstark@mit.edu> writes:
Personally I would settle for a fuller set of small fixed size datatypes. The
"char" datatype is pretty much exactly what's needed except that it provides
such a quirky interface.
I'm not actually against inventing an int1/tinyint type. I used to be
worried that it would screw up the numeric datatype promotion hierarchy
even more than it already was screwed up :-( ... but I think we have
dealt with most of those issues now. It'd be worth trying anyway ---
much more so than trying to optimize char(1), IMHO.
regards, tom lane
On 7/13/05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Greg Stark <gsstark@mit.edu> writes:
Personally I would settle for a fuller set of small fixed size datatypes. The
"char" datatype is pretty much exactly what's needed except that it provides
such a quirky interface.I'm not actually against inventing an int1/tinyint type. I used to be
worried that it would screw up the numeric datatype promotion hierarchy
even more than it already was screwed up :-( ... but I think we have
dealt with most of those issues now. It'd be worth trying anyway ---
much more so than trying to optimize char(1), IMHO.
The problem with int1 type is that the smaller the value, the more
push for unsigned types... I think it may be worth doing, but is not
exactly the problem -- smallint is fine for most of situations. The
only place where I was unhappy with signed integers was... int4
(I wanted to put full 32bit unsigned values, so I had to use bigint,
with couple of millions of rows its a bit of a waste ;)).
As for the char/varchar type -- I was wondering. Worst case
scenario for UTF-8 (correct me on this) is when 1 character
takes 4 bytes. And biggest problem with char/varchar is that
length indicator takes 4 bytes... How much overhead would
it be to make a length variable, for example:
(var)char(1)-char(63) -- 1 byte length + string
char(64)-char(16383) -- 2 byte length + string
char(16384)-text -- 4 byte length + string, like now
This would reduce length of char(5) string from 9 bytes to
6 bytes, char(2) from 6 bytes to 3 bytes (for multibyte chars
it would be a win also).
I don't know the internals too well (read: at all), but I guess there
would be a problem of choosing which length of length to use --
would it be possible to make some sort of on-the-fly mapping
when creating tables -- varchar(224) is text_2bytelength,
text is text_4bytelength, char(1) is text_1bytelength...
Regards,
Dawid
On 7/12/05, Joe <svn@freedomcircle.net> wrote:
Dawid Kuroczko wrote:
smallint takes two bytes. Numeric(1) will take around 10 bytes and char(1) will
take 5 bytes (4 bytes for length of data).I never would've imagined *that* amount of overhead for CHAR(1)! I would've
imagined that it would take up one byte (or two with a NULL indicator). After
all, we're not talking about VARCHAR(1) [which is sort of useless]. Don't the
catalogs know the declared length and if so, why the length overhead? I'm also
surprised --albeit less-- about the NUMERIC(1) overhead. Is any of this
discussed in the Internals chapters? I didn't see this discussed elsewhere.
It is all described in
http://www.postgresql.org/docs/8.0/interactive/datatype.html
with given above space requirements also. Noone hides it, it's all black on
white. :-)
As a perhaps-related aside, I've noticed several examples which declare string
types as 'text', rather than VARCHAR or CHAR, the former being non-SQL standard.
Is there some performance benefit to using 'text' (other than it being shorter
to type :-) or is it just the "usual Postgres way"?
Some time ago people used text as a way for making "easily changeable varchar",
say:
col text CHECK (length(VAL) < 100)
...when person sees that 100 is too small it was just a matter of changing the
CHECK constraint.
...but it was long time ago, and since that time PostgreSQL is able to change
the type of column with no problems, and efficiently.
Regards,
Dawid
On Wed, Jul 13, 2005 at 10:48:56AM +0200, Dawid Kuroczko wrote:
As for the char/varchar type -- I was wondering. Worst case
scenario for UTF-8 (correct me on this) is when 1 character
takes 4 bytes. And biggest problem with char/varchar is that
length indicator takes 4 bytes... How much overhead would
it be to make a length variable, for example:(var)char(1)-char(63) -- 1 byte length + string
char(64)-char(16383) -- 2 byte length + string
char(16384)-text -- 4 byte length + string, like now
Well, you get another issue, alignment. If you squeeze your string
down, the next field, if it is an int or string, will get padded to a
multiple of 4 negating most of the gains. Like in C structures, there
is padding to optimise access.
This would reduce length of char(5) string from 9 bytes to
6 bytes, char(2) from 6 bytes to 3 bytes (for multibyte chars
it would be a win also).
The only types that won't require padding if they are next field are
bool, "char" and cstring. So char(1-4) will actually go from 8 to 4
bytes in most cases. char(5-8) will go from 12 to 8 bytes.
I don't know the internals too well (read: at all), but I guess there
would be a problem of choosing which length of length to use --
would it be possible to make some sort of on-the-fly mapping
when creating tables -- varchar(224) is text_2bytelength,
text is text_4bytelength, char(1) is text_1bytelength...
At the moment there are two basic types: variable length and fixed
length as defined by the type id. Fixed length are stored as is.
Variable length is a 4 byte length plus the string. The two highest
bits are flags. AFAIK it's the typlen value that decides the decoding,
the type itself is irrelevent.
I suppose somebody could create a new encoding but I don't know how
hard that would be...
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.
Martijn van Oosterhout wrote:
Well, you get another issue, alignment. If you squeeze your string
down, the next field, if it is an int or string, will get padded to a
multiple of 4 negating most of the gains. Like in C structures, there
is padding to optimise access.
Anecdotally I hear at least as many people say that their database
is more I/O bound than CPU bound; and it seems that adding bytes
for alignment is a way of reducing CPU for more disk I/O.
I guess unaligned access so expensive that it makes up for the extra i/o?
Ron Mayer wrote:
Martijn van Oosterhout wrote:
Well, you get another issue, alignment. If you squeeze your string
down, the next field, if it is an int or string, will get padded to a
multiple of 4 negating most of the gains. Like in C structures, there
is padding to optimise access.Anecdotally I hear at least as many people say that their database
is more I/O bound than CPU bound; and it seems that adding bytes
for alignment is a way of reducing CPU for more disk I/O.I guess unaligned access so expensive that it makes up for the extra i/o?
This is a good point. We have always stored data on disk that exactly
matches its layout in memory. We could change that, but no one has
shown it would be a win.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Wed, Jul 13, 2005 at 07:52:04PM -0400, Bruce Momjian wrote:
Ron Mayer wrote:
Martijn van Oosterhout wrote:
Well, you get another issue, alignment. If you squeeze your string
down, the next field, if it is an int or string, will get padded to a
multiple of 4 negating most of the gains. Like in C structures, there
is padding to optimise access.Anecdotally I hear at least as many people say that their database
is more I/O bound than CPU bound; and it seems that adding bytes
for alignment is a way of reducing CPU for more disk I/O.I guess unaligned access so expensive that it makes up for the extra i/o?
This is a good point. We have always stored data on disk that exactly
matches its layout in memory. We could change that, but no one has
shown it would be a win.
Out of curiosity, what would be involved in hacking the backend enough
to be able to test this theory out? I'm guessing you'd want to convert
between on-disk and in-memory formats as you read pages in, so either
on-disk pages would become variable size (and smaller than memory pages)
or in-memory pages would become variable size (and larger than on-disk
pages).
Or maybe as an alternative, would it be possible to determine how much
space in a given relation was being wasted due to padding? That could be
used to figure out how much IO could be saved on different transactions.
While there would be a slight CPU penalty every time you read or write a
page, I find it hard to believe it could come close to equaling IO cost.
On a side note, I think it might be useful to have a seperate TODO
catagory for ideas that need to be tested to see if they're worth
implementing for real. This is a case where it's probably substantially
easier to estimate (or maybe even measure) how much there is to gain
from this than to do the actual work and then see if it helps. It's also
likely that a less experienced hacker could test the theory out. Some
likely items for this list:
Reduce WAL traffic so only modified values are written rather than
entire rows?
Find a way to reduce rotational delay when repeatedly writing last WAL
page
Precompile SQL functions to avoid overheadDo async I/O for faster random
read-ahead of data
Not on todo:
Estimate gains from not using the in-memory format of data for on-disk
storage
Estimate gains from reducing the amount of space used by visibility
information in each tuple
BTW, what ever happened to the idea of having a list of projects for
beginners? (Or maybe it'd be better to assign a numeric difficulty to
each TODO item?)
--
Jim C. Nasby, Database Consultant decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828
Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"
On Thu, Jul 14, 2005 at 11:30:36AM -0500, Jim C. Nasby wrote:
On Wed, Jul 13, 2005 at 07:52:04PM -0400, Bruce Momjian wrote:
This is a good point. We have always stored data on disk that exactly
matches its layout in memory. We could change that, but no one has
shown it would be a win.Out of curiosity, what would be involved in hacking the backend enough
to be able to test this theory out? I'm guessing you'd want to convert
between on-disk and in-memory formats as you read pages in, so either
on-disk pages would become variable size (and smaller than memory pages)
or in-memory pages would become variable size (and larger than on-disk
pages).
It's a pain because on some architectures you can't do unaligned
accesses. I imagine you'd have to have the on-disk pages in memory and
copy them to a temporary space when you actually want to use the data,
converting on the fly.
IMHO a much much better approach would be the two phase:
- Decouple order of columns on disk from logical column order
Then people can rearrange columns, people do ask that occasionally.
- Change CREATE TABLE to rearrange columns on disk (not the logical
order) to minimize padding.
This gives you real benefits without having to overhaul the code...
On a side note, I think it might be useful to have a seperate TODO
catagory for ideas that need to be tested to see if they're worth
implementing for real. This is a case where it's probably substantially
easier to estimate (or maybe even measure) how much there is to gain
from this than to do the actual work and then see if it helps. It's also
likely that a less experienced hacker could test the theory out. Some
likely items for this list:
What usually happens is someone tries it and it either works or it
doesn't...
Can't comment on the other ideas.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.