RFD: hexstring(n) data type

Started by Dawid Kuroczkoalmost 18 years ago4 messages
#1Dawid Kuroczko
qnex42@gmail.com

Following the discussion on making UUID data type to be much more liberal
( http://archives.postgresql.org/pgsql-hackers/2008-02/msg01214.php )
I have decided to try to approach it from more general perspective.

The current state of code is available at:
http://qnex.net/hexstring-0.1-2008-03-03.tgz

And now for more details:

The idea is to have a data type HEXSTRING(n) which can have an optional
typemod specifying the size of data (in bytes).

Internally the datatype is binary compatible with bytea, but I/O is done as
hex-encoded strings. The format is liberal, isspace() and ispunct() characters
are skipped while the digits are read.

I have played with two versions of hexstringin() function, one which uses
strtoul() function and the other which uses "home brew" code. The latter
appears to be faster, so I stayed with that. But I would appreciate
comments on this from more experienced.

So, what are the use cases?

CREATE DOMAIN liberal_uuid AS hexstring(16);
CREATE DOMAIN liberal_macaddr AS hexstring(6);

...it allows for creating other standard hex-types, as for example:
CREATE DOMAIN wwn AS hexstring(8); --
http://en.wikipedia.org/wiki/World_Wide_Name

Also it can be a convenient alternative to bytea format (I know, the
encode()/decode() pair),
especially when you have to format output data as some fancy hex-string.

The code is currently just a bunch of input/output/typemod functions
which appear
to work. I will add casts, operators, etc -- they most likely will be
nicked from bytea.

What I would like to also add is ubiquitous to_char(hex, format) function.
For an UUID-compatilbe format it would be called as:
SELECT to_char(hex, 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx') or
SELECT to_char(hex, '8x-4x-4x-4x-12x') where x is expanded as
[0-9a-f] digit and X is expanded as [0-9A-F].
I am not sure what to do about variable length hexstrings, I am
considering something
like to_char(hex, '8X-') which would produce something like
'00000000-11111111-22222222'
for a 12-byte hexstring (what to do about dangling '-' ?).

...but the original case against liberal UUID was that it would make
the I/O slower.
My simple test:

postgres=# CREATE TABLE uuids AS SELECT uuid_generate_v4()::text AS u
FROM generate_series(1,10000000);
CREATE

postgres=# CREATE TEMP TABLE text ON COMMIT DROP AS SELECT
u::hexstring(16) FROM uuids;
SELECT
Time: 13058.486 ms
postgres=# CREATE TEMP TABLE test ON COMMIT DROP AS SELECT u::uuid FROM uuids;
SELECT
Time: 13536.816 ms

...now hexstring is varlena type but does not use strtoul. Perhaps
uuid might be more liberal too.

What do you think about it?

Regards,
Dawid

#2Bruce Momjian
bruce@momjian.us
In reply to: Dawid Kuroczko (#1)
Re: RFD: hexstring(n) data type

I am confused how a hex type is any better than using the 'hex' decode()
format we already support:

test=> select decode('5476', 'hex');
decode
--------
Tv
(1 row)

---------------------------------------------------------------------------

Dawid Kuroczko wrote:

Following the discussion on making UUID data type to be much more liberal
( http://archives.postgresql.org/pgsql-hackers/2008-02/msg01214.php )
I have decided to try to approach it from more general perspective.

The current state of code is available at:
http://qnex.net/hexstring-0.1-2008-03-03.tgz

And now for more details:

The idea is to have a data type HEXSTRING(n) which can have an optional
typemod specifying the size of data (in bytes).

Internally the datatype is binary compatible with bytea, but I/O is done as
hex-encoded strings. The format is liberal, isspace() and ispunct() characters
are skipped while the digits are read.

I have played with two versions of hexstringin() function, one which uses
strtoul() function and the other which uses "home brew" code. The latter
appears to be faster, so I stayed with that. But I would appreciate
comments on this from more experienced.

So, what are the use cases?

CREATE DOMAIN liberal_uuid AS hexstring(16);
CREATE DOMAIN liberal_macaddr AS hexstring(6);

...it allows for creating other standard hex-types, as for example:
CREATE DOMAIN wwn AS hexstring(8); --
http://en.wikipedia.org/wiki/World_Wide_Name

Also it can be a convenient alternative to bytea format (I know, the
encode()/decode() pair),
especially when you have to format output data as some fancy hex-string.

The code is currently just a bunch of input/output/typemod functions
which appear
to work. I will add casts, operators, etc -- they most likely will be
nicked from bytea.

What I would like to also add is ubiquitous to_char(hex, format) function.
For an UUID-compatilbe format it would be called as:
SELECT to_char(hex, 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx') or
SELECT to_char(hex, '8x-4x-4x-4x-12x') where x is expanded as
[0-9a-f] digit and X is expanded as [0-9A-F].
I am not sure what to do about variable length hexstrings, I am
considering something
like to_char(hex, '8X-') which would produce something like
'00000000-11111111-22222222'
for a 12-byte hexstring (what to do about dangling '-' ?).

...but the original case against liberal UUID was that it would make
the I/O slower.
My simple test:

postgres=# CREATE TABLE uuids AS SELECT uuid_generate_v4()::text AS u
FROM generate_series(1,10000000);
CREATE

postgres=# CREATE TEMP TABLE text ON COMMIT DROP AS SELECT
u::hexstring(16) FROM uuids;
SELECT
Time: 13058.486 ms
postgres=# CREATE TEMP TABLE test ON COMMIT DROP AS SELECT u::uuid FROM uuids;
SELECT
Time: 13536.816 ms

...now hexstring is varlena type but does not use strtoul. Perhaps
uuid might be more liberal too.

What do you think about it?

Regards,
Dawid

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your Subscription:
http://mail.postgresql.org/mj/mj_wwwusr?domain=postgresql.org&extra=pgsql-hackers

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#3Gregory Stark
stark@enterprisedb.com
In reply to: Bruce Momjian (#2)
Re: RFD: hexstring(n) data type

I don't think we should define types based on how they print. We should define
types based on what they contain. bytea is a perfectly good datatype for
storing binary data, though perhaps we could use a bytea(n) for fixed size
binary data.

However bytea has an inconvenient string representation. Really I would prefer
if bytea were *always* read and written in hex format. We could have a guc to
enable this but it would make it hard to write code which would function
consistently as part of a larger application.

I think this is more akin to the MONEY data type. Really it would be better if
we could declare columns as NUMERIC but attach a "default format" to them for
use when string representation is desired. Similarly with bytea we could
choose a default string representation different from the default in/out
functions.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gregory Stark (#3)
Re: RFD: hexstring(n) data type

Gregory Stark <stark@enterprisedb.com> writes:

However bytea has an inconvenient string representation. Really I would prefer
if bytea were *always* read and written in hex format.

Yeah, the escaping rule is a mess: hard to work with and even less
storage-efficient than hex would be.

We could have a guc to
enable this but it would make it hard to write code which would function
consistently as part of a larger application.

Well, the datetime types have had DateStyle for years and people have
managed to deal with it, so maybe you're overestimating the problem.

regards, tom lane