UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

Started by Dawid Kuroczkoalmost 18 years ago18 messages
#1Dawid Kuroczko
qnex42@gmail.com

Hello.

I am currently playing with UUID data type and try to use it to store provided
by third party (Hewlett-Packard) application. The problem is they
format UUIDs as
0000-0000-0000-0000-0000-0000-0000-0000, so I have to
replace(text,'-','')::uuid for
this kind of data.

Nooow, the case is quite simple and it might be that there are other
applications
formatting UUIDs too liberally.

I am working on a patch to support this format (yes, it is a simple
modification).

And in the meanwhile I would like to ask you what do you think about it?

Cons: Such format is not standard.

Pros: This will help UUID data type adoption. [1]My first thought when I received the error message was "hey! this is not an UUID, it is too long/too short!", only later did I check that they just don't format it too well. While good
applications format
their data well, there are others which don't follow standards. Also
I think it is
easier for a human being to enter UUID as 8 times 4 digits.

Your thoughts? Should I submit a patch?

Regards,
Dawid

[1]: My first thought when I received the error message was "hey! this is not an UUID, it is too long/too short!", only later did I check that they just don't format it too well.
is not an UUID,
it is too long/too short!", only later did I check that they just
don't format it too well.

#2Josh Berkus
josh@agliodbs.com
In reply to: Dawid Kuroczko (#1)
Re: UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

Dawid,

I am working on a patch to support this format (yes, it is a simple
modification).

I'd suggest writing a formatting function for UUIDs instead. Not sure what
it should be called, though. "to_char" is pretty overloaded right now.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

#3Gevik Babakhani
pgdev@xs4all.nl
In reply to: Josh Berkus (#2)
Re: UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

I am working on a patch to support this format (yes, it is a simple
modification).

There was a proposal and a discussion regarding how this datatype would be
before I started developing it. We decided to go with the format proposed in
RFC. Unless there is strong case, I doubt any non standard formatting will
be accepted into core. IIRC we where also opposed to support java like
formatted uuid's back then. This is no different.

Regards,
Gevik.

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#2)
Re: UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

Josh Berkus <josh@agliodbs.com> writes:

I am working on a patch to support this format (yes, it is a simple
modification).

I'd suggest writing a formatting function for UUIDs instead.

That seems like overkill, if not outright encouragement of people to
come up with yet other nonstandard formats for UUIDs.

I think the question we have to answer is whether we want to be
complicit in the spreading of a nonstandard UUID format. Even if
we answer "yes" for this HP case, it doesn't follow that we should
create a mechanism for anybody to do anything with 'em. That way
lies the madness people already have to cope with for datetime
data :-(

regards, tom lane

#5Jochem van Dieten
jochemd@gmail.com
In reply to: Tom Lane (#4)
Re: UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

On Thu, Feb 28, 2008 at 1:19 AM, Tom Lane wrote:

I think the question we have to answer is whether we want to be
complicit in the spreading of a nonstandard UUID format.

I don't.

I have patched the UUID input and output functions to be compatible
with Adobe ColdFusion (http://adobe.com/products/coldfusion/ uses
8x-4x-4x-16x), and while I have released them I have deliberately made
the changes incompatible with other formats and will not submit them
to PostgreSQL because I want Adobe to fix ColdFusion to use the
standard format.

Jochem

#6Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#4)
Re: UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

Tom,

I think the question we have to answer is whether we want to be
complicit in the spreading of a nonstandard UUID format. Even if
we answer "yes" for this HP case, it doesn't follow that we should
create a mechanism for anybody to do anything with 'em. That way
lies the madness people already have to cope with for datetime
data :-(

Well, I guess the question is: if we don't offer some builtin way to render
non-standard formats built into company products, will those companies fix
their format or just not use PostgreSQL?

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

#7Andrew Sullivan
ajs@crankycanuck.ca
In reply to: Josh Berkus (#6)
Re: UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

On Thu, Feb 28, 2008 at 08:58:01AM -0800, Josh Berkus wrote:

Well, I guess the question is: if we don't offer some builtin way to render
non-standard formats built into company products, will those companies fix
their format or just not use PostgreSQL?

Well, there is an advantage that Postgres has that some others don't: you
can extend Postgres pretty easily. That suggests to me a reason to be
conservative in what we "build in". This is consistent with the principle,
"Be conservative in what you send, and liberal in what you accept."

A

#8Zeugswetter Andreas ADI SD
Andreas.Zeugswetter@s-itsolutions.at
In reply to: Andrew Sullivan (#7)
Re: UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

Well, I guess the question is: if we don't offer some builtin way to

render

non-standard formats built into company products, will those

companies fix

their format or just not use PostgreSQL?

Well, there is an advantage that Postgres has that some others don't:

you

can extend Postgres pretty easily. That suggests to me a reason to be
conservative in what we "build in". This is consistent with the

principle,

"Be conservative in what you send, and liberal in what you accept."

Well, then the uuid input function should most likely disregard all -,
and accept the 4x-, 8x- formats and the like on input.

Andreas

#9Kenneth Marshall
ktm@rice.edu
In reply to: Zeugswetter Andreas ADI SD (#8)
Re: UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

On Thu, Feb 28, 2008 at 08:06:46PM +0100, Zeugswetter Andreas ADI SD wrote:

Well, I guess the question is: if we don't offer some builtin way to

render

non-standard formats built into company products, will those

companies fix

their format or just not use PostgreSQL?

Well, there is an advantage that Postgres has that some others don't:

you

can extend Postgres pretty easily. That suggests to me a reason to be
conservative in what we "build in". This is consistent with the

principle,

"Be conservative in what you send, and liberal in what you accept."

Well, then the uuid input function should most likely disregard all -,
and accept the 4x-, 8x- formats and the like on input.

Andreas

We need to support the standard definition. People not using the standard
need to know that and explicitly acknowledge that by implementing the
conversion process themselves. Accepting random input puts a performance
hit on everybody following the standard. It is the non-standard users who
should pay that cost.

Cheers,
Ken

#10James Mansion
james@mansionfamily.plus.com
In reply to: Kenneth Marshall (#9)
Re: UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

Kenneth Marshall wrote:

conversion process themselves. Accepting random input puts a performance
hit on everybody following the standard.

Why is that necessarily the case?

Why not have a liberal parser and a configurable switch that determines
whether non-standard
forms are liberally accepted, accepted with a logged warning, or rejected?

James

#11Mark Mielke
mark@mark.mielke.cc
In reply to: James Mansion (#10)
Re: UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

James Mansion wrote:

Kenneth Marshall wrote:

conversion process themselves. Accepting random input puts a performance
hit on everybody following the standard.

Why is that necessarily the case?

Why not have a liberal parser and a configurable switch that
determines whether non-standard
forms are liberally accepted, accepted with a logged warning, or
rejected?

I recall there being a measurable performance difference between the
most liberal parser, and the most optimized parser, back when I wrote
one for PostgreSQL. I don't know how good the one in use for PostgreSQL
8.3 is. As to whether the cost is noticeable to people or not - that
depends on what they are doing. The problem is that a UUID is pretty
big, and parsing it liberally means a loop.

My personal opinion is that this is entirely a philosophical issue, and
that both sides have merits. There is no reason for PostgreSQL to
support all formats, not matter how non-standard, for every single type.
So, why would UUID be special? Because it's easy to do is not
necessarily a good reason. But then, it's not a bad reason either.

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

#12James Mansion
james@mansionfamily.plus.com
In reply to: Mark Mielke (#11)
Re: UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

Mark Mielke wrote:

I recall there being a measurable performance difference between the
most liberal parser, and the most optimized parser, back when I wrote
one for PostgreSQL. I don't know how good the one in use for
PostgreSQL 8.3 is. As to whether the cost is noticeable to people or
not - that depends on what they are doing. The problem is that a UUID
is pretty big, and parsing it liberally means a loop.

It just seems odd - I would have thought one would use re2c or ragel to
generate something and the performance would essentially be O[n] on the
input length in characters - using either a collection of allowed forms
or an engine that normalises case and discards the '-' characters
between any hex pairs. So yes these would have a control loop. Is that
so bad?

Either way its hard to imagine how parsing a string of this length could
create a measurable performance issue compared to what will happen with
the value post parse.

James

#13Sam Mason
sam@samason.me.uk
In reply to: Mark Mielke (#11)
Re: UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

On Thu, Feb 28, 2008 at 06:45:18PM -0500, Mark Mielke wrote:

My personal opinion is that this is entirely a philosophical issue, and
that both sides have merits.

I think it depends on what you're optimising for: initial development
time, maintaince time or run time.

There is no reason for PostgreSQL to
support all formats, not matter how non-standard, for every single type.
So, why would UUID be special? Because it's easy to do is not
necessarily a good reason. But then, it's not a bad reason either.

I never really buy the "performance" argument. I much prefer the
correctness argument, if the code is doing something strange I'd prefer
to know about it as soon as possible. This generally means that I'm
optimising for maintaince.

It's a similar argument to why lots of automatic casts were removed from
8.3, it generally doesn't hurt but the few times it does it's going to
be bad and if you're doing something strange to start with it's better
to be explicit about it.

Sam

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Sullivan (#7)
Re: UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

Andrew Sullivan <ajs@crankycanuck.ca> writes:

"Be conservative in what you send, and liberal in what you accept."

Yeah, I was about to quote that same maxim myself. I don't have a big
problem with allowing uuid_in to accept known format variants. (I'm
not sure about allowing a hyphen *anywhere*, because that could lead to
accepting things that weren't meant to be a UUID at all, but this HP
format seems regular enough that that's not a serious objection to it.)

What I was really complaining about was Josh's suggestion that we invent
a function to let users *output* UUIDs in random-format-of-the-week.
I can't imagine much good coming of that. I think we should keep
uuid_out emitting only the RFC-standardized format.

regards, tom lane

#15Mark Mielke
mark@mark.mielke.cc
In reply to: James Mansion (#12)
Re: UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

James Mansion wrote:

Mark Mielke wrote:

I recall there being a measurable performance difference between the
most liberal parser, and the most optimized parser, back when I wrote
one for PostgreSQL. I don't know how good the one in use for
PostgreSQL 8.3 is. As to whether the cost is noticeable to people or
not - that depends on what they are doing. The problem is that a UUID
is pretty big, and parsing it liberally means a loop.

It just seems odd - I would have thought one would use re2c or ragel
to generate something and the performance would essentially be O[n] on
the input length in characters - using either a collection of allowed
forms or an engine that normalises case and discards the '-'
characters between any hex pairs.

Instruction level parallelism allows for multiple hex values to be
processed in parallel, whereas a loop relies on branch prediction and
speculative load and store? :-)

The liberal version is difficult to unroll. The strict version is easy
to unroll.

So yes these would have a control loop. Is that so bad?

Either way its hard to imagine how parsing a string of this length
could create a measurable performance issue compared to what will
happen with the value post parse.

I think so too.

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

#16Tom Dunstan
pgsql@tomd.cc
In reply to: Tom Lane (#14)
Re: UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

On Fri, Feb 29, 2008 at 9:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Sullivan <ajs@crankycanuck.ca> writes:

"Be conservative in what you send, and liberal in what you accept."

Yeah, I was about to quote that same maxim myself. I don't have a big
problem with allowing uuid_in to accept known format variants. (I'm
not sure about allowing a hyphen *anywhere*, because that could lead to
accepting things that weren't meant to be a UUID at all, but this HP
format seems regular enough that that's not a serious objection to it.)

This seems like a good enough opportunity to mention an idea that I
had while/after doing the enum patch. The patch was fairly intrusive
for something that was just adding a type because postgresql isn't
really set up for parameterized types other than core types. The idea
would be to extend the enum mechanism to allow UDTs etc to be
parameterized, and enums would just become one use of the mechanism.
Other obvious examples that I had in mind were allowing variable
lengths for that binary data type with hex IO for e.g. differently
sized checksums that people want, and allowing different formats for
uuids.

So the idea as applied to this case would be to do the enum-style
typesafe thing, ie:

create type coldfusion_uuid as generic_uuid('xxxx-xxxx-xxxx-xxxx');

...then just use that. I had some thoughts about whether it would be
worth allowing inline declarations of such types inside table creation
statements as well, and there are various related issues and thoughts
on implementation which I won't go into in this email. Do people think
the idea has legs, though?

What I was really complaining about was Josh's suggestion that we invent
a function to let users *output* UUIDs in random-format-of-the-week.
I can't imagine much good coming of that. I think we should keep
uuid_out emitting only the RFC-standardized format.

Well, if the application is handing them to us in that format, it
might be a bit surprised if it gets back a "fixed" one. The custom
type approach wouldn't have that side effect.

Cheers

Tom

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Dunstan (#16)
Re: UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

"Tom Dunstan" <pgsql@tomd.cc> writes:

This seems like a good enough opportunity to mention an idea that I
had while/after doing the enum patch. The patch was fairly intrusive
for something that was just adding a type because postgresql isn't
really set up for parameterized types other than core types. The idea
would be to extend the enum mechanism to allow UDTs etc to be
parameterized, and enums would just become one use of the mechanism.

Isn't this reasonably well covered by Teodor's work to support
typmods for user-defined types? We've discussed how the typmod could
be effectively a key into a system catalog someplace, thus allowing it
to represent more than just an int32 worth of stuff. I'm not seeing
where your proposal accomplishes more than that can.

regards, tom lane

#18Bruce Momjian
bruce@momjian.us
In reply to: Dawid Kuroczko (#1)
Re: UUID data format 4x-4x-4x-4x-4x-4x-4x-4x

Added to TODO:

* Allow the UUID type to accept non-standard formats

http://archives.postgresql.org/pgsql-hackers/2008-02/msg01214.php

---------------------------------------------------------------------------

Dawid Kuroczko wrote:

Hello.

I am currently playing with UUID data type and try to use it to store provided
by third party (Hewlett-Packard) application. The problem is they
format UUIDs as
0000-0000-0000-0000-0000-0000-0000-0000, so I have to
replace(text,'-','')::uuid for
this kind of data.

Nooow, the case is quite simple and it might be that there are other
applications
formatting UUIDs too liberally.

I am working on a patch to support this format (yes, it is a simple
modification).

And in the meanwhile I would like to ask you what do you think about it?

Cons: Such format is not standard.

Pros: This will help UUID data type adoption. [1] While good
applications format
their data well, there are others which don't follow standards. Also
I think it is
easier for a human being to enter UUID as 8 times 4 digits.

Your thoughts? Should I submit a patch?

Regards,
Dawid

[1]: My first thought when I received the error message was "hey! this
is not an UUID,
it is too long/too short!", only later did I check that they just
don't format it too well.

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +