extensible enum types
A recent discussion with a developer convinced me that enums would be
much more useful if new values could be added to the label list easily
without changing the stored values. Given the current representation of
enums as a globally unique oid, I think this is just about impossible if
the new label is to go somewhere other than on the end of the list
(assuming that's where the next oid would get allocated).
So I have been thinking about a new type family which for the sake of
argument I will call varenum (please hold the bikeshedding till later).
Staying with 4 bytes as the stored value, the idea is that we partition
it into two 16 bit pieces. The high-order piece is the varenum type
identifier. The low-order piece would uniquely identify the label within
the set for that varenum. Initial values would be allocated like this:
calculate the space between values p as 2**16 / (16 + number_of_labels).
Then set the first value at 8 * p, then next at 9* p and so on. This is
designed to allow more space to add labels at the beginning and end of
the list, where this is more likely. Adding a label would be a matter of
finding the labels adjacent to the position where we want to add the new
label, and placing it half way between them, possibly with special rules
for the list ends. If we want to add the label between two labels having
values n and n+1 the addition would fail.
All this would mean a) we can't have more than 65536 varenum types in a
system, and no varenum type could have more than 65536 values (and
possibly less, depending on how they are added). In practice I think
these are quite reasonable restrictions, and 99.99% of users would never
come close to bumping up against them.
Given all this, we could then allow things like:
ALTER TYPE varenumtype ADD 'newlabel' [ BEFORE | AFTER 'existinglabel' ]
I haven't looked at how we'd set this up in the catalog, but I assume
that by analogy with pg_enum it should be fairly straightforward.
We could actually allow something like the above for existing enum types
without the optional clause, which would just add the label wherever the
next oid happened to fall, which would commonly be at the end of the
list. That would handle the common case where the application doesn't
care about the label ordering. That should be a much simpler change and
is probably worth doing first.
There was some discussion of this here:
<http://archives.postgresql.org/message-id/20080425182718.GD5888@alvh.no-ip.org>
But I'm fairly reluctant to do things which will require table rewrites.
I'm also less excited about the idea of removing values - that is
something I don't ever recall being asked about, but I have often been
asked about adding labels. For people prepared to rewrite tables, there
is currently a workaround: create a new enum type, alter the table to
use the new enum type, drop the old enum type, rename the new enum type
to the old enum type.
Thoughts?
cheers
andrew
On Fri, Jun 18, 2010 at 11:50 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
Then set the
first value at 8 * p, then next at 9* p and so on. This is designed to
allow more space to add labels at the beginning and end of the list, where
this is more likely. Adding a label would be a matter of finding the labels
adjacent to the position where we want to add the new label, and placing it
half way between them, possibly with special rules for the list ends. If we
want to add the label between two labels having values n and n+1 the
addition would fail.
I like the idea of being able to modify enums on the fly, but I'm
skeptical of an implementation that won't always work. Maybe it's
still better than what we have now, but it seems grotty.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
Robert Haas wrote:
On Fri, Jun 18, 2010 at 11:50 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
Then set the
first value at 8 * p, then next at 9* p and so on. This is designed to
allow more space to add labels at the beginning and end of the list, where
this is more likely. Adding a label would be a matter of finding the labels
adjacent to the position where we want to add the new label, and placing it
half way between them, possibly with special rules for the list ends. If we
want to add the label between two labels having values n and n+1 the
addition would fail.I like the idea of being able to modify enums on the fly, but I'm
skeptical of an implementation that won't always work. Maybe it's
still better than what we have now, but it seems grotty.
I'd be perfectly happy to hear a reasonable alternative. Assuming we use
some integer representation, given two labels represented by n and n+1,
we can't add a label between them without rewriting the tables that use
the type, whether it's my representation scheme or some other. Maybe we
could have a FORCE option which would rewrite if necessary.
cheers
andrew
On Jun 18, 2010, at 9:07 AM, Robert Haas wrote:
Then set the
first value at 8 * p, then next at 9* p and so on. This is designed to
allow more space to add labels at the beginning and end of the list, where
this is more likely. Adding a label would be a matter of finding the labels
adjacent to the position where we want to add the new label, and placing it
half way between them, possibly with special rules for the list ends. If we
want to add the label between two labels having values n and n+1 the
addition would fail.I like the idea of being able to modify enums on the fly, but I'm
skeptical of an implementation that won't always work. Maybe it's
still better than what we have now, but it seems grotty.
Yes, other than that I fully endorse the idea. What's the likelihood of a failure? And would the position of the new label (represented by its internal number) be predictive? IOW, would updating the same varenumtype in two databases (or on two servers) yield the same underlying positional value?
Best,
David
On Jun 18, 2010, at 9:34 AM, Andrew Dunstan wrote:
I'd be perfectly happy to hear a reasonable alternative. Assuming we use some integer representation, given two labels represented by n and n+1, we can't add a label between them without rewriting the tables that use the type, whether it's my representation scheme or some other. Maybe we could have a FORCE option which would rewrite if necessary.
People would likely always use it.
Why not use a decimal number?
Best,
David
On Fri, 2010-06-18 at 12:34 -0400, Andrew Dunstan wrote:
Robert Haas wrote:
On Fri, Jun 18, 2010 at 11:50 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
Then set the
first value at 8 * p, then next at 9* p and so on. This is designed to
allow more space to add labels at the beginning and end of the list, where
this is more likely. Adding a label would be a matter of finding the labels
adjacent to the position where we want to add the new label, and placing it
half way between them, possibly with special rules for the list ends. If we
want to add the label between two labels having values n and n+1 the
addition would fail.I like the idea of being able to modify enums on the fly, but I'm
skeptical of an implementation that won't always work. Maybe it's
still better than what we have now, but it seems grotty.I'd be perfectly happy to hear a reasonable alternative. Assuming we use
some integer representation, given two labels represented by n and n+1,
we can't add a label between them without rewriting the tables that use
the type, whether it's my representation scheme or some other. Maybe we
could have a FORCE option which would rewrite if necessary.
I apologize as this thread has already moved past the initial proposal.
I had mail problems so I didn't see any part of the thread in my client
until now.
My standard response to enums is, don't use them. Specifically because
you can't modify them. When I talk to developers and they see we have
enums they get excited and then I tell them you can't modify them and
they get very downtrodden. I tell them just to use a look up table.
Anyway, Andrew if you can find a reasonable way to make them modifiable,
I am all for it :)
On that note and yes this would be weird but have we considered some
kind of wrapper around just a lookup table? I mean what if there was a
system table that listed :
nspname, relation, column, value
the "type" enum would know to look in that table for valid values
Then we just add various syntactical sugar to manage the values in the
table via that specific enum:
ALTER TABLE foo ALTER COLUMN bar VALUES (bar,baz,bing,foo)
ALTER TABLE would know its an enum and would perform proper operations
on the system table.
Sincerely,
Joshua D. Drake
cheers
andrew
--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
Andrew Dunstan <andrew@dunslane.net> writes:
Robert Haas wrote:
I like the idea of being able to modify enums on the fly, but I'm
skeptical of an implementation that won't always work. Maybe it's
still better than what we have now, but it seems grotty.
I'd be perfectly happy to hear a reasonable alternative.
Insert a sort order column into pg_enum, and rearrange the values in
that whenever the user wants to add a new value in a particular place.
You give up cheap comparisons in exchange for flexibility. I think lots
of people would accept that tradeoff, especially if they could make it
per-datatype.
One point here is that you'd have to restrict the rearrangements so that
the relative sort order of existing values never changes, else you break
(for example) indexes on columns of that type.
regards, tom lane
David E. Wheeler wrote:
What's the likelihood of a failure?
Constructing a failure case would be simple. In practice, I suspect it
would be very low.
And would the position of the new label (represented by its internal number) be predictive? IOW, would updating the same varenumtype in two databases (or on two servers) yield the same underlying positional value?
The algorithm I outlined is deterministic, so the same sequence of
operations on the type would yield the same set of values on the
low-order 16 bits. But that doesn't mean they would have the same high
order 16 bits. That would depend on the history of the system. But more
importantly, why do you care? the stored value is an implementation
artefact that should be totally invisible to users. There would be no
guarantee of the same underlying values on dump/reload either, just as
there is not now for enums.
cheers
andrew
David E. Wheeler wrote:
On Jun 18, 2010, at 9:34 AM, Andrew Dunstan wrote:
I'd be perfectly happy to hear a reasonable alternative. Assuming we use some integer representation, given two labels represented by n and n+1, we can't add a label between them without rewriting the tables that use the type, whether it's my representation scheme or some other. Maybe we could have a FORCE option which would rewrite if necessary.
People would likely always use it.
Why not use a decimal number?
You are just bumping up the storage cost. Part of the attraction of
enums is their efficiency.
cheers
andrew
Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
Robert Haas wrote:
I like the idea of being able to modify enums on the fly, but I'm
skeptical of an implementation that won't always work. Maybe it's
still better than what we have now, but it seems grotty.I'd be perfectly happy to hear a reasonable alternative.
Insert a sort order column into pg_enum, and rearrange the values in
that whenever the user wants to add a new value in a particular place.
You give up cheap comparisons in exchange for flexibility. I think lots
of people would accept that tradeoff, especially if they could make it
per-datatype.One point here is that you'd have to restrict the rearrangements so that
the relative sort order of existing values never changes, else you break
(for example) indexes on columns of that type.
Hmm. Yes, that could work. The assumption in my proposal was that
existing values would not be reordered anyway.
But I'm not happy about giving up cheap comparison. And how would it be
per data-type? That part isn't clear to me. Would we mark a given enum
type as having its oids in order? It would also be sensible to quantify
how much more expensive comparisons would become. If the sort order data
were kept in the syscache the extra cost might get very small.
What I actually like most about this suggestion is that we would be able
to apply it cleanly to existing enum types without inventing anything
much new.
cheers
cheers
On Fri, Jun 18, 2010 at 12:59 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
You are just bumping up the storage cost. Part of the attraction of enums is
their efficiency.
What's efficient about them? Aren't we using 4 bytes to store a value
that will nearly always fit in 2, if not 1?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
Robert Haas wrote:
On Fri, Jun 18, 2010 at 12:59 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
You are just bumping up the storage cost. Part of the attraction of enums is
their efficiency.What's efficient about them? Aren't we using 4 bytes to store a value
that will nearly always fit in 2, if not 1?
This was debated when we implemented enums. As between 1,2 and 4 there
is often not much to choose, as alignment padding makes it pretty much
the same. But any of them are more efficient than storing a numeric
value or the label itself.
Anyway, it might well be moot.
cheers
andrew
On Fri, Jun 18, 2010 at 1:59 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
This was debated when we implemented enums. As between 1,2 and 4 there is
often not much to choose, as alignment padding makes it pretty much the
same. But any of them are more efficient than storing a numeric value or the
label itself.
I was assuming the alternative was an integer, rather than a
numeric... but yeah, a numeric or the label itself would definitely
be larger.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
On Fri, Jun 18, 2010 at 6:17 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
Tom Lane wrote:
Insert a sort order column into pg_enum, and rearrange the values in
that whenever the user wants to add a new value in a particular place.
+1 I was going to say exactly the same thing.
You give up cheap comparisons in exchange for flexibility. I think lots
of people would accept that tradeoff, especially if they could make it
per-datatype.Hmm. Yes, that could work. The assumption in my proposal was that existing
values would not be reordered anyway.But I'm not happy about giving up cheap comparison. And how would it be per
data-type? That part isn't clear to me. Would we mark a given enum type as
having its oids in order? It would also be sensible to quantify how much
more expensive comparisons would become. If the sort order data were kept in
the syscache the extra cost might get very small.
I think you would need a syscache or something like it. My first
instinct was to load the whole enum value->sort order mapping into a
hash table the first time you're asked to compare two values in a
given type. Then your comparison operator amounts to "look
up/initialize hash table for this enum type, look up both sort orders
in hash table, return comparison". You might need something like a
syscache for the hash tables so that you don't keep the hash tables
around forever.
Using a syscache for the individual sort values would be slower to
initially load if you're sorting a list since you would be doing a lot
of retail lookups of individual values. But then perhaps it's a cheap
way to protect against very large enums which using a hash table per
enum type would be fragile against.
--
greg
Andrew Dunstan <andrew@dunslane.net> writes:
Tom Lane wrote:
Insert a sort order column into pg_enum, and rearrange the values in
that whenever the user wants to add a new value in a particular place.
You give up cheap comparisons in exchange for flexibility. I think lots
of people would accept that tradeoff, especially if they could make it
per-datatype.
But I'm not happy about giving up cheap comparison.
I don't think it would be all that bad. We could teach typcache.c to
cache the ordering data for any type that's in active use. It'd
certainly be a lot more expensive than OID comparison, but perhaps not
worse than NUMERIC comparisons.
And how would it be per data-type?
Well, there'd be two kinds of enums, just as you were saying before.
I'm not sure how we'd expose that to users exactly, or whether there
could be provisions for switching a type's behavior after creation.
regards, tom lane
On Fri, Jun 18, 2010 at 1:59 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
Robert Haas wrote:
On Fri, Jun 18, 2010 at 12:59 PM, Andrew Dunstan <andrew@dunslane.net>
wrote:You are just bumping up the storage cost. Part of the attraction of enums
is
their efficiency.What's efficient about them? Aren't we using 4 bytes to store a value
that will nearly always fit in 2, if not 1?This was debated when we implemented enums. As between 1,2 and 4 there is
often not much to choose, as alignment padding makes it pretty much the
same. But any of them are more efficient than storing a numeric value or the
label itself.Anyway, it might well be moot.
cheers
andrew
Something I don't understand in all this is: why can't the type of an
enum be determined statically rather than stored in every single
value? For instance, if we have:
CREATE TYPE number AS ENUM ('one', 'two', 'three');
CREATE TYPE color AS ENUM ('red', 'green', 'blue');
PostgreSQL won't allow a comparison between two different enum types, e.g.:
SELECT 'one'::number = 'red'::color;
ERROR: operator does not exist: number = color
However, when we say:
SELECT 'one'::number = 'one'::number
Couldn't enum_eq just use get_fn_expr_argtype to determine the type of
enum input rather than rely on it being stored in the value (either
implicitly via OID or explicitly as a word half)?
Also, I can't seem to find the original debates from when enums were
implemented. Does anyone have a link to that thread in the archives?
Thanks.
Joey Adams
Joseph Adams wrote:
Also, I can't seem to find the original debates from when enums were
implemented. Does anyone have a link to that thread in the archives?
Thanks.
Start here
<http://archives.postgresql.org/pgsql-hackers/2006-08/msg00979.php>
cheers
andrew
Excerpts from Joseph Adams's message of vie jun 18 18:17:50 -0400 2010:
Couldn't enum_eq just use get_fn_expr_argtype to determine the type of
enum input rather than rely on it being stored in the value (either
implicitly via OID or explicitly as a word half)?Also, I can't seem to find the original debates from when enums were
implemented. Does anyone have a link to that thread in the archives?
Probably around here
http://archives.postgresql.org/pgsql-patches/2006-12/msg00099.php
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Joseph Adams <joeyadams3.14159@gmail.com> writes:
Couldn't enum_eq just use get_fn_expr_argtype to determine the type of
enum input rather than rely on it being stored in the value
No. Support functions have to work in many contexts where there is no
"side channel" such as get_fn_expr_argtype. What's more, it's very
difficult to provide a side channel without creating security holes.
We used to think it was OK for output functions to rely on a type OID
passed separately from the actual value, but that's insecure:
http://archives.postgresql.org/pgsql-hackers/2005-04/msg00998.php
regards, tom lane
Tom Lane wrote:
And how would it be per data-type?
Well, there'd be two kinds of enums, just as you were saying before.
I'm not sure how we'd expose that to users exactly, or whether there
could be provisions for switching a type's behavior after creation.
I'd be a whole lot happier if we didn't have to do that. I've been
trying to wrack my brain for some clever way to avoid it (so far without
success).
cheers
andrew