VARIANT / ANYTYPE datatype
Hello,
A customer came to us with this request: a way to store "any" data in a
column. We've gone back and forth trying to determine reasonable
implementation restrictions, safety and useful semantics for them.
I note that this has been requested in the past:
http://archives.postgresql.org/pgsql-general/2004-02/msg01266.php
and both Oracle and MS-SQL have it and apparently people find them
useful. I didn't find any indication that SQL contains anything
resembling this.
The main idea is to be able to store column values in an audit table
like this:
change_time timestamptz
table_name name
column_name name
old_value variant
new_value variant
So per-column changes, which is much more convenient than the regular
idea of storing the whole NEW and/or OLD record(s).
Currently, they use text for old_value and new_value, but this is, of
course, not very satisfactory.
My thought was that a variant type would store the datum as
<typid><data> so that it would be possible to identify the datatype
stored in each column/row and interpret adequately, calling the
appropriate output function etc. On input it would be limited to come
only from inside the system, not from the outside world, as that would
have obvious security implications; so it'd be similar to pg_node_tree
in that regard.
Now this has obvious limitations: first, any query that tries to extract
data would need to include a cast of the variant value to the
appropriate type, so that the type can be resolved early. Thus,
trying to extract rows of different types would be forbidden.
Also, there would be a security problem with a datum storing something
whose datatype later changes (consider a user-defined record type or
things like that). My first reaction was to do something like
CREATE TYPE foo VARIANT OF (int, text, timestamptz);
and then you could declare old_value with type foo, which would only
allow values of the declared types. This makes it easy to catalogue
used types in any variant, and thus easy to restrict modifying or
dropping types that are used in some variant. However, this idea was
rejected by the customer due to the unusability: you would have to
remember to edit the variant to add the new type anytime you added a new
column to a table, which would be cumbersome.
What the customer suggested was to have a new fork, which stores
type OIDs of datatypes used in the variant. Then when a type is to be
altered, all forks would be scanned to determine if the type is used,
and raise an error if so. I rejected that idea as unworkable.
However, as a middle ground we agreed that we could allow a declared
variant to store any pinned type without restrictions; those can't be
changed or dropped so there's no safety concern. Storing other types
(user-defined types, records and so on) would require some safety net.
Before spending too much time exploring a detailed design, I'd like to
hear some opinions on the basic idea.
--
Álvaro Herrera <alvherre@alvh.no-ip.org>
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
A customer came to us with this request: a way to store "any" data
in a column.
+1
More than once (usually in replication, interface, or audit
situations) I've had to create a table with one column each of a
number of different data types, only one of which will be used in
each row. It would make life much simpler for programming such
things if an "unknown" type could be stored. I could either cast
the types in the rows based on the related identifier column, but it
would be even nicer if they could be read "unknown" and pg_typeof or
something similar could be used to control flow based on the type.
As one example, only *last night* I created the following table as
part of the development of our shop's next generation of software.
(Yes I know about Slony, Londiste, and others -- we need to roll
something out which integrates with existing systems, without
production disruption, over the next few years. This does require a
custom solution.)
Table "public.DbTranOpValue"
Column | Type | Modifiers
----------------+-------------------+-----------
countyNo | "CountyNoT" | not null
backendPid | integer | not null
tranStart | "TimestampT" | not null
logRecordSeqNo | "LogRecordSeqNoT" | not null
columnName | "ColumnNameT" | not null
isAfter | "BooleanT" | not null
textValue | text |
intValue | bigint |
numericValue | numeric |
binaryValue | bytea |
booleanValue | "BooleanT" |
dateValue | "DateT" |
timeValue | "TimeT" |
timestampValue | "TimestampT" |
Indexes:
"DbTranOpValue_pkey" PRIMARY KEY, btree
("backendPid", "tranStart", "logRecordSeqNo", "columnName")
I would much rather have had a "value" column of unknown type.
-Kevin
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
A customer came to us with this request: a way to store "any" data in a
column. We've gone back and forth trying to determine reasonable
implementation restrictions, safety and useful semantics for them.
Yes, it seems rather messy.
The main idea is to be able to store column values in an audit table
like this:
old_value variant
new_value variant
Currently, they use text for old_value and new_value, but this is, of
course, not very satisfactory.
Just out of curiosity, what actual functionality gain would ensue over
just using text? It seems like doing anything useful with the audit
table contents would still require casting the column to text, or the
moral equivalent of that.
If we did have a concept of ANYTYPE, or really we'd need ANYTYPE[],
it could perhaps be used in pg_statistic, in place of the rather ugly
anyarray hack that's there now. But I note that nothing above the level
of C code can do anything very useful with the contents of pg_statistic,
and I'm not sure that having an official type would change that.
However, as a middle ground we agreed that we could allow a declared
variant to store any pinned type without restrictions; those can't be
changed or dropped so there's no safety concern.
If you're going to accept that there are restrictions, I don't see that
there is a good argument against your thought of a declared "union"
type. At least then it's clear what the restrictions are. I am firmly
against exposing the notion of "pinned" types in any user-visible SQL
semantics, and even more firmly against the idea of creating fundamental
functionality that only works for built-in types and can't be extended
to add-on types.
regards, tom lane
Alvaro Herrera wrote:
A customer came to us with this request: a way to store "any" data in a
column. We've gone back and forth trying to determine reasonable
implementation restrictions, safety and useful semantics for them.
I note that this has been requested in the past:
http://archives.postgresql.org/pgsql-general/2004-02/msg01266.php
and both Oracle and MS-SQL have it and apparently people find them
useful. I didn't find any indication that SQL contains anything
resembling this.
I see VARIANT/ANYTYPE as the most general case of supporting union types, which,
say, could have more specific examples of "allow any number or date here but
nothing else". If VARIANT is supported, unions in general ought to be also.
The most effective way of supporting VARIANT or union types in general is having
an implementation where in the general case each value in the database knows its
own data type rather than the data type being provided by a context such as what
table column it is in. For example, if rather than storing a data value
directly we store a 2-attribute struct naming a data type and pointing to or
holding the data value.
See how SQLite works as an example of how VARIANTs or unions could work,
although that on its own would need to be made more comprehensive for Pg.
I claim ignorance as to how Pg currently implements these matters.
Where VARIANT/union types are supported by default, declaring more specific
types is just a type constraint and an optimization.
Of course, when we know the type of a column/etc isn't going to be VARIANT or
some other union type, then a simple optimization allows us to just store the
value and have its type provided by context rather than the struct.
-- Darren Duncan
On Wed, May 4, 2011 at 12:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
A customer came to us with this request: a way to store "any" data in a
column. We've gone back and forth trying to determine reasonable
implementation restrictions, safety and useful semantics for them.Yes, it seems rather messy.
The main idea is to be able to store column values in an audit table
like this:
old_value variant
new_value variant
Currently, they use text for old_value and new_value, but this is, of
course, not very satisfactory.Just out of curiosity, what actual functionality gain would ensue over
just using text? It seems like doing anything useful with the audit
table contents would still require casting the column to text, or the
moral equivalent of that.
The problem with text is that for composite/table types you lose the
ability to cast back when the structure changes. Most of the
auditing implementations I've done cast new/old to text and drop it
directly into a single column on the audit record. It works ok, but
you have to messily update the text later when the table changes. For
non-variant composites you can add columns down the line and it works
ok in dependent records without too much fuss.
I think though that getting this to work such that type dependency is
via row/column instead of just column is going to be tough. Outside
of composites, I don't see much advantages vs the text route,
performance maybe?
merlin
On 05/04/2011 01:36 PM, Tom Lane wrote:
The main idea is to be able to store column values in an audit table
like this:
old_value variant
new_value variant
Currently, they use text for old_value and new_value, but this is, of
course, not very satisfactory.Just out of curiosity, what actual functionality gain would ensue over
just using text? It seems like doing anything useful with the audit
table contents would still require casting the column to text, or the
moral equivalent of that.
Yeah, I've been down this road once or twice, and I think that's the $64
question.
I wrote a custom audit app two or three years ago. After several
iterations the customer and I found that using an hstore for the old/new
(or old record / changeset, which is what we actually use) was the most
suitable for our use.
I think if we did this we'd need to add some sort of is_type() and
typeof() functions for variant objects.
cheers
andrew
On Wed, May 4, 2011 at 2:55 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
On 05/04/2011 01:36 PM, Tom Lane wrote:
The main idea is to be able to store column values in an audit table
like this:
old_value variant
new_value variant
Currently, they use text for old_value and new_value, but this is, of
course, not very satisfactory.Just out of curiosity, what actual functionality gain would ensue over
just using text? It seems like doing anything useful with the audit
table contents would still require casting the column to text, or the
moral equivalent of that.Yeah, I've been down this road once or twice, and I think that's the $64
question.I wrote a custom audit app two or three years ago. After several iterations
the customer and I found that using an hstore for the old/new (or old record
/ changeset, which is what we actually use) was the most suitable for our
use.
yeah -- +1 on that method. I think it's really the right way to go
with the recent hstore enhancements.
merlin
Excerpts from Tom Lane's message of mié may 04 14:36:44 -0300 2011:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
The main idea is to be able to store column values in an audit table
like this:
old_value variant
new_value variant
Currently, they use text for old_value and new_value, but this is, of
course, not very satisfactory.Just out of curiosity, what actual functionality gain would ensue over
just using text? It seems like doing anything useful with the audit
table contents would still require casting the column to text, or the
moral equivalent of that.
Storage efficiency. These people have really huge databases; small
changes in how tight things are packed makes a large difference for
them. (For example, we developed a type to store SHA-2 digests in a
more compact way than bytea mainly because of this reason. Also, at
some time they also wanted to apply compression to hstore keys and
values.)
As someone commented downthread, they also want to have things such as a
"typeof" operator. It could be used in (say) a plpgsql function to
choose different branches of code.
Things go wild when you think about using VARIANT as a function
parameter type. Given how complex are our current function resolution
rules I'm not really sold on making this work at all, so initially I'm
aiming at just raising an error in that case.
If we did have a concept of ANYTYPE, or really we'd need ANYTYPE[],
it could perhaps be used in pg_statistic, in place of the rather ugly
anyarray hack that's there now. But I note that nothing above the level
of C code can do anything very useful with the contents of pg_statistic,
and I'm not sure that having an official type would change that.
Well, we could get rid of ugly hacks that are in various places in the
backend code to cope with this. Nor sure how useful it'd be for doing
things at the SQL level.
However, as a middle ground we agreed that we could allow a declared
variant to store any pinned type without restrictions; those can't be
changed or dropped so there's no safety concern.If you're going to accept that there are restrictions, I don't see that
there is a good argument against your thought of a declared "union"
type. At least then it's clear what the restrictions are. I am firmly
against exposing the notion of "pinned" types in any user-visible SQL
semantics, and even more firmly against the idea of creating fundamental
functionality that only works for built-in types and can't be extended
to add-on types.
The argument was "it's too cumbersome to use" (which makes sense: users
are certain to forget to add the new type to the declared union when
they add a new column to the table, possibly causing a run-time error if
the audit trigger is in place and attempts to load a new datum into the
log table.)
I understand the reluctancy to use pinned-ness in a user-visible way.
Back to the drawing board.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes:
Excerpts from Tom Lane's message of mié may 04 14:36:44 -0300 2011:
Just out of curiosity, what actual functionality gain would ensue over
just using text? It seems like doing anything useful with the audit
table contents would still require casting the column to text, or the
moral equivalent of that.
Storage efficiency. These people have really huge databases; small
changes in how tight things are packed makes a large difference for
them. (For example, we developed a type to store SHA-2 digests in a
more compact way than bytea mainly because of this reason. Also, at
some time they also wanted to apply compression to hstore keys and
values.)
Hmm. The prototypical case for this would probably be a 4-byte int,
which if you add an OID to it so you can resolve the type is going to
take 8 bytes, plus you are going to need a length word because there is
really no alternative to the "VARIANT" type being varlena overall, which
makes it 9 bytes if you're lucky on alignment and up to 16 if you're
not. That is not shorter than the average length of the text
representation of an int. The numbers don't seem a lot better for
8-byte quantities like int8, float8, or timestamp. It might be
marginally worthwhile for timestamp, but surely this is a huge amount of
effort to substitute for thinking of a more compact text representation
for timestamps.
Pardon me for being unconvinced.
regards, tom lane
On May 4, 2011, at 3:04 PM, Alvaro Herrera wrote:
As someone commented downthread, they also want to have things such as a
"typeof" operator. It could be used in (say) a plpgsql function to
choose different branches of code.
FWIW, pg_typeof(any) has been in core since 9.0.
Best,
David
On 05/04/2011 07:05 PM, Tom Lane wrote:
Alvaro Herrera<alvherre@commandprompt.com> writes:
Excerpts from Tom Lane's message of mié may 04 14:36:44 -0300 2011:
Just out of curiosity, what actual functionality gain would ensue over
just using text? It seems like doing anything useful with the audit
table contents would still require casting the column to text, or the
moral equivalent of that.Storage efficiency. These people have really huge databases; small
changes in how tight things are packed makes a large difference for
them. (For example, we developed a type to store SHA-2 digests in a
more compact way than bytea mainly because of this reason. Also, at
some time they also wanted to apply compression to hstore keys and
values.)Hmm. The prototypical case for this would probably be a 4-byte int,
which if you add an OID to it so you can resolve the type is going to
take 8 bytes, plus you are going to need a length word because there is
really no alternative to the "VARIANT" type being varlena overall, which
makes it 9 bytes if you're lucky on alignment and up to 16 if you're
not. That is not shorter than the average length of the text
representation of an int. The numbers don't seem a lot better for
8-byte quantities like int8, float8, or timestamp. It might be
marginally worthwhile for timestamp, but surely this is a huge amount of
effort to substitute for thinking of a more compact text representation
for timestamps.Pardon me for being unconvinced.
I'm far from convinced that storing deltas per column rather than per
record is a win anyway. I don't have hard numbers to hand, but my vague
recollection is that my tests showed it to be a design that used more space.
cheers
andrew
"David E. Wheeler" <david@kineticode.com> writes:
On May 4, 2011, at 3:04 PM, Alvaro Herrera wrote:
As someone commented downthread, they also want to have things such as a
"typeof" operator. It could be used in (say) a plpgsql function to
choose different branches of code.
FWIW, pg_typeof(any) has been in core since 9.0.
But note that pg_typeof() would surely say "VARIANT" if applied to such
a datatype. You'd need some other, VARIANT-specific function that knew
enough to dig into the value at runtime and get the OID therein.
regards, tom lane
Excerpts from Tom Lane's message of mié may 04 20:05:54 -0300 2011:
Alvaro Herrera <alvherre@commandprompt.com> writes:
Excerpts from Tom Lane's message of mié may 04 14:36:44 -0300 2011:
Just out of curiosity, what actual functionality gain would ensue over
just using text? It seems like doing anything useful with the audit
table contents would still require casting the column to text, or the
moral equivalent of that.Storage efficiency. These people have really huge databases; small
changes in how tight things are packed makes a large difference for
them. (For example, we developed a type to store SHA-2 digests in a
more compact way than bytea mainly because of this reason. Also, at
some time they also wanted to apply compression to hstore keys and
values.)Hmm. The prototypical case for this would probably be a 4-byte int,
which if you add an OID to it so you can resolve the type is going to
take 8 bytes, plus you are going to need a length word because there is
really no alternative to the "VARIANT" type being varlena overall, which
makes it 9 bytes if you're lucky on alignment and up to 16 if you're
not.
Hmm, I was hoping that we could skip making it varlena at least in some
cases ... perhaps that's hopeless, in which case we'll have to reassess
the whole idea. Still there's the new functionality argument, though.
As a followup idea there exists the desire to store records as records
and not text representation of same (given differing record types, of
course), for which it'd be more worthwhile.
It might be
marginally worthwhile for timestamp, but surely this is a huge amount of
effort to substitute for thinking of a more compact text representation
for timestamps.
Surely if it's just for timestamps it'd be a waste of time.
Pardon me for being unconvinced.
:-)
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes:
As a followup idea there exists the desire to store records as records
and not text representation of same (given differing record types, of
course), for which it'd be more worthwhile.
Maybe. The conventional wisdom is that text representation of data is
more compact than PG's internal representation by a significant factor
--- our FAQ says up to 5x, in fact. I know that that's including row
overhead and indexes and so on, but I still don't find it to be a given
that you're going to win on space with this sort of trick.
Another point here is that once the values in question get to be any
significant number of bytes apiece, both text and the proposed VARIANT
representation could be subject to compression, which would greatly
reduce any nominal advantage of one over the other. Any
representational inefficiency in either would just be grist for the
compressor.
regards, tom lane
On Wed, May 4, 2011 at 8:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@commandprompt.com> writes:
As a followup idea there exists the desire to store records as records
and not text representation of same (given differing record types, of
course), for which it'd be more worthwhile.Maybe. The conventional wisdom is that text representation of data is more compact than PG's internal representation by a significant factor --- our FAQ says up to 5x, in fact. I know that that's including row overhead and indexes and so on, but I still don't find it to be a given that you're going to win on space with this sort of trick.
I've done a lot of testing of the text vs binary format on the wire
format...not exactly the same set of issues, but pretty close since
you have to send all the oids, lengths, etc. Conventional wisdom is
correct although overstated for this topic. Even in truly
pathological cases for text, for example in sending multiple levels of
redundant escaping in complex structures, the text format will almost
always be smaller. For 'typical' data it can be significantly
smaller. Two exceptions most people will run into are bytea obviously
and the timestamp family of types where binary style manipulation is a
huge win both in terms of space and performance.
For complex data (say 3+ levels of composites stacked in arrays),
binary type formats are much *faster*, albeit larger, via binary as
long as you are not bandwidth constrained, and presumably they would
be as well for variants. Perhaps even more so, because some of the
manipulations made converting tuple storage to binary wire formats
don't have to happen. That said, while there are use cases for
sending highly structured data over the wire, I can't think of any for
direct storage on a table in variant type scenarios, at least not yet
:-).
merlin
On May 4, 2011, at 6:24 PM, Andrew Dunstan wrote:
I'm far from convinced that storing deltas per column rather than per record is a win anyway. I don't have hard numbers to hand, but my vague recollection is that my tests showed it to be a design that used more space.
It depends on how many fields you're changing in one go and how wide the table is. It's also a PITA to identify what fields actually changed if you're storing everything. In the case of logging, I'd say that what's really needed is a way to store a table record that has an indicator of what fields actually changed (and possibly not storing anything for fields that didn't change). That table record would need to also deal with changes to the underlying table structure.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
On 05/05/2011 01:00 PM, Jim Nasby wrote:
On May 4, 2011, at 6:24 PM, Andrew Dunstan wrote:
I'm far from convinced that storing deltas per column rather than per record is a win anyway. I don't have hard numbers to hand, but my vague recollection is that my tests showed it to be a design that used more space.
It depends on how many fields you're changing in one go and how wide the table is. It's also a PITA to identify what fields actually changed if you're storing everything.
No it's not. Instead of storing OLD/NEW, store a base record and a delta
record (an hstore with just the changed fields) for an update. This
saves space and means you only have to calculate what changed once.
cheers
andrew
Excerpts from Darren Duncan's message of mié may 04 15:33:33 -0300 2011:
I see VARIANT/ANYTYPE as the most general case of supporting union types, which,
say, could have more specific examples of "allow any number or date here but
nothing else". If VARIANT is supported, unions in general ought to be also.
Okay, so aside from the performance (storage reduction) gained, there's
this argument for having variant/union types. It seems to me that this
is indeed possible to build. Completely general VARIANT, though, is
rather complex. A declared union, where you specify exactly which types
can be part of the union, can be catalogued, so that the system knows
exactly where to look when a type needs to be modified. A general
VARIANT however looks complex to me to solve.
The problem is this: if an user attempts to drop a type, and this type
is used in a variant somewhere, we would lose the stored data. So the
drop needs to be aborted. Similarly, if we alter a type (easy example:
a composite type) used in a variant, we need to cascade to modify all
rows using that composite.
If the unions that use a certain type are catalogued, we at least know
what tables to scan to cascade.
In a general variant, the system catalogs do not have the information of
what type each variant masquerades as. We would need to examine the
variant's masqueraded types on each insert; if the current type is not
found, add it. This seems a bit expensive.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 05/06/2011 04:08 PM, Alvaro Herrera wrote:
Excerpts from Darren Duncan's message of mié may 04 15:33:33 -0300 2011:
I see VARIANT/ANYTYPE as the most general case of supporting union types, which,
say, could have more specific examples of "allow any number or date here but
nothing else". If VARIANT is supported, unions in general ought to be also.Okay, so aside from the performance (storage reduction) gained, there's
this argument for having variant/union types. It seems to me that this
is indeed possible to build. Completely general VARIANT, though, is
rather complex. A declared union, where you specify exactly which types
can be part of the union, can be catalogued, so that the system knows
exactly where to look when a type needs to be modified. A general
VARIANT however looks complex to me to solve.The problem is this: if an user attempts to drop a type, and this type
is used in a variant somewhere, we would lose the stored data. So the
drop needs to be aborted. Similarly, if we alter a type (easy example:
a composite type) used in a variant, we need to cascade to modify all
rows using that composite.If the unions that use a certain type are catalogued, we at least know
what tables to scan to cascade.In a general variant, the system catalogs do not have the information of
what type each variant masquerades as. We would need to examine the
variant's masqueraded types on each insert; if the current type is not
found, add it. This seems a bit expensive.
So how is a declared union going to look and operate? Something like this?
CREATE TYPE foo AS UNION (ival int, tval text, tsval timestamptz):
CREATE TABLE bar (myunion foo);
INSERT INTO bar (myunion) VALUES (ival=>1), (tval=>'some text');
UPDATE bar SET myunion.tsval = CURRENT_TIMESTAMP;
Something like that could actually be quite nice for a number of purposes.
cheers
andrew
Alvaro Herrera <alvherre@commandprompt.com> writes:
Excerpts from Darren Duncan's message of mié may 04 15:33:33 -0300 2011:
I see VARIANT/ANYTYPE as the most general case of supporting union types, which,
say, could have more specific examples of "allow any number or date here but
nothing else". If VARIANT is supported, unions in general ought to be also.
Okay, so aside from the performance (storage reduction) gained, there's
this argument for having variant/union types. It seems to me that this
is indeed possible to build. Completely general VARIANT, though, is
rather complex.
Yeah. I have no particular objection to a UNION over a specified set of
types, but am not very happy with the idea of an unconstrained union.
Also, a UNION declaration would allow attaching a field name to each
containable type, as I see Andrew just posted about. That seems like
potentially a good idea.
In a general variant, the system catalogs do not have the information of
what type each variant masquerades as. We would need to examine the
variant's masqueraded types on each insert; if the current type is not
found, add it. This seems a bit expensive.
Not to mention race-condition-prone. How do you ensure someone is not
inserting another instance of the variant, with some previously not used
content type, while this is going on?
regards, tom lane