JSON for PG 9.2
Where are we with adding JSON for Postgres 9.2? We got bogged down in
the data representation last time we discussed this.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
On Mon, Dec 5, 2011 at 3:12 PM, Bruce Momjian <bruce@momjian.us> wrote:
Where are we with adding JSON for Postgres 9.2? We got bogged down in
the data representation last time we discussed this.
We're waiting for you to send a patch that resolves all
previously-raised issues. :-)
In all seriousness, I think the right long-term answer here is to have
two data types - one that simply validates JSON and stores it as text,
and the other of which uses some binary encoding. The first would be
similar to our existing xml datatype and would be suitable for cases
when all or nearly all of your storage and retrieval operations will
be full-column operations, and the json types is basically just
providing validation. The second would be optimized for pulling out
(or, perhaps, replacing) pieces of arrays or hashes, but would have
additional serialization/deserialization overhead when working with
the entire value. As far as I can see, these could be implemented
independently of each other and in either order, but no one seems to
have yet found the round tuits.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Dec 12, 2011 at 7:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Dec 5, 2011 at 3:12 PM, Bruce Momjian <bruce@momjian.us> wrote:
Where are we with adding JSON for Postgres 9.2? We got bogged down in
the data representation last time we discussed this.We're waiting for you to send a patch that resolves all
previously-raised issues. :-)In all seriousness, I think the right long-term answer here is to have
two data types - one that simply validates JSON and stores it as text,
and the other of which uses some binary encoding. The first would be
similar to our existing xml datatype and would be suitable for cases
when all or nearly all of your storage and retrieval operations will
be full-column operations, and the json types is basically just
providing validation. The second would be optimized for pulling out
(or, perhaps, replacing) pieces of arrays or hashes, but would have
additional serialization/deserialization overhead when working with
the entire value. As far as I can see, these could be implemented
independently of each other and in either order, but no one seems to
have yet found the round tuits.
Rather than fuss with specific data formats, why not implement
something a little more useful?
At present we can have typmods passed as a cstring, so it should be
possible to add typmods onto the TEXT data type.
e.g. TEXT('JSON'), TEXT('JSONB')
We then invent a new catalog table called pg_text_format which has
oid PRIMARY KEY
textformatname UNIQUE
textformatvalidationproc
textformatstorageproc
The typmod must reduce to a single integer, so we just store the
integer. If no typmod, we store 0, so we have a fastpath for normal
TEXT datatypes.
This would then allow people to have variations of the TEXT type that
supports conversions, casts, indexing etc without additional fuss and
without everything else outside the database breaking because it
doesn't know that datatype name.
We could then support JSON (both kinds), YAML, etc
as well as providing a way to add validation into the datatype itself.
We can replace citext with TEXT('CASE_INSENSITIVE')
Think of this as using the object-relational capabilities of Postgres
to extend the TEXT data type.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Dec 12, 2011 at 3:38 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Rather than fuss with specific data formats, why not implement
something a little more useful?At present we can have typmods passed as a cstring, so it should be
possible to add typmods onto the TEXT data type.e.g. TEXT('JSON'), TEXT('JSONB')
We then invent a new catalog table called pg_text_format which has
oid PRIMARY KEY
textformatname UNIQUE
textformatvalidationproc
textformatstorageprocThe typmod must reduce to a single integer, so we just store the
integer. If no typmod, we store 0, so we have a fastpath for normal
TEXT datatypes.This would then allow people to have variations of the TEXT type that
supports conversions, casts, indexing etc without additional fuss and
without everything else outside the database breaking because it
doesn't know that datatype name.We could then support JSON (both kinds), YAML, etc
as well as providing a way to add validation into the datatype itself.
We can replace citext with TEXT('CASE_INSENSITIVE')Think of this as using the object-relational capabilities of Postgres
to extend the TEXT data type.
Well, it's arguable that text-format JSON and YAML and our existing
XML datatype ought to share some structure with text, but
binary-format JSON is a different animal altogether; you might as well
propose having text('int8'). In any case, I doubt that trying to make
the typmod provide subclassing behavior is going to work out very
well. There are way too many places that assume that the typmod can
just be discarded. I don't think that's going to fly, because
=(text,text) probably has different semantics from =(json,json).
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Dec 12, 2011 at 8:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
There are way too many places that assume that the typmod can
just be discarded.
If true, that probably ought to be documented cos it sounds fairly important.
Where and when is it true?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Dec 12, 2011 at 4:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, Dec 12, 2011 at 8:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
There are way too many places that assume that the typmod can
just be discarded.If true, that probably ought to be documented cos it sounds fairly important.
Where and when is it true?
I'm not going to go compile an exhaustive list, since that would take
a week and I don't have any particular desire to invest that much time
in it, but just to take a couple of simple examples:
rhaas=# create or replace function wuzzle(numeric(5,2)) returns int as
$$select 1$$ language sql;
CREATE FUNCTION
rhaas=# \df wuzzle
List of functions
Schema | Name | Result data type | Argument data types | Type
--------+--------+------------------+---------------------+--------
public | wuzzle | numeric | | normal
public | wuzzle | integer | numeric | normal
(2 rows)
rhaas=# select pg_typeof(1.23::numeric(5,2));
pg_typeof
-----------
numeric
(1 row)
There are a very large number of others. Possibly grepping for places
where we do getBaseType() rather than getBaseTypeAndTypmod() would be
a way to find some of them.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On mån, 2011-12-12 at 21:08 +0000, Simon Riggs wrote:
On Mon, Dec 12, 2011 at 8:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
There are way too many places that assume that the typmod can
just be discarded.If true, that probably ought to be documented cos it sounds fairly important.
Where and when is it true?
Function arguments and return values, for example.
On Dec 12, 2011, at 12:54 PM, Robert Haas wrote:
I don't think that's going to fly, because
=(text,text) probably has different semantics from =(json,json).
No question:
david=# select '{"foo": 1, "bar": 2}'::json = '{"bar": 2, "foo": 1}'::json;
?column?
----------
t
(1 row)
Best,
David
On 12/12/2011 03:54 PM, Robert Haas wrote:
On Mon, Dec 12, 2011 at 3:38 PM, Simon Riggs<simon@2ndquadrant.com> wrote:
Rather than fuss with specific data formats, why not implement
something a little more useful?At present we can have typmods passed as a cstring, so it should be
possible to add typmods onto the TEXT data type.e.g. TEXT('JSON'), TEXT('JSONB')
[...]
There are way too many places that assume that the typmod can
just be discarded. I don't think that's going to fly, because
=(text,text) probably has different semantics from =(json,json).
And certain places where they are not allowed at all, I think (unless I
am misremembering the early debates about enum types and output functions).
cheers
andrew
Bruce,
I thought that Joseph Adams was still working on this, sponsored by
Heroku. Joseph?
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
We reached out to Joseph to see if we could help sponsor the project,
but never really heard back from him.
Because we haven't heard from him in a while we've been using PL/V8 to
validate a JSON datatype simulated by a DOMAIN with a simple
acceptance function. (See below.) This is not ideally performant but
thanks to V8's JIT the JSON parser is actually reasonably good.
I think releasing something simple and non-performant with reasonable
semantics would be the best next step. If it were up to me, I'd
probably even try to just land PL/V8 as PL/JavaScript for 9.2 if the
crash bugs and deal breakers can be sifted out.
PL/V8 is fast, it's sandboxed, and while it doesn't provide GIN or
GIST operators out of the box, maybe those could be motivated by its
inclusion.
Andrew, you've been down in the guts here, what do you think?
-pvh
(Code sample.)
create or replace function valid_json(json text)
returns bool as $$
try { JSON.parse(json); return true }
catch(e) { return false}
$$ LANGUAGE plv8 IMMUTABLE STRICT;
select valid_json('{"key": "value"}'), valid_json('lol');
valid_json | t
valid_json | f
Time: 0.283 ms
create domain json
as text check(valid_json(VALUE));
create table jsononly(data json);
insert into jsononly values 'lol';
ERROR: syntax error at or near "'lol'"
LINE 1: insert into jsononly values 'lol';
insert into jsononly
values ('{"ok": true}');
INSERT 0 1
-p
On Mon, Dec 12, 2011 at 1:34 PM, Josh Berkus <josh@agliodbs.com> wrote:
Bruce,
I thought that Joseph Adams was still working on this, sponsored by
Heroku. Joseph?--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Peter van Hardenberg
San Francisco, California
"Everything was beautiful, and nothing hurt." -- Kurt Vonnegut
On Dec 12, 2011, at 4:51 PM, Peter van Hardenberg wrote:
Because we haven't heard from him in a while we've been using PL/V8 to
validate a JSON datatype simulated by a DOMAIN with a simple
acceptance function. (See below.) This is not ideally performant but
thanks to V8's JIT the JSON parser is actually reasonably good.I think releasing something simple and non-performant with reasonable
semantics would be the best next step. If it were up to me, I'd
probably even try to just land PL/V8 as PL/JavaScript for 9.2 if the
crash bugs and deal breakers can be sifted out.
Note that Claes Jakobsson has been working on a JSON data type using the Jansson JSON library.
I’ve submitted a pull request renaming it to jansson-json (though the data type is still "json"):
https://github.com/theory/pg-json/tree/pgxn
Anyway, it seems like a decent start to an extensible type implemented entirely as an extension. Claes tells me he plans to add index support soonish, so it could get to be pretty robust before long.
Just another stab at the problem to alert folks to.
Best,
David
On 12/12/2011 07:51 PM, Peter van Hardenberg wrote:
We reached out to Joseph to see if we could help sponsor the project,
but never really heard back from him.Because we haven't heard from him in a while we've been using PL/V8 to
validate a JSON datatype simulated by a DOMAIN with a simple
acceptance function. (See below.) This is not ideally performant but
thanks to V8's JIT the JSON parser is actually reasonably good.I think releasing something simple and non-performant with reasonable
semantics would be the best next step. If it were up to me, I'd
probably even try to just land PL/V8 as PL/JavaScript for 9.2 if the
crash bugs and deal breakers can be sifted out.PL/V8 is fast, it's sandboxed, and while it doesn't provide GIN or
GIST operators out of the box, maybe those could be motivated by its
inclusion.Andrew, you've been down in the guts here, what do you think?
The trouble with using JSON.parse() as a validator is that it's probably
doing way too much work. PLV8 is cool, and I keep trying to get enough
time to work on it more, but I don't think it's a substitute for a JSON
type with a purpose built validator and some native operations. I think
these efforts can continue in parallel.
cheers
andrew
On Mon, Dec 12, 2011 at 4:51 PM, Peter van Hardenberg <pvh@pvh.ca> wrote:>
PL/V8 is fast, it's sandboxed, and while it doesn't provide GIN or
GIST operators out of the box, maybe those could be motivated by its
inclusion.
I also feel that a big problem with JSON as a data type is that there
is not a powerful, common navigation method. JSON path is basically
pretty obscure by comparison to XPath. As a result, the common
approach to navigation in a JSON structure is basically "write
procedures". And that is only perfectly supported by a full-blown
interpreter.
So that's why I'm personally more inclined to lend my attention to
embedding JavaScript entirely. Not to say there aren't areas ripe for
improvement:
* It'd be nice to pass intermediate in-memory representations rather
than calling JSON.parse all the time, similar to OPAQUE except sound
(so bogus pointers cannot be passed). Basically, an "ephemeral type".
It could save a lot of when composing operators. I've needed this
for other projects, but for much the same reason.
* It'd be nice to be able to safely define indexes in a trusted
language somehow, writing the penalty and split functions, et al.
Right now it's my recollection that defining GiST operators in a naive
port to Javascript would give you the power to return garbage that is
not merely wrong, but could also crash Postgres if it uses a bogus
indexes. Ready and willing to be corrected*
* Some kind of partial toasting of large datums (I think Simon Riggs
briefly glossed over such an idea when we were talking about this
general use case)
But nothing I can quickly identify in the Postgres as-is is opposed to
any of these improvements at a design level, so they can be chipped
off into incremental work in the future.
--
fdr
* Madness, you say? http://bellard.org/jslinux/, if your browser is
new enough. The relevant spec:
https://www.khronos.org/registry/typedarray/specs/latest/
On Mon, Dec 12, 2011 at 5:36 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
The trouble with using JSON.parse() as a validator is that it's probably
doing way too much work. PLV8 is cool, and I keep trying to get enough time
to work on it more, but I don't think it's a substitute for a JSON type with
a purpose built validator and some native operations. I think these efforts
can continue in parallel.
Hmm. Maybe? While I'm sure things could be faster, we've had results
that are fast enough to be usable even with constant reparsing. Here
are some microbenchmarks I did some time ago where I tried to find the
overhead of calling JSON.parse and doing some really simple stuff in
V8 that I thought would maximize the amount of constant-time overhead:
https://gist.github.com/1150804
On my workstation, one core was able to do 130,000 JSON.parses + other
stuff necessary to create an index per second. One could maybe try to
improve the speed and memory footprint on large documents by having
validators that don't actually build the V8 representation and
possibly defining a space of operators that are known to build, by
induction, valid JSON without rechecks.
But in the end, I think there's already a class of problem where the
performance plv8 provides is already quite sufficient, and provides a
much more complete and familiar approach to the problem of how people
choose to navigate, project, and manipulate JSON documents.
I also haven't tried this for larger documents, as I was trying to get
a sense of how much time was spent in a few primitive operations, and
not testing performance with regard to document length.
--
fdr
On 12/12/2011 08:46 PM, Daniel Farina wrote:
On Mon, Dec 12, 2011 at 5:36 PM, Andrew Dunstan<andrew@dunslane.net> wrote:
The trouble with using JSON.parse() as a validator is that it's probably
doing way too much work. PLV8 is cool, and I keep trying to get enough time
to work on it more, but I don't think it's a substitute for a JSON type with
a purpose built validator and some native operations. I think these efforts
can continue in parallel.Hmm. Maybe? While I'm sure things could be faster, we've had results
that are fast enough to be usable even with constant reparsing. Here
are some microbenchmarks I did some time ago where I tried to find the
overhead of calling JSON.parse and doing some really simple stuff in
V8 that I thought would maximize the amount of constant-time overhead:https://gist.github.com/1150804
On my workstation, one core was able to do 130,000 JSON.parses + other
stuff necessary to create an index per second. One could maybe try to
improve the speed and memory footprint on large documents by having
validators that don't actually build the V8 representation and
possibly defining a space of operators that are known to build, by
induction, valid JSON without rechecks.But in the end, I think there's already a class of problem where the
performance plv8 provides is already quite sufficient, and provides a
much more complete and familiar approach to the problem of how people
choose to navigate, project, and manipulate JSON documents.I also haven't tried this for larger documents, as I was trying to get
a sense of how much time was spent in a few primitive operations, and
not testing performance with regard to document length.
Yes, I didn't mean to say it's not fast. For many cases I agree it is
probably fast enough.
But in the end PLV8 is likely to remain an addon - nice as it can be I
doubt the core team will want to add another PL to the core code,
especially one written in C++. If we want a JSON type built in, which
many people seem to want, we'll need to do it without the support of
PLV8, I think.
cheers
andrew
Excerpts from Daniel Farina's message of lun dic 12 22:37:13 -0300 2011:
* It'd be nice to pass intermediate in-memory representations rather
than calling JSON.parse all the time, similar to OPAQUE except sound
(so bogus pointers cannot be passed). Basically, an "ephemeral type".
It could save a lot of when composing operators. I've needed this
for other projects, but for much the same reason.
I remember there was the idea of doing something like this for regexes
-- have a specialized data type that saves the trouble of parsing it.
I imagine this is pretty much the same.
Nobody got around to doing anything about it though.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Andrew Dunstan <andrew@dunslane.net> writes:
On 12/12/2011 03:54 PM, Robert Haas wrote:
There are way too many places that assume that the typmod can
just be discarded. I don't think that's going to fly, because
=(text,text) probably has different semantics from =(json,json).
And certain places where they are not allowed at all, I think (unless I
am misremembering the early debates about enum types and output functions).
Yeah. The current system design assumes that typmod specifies a
constraint of some sort. It is not possible to use it to change the
semantics of the datatype. The most obvious way in which this is true
is that selection of which operators and functions to apply to values
does not consider typmod of the values. This is not something we should
lightly revisit. We don't even have a handle on how to make domains
behave differently from their underlying datatypes, and those *do* have
their own type OIDs. Injecting typmod into the algorithm seems like a
disaster from here.
regards, tom lane
On mån, 2011-12-12 at 16:51 -0800, Peter van Hardenberg wrote:
Because we haven't heard from him in a while we've been using PL/V8 to
validate a JSON datatype simulated by a DOMAIN with a simple
acceptance function. (See below.) This is not ideally performant but
thanks to V8's JIT the JSON parser is actually reasonably good.I think releasing something simple and non-performant with reasonable
semantics would be the best next step. If it were up to me, I'd
probably even try to just land PL/V8 as PL/JavaScript for 9.2 if the
crash bugs and deal breakers can be sifted out.
You don't need a new PL to do that. The existing PLs can also parse
JSON. So that's not nearly enough of a reason to consider adding this
new PL.
On Mon, Dec 12, 2011 at 7:37 PM, Daniel Farina <daniel@heroku.com> wrote:
On Mon, Dec 12, 2011 at 4:51 PM, Peter van Hardenberg <pvh@pvh.ca> wrote:>
PL/V8 is fast, it's sandboxed, and while it doesn't provide GIN or
GIST operators out of the box, maybe those could be motivated by its
inclusion.I also feel that a big problem with JSON as a data type is that there
is not a powerful, common navigation method. JSON path is basically
pretty obscure by comparison to XPath. As a result, the common
approach to navigation in a JSON structure is basically "write
procedures". And that is only perfectly supported by a full-blown
interpreter.
This. For me, postgres xml extensions is 'a whole bunch of extra
stuff that comes with the xpath function'. How you get data into and
out of json is much more interesting than how the type is set up
internally or how it's parsed.
There must be some way to avoid iterative set up and tear down of json
objects (maybe as a cast?) -- postgres arrays of composites can set up
data in a way that feels very much like json in it's construction.
One big reason why people might go to server side json is to try and
avoid tedious marshaling of data between client and server. The xpath
function has had its warts, but it offers very tight coupling between
your database and your documents. In the case of json, I think you
can go even further.
merlin