Proposal: Add JSON support

Started by Joseph Adamsalmost 16 years ago52 messages
#1Joseph Adams
joeyadams3.14159@gmail.com

I introduced myself in the thread "Proposal: access control jails (and
introduction as aspiring GSoC student)", and we discussed jails and
session-local variables. But, as Robert Haas suggested, implementing
variable support in the backend would probably be way too ambitious a
project for a newbie like me. I decided instead to pursue the task of
adding JSON support to PostgreSQL, hence the new thread.

I plan to reference datatype-xml.html and functions-xml.html in some
design decisions, but there are some things that apply to XML that
don't apply to JSON and vice versa. For instance, jsoncomment
wouldn't make sense because (standard) JSON doesn't have comments.
For access, we might have something like json_get('foo[1].bar') and
json_set('foo[1].bar', 'hello'). jsonforest and jsonagg would be
beautiful. For mapping, jsonforest/jsonagg could be used to build a
JSON string from a result set (SELECT jsonagg(jsonforest(col1, col2,
...)) FROM tbl), but I'm not sure on the best way to go the other way
around (generate a result set from JSON). CSS-style selectors would
be cool, but "selecting" is what SQL is all about, and I'm not sure
having a json_select("dom-element[key=value]") function is a good,
orthogonal approach.

I'm wondering whether the internal representation of JSON should be
plain JSON text, or some binary code that's easier to traverse and
whatnot. For the sake of code size, just keeping it in text is
probably best.

Now my thoughts and opinions on the JSON parsing/unparsing itself:

It should be built-in, rather than relying on an external library
(like XML does). Priorities of the JSON implementation, in descending
order, are:

* Small
* Correct
* Fast

Moreover, JSON operations shall not crash due to stack overflows.

I'm thinking Bison/Flex is overkill for parsing JSON (I haven't seen
any JSON implementations out there that use it anyway). I would
probably end up writing the JSON parser/serializer manually. It
should not take more than a week.

As far as character encodings, I'd rather keep that out of the JSON
parsing/serializing code itself and assume UTF-8. Wherever I'm wrong,
I'll just throw encode/decode/validate operations at it.

Thoughts? Thanks.

#2Robert Haas
robertmhaas@gmail.com
In reply to: Joseph Adams (#1)
Re: Proposal: Add JSON support

On Sun, Mar 28, 2010 at 4:48 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

I'm wondering whether the internal representation of JSON should be
plain JSON text, or some binary code that's easier to traverse and
whatnot.  For the sake of code size, just keeping it in text is
probably best.

+1 for text.

Now my thoughts and opinions on the JSON parsing/unparsing itself:

It should be built-in, rather than relying on an external library
(like XML does).

Why? I'm not saying you aren't right, but you need to make an
argument rather than an assertion. This is a community, so no one is
entitled to decide anything unilaterally, and people want to be
convinced - including me.

As far as character encodings, I'd rather keep that out of the JSON
parsing/serializing code itself and assume UTF-8.  Wherever I'm wrong,
I'll just throw encode/decode/validate operations at it.

I think you need to assume that the encoding will be the server
encoding, not UTF-8. Although others on this list are better
qualified to speak to that than I am.

...Robert

#3Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#2)
Re: Proposal: Add JSON support

Robert Haas wrote:

On Sun, Mar 28, 2010 at 4:48 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

I'm wondering whether the internal representation of JSON should be
plain JSON text, or some binary code that's easier to traverse and
whatnot. For the sake of code size, just keeping it in text is
probably best.

+1 for text.

Agreed.

Now my thoughts and opinions on the JSON parsing/unparsing itself:

It should be built-in, rather than relying on an external library
(like XML does).

Why? I'm not saying you aren't right, but you need to make an
argument rather than an assertion. This is a community, so no one is
entitled to decide anything unilaterally, and people want to be
convinced - including me.

Yeah, why? We should not be in the business of reinventing the wheel
(and then maintaining the reinvented wheel), unless the code in question
is *really* small.

As far as character encodings, I'd rather keep that out of the JSON
parsing/serializing code itself and assume UTF-8. Wherever I'm wrong,
I'll just throw encode/decode/validate operations at it.

I think you need to assume that the encoding will be the server
encoding, not UTF-8. Although others on this list are better
qualified to speak to that than I am.

The trouble is that JSON is defined to be specifically Unicode, and in
practice for us that means UTF8 on the server side. It could get a bit
hairy, and it's definitely not something I think you can wave away with
a simple "I'll just throw some encoding/decoding function calls at it."

cheers

andrew

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#3)
Re: Proposal: Add JSON support

Andrew Dunstan <andrew@dunslane.net> writes:

Robert Haas wrote:

I think you need to assume that the encoding will be the server
encoding, not UTF-8. Although others on this list are better
qualified to speak to that than I am.

The trouble is that JSON is defined to be specifically Unicode, and in
practice for us that means UTF8 on the server side. It could get a bit
hairy, and it's definitely not something I think you can wave away with
a simple "I'll just throw some encoding/decoding function calls at it."

It's just text, no? Are there any operations where this actually makes
a difference?

Like Robert, I'm *very* wary of trying to introduce any text storage
into the backend that is in an encoding different from server_encoding.
Even the best-case scenarios for that will involve multiple new places for
encoding conversion failures to happen.

regards, tom lane

#5Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#4)
Re: Proposal: Add JSON support

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Robert Haas wrote:

I think you need to assume that the encoding will be the server
encoding, not UTF-8. Although others on this list are better
qualified to speak to that than I am.

The trouble is that JSON is defined to be specifically Unicode, and in
practice for us that means UTF8 on the server side. It could get a bit
hairy, and it's definitely not something I think you can wave away with
a simple "I'll just throw some encoding/decoding function calls at it."

It's just text, no? Are there any operations where this actually makes
a difference?

If we're going to provide operations on it that might involve some. I
don't know.

Like Robert, I'm *very* wary of trying to introduce any text storage
into the backend that is in an encoding different from server_encoding.
Even the best-case scenarios for that will involve multiple new places for
encoding conversion failures to happen.

I agree entirely. All I'm suggesting is that there could be many
wrinkles here.

Here's another thought. Given that JSON is actually specified to consist
of a string of Unicode characters, what will we deliver to the client
where the client encoding is, say Latin1? Will it actually be a legal
JSON byte stream?

cheers

andrew

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#5)
Re: Proposal: Add JSON support

Andrew Dunstan <andrew@dunslane.net> writes:

Here's another thought. Given that JSON is actually specified to consist
of a string of Unicode characters, what will we deliver to the client
where the client encoding is, say Latin1? Will it actually be a legal
JSON byte stream?

No, it won't. We will *not* be sending anything but latin1 in such a
situation, and I really couldn't care less what the JSON spec says about
it. Delivering wrongly-encoded data to a client is a good recipe for
all sorts of problems, since the client-side code is very unlikely to be
expecting that. A datatype doesn't get to make up its own mind whether
to obey those rules. Likewise, data on input had better match
client_encoding, because it's otherwise going to fail the encoding
checks long before a json datatype could have any say in the matter.

While I've not read the spec, I wonder exactly what "consist of a string
of Unicode characters" should actually be taken to mean. Perhaps it
only means that all the characters must be members of the Unicode set,
not that the string can never be represented in any other encoding.
There's more than one Unicode encoding anyway...

regards, tom lane

#7Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#6)
Re: Proposal: Add JSON support

On Sun, Mar 28, 2010 at 7:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Here's another thought. Given that JSON is actually specified to consist
of a string of Unicode characters, what will we deliver to the client
where the client encoding is, say Latin1? Will it actually be a legal
JSON byte stream?

No, it won't.  We will *not* be sending anything but latin1 in such a
situation, and I really couldn't care less what the JSON spec says about
it.  Delivering wrongly-encoded data to a client is a good recipe for
all sorts of problems, since the client-side code is very unlikely to be
expecting that.  A datatype doesn't get to make up its own mind whether
to obey those rules.  Likewise, data on input had better match
client_encoding, because it's otherwise going to fail the encoding
checks long before a json datatype could have any say in the matter.

While I've not read the spec, I wonder exactly what "consist of a string
of Unicode characters" should actually be taken to mean.  Perhaps it
only means that all the characters must be members of the Unicode set,
not that the string can never be represented in any other encoding.
There's more than one Unicode encoding anyway...

See sections 2.5 and 3 of:

http://www.ietf.org/rfc/rfc4627.txt?number=4627

...Robert

#8Mike Rylander
mrylander@gmail.com
In reply to: Tom Lane (#6)
Re: Proposal: Add JSON support

On Sun, Mar 28, 2010 at 7:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Here's another thought. Given that JSON is actually specified to consist
of a string of Unicode characters, what will we deliver to the client
where the client encoding is, say Latin1? Will it actually be a legal
JSON byte stream?

No, it won't.  We will *not* be sending anything but latin1 in such a
situation, and I really couldn't care less what the JSON spec says about
it.  Delivering wrongly-encoded data to a client is a good recipe for
all sorts of problems, since the client-side code is very unlikely to be
expecting that.  A datatype doesn't get to make up its own mind whether
to obey those rules.  Likewise, data on input had better match
client_encoding, because it's otherwise going to fail the encoding
checks long before a json datatype could have any say in the matter.

While I've not read the spec, I wonder exactly what "consist of a string
of Unicode characters" should actually be taken to mean.  Perhaps it
only means that all the characters must be members of the Unicode set,
not that the string can never be represented in any other encoding.
There's more than one Unicode encoding anyway...

In practice, every parser/serializer I've used (including the one I
helped write) allows (and, often, forces) any non-ASCII character to
be encoded as \u followed by a string of four hex digits.

Whether it would be easy inside the backend, when generating JSON from
user data stored in tables that are not in a UTF-8 encoded cluster, to
convert to UTF-8, that's something else entirely. If it /is/ easy and
safe, then it's just a matter of scanning for multi-byte sequences and
replacing those with their \uXXXX equivalents. I have some simple and
fast code I could share, if it's needed, though I suspect it's not.
:)

UPDATE: Thanks, Robert, for pointing to the RFC.

--
Mike Rylander
| VP, Research and Design
| Equinox Software, Inc. / The Evergreen Experts
| phone: 1-877-OPEN-ILS (673-6457)
| email: miker@esilibrary.com
| web: http://www.esilibrary.com

#9Robert Haas
robertmhaas@gmail.com
In reply to: Mike Rylander (#8)
Re: Proposal: Add JSON support

On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander@gmail.com> wrote:

In practice, every parser/serializer I've used (including the one I
helped write) allows (and, often, forces) any non-ASCII character to
be encoded as \u followed by a string of four hex digits.

Is it correct to say that the only feasible place where non-ASCII
characters can be used is within string constants? If so, it might be
reasonable to disallow characters with the high-bit set unless the
server encoding is one of the flavors of Unicode of which the spec
approves. I'm tempted to think that when the server encoding is
Unicode we really ought to allow Unicode characters natively, because
turning a long string of two-byte wide chars into a long string of
six-byte wide chars sounds pretty evil from a performance point of
view.

...Robert

#10Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#9)
Re: Proposal: Add JSON support

Robert Haas wrote:

On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander@gmail.com> wrote:

In practice, every parser/serializer I've used (including the one I
helped write) allows (and, often, forces) any non-ASCII character to
be encoded as \u followed by a string of four hex digits.

Is it correct to say that the only feasible place where non-ASCII
characters can be used is within string constants? If so, it might be
reasonable to disallow characters with the high-bit set unless the
server encoding is one of the flavors of Unicode of which the spec
approves. I'm tempted to think that when the server encoding is
Unicode we really ought to allow Unicode characters natively, because
turning a long string of two-byte wide chars into a long string of
six-byte wide chars sounds pretty evil from a performance point of
view.

We support exactly one unicode encoding on the server side: utf8.

And the maximum possible size of a validly encoded unicode char in utf8
is 4 (and that's pretty rare, IIRC).

cheers

andrew

#11Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#10)
Re: Proposal: Add JSON support

Andrew Dunstan wrote:

Robert Haas wrote:

On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander@gmail.com>
wrote:

In practice, every parser/serializer I've used (including the one I
helped write) allows (and, often, forces) any non-ASCII character to
be encoded as \u followed by a string of four hex digits.

Is it correct to say that the only feasible place where non-ASCII
characters can be used is within string constants? If so, it might be
reasonable to disallow characters with the high-bit set unless the
server encoding is one of the flavors of Unicode of which the spec
approves. I'm tempted to think that when the server encoding is
Unicode we really ought to allow Unicode characters natively, because
turning a long string of two-byte wide chars into a long string of
six-byte wide chars sounds pretty evil from a performance point of
view.

We support exactly one unicode encoding on the server side: utf8.

And the maximum possible size of a validly encoded unicode char in
utf8 is 4 (and that's pretty rare, IIRC).

Sorry. Disregard this. I see what you mean.

Yeah, I thing *requiring* non-ascii character to be escaped would be evil.

cheers

andrew

#12Mike Rylander
mrylander@gmail.com
In reply to: Robert Haas (#9)
Re: Proposal: Add JSON support

On Sun, Mar 28, 2010 at 8:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander@gmail.com> wrote:

In practice, every parser/serializer I've used (including the one I
helped write) allows (and, often, forces) any non-ASCII character to
be encoded as \u followed by a string of four hex digits.

Is it correct to say that the only feasible place where non-ASCII
characters can be used is within string constants?

Yes. That includes object property strings -- they are quoted string literals.

If so, it might be
reasonable to disallow characters with the high-bit set unless the
server encoding is one of the flavors of Unicode of which the spec
approves.  I'm tempted to think that when the server encoding is
Unicode we really ought to allow Unicode characters natively, because
turning a long string of two-byte wide chars into a long string of
six-byte wide chars sounds pretty evil from a performance point of
view.

+1

As an aside, \u-encoded (escaped) characters and native multi-byte
sequences (of any RFC-allowable Unicode encoding) are exactly
equivalent in JSON -- it's a storage and transmission format, and
doesn't prescribe the application-internal representation of the data.

If it's faster (which it almost certainly is) to not mangle the data
when it's all staying server side, that seems like a useful
optimization. For output to the client, however, it would be useful
to provide a \u-escaping function, which (AIUI) should always be safe
regardless of client encoding.

--
Mike Rylander
| VP, Research and Design
| Equinox Software, Inc. / The Evergreen Experts
| phone: 1-877-OPEN-ILS (673-6457)
| email: miker@esilibrary.com
| web: http://www.esilibrary.com

#13Joseph Adams
joeyadams3.14159@gmail.com
In reply to: Robert Haas (#2)
Re: Proposal: Add JSON support

On Sun, Mar 28, 2010 at 5:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Mar 28, 2010 at 4:48 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

Now my thoughts and opinions on the JSON parsing/unparsing itself:

It should be built-in, rather than relying on an external library
(like XML does).

Why?  I'm not saying you aren't right, but you need to make an
argument rather than an assertion.  This is a community, so no one is
entitled to decide anything unilaterally, and people want to be
convinced - including me.

I apologize; I was just starting the conversation with some of my
ideas to receive feedback. I didn't want people to have to wade
through too many "I think"s . I'll be sure to use <opinion> tags in
the future :-)

My reasoning for "It should be built-in" is:
* It would be nice to have a built-in serialization format that's
available by default.
* It might be a little faster because it doesn't have to link to an
external library.
* The code to interface between JSON logic and PostgreSQL will
probably be much larger than the actual JSON encoding/decoding itself.
* The externally-maintained and packaged libjson implementations I
saw brought in lots of dependencies (e.g. glib).
* "Everyone else" (e.g. PHP) uses a statically-linked JSON implementation.

Is the code in question "*really*" small? Well, not really, but it's
not enormous either. By the way, I found a bug in PHP's JSON_parser
(json_decode("true "); /* with a space */ returns null instead of
true). I'll have to get around to reporting that.

Now, assuming JSON support is built-in to PostgreSQL and is enabled by
default, it is my opinion that encoding issues should not be dealt
with in the JSON code itself, but that the JSON code itself should
assume UTF-8. I think conversions should be done to/from UTF-8 before
passing it through the JSON code because this would likely be the
smallest way to implement it (not necessarily the fastest, though).

Mike Rylander pointed out something wonderful, and that is that JSON
code can be stored in plain old ASCII using \u... . If a target
encoding supports all of Unicode, the JSON serializer could be told
not to generate \u escapes. Otherwise, the \u escapes would be
necessary.

Thus, here's an example of how (in my opinion) character sets and such
should be handled in the JSON code:

Suppose the client's encoding is UTF-16, and the server's encoding is
Latin-1. When JSON is stored to the database:
1. The client is responsible and sends a valid UTF-16 JSON string.
2. PostgreSQL checks to make sure it is valid UTF-16, then converts
it to UTF-8.
3. The JSON code parses it (to ensure it's valid).
4. The JSON code unparses it (to get a representation without
needless whitespace). It is given a flag indicating it should only
output ASCII text.
5. The ASCII is stored in the server, since it is valid Latin-1.

When JSON is retrieved from the database:
1. ASCII is retrieved from the server
2. If user needs to extract one or more fields, the JSON is parsed,
and the fields are extracted.
3. Otherwise, the JSON text is converted to UTF-16 and sent to the client.

Note that I am being biased toward optimizing code size rather than speed.

Here's a question about semantics: should converting JSON to text
guarantee that Unicode will be \u escaped, or should it render actual
Unicode whenever possible (when the client uses a Unicode-complete
charset) ?

As for reinventing the wheel, I'm in the process of writing yet
another JSON implementation simply because I didn't find the other
ones I looked at palatable. I am aiming for simple code, not fast
code. I am using malloc for structures and realloc for strings/arrays
rather than resorting to clever buffering tricks. Of course, I'll
switch it over to palloc/repalloc before migrating it to PostgreSQL.

#14Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Andrew Dunstan (#3)
Re: Proposal: Add JSON support

2010/3/29 Andrew Dunstan <andrew@dunslane.net>:

Robert Haas wrote:

On Sun, Mar 28, 2010 at 4:48 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

I'm wondering whether the internal representation of JSON should be
plain JSON text, or some binary code that's easier to traverse and
whatnot.  For the sake of code size, just keeping it in text is
probably best.

+1 for text.

Agreed.

There's another choice, called BSON.

http://www.mongodb.org/display/DOCS/BSON

I've not researched it yet deeply, it seems reasonable to be stored in
databases as it is invented for MongoDB.

Now my thoughts and opinions on the JSON parsing/unparsing itself:

It should be built-in, rather than relying on an external library
(like XML does).

Why?  I'm not saying you aren't right, but you need to make an
argument rather than an assertion.  This is a community, so no one is
entitled to decide anything unilaterally, and people want to be
convinced - including me.

Yeah, why? We should not be in the business of reinventing the wheel (and
then maintaining the reinvented wheel), unless the code in question is
*really* small.

Many implementations in many languages of JSON show that parsing JSON
is not so difficult to code and the needs vary. Hence, I wonder if we
can have it very our own.

Never take it wrongly, I don't disagree text format nor disagree to
use an external library.

Regards,

--
Hitoshi Harada

#15Robert Haas
robertmhaas@gmail.com
In reply to: Joseph Adams (#13)
Re: Proposal: Add JSON support

On Sun, Mar 28, 2010 at 11:24 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

I apologize; I was just starting the conversation with some of my
ideas to receive feedback.  I didn't want people to have to wade
through too many "I think"s .  I'll be sure to use <opinion> tags in
the future :-)

FWIW, I don't care at all whether you say "I think" or "I know"; the
point is that you have to provide backup for any position you choose
to take.

My reasoning for "It should be built-in" is:
 * It would be nice to have a built-in serialization format that's
available by default.
 * It might be a little faster because it doesn't have to link to an
external library.

I don't think either of these reasons is valid.

 * The code to interface between JSON logic and PostgreSQL will
probably be much larger than the actual JSON encoding/decoding itself.

If true, this is a good argument.

 * The externally-maintained and packaged libjson implementations I
saw brought in lots of dependencies (e.g. glib).

As is this.

 * "Everyone else" (e.g. PHP) uses a statically-linked JSON implementation.

But this isn't.

Is the code in question "*really*" small?  Well, not really, but it's
not enormous either.  By the way, I found a bug in PHP's JSON_parser
(json_decode("true "); /* with a space */ returns null instead of
true).  I'll have to get around to reporting that.

Now, assuming JSON support is built-in to PostgreSQL and is enabled by
default, it is my opinion that encoding issues should not be dealt
with in the JSON code itself, but that the JSON code itself should
assume UTF-8.  I think conversions should be done to/from UTF-8 before
passing it through the JSON code because this would likely be the
smallest way to implement it (not necessarily the fastest, though).

Mike Rylander pointed out something wonderful, and that is that JSON
code can be stored in plain old ASCII using \u... .  If a target
encoding supports all of Unicode, the JSON serializer could be told
not to generate \u escapes.  Otherwise, the \u escapes would be
necessary.

Thus, here's an example of how (in my opinion) character sets and such
should be handled in the JSON code:

Suppose the client's encoding is UTF-16, and the server's encoding is
Latin-1.  When JSON is stored to the database:
 1. The client is responsible and sends a valid UTF-16 JSON string.
 2. PostgreSQL checks to make sure it is valid UTF-16, then converts
it to UTF-8.
 3. The JSON code parses it (to ensure it's valid).
 4. The JSON code unparses it (to get a representation without
needless whitespace).  It is given a flag indicating it should only
output ASCII text.
 5. The ASCII is stored in the server, since it is valid Latin-1.

When JSON is retrieved from the database:
 1. ASCII is retrieved from the server
 2. If user needs to extract one or more fields, the JSON is parsed,
and the fields are extracted.
 3. Otherwise, the JSON text is converted to UTF-16 and sent to the client.

Note that I am being biased toward optimizing code size rather than speed.

Can you comment on my proposal elsewhere on this thread and compare
your proposal to mine? In what ways are they different, and which is
better, and why?

Here's a question about semantics: should converting JSON to text
guarantee that Unicode will be \u escaped, or should it render actual
Unicode whenever possible (when the client uses a Unicode-complete
charset) ?

I feel pretty strongly that the data should be stored in the database
in the format in which it will be returned to the user - any
conversion which is necessary should happen on the way in. I am not
100% sure to what extent we should attempt to canonicalize the input
and to what extend we should simply store it in whichever way the user
chooses to provide it.

As for reinventing the wheel, I'm in the process of writing yet
another JSON implementation simply because I didn't find the other
ones I looked at palatable.  I am aiming for simple code, not fast
code.  I am using malloc for structures and realloc for strings/arrays
rather than resorting to clever buffering tricks.  Of course, I'll
switch it over to palloc/repalloc before migrating it to PostgreSQL.

I'm not sure that optimizing for simplicity over speed is a good idea.
I think we can reject implementations as unpalatable because they are
slow or feature-poor or have licensing issues or are not actively
maintained, but rejecting them because they use complex code in order
to be fast doesn't seem like the right trade-off to me.

...Robert

#16Dimitri Fontaine
dfontaine@hi-media.com
In reply to: Joseph Adams (#13)
Re: Proposal: Add JSON support

Hi,

Joseph Adams <joeyadams3.14159@gmail.com> writes:

As for reinventing the wheel, I'm in the process of writing yet
another JSON implementation simply because I didn't find the other
ones I looked at palatable.

Even this one (ANSI C, MIT Licenced)?

cJSON -- An ultra-lightweight, portable, single-file, simple-as-can-be
ANSI-C compliant JSON parser, under MIT license.

http://sourceforge.net/projects/cjson/
http://cjson.svn.sourceforge.net/viewvc/cjson/README?revision=7&amp;view=markup
http://cjson.svn.sourceforge.net/viewvc/cjson/cJSON.c?revision=33&amp;view=markup

And from the cJSON.h we read that it could be somewhat easy to integrate
into PostgreSQL's memory management:
56 typedef struct cJSON_Hooks {
57 void *(*malloc_fn)(size_t sz);
58 void (*free_fn)(void *ptr);
59 } cJSON_Hooks;

Just adding some data points, hoping that's not adding only confusion.

Regards,
--
dim

#17Peter Eisentraut
peter_e@gmx.net
In reply to: Joseph Adams (#13)
Re: Proposal: Add JSON support

On sön, 2010-03-28 at 23:24 -0400, Joseph Adams wrote:

Thus, here's an example of how (in my opinion) character sets and such
should be handled in the JSON code:

Suppose the client's encoding is UTF-16, and the server's encoding is
Latin-1. When JSON is stored to the database:
1. The client is responsible and sends a valid UTF-16 JSON string.
2. PostgreSQL checks to make sure it is valid UTF-16, then converts
it to UTF-8.
3. The JSON code parses it (to ensure it's valid).
4. The JSON code unparses it (to get a representation without
needless whitespace). It is given a flag indicating it should only
output ASCII text.
5. The ASCII is stored in the server, since it is valid Latin-1.

When JSON is retrieved from the database:
1. ASCII is retrieved from the server
2. If user needs to extract one or more fields, the JSON is parsed,
and the fields are extracted.
3. Otherwise, the JSON text is converted to UTF-16 and sent to the client.

The problem I see here is that a data type output function is normally
not aware of the client encoding. The alternatives that I see is that
you always escape everything you see to plain ASCII, so it's valid in
every server encoding, but that would result in pretty sad behavior for
users of languages that don't use a lot of ASCII characters, or you
decree a nonstandard JSON variant that momentarily uses whatever
encoding you decide.

#18Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#15)
Re: Proposal: Add JSON support

Robert Haas wrote:

I feel pretty strongly that the data should be stored in the database
in the format in which it will be returned to the user - any
conversion which is necessary should happen on the way in. I am not
100% sure to what extent we should attempt to canonicalize the input
and to what extend we should simply store it in whichever way the user
chooses to provide it.

ISTM that implies that, with a possible exception when the server
encoding is utf8, you would have to \u escape the data on the way in
fairly pessimistically.

I'd be inclined to say we should store and validate it exactly as the
client gives it to us (converted to the server encoding, as it would be,
of course). In practice that would mean that for non-utf8 databases the
client would need to \u escape it. I suspect most uses of this would be
in utf8-encoded databases anyway.

I also think we should provide a function to do the escaping, so users
could do something like:

insert into foo (myjson) values (json_escape('some jason text here'));

I also thought about a switch to turn on \u escaping on output - that
might be useful for pg_dump for instance.

cheers

andrew

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#15)
Re: Proposal: Add JSON support

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Mar 28, 2010 at 11:24 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

My reasoning for "It should be built-in" is:
�* It would be nice to have a built-in serialization format that's
available by default.
�* It might be a little faster because it doesn't have to link to an
external library.

I don't think either of these reasons is valid.

FWIW, our track record with relying on external libraries has been less
than great --- "upstream will maintain it" sounds good but has fallen
over with respect to both the regex engine and the snowball stemmers,
to take two examples. And libxml2 has been nothing but a source of pain.

If this is going to end up being one fairly small C file implementing
a spec that is not a moving target, I'd vote against depending on an
external library instead, no matter how spiffy and license-compatible
the external library might be.

regards, tom lane

#20Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#19)
Re: Proposal: Add JSON support

On Mon, Mar 29, 2010 at 12:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Mar 28, 2010 at 11:24 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

My reasoning for "It should be built-in" is:
 * It would be nice to have a built-in serialization format that's
available by default.
 * It might be a little faster because it doesn't have to link to an
external library.

I don't think either of these reasons is valid.

FWIW, our track record with relying on external libraries has been less
than great --- "upstream will maintain it" sounds good but has fallen
over with respect to both the regex engine and the snowball stemmers,
to take two examples.  And libxml2 has been nothing but a source of pain.

If this is going to end up being one fairly small C file implementing
a spec that is not a moving target, I'd vote against depending on an
external library instead, no matter how spiffy and license-compatible
the external library might be.

Fair enough. Note that I did go on to say which reasons I did think
were potentially valid. ;-)

...Robert

#21Josh Berkus
josh@agliodbs.com
In reply to: Hitoshi Harada (#14)
Re: Proposal: Add JSON support

On 3/28/10 8:52 PM, Hitoshi Harada wrote:

There's another choice, called BSON.

http://www.mongodb.org/display/DOCS/BSON

I've not researched it yet deeply, it seems reasonable to be stored in
databases as it is invented for MongoDB.

I wouldn't take that for granted. The MongoDB project involves a lot of
"re-inventing the wheel" and I'd scrutinize any of their innovations
pretty thoroughly.

Besides, I thought the point of a JSON type was to be compatible with
the *majority* of JSON users?

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#22David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#19)
Re: Proposal: Add JSON support

On Mar 29, 2010, at 9:02 AM, Tom Lane wrote:

If this is going to end up being one fairly small C file implementing
a spec that is not a moving target, I'd vote against depending on an
external library instead, no matter how spiffy and license-compatible
the external library might be.

Perhaps you could fork one, in that case.

Best,

David

#23Joseph Adams
joeyadams3.14159@gmail.com
In reply to: Joseph Adams (#1)
Re: Proposal: Add JSON support

On Mon, Mar 29, 2010 at 2:23 PM, David E. Wheeler <david@kineticode.com> wrote:

On Mar 29, 2010, at 9:02 AM, Tom Lane wrote:

If this is going to end up being one fairly small C file implementing
a spec that is not a moving target, I'd vote against depending on an
external library instead, no matter how spiffy and license-compatible
the external library might be.

Perhaps you could fork one, in that case.

Best,

David

I'm considering using and adapting cJSON instead of continuing with my
redundant implementation. I could run `indent -kr -i4` on it (will
that match PostgreSQL's coding style?), add support for UTF-16
surrogate pairs (pair of \u... escapes for each character above U+FFFF
as required by the JSON spec), and add a switch to turn on/off pure
ASCII output.

P.S.: Sorry for the repeat, David. I forgot to CC the mailing list.

#24Dimitri Fontaine
dfontaine@hi-media.com
In reply to: Tom Lane (#19)
Re: Proposal: Add JSON support

Tom Lane <tgl@sss.pgh.pa.us> writes:

If this is going to end up being one fairly small C file implementing
a spec that is not a moving target, I'd vote against depending on an
external library instead, no matter how spiffy and license-compatible
the external library might be.

My understanding is that it's possible to include (fork) a MIT or BSD
source code into our source tree, right? (Some other licenses certainly
apply too).

Regards,
--
dim

#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dimitri Fontaine (#24)
Re: Proposal: Add JSON support

Dimitri Fontaine <dfontaine@hi-media.com> writes:

Tom Lane <tgl@sss.pgh.pa.us> writes:

If this is going to end up being one fairly small C file implementing
a spec that is not a moving target, I'd vote against depending on an
external library instead, no matter how spiffy and license-compatible
the external library might be.

My understanding is that it's possible to include (fork) a MIT or BSD
source code into our source tree, right? (Some other licenses certainly
apply too).

MIT or 2-clause BSD would be ok for such a thing, other licenses
probably not.

regards, tom lane

#26Chris Browne
cbbrowne@acm.org
In reply to: Joseph Adams (#1)
Re: Proposal: Add JSON support

joeyadams3.14159@gmail.com (Joseph Adams) writes:

I introduced myself in the thread "Proposal: access control jails (and
introduction as aspiring GSoC student)", and we discussed jails and
session-local variables. But, as Robert Haas suggested, implementing
variable support in the backend would probably be way too ambitious a
project for a newbie like me. I decided instead to pursue the task of
adding JSON support to PostgreSQL, hence the new thread.

Interesting...

I had a discussion about much this sort of thing with a local LUG
associate; he was interested in this from a "doing CouchDB-ish things
using PostgreSQL" perspective.

There were a couple perspectives there, which may be somewhat orthogonal
to what you're trying to do. I'll mention them as they may suggest
useful operations.

1. Buddy Myles pointed out a NYTimes project which does something
pretty analagous...
http://code.nytimes.com/projects/dbslayer

This is a proxy that allows clients to submit requests via HTTP,
returning responses in JSON form. Note that the HTTP request has the
SQL query embedded into it.

2. CouchDB's interface is much the same, where clients submit HTTP
requests and receive JSON responses back, but with the difference that
the query is a stylized sorta-JSON form.

I'd think that you could get quite a long ways on this, at least doing
something like dbslayer without *necessarily* needing to do terribly
much work inside the DB engine.

Mapping a tuple, or a list of tuples, into a forest of JSON documents
should be pretty straightforward; whether or not it's really desirable
to operate a JSON-flavoured query inside PostgreSQL may be the
difference between *this year's* GSOC and *next year's* :-).
--
"...the Jedi learned early on what language the universe was
programmed in. Then they took advantage of an accident of language to
obscure this fact from the unwashed. They all affected an inverted
lisp. so, a Jedi to be, you the Forth must use."

#27Josh Berkus
josh@agliodbs.com
In reply to: Chris Browne (#26)
Re: Proposal: Add JSON support

I'd think that you could get quite a long ways on this, at least doing
something like dbslayer without *necessarily* needing to do terribly
much work inside the DB engine.

There's actually an HTTP framework tool for Postgres which already does
something of the sort. It was introduced at pgCon 2 years ago ... will
look for.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#28Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#27)
Re: Proposal: Add JSON support

On Tue, Mar 30, 2010 at 8:58 PM, Josh Berkus <josh@agliodbs.com> wrote:

I'd think that you could get quite a long ways on this, at least doing
something like dbslayer without *necessarily* needing to do terribly
much work inside the DB engine.

There's actually an HTTP framework tool for Postgres which already does
something of the sort.  It was introduced at pgCon 2 years ago ... will
look for.

While it might be interesting to have/find/write a tool that puts an
HTTP/JSON layer around the DB connection, it's pretty much entirely
unrelated to the proposed project of creating a json type with
PostgreSQL analagous to the xml type we already have, which is what
the OP is proposing to do.

Personally, I suspect that a JSON type is both a more interesting
project to work on and a more useful result for this community.

...Robert

#29Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#28)
Re: Proposal: Add JSON support

Robert Haas wrote:

While it might be interesting to have/find/write a tool that puts an
HTTP/JSON layer around the DB connection, it's pretty much entirely
unrelated to the proposed project of creating a json type with
PostgreSQL analagous to the xml type we already have, which is what
the OP is proposing to do.

Personally, I suspect that a JSON type is both a more interesting
project to work on and a more useful result for this community.

I agree.

cheers

andrew

#30Chris Browne
cbbrowne@acm.org
In reply to: Joseph Adams (#1)
Re: Proposal: Add JSON support

robertmhaas@gmail.com (Robert Haas) writes:

On Tue, Mar 30, 2010 at 8:58 PM, Josh Berkus <josh@agliodbs.com> wrote:

I'd think that you could get quite a long ways on this, at least doing
something like dbslayer without *necessarily* needing to do terribly
much work inside the DB engine.

There's actually an HTTP framework tool for Postgres which already does
something of the sort. �It was introduced at pgCon 2 years ago ... will
look for.

While it might be interesting to have/find/write a tool that puts an
HTTP/JSON layer around the DB connection, it's pretty much entirely
unrelated to the proposed project of creating a json type with
PostgreSQL analagous to the xml type we already have, which is what
the OP is proposing to do.

Personally, I suspect that a JSON type is both a more interesting
project to work on and a more useful result for this community.

No disagreement here; I'd expect that a JSON type would significantly
ease building such a framework. Indeed, that could be a demonstration
of success...

"We then implemented an HTTP/JSON proxy in 27 lines of Python code..."
:-)
--
"Unless you used NetInfo. _Then_ changing network settings could
often require torching of the existing system, salting of the ground
it had rested on, and termination of anyone who used it."
-- JFW <jwiede@biff.com> on comp.sys.next.advocacy

#31Joseph Adams
joeyadams3.14159@gmail.com
In reply to: Chris Browne (#30)
Re: Proposal: Add JSON support

I ended up reinventing the wheel and writing another JSON library:

http://constellationmedia.com/~funsite/static/json-0.0.1.tar.bz2

This is a first release, and it doesn't really have a name besides
"json". It's very similar to cJSON, except it is (sans unknown bugs)
more reliable, more correct, and cleaner (unless you hate gotos ;-) ).
It has a simple test suite. It is not prone to stack overflows, as
it doesn't recurse. It is strict, requires input to be UTF-8 (it
validates it first) and only outputs UTF-8. Other than treating
numbers liberally, my implementation only accepts valid JSON code (it
doesn't try to correct anything, even Unicode problems). It is under
the MIT license.

#32Petr Jelinek
pjmodos@pjmodos.net
In reply to: Joseph Adams (#31)
Re: Proposal: Add JSON support

Dne 1.4.2010 5:39, Joseph Adams napsal(a):

I ended up reinventing the wheel and writing another JSON library:

http://constellationmedia.com/~funsite/static/json-0.0.1.tar.bz2

This is a first release, and it doesn't really have a name besides
"json". It's very similar to cJSON, except it is (sans unknown bugs)
more reliable, more correct, and cleaner (unless you hate gotos ;-) ).
It has a simple test suite. It is not prone to stack overflows, as
it doesn't recurse. It is strict, requires input to be UTF-8 (it
validates it first) and only outputs UTF-8. Other than treating
numbers liberally, my implementation only accepts valid JSON code (it
doesn't try to correct anything, even Unicode problems). It is under
the MIT license.

I did some testing on my own, it passed everything I have thrown at it
so far.
I also did tests using MSVC for both 32bit and 64bit targets and it
worked fine too (except for missing stdbool.h in msvc which is no big deal).

The coding style compared to cJSON (or other libs I've seen) seems
closer to the style of PostgreSQL, it would however still require
pgindent run and maybe some minor adjustments.

--
Regards
Petr Jelinek (PJMODOS)

#33David E. Wheeler
david@kineticode.com
In reply to: Petr Jelinek (#32)
Re: Proposal: Add JSON support

On Apr 1, 2010, at 9:34 PM, Petr Jelinek wrote:

I ended up reinventing the wheel and writing another JSON library:

http://constellationmedia.com/~funsite/static/json-0.0.1.tar.bz2

This is a first release, and it doesn't really have a name besides
"json". It's very similar to cJSON, except it is (sans unknown bugs)
more reliable, more correct, and cleaner (unless you hate gotos ;-) ).
It has a simple test suite. It is not prone to stack overflows, as
it doesn't recurse. It is strict, requires input to be UTF-8 (it
validates it first) and only outputs UTF-8. Other than treating
numbers liberally, my implementation only accepts valid JSON code (it
doesn't try to correct anything, even Unicode problems). It is under
the MIT license.

I did some testing on my own, it passed everything I have thrown at it so far.
I also did tests using MSVC for both 32bit and 64bit targets and it worked fine too (except for missing stdbool.h in msvc which is no big deal).

The coding style compared to cJSON (or other libs I've seen) seems closer to the style of PostgreSQL, it would however still require pgindent run and maybe some minor adjustments.

Someone approve this project for the GSoC quick, before Joseph finishes it!

Best,

David

#34Joseph Adams
joeyadams3.14159@gmail.com
In reply to: David E. Wheeler (#33)
Re: Proposal: Add JSON support

I've been wondering whether the JSON datatype should be strict or conservative.

For one, there's strict JSON (following the exact specification).
Then there's more conservative JSON variants. Some JSON parsers
support comments, some support invalid number formats (e.g. '3.' or
'+5'), etc..

The consensus seems to be that JSON content should be stored verbatim
(it should store the exact string the client sent to it), as is done
with XML. However, this notion is somewhat incompatible with "Be
conservative in what you do; be liberal in what you accept from
others" because we can't accept loose JSON, then spit out conservative
JSON without messing with the content.

Here's my idea: the datatype should only allow strict JSON, but there
should be a function that accepts a liberal format, cleans it up to
make it strict JSON, and converts it to JSON. I think making strict
JSON the default makes the most sense because:
* Inputs to the database will most likely be coming from programs, not humans.
* Output is expected to be valid JSON and work anywhere JSON should work.
* Strict JSON is what more people would expect, I'd think.

#35Mike Rylander
mrylander@gmail.com
In reply to: Joseph Adams (#34)
Re: Proposal: Add JSON support

On Sat, Apr 3, 2010 at 8:59 AM, Joseph Adams <joeyadams3.14159@gmail.com> wrote:

I've been wondering whether the JSON datatype should be strict or conservative.

For one, there's strict JSON (following the exact specification).
Then there's more conservative JSON variants.  Some JSON parsers
support comments, some support invalid number formats (e.g. '3.' or
'+5'), etc..

The consensus seems to be that JSON content should be stored verbatim
(it should store the exact string the client sent to it), as is done
with XML.  However, this notion is somewhat incompatible with "Be
conservative in what you do; be liberal in what you accept from
others" because we can't accept loose JSON, then spit out conservative
JSON without messing with the content.

Here's my idea: the datatype should only allow strict JSON, but there
should be a function that accepts a liberal format, cleans it up to
make it strict JSON, and converts it to JSON.  I think making strict
JSON the default makes the most sense because:
 * Inputs to the database will most likely be coming from programs, not humans.
 * Output is expected to be valid JSON and work anywhere JSON should work.
 * Strict JSON is what more people would expect, I'd think.

+1

--
Mike Rylander
| VP, Research and Design
| Equinox Software, Inc. / The Evergreen Experts
| phone: 1-877-OPEN-ILS (673-6457)
| email: miker@esilibrary.com
| web: http://www.esilibrary.com

#36Andrew Dunstan
andrew@dunslane.net
In reply to: Mike Rylander (#35)
Re: Proposal: Add JSON support

Mike Rylander wrote:

Here's my idea: the datatype should only allow strict JSON, but there
should be a function that accepts a liberal format, cleans it up to
make it strict JSON, and converts it to JSON. I think making strict
JSON the default makes the most sense because:
* Inputs to the database will most likely be coming from programs, not humans.
* Output is expected to be valid JSON and work anywhere JSON should work.
* Strict JSON is what more people would expect, I'd think.

+1

Yeah. That's the only thing that makes sense to me. We don't allow badly
formed XML, for example, although we do allow document fragments (as
required by the standard, IIRC). But we could sensibly have some
function like 'cleanup_json(almost_json text) returns json'.

cheers

andrew

#37Joseph Adams
joeyadams3.14159@gmail.com
In reply to: Andrew Dunstan (#36)
Re: Proposal: Add JSON support

Another JSON strictness issue: the JSON standard (
http://www.ietf.org/rfc/rfc4627.txt ) states that JSON text can only
be an array or object. However, my implementation currently accepts
any valid value. Thus, '3', '"hello"', 'true', 'false', and 'null'
are all accepted by my implementation, but are not strictly JSON text.
The question is: should the JSON datatype accept atomic values (those
that aren't arrays or objects) as valid JSON?

I tried a few other JSON implementations to see where they stand
regarding atomic types as input:

JSON_checker (C) does not accept them.
JSON.parse() (JavaScript) accepts them.
json_decode() (PHP) accepts them. However, support is currently buggy
(e.g. '1' is accepted, but '1 ' is not).
cJSON (C) accepts them.
JSON.pm (Perl) accepts them if you specify the allow_nonref option.
Otherwise, it accepts 'true' and 'false', but not 'null', a number, or
a string by itself.

In my opinion, we should accept an atomic value as valid JSON content.
I suppose we could get away with calling it a "content" fragment as
is done with XML without a doctype.

Accepting atomic values as valid JSON would be more orthagonal, as it
would be possible to have a function like this:

json_values(object_or_array JSON) RETURNS SETOF JSON
-- extracts values from an object or members from an array, returning
them as JSON fragments.

Also, should we go even further and accept key:value pairs by themselves? :

'"key":"value"'::JSON

I don't think we should because doing so would be rather zany. It
would mean JSON content could be invalid in value context, as in:

// JavaScript
var content = "key" : "value";

I improved my JSON library. It now only accepts strict, UTF-8 encoded
JSON values (that is, objects, arrays, strings, numbers, true, false,
and null). It also has a json_decode_liberal() function that accepts
a string, cleans it up, and passes it through the stricter
json_decode(). json_decode_liberal() filters out comments, allows
single quoted strings, and accepts a lax number format compared to
strict JSON. I may add Unicode repair to it later on, but
implementing that well really depends on what type of Unicode errors
appear in real life, I think.

http://constellationmedia.com/~funsite/static/json-0.0.2.tar.bz2

My json.c is now 1161 lines long, so I can't quite call it "small" anymore.

#38Robert Haas
robertmhaas@gmail.com
In reply to: Joseph Adams (#37)
Re: Proposal: Add JSON support

On Mon, Apr 5, 2010 at 11:50 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

In my opinion, we should accept an atomic value as valid JSON content.

That seems right to me.

Also, should we go even further and accept key:value pairs by themselves? :

'"key":"value"'::JSON

Definitely not.

...Robert

#39Petr Jelinek
pjmodos@pjmodos.net
In reply to: Joseph Adams (#37)
Re: Proposal: Add JSON support

Dne 6.4.2010 5:50, Joseph Adams napsal(a):

Another JSON strictness issue: the JSON standard (
http://www.ietf.org/rfc/rfc4627.txt ) states that JSON text can only
be an array or object. However, my implementation currently accepts
any valid value. Thus, '3', '"hello"', 'true', 'false', and 'null'
are all accepted by my implementation, but are not strictly JSON text.
The question is: should the JSON datatype accept atomic values (those
that aren't arrays or objects) as valid JSON?

Not really sure about this myself, but keep in mind that NULL has
special meaning in SQL.

Also, should we go even further and accept key:value pairs by themselves? :

'"key":"value"'::JSON

No, especially considering that '{"key":"value"}' is a valid JSON value.

I improved my JSON library. It now only accepts strict, UTF-8 encoded
JSON values (that is, objects, arrays, strings, numbers, true, false,
and null).

Just a note, but PostgreSQL has some UTF-8 validation code, you might
want to look at it maybe, at least once you start the actual integration
into core, so that you are not reinventing too many wheels. I can see
how your own code is good thing for general library which this can (and
I am sure will be) used as, but for the datatype itself, it might be
better idea to use what's already there, unless it's somehow
incompatible of course.

--
Regards
Petr Jelinek (PJMODOS)

#40Petr Jelinek
pjmodos@pjmodos.net
In reply to: Joseph Adams (#1)
Re: Proposal: Add JSON support

Dne 6.4.2010 7:57, Joseph Adams napsal(a):

On Tue, Apr 6, 2010 at 1:00 AM, Petr Jelinek<pjmodos@pjmodos.net> wrote:

Not really sure about this myself, but keep in mind that NULL has special
meaning in SQL.

To me, the most logical approach is to do the obvious thing: make
JSON's 'null' be SQL's NULL. For instance, SELECTing on a table with
NULLs in it and converting the result set to JSON would yield a
structure with 'null's in it. 'null'::JSON would yield NULL. I'm not
sure what startling results would come of this approach, but I'm
guessing this would be most intuitive and useful.

+1

Just a note, but PostgreSQL has some UTF-8 validation code, you might want
to look at it maybe, at least once you start the actual integration into
core, so that you are not reinventing too many wheels. I can see how your
own code is good thing for general library which this can (and I am sure
will be) used as, but for the datatype itself, it might be better idea to
use what's already there, unless it's somehow incompatible of course.

Indeed. My plan is to first get a strong standalone JSON library
written and tested so it can be used as a general-purpose library. As
the JSON code is merged into PostgreSQL, it can be adapted. Part of
this adaptation would most likely be removing the UTF-8 validation
function I wrote and using PostgreSQL's Unicode support code instead.

There are probably other bits that could be PostgreSQLified as well.
I wonder if I should consider leveraging PostgreSQL's regex support or
if it would be a bad fit/waste of time/slower/not worth it.

Regex ? What for ? You certainly don't need it for parsing, you have
good parser IMHO and regex would probably be all of the above.

--
Regards
Petr Jelinek (PJMODOS)

#41Mike Rylander
mrylander@gmail.com
In reply to: Joseph Adams (#37)
Re: Proposal: Add JSON support

On Mon, Apr 5, 2010 at 11:50 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

Another JSON strictness issue:  the JSON standard (
http://www.ietf.org/rfc/rfc4627.txt ) states that JSON text can only
be an array or object.  However, my implementation currently accepts
any valid value.  Thus, '3', '"hello"', 'true', 'false', and 'null'
are all accepted by my implementation, but are not strictly JSON text.
 The question is: should the JSON datatype accept atomic values (those
that aren't arrays or objects) as valid JSON?

I tried a few other JSON implementations to see where they stand
regarding atomic types as input:

JSON_checker (C) does not accept them.
JSON.parse() (JavaScript) accepts them.
json_decode() (PHP) accepts them.  However, support is currently buggy
(e.g. '1' is accepted, but '1 ' is not).
cJSON (C) accepts them.
JSON.pm (Perl) accepts them if you specify the allow_nonref option.
Otherwise, it accepts 'true' and 'false', but not 'null', a number, or
a string by itself.

In my opinion, we should accept an atomic value as valid JSON content.
 I suppose we could get away with calling it a "content" fragment as
is done with XML without a doctype.

Accepting atomic values as valid JSON would be more orthagonal, as it
would be possible to have a function like this:

json_values(object_or_array JSON) RETURNS SETOF JSON
-- extracts values from an object or members from an array, returning
them as JSON fragments.

For these reasons, and the fact that my project uses atomic values ;),
I think yes, we should support them.

IIUC, the reason for requiring an array or object is that the O part
of JSON means "some sort of a collection of atomic values". But, in
ECMAScript (JavaScript), instances of strings, numbers, bools and null
are, indeed, objects. IOW, I think JSON is using a faulty definition
of "object" in the spec. It's the one part of the spec that doesn't
make sense to me at all.

Also, should we go even further and accept key:value pairs by themselves? :

'"key":"value"'::JSON

This, though, is probably a step too far. It violates the JS part of JSON ...

I don't think we should because doing so would be rather zany.  It
would mean JSON content could be invalid in value context, as in:

// JavaScript
var content = "key" : "value";

Right.

Thanks, Joseph. I think this will be a great addition!

--
Mike Rylander
| VP, Research and Design
| Equinox Software, Inc. / The Evergreen Experts
| phone: 1-877-OPEN-ILS (673-6457)
| email: miker@esilibrary.com
| web: http://www.esilibrary.com

#42Alvaro Herrera
alvherre@commandprompt.com
In reply to: Joseph Adams (#37)
Re: Proposal: Add JSON support

Joseph Adams escribi�:

http://constellationmedia.com/~funsite/static/json-0.0.2.tar.bz2

My json.c is now 1161 lines long, so I can't quite call it "small" anymore.

Just noticed you don't check the return value of malloc and friends.
How do you intend to handle that? There are various places that would
simply dump core with the 0.0.2 code. Within Postgres it's easy -- a
failed palloc aborts the transaction and doesn't continue running your
code. But in a standalone library that's probably not acceptable.

If we were to import this there are some lines that could be ripped out,
like 60 lines in the string buffer stuff and 130 lines for Unicode.
That brings your code just under 1000 lines.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Petr Jelinek (#40)
Re: Proposal: Add JSON support

Petr Jelinek <pjmodos@pjmodos.net> writes:

Dne 6.4.2010 7:57, Joseph Adams napsal(a):

To me, the most logical approach is to do the obvious thing: make
JSON's 'null' be SQL's NULL. For instance, SELECTing on a table with
NULLs in it and converting the result set to JSON would yield a
structure with 'null's in it. 'null'::JSON would yield NULL. I'm not
sure what startling results would come of this approach, but I'm
guessing this would be most intuitive and useful.

+1

I think it's a pretty bad idea for 'null'::JSON to yield NULL. AFAIR
there is no other standard datatype for which the input converter can
yield NULL from a non-null input string, and I'm not even sure that the
InputFunctionCall protocol allows it. (In fact a quick look indicates
that it doesn't...)

To me, what this throws into question is not so much whether JSON null
should equate to SQL NULL (it should), but whether it's sane to accept
atomic values. If I understood the beginning of this discussion, that's
not strictly legal. I think it would be better for strict input mode
to reject this, and permissive mode to convert it to a non-atomic value.
Thus jsonify('null') wouldn't yield NULL but a structure containing a
null.

regards, tom lane

#44Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#42)
Re: Proposal: Add JSON support

On Tue, Apr 6, 2010 at 11:05 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Joseph Adams escribió:

http://constellationmedia.com/~funsite/static/json-0.0.2.tar.bz2

My json.c is now 1161 lines long, so I can't quite call it "small" anymore.

Just noticed you don't check the return value of malloc and friends.
How do you intend to handle that?  There are various places that would
simply dump core with the 0.0.2 code.  Within Postgres it's easy -- a
failed palloc aborts the transaction and doesn't continue running your
code.  But in a standalone library that's probably not acceptable.

If we were to import this there are some lines that could be ripped out,
like 60 lines in the string buffer stuff and 130 lines for Unicode.
That brings your code just under 1000 lines.

Let me be the first to suggest putting this code under the PostgreSQL license.

...Robert

#45Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#43)
Re: Proposal: Add JSON support

On Tue, Apr 6, 2010 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Petr Jelinek <pjmodos@pjmodos.net> writes:

Dne 6.4.2010 7:57, Joseph Adams napsal(a):

To me, the most logical approach is to do the obvious thing: make
JSON's 'null' be SQL's NULL.  For instance, SELECTing on a table with
NULLs in it and converting the result set to JSON would yield a
structure with 'null's in it.  'null'::JSON would yield NULL.  I'm not
sure what startling results would come of this approach, but I'm
guessing this would be most intuitive and useful.

+1

I think it's a pretty bad idea for 'null'::JSON to yield NULL.  AFAIR
there is no other standard datatype for which the input converter can
yield NULL from a non-null input string, and I'm not even sure that the
InputFunctionCall protocol allows it.  (In fact a quick look indicates
that it doesn't...)

Oh. I missed this aspect of the proposal. I agree - that's a bad idea.

To me, what this throws into question is not so much whether JSON null
should equate to SQL NULL (it should), but whether it's sane to accept
atomic values.

With this, I disagree. I see no reason to suppose that a JSON NULL
and an SQL NULL are the same thing.

 If I understood the beginning of this discussion, that's
not strictly legal.  I think it would be better for strict input mode
to reject this, and permissive mode to convert it to a non-atomic value.
Thus jsonify('null') wouldn't yield NULL but a structure containing a
null.

There's no obvious "structure" to convert this into.

...Robert

#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#45)
Re: Proposal: Add JSON support

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Apr 6, 2010 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

To me, what this throws into question is not so much whether JSON null
should equate to SQL NULL (it should), but whether it's sane to accept
atomic values.

With this, I disagree. I see no reason to suppose that a JSON NULL
and an SQL NULL are the same thing.

Oh. If they're not the same, then the problem is easily dodged, but
then what *is* a JSON null?

regards, tom lane

#47Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#46)
Re: Proposal: Add JSON support

On Tue, Apr 6, 2010 at 1:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Apr 6, 2010 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

To me, what this throws into question is not so much whether JSON null
should equate to SQL NULL (it should), but whether it's sane to accept
atomic values.

With this, I disagree.  I see no reason to suppose that a JSON NULL
and an SQL NULL are the same thing.

Oh.  If they're not the same, then the problem is easily dodged, but
then what *is* a JSON null?

I assume we're going to treat JSON much like XML: basically text, but
with some validation (and perhaps canonicalization) under the hood.
So a JSON null will be "null", just a JSON boolean true value will be
"true". It would be pretty weird if storing "true" or "false" or "4"
or "[3,1,4,1,5,9]" into a json column and then reading it back
returned the input string; but at the same time storing "null" into
the column returned a SQL NULL.

...Robert

#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#47)
Re: Proposal: Add JSON support

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Apr 6, 2010 at 1:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oh. �If they're not the same, then the problem is easily dodged, but
then what *is* a JSON null?

I assume we're going to treat JSON much like XML: basically text, but
with some validation (and perhaps canonicalization) under the hood.
So a JSON null will be "null", just a JSON boolean true value will be
"true". It would be pretty weird if storing "true" or "false" or "4"
or "[3,1,4,1,5,9]" into a json column and then reading it back
returned the input string; but at the same time storing "null" into
the column returned a SQL NULL.

Hmm. So the idea is that all JSON atomic values are considered to be
text strings, even when they look like something else (like bools or
numbers)? That would simplify matters I guess, but I'm not sure about
the usability. In particular I'd want to have something that dequotes
the value so that I can get foo not "foo" when converting to SQL text.
(I'm assuming that quotes would be there normally, so as not to lose
the distinction between 3 and "3" in the JSON representation.)

regards, tom lane

#49Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#48)
Re: Proposal: Add JSON support

On Tue, Apr 6, 2010 at 2:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Apr 6, 2010 at 1:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oh.  If they're not the same, then the problem is easily dodged, but
then what *is* a JSON null?

I assume we're going to treat JSON much like XML: basically text, but
with some validation (and perhaps canonicalization) under the hood.
So a JSON null will be "null", just a JSON boolean true value will be
"true".  It would be pretty weird if storing "true" or "false" or "4"
or "[3,1,4,1,5,9]" into a json column and then reading it back
returned the input string; but at the same time storing "null" into
the column returned a SQL NULL.

Hmm.  So the idea is that all JSON atomic values are considered to be
text strings, even when they look like something else (like bools or
numbers)?  That would simplify matters I guess, but I'm not sure about
the usability.

I'm not sure what the other option is. If you do SELECT col FROM
table, I'm not aware that you can return differently-typed values for
different rows...

In particular I'd want to have something that dequotes
the value so that I can get foo not "foo" when converting to SQL text.
(I'm assuming that quotes would be there normally, so as not to lose
the distinction between 3 and "3" in the JSON representation.)

Yes, that seems like a useful support function.

...Robert

#50Yeb Havinga
yebhavinga@gmail.com
In reply to: Tom Lane (#46)
Re: Proposal: Add JSON support

Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

With this, I disagree. I see no reason to suppose that a JSON NULL
and an SQL NULL are the same thing.

Oh. If they're not the same, then the problem is easily dodged, but
then what *is* a JSON null?

Probably the same as the javascript null.

regards,
Yeb Havinga

#51Joseph Adams
joeyadams3.14159@gmail.com
In reply to: Tom Lane (#43)
Re: Proposal: Add JSON support

On Tue, Apr 6, 2010 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Petr Jelinek <pjmodos@pjmodos.net> writes:

Dne 6.4.2010 7:57, Joseph Adams napsal(a):

To me, the most logical approach is to do the obvious thing: make
JSON's 'null' be SQL's NULL.  For instance, SELECTing on a table with
NULLs in it and converting the result set to JSON would yield a
structure with 'null's in it.  'null'::JSON would yield NULL.  I'm not
sure what startling results would come of this approach, but I'm
guessing this would be most intuitive and useful.

+1

I think it's a pretty bad idea for 'null'::JSON to yield NULL.  AFAIR
there is no other standard datatype for which the input converter can
yield NULL from a non-null input string, and I'm not even sure that the
InputFunctionCall protocol allows it.  (In fact a quick look indicates
that it doesn't...)

To me, what this throws into question is not so much whether JSON null
should equate to SQL NULL (it should), but whether it's sane to accept
atomic values.  If I understood the beginning of this discussion, that's
not strictly legal.  I think it would be better for strict input mode
to reject this, and permissive mode to convert it to a non-atomic value.
Thus jsonify('null') wouldn't yield NULL but a structure containing a
null.

                       regards, tom lane

Actually, I kind of made a zany mistake here. If 'null'::JSON yielded
NULL, that would mean some type of automatic conversion was going on.
Likewise, '3.14159'::JSON shouldn't magically turn into a FLOAT.

I think the JSON datatype should behave more like TEXT. 'null'::JSON
would yield a JSON fragment containing 'null'. 'null'::JSON::TEXT
would yield the literal text 'null'. However, '3.14159'::JSON::FLOAT
should probably not be allowed as a precaution, as
'"hello"'::JSON::TEXT would yield '"hello"', not 'hello'. In other
words, casting to the target type directly isn't the same as parsing
JSON and extracting a value.

Perhaps there could be conversion functions. E.g.:

json_to_string('"hello"') yields 'hello'
json_to_number('3.14159') yields '3.14159' as text
(it is up to the user to cast it to the number type s/he wants)
json_to_bool('true') yields TRUE
json_to_null('null') yields NULL, json_null('nonsense') fails

string_to_json('hello') yields '"hello"' as JSON
number_to_json(3.14159) yields '3.14159' as JSON
bool_to_json(TRUE) yields 'true' as JSON
null_to_json(NULL) yields 'null' as JSON (kinda useless)

I wonder if these could all be reduced to two generic functions, like
json_to_value and value_to_json.

#52Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joseph Adams (#51)
Re: Proposal: Add JSON support

Joseph Adams <joeyadams3.14159@gmail.com> writes:

Perhaps there could be conversion functions. E.g.:

Yeah, that's what I was thinking about.

json_to_string('"hello"') yields 'hello'
json_to_number('3.14159') yields '3.14159' as text
(it is up to the user to cast it to the number type s/he wants)
json_to_bool('true') yields TRUE
json_to_null('null') yields NULL, json_null('nonsense') fails

string_to_json('hello') yields '"hello"' as JSON
number_to_json(3.14159) yields '3.14159' as JSON
bool_to_json(TRUE) yields 'true' as JSON
null_to_json(NULL) yields 'null' as JSON (kinda useless)

The null cases seem a bit useless. What might be helpful is to
translate JSON 'null' to and from SQL NULL in each of the other
conversions, in addition to their primary capability.

I'd go with using NUMERIC as the source/result type for the numeric
conversions. Forcing people to insert explicit coercions from text
isn't going to be particularly convenient to use.

I wonder if these could all be reduced to two generic functions, like
json_to_value and value_to_json.

value_to_json(any) might work, but the other way could not; and it seems
better to keep some symmetry between the directions.

regards, tom lane