Proposal: Add JSON support

Started by Joseph Adamsabout 16 years ago52 messageshackers
Jump to latest
#1Joseph Adams
joeyadams3.14159@gmail.com

I introduced myself in the thread "Proposal: access control jails (and
introduction as aspiring GSoC student)", and we discussed jails and
session-local variables. But, as Robert Haas suggested, implementing
variable support in the backend would probably be way too ambitious a
project for a newbie like me. I decided instead to pursue the task of
adding JSON support to PostgreSQL, hence the new thread.

I plan to reference datatype-xml.html and functions-xml.html in some
design decisions, but there are some things that apply to XML that
don't apply to JSON and vice versa. For instance, jsoncomment
wouldn't make sense because (standard) JSON doesn't have comments.
For access, we might have something like json_get('foo[1].bar') and
json_set('foo[1].bar', 'hello'). jsonforest and jsonagg would be
beautiful. For mapping, jsonforest/jsonagg could be used to build a
JSON string from a result set (SELECT jsonagg(jsonforest(col1, col2,
...)) FROM tbl), but I'm not sure on the best way to go the other way
around (generate a result set from JSON). CSS-style selectors would
be cool, but "selecting" is what SQL is all about, and I'm not sure
having a json_select("dom-element[key=value]") function is a good,
orthogonal approach.

I'm wondering whether the internal representation of JSON should be
plain JSON text, or some binary code that's easier to traverse and
whatnot. For the sake of code size, just keeping it in text is
probably best.

Now my thoughts and opinions on the JSON parsing/unparsing itself:

It should be built-in, rather than relying on an external library
(like XML does). Priorities of the JSON implementation, in descending
order, are:

* Small
* Correct
* Fast

Moreover, JSON operations shall not crash due to stack overflows.

I'm thinking Bison/Flex is overkill for parsing JSON (I haven't seen
any JSON implementations out there that use it anyway). I would
probably end up writing the JSON parser/serializer manually. It
should not take more than a week.

As far as character encodings, I'd rather keep that out of the JSON
parsing/serializing code itself and assume UTF-8. Wherever I'm wrong,
I'll just throw encode/decode/validate operations at it.

Thoughts? Thanks.

#2Robert Haas
robertmhaas@gmail.com
In reply to: Joseph Adams (#1)
Re: Proposal: Add JSON support

On Sun, Mar 28, 2010 at 4:48 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

I'm wondering whether the internal representation of JSON should be
plain JSON text, or some binary code that's easier to traverse and
whatnot.  For the sake of code size, just keeping it in text is
probably best.

+1 for text.

Now my thoughts and opinions on the JSON parsing/unparsing itself:

It should be built-in, rather than relying on an external library
(like XML does).

Why? I'm not saying you aren't right, but you need to make an
argument rather than an assertion. This is a community, so no one is
entitled to decide anything unilaterally, and people want to be
convinced - including me.

As far as character encodings, I'd rather keep that out of the JSON
parsing/serializing code itself and assume UTF-8.  Wherever I'm wrong,
I'll just throw encode/decode/validate operations at it.

I think you need to assume that the encoding will be the server
encoding, not UTF-8. Although others on this list are better
qualified to speak to that than I am.

...Robert

#3Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#2)
Re: Proposal: Add JSON support

Robert Haas wrote:

On Sun, Mar 28, 2010 at 4:48 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

I'm wondering whether the internal representation of JSON should be
plain JSON text, or some binary code that's easier to traverse and
whatnot. For the sake of code size, just keeping it in text is
probably best.

+1 for text.

Agreed.

Now my thoughts and opinions on the JSON parsing/unparsing itself:

It should be built-in, rather than relying on an external library
(like XML does).

Why? I'm not saying you aren't right, but you need to make an
argument rather than an assertion. This is a community, so no one is
entitled to decide anything unilaterally, and people want to be
convinced - including me.

Yeah, why? We should not be in the business of reinventing the wheel
(and then maintaining the reinvented wheel), unless the code in question
is *really* small.

As far as character encodings, I'd rather keep that out of the JSON
parsing/serializing code itself and assume UTF-8. Wherever I'm wrong,
I'll just throw encode/decode/validate operations at it.

I think you need to assume that the encoding will be the server
encoding, not UTF-8. Although others on this list are better
qualified to speak to that than I am.

The trouble is that JSON is defined to be specifically Unicode, and in
practice for us that means UTF8 on the server side. It could get a bit
hairy, and it's definitely not something I think you can wave away with
a simple "I'll just throw some encoding/decoding function calls at it."

cheers

andrew

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#3)
Re: Proposal: Add JSON support

Andrew Dunstan <andrew@dunslane.net> writes:

Robert Haas wrote:

I think you need to assume that the encoding will be the server
encoding, not UTF-8. Although others on this list are better
qualified to speak to that than I am.

The trouble is that JSON is defined to be specifically Unicode, and in
practice for us that means UTF8 on the server side. It could get a bit
hairy, and it's definitely not something I think you can wave away with
a simple "I'll just throw some encoding/decoding function calls at it."

It's just text, no? Are there any operations where this actually makes
a difference?

Like Robert, I'm *very* wary of trying to introduce any text storage
into the backend that is in an encoding different from server_encoding.
Even the best-case scenarios for that will involve multiple new places for
encoding conversion failures to happen.

regards, tom lane

#5Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#4)
Re: Proposal: Add JSON support

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Robert Haas wrote:

I think you need to assume that the encoding will be the server
encoding, not UTF-8. Although others on this list are better
qualified to speak to that than I am.

The trouble is that JSON is defined to be specifically Unicode, and in
practice for us that means UTF8 on the server side. It could get a bit
hairy, and it's definitely not something I think you can wave away with
a simple "I'll just throw some encoding/decoding function calls at it."

It's just text, no? Are there any operations where this actually makes
a difference?

If we're going to provide operations on it that might involve some. I
don't know.

Like Robert, I'm *very* wary of trying to introduce any text storage
into the backend that is in an encoding different from server_encoding.
Even the best-case scenarios for that will involve multiple new places for
encoding conversion failures to happen.

I agree entirely. All I'm suggesting is that there could be many
wrinkles here.

Here's another thought. Given that JSON is actually specified to consist
of a string of Unicode characters, what will we deliver to the client
where the client encoding is, say Latin1? Will it actually be a legal
JSON byte stream?

cheers

andrew

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#5)
Re: Proposal: Add JSON support

Andrew Dunstan <andrew@dunslane.net> writes:

Here's another thought. Given that JSON is actually specified to consist
of a string of Unicode characters, what will we deliver to the client
where the client encoding is, say Latin1? Will it actually be a legal
JSON byte stream?

No, it won't. We will *not* be sending anything but latin1 in such a
situation, and I really couldn't care less what the JSON spec says about
it. Delivering wrongly-encoded data to a client is a good recipe for
all sorts of problems, since the client-side code is very unlikely to be
expecting that. A datatype doesn't get to make up its own mind whether
to obey those rules. Likewise, data on input had better match
client_encoding, because it's otherwise going to fail the encoding
checks long before a json datatype could have any say in the matter.

While I've not read the spec, I wonder exactly what "consist of a string
of Unicode characters" should actually be taken to mean. Perhaps it
only means that all the characters must be members of the Unicode set,
not that the string can never be represented in any other encoding.
There's more than one Unicode encoding anyway...

regards, tom lane

#7Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#6)
Re: Proposal: Add JSON support

On Sun, Mar 28, 2010 at 7:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Here's another thought. Given that JSON is actually specified to consist
of a string of Unicode characters, what will we deliver to the client
where the client encoding is, say Latin1? Will it actually be a legal
JSON byte stream?

No, it won't.  We will *not* be sending anything but latin1 in such a
situation, and I really couldn't care less what the JSON spec says about
it.  Delivering wrongly-encoded data to a client is a good recipe for
all sorts of problems, since the client-side code is very unlikely to be
expecting that.  A datatype doesn't get to make up its own mind whether
to obey those rules.  Likewise, data on input had better match
client_encoding, because it's otherwise going to fail the encoding
checks long before a json datatype could have any say in the matter.

While I've not read the spec, I wonder exactly what "consist of a string
of Unicode characters" should actually be taken to mean.  Perhaps it
only means that all the characters must be members of the Unicode set,
not that the string can never be represented in any other encoding.
There's more than one Unicode encoding anyway...

See sections 2.5 and 3 of:

http://www.ietf.org/rfc/rfc4627.txt?number=4627

...Robert

#8Mike Rylander
mrylander@gmail.com
In reply to: Tom Lane (#6)
Re: Proposal: Add JSON support

On Sun, Mar 28, 2010 at 7:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Here's another thought. Given that JSON is actually specified to consist
of a string of Unicode characters, what will we deliver to the client
where the client encoding is, say Latin1? Will it actually be a legal
JSON byte stream?

No, it won't.  We will *not* be sending anything but latin1 in such a
situation, and I really couldn't care less what the JSON spec says about
it.  Delivering wrongly-encoded data to a client is a good recipe for
all sorts of problems, since the client-side code is very unlikely to be
expecting that.  A datatype doesn't get to make up its own mind whether
to obey those rules.  Likewise, data on input had better match
client_encoding, because it's otherwise going to fail the encoding
checks long before a json datatype could have any say in the matter.

While I've not read the spec, I wonder exactly what "consist of a string
of Unicode characters" should actually be taken to mean.  Perhaps it
only means that all the characters must be members of the Unicode set,
not that the string can never be represented in any other encoding.
There's more than one Unicode encoding anyway...

In practice, every parser/serializer I've used (including the one I
helped write) allows (and, often, forces) any non-ASCII character to
be encoded as \u followed by a string of four hex digits.

Whether it would be easy inside the backend, when generating JSON from
user data stored in tables that are not in a UTF-8 encoded cluster, to
convert to UTF-8, that's something else entirely. If it /is/ easy and
safe, then it's just a matter of scanning for multi-byte sequences and
replacing those with their \uXXXX equivalents. I have some simple and
fast code I could share, if it's needed, though I suspect it's not.
:)

UPDATE: Thanks, Robert, for pointing to the RFC.

--
Mike Rylander
| VP, Research and Design
| Equinox Software, Inc. / The Evergreen Experts
| phone: 1-877-OPEN-ILS (673-6457)
| email: miker@esilibrary.com
| web: http://www.esilibrary.com

#9Robert Haas
robertmhaas@gmail.com
In reply to: Mike Rylander (#8)
Re: Proposal: Add JSON support

On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander@gmail.com> wrote:

In practice, every parser/serializer I've used (including the one I
helped write) allows (and, often, forces) any non-ASCII character to
be encoded as \u followed by a string of four hex digits.

Is it correct to say that the only feasible place where non-ASCII
characters can be used is within string constants? If so, it might be
reasonable to disallow characters with the high-bit set unless the
server encoding is one of the flavors of Unicode of which the spec
approves. I'm tempted to think that when the server encoding is
Unicode we really ought to allow Unicode characters natively, because
turning a long string of two-byte wide chars into a long string of
six-byte wide chars sounds pretty evil from a performance point of
view.

...Robert

#10Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#9)
Re: Proposal: Add JSON support

Robert Haas wrote:

On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander@gmail.com> wrote:

In practice, every parser/serializer I've used (including the one I
helped write) allows (and, often, forces) any non-ASCII character to
be encoded as \u followed by a string of four hex digits.

Is it correct to say that the only feasible place where non-ASCII
characters can be used is within string constants? If so, it might be
reasonable to disallow characters with the high-bit set unless the
server encoding is one of the flavors of Unicode of which the spec
approves. I'm tempted to think that when the server encoding is
Unicode we really ought to allow Unicode characters natively, because
turning a long string of two-byte wide chars into a long string of
six-byte wide chars sounds pretty evil from a performance point of
view.

We support exactly one unicode encoding on the server side: utf8.

And the maximum possible size of a validly encoded unicode char in utf8
is 4 (and that's pretty rare, IIRC).

cheers

andrew

#11Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#10)
Re: Proposal: Add JSON support

Andrew Dunstan wrote:

Robert Haas wrote:

On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander@gmail.com>
wrote:

In practice, every parser/serializer I've used (including the one I
helped write) allows (and, often, forces) any non-ASCII character to
be encoded as \u followed by a string of four hex digits.

Is it correct to say that the only feasible place where non-ASCII
characters can be used is within string constants? If so, it might be
reasonable to disallow characters with the high-bit set unless the
server encoding is one of the flavors of Unicode of which the spec
approves. I'm tempted to think that when the server encoding is
Unicode we really ought to allow Unicode characters natively, because
turning a long string of two-byte wide chars into a long string of
six-byte wide chars sounds pretty evil from a performance point of
view.

We support exactly one unicode encoding on the server side: utf8.

And the maximum possible size of a validly encoded unicode char in
utf8 is 4 (and that's pretty rare, IIRC).

Sorry. Disregard this. I see what you mean.

Yeah, I thing *requiring* non-ascii character to be escaped would be evil.

cheers

andrew

#12Mike Rylander
mrylander@gmail.com
In reply to: Robert Haas (#9)
Re: Proposal: Add JSON support

On Sun, Mar 28, 2010 at 8:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander@gmail.com> wrote:

In practice, every parser/serializer I've used (including the one I
helped write) allows (and, often, forces) any non-ASCII character to
be encoded as \u followed by a string of four hex digits.

Is it correct to say that the only feasible place where non-ASCII
characters can be used is within string constants?

Yes. That includes object property strings -- they are quoted string literals.

If so, it might be
reasonable to disallow characters with the high-bit set unless the
server encoding is one of the flavors of Unicode of which the spec
approves.  I'm tempted to think that when the server encoding is
Unicode we really ought to allow Unicode characters natively, because
turning a long string of two-byte wide chars into a long string of
six-byte wide chars sounds pretty evil from a performance point of
view.

+1

As an aside, \u-encoded (escaped) characters and native multi-byte
sequences (of any RFC-allowable Unicode encoding) are exactly
equivalent in JSON -- it's a storage and transmission format, and
doesn't prescribe the application-internal representation of the data.

If it's faster (which it almost certainly is) to not mangle the data
when it's all staying server side, that seems like a useful
optimization. For output to the client, however, it would be useful
to provide a \u-escaping function, which (AIUI) should always be safe
regardless of client encoding.

--
Mike Rylander
| VP, Research and Design
| Equinox Software, Inc. / The Evergreen Experts
| phone: 1-877-OPEN-ILS (673-6457)
| email: miker@esilibrary.com
| web: http://www.esilibrary.com

#13Joseph Adams
joeyadams3.14159@gmail.com
In reply to: Robert Haas (#2)
Re: Proposal: Add JSON support

On Sun, Mar 28, 2010 at 5:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Mar 28, 2010 at 4:48 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

Now my thoughts and opinions on the JSON parsing/unparsing itself:

It should be built-in, rather than relying on an external library
(like XML does).

Why?  I'm not saying you aren't right, but you need to make an
argument rather than an assertion.  This is a community, so no one is
entitled to decide anything unilaterally, and people want to be
convinced - including me.

I apologize; I was just starting the conversation with some of my
ideas to receive feedback. I didn't want people to have to wade
through too many "I think"s . I'll be sure to use <opinion> tags in
the future :-)

My reasoning for "It should be built-in" is:
* It would be nice to have a built-in serialization format that's
available by default.
* It might be a little faster because it doesn't have to link to an
external library.
* The code to interface between JSON logic and PostgreSQL will
probably be much larger than the actual JSON encoding/decoding itself.
* The externally-maintained and packaged libjson implementations I
saw brought in lots of dependencies (e.g. glib).
* "Everyone else" (e.g. PHP) uses a statically-linked JSON implementation.

Is the code in question "*really*" small? Well, not really, but it's
not enormous either. By the way, I found a bug in PHP's JSON_parser
(json_decode("true "); /* with a space */ returns null instead of
true). I'll have to get around to reporting that.

Now, assuming JSON support is built-in to PostgreSQL and is enabled by
default, it is my opinion that encoding issues should not be dealt
with in the JSON code itself, but that the JSON code itself should
assume UTF-8. I think conversions should be done to/from UTF-8 before
passing it through the JSON code because this would likely be the
smallest way to implement it (not necessarily the fastest, though).

Mike Rylander pointed out something wonderful, and that is that JSON
code can be stored in plain old ASCII using \u... . If a target
encoding supports all of Unicode, the JSON serializer could be told
not to generate \u escapes. Otherwise, the \u escapes would be
necessary.

Thus, here's an example of how (in my opinion) character sets and such
should be handled in the JSON code:

Suppose the client's encoding is UTF-16, and the server's encoding is
Latin-1. When JSON is stored to the database:
1. The client is responsible and sends a valid UTF-16 JSON string.
2. PostgreSQL checks to make sure it is valid UTF-16, then converts
it to UTF-8.
3. The JSON code parses it (to ensure it's valid).
4. The JSON code unparses it (to get a representation without
needless whitespace). It is given a flag indicating it should only
output ASCII text.
5. The ASCII is stored in the server, since it is valid Latin-1.

When JSON is retrieved from the database:
1. ASCII is retrieved from the server
2. If user needs to extract one or more fields, the JSON is parsed,
and the fields are extracted.
3. Otherwise, the JSON text is converted to UTF-16 and sent to the client.

Note that I am being biased toward optimizing code size rather than speed.

Here's a question about semantics: should converting JSON to text
guarantee that Unicode will be \u escaped, or should it render actual
Unicode whenever possible (when the client uses a Unicode-complete
charset) ?

As for reinventing the wheel, I'm in the process of writing yet
another JSON implementation simply because I didn't find the other
ones I looked at palatable. I am aiming for simple code, not fast
code. I am using malloc for structures and realloc for strings/arrays
rather than resorting to clever buffering tricks. Of course, I'll
switch it over to palloc/repalloc before migrating it to PostgreSQL.

#14Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Andrew Dunstan (#3)
Re: Proposal: Add JSON support

2010/3/29 Andrew Dunstan <andrew@dunslane.net>:

Robert Haas wrote:

On Sun, Mar 28, 2010 at 4:48 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

I'm wondering whether the internal representation of JSON should be
plain JSON text, or some binary code that's easier to traverse and
whatnot.  For the sake of code size, just keeping it in text is
probably best.

+1 for text.

Agreed.

There's another choice, called BSON.

http://www.mongodb.org/display/DOCS/BSON

I've not researched it yet deeply, it seems reasonable to be stored in
databases as it is invented for MongoDB.

Now my thoughts and opinions on the JSON parsing/unparsing itself:

It should be built-in, rather than relying on an external library
(like XML does).

Why?  I'm not saying you aren't right, but you need to make an
argument rather than an assertion.  This is a community, so no one is
entitled to decide anything unilaterally, and people want to be
convinced - including me.

Yeah, why? We should not be in the business of reinventing the wheel (and
then maintaining the reinvented wheel), unless the code in question is
*really* small.

Many implementations in many languages of JSON show that parsing JSON
is not so difficult to code and the needs vary. Hence, I wonder if we
can have it very our own.

Never take it wrongly, I don't disagree text format nor disagree to
use an external library.

Regards,

--
Hitoshi Harada

#15Robert Haas
robertmhaas@gmail.com
In reply to: Joseph Adams (#13)
Re: Proposal: Add JSON support

On Sun, Mar 28, 2010 at 11:24 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

I apologize; I was just starting the conversation with some of my
ideas to receive feedback.  I didn't want people to have to wade
through too many "I think"s .  I'll be sure to use <opinion> tags in
the future :-)

FWIW, I don't care at all whether you say "I think" or "I know"; the
point is that you have to provide backup for any position you choose
to take.

My reasoning for "It should be built-in" is:
 * It would be nice to have a built-in serialization format that's
available by default.
 * It might be a little faster because it doesn't have to link to an
external library.

I don't think either of these reasons is valid.

 * The code to interface between JSON logic and PostgreSQL will
probably be much larger than the actual JSON encoding/decoding itself.

If true, this is a good argument.

 * The externally-maintained and packaged libjson implementations I
saw brought in lots of dependencies (e.g. glib).

As is this.

 * "Everyone else" (e.g. PHP) uses a statically-linked JSON implementation.

But this isn't.

Is the code in question "*really*" small?  Well, not really, but it's
not enormous either.  By the way, I found a bug in PHP's JSON_parser
(json_decode("true "); /* with a space */ returns null instead of
true).  I'll have to get around to reporting that.

Now, assuming JSON support is built-in to PostgreSQL and is enabled by
default, it is my opinion that encoding issues should not be dealt
with in the JSON code itself, but that the JSON code itself should
assume UTF-8.  I think conversions should be done to/from UTF-8 before
passing it through the JSON code because this would likely be the
smallest way to implement it (not necessarily the fastest, though).

Mike Rylander pointed out something wonderful, and that is that JSON
code can be stored in plain old ASCII using \u... .  If a target
encoding supports all of Unicode, the JSON serializer could be told
not to generate \u escapes.  Otherwise, the \u escapes would be
necessary.

Thus, here's an example of how (in my opinion) character sets and such
should be handled in the JSON code:

Suppose the client's encoding is UTF-16, and the server's encoding is
Latin-1.  When JSON is stored to the database:
 1. The client is responsible and sends a valid UTF-16 JSON string.
 2. PostgreSQL checks to make sure it is valid UTF-16, then converts
it to UTF-8.
 3. The JSON code parses it (to ensure it's valid).
 4. The JSON code unparses it (to get a representation without
needless whitespace).  It is given a flag indicating it should only
output ASCII text.
 5. The ASCII is stored in the server, since it is valid Latin-1.

When JSON is retrieved from the database:
 1. ASCII is retrieved from the server
 2. If user needs to extract one or more fields, the JSON is parsed,
and the fields are extracted.
 3. Otherwise, the JSON text is converted to UTF-16 and sent to the client.

Note that I am being biased toward optimizing code size rather than speed.

Can you comment on my proposal elsewhere on this thread and compare
your proposal to mine? In what ways are they different, and which is
better, and why?

Here's a question about semantics: should converting JSON to text
guarantee that Unicode will be \u escaped, or should it render actual
Unicode whenever possible (when the client uses a Unicode-complete
charset) ?

I feel pretty strongly that the data should be stored in the database
in the format in which it will be returned to the user - any
conversion which is necessary should happen on the way in. I am not
100% sure to what extent we should attempt to canonicalize the input
and to what extend we should simply store it in whichever way the user
chooses to provide it.

As for reinventing the wheel, I'm in the process of writing yet
another JSON implementation simply because I didn't find the other
ones I looked at palatable.  I am aiming for simple code, not fast
code.  I am using malloc for structures and realloc for strings/arrays
rather than resorting to clever buffering tricks.  Of course, I'll
switch it over to palloc/repalloc before migrating it to PostgreSQL.

I'm not sure that optimizing for simplicity over speed is a good idea.
I think we can reject implementations as unpalatable because they are
slow or feature-poor or have licensing issues or are not actively
maintained, but rejecting them because they use complex code in order
to be fast doesn't seem like the right trade-off to me.

...Robert

#16Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Joseph Adams (#13)
Re: Proposal: Add JSON support

Hi,

Joseph Adams <joeyadams3.14159@gmail.com> writes:

As for reinventing the wheel, I'm in the process of writing yet
another JSON implementation simply because I didn't find the other
ones I looked at palatable.

Even this one (ANSI C, MIT Licenced)?

cJSON -- An ultra-lightweight, portable, single-file, simple-as-can-be
ANSI-C compliant JSON parser, under MIT license.

http://sourceforge.net/projects/cjson/
http://cjson.svn.sourceforge.net/viewvc/cjson/README?revision=7&amp;view=markup
http://cjson.svn.sourceforge.net/viewvc/cjson/cJSON.c?revision=33&amp;view=markup

And from the cJSON.h we read that it could be somewhat easy to integrate
into PostgreSQL's memory management:
56 typedef struct cJSON_Hooks {
57 void *(*malloc_fn)(size_t sz);
58 void (*free_fn)(void *ptr);
59 } cJSON_Hooks;

Just adding some data points, hoping that's not adding only confusion.

Regards,
--
dim

#17Peter Eisentraut
peter_e@gmx.net
In reply to: Joseph Adams (#13)
Re: Proposal: Add JSON support

On sön, 2010-03-28 at 23:24 -0400, Joseph Adams wrote:

Thus, here's an example of how (in my opinion) character sets and such
should be handled in the JSON code:

Suppose the client's encoding is UTF-16, and the server's encoding is
Latin-1. When JSON is stored to the database:
1. The client is responsible and sends a valid UTF-16 JSON string.
2. PostgreSQL checks to make sure it is valid UTF-16, then converts
it to UTF-8.
3. The JSON code parses it (to ensure it's valid).
4. The JSON code unparses it (to get a representation without
needless whitespace). It is given a flag indicating it should only
output ASCII text.
5. The ASCII is stored in the server, since it is valid Latin-1.

When JSON is retrieved from the database:
1. ASCII is retrieved from the server
2. If user needs to extract one or more fields, the JSON is parsed,
and the fields are extracted.
3. Otherwise, the JSON text is converted to UTF-16 and sent to the client.

The problem I see here is that a data type output function is normally
not aware of the client encoding. The alternatives that I see is that
you always escape everything you see to plain ASCII, so it's valid in
every server encoding, but that would result in pretty sad behavior for
users of languages that don't use a lot of ASCII characters, or you
decree a nonstandard JSON variant that momentarily uses whatever
encoding you decide.

#18Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#15)
Re: Proposal: Add JSON support

Robert Haas wrote:

I feel pretty strongly that the data should be stored in the database
in the format in which it will be returned to the user - any
conversion which is necessary should happen on the way in. I am not
100% sure to what extent we should attempt to canonicalize the input
and to what extend we should simply store it in whichever way the user
chooses to provide it.

ISTM that implies that, with a possible exception when the server
encoding is utf8, you would have to \u escape the data on the way in
fairly pessimistically.

I'd be inclined to say we should store and validate it exactly as the
client gives it to us (converted to the server encoding, as it would be,
of course). In practice that would mean that for non-utf8 databases the
client would need to \u escape it. I suspect most uses of this would be
in utf8-encoded databases anyway.

I also think we should provide a function to do the escaping, so users
could do something like:

insert into foo (myjson) values (json_escape('some jason text here'));

I also thought about a switch to turn on \u escaping on output - that
might be useful for pg_dump for instance.

cheers

andrew

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#15)
Re: Proposal: Add JSON support

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Mar 28, 2010 at 11:24 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

My reasoning for "It should be built-in" is:
�* It would be nice to have a built-in serialization format that's
available by default.
�* It might be a little faster because it doesn't have to link to an
external library.

I don't think either of these reasons is valid.

FWIW, our track record with relying on external libraries has been less
than great --- "upstream will maintain it" sounds good but has fallen
over with respect to both the regex engine and the snowball stemmers,
to take two examples. And libxml2 has been nothing but a source of pain.

If this is going to end up being one fairly small C file implementing
a spec that is not a moving target, I'd vote against depending on an
external library instead, no matter how spiffy and license-compatible
the external library might be.

regards, tom lane

#20Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#19)
Re: Proposal: Add JSON support

On Mon, Mar 29, 2010 at 12:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Mar 28, 2010 at 11:24 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

My reasoning for "It should be built-in" is:
 * It would be nice to have a built-in serialization format that's
available by default.
 * It might be a little faster because it doesn't have to link to an
external library.

I don't think either of these reasons is valid.

FWIW, our track record with relying on external libraries has been less
than great --- "upstream will maintain it" sounds good but has fallen
over with respect to both the regex engine and the snowball stemmers,
to take two examples.  And libxml2 has been nothing but a source of pain.

If this is going to end up being one fairly small C file implementing
a spec that is not a moving target, I'd vote against depending on an
external library instead, no matter how spiffy and license-compatible
the external library might be.

Fair enough. Note that I did go on to say which reasons I did think
were potentially valid. ;-)

...Robert

#21Josh Berkus
josh@agliodbs.com
In reply to: Hitoshi Harada (#14)
#22David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#19)
#23Joseph Adams
joeyadams3.14159@gmail.com
In reply to: Joseph Adams (#1)
#24Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Tom Lane (#19)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dimitri Fontaine (#24)
#26Chris Browne
cbbrowne@acm.org
In reply to: Joseph Adams (#1)
#27Josh Berkus
josh@agliodbs.com
In reply to: Chris Browne (#26)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#27)
#29Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#28)
#30Chris Browne
cbbrowne@acm.org
In reply to: Joseph Adams (#1)
#31Joseph Adams
joeyadams3.14159@gmail.com
In reply to: Chris Browne (#30)
#32Petr Jelinek
petr@2ndquadrant.com
In reply to: Joseph Adams (#31)
#33David E. Wheeler
david@kineticode.com
In reply to: Petr Jelinek (#32)
#34Joseph Adams
joeyadams3.14159@gmail.com
In reply to: David E. Wheeler (#33)
#35Mike Rylander
mrylander@gmail.com
In reply to: Joseph Adams (#34)
#36Andrew Dunstan
andrew@dunslane.net
In reply to: Mike Rylander (#35)
#37Joseph Adams
joeyadams3.14159@gmail.com
In reply to: Andrew Dunstan (#36)
#38Robert Haas
robertmhaas@gmail.com
In reply to: Joseph Adams (#37)
#39Petr Jelinek
petr@2ndquadrant.com
In reply to: Joseph Adams (#37)
#40Petr Jelinek
petr@2ndquadrant.com
In reply to: Joseph Adams (#1)
#41Mike Rylander
mrylander@gmail.com
In reply to: Joseph Adams (#37)
#42Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Joseph Adams (#37)
#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Petr Jelinek (#40)
#44Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#42)
#45Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#43)
#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#45)
#47Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#46)
#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#47)
#49Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#48)
#50Yeb Havinga
yebhavinga@gmail.com
In reply to: Tom Lane (#46)
#51Joseph Adams
joeyadams3.14159@gmail.com
In reply to: Tom Lane (#43)
#52Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joseph Adams (#51)