Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

Started by Terry Laurenzoover 15 years ago38 messageshackers
Jump to latest
#1Terry Laurenzo
tj@laurenzo.org

Hi all -
I independently started some work on a similar capability as was contributed
back in August by Joey Adams for a json datatype. Before starting, I did a
quick search but for some reason didn't turn this existing thread up.

What I've been working on is out on github for now:
http://github.com/tlaurenzo/pgjson

When I started, I was actually aiming for something else, and got caught up
going down this rabbit hole. I took a different design approach, making the
internal form be an extended BSON stream and implementing event-driven
parsing and serializing to the different formats. There was some discussion
in the original thread around storing plain text vs a custom format. I have
to admit I've been back and forth a couple of times on this and have come to
like a BSON-like format for the data at rest.

Pros:
- It is directly iterable without parsing and/or constructing an AST
- It is its own representation. If iterating and you want to tear-off a
value to be returned or used elsewhere, its a simple buffer copy plus some
bit twiddling.
- It is conceivable that clients already know how to deal with BSON,
allowing them to work with the internal form directly (ala MongoDB)
- It stores a wider range of primitive types than JSON-text. The most
important are Date and binary.

Cons:
- The format appears to have been "grown". Some of the critical
decisions were made late in the game (ie. why would your integral type codes
be last)
- Natively, the format falls victim to the artificial document vs element
distinction, which I never understood. I've worked around this with an
escape mechanism for representing root values, but its not great.
- The processor is not resilient in the face of unknown element types

I'm leaning towards thinking that the determination comes down to the
following:
- If you just want a "checkbox" item that the database has a json
datatype and some support functions, storing as text may make sense. It can
be much simpler; however, it becomes increasingly hard to do anything real
without adding a parse to AST, manipulate, dump to text cycle to every
function.
- If you want a json datatype that is highly integrated and manipulable,
you want a binary datastructure and in the absence of any other contender in
this area, BSON is ok (not great, but ok).
- The addition of a JavaScript procedural language probably does not
bring its own format for data at rest. All of the engines I know of (I
haven't looked at what Guile is doing) do not have a static representation
for internal data structures. They are heap objects with liberal use of
internal and external pointers. Most do have a mechanism, however, for
injecting foreign objects into the runtime without resorting to making a
dumb copy. As such, the integration approach would probably be to determine
the best format for JSON data at rest and provide adapters to the chosen
JavaScript runtime to manipulate this at-rest format directly (potentially
using a copy on write approach). If the at-rest format is Text, then you
would need to do a parse-to-AST step for each JavaScript function
invocation.

Here's a few notes on my current implementation:
- Excessive use of lex/yacc: This was quick and easy but the grammars are
simple enough that I'd probably hand-code a parser for any final solution.
- When the choice between following the json.org spec to the letter and
implementing lenient parsing for valid JavaScript constructs arose, I chose
lenient.
- Too much buffer copying: When I started, I was just doodling with
writing C code to manipulate JSON/BSON and not working with postgres in
particular. As such, it all uses straight malloc/free and too many copies
are made to get things in and out of VARDATA structures. This would all be
eliminated in any real version.
- UTF-8 is supported but not fully working completely. The support
functions that Joey wrote do a better job at this.
- My json path evaluation is crippled. Given the integration with the PG
type system, I thought I just wanted a simple property traversal mechanism,
punting higher level manipulation to native PG functions. Seeing real
JSONPath work, though, I'm not so sure. I like the simplicity of what I've
done but the features of the full bit are nice too.
- This is first-pass prototype code with the goal of seeing it all
working together.

While I had an end in mind, I did a lot of this for the fun of it and to
just scratch an itch, so I'm not really advocating for anything at this
point. I'm curious as to what others think the state of JSON and Postgres
should be. I've worked with JavaScript engines a good deal and would be
happy to help get us there, either using some of the work/approaches here or
going in a different direction.

Terry

#2Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Terry Laurenzo (#1)

2010/10/17 Terry Laurenzo <tj@laurenzo.org>:

Hi all -
I independently started some work on a similar capability as was contributed
back in August by Joey Adams for a json datatype.  Before starting, I did a
quick search but for some reason didn't turn this existing thread up.
What I've been working on is out on github for
now: http://github.com/tlaurenzo/pgjson
When I started, I was actually aiming for something else, and got caught up
going down this rabbit hole.  I took a different design approach, making the
internal form be an extended BSON stream and implementing event-driven
parsing and serializing to the different formats.  There was some discussion
in the original thread around storing plain text vs a custom format.  I have
to admit I've been back and forth a couple of times on this and have come to
like a BSON-like format for the data at rest.

Reading your proposal, I'm now +1 for BSON-like style. Especially JS
engine's capabilities to map external data to the language
representation are good news. I agree the mapping is engine's task,
not data format task. I'm not sure if your BSON-like format is more
efficient in terms of space and time than plain text, though.

I like as simple design as we can accept. ISTM format, I/O interface,
simple get/set, mapping tuple from/to object, and indexing are minimum
requirement. Something like JSONPath, aggregates, hstore conversion
and whatsoever sound too much.

Regards,

--
Hitoshi Harada

#3Itagaki Takahiro
itagaki.takahiro@gmail.com
In reply to: Hitoshi Harada (#2)

On Sun, Oct 17, 2010 at 5:18 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

Reading your proposal, I'm now +1 for BSON-like style. Especially JS
engine's capabilities to map external data to the language
representation are good news.

Hmm, we could store postgres' data types as-is with their type oids.
I'm not sure whether it is efficient or worth doing, though.

I like as simple design as we can accept. ISTM format, I/O interface,
simple get/set, mapping tuple from/to object, and indexing are minimum
requirement.

+1 to small start, but simple get/set are already debatable...
For example, text/json conversion:
A. SELECT '<json>'::json;
B. SELECT '<text>'::text::json;

In the git repo, A calls parse_json_to_bson_as_vardata(), so the input
should be a json format. OTOH, B calls pgjson_json_from_text(), so the
input can be any text. Those behaviors are surprising. I think we have
no other choice but to define text-to-json cast as parsing. The same
can be said for json-to-text -- type-output function vs. extracting
text value from json.

I think casting text to/from json should behave in the same way as type
input/output. The xml type works in the same manner. And if so, we might
not have any casts to/from json for consistency, even though there are
no problems in casts for non-text types.

I'll list issues before we start json types even in the simplest cases:
----
1. where to implement json core: external library vs. inner postgres
2. internal format: text vs. binary (*)
3. encoding: always UTF-8 vs. database encoding (*)
4. meaning of casts text to/from json: parse/stringify vs. get/set
5. parser implementation: flex/bison vs. hand-coded.
----
(*) Note that we would have comparison two json values in the future. So,
we might need to normalize the internal format even in text representation.

The most interesting parts of json types, including indexing and jsonpath,
would be made on the json core. We need conclusions about those issues.

--
Itagaki Takahiro

#4Terry Laurenzo
tj@laurenzo.org
In reply to: Itagaki Takahiro (#3)

I like as simple design as we can accept. ISTM format, I/O interface,
simple get/set, mapping tuple from/to object, and indexing are minimum
requirement.

+1 to small start, but simple get/set are already debatable...
For example, text/json conversion:
A. SELECT '<json>'::json;
B. SELECT '<text>'::text::json;

In the git repo, A calls parse_json_to_bson_as_vardata(), so the input
should be a json format. OTOH, B calls pgjson_json_from_text(), so the
input can be any text. Those behaviors are surprising. I think we have
no other choice but to define text-to-json cast as parsing. The same
can be said for json-to-text -- type-output function vs. extracting
text value from json.

I think casting text to/from json should behave in the same way as type
input/output. The xml type works in the same manner. And if so, we might
not have any casts to/from json for consistency, even though there are
no problems in casts for non-text types.

I just reworked some of this last night, so I'm not sure which version you
are referring to (new version has a pgplugin/jsoncast.c source file). I was
basically circling around the same thing as you trying to find something
that felt natural and not confusing. I agree that we don't have much of a
choice to keep in/out functions as parse/serialize and that then introducing
casts that do differently creates confusion. When I was playing with it, I
was getting confused, and I wrote it. :)

An alternative to pg casting to extract postgres values could be to
introduce analogs to JavaScript constructors, which is the JavaScript way to
cast. For example: String(json), Number(json), Date(json). This would feel
natural to a JavaScript programmer and would be explicit and non-surprising:
A. SELECT String('{a: 1, b:2}'::json -> 'a') (Returns postgres text)
B. SELECT Number('1'::json) (Returns postgres decimal)

I think that the most common use case for this type of thing in the DB will
be to extract a JSON scalar as a postgres scalar.

The inverse, while probably less useful, is currently represented by the
json_from_* functions. We could collapse all of these down to one
overloaded function, say ToJson(...):
A. SELECT ToJson(1) (Would return a json type with an int32 "1" value)
B. SELECT ToJson('Some String') (Would return a json type with a string
value)

There might be some syntactic magic we could do by adding an intermediate
jsonscalar type, but based on trying real cases with this stuff, you always
end up having to be explicit about your conversions anyway. Having implicit
type coercion from this polymorphic type tends to make things confusing,
imo.

I'll list issues before we start json types even in the simplest cases:
----
1. where to implement json core: external library vs. inner postgres
2. internal format: text vs. binary (*)
3. encoding: always UTF-8 vs. database encoding (*)
4. meaning of casts text to/from json: parse/stringify vs. get/set
5. parser implementation: flex/bison vs. hand-coded.
----
(*) Note that we would have comparison two json values in the future. So,
we might need to normalize the internal format even in text representation.

The most interesting parts of json types, including indexing and jsonpath,
would be made on the json core. We need conclusions about those issues.

My opinions or ramblings on the above:
1. There's a fair bit of code involved for something that many are going
to gloss over. I can think of pros/cons for external/internal/contrib and
I'm not sure which I would choose.
2. I'm definitely in the binary camp, but part of the reason for building
it out was to try it with some real world cases to get a feel for
performance implications end to end. We make heavy use of MongoDB at the
office and I was thinking it might make sense to strip some of those cases
down and see how they would be implemented in this context. I'll write up
more thoughts on how I think text/binary should perform for various cases
tonight.
3. I think if we go with binary, we should always store UTF-8 in the
binary structure. Otherwise, we just have too much of the guts of the
binary left to the whim of the database encoding. As currently implemented,
all strings generated by the in/out functions should be escaped so that they
are pure ascii (not quite working, but there in theory). JSON is by
definition UTF-8, and in this case, I think it trumps database encoding.
4. My thoughts on the casts are above.
5. There seems to be a lot of runtime and code size overhead inherent in
the flex/bison parsers, especially considering that they will most
frequently be invoked for very small streams. Writing a good hand-coded
parser for comparison is just a matter of which bottle of wine to choose
prior to spending the hours coding it, and I would probably defer the
decision until later.

Terry

#5Robert Haas
robertmhaas@gmail.com
In reply to: Terry Laurenzo (#1)

On Sat, Oct 16, 2010 at 12:59 PM, Terry Laurenzo <tj@laurenzo.org> wrote:

   - It is directly iterable without parsing and/or constructing an AST
   - It is its own representation.  If iterating and you want to tear-off a
value to be returned or used elsewhere, its a simple buffer copy plus some
bit twiddling.
   - It is conceivable that clients already know how to deal with BSON,
allowing them to work with the internal form directly (ala MongoDB)
   - It stores a wider range of primitive types than JSON-text.  The most
important are Date and binary.

When last I looked at that, it appeared to me that what BSON could
represent was a subset of what JSON could represent - in particular,
that it had things like a 32-bit limit on integers, or something along
those lines. Sounds like it may be neither a superset nor a subset,
in which case I think it's a poor choice for an internal
representation of JSON.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#5)

On 10/19/2010 10:44 AM, Robert Haas wrote:

On Sat, Oct 16, 2010 at 12:59 PM, Terry Laurenzo<tj@laurenzo.org> wrote:

- It is directly iterable without parsing and/or constructing an AST
- It is its own representation. If iterating and you want to tear-off a
value to be returned or used elsewhere, its a simple buffer copy plus some
bit twiddling.
- It is conceivable that clients already know how to deal with BSON,
allowing them to work with the internal form directly (ala MongoDB)
- It stores a wider range of primitive types than JSON-text. The most
important are Date and binary.

When last I looked at that, it appeared to me that what BSON could
represent was a subset of what JSON could represent - in particular,
that it had things like a 32-bit limit on integers, or something along
those lines. Sounds like it may be neither a superset nor a subset,
in which case I think it's a poor choice for an internal
representation of JSON.

Yeah, if it can't handle arbitrary precision numbers as has previously
been stated it's dead in the water for our purposes, I think.

cheers

andrew

#7Terry Laurenzo
tj@laurenzo.org
In reply to: Andrew Dunstan (#6)

Agreed. BSON was born out of implementations that either lacked arbitrary
precision numbers or had a strong affinity to an int/floating point way of
thinking about numbers. I believe that if BSON had an arbitrary precision
number type, it would be a proper superset of JSON.

As an aside, the max range of an int in BSON 64bits. Back to my original
comment that BSON was "grown" instead of designed, it looks like both the
32bit and 64bit integers were added late in the game and that the original
designers perhaps were just going to store all numbers as double.

Perhaps we should enumerate the attributes of what would make a good binary
encoding?

Terry

On Tue, Oct 19, 2010 at 8:57 AM, Andrew Dunstan <andrew@dunslane.net> wrote:

Show quoted text

On 10/19/2010 10:44 AM, Robert Haas wrote:

On Sat, Oct 16, 2010 at 12:59 PM, Terry Laurenzo<tj@laurenzo.org> wrote:

- It is directly iterable without parsing and/or constructing an AST
- It is its own representation. If iterating and you want to tear-off
a
value to be returned or used elsewhere, its a simple buffer copy plus
some
bit twiddling.
- It is conceivable that clients already know how to deal with BSON,
allowing them to work with the internal form directly (ala MongoDB)
- It stores a wider range of primitive types than JSON-text. The most
important are Date and binary.

When last I looked at that, it appeared to me that what BSON could
represent was a subset of what JSON could represent - in particular,
that it had things like a 32-bit limit on integers, or something along
those lines. Sounds like it may be neither a superset nor a subset,
in which case I think it's a poor choice for an internal
representation of JSON.

Yeah, if it can't handle arbitrary precision numbers as has previously been
stated it's dead in the water for our purposes, I think.

cheers

andrew

#8Robert Haas
robertmhaas@gmail.com
In reply to: Terry Laurenzo (#7)

On Tue, Oct 19, 2010 at 11:22 AM, Terry Laurenzo <tj@laurenzo.org> wrote:

Agreed.  BSON was born out of implementations that either lacked arbitrary
precision numbers or had a strong affinity to an int/floating point way of
thinking about numbers.  I believe that if BSON had an arbitrary precision
number type, it would be a proper superset of JSON.

As an aside, the max range of an int in BSON 64bits.  Back to my original
comment that BSON was "grown" instead of designed, it looks like both the
32bit and 64bit integers were added late in the game and that the original
designers perhaps were just going to store all numbers as double.

Perhaps we should enumerate the attributes of what would make a good binary
encoding?

I think we should take a few steps back and ask why we think that
binary encoding is the way to go. We store XML as text, for example,
and I can't remember any complaints about that on -bugs or
-performance, so why do we think JSON will be different? Binary
encoding is a trade-off. A well-designed binary encoding should make
it quicker to extract a small chunk of a large JSON object and return
it; however, it will also make it slower to return the whole object
(because you're adding serialization overhead). I haven't seen any
analysis of which of those use cases is more important and why.

I am also wondering how this proposed binary encoding scheme will
interact with TOAST. If the datum is compressed on disk, you'll have
to decompress it anyway to do anything with it; at that point, is
there still going to be a noticeable speed-up from using the binary
encoding?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9David E. Wheeler
david@kineticode.com
In reply to: Robert Haas (#8)

On Oct 19, 2010, at 12:17 PM, Robert Haas wrote:

I think we should take a few steps back and ask why we think that
binary encoding is the way to go. We store XML as text, for example,
and I can't remember any complaints about that on -bugs or
-performance, so why do we think JSON will be different? Binary
encoding is a trade-off. A well-designed binary encoding should make
it quicker to extract a small chunk of a large JSON object and return
it; however, it will also make it slower to return the whole object
(because you're adding serialization overhead). I haven't seen any
analysis of which of those use cases is more important and why.

Maybe someone has numbers on that for the XML type?

Best,

David

#10Joseph Adams
joeyadams3.14159@gmail.com
In reply to: Terry Laurenzo (#7)

On Tue, Oct 19, 2010 at 11:22 AM, Terry Laurenzo <tj@laurenzo.org> wrote:

Perhaps we should enumerate the attributes of what would make a good binary
encoding?

Not sure if we're discussing the internal storage format or the binary
send/recv format, but in my humble opinion, some attributes of a good
internal format are:

1. Lightweight - it'd be really nice for the JSON datatype to be
available in core (even if extra features like JSONPath aren't).
2. Efficiency - Retrieval and storage of JSON datums should be
efficient. The internal format should probably closely resemble the
binary send/recv format so there's a good reason to use it.

A good attribute of the binary send/recv format would be
compatibility. For instance, if MongoDB (which I know very little
about) has binary send/receive, perhaps the JSON data type's binary
send/receive should use it.

Efficient retrieval and update of values in a large JSON tree would be
cool, but would be rather complex, and IMHO, overkill. JSON's main
advantage is that it's sort of a least common denominator of the type
systems of many popular languages, making it easy to transfer
information between them. Having hierarchical key/value store support
would be pretty cool, but I don't think it's what PostgreSQL's JSON
data type should do.

On Tue, Oct 19, 2010 at 3:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I think we should take a few steps back and ask why we think that
binary encoding is the way to go. We store XML as text, for example,
and I can't remember any complaints about that on -bugs or
-performance, so why do we think JSON will be different? Binary
encoding is a trade-off. A well-designed binary encoding should make
it quicker to extract a small chunk of a large JSON object and return
it; however, it will also make it slower to return the whole object
(because you're adding serialization overhead). I haven't seen any
analysis of which of those use cases is more important and why.

Speculation: the overhead involved with retrieving/sending and
receiving/storing JSON (not to mention TOAST
compression/decompression) will be far greater than that of
serializing/unserializing.

#11Robert Haas
robertmhaas@gmail.com
In reply to: Joseph Adams (#10)

On Tue, Oct 19, 2010 at 3:40 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

On Tue, Oct 19, 2010 at 3:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I think we should take a few steps back and ask why we think that
binary encoding is the way to go.  We store XML as text, for example,
and I can't remember any complaints about that on -bugs or
-performance, so why do we think JSON will be different?  Binary
encoding is a trade-off.  A well-designed binary encoding should make
it quicker to extract a small chunk of a large JSON object and return
it; however, it will also make it slower to return the whole object
(because you're adding serialization overhead).  I haven't seen any
analysis of which of those use cases is more important and why.

Speculation: the overhead involved with retrieving/sending and
receiving/storing JSON (not to mention TOAST
compression/decompression) will be far greater than that of
serializing/unserializing.

I speculate that your speculation is incorrect. AIUI, we, unlike
$COMPETITOR, tend to be CPU-bound rather than IO-bound on COPY. But
perhaps less speculation and more benchmarking is in order.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Terry Laurenzo
tj@laurenzo.org
In reply to: Robert Haas (#11)

On Tue, Oct 19, 2010 at 2:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 19, 2010 at 3:40 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

On Tue, Oct 19, 2010 at 3:17 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

I think we should take a few steps back and ask why we think that
binary encoding is the way to go. We store XML as text, for example,
and I can't remember any complaints about that on -bugs or
-performance, so why do we think JSON will be different? Binary
encoding is a trade-off. A well-designed binary encoding should make
it quicker to extract a small chunk of a large JSON object and return
it; however, it will also make it slower to return the whole object
(because you're adding serialization overhead). I haven't seen any
analysis of which of those use cases is more important and why.

Speculation: the overhead involved with retrieving/sending and
receiving/storing JSON (not to mention TOAST
compression/decompression) will be far greater than that of
serializing/unserializing.

I speculate that your speculation is incorrect. AIUI, we, unlike
$COMPETITOR, tend to be CPU-bound rather than IO-bound on COPY. But
perhaps less speculation and more benchmarking is in order.

After spending a week in the morass of this, I have to say that I am less
certain than I was on any front regarding the text/binary distinction. I'll
take some time and benchmark different cases. My hypothesis is that a well
implemented binary structure and conversions will add minimal overhead in
the IO + Validate case which would be the typical in/out flow. It could be
substantially faster for binary send/receive because the validation step
could be eliminated/reduced. Further storing as binary reduces the overhead
of random access to the data by database functions.

I'm envisioning staging this up as follows:
1. Create a "jsontext". jsontext uses text as its internal
representation. in/out functions are essentially a straight copy or a copy
+ validate.
2. Create a "jsonbinary" type. This uses an optimized binary format for
internal rep and send/receive. in/out is a parse/transcode operation to
standard JSON text.
3. Internal data access functions and JSON Path require a jsonbinary.
4. There are implicit casts to/from jsontext and jsonbinary.

I've got a grammar in mind for the binary structure that I'll share later
when I've got some more time. It's inspired by $COMPETITOR's format but a
little more sane, using type tags that implicitly define the size of the
operands, simplifying parsing.

I'll then define the various use cases and benchmark using the different
types. Some examples include such as IO No Validate, IO+Validate, Store and
Index, Internal Processing, Internal Composition, etc.

The answer may be to have both a jsontext and jsonbinary type as each will
be optimized for a different case.

Make sense? It may be a week before I get through this.
Terry

#13Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#8)

On Tue, Oct 19, 2010 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I think we should take a few steps back and ask why we think that
binary encoding is the way to go.  We store XML as text, for example,
and I can't remember any complaints about that on -bugs or
-performance, so why do we think JSON will be different?  Binary
encoding is a trade-off.  A well-designed binary encoding should make
it quicker to extract a small chunk of a large JSON object and return
it; however, it will also make it slower to return the whole object
(because you're adding serialization overhead).  I haven't seen any
analysis of which of those use cases is more important and why.

The elephant in the room is if the binary encoded form is smaller then
it occupies less ram and disk bandwidth to copy it around. If your
database is large that alone is the dominant factor. Doubling the size
of all the objects in your database means halving the portion of the
database that fits in RAM and doubling the amount of I/O required to
complete any given operation. If your database fits entirely in RAM
either way then it still means less RAM bandwidth used which is often
the limiting factor but depending on how much cpu effort it takes to
serialize and deserialize the balance could shift either way.

--
greg

#14Pavel Stehule
pavel.stehule@gmail.com
In reply to: Terry Laurenzo (#12)

After spending a week in the morass of this, I have to say that I am less
certain than I was on any front regarding the text/binary distinction.  I'll
take some time and benchmark different cases.  My hypothesis is that a well
implemented binary structure and conversions will add minimal overhead in
the IO + Validate case which would be the typical in/out flow.  It could be
substantially faster for binary send/receive because the validation step
could be eliminated/reduced.  Further storing as binary reduces the overhead
of random access to the data by database functions.

I'm envisioning staging this up as follows:
   1. Create a "jsontext".  jsontext uses text as its internal
representation.  in/out functions are essentially a straight copy or a copy
+ validate.
   2. Create a "jsonbinary" type.  This uses an optimized binary format for
internal rep and send/receive.  in/out is a parse/transcode operation to
standard JSON text.
   3. Internal data access functions and JSON Path require a jsonbinary.
   4. There are implicit casts to/from jsontext and jsonbinary.

some years ago I solved similar problems with xml type. I think, so
you have to calculate with two factors:

a) all varlena types are compressed - you cannot to get some
interesting fragment or you cannot tu update some interesting
fragment, every time pg working with complete document

b) access to some fragment of JSON or XML document are not really
important, because fast access to data are solved via indexes.

c) only a few API allows binary communication between server/client.
Almost all interfases use only text based API. I see some possible
interesting direction for binary protocol when some one uses a
javascript driver, when some one use pg in some javascript server side
environment, but it isn't a often used now.

Regards

Pavel

Show quoted text

I've got a grammar in mind for the binary structure that I'll share later
when I've got some more time.  It's inspired by $COMPETITOR's format but a
little more sane, using type tags that implicitly define the size of the
operands, simplifying parsing.

I'll then define the various use cases and benchmark using the different
types.  Some examples include such as IO No Validate, IO+Validate, Store and
Index, Internal Processing, Internal Composition, etc.

The answer may be to have both a jsontext and jsonbinary type as each will
be optimized for a different case.

Make sense?  It may be a week before I get through this.
Terry

#15Pavel Stehule
pavel.stehule@gmail.com
In reply to: Bruce Momjian (#13)

2010/10/19 Greg Stark <gsstark@mit.edu>:

On Tue, Oct 19, 2010 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I think we should take a few steps back and ask why we think that
binary encoding is the way to go.  We store XML as text, for example,
and I can't remember any complaints about that on -bugs or
-performance, so why do we think JSON will be different?  Binary
encoding is a trade-off.  A well-designed binary encoding should make
it quicker to extract a small chunk of a large JSON object and return
it; however, it will also make it slower to return the whole object
(because you're adding serialization overhead).  I haven't seen any
analysis of which of those use cases is more important and why.

The elephant in the room is if the binary encoded form is smaller then
it occupies less ram and disk bandwidth to copy it around. If your
database is large that alone is the dominant factor. Doubling the size
of all the objects in your database means halving the portion of the
database that fits in RAM and doubling the amount of I/O required to
complete any given operation. If your database fits entirely in RAM
either way then it still means less RAM bandwidth used which is often
the limiting factor but depending on how much cpu effort it takes to
serialize and deserialize the balance could shift either way.

I am not sure, if this argument is important for json. This protocol
has not big overhead and json documents are pretty small. More - from
9.0 TOAST uses a relative aggresive compression. I would to like a
some standardised format for json inside pg too, but without using a
some external library I don't see a advantages to use a other format
then text.

Regards

Pavel

Show quoted text

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Terry Laurenzo (#12)

Terry Laurenzo <tj@laurenzo.org> writes:

After spending a week in the morass of this, I have to say that I am less
certain than I was on any front regarding the text/binary distinction. I'll
take some time and benchmark different cases. My hypothesis is that a well
implemented binary structure and conversions will add minimal overhead in
the IO + Validate case which would be the typical in/out flow. It could be
substantially faster for binary send/receive because the validation step
could be eliminated/reduced.

I think that arguments proceeding from speed of binary send/receive
aren't worth the electrons they're written on, because there is nothing
anywhere that says what the binary format ought to be. In the case of
XML we're just using the text representation as the binary format too,
and nobody's complained about that. If we were to choose to stick with
straight text internally for a JSON type, we'd do the same thing, and
again nobody would complain.

So, if you want to make a case for using some binary internal format or
other, make it without that consideration.

I'm envisioning staging this up as follows:
1. Create a "jsontext". jsontext uses text as its internal
representation. in/out functions are essentially a straight copy or a copy
+ validate.
2. Create a "jsonbinary" type. This uses an optimized binary format for
internal rep and send/receive. in/out is a parse/transcode operation to
standard JSON text.

Ugh. Please don't. JSON should be JSON, and nothing else. Do you see
any other datatypes in Postgres that expose such internal considerations?

regards, tom lane

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#13)

Greg Stark <gsstark@mit.edu> writes:

The elephant in the room is if the binary encoded form is smaller then
it occupies less ram and disk bandwidth to copy it around.

It seems equally likely that a binary-encoded form could be larger
than the text form (that's often true for our other datatypes).
Again, this is an argument that would require experimental evidence
to back it up.

regards, tom lane

#18Terry Laurenzo
tj@laurenzo.org
In reply to: Tom Lane (#16)

On Tue, Oct 19, 2010 at 4:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Terry Laurenzo <tj@laurenzo.org> writes:

After spending a week in the morass of this, I have to say that I am less
certain than I was on any front regarding the text/binary distinction.

I'll

take some time and benchmark different cases. My hypothesis is that a

well

implemented binary structure and conversions will add minimal overhead in
the IO + Validate case which would be the typical in/out flow. It could

be

substantially faster for binary send/receive because the validation step
could be eliminated/reduced.

I think that arguments proceeding from speed of binary send/receive
aren't worth the electrons they're written on, because there is nothing
anywhere that says what the binary format ought to be. In the case of
XML we're just using the text representation as the binary format too,
and nobody's complained about that. If we were to choose to stick with
straight text internally for a JSON type, we'd do the same thing, and
again nobody would complain.

So, if you want to make a case for using some binary internal format or
other, make it without that consideration.

I'm envisioning staging this up as follows:
1. Create a "jsontext". jsontext uses text as its internal
representation. in/out functions are essentially a straight copy or a

copy

+ validate.
2. Create a "jsonbinary" type. This uses an optimized binary format

for

internal rep and send/receive. in/out is a parse/transcode operation to
standard JSON text.

Ugh. Please don't. JSON should be JSON, and nothing else. Do you see
any other datatypes in Postgres that expose such internal considerations?

regards, tom lane

I don't think anyone here was really presenting arguments as yet. We're
several layers deep on speculation and everyone is saying that
experimentation is needed.

I've got my own reasons for going down this path for a solution I have in
mind. I had thought that some part of that might have been applicable to pg
core, but if not, that's no problem. For my own edification, I'm going to
proceed down this path and see where it leads. I'll let the list know what
I find out.

I can understand the sentiment that JSON should be JSON and nothing else
from a traditional database server's point of view, but there is nothing
sacrosanct about it in the broader context.

Terry

#19Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: David E. Wheeler (#9)

Excerpts from David E. Wheeler's message of mar oct 19 16:36:20 -0300 2010:

On Oct 19, 2010, at 12:17 PM, Robert Haas wrote:

I think we should take a few steps back and ask why we think that
binary encoding is the way to go. We store XML as text, for example,
and I can't remember any complaints about that on -bugs or
-performance, so why do we think JSON will be different? Binary
encoding is a trade-off. A well-designed binary encoding should make
it quicker to extract a small chunk of a large JSON object and return
it; however, it will also make it slower to return the whole object
(because you're adding serialization overhead). I haven't seen any
analysis of which of those use cases is more important and why.

Maybe someone has numbers on that for the XML type?

Like these?
http://exificient.sourceforge.net/?id=performance

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#20Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#17)

On Tue, Oct 19, 2010 at 6:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Greg Stark <gsstark@mit.edu> writes:

The elephant in the room is if the binary encoded form is smaller then
it occupies less ram and disk bandwidth to copy it around.

It seems equally likely that a binary-encoded form could be larger
than the text form (that's often true for our other datatypes).
Again, this is an argument that would require experimental evidence
to back it up.

That's exactly what I was thinking when I read Greg's email. I
designed something vaguely (very vaguely) like this many years ago and
the binary format that I worked so hard to create was enormous
compared to the text format, mostly because I had a lot of small
integers in the data I was serializing, and as it turns out,
representing {0,1,2} in less than 7 bytes is not very easy. It can
certainly be done if you set out to optimize for precisely those kinds
of cases, but I ended up with something awful like:

<4 byte type = list> <4 byte list length = 3> <4 byte type = integer>
<4 byte integer = 0> <4 byte type = integer> <4 byte integer = 1> <4
byte type = integer> <4 byte integer = 2>

= 32 bytes. Even if you were a little smarter than I was and used 2
byte integers (with some escape hatch allowing larger numbers to be
represented) it's still more than twice the size of the text
representation. Even if you use 1 byte integers it's still bigger.
To get it down to being smaller, you've got to do something like make
the high nibble of each byte a type field and the low nibble the first
4 payload bits. You can certainly do all of this but you could also
just store it as text and let the TOAST compression algorithm worry
about making it smaller.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21Terry Laurenzo
tj@laurenzo.org
In reply to: Robert Haas (#20)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Terry Laurenzo (#21)
#23Itagaki Takahiro
itagaki.takahiro@gmail.com
In reply to: Terry Laurenzo (#12)
#24Terry Laurenzo
tj@laurenzo.org
In reply to: Itagaki Takahiro (#23)
#25Florian Weimer
fw@deneb.enyo.de
In reply to: Terry Laurenzo (#7)
#26Andrew Dunstan
andrew@dunslane.net
In reply to: Florian Weimer (#25)
#27Terry Laurenzo
tj@laurenzo.org
In reply to: Terry Laurenzo (#24)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Terry Laurenzo (#27)
#29Terry Laurenzo
tj@laurenzo.org
In reply to: Robert Haas (#28)
#30Robert Haas
robertmhaas@gmail.com
In reply to: Terry Laurenzo (#29)
#31Terry Laurenzo
tj@laurenzo.org
In reply to: Robert Haas (#30)
#32Itagaki Takahiro
itagaki.takahiro@gmail.com
In reply to: Terry Laurenzo (#31)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Itagaki Takahiro (#32)
#34Terry Laurenzo
tj@laurenzo.org
In reply to: Robert Haas (#33)
#35Terry Laurenzo
tj@laurenzo.org
In reply to: Terry Laurenzo (#34)
#36Robert Haas
robertmhaas@gmail.com
In reply to: Terry Laurenzo (#35)
#37David E. Wheeler
david@kineticode.com
In reply to: Terry Laurenzo (#35)
#38Andrew Dunstan
andrew@dunslane.net
In reply to: Terry Laurenzo (#35)