Why is writing JSONB faster than just JSON?

Started by Mitarabout 5 years ago12 messagesgeneral

mmitar@gmail.com

about 5 years ago

Hi!

I have a project where we among other data want to store static JSON
objects which can get pretty large (10-100 KB). I was trying to
evaluate how it would work if we simply store it as an additional
column in a PostgreSQL database. So I made a benchmark [1]https://gitlab.com/mitar/benchmark-pg-json. The
results surprised me a bit and I am writing here because I would like
to understand them. Namely, it looks like writing into a jsonb typed
column is 30% faster than writing into a json typed column. Why is
that? Does not jsonb require parsing of JSON and conversion? That
should be slower than just storing a blob as-is?

[1]: https://gitlab.com/mitar/benchmark-pg-json

Mitar

--
http://mitar.tnode.com/
https://twitter.com/mitar_m

9erthalion6@gmail.com

about 5 years ago

In reply to: Mitar (#1)

Re: Why is writing JSONB faster than just JSON?

On Tue, Apr 13, 2021 at 11:38:04PM -0700, Mitar wrote:
Hi!

I have a project where we among other data want to store static JSON
objects which can get pretty large (10-100 KB). I was trying to
evaluate how it would work if we simply store it as an additional
column in a PostgreSQL database. So I made a benchmark [1]. The
results surprised me a bit and I am writing here because I would like
to understand them. Namely, it looks like writing into a jsonb typed
column is 30% faster than writing into a json typed column. Why is
that? Does not jsonb require parsing of JSON and conversion? That
should be slower than just storing a blob as-is?

[1] https://gitlab.com/mitar/benchmark-pg-json

Interesting. I've tried to reproduce the schema and queries from the repository
above (with a different generated data though and without the app itself) and
indeed got a bit slower (not 30%, but visibly) execution for json column
instead of jsonb. There are couple of important points:

* Parsing is happening in both cases, for json it's done for validation
purposes.

* Compression of json data is actually dominates the runtime load for large
json objects, making other factors less visible and reducing difference in
size (jsonb also should be a bit bigger, that's why I would think it would be
slower).

* At the same time perf shows that json compression spends a bit more time in
pglz_find_match (mostly around accessing and checking history entries), which
is compression optimization via history table. So probably due to differences
in data layout this optimization works slightly worse for plain json than for
jsonb?

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: Dmitry Dolgov (#2)

Re: Why is writing JSONB faster than just JSON?

Dmitry Dolgov <9erthalion6@gmail.com> writes:

On Tue, Apr 13, 2021 at 11:38:04PM -0700, Mitar wrote:
... Namely, it looks like writing into a jsonb typed
column is 30% faster than writing into a json typed column. Why is
that? Does not jsonb require parsing of JSON and conversion? That
should be slower than just storing a blob as-is?

* Parsing is happening in both cases, for json it's done for validation
purposes.

Right.

* Compression of json data is actually dominates the runtime load for large
json objects, making other factors less visible and reducing difference in
size (jsonb also should be a bit bigger, that's why I would think it would be
slower).
* At the same time perf shows that json compression spends a bit more time in
pglz_find_match (mostly around accessing and checking history entries), which
is compression optimization via history table. So probably due to differences
in data layout this optimization works slightly worse for plain json than for
jsonb?

Interesting. I recall that we made some last-minute changes in the JSONB
physical representation after finding that the original choices resulted
in sucky compression behavior. But I think we were only looking at the
resultant compressed size, not time-to-compress.

My own guess about this, without having tried to reproduce it, is that
JSONB might end up physically smaller than JSON, resulting in less work
to push the toasted datum out to disk. This'd depend a lot on your
formatting habits for JSON, of course. But in any case, it'd be worth
comparing pg_column_size() results to see what's up with that.

regards, tom lane

9erthalion6@gmail.com

about 5 years ago

In reply to: Tom Lane (#3)

Re: Why is writing JSONB faster than just JSON?

On Wed, Apr 14, 2021 at 10:26:23AM -0400, Tom Lane wrote:
Dmitry Dolgov <9erthalion6@gmail.com> writes:

On Tue, Apr 13, 2021 at 11:38:04PM -0700, Mitar wrote:
... Namely, it looks like writing into a jsonb typed
column is 30% faster than writing into a json typed column. Why is
that? Does not jsonb require parsing of JSON and conversion? That
should be slower than just storing a blob as-is?

* Parsing is happening in both cases, for json it's done for validation
purposes.

Right.

* Compression of json data is actually dominates the runtime load for large
json objects, making other factors less visible and reducing difference in
size (jsonb also should be a bit bigger, that's why I would think it would be
slower).
* At the same time perf shows that json compression spends a bit more time in
pglz_find_match (mostly around accessing and checking history entries), which
is compression optimization via history table. So probably due to differences
in data layout this optimization works slightly worse for plain json than for
jsonb?

Interesting. I recall that we made some last-minute changes in the JSONB
physical representation after finding that the original choices resulted
in sucky compression behavior. But I think we were only looking at the
resultant compressed size, not time-to-compress.

My own guess about this, without having tried to reproduce it, is that
JSONB might end up physically smaller than JSON, resulting in less work
to push the toasted datum out to disk. This'd depend a lot on your
formatting habits for JSON, of course. But in any case, it'd be worth
comparing pg_column_size() results to see what's up with that.

Oh, of course I've missed that the input I was using was indeed
formatted, without formatting both cases perform equally well and I
can't reproduce the issue. Although if I understand correctly the
original code in question doesn't actually do any formatting.

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: Dmitry Dolgov (#4)

Re: Why is writing JSONB faster than just JSON?

Dmitry Dolgov <9erthalion6@gmail.com> writes:

On Wed, Apr 14, 2021 at 10:26:23AM -0400, Tom Lane wrote:

My own guess about this, without having tried to reproduce it, is that
JSONB might end up physically smaller than JSON, resulting in less work
to push the toasted datum out to disk. This'd depend a lot on your
formatting habits for JSON, of course. But in any case, it'd be worth
comparing pg_column_size() results to see what's up with that.

Oh, of course I've missed that the input I was using was indeed
formatted, without formatting both cases perform equally well and I
can't reproduce the issue. Although if I understand correctly the
original code in question doesn't actually do any formatting.

My point was that for JSON, after validating that the input is
syntactically correct, we just store it as-received. So in particular
the amount of whitespace in the value would depend on how the client
had chosen to format the JSON. This'd affect the stored size of
course, and I think it would have an effect on compression time too.

regards, tom lane

9erthalion6@gmail.com

about 5 years ago

In reply to: Tom Lane (#5)

Re: Why is writing JSONB faster than just JSON?

On Thu, Apr 15, 2021 at 12:47:25PM -0400, Tom Lane wrote:
Dmitry Dolgov <9erthalion6@gmail.com> writes:

On Wed, Apr 14, 2021 at 10:26:23AM -0400, Tom Lane wrote:

My own guess about this, without having tried to reproduce it, is that
JSONB might end up physically smaller than JSON, resulting in less work
to push the toasted datum out to disk. This'd depend a lot on your
formatting habits for JSON, of course. But in any case, it'd be worth
comparing pg_column_size() results to see what's up with that.

Oh, of course I've missed that the input I was using was indeed
formatted, without formatting both cases perform equally well and I
can't reproduce the issue. Although if I understand correctly the
original code in question doesn't actually do any formatting.

My point was that for JSON, after validating that the input is
syntactically correct, we just store it as-received. So in particular
the amount of whitespace in the value would depend on how the client
had chosen to format the JSON. This'd affect the stored size of
course, and I think it would have an effect on compression time too.

Yes, I got it and just wanted to confirm you were right - this was the
reason I've observed slowdown trying to reproduce the report.

mmitar@gmail.com

about 5 years ago

In reply to: Dmitry Dolgov (#6)

Re: Why is writing JSONB faster than just JSON?

Hi!

On Thu, Apr 15, 2021 at 12:11 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

My point was that for JSON, after validating that the input is
syntactically correct, we just store it as-received. So in particular
the amount of whitespace in the value would depend on how the client
had chosen to format the JSON. This'd affect the stored size of
course, and I think it would have an effect on compression time too.

Yes, I got it and just wanted to confirm you were right - this was the
reason I've observed slowdown trying to reproduce the report.

Thank you for trying to reproduce the report. I did a bit more digging
myself and I am still confused.

First, it is important to note that the JSON I am using contains
primarily random strings as values, so not really something which is
easy to compress. See example at [1]https://gitlab.com/mitar/benchmark-pg-json/-/blob/master/example.json. I have realized though that in
the previous benchmark I have been using the same JSON document and
inserting it multiple times, so compression might work across
documents or something. So I ran a version of the benchmark with
different JSONs being inserted (but with the same structure, just
values are random strings). There was no difference.

Second, as you see from [1]https://gitlab.com/mitar/benchmark-pg-json/-/blob/master/example.json, the JSON representation I am using is
really compact and has no extra spaces. I also used
pg_total_relation_size to get the size of the table after inserting
10k rows and the numbers are similar, with JSONB being slightly larger
than others. So I think the idea of compression does not hold.

So I do not know what is happening and why you cannot reproduce it.
Maybe explain a bit how you are trying to reproduce it? Directly from
psql console? Are you using the same version as me (13.2)?

Numbers with inserting the same large JSON 10k times:

Type: jsonb
Mean: 200243.1
Stddev: 1679.7741187433503
Size: { pg_total_relation_size: '4611792896' }
Type: json
Mean: 256938.5
Stddev: 2471.9909890612466
Size: { pg_total_relation_size: '4597833728' }
Type: text
Mean: 248175.3
Stddev: 376.677594236769
Size: { pg_total_relation_size: '4597833728' }

Inserting different JSON 10k times:

Type: jsonb
Mean: 202794.5
Stddev: 978.5346442512907
Size: { pg_total_relation_size: '4611792896' }
Type: json
Mean: 259437.9
Stddev: 1785.8411155531167
Size: { pg_total_relation_size: '4597833728' }
Type: text
Mean: 250060.5
Stddev: 912.9207249263213
Size: { pg_total_relation_size: '4597833728' }

[1]: https://gitlab.com/mitar/benchmark-pg-json/-/blob/master/example.json

Mitar

--
http://mitar.tnode.com/
https://twitter.com/mitar_m

Francisco Olarte

folarte@peoplecall.com

about 5 years ago

In reply to: Mitar (#7)

Re: Why is writing JSONB faster than just JSON?

Just a note:

On Fri, Apr 23, 2021 at 10:57 AM Mitar <mmitar@gmail.com> wrote:

First, it is important to note that the JSON I am using contains
primarily random strings as values, so not really something which is
easy to compress. See example at [1].

A fast look at the link. It seems to be long string of random LOWER
CASE HEX DIGITS. A simple huffman coder can probably put it in 5 bits
per char, and a more sophisticated algorithm can probably approach 4.

FOS.

mmitar@gmail.com

about 5 years ago

In reply to: Francisco Olarte (#8)

Re: Why is writing JSONB faster than just JSON?

Hi!

On Fri, Apr 23, 2021 at 10:28 AM Francisco Olarte
<folarte@peoplecall.com> wrote:

A fast look at the link. It seems to be long string of random LOWER
CASE HEX DIGITS. A simple huffman coder can probably put it in 5 bits
per char, and a more sophisticated algorithm can probably approach 4.

But this compression-ility would apply to both JSONB and JSON column
types, no? Moreover, it looks like JSONB column type ends up larger on
disk.

Mitar

--
http://mitar.tnode.com/
https://twitter.com/mitar_m

Francisco Olarte

folarte@peoplecall.com

about 5 years ago

In reply to: Mitar (#9)

Re: Why is writing JSONB faster than just JSON?

Mitar:

On Fri, Apr 23, 2021 at 7:33 PM Mitar <mmitar@gmail.com> wrote:

On Fri, Apr 23, 2021 at 10:28 AM Francisco Olarte
<folarte@peoplecall.com> wrote:

A fast look at the link. It seems to be long string of random LOWER
CASE HEX DIGITS. A simple huffman coder can probably put it in 5 bits
per char, and a more sophisticated algorithm can probably approach 4.

But this compression-ility would apply to both JSONB and JSON column
types, no? Moreover, it looks like JSONB column type ends up larger on
disk.

Of course, I did not follow the thread to deeply, just pointed that in
case you were assuming that was not going to be stored compressed.

Also, not surprised JSONB ends up being fatter, when no binary data is
transmitted and no compression is used compact JSON has very little
redundancy, about 6 chars per k/v pair, this is hard to beat without
using sophisticated encodings, especially with long data. And if jsonb
starts up bigger, and IIRC it does trade off compactness to be easily
indexable an other stuff, it will normally end up bigger. But not an
expert on this.

FOS.

mmitar@gmail.com

about 5 years ago

In reply to: Francisco Olarte (#10)

Re: Why is writing JSONB faster than just JSON?

Hi!

On Fri, Apr 23, 2021 at 10:49 AM Francisco Olarte
<folarte@peoplecall.com> wrote:

Of course, I did not follow the thread to deeply, just pointed that in
case you were assuming that was not going to be stored compressed.

Thanks for pointing that out. I was just trying to make sure I am
understanding you correctly and that we are all on the same page about
implications. It seems we are.

Also, not surprised JSONB ends up being fatter,

Yes, by itself this is not surprising. Why I mentioned it is because
in my original post in this thread, I posted that I am surprised that
inserting into JSONB column seems observably faster than into JSON or
TEXT column (for same data) and I wonder why that is. One theory
presented was that JSONB might compress better so there is less IO so
insertion is faster. But JSONB does not look more compressed (I have
not measured the size in my original benchmark), so now I am searching
for other explanations for the results of my benchmark.

Mitar

--
http://mitar.tnode.com/
https://twitter.com/mitar_m

9erthalion6@gmail.com

about 5 years ago

In reply to: Mitar (#7)

Re: Why is writing JSONB faster than just JSON?

On Fri, Apr 23, 2021 at 01:56:57AM -0700, Mitar wrote:
Hi!

On Thu, Apr 15, 2021 at 12:11 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

My point was that for JSON, after validating that the input is
syntactically correct, we just store it as-received. So in particular
the amount of whitespace in the value would depend on how the client
had chosen to format the JSON. This'd affect the stored size of
course, and I think it would have an effect on compression time too.

Yes, I got it and just wanted to confirm you were right - this was the
reason I've observed slowdown trying to reproduce the report.

Thank you for trying to reproduce the report. I did a bit more digging
myself and I am still confused.

...

So I do not know what is happening and why you cannot reproduce it.

Could you maybe get a profile with perf for both cases? Since they're
executed within a single backend, you can profile only a single pid.
Having a reasonable profiling frequency, --call-graph dwarf and probably
limit events to only user space with precise tagging (cycles:uppp)
should give an impression what's going on.