pgsql: Clean up jsonb code.

Started by Heikki Linnakangasover 11 years ago9 messages
#1Heikki Linnakangas
heikki.linnakangas@iki.fi

Clean up jsonb code.

The main target of this cleanup is the convertJsonb() function, but I also
touched a lot of other things that I spotted into in the process.

The new convertToJsonb() function uses an output buffer that's resized on
demand, so the code to estimate of the size of JsonbValue is removed.

The on-disk format was not changed, even though I refactored the structs
used to handle it. The term "superheader" is replaced with "container".

The jsonb_exists_any and jsonb_exists_all functions no longer sort the input
array. That was a premature optimization, the idea being that if there are
duplicates in the input array, you only need to check them once. Also,
sorting the array saves some effort in the binary search used to find a key
within an object. But there were drawbacks too: the sorting and
deduplicating obviously isn't free, and in the typical case there are no
duplicates to remove, and the gain in the binary search was minimal. Remove
all that, which makes the code simpler too.

This includes a bug-fix; the total length of the elements in a jsonb array
or object mustn't exceed 2^28. That is now checked.

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/364ddc3e5cbd01c93a39896b5260509129a9883e

Modified Files
--------------
src/backend/utils/adt/jsonb.c | 16 +-
src/backend/utils/adt/jsonb_gin.c | 4 +-
src/backend/utils/adt/jsonb_op.c | 106 ++--
src/backend/utils/adt/jsonb_util.c | 929 +++++++++++++-----------------------
src/backend/utils/adt/jsonfuncs.c | 64 +--
src/include/utils/jsonb.h | 213 +++++----
6 files changed, 556 insertions(+), 776 deletions(-)

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#2Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#1)
Re: pgsql: Clean up jsonb code.

Thanks for cleaning this up.

On Wed, May 7, 2014 at 1:18 PM, Heikki Linnakangas
<heikki.linnakangas@iki.fi> wrote:

The jsonb_exists_any and jsonb_exists_all functions no longer sort the input
array. That was a premature optimization, the idea being that if there are
duplicates in the input array, you only need to check them once. Also,
sorting the array saves some effort in the binary search used to find a key
within an object. But there were drawbacks too: the sorting and
deduplicating obviously isn't free, and in the typical case there are no
duplicates to remove, and the gain in the binary search was minimal. Remove
all that, which makes the code simpler too.

This is not the reason why the code did that. De-duplication was not
the point at all. findJsonbValueFromSuperHeader()'s lowbound argument
previously served to establish a low bound for searching when
searching for multiple keys (so the second and subsequent
user-supplied key could skip much of the object). In the case of
jsonb_exists_any(), say, if you only have a reasonable expectation
that about 1 key exists, and that happens to be the last key that the
user passed to the text[] argument (to the existence/? operator), then
n - 1 calls to what is now findJsonbValueFromContainer() (which now
does not accept a lowbound) are wasted. That's elem_count - 1
top-level binary searches of the entire jsonb. Or elem_count such
calls rather than 1 call (plus 1 sort of the supplied array) in the
common case where jsonb_exists_any() will return false.

Granted, that might not be that bad right now, given that it's only
ever (say) elem_count or elem_count - 1 wasted binary searches through
the *top* level, but that might not always be true. And even today,
sorting a presumably much smaller user-passed lookup array once has to
be cheaper than searching through the entire jsonb perhaps elem_count
times per call.

--
Peter Geoghegan

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#3Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Peter Geoghegan (#2)
Re: [COMMITTERS] pgsql: Clean up jsonb code.

On 05/08/2014 02:25 AM, Peter Geoghegan wrote:

findJsonbValueFromSuperHeader()'s lowbound argument
previously served to establish a low bound for searching when
searching for multiple keys (so the second and subsequent
user-supplied key could skip much of the object).

Got that.

In the case of
jsonb_exists_any(), say, if you only have a reasonable expectation
that about 1 key exists, and that happens to be the last key that the
user passed to the text[] argument (to the existence/? operator), then
n - 1 calls to what is now findJsonbValueFromContainer() (which now
does not accept a lowbound) are wasted.

Check.

That's elem_count - 1
top-level binary searches of the entire jsonb. Or elem_count such
calls rather than 1 call (plus 1 sort of the supplied array) in the
common case where jsonb_exists_any() will return false.

Ok, but I don't see any big difference in that regard. It still called
findJsonbValueFromContainer() elem_count times, before this commit. Each
call was somewhat cheaper, because the lower bound of the binary search
was initialized to where the previous search ended, but you still had to
perform the search.

Granted, that might not be that bad right now, given that it's only
ever (say) elem_count or elem_count - 1 wasted binary searches through
the *top* level, but that might not always be true.

If we are ever to perform a deep search, I think we'll want to do much
more to optimize that than just keep track of the lower bound. Like,
start an iterator of tree and check for all of the keys in one go.

And even today,
sorting a presumably much smaller user-passed lookup array once has to
be cheaper than searching through the entire jsonb perhaps elem_count
times per call.

Well, the quick testing I did suggested otherwise. It's not a big
difference, but sorting has all kinds of overhead, like pallocing the
array to sort, copying the elements around etc. For a small array, the
startup cost of sorting trumps the savings in the binary searches.
Possibly the way the sorting was done was not optimal, and you could
reduce the copying and allocations involved in that, but it's hardly
worth the trouble.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#3)
Re: [COMMITTERS] pgsql: Clean up jsonb code.

On Thu, May 8, 2014 at 11:45 AM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

On 05/08/2014 02:25 AM, Peter Geoghegan wrote:

findJsonbValueFromSuperHeader()'s lowbound argument
previously served to establish a low bound for searching when
searching for multiple keys (so the second and subsequent
user-supplied key could skip much of the object).

Got that.

In the case of

jsonb_exists_any(), say, if you only have a reasonable expectation
that about 1 key exists, and that happens to be the last key that the
user passed to the text[] argument (to the existence/? operator), then
n - 1 calls to what is now findJsonbValueFromContainer() (which now
does not accept a lowbound) are wasted.

Check.

That's elem_count - 1

top-level binary searches of the entire jsonb. Or elem_count such
calls rather than 1 call (plus 1 sort of the supplied array) in the
common case where jsonb_exists_any() will return false.

Ok, but I don't see any big difference in that regard. It still called
findJsonbValueFromContainer() elem_count times, before this commit. Each
call was somewhat cheaper, because the lower bound of the binary search was
initialized to where the previous search ended, but you still had to
perform the search.

Granted, that might not be that bad right now, given that it's only

ever (say) elem_count or elem_count - 1 wasted binary searches through
the *top* level, but that might not always be true.

If we are ever to perform a deep search, I think we'll want to do much
more to optimize that than just keep track of the lower bound. Like, start
an iterator of tree and check for all of the keys in one go.

And even today,

sorting a presumably much smaller user-passed lookup array once has to
be cheaper than searching through the entire jsonb perhaps elem_count
times per call.

Well, the quick testing I did suggested otherwise. It's not a big
difference, but sorting has all kinds of overhead, like pallocing the array
to sort, copying the elements around etc. For a small array, the startup
cost of sorting trumps the savings in the binary searches. Possibly the way
the sorting was done was not optimal, and you could reduce the copying and
allocations involved in that, but it's hardly worth the trouble.

With current head I can't load delicious dataset into jsonb format. I got
segfault. It looks like memory corruption.

$ gzip -c -d js.copy.gz | psql postgres -c 'copy js from stdin;'
WARNING: problem in alloc set ExprContext: bogus aset link in block
0x14a846000, chunk 0x14a84d278
CONTEXT: COPY js, line 246766
WARNING: problem in alloc set ExprContext: bad single-chunk 0x14a804b18 in
block 0x14a7e6000
CONTEXT: COPY js, line 1009820
WARNING: problem in alloc set ExprContext: bogus aset link in block
0x14a7e6000, chunk 0x14a804b18
CONTEXT: COPY js, line 1009820
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

You can get dataset here:
http://mira.sai.msu.ru/~megera/tmp/js.copy.gz

------
With best regards,
Alexander Korotkov.

#5Peter Geoghegan
pg@heroku.com
In reply to: Alexander Korotkov (#4)
Re: [COMMITTERS] pgsql: Clean up jsonb code.

On Fri, May 9, 2014 at 10:27 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

With current head I can't load delicious dataset into jsonb format. I got
segfault. It looks like memory corruption.

I'll look at this within the next couple of hours.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#4)
Re: [COMMITTERS] pgsql: Clean up jsonb code.

Alexander Korotkov <aekorotkov@gmail.com> writes:

With current head I can't load delicious dataset into jsonb format. I got
segfault. It looks like memory corruption.

The proximate cause of this seems to be that reserveFromBuffer() fails
to consider the possibility that it needs to more-than-double the
current buffer size. This change makes the crash go away for me:

diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c
index 832a08d..0c4af04 100644
*** a/src/backend/utils/adt/jsonb_util.c
--- b/src/backend/utils/adt/jsonb_util.c
*************** reserveFromBuffer(convertState *buffer, 
*** 1186,1192 ****
  	/* Make more room if needed */
  	if (buffer->len + len > buffer->allocatedsz)
  	{
! 		buffer->allocatedsz *= 2;
  		buffer->buffer = repalloc(buffer->buffer, buffer->allocatedsz);
  	}
--- 1186,1195 ----
  	/* Make more room if needed */
  	if (buffer->len + len > buffer->allocatedsz)
  	{
! 		do
! 		{
! 			buffer->allocatedsz *= 2;
! 		} while (buffer->len + len > buffer->allocatedsz);
  		buffer->buffer = repalloc(buffer->buffer, buffer->allocatedsz);
  	}

However, what it looks to me like we've got here is a very bad
reimplementation of StringInfo buffers. There is for example no
integer-overflow checking here. Rather than try to bring this code
up to speed, I think we should rip it out and use StringInfo.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Peter Geoghegan
pg@heroku.com
In reply to: Tom Lane (#6)
Re: [COMMITTERS] pgsql: Clean up jsonb code.

On Fri, May 9, 2014 at 2:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

However, what it looks to me like we've got here is a very bad
reimplementation of StringInfo buffers. There is for example no
integer-overflow checking here. Rather than try to bring this code
up to speed, I think we should rip it out and use StringInfo.

Heikki did specifically consider StringInfo buffers and said they were
not best suited to the task at hand. At the time I thought he meant
that he'd do something domain-specific to avoid unnecessary geometric
growth in the size of the buffer (I like to grow buffers to either
twice their previous size, or just big enough to fit the next thing,
whichever is larger), but that doesn't appear to be the case. Still,
it would be good to know what he meant before proceeding. It probably
had something to do with alignment.

Integer overflow checking probably isn't strictly necessary FWIW,
because there are limits to the size that the buffer can grow to
enforced at various points.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#7)
Re: [COMMITTERS] pgsql: Clean up jsonb code.

Peter Geoghegan <pg@heroku.com> writes:

On Fri, May 9, 2014 at 2:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

However, what it looks to me like we've got here is a very bad
reimplementation of StringInfo buffers. There is for example no
integer-overflow checking here. Rather than try to bring this code
up to speed, I think we should rip it out and use StringInfo.

Heikki did specifically consider StringInfo buffers and said they were
not best suited to the task at hand. At the time I thought he meant
that he'd do something domain-specific to avoid unnecessary geometric
growth in the size of the buffer (I like to grow buffers to either
twice their previous size, or just big enough to fit the next thing,
whichever is larger), but that doesn't appear to be the case. Still,
it would be good to know what he meant before proceeding. It probably
had something to do with alignment.

It looks to me like he wanted an API that would let him reserve space
separately from filling it, which is not in stringinfo.c but is surely
easily built on top of it. For the moment, I've just gotten rid of
the buggy code fragment in favor of calling enlargeStringInfo, which
I trust to be right.

We might at some point want to change the heuristics in
enlargeStringInfo, but two days before beta is not the time for that.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Tom Lane (#8)
Re: [COMMITTERS] pgsql: Clean up jsonb code.

On 05/10/2014 01:32 AM, Tom Lane wrote:

Peter Geoghegan <pg@heroku.com> writes:

On Fri, May 9, 2014 at 2:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

However, what it looks to me like we've got here is a very bad
reimplementation of StringInfo buffers. There is for example no
integer-overflow checking here. Rather than try to bring this code
up to speed, I think we should rip it out and use StringInfo.

Heikki did specifically consider StringInfo buffers and said they were
not best suited to the task at hand. At the time I thought he meant
that he'd do something domain-specific to avoid unnecessary geometric
growth in the size of the buffer (I like to grow buffers to either
twice their previous size, or just big enough to fit the next thing,
whichever is larger), but that doesn't appear to be the case. Still,
it would be good to know what he meant before proceeding. It probably
had something to do with alignment.

It looks to me like he wanted an API that would let him reserve space
separately from filling it, which is not in stringinfo.c but is surely
easily built on top of it.

Right, the API to reserve space separately was what I had in mind.

For the moment, I've just gotten rid of
the buggy code fragment in favor of calling enlargeStringInfo, which
I trust to be right.

Thanks. I admit it didn't even occur to me to keep the localized API in
jsonb_utils as wrappers around appendString* functions. I only
considered two options: using appendString* directly, or doing
repalloc's in jsonb_utils.c. I like what you did there.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers