Brain fade in gin_extract_jsonb_path()
I looked into bug #13756,
/messages/by-id/20151105171933.14035.25039@wrigleys.postgresql.org
The cause of the problem is that gin_extract_jsonb_path() computes
a different hash for the "2" in
{"a":[ ["b",{"x":1}], ["b",{"x":2}]]}
than it does for the "2" in
{"a":[[{"x":2}]]}
And the cause of that is that it supposes that after emitting a hash
for a VALUE, it can just leave stack->hash to be reinitialized by the
next KEY or ELEM. But if the next thing is a sub-object, we propagate
the previous value's hash down into the sub-object, causing values
therein to receive different hashes than they'd have gotten without
the preceding outer-level VALUE.
The attached one-liner fixes it, but of course this means that on-disk
jsonb_path_ops indexes are possibly broken and will need to be reindexed.
I see no way around that ... does anybody else?
regards, tom lane
Attachments:
bug-13756-fix.patchtext/x-diff; charset=us-ascii; name=bug-13756-fix.patchDownload+3-3
On Thu, Nov 5, 2015 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The attached one-liner fixes it, but of course this means that on-disk
jsonb_path_ops indexes are possibly broken and will need to be reindexed.
I see no way around that ... does anybody else?
I think it's impossible to avoid having to recommend a reindex here.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter Geoghegan <pg@heroku.com> writes:
On Thu, Nov 5, 2015 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The attached one-liner fixes it, but of course this means that on-disk
jsonb_path_ops indexes are possibly broken and will need to be reindexed.
I see no way around that ... does anybody else?
I think it's impossible to avoid having to recommend a reindex here.
Good, because that's not the only bug here :-(
Extending the example with
insert into t (j) values ('[[14,2,3]]');
insert into t (j) values ('[1,[14,2,3]]');
we get this with seqscan on:
regression=# select * from t where j @> '[[14]]';
j
-----------------
[[14, 2, 3]]
[1, [14, 2, 3]]
(2 rows)
which is the right answer, and this with seqscan off:
regression=# select * from t where j @> '[[14]]';
j
--------------
[[14, 2, 3]]
(1 row)
which is not so much.
The reason for that is that when processing an outer-level WJB_BEGIN_XXX,
we initialize stack->hash to something that is different from
stack->parent->hash. This is a waste of time in unnested cases because
when we hit the first WJB_KEY or WJB_ELEM, we'll reset stack->hash to
stack->parent->hash. However, if we have a nested array, then processing
of elements inside that array treats stack->parent->hash as the reference
value, so that we compute different hash values than we would have at
outer level. What's worse, it matters whether the nested array is the
first subobject or not, as shown by the above example; that's because once
we reset stack->hash to equal stack->parent->hash, hashes for later
elements will be inconsistent with the special-case initialization.
(Note: that explains why this is broken with my previous patch. It fails
in much the same way with unmodified HEAD, but I think for slightly
different reasons, which I've not bothered to understand in detail.)
I think the easiest way to fix this is to forget about the special
initialization at outer level and just always initialize a new stack level
to have the same hash as its parent. That leads to the first patch below
--- but once you look at that, you realize that we've got unnecessarily
many copies of stack->parent->hash into stack->hash, so what I actually
propose now is the second patch below.
I think that this only changes the hashes calculated for nested-array
cases, but it's still a larger change footprint than the previous patch.
Not that that affects much; a reindex is a reindex, and a bug is a bug,
so we don't have much wiggle room.
regards, tom lane
I wrote:
I think the easiest way to fix this is to forget about the special initialization at outer level and just always initialize a new stack level to have the same hash as its parent. That leads to the first patch below --- but once you look at that, you realize that we've got unnecessarily many copies of stack->parent->hash into stack->hash, so what I actually propose now is the second patch below.
Meh, that wasn't quite right either --- need to reset stack->hash after
processing a sub-object, not only after a scalar VALUE. I think what I
committed is OK though.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers