Unable to make use of "deep" JSONB index

Started by Shaheed Haquealmost 4 years ago14 messagesbugs
Jump to latest
#1Shaheed Haque
shaheedhaque@gmail.com

Hi,

I'm having trouble using an index on the "deep" innards of a JSONB field in
that the matching query it is trying to accelerate never uses it. I did get
some advice on a simplified version of the problem at [1]/messages/by-id/CAHAc2jdiRtw3qus_rvz1QvcxUJ9AgaBKObP-Fvx1q6Vr80n_xw@mail.gmail.com, but the actual
problem remains the same in that "EXPLAIN ANALYZE" never refers to the
index. Here are the details including a test case below...

1. The JSONB can be several MB in size. This works fine for all but one
access pattern.
2. The JSON in the problem use case looks like this:

{
"...stuff...": ...
"employee": {
"999": {"id": 999, "integer attribute": 0, "boolean-may-be-missing":
true, "state": {
"nested-list": [[], [], ...]
}
}
}

3. As per the discussion at [2]/messages/by-id/CAHAc2jf6Yp2uddfKQyGp=byvOuUzYpsHoPjyKXnUhS5+=Acuww@mail.gmail.com, using a SELECT with a WHERE on the 3
attributes of interest ("integer attribute", the "boolean-may-be-missing"
and "nested-list") incurs a significant overheard which suggests that the
JSONB storage is being accessed 3 times. In order to optimise for this
case, I constructed a query using the jsonpath support which seems to
successfully avoid the triple-fetch by keeping the logic inside the
jsonpath query like this:

WHERE (snapshot @? '$.employee."999" ? (@.pay_graph <> 0 ||
@.last_run_of_employment
== true || @.state.employment[last][2]/messages/by-id/CAHAc2jf6Yp2uddfKQyGp=byvOuUzYpsHoPjyKXnUhS5+=Acuww@mail.gmail.com == 0)')

4. Then I created an index "matching" this query.
5. According to EXPLAIN ANALYSE, the index is never used.

=== version and platform ===

Version: PostgreSQL 14.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC)
7.3.1 20180712 (Red Hat 7.3.1-12), 64-bit
Platform: Ubuntu 22.04 (Jammy), using OS-supplied build

=== test case ===
CREATE TABLE payrun (
id serial primary key,
snapshot JSONB
);

INSERT INTO payrun(snapshot)
VALUES
('{"employee": {"999": {"id": 999, "state": {"employment":
[["1920-01-01", null, 3]]}, "pay_graph": 0, "last_run_of_employment":
false}, "111": {"id": 111, "state": {"employment": [["1920-01-01", null,
5]]}, "pay_graph": 3, "last_run_of_employment": true}}}'),
('{"employee": {"999": {"id": 999, "state": {"employment":
[["1970-01-01", null, 3]]}, "pay_graph": 6, "last_run_of_employment":
true}, "222": {"id": 222, "state": {"employment": [["1920-01-01", null,
5]]}, "pay_graph": 5, "last_run_of_employment": true}}}'),
('{"employee": {"998": {"id": 998, "state": {"employment":
[["1980-01-01", null, 3]]}, "pay_graph": 7, "last_run_of_employment":
false}, "333": {"id": 333, "state": {"employment": [["1920-01-01", null,
5]]}, "pay_graph": 3, "last_run_of_employment": true}}}')
;

SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun" WHERE
(snapshot @? '$.employee."999" ? (@.pay_graph <> 0 ||
@.last_run_of_employment == true || @.state.employment[last][2]/messages/by-id/CAHAc2jf6Yp2uddfKQyGp=byvOuUzYpsHoPjyKXnUhS5+=Acuww@mail.gmail.com == 0)');

--
-- Create index designed to match the query.
--
create index idx1 on payrun using gin ((snapshot->'$.employee.* ?
(@.pay_graph <> 0 || @.last_run_of_employment == true ||
@.state.employment[last][2]/messages/by-id/CAHAc2jf6Yp2uddfKQyGp=byvOuUzYpsHoPjyKXnUhS5+=Acuww@mail.gmail.com == 0)'));

set enable_seqscan = OFF;

--
-- EXPLAIN ANALYZE ...query above...
--
explain analyse SELECT id,snapshot #>'{employee,999,state,employment}' FROM
"payrun" WHERE (snapshot @? '$.employee."999" ? (@.pay_graph <> 0 ||
@.last_run_of_employment
== true || @.state.employment[last][2]/messages/by-id/CAHAc2jf6Yp2uddfKQyGp=byvOuUzYpsHoPjyKXnUhS5+=Acuww@mail.gmail.com == 0)');
QUERY
PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------

Seq Scan on payrun (cost=10000000000.00..10000000001.04 rows=1 width=36)
(actual time=0.040..0.042 rows=1 loops=1)
Filter: (snapshot @? '$."employee"."999"?((@."pay_graph" != 0 ||
@."last_run_of_employment" == true) || @."state"."employment"[last][2]/messages/by-id/CAHAc2jf6Yp2uddfKQyGp=byvOuUzYpsHoPjyKXnUhS5+=Acuww@mail.gmail.com ==
0)'::jsonpath)
Rows Removed by Filter: 2
Planning Time: 0.883 ms
Execution Time: 0.078 ms
(5 rows)

=== end test case ===

The expected result is that with enable_seqscan = OFF, the index should be
used, but instead a sequential scan is reported as above. The same happens
without enable_seqscan = OFF on
a moderately large test set of over 2000 rows (with extended run times
circa 10+ seconds on my hardware).

I have tried the same with similar results on PG13.

Have I constructed the index incorrectly, or is there some other way to
convince the query to use it?

Thanks, Shahee

[1]: /messages/by-id/CAHAc2jdiRtw3qus_rvz1QvcxUJ9AgaBKObP-Fvx1q6Vr80n_xw@mail.gmail.com
/messages/by-id/CAHAc2jdiRtw3qus_rvz1QvcxUJ9AgaBKObP-Fvx1q6Vr80n_xw@mail.gmail.com
[2]: /messages/by-id/CAHAc2jf6Yp2uddfKQyGp=byvOuUzYpsHoPjyKXnUhS5+=Acuww@mail.gmail.com
/messages/by-id/CAHAc2jf6Yp2uddfKQyGp=byvOuUzYpsHoPjyKXnUhS5+=Acuww@mail.gmail.com

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Shaheed Haque (#1)
Re: Unable to make use of "deep" JSONB index

Shaheed Haque <shaheedhaque@gmail.com> writes:

-- Create index designed to match the query.
--
create index idx1 on payrun using gin ((snapshot->'$.employee.* ?
(@.pay_graph <> 0 || @.last_run_of_employment == true ||
@.state.employment[last][2] == 0)'));

But that doesn't match the query; it's not even the same topmost
operator:

explain analyse SELECT id,snapshot #>'{employee,999,state,employment}' FROM
"payrun" WHERE (snapshot @? '$.employee."999" ? (@.pay_graph <> 0 ||
@.last_run_of_employment
== true || @.state.employment[last][2] == 0)');

In general you seem to have much too high an opinion of what PG's
index machinery can cope with. The general pattern is that it can
use a query WHERE clause with an index if the clause is of the form
"indexed-column indexable-operator constant". There's a small number
of special cases where it can transform things that don't initially
look like that into the right form, but AFAIR we don't have any
such special cases for any json-related operators.

The one saving grace is that "indexed-column" can be an expression
appearing in an index, so in some cases you can finesse things
that way. But you won't find any deep knowledge of jsonpath
expressions in there.

Having said that, @? is reported as an indexable operator in v14:

regression=# \dAo gin jsonb*
List of operators of operator families
AM | Operator family | Operator | Strategy | Purpose
-----+-----------------+--------------------+----------+---------
gin | jsonb_ops | @>(jsonb,jsonb) | 7 | search
gin | jsonb_ops | @?(jsonb,jsonpath) | 15 | search
gin | jsonb_ops | @@(jsonb,jsonpath) | 16 | search
gin | jsonb_ops | ?(jsonb,text) | 9 | search
gin | jsonb_ops | ?|(jsonb,text[]) | 10 | search
gin | jsonb_ops | ?&(jsonb,text[]) | 11 | search
gin | jsonb_path_ops | @>(jsonb,jsonb) | 7 | search
gin | jsonb_path_ops | @?(jsonb,jsonpath) | 15 | search
gin | jsonb_path_ops | @@(jsonb,jsonpath) | 16 | search
(9 rows)

so it seems like you ought to get some benefit for this query
from just a plain GIN index on "snapshot".

regards, tom lane

#3Shaheed Haque
shaheedhaque@gmail.com
In reply to: Tom Lane (#2)
Re: Unable to make use of "deep" JSONB index

On Thu, 2 Jun 2022 at 15:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Shaheed Haque <shaheedhaque@gmail.com> writes:

-- Create index designed to match the query.
--
create index idx1 on payrun using gin ((snapshot->'$.employee.* ?
(@.pay_graph <> 0 || @.last_run_of_employment == true ||
@.state.employment[last][2] == 0)'));

But that doesn't match the query; it's not even the same topmost
operator:

explain analyse SELECT id,snapshot #>'{employee,999,state,employment}' FROM
"payrun" WHERE (snapshot @? '$.employee."999" ? (@.pay_graph <> 0 ||
@.last_run_of_employment
== true || @.state.employment[last][2] == 0)');

I assume you are referring to the difference between "snapshot @?" and
"snapshot ->"? If so, apologies: too much cutting and pasting from too
many experiments. I did in fact also try the "using btree ((snapshot
@?" form but it gave the same results.

In general you seem to have much too high an opinion of what PG's
index machinery can cope with. The general pattern is that it can
use a query WHERE clause with an index if the clause is of the form
"indexed-column indexable-operator constant". There's a small number
of special cases where it can transform things that don't initially
look like that into the right form, but AFAIR we don't have any
such special cases for any json-related operators.

LOL. I'm pretty much a noob here, so that's very possible.

The one saving grace is that "indexed-column" can be an expression
appearing in an index, so in some cases you can finesse things
that way. But you won't find any deep knowledge of jsonpath
expressions in there.

I was basing my efforts on this statement in the docs
https://www.postgresql.org/docs/14/datatype-json.html#JSON-INDEXING:

GIN index extracts statements of following form out of jsonpath:
accessors_chain = const. Accessors chain may consist of .key, [*], and
[index] accessors. jsonb_ops additionally supports .* and .**
accessors.

Did I mis-implement, misunderstand or read too much into this?

Having said that, @? is reported as an indexable operator in v14:

regression=# \dAo gin jsonb*
List of operators of operator families
AM | Operator family | Operator | Strategy | Purpose
-----+-----------------+--------------------+----------+---------
gin | jsonb_ops | @>(jsonb,jsonb) | 7 | search
gin | jsonb_ops | @?(jsonb,jsonpath) | 15 | search
gin | jsonb_ops | @@(jsonb,jsonpath) | 16 | search
gin | jsonb_ops | ?(jsonb,text) | 9 | search
gin | jsonb_ops | ?|(jsonb,text[]) | 10 | search
gin | jsonb_ops | ?&(jsonb,text[]) | 11 | search
gin | jsonb_path_ops | @>(jsonb,jsonb) | 7 | search
gin | jsonb_path_ops | @?(jsonb,jsonpath) | 15 | search
gin | jsonb_path_ops | @@(jsonb,jsonpath) | 16 | search
(9 rows)

so it seems like you ought to get some benefit for this query
from just a plain GIN index on "snapshot".

Interesting. I'm pretty sure I started there a few days ago without
any luck but I'll give it another spin (having learnt quite a bit
since then).

Show quoted text

regards, tom lane

#4Shaheed Haque
shaheedhaque@gmail.com
In reply to: Shaheed Haque (#3)
Re: Unable to make use of "deep" JSONB index

OK, I have corrected and simplified the test case (including switching
to a btree index). The WHERE clause and the inex now look like this:

...WHERE ((snapshot -> 'employee' -> '999' ->>
'pay_graph')::integer != 0);
...USING btree (((snapshot -> 'employee' -> '$.*' ->>
'pay_graph')::integer != 0));

But the index is still not being used (test case below). I have
confirmed that the equality operator is listed for "search" (I assume
inequality is the same as equality, but I tried both):

# \dAo btree jsonb*
List of operators of operator families
AM | Operator family | Operator | Strategy | Purpose
-------+-----------------+-----------------+----------+---------
btree | jsonb_ops | <(jsonb,jsonb) | 1 | search
btree | jsonb_ops | <=(jsonb,jsonb) | 2 | search
btree | jsonb_ops | =(jsonb,jsonb) | 3 | search
btree | jsonb_ops | >=(jsonb,jsonb) | 4 | search
btree | jsonb_ops | >(jsonb,jsonb) | 5 | search
(5 rows)

If this is not a bug, then how should the query or the index be
changed to make this work?

=== begin test case ===

CREATE TABLE payrun (
id serial primary key,
snapshot JSONB
);

INSERT INTO payrun(snapshot)
VALUES
('{"employee": {"999": {"id": 999, "state": {"employment":
[["1920-01-01", null, 3]]}, "pay_graph": 0, "last_run_of_employment":
false}, "111": {"id": 111, "state": {"employment": [["1920-01-01",
null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}'),
('{"employee": {"999": {"id": 999, "state": {"employment":
[["1970-01-01", null, 3]]}, "pay_graph": 6, "last_run_of_employment":
true}, "222": {"id": 222, "state": {"employment": [["1920-01-01",
null, 5]]}, "pay_graph": 5, "last_run_of_employment": true}}}'),
('{"employee": {"998": {"id": 998, "state": {"employment":
[["1980-01-01", null, 3]]}, "pay_graph": 7, "last_run_of_employment":
false}, "333": {"id": 333, "state": {"employment": [["1920-01-01",
null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}')
;

SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun"
WHERE ((snapshot -> 'employee' -> '999' ->> 'pay_graph')::integer !=
0);

--
-- Create index designed to match the query.
--
CREATE INDEX idx1 ON payrun USING btree (((snapshot -> 'employee' ->
'$.*' ->> 'pay_graph')::integer != 0));

set enable_seqscan = OFF;

--
-- EXPLAIN ANALYZE ...query above...
--
explain analyze SELECT id,snapshot #>'{employee,999,state,employment}'
FROM "payrun" WHERE ((snapshot -> 'employee' -> '999' ->>
'pay_graph')::integer != 0);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Seq Scan on payrun (cost=10000000000.00..10000000001.08 rows=2
width=36) (actual time=70.051..70.052 rows=1 loops=1)
Filter: (((((snapshot -> 'employee'::text) -> '999'::text) ->>
'pay_graph'::text))::integer <> 0)
Rows Removed by Filter: 2
Planning Time: 0.147 ms
JIT:
Functions: 4
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.354 ms, Inlining 8.305 ms, Optimization 49.237
ms, Emission 12.499 ms, Total 70.395 ms
Execution Time: 70.428 ms
(9 rows)

=== end test case ===

Thanks, Shaheed

Show quoted text

On Thu, 2 Jun 2022 at 16:51, Shaheed Haque <shaheedhaque@gmail.com> wrote:

On Thu, 2 Jun 2022 at 15:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Shaheed Haque <shaheedhaque@gmail.com> writes:

-- Create index designed to match the query.
--
create index idx1 on payrun using gin ((snapshot->'$.employee.* ?
(@.pay_graph <> 0 || @.last_run_of_employment == true ||
@.state.employment[last][2] == 0)'));

But that doesn't match the query; it's not even the same topmost
operator:

explain analyse SELECT id,snapshot #>'{employee,999,state,employment}' FROM
"payrun" WHERE (snapshot @? '$.employee."999" ? (@.pay_graph <> 0 ||
@.last_run_of_employment
== true || @.state.employment[last][2] == 0)');

I assume you are referring to the difference between "snapshot @?" and
"snapshot ->"? If so, apologies: too much cutting and pasting from too
many experiments. I did in fact also try the "using btree ((snapshot
@?" form but it gave the same results.

In general you seem to have much too high an opinion of what PG's
index machinery can cope with. The general pattern is that it can
use a query WHERE clause with an index if the clause is of the form
"indexed-column indexable-operator constant". There's a small number
of special cases where it can transform things that don't initially
look like that into the right form, but AFAIR we don't have any
such special cases for any json-related operators.

LOL. I'm pretty much a noob here, so that's very possible.

The one saving grace is that "indexed-column" can be an expression
appearing in an index, so in some cases you can finesse things
that way. But you won't find any deep knowledge of jsonpath
expressions in there.

I was basing my efforts on this statement in the docs
https://www.postgresql.org/docs/14/datatype-json.html#JSON-INDEXING:

GIN index extracts statements of following form out of jsonpath:
accessors_chain = const. Accessors chain may consist of .key, [*], and
[index] accessors. jsonb_ops additionally supports .* and .**
accessors.

Did I mis-implement, misunderstand or read too much into this?

Having said that, @? is reported as an indexable operator in v14:

regression=# \dAo gin jsonb*
List of operators of operator families
AM | Operator family | Operator | Strategy | Purpose
-----+-----------------+--------------------+----------+---------
gin | jsonb_ops | @>(jsonb,jsonb) | 7 | search
gin | jsonb_ops | @?(jsonb,jsonpath) | 15 | search
gin | jsonb_ops | @@(jsonb,jsonpath) | 16 | search
gin | jsonb_ops | ?(jsonb,text) | 9 | search
gin | jsonb_ops | ?|(jsonb,text[]) | 10 | search
gin | jsonb_ops | ?&(jsonb,text[]) | 11 | search
gin | jsonb_path_ops | @>(jsonb,jsonb) | 7 | search
gin | jsonb_path_ops | @?(jsonb,jsonpath) | 15 | search
gin | jsonb_path_ops | @@(jsonb,jsonpath) | 16 | search
(9 rows)

so it seems like you ought to get some benefit for this query
from just a plain GIN index on "snapshot".

Interesting. I'm pretty sure I started there a few days ago without
any luck but I'll give it another spin (having learnt quite a bit
since then).

regards, tom lane

#5Erik Rijkers
er@xs4all.nl
In reply to: Shaheed Haque (#4)
Re: Unable to make use of "deep" JSONB index

Op 12-06-2022 om 11:34 schreef Shaheed Haque:

OK, I have corrected and simplified the test case (including switching
to a btree index). The WHERE clause and the inex now look like this:

...WHERE ((snapshot -> 'employee' -> '999' ->>
'pay_graph')::integer != 0);
...USING btree (((snapshot -> 'employee' -> '$.*' ->>
'pay_graph')::integer != 0));

But the index is still not being used (test case below). I have
confirmed that the equality operator is listed for "search" (I assume
inequality is the same as equality, but I tried both):

# \dAo btree jsonb*
List of operators of operator families
AM | Operator family | Operator | Strategy | Purpose
-------+-----------------+-----------------+----------+---------
btree | jsonb_ops | <(jsonb,jsonb) | 1 | search
btree | jsonb_ops | <=(jsonb,jsonb) | 2 | search
btree | jsonb_ops | =(jsonb,jsonb) | 3 | search
btree | jsonb_ops | >=(jsonb,jsonb) | 4 | search
btree | jsonb_ops | >(jsonb,jsonb) | 5 | search
(5 rows)

If this is not a bug, then how should the query or the index be
changed to make this work?

=== begin test case ===

CREATE TABLE payrun (
id serial primary key,
snapshot JSONB
);

INSERT INTO payrun(snapshot)
VALUES
('{"employee": {"999": {"id": 999, "state": {"employment":
[["1920-01-01", null, 3]]}, "pay_graph": 0, "last_run_of_employment":
false}, "111": {"id": 111, "state": {"employment": [["1920-01-01",
null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}'),
('{"employee": {"999": {"id": 999, "state": {"employment":
[["1970-01-01", null, 3]]}, "pay_graph": 6, "last_run_of_employment":
true}, "222": {"id": 222, "state": {"employment": [["1920-01-01",
null, 5]]}, "pay_graph": 5, "last_run_of_employment": true}}}'),
('{"employee": {"998": {"id": 998, "state": {"employment":
[["1980-01-01", null, 3]]}, "pay_graph": 7, "last_run_of_employment":
false}, "333": {"id": 333, "state": {"employment": [["1920-01-01",
null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}')
;

SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun"
WHERE ((snapshot -> 'employee' -> '999' ->> 'pay_graph')::integer !=
0);

--
-- Create index designed to match the query.
--
CREATE INDEX idx1 ON payrun USING btree (((snapshot -> 'employee' ->
'$.*' ->> 'pay_graph')::integer != 0));

set enable_seqscan = OFF;

How is this?

I took the triple condition from your earlier email.
I did not use your index.
I added one index using gin jsonb_path_ops.

create index payrun_jspathop_idx ON payrun using gin (snapshot
jsonb_path_ops);
set enable_seqscan = OFF;
\timing on
select id, snapshot #>'{employee,999,state,employment}' from payrun
where snapshot @? '$."employee"."999" ?
( @.pay_graph <> 0
|| @.last_run_of_employment == true
|| @.state.employment[last][2] == 0
)';
id | ?column?
----+---------------------------
2 | [["1970-01-01", null, 3]]
(1 row)

Time: 0.897 ms
explain analyze select id, snapshot #>'{employee,999,state,employment}'
from payrun
where snapshot @? '$."employee"."999" ?
( @.pay_graph <> 0
|| @.last_run_of_employment == true
|| @.state.employment[last][2] == 0
)';

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on payrun (cost=136.00..140.02 rows=1 width=36)
(actual time=0.018..0.019 rows=1 loops=1)
Recheck Cond: (snapshot @? '$."employee"."999"?((@."pay_graph" != 0
|| @."last_run_of_employment" == true) ||
@."state"."employment"[last][2] == 0)'::jsonpath)
Rows Removed by Index Recheck: 2
Heap Blocks: exact=1
-> Bitmap Index Scan on payrun_jspathop_idx (cost=0.00..136.00
rows=1 width=0) (actual time=0.007..0.007 rows=3 loops=1)
Index Cond: (snapshot @? '$."employee"."999"?((@."pay_graph"
!= 0 || @."last_run_of_employment" == true) ||
@."state"."employment"[last][2] == 0)'::jsonpath)
Planning Time: 0.034 ms
Execution Time: 0.033 ms
(8 rows)

Time: 0.284 ms

hope that helps.

Erik Rijkers

#6Shaheed Haque
shaheedhaque@gmail.com
In reply to: Erik Rijkers (#5)
Re: Unable to make use of "deep" JSONB index

Thanks Erik. Is the point that the index has to be on the JSON field
as a whole (i.e. "snapshot") rather than deep inside it (e.g.
"snapshot.something.further[down]")?

Show quoted text

On Sun, 12 Jun 2022 at 11:53, Erik Rijkers <er@xs4all.nl> wrote:

Op 12-06-2022 om 11:34 schreef Shaheed Haque:

OK, I have corrected and simplified the test case (including switching
to a btree index). The WHERE clause and the inex now look like this:

...WHERE ((snapshot -> 'employee' -> '999' ->>
'pay_graph')::integer != 0);
...USING btree (((snapshot -> 'employee' -> '$.*' ->>
'pay_graph')::integer != 0));

But the index is still not being used (test case below). I have
confirmed that the equality operator is listed for "search" (I assume
inequality is the same as equality, but I tried both):

# \dAo btree jsonb*
List of operators of operator families
AM | Operator family | Operator | Strategy | Purpose
-------+-----------------+-----------------+----------+---------
btree | jsonb_ops | <(jsonb,jsonb) | 1 | search
btree | jsonb_ops | <=(jsonb,jsonb) | 2 | search
btree | jsonb_ops | =(jsonb,jsonb) | 3 | search
btree | jsonb_ops | >=(jsonb,jsonb) | 4 | search
btree | jsonb_ops | >(jsonb,jsonb) | 5 | search
(5 rows)

If this is not a bug, then how should the query or the index be
changed to make this work?

=== begin test case ===

CREATE TABLE payrun (
id serial primary key,
snapshot JSONB
);

INSERT INTO payrun(snapshot)
VALUES
('{"employee": {"999": {"id": 999, "state": {"employment":
[["1920-01-01", null, 3]]}, "pay_graph": 0, "last_run_of_employment":
false}, "111": {"id": 111, "state": {"employment": [["1920-01-01",
null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}'),
('{"employee": {"999": {"id": 999, "state": {"employment":
[["1970-01-01", null, 3]]}, "pay_graph": 6, "last_run_of_employment":
true}, "222": {"id": 222, "state": {"employment": [["1920-01-01",
null, 5]]}, "pay_graph": 5, "last_run_of_employment": true}}}'),
('{"employee": {"998": {"id": 998, "state": {"employment":
[["1980-01-01", null, 3]]}, "pay_graph": 7, "last_run_of_employment":
false}, "333": {"id": 333, "state": {"employment": [["1920-01-01",
null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}')
;

SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun"
WHERE ((snapshot -> 'employee' -> '999' ->> 'pay_graph')::integer !=
0);

--
-- Create index designed to match the query.
--
CREATE INDEX idx1 ON payrun USING btree (((snapshot -> 'employee' ->
'$.*' ->> 'pay_graph')::integer != 0));

set enable_seqscan = OFF;

How is this?

I took the triple condition from your earlier email.
I did not use your index.
I added one index using gin jsonb_path_ops.

create index payrun_jspathop_idx ON payrun using gin (snapshot
jsonb_path_ops);
set enable_seqscan = OFF;
\timing on
select id, snapshot #>'{employee,999,state,employment}' from payrun
where snapshot @? '$."employee"."999" ?
( @.pay_graph <> 0
|| @.last_run_of_employment == true
|| @.state.employment[last][2] == 0
)';
id | ?column?
----+---------------------------
2 | [["1970-01-01", null, 3]]
(1 row)

Time: 0.897 ms
explain analyze select id, snapshot #>'{employee,999,state,employment}'
from payrun
where snapshot @? '$."employee"."999" ?
( @.pay_graph <> 0
|| @.last_run_of_employment == true
|| @.state.employment[last][2] == 0
)';

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on payrun (cost=136.00..140.02 rows=1 width=36)
(actual time=0.018..0.019 rows=1 loops=1)
Recheck Cond: (snapshot @? '$."employee"."999"?((@."pay_graph" != 0
|| @."last_run_of_employment" == true) ||
@."state"."employment"[last][2] == 0)'::jsonpath)
Rows Removed by Index Recheck: 2
Heap Blocks: exact=1
-> Bitmap Index Scan on payrun_jspathop_idx (cost=0.00..136.00
rows=1 width=0) (actual time=0.007..0.007 rows=3 loops=1)
Index Cond: (snapshot @? '$."employee"."999"?((@."pay_graph"
!= 0 || @."last_run_of_employment" == true) ||
@."state"."employment"[last][2] == 0)'::jsonpath)
Planning Time: 0.034 ms
Execution Time: 0.033 ms
(8 rows)

Time: 0.284 ms

hope that helps.

Erik Rijkers

#7Shaheed Haque
shaheedhaque@gmail.com
In reply to: Erik Rijkers (#5)
Re: Unable to make use of "deep" JSONB index

(Resend, wrong version was sent before)

Thanks Erik. Is the point that the index has to be on the JSON field
as a whole (i.e. "snapshot") rather than deep inside it (e.g.
"snapshot.something.further[down]")?

In my case, the snapshot is several MB in size (perhaps 10MB or even
20MB), dominated by the snapshot.employee (cardinality 10k, each sized
as a dict 1-2kB as text). My expectation/guess is that an index of
"snapshot" will itself be of a size of similar order. However the
design as-is works very well except for this one case where to speed
it up, in principle, the index need contain no more than one boolean
per employee. So that's what I'd like to achieve, if possible.

I've seen no hint in the documentation that creating the index on
"snapshot.something.further[down]" should not work, and PG certainly
allows it to be created. Also, Tom has suggested that I should not
look to some magical ability to infer the use of the index from a
differently structured query, and I've taken that on board with the
updated query + index.

AFAIK, there are 3 possibilities:

- I've not done things right, in which case I'd love to know my mistake.
- It is not supposed to work, in which case it would be good to have
that stated, and maybe have PG not allow useless indices to be
created.
- It is a bug.

All input much appreciated,

Thanks, Shaheed

Show quoted text

On Sun, 12 Jun 2022 at 11:53, Erik Rijkers <er@xs4all.nl> wrote:

Op 12-06-2022 om 11:34 schreef Shaheed Haque:

OK, I have corrected and simplified the test case (including switching
to a btree index). The WHERE clause and the inex now look like this:

...WHERE ((snapshot -> 'employee' -> '999' ->>
'pay_graph')::integer != 0);
...USING btree (((snapshot -> 'employee' -> '$.*' ->>
'pay_graph')::integer != 0));

But the index is still not being used (test case below). I have
confirmed that the equality operator is listed for "search" (I assume
inequality is the same as equality, but I tried both):

# \dAo btree jsonb*
List of operators of operator families
AM | Operator family | Operator | Strategy | Purpose
-------+-----------------+-----------------+----------+---------
btree | jsonb_ops | <(jsonb,jsonb) | 1 | search
btree | jsonb_ops | <=(jsonb,jsonb) | 2 | search
btree | jsonb_ops | =(jsonb,jsonb) | 3 | search
btree | jsonb_ops | >=(jsonb,jsonb) | 4 | search
btree | jsonb_ops | >(jsonb,jsonb) | 5 | search
(5 rows)

If this is not a bug, then how should the query or the index be
changed to make this work?

=== begin test case ===

CREATE TABLE payrun (
id serial primary key,
snapshot JSONB
);

INSERT INTO payrun(snapshot)
VALUES
('{"employee": {"999": {"id": 999, "state": {"employment":
[["1920-01-01", null, 3]]}, "pay_graph": 0, "last_run_of_employment":
false}, "111": {"id": 111, "state": {"employment": [["1920-01-01",
null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}'),
('{"employee": {"999": {"id": 999, "state": {"employment":
[["1970-01-01", null, 3]]}, "pay_graph": 6, "last_run_of_employment":
true}, "222": {"id": 222, "state": {"employment": [["1920-01-01",
null, 5]]}, "pay_graph": 5, "last_run_of_employment": true}}}'),
('{"employee": {"998": {"id": 998, "state": {"employment":
[["1980-01-01", null, 3]]}, "pay_graph": 7, "last_run_of_employment":
false}, "333": {"id": 333, "state": {"employment": [["1920-01-01",
null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}')
;

SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun"
WHERE ((snapshot -> 'employee' -> '999' ->> 'pay_graph')::integer !=
0);

--
-- Create index designed to match the query.
--
CREATE INDEX idx1 ON payrun USING btree (((snapshot -> 'employee' ->
'$.*' ->> 'pay_graph')::integer != 0));

set enable_seqscan = OFF;

How is this?

I took the triple condition from your earlier email.
I did not use your index.
I added one index using gin jsonb_path_ops.

create index payrun_jspathop_idx ON payrun using gin (snapshot
jsonb_path_ops);
set enable_seqscan = OFF;
\timing on
select id, snapshot #>'{employee,999,state,employment}' from payrun
where snapshot @? '$."employee"."999" ?
( @.pay_graph <> 0
|| @.last_run_of_employment == true
|| @.state.employment[last][2] == 0
)';
id | ?column?
----+---------------------------
2 | [["1970-01-01", null, 3]]
(1 row)

Time: 0.897 ms
explain analyze select id, snapshot #>'{employee,999,state,employment}'
from payrun
where snapshot @? '$."employee"."999" ?
( @.pay_graph <> 0
|| @.last_run_of_employment == true
|| @.state.employment[last][2] == 0
)';

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on payrun (cost=136.00..140.02 rows=1 width=36)
(actual time=0.018..0.019 rows=1 loops=1)
Recheck Cond: (snapshot @? '$."employee"."999"?((@."pay_graph" != 0
|| @."last_run_of_employment" == true) ||
@."state"."employment"[last][2] == 0)'::jsonpath)
Rows Removed by Index Recheck: 2
Heap Blocks: exact=1
-> Bitmap Index Scan on payrun_jspathop_idx (cost=0.00..136.00
rows=1 width=0) (actual time=0.007..0.007 rows=3 loops=1)
Index Cond: (snapshot @? '$."employee"."999"?((@."pay_graph"
!= 0 || @."last_run_of_employment" == true) ||
@."state"."employment"[last][2] == 0)'::jsonpath)
Planning Time: 0.034 ms
Execution Time: 0.033 ms
(8 rows)

Time: 0.284 ms

hope that helps.

Erik Rijkers

#8Jeff Janes
jeff.janes@gmail.com
In reply to: Shaheed Haque (#4)
Re: Unable to make use of "deep" JSONB index

On Sun, Jun 12, 2022 at 5:34 AM Shaheed Haque <shaheedhaque@gmail.com>
wrote:

OK, I have corrected and simplified the test case (including switching
to a btree index). The WHERE clause and the inex now look like this:

...WHERE ((snapshot -> 'employee' -> '999' ->>
'pay_graph')::integer != 0);
...USING btree (((snapshot -> 'employee' -> '$.*' ->>
'pay_graph')::integer != 0));

But, this is not a correction. You are still trying to use -> as if it
were @?, and that is still not going to work.

You are indexing the part of snapshot which has the employee number of
'$.*', which is a weird employee number for anyone to have. You might want
to represent a wildcard but that is not what -> does.

Cheer,

Jeff

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Janes (#8)
Re: Unable to make use of "deep" JSONB index

Jeff Janes <jeff.janes@gmail.com> writes:

On Sun, Jun 12, 2022 at 5:34 AM Shaheed Haque <shaheedhaque@gmail.com>
wrote:

OK, I have corrected and simplified the test case (including switching
to a btree index). The WHERE clause and the inex now look like this:

...WHERE ((snapshot -> 'employee' -> '999' ->>
'pay_graph')::integer != 0);
...USING btree (((snapshot -> 'employee' -> '$.*' ->>
'pay_graph')::integer != 0));

But, this is not a correction. You are still trying to use -> as if it
were @?, and that is still not going to work.

In hopes of clarifying some more: all that index does is to record
the boolean result of
(snapshot -> 'employee' -> '$.*' ->> 'pay_graph')::integer != 0
at each row. We could use it for a query that contains *exactly*
that condition as a WHERE clause. We cannot use it for a query that
contains some other condition, even if that other condition looks
related to you.

You are indexing the part of snapshot which has the employee number of
'$.*', which is a weird employee number for anyone to have. You might want
to represent a wildcard but that is not what -> does.

Yeah, there's also the problem that the semantics of this particular
expression aren't really useful. But even if they were, PG's index
machinery is not smart enough to pick apart the contents of an index
expression. If the index expression *exactly* matches some sub-expression
of a WHERE clause, and what's above that sub-expression is an operator
that's indexable according to the index opclass, then we have a chance
of using it. This example is not that.

regards, tom lane

#10Shaheed Haque
shaheedhaque@gmail.com
In reply to: Tom Lane (#9)
Re: Unable to make use of "deep" JSONB index

On Sun, 12 Jun 2022 at 22:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

On Sun, Jun 12, 2022 at 5:34 AM Shaheed Haque <shaheedhaque@gmail.com>
wrote:

OK, I have corrected and simplified the test case (including switching
to a btree index). The WHERE clause and the inex now look like this:

...WHERE ((snapshot -> 'employee' -> '999' ->>
'pay_graph')::integer != 0);
...USING btree (((snapshot -> 'employee' -> '$.*' ->>
'pay_graph')::integer != 0));

But, this is not a correction. You are still trying to use -> as if it
were @?, and that is still not going to work.

In hopes of clarifying some more: all that index does is to record
the boolean result of
(snapshot -> 'employee' -> '$.*' ->> 'pay_graph')::integer != 0
at each row. We could use it for a query that contains *exactly*
that condition as a WHERE clause. We cannot use it for a query that
contains some other condition, even if that other condition looks
related to you.

OK, I see that I got myself all confused and the @? form is needed. So:

...USING btree ((snapshot -> 'employee' @? '$.* ? (@.pay_graph != 0)'));
...WHERE ((snapshot -> 'employee' @? '$."999" ? (@.pay_graph != 0)'))

You are indexing the part of snapshot which has the employee number of
'$.*', which is a weird employee number for anyone to have. You might want
to represent a wildcard but that is not what -> does.

Yeah, there's also the problem that the semantics of this particular
expression aren't really useful.

Does the switch back to @? address this point? If not, please clarify.

But even if they were, PG's index
machinery is not smart enough to pick apart the contents of an index
expression. If the index expression *exactly* matches some sub-expression

To my inexpert eye, given the presence of the wildcard, the above look
like an exact match. What have I missed?

of a WHERE clause, and what's above that sub-expression is an operator
that's indexable according to the index opclass, then we have a chance
of using it. This example is not that.

Because I switched to btree, and btree cannot search on "@?"? So, what
should the index+query look like using gin? (I am trying to address a
niche case, and can easily arrange for them to match if I knew what
was needed).

Thanks, Shaheed

Show quoted text

regards, tom lane

#11David G. Johnston
david.g.johnston@gmail.com
In reply to: Shaheed Haque (#10)
Re: Unable to make use of "deep" JSONB index

On Sun, Jun 12, 2022 at 4:52 PM Shaheed Haque <shaheedhaque@gmail.com>
wrote:

...USING btree ((snapshot -> 'employee' @? '$.* ? (@.pay_graph !=
0)'));
...WHERE ((snapshot -> 'employee' @? '$."999" ? (@.pay_graph !=
0)'))

To my inexpert eye, given the presence of the wildcard, the above look
like an exact match. What have I missed?

The system doesn't see a wildcard. Exact match means identical
characters. The fact that you had to write "given the presence of the
wildcard" is why this is not an exact match.

David J.

#12Jeff Janes
jeff.janes@gmail.com
In reply to: Shaheed Haque (#7)
Re: Unable to make use of "deep" JSONB index

On Sun, Jun 12, 2022 at 3:31 PM Shaheed Haque <shaheedhaque@gmail.com>
wrote:

(Resend, wrong version was sent before)

Thanks Erik. Is the point that the index has to be on the JSON field
as a whole (i.e. "snapshot") rather than deep inside it (e.g.
"snapshot.something.further[down]")?

In my case, the snapshot is several MB in size (perhaps 10MB or even
20MB), dominated by the snapshot.employee (cardinality 10k, each sized
as a dict 1-2kB as text). My expectation/guess is that an index of
"snapshot" will itself be of a size of similar order. However the
design as-is works very well except for this one case where to speed
it up, in principle, the index need contain no more than one boolean
per employee. So that's what I'd like to achieve, if possible.

It sounds like what you really want here is to extract just the list of the
ids which meet one of your three further criteria, and index that list.

You can do that with jsonpath, but you have to apply it with a function,
not one of the boolean-returning operators.

This almost works to do that:

create index on payrun using gin (jsonb_path_query_array(snapshot,
'$.employee.* ? (@.pay_graph <> 0 || @.last_run_of_employment == true ||
@.state.employment[last][2] == 0).id'));

Then query it like:

select * from payrun where jsonb_path_query_array(snapshot, '$.employee.* ?
(@.pay_graph <> 0 || @.last_run_of_employment == true ||
@.state.employment[last][2] == 0).id') ? '999';

The problem is that ? only tests for top-level text values, while in your
example the value for the 'id' keys are ints, not text. So they are not
findable with the ? operator. If I edit your example data to wrap the ids'
values in double quotes, turning them into json strings rather than json
ints, then this does work for me.

Maybe there is a way to modify the jsonpath so that it converts the ints to
text for you. But if there is such a way, I don't know what it is.

If I were doing this for my own system, I would probably create an
immutable user-defined-function which took a jsonb and returned an int[] of
the filtered id values. Then you would have to query it with @> rather
than ?.

- I've not done things right, in which case I'd love to know my mistake.
- It is not supposed to work, in which case it would be good to have
that stated, and maybe have PG not allow useless indices to be
created.
- It is a bug.

I guess it is number one mixed with number two. The index you created is
useless for your intended purpose, but is not useless for every conceivable
purpose. It is not realistic to expect PostgreSQL to reject things just
because it is not obvious (to a computer) what you are getting up to.

Cheers,

Jeff

#13Erik Rijkers
er@xs4all.nl
In reply to: Shaheed Haque (#7)
Re: Unable to make use of "deep" JSONB index

Op 12-06-2022 om 21:31 schreef Shaheed Haque:

Thanks Erik. Is the point that the index has to be on the JSON field
as a whole (i.e. "snapshot") rather than deep inside it (e.g.
"snapshot.something.further[down]")?

In my case, the snapshot is several MB in size (perhaps 10MB or even
20MB), dominated by the snapshot.employee (cardinality 10k, each sized
as a dict 1-2kB as text). My expectation/guess is that an index of
"snapshot" will itself be of a size of similar order. However the
design as-is works very well except for this one case where to speed
it up, in principle, the index need contain no more than one boolean
per employee. So that's what I'd like to achieve, if possible.

I've seen no hint in the documentation that creating the index on
"snapshot.something.further[down]" should not work, and PG certainly
allows it to be created. Also, Tom has suggested that I should not
look to some magical ability to infer the use of the index from a
differently structured query, and I've taken that on board with the
updated query + index.

If you insist on a btree/integer on pay_graph, the below stuff seems to
work, no?

Perhaps you've missed the difference
between
x != 0 on the one hand,
and
x > 0 on the other,

Here are both queries to show that difference.

CREATE INDEX payrun_btree_paygr_idx ON payrun using btree
(((snapshot->'employee'->'999'->>'pay_graph')::integer));

set enable_seqscan = 0;

SELECT id, snapshot #>'{employee,999,state,employment}'
FROM payrun
WHERE (snapshot->'employee'->'999'->>'pay_graph')::integer > 0 ;
id | ?column?
----+---------------------------
2 | [["1970-01-01", null, 3]]
(1 row)

Time: 1.384 ms

explain analyze SELECT id, snapshot #>'{employee,999,state,employment}'
FROM payrun WHERE (snapshot->'employee'->'999'->>'pay_graph')::integer >
0 ;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
Index Scan using payrun_btree_paygr_idx on payrun (cost=0.14..8.98
rows=42 width=36) (actual time=0.018..0.020 rows=1 loops=1)
Index Cond: (((((snapshot -> 'employee'::text) -> '999'::text) ->>
'pay_graph'::text))::integer > 0)
Planning Time: 0.047 ms
Execution Time: 0.048 ms
(4 rows)

Time: 0.386 ms
explain analyze SELECT id, snapshot #>'{employee,999,state,employment}'
FROM payrun WHERE (snapshot->'employee'->'999'->>'pay_graph')::integer
!= 0 ;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
Seq Scan on payrun (cost=10000000000.00..10000000004.49 rows=126
width=36) (actual time=112.182..112.186 rows=1 loops=1)
Filter: (((((snapshot -> 'employee'::text) -> '999'::text) ->>
'pay_graph'::text))::integer <> 0)
Rows Removed by Filter: 3
Planning Time: 0.050 ms
JIT:
Functions: 4
Options: Inlining true, Optimization true, Expressions true,
Deforming true
Timing: Generation 0.636 ms, Inlining 41.761 ms, Optimization 52.033
ms, Emission 18.228 ms, Total 112.658 ms
Execution Time: 153.486 ms
(9 rows)

Time: 153.835 ms

So:
0.386 ms for x > 0 uses payrun_btree_paygr_idx
vs:
153.835 ms for x != 0 uses seq scan

It's hard to see for me where you encounter problems; including the fast
searches in my earier mail, it looks to me like you have many good options.

Erik Rijkers

#14Shaheed Haque
shaheedhaque@gmail.com
In reply to: Jeff Janes (#12)
Re: Unable to make use of "deep" JSONB index

Jeff, David, Erik, Tom,

I now see that I misunderstood several aspects of the problem. With
regard to Jeff's last note, I note (and like!!!!):

- the use of the final ".id" to collapse the employee "dict" into an
array of simple values.
- the use of "?" to test elements in the array as "keys".

As Jeff noted, the "?" requires a text value. Luckily I have a
".username" in my real data that should work...so I think I have a
solution without recourse to a custom function! I have appended an
updated test case showing the indexing working. Finally, I do wonder
if a "jsonb_path_query_array_text" function might be worth considering
as an enhancement for when there is not a usable text value available?

Thanks again for all your kind help.

Shaheed

== test case showing how to index "deep" into a JSONB field ==

DROP TABLE payrun;
DROP INDEX idx1;

CREATE TABLE payrun (
id serial primary key,
snapshot JSONB
);

INSERT INTO payrun(snapshot)
VALUES
('{"employee": {"999": {"id": "user999@foo.com", "state":
{"employment": [["1920-01-01", null, 3]]}, "pay_graph": 0,
"last_run_of_employment": false}, "111": {"id": "user111@foo.com",
"state": {"employment": [["1920-01-01", null, 5]]}, "pay_graph": 3,
"last_run_of_employment": true}}}'),
('{"employee": {"999": {"id": "user999@foo.com", "state":
{"employment": [["1970-01-01", null, 3]]}, "pay_graph": 6,
"last_run_of_employment": true}, "222": {"id": "user222@foo.com",
"state": {"employment": [["1920-01-01", null, 5]]}, "pay_graph": 5,
"last_run_of_employment": true}}}'),
('{"employee": {"998": {"id": "user998@foo.com", "state":
{"employment": [["1980-01-01", null, 3]]}, "pay_graph": 7,
"last_run_of_employment": false}, "333": {"id": "user333@foo.com",
"state": {"employment": [["1920-01-01", null, 5]]}, "pay_graph": 3,
"last_run_of_employment": true}}}')
;

CREATE INDEX idx1 ON payrun USING gin
(jsonb_path_query_array(snapshot, '$.employee.* ? (@.pay_graph <> 0 ||
@.last_run_of_employment == true || @.state.employment[last][2] ==
0).id'));

SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun"
WHERE jsonb_path_query_array(snapshot, '$.employee.* ? (@.pay_graph <>
0 || @.last_run_of_employment == true || @.state.employment[last][2]
== 0).id') ? 'user999@foo.com';

--
-- Test
--
set enable_seqscan = OFF;

EXPLAIN ANALYSE
SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun"
WHERE jsonb_path_query_array(snapshot, '$.employee.* ? (@.pay_graph <>
0 || @.last_run_of_employment == true || @.state.employment[last][2]
== 0).id') ? 'user999@foo.com';

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------
Bitmap Heap Scan on payrun (cost=8.00..12.02 rows=1 width=36) (actual
time=0.022..0.023 rows=1 loops=1)
Recheck Cond: (jsonb_path_query_array(snapshot,
'$."employee".*?((@."pay_graph" != 0 || @."last_run_of_employment" ==
true) || @."state"."employment"[last][2] == 0)."id"'::jsonpath,
'{}'::jsonb, false) ? 'user999@foo.com
'::text)
Heap Blocks: exact=1
-> Bitmap Index Scan on idx1 (cost=0.00..8.00 rows=1 width=0)
(actual time=0.007..0.007 rows=1 loops=1)
Index Cond: (jsonb_path_query_array(snapshot,
'$."employee".*?((@."pay_graph" != 0 || @."last_run_of_employment" ==
true) || @."state"."employment"[last][2] == 0)."id"'::jsonpath,
'{}'::jsonb, false) ? 'user999@foo
.com'::text)
Planning Time: 0.062 ms
Execution Time: 0.041 ms
(7 rows)

Show quoted text

On Mon, 13 Jun 2022 at 03:23, Jeff Janes <jeff.janes@gmail.com> wrote:

On Sun, Jun 12, 2022 at 3:31 PM Shaheed Haque <shaheedhaque@gmail.com> wrote:

(Resend, wrong version was sent before)

Thanks Erik. Is the point that the index has to be on the JSON field
as a whole (i.e. "snapshot") rather than deep inside it (e.g.
"snapshot.something.further[down]")?

In my case, the snapshot is several MB in size (perhaps 10MB or even
20MB), dominated by the snapshot.employee (cardinality 10k, each sized
as a dict 1-2kB as text). My expectation/guess is that an index of
"snapshot" will itself be of a size of similar order. However the
design as-is works very well except for this one case where to speed
it up, in principle, the index need contain no more than one boolean
per employee. So that's what I'd like to achieve, if possible.

It sounds like what you really want here is to extract just the list of the ids which meet one of your three further criteria, and index that list.

You can do that with jsonpath, but you have to apply it with a function, not one of the boolean-returning operators.

This almost works to do that:

create index on payrun using gin (jsonb_path_query_array(snapshot, '$.employee.* ? (@.pay_graph <> 0 || @.last_run_of_employment == true || @.state.employment[last][2] == 0).id'));

Then query it like:

select * from payrun where jsonb_path_query_array(snapshot, '$.employee.* ? (@.pay_graph <> 0 || @.last_run_of_employment == true || @.state.employment[last][2] == 0).id') ? '999';

The problem is that ? only tests for top-level text values, while in your example the value for the 'id' keys are ints, not text. So they are not findable with the ? operator. If I edit your example data to wrap the ids' values in double quotes, turning them into json strings rather than json ints, then this does work for me.

Maybe there is a way to modify the jsonpath so that it converts the ints to text for you. But if there is such a way, I don't know what it is.

If I were doing this for my own system, I would probably create an immutable user-defined-function which took a jsonb and returned an int[] of the filtered id values. Then you would have to query it with @> rather than ?.

- I've not done things right, in which case I'd love to know my mistake.
- It is not supposed to work, in which case it would be good to have
that stated, and maybe have PG not allow useless indices to be
created.
- It is a bug.

I guess it is number one mixed with number two. The index you created is useless for your intended purpose, but is not useless for every conceivable purpose. It is not realistic to expect PostgreSQL to reject things just because it is not obvious (to a computer) what you are getting up to.

Cheers,

Jeff