"Skipping" BitmapHeapScan for simple counts

Started by Maarten Mortieralmost 5 years ago2 messagesgeneral
Jump to latest
#1Maarten Mortier
maarten.mortier@gmail.com

Hello,

We have a table with raw jsonb data in a field "data".

We have a jsonb_ops GIN index on this field, because we query the following
"freeform" match:

SELECT COUNT(id)
FROM records
WHERE data @@ '$.**.label == "person"';

When analysing this plan, it seems the BitmapIndexScan to do this is fairly
fast (as it uses the GIN index, which is quite powerful), but the
BitmapHeapScan that follows, is slow, when the amount of matching records
span a lot of heap blocks.

I can understand why this happens: the BitmapIndexScan is supposed to be
somewhat optimistic, and rows need to be discarded by properly going
through the data in a BitmapHeapScan. This optimism can (for other types of
indices) be somewhat tamed by "vacuum"ing the indices, which would allow
IndexOnly queries that include availability of the records with the index.

However, in the case of jsonb queries above, there is no need to do the
BitmapHeapScan really: the BitmapIndexScan has the correct row count,
always.

So, I would like for us to just skip it. If we need the data, we can do
LIMIT/OFFSET windows to tame the BitmapHeapScan, but having a full count
would be nice.

Can we skip this in some way? Can I find the right spot in the code where
this could be skipped?

Thanks so much for any help,

--
Maarten

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Maarten Mortier (#1)
Re: "Skipping" BitmapHeapScan for simple counts

Maarten Mortier <maarten.mortier@gmail.com> writes:

We have a jsonb_ops GIN index on this field, because we query the following
"freeform" match:

SELECT COUNT(id)
FROM records
WHERE data @@ '$.**.label == "person"';

However, in the case of jsonb queries above, there is no need to do the
BitmapHeapScan really: the BitmapIndexScan has the correct row count,
always.

This claim is wrong (in general) because of the need to verify visibility
of the heap row. You might be able to look aside at the visibility map
and find out that a particular heap page is all-visible, but unless the
table is mostly static that can easily be a net waste of cycles. The
reason we make a distinction between plain indexscans and index-only
scans is to encode in the plan whether the planner thought such checks
were likely to be a win. To transpose the idea into bitmapscans, you'd
need similar infrastructure, which is not going to be a hack-it-in-one-
place project.

More to the point: I doubt this will help at all for the query above,
because I doubt that the index is lossless for this query. From what
I recall of GIN for jsonb, what it will return is TIDs for tuples that
contain both "label" and "person". Whether they have the relationship
of "person" being the value of a "label" tag can't be told without
looking at the actual JSON value.

regards, tom lane