Is it safe to cache data by GiST consistent function

Started by Michał Kłeczekalmost 2 years ago5 messages

michal@kleczek.org

almost 2 years ago

Hello,

When implementing a GiST consistent function I found the need to cache pre-processed query across invocations.
I am not sure if it is safe to do (or I need to perform some steps to make sure cached info is not leaked between rescans).

The comment in gistrescan says:

/*
* If this isn't the first time through, preserve the fn_extra
* pointers, so that if the consistentFns are using them to cache
* data, that data is not leaked across a rescan.
*/

which seems to me self-contradictory as fn_extra is preserved between rescans (so leaks are indeed possible).

Am I missing something?

Thanks,
Michal

Tom Lane

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Michał Kłeczek (#1)

Re: Is it safe to cache data by GiST consistent function

=?utf-8?Q?Micha=C5=82_K=C5=82eczek?= <michal@kleczek.org> writes:

When implementing a GiST consistent function I found the need to cache pre-processed query across invocations.
I am not sure if it is safe to do (or I need to perform some steps to make sure cached info is not leaked between rescans).

AFAIK it works. I don't see any of the in-core ones doing so,
but at least range_gist_consistent and multirange_gist_consistent
are missing a bet by repeating their cache search every time.

The comment in gistrescan says:

/*
* If this isn't the first time through, preserve the fn_extra
* pointers, so that if the consistentFns are using them to cache
* data, that data is not leaked across a rescan.
*/

which seems to me self-contradictory as fn_extra is preserved between rescans (so leaks are indeed possible).

I think you're reading it wrong. If we cleared fn_extra during
rescan, access to the old extra value would be lost so a new one
would have to be created, leaking the old value for the rest of
the query.

regards, tom lane

Michał Kłeczek

michal@kleczek.org

almost 2 years ago

In reply to: Tom Lane (#2)

Re: Is it safe to cache data by GiST consistent function

Thanks for taking your time to answer. Not sure if I understand though.

On 3 Apr 2024, at 16:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:

=?utf-8?Q?Micha=C5=82_K=C5=82eczek?= <michal@kleczek.org> writes:

When implementing a GiST consistent function I found the need to cache pre-processed query across invocations.
I am not sure if it is safe to do (or I need to perform some steps to make sure cached info is not leaked between rescans).

AFAIK it works. I don't see any of the in-core ones doing so,
but at least range_gist_consistent and multirange_gist_consistent
are missing a bet by repeating their cache search every time.

pg_trgm consistent caches tigrams but it has some logic to make sure cached values are recalculated:

cache = (gtrgm_consistent_cache *) fcinfo->flinfo->fn_extra;
if (cache == NULL ||
cache->strategy != strategy ||
VARSIZE(cache->query) != querysize ||
memcmp((char *) cache->query, (char *) query, querysize) != 0)

What I don’t understand is if it is necessary or it is enough to check fn_extra==NULL.

The comment in gistrescan says:

/*
* If this isn't the first time through, preserve the fn_extra
* pointers, so that if the consistentFns are using them to cache
* data, that data is not leaked across a rescan.
*/

which seems to me self-contradictory as fn_extra is preserved between rescans (so leaks are indeed possible).

I think you're reading it wrong. If we cleared fn_extra during
rescan, access to the old extra value would be lost so a new one
would have to be created, leaking the old value for the rest of
the query.

I understand that but not sure what “that data is not leaked across a rescan” means.

—
Michal

Tom Lane

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Michał Kłeczek (#3)

Re: Is it safe to cache data by GiST consistent function

=?utf-8?Q?Micha=C5=82_K=C5=82eczek?= <michal@kleczek.org> writes:

On 3 Apr 2024, at 16:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:

AFAIK it works. I don't see any of the in-core ones doing so,
but at least range_gist_consistent and multirange_gist_consistent
are missing a bet by repeating their cache search every time.

pg_trgm consistent caches tigrams but it has some logic to make sure cached values are recalculated:

cache = (gtrgm_consistent_cache *) fcinfo->flinfo->fn_extra;
if (cache == NULL ||
cache->strategy != strategy ||
VARSIZE(cache->query) != querysize ||
memcmp((char *) cache->query, (char *) query, querysize) != 0)

What I don’t understand is if it is necessary or it is enough to check fn_extra==NULL.

Ah, I didn't think to search contrib. Yes, you need to validate the
cache entry. In this example, a rescan could insert a new query
value. In general, an opclass support function could get called using
a pretty long-lived FunctionCallInfo (e.g. one in the index's relcache
entry), so it's unwise to assume that cached data is relevant to the
current call without checking.

regards, tom lane

Michał Kłeczek

michal@kleczek.org

almost 2 years ago

In reply to: Tom Lane (#4)

Re: Is it safe to cache data by GiST consistent function

On 3 Apr 2024, at 19:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:

=?utf-8?Q?Micha=C5=82_K=C5=82eczek?= <michal@kleczek.org> writes:

pg_trgm consistent caches tigrams but it has some logic to make sure cached values are recalculated:

cache = (gtrgm_consistent_cache *) fcinfo->flinfo->fn_extra;
if (cache == NULL ||
cache->strategy != strategy ||
VARSIZE(cache->query) != querysize ||
memcmp((char *) cache->query, (char *) query, querysize) != 0)

What I don’t understand is if it is necessary or it is enough to check fn_extra==NULL.

Ah, I didn't think to search contrib. Yes, you need to validate the
cache entry. In this example, a rescan could insert a new query
value. In general, an opclass support function could get called using
a pretty long-lived FunctionCallInfo (e.g. one in the index's relcache
entry), so it's unwise to assume that cached data is relevant to the
current call without checking.

This actually sounds scary - looks like there is no way to perform cache clean-up after rescan then?

Do you think it might be useful to introduce a way for per-rescan caching (ie. setting up a dedicated memory context in gistrescan and passing it to support functions)?

—
Michal