BUG #15350: Getting invalid cache ID: 11 Errors

Started by PG Bug reporting formover 7 years ago15 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 15350
Logged by: Kieran McCusker
Email address: kieran.mccusker@gmail.com
PostgreSQL version: 10.5
Operating system: Fedora 28
Description:

Hi

We're using 10.5 with parallel queries enabled and the config options
#max_worker_processes = 8
#max_parallel_workers_per_gather = 2
#max_parallel_workers = 8

I'm seeing invalid cache ID: 11 errors in the log. It's only happening
occasionally (15 times today on a not very busy system). I've been following
the thread about making windowing functions et al parallel restricted - Is
this another manifestation of the problem?

On the first query setting set session max_parallel_workers to 0 more than
doubles the execution time from 208ms to 580ms so that doesn't seem
desirable, and it's not obvious to me why the second would have failed.

I'm probably being dense and if so apologies for wasting your time. I also
don't see how to create a test case sorry.

Many thanks

Kieran

A couple of example queries are :-
-----------------------------------------------------------------------------------------------
with limits as (
select min(contact_date), max(contact_date)
from q366midl.q366m1
where contact_date is not null
)
select
to_char(min,'YYYY-MM'),
'text' as format,
to_char(min,substr('YYYY-MM',1,7)) as code,
--
from limits
union all
select
to_char(max,'YYYY-MM'),
'text' as format,
to_char(max,substr('YYYY-MM',1,7)) as code,
to_char(max,'Mon YYYY') as text
from limits
order by 1

2018-08-24 13:50:28.325 BST [68794] ERROR: invalid cache ID: 11
2018-08-24 13:50:28.325 BST [68844] LOG: worker process: parallel worker
for PID 67216 (PID 68793) exited with exit code 1
2018-08-24 13:50:28.327 BST [68844] LOG: worker process: parallel worker
for PID 67216 (PID 68794) exited with exit code 1

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2018-08-23 11:04:23.020 BST [80148] ERROR: invalid cache ID: 11
2018-08-23 11:04:23.020 BST [80148] CONTEXT: parallel worker
2018-08-23 11:04:23.020 BST [80148] STATEMENT: SELECT "survey"."calls".*
FROM "survey"."calls" WHERE ((survey_state is null or survey_state not in
(500,501)) and phone_numbers @> array['07854 xxxxxx']) ORDER BY created_at
desc
2018-08-23 11:04:23.021 BST [81154] ERROR: invalid cache ID: 11
2018-08-23 11:04:23.023 BST [68844] LOG: worker process: parallel worker
for PID 80148 (PID 81153) exited with exit code 1
2018-08-23 11:04:23.023 BST [68844] LOG: worker process: parallel worker
for PID 80148 (PID 81154) exited with exit code 1

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: PG Bug reporting form (#1)
Re: BUG #15350: Getting invalid cache ID: 11 Errors

=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:

We're using 10.5 with parallel queries enabled and the config options
#max_worker_processes = 8
#max_parallel_workers_per_gather = 2
#max_parallel_workers = 8

I'm seeing invalid cache ID: 11 errors in the log. It's only happening
occasionally (15 times today on a not very busy system).

Interesting. Syscache 11 would be AUTHOID, which seems to be consulted
mostly for privilege checks, though there's at least one reference
during process startup.

I've been following
the thread about making windowing functions et al parallel restricted - Is
this another manifestation of the problem?

No, I don't think it's that. We've seen a couple of reports of similar
failures referencing cache 42, but this is a new symptom.

I also don't see how to create a test case sorry.

Don't suppose you'd be prepared to recompile PG to turn that ERROR into
a PANIC, and then capture a stack trace from the core dump next time
it happens?

regards, tom lane

#3Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#2)
Re: BUG #15350: Getting invalid cache ID: 11 Errors

On Sat, Aug 25, 2018 at 3:16 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:

We're using 10.5 with parallel queries enabled and the config options
#max_worker_processes = 8
#max_parallel_workers_per_gather = 2
#max_parallel_workers = 8

I'm seeing invalid cache ID: 11 errors in the log. It's only happening
occasionally (15 times today on a not very busy system).

Interesting. Syscache 11 would be AUTHOID, which seems to be consulted
mostly for privilege checks, though there's at least one reference
during process startup.

Hi Kieran,

Are you using extensions, by any chance? If an extension were to
access the AUTHOID syscache during _PG_init(), it would fail like this
in parallel workers, because they run RestoreLibraryState() before
they run BackgroundWorkerInitializeConnectionByOid() (which runs
InitPostgres() which runs InitCatalogCache()). Oracle_fdw has this
problem (see nearby thread) and there may be others out there. The
extension wouldn't have to be used by the query that exhibited the
symptom... it could have been loaded earlier in the life of the leader
backend but caused no problem until eventually a parallel query was
launched.

We could probably improve that situation by making syscache lookups
(and probably other things too) fail when called from _PG_init() in
regular backends so that extension authors are made aware of this
hazard, or perhaps go the other way and change the order we do things
in parallel workers.

--
Thomas Munro
http://www.enterprisedb.com

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#3)
Re: BUG #15350: Getting invalid cache ID: 11 Errors

Thomas Munro <thomas.munro@enterprisedb.com> writes:

We could probably improve that situation by making syscache lookups
(and probably other things too) fail when called from _PG_init() in
regular backends so that extension authors are made aware of this
hazard, or perhaps go the other way and change the order we do things
in parallel workers.

Hmm. There's an argument to be made for the latter: we don't really
want stuff failing in parallel workers if it works fine normally.

On the other hand, it seems clear to me that we *don't* want extensions to
be doing stuff like syscache lookups in _PG_init(), because that would
prevent them from working as shared_preload_libraries entries.

And on the third hand, intentionally breaking code that used to work
isn't likely to win us many friends either. So I'm not sure that your
first option is really tenable. Perhaps we could get away with doing
it in HEAD and not back-patching ... but that does little for existing
problems.

regards, tom lane

#5Kieran McCusker
kieran.mccusker@gmail.com
In reply to: Tom Lane (#4)
Re: BUG #15350: Getting invalid cache ID: 11 Errors

Yes I'm using extensions - A bunch of them but I'm guessing it's
likely ogr_fdw that will be causing the issue as I've only seen this on
days when someone has imported data using ogr_fdw.

On Tue, 28 Aug 2018, 03:40 Tom Lane, <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Thomas Munro <thomas.munro@enterprisedb.com> writes:

We could probably improve that situation by making syscache lookups
(and probably other things too) fail when called from _PG_init() in
regular backends so that extension authors are made aware of this
hazard, or perhaps go the other way and change the order we do things
in parallel workers.

Hmm. There's an argument to be made for the latter: we don't really
want stuff failing in parallel workers if it works fine normally.

On the other hand, it seems clear to me that we *don't* want extensions to
be doing stuff like syscache lookups in _PG_init(), because that would
prevent them from working as shared_preload_libraries entries.

And on the third hand, intentionally breaking code that used to work
isn't likely to win us many friends either. So I'm not sure that your
first option is really tenable. Perhaps we could get away with doing
it in HEAD and not back-patching ... but that does little for existing
problems.

regards, tom lane

#6Thomas Munro
thomas.munro@gmail.com
In reply to: Kieran McCusker (#5)
Re: BUG #15350: Getting invalid cache ID: 11 Errors

On Wed, Aug 29, 2018 at 2:31 AM Kieran McCusker
<kieran.mccusker@gmail.com> wrote:

Yes I'm using extensions - A bunch of them but I'm guessing it's likely ogr_fdw that will be causing the issue as I've only seen this on days when someone has imported data using ogr_fdw.

Bingo:

https://github.com/pramsey/pgsql-ogr-fdw/blob/v1.0.6/ogr_fdw.c#L195

--
Thomas Munro
http://www.enterprisedb.com

#7Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#4)
Re: BUG #15350: Getting invalid cache ID: 11 Errors

On Tue, Aug 28, 2018 at 2:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

We could probably improve that situation by making syscache lookups
(and probably other things too) fail when called from _PG_init() in
regular backends so that extension authors are made aware of this
hazard, or perhaps go the other way and change the order we do things
in parallel workers.

Hmm. There's an argument to be made for the latter: we don't really
want stuff failing in parallel workers if it works fine normally.

On the other hand, it seems clear to me that we *don't* want extensions to
be doing stuff like syscache lookups in _PG_init(), because that would
prevent them from working as shared_preload_libraries entries.

And on the third hand, intentionally breaking code that used to work
isn't likely to win us many friends either. So I'm not sure that your
first option is really tenable. Perhaps we could get away with doing
it in HEAD and not back-patching ... but that does little for existing
problems.

Ok, so I guess we should just make it work then. Here's throw-away
test patch to make "unaccent" blow up in this way, and a draft patch
to reorder things so that it works.

CREATE EXTENSION unaccent;
CREATE TABLE t AS SELECT 'hello world' AS v FROM generate_series(1, 1000000);
ANALYZE t;
EXPLAIN ANALYZE SELECT COUNT(unaccent(v)) FROM t;

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Hack-to-make-unaccent-poke-syscache.patchapplication/octet-stream; name=0001-Hack-to-make-unaccent-poke-syscache.patchDownload+9-1
0002-Defer-restoration-of-libraries-in-parallel-workers.patchapplication/octet-stream; name=0002-Defer-restoration-of-libraries-in-parallel-workers.patchDownload+9-10
#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Thomas Munro (#7)
Re: BUG #15350: Getting invalid cache ID: 11 Errors

On Thu, Aug 30, 2018 at 4:16 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Tue, Aug 28, 2018 at 2:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

We could probably improve that situation by making syscache lookups
(and probably other things too) fail when called from _PG_init() in
regular backends so that extension authors are made aware of this
hazard, or perhaps go the other way and change the order we do things
in parallel workers.

Hmm. There's an argument to be made for the latter: we don't really
want stuff failing in parallel workers if it works fine normally.

On the other hand, it seems clear to me that we *don't* want extensions to
be doing stuff like syscache lookups in _PG_init(), because that would
prevent them from working as shared_preload_libraries entries.

And on the third hand, intentionally breaking code that used to work
isn't likely to win us many friends either. So I'm not sure that your
first option is really tenable. Perhaps we could get away with doing
it in HEAD and not back-patching ... but that does little for existing
problems.

Ok, so I guess we should just make it work then. Here's throw-away
test patch to make "unaccent" blow up in this way, and a draft patch
to reorder things so that it works.

+1 for reordering the things in parallel-workers. It will make the
behavior of parallel workers similar to backends. In fact, I have
proposed that myself sometime back [1]/messages/by-id/CAA4eK1KH0wd3+tUDxMuMbqhLh0kKMu_TOfAcSVWuyfYVt9hfzQ@mail.gmail.com for the similar problem.

[1]: /messages/by-id/CAA4eK1KH0wd3+tUDxMuMbqhLh0kKMu_TOfAcSVWuyfYVt9hfzQ@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#9Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#4)
Re: BUG #15350: Getting invalid cache ID: 11 Errors

On 2018-08-27 22:40:17 -0400, Tom Lane wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

We could probably improve that situation by making syscache lookups
(and probably other things too) fail when called from _PG_init() in
regular backends so that extension authors are made aware of this
hazard, or perhaps go the other way and change the order we do things
in parallel workers.

Hmm. There's an argument to be made for the latter: we don't really
want stuff failing in parallel workers if it works fine normally.

Yea, I guess there's an argument to be made for that.

On the other hand, it seems clear to me that we *don't* want extensions to
be doing stuff like syscache lookups in _PG_init(), because that would
prevent them from working as shared_preload_libraries entries.

Yea, that doesn't seem great.

And on the third hand, intentionally breaking code that used to work
isn't likely to win us many friends either. So I'm not sure that your
first option is really tenable. Perhaps we could get away with doing
it in HEAD and not back-patching ... but that does little for existing
problems.

I wonder if we could make it warn in all branches and error out hard in
master? That way hopefully most extensions would be fixed long before
the next release comes out, and people don't load new libraries at that
crazy a rate...

Greetings,

Andres Freund

#10Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#9)
Re: BUG #15350: Getting invalid cache ID: 11 Errors

On Mon, Sep 3, 2018 at 5:05 AM Andres Freund <andres@anarazel.de> wrote:

On 2018-08-27 22:40:17 -0400, Tom Lane wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

We could probably improve that situation by making syscache lookups
(and probably other things too) fail when called from _PG_init() in
regular backends so that extension authors are made aware of this
hazard, or perhaps go the other way and change the order we do things
in parallel workers.

Hmm. There's an argument to be made for the latter: we don't really
want stuff failing in parallel workers if it works fine normally.

Yea, I guess there's an argument to be made for that.

On the other hand, it seems clear to me that we *don't* want extensions to
be doing stuff like syscache lookups in _PG_init(), because that would
prevent them from working as shared_preload_libraries entries.

Yea, that doesn't seem great.

And on the third hand, intentionally breaking code that used to work
isn't likely to win us many friends either. So I'm not sure that your
first option is really tenable. Perhaps we could get away with doing
it in HEAD and not back-patching ... but that does little for existing
problems.

I wonder if we could make it warn in all branches and error out hard in
master? That way hopefully most extensions would be fixed long before
the next release comes out, and people don't load new libraries at that
crazy a rate...

Any other views? It sounds like we do want a patch like the one I
posted earlier. The question is whether we also want a new warning,
and in later releases an error. Do we really want to crack down on
extension libraries that wouldn't work in shared_preload_libraries?
What about hypothetical libraries that are smart enough to avoid
making such calls when they detect that they're running from
shared_preload_libraries -- you'd generate false warnings when loaded
the regular way.

--
Thomas Munro
http://www.enterprisedb.com

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#10)
Re: BUG #15350: Getting invalid cache ID: 11 Errors

Thomas Munro <thomas.munro@enterprisedb.com> writes:

... Do we really want to crack down on
extension libraries that wouldn't work in shared_preload_libraries?

That is, perhaps, a fair question; but I see little benefit to users
in such a library design. Most people who are using a library heavily
would likely wish to have it preloaded.

What about hypothetical libraries that are smart enough to avoid
making such calls when they detect that they're running from
shared_preload_libraries -- you'd generate false warnings when loaded
the regular way.

I find that argument too thin to take seriously. If the library
can manage that, it might as well just postpone the calls all the
time, rather than have two different operating modes that would
require separate testing. What would be the advantage of the
extra complication?

regards, tom lane

#12Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#11)
Re: BUG #15350: Getting invalid cache ID: 11 Errors

On Thu, Sep 13, 2018 at 1:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

... Do we really want to crack down on
extension libraries that wouldn't work in shared_preload_libraries?

That is, perhaps, a fair question; but I see little benefit to users
in such a library design. Most people who are using a library heavily
would likely wish to have it preloaded.

What about hypothetical libraries that are smart enough to avoid
making such calls when they detect that they're running from
shared_preload_libraries -- you'd generate false warnings when loaded
the regular way.

I find that argument too thin to take seriously. If the library
can manage that, it might as well just postpone the calls all the
time, rather than have two different operating modes that would
require separate testing. What would be the advantage of the
extra complication?

Ok, 0003 (attached) is a straw-man patch to log a warning. I guess it
won't be too popular with anyone who has worked on making syscache as
fast as possible.

postgres=# select unaccent('hello');
WARNING: syscache access from _PG_init() is deprecated and will not
be allowed in a future release of PostgreSQL
unaccent
----------
hello
(1 row)

You see this warning only once per library per backend (including
parallel workers, as can be tested with the same trivial query with
force_parallel_mode = on, assuming 0002 is also applied so it doesn't
crash).

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-hack-to-make-unaccent-crash-in-parallel-workers-v2.patchapplication/octet-stream; name=0001-hack-to-make-unaccent-crash-in-parallel-workers-v2.patchDownload+8-1
0002-Defer-restoration-of-libraries-in-parallel-worker-v2.patchapplication/octet-stream; name=0002-Defer-restoration-of-libraries-in-parallel-worker-v2.patchDownload+9-10
0003-Log-a-WARNING-if-_PG_init-accesses-syscache-v2.patchapplication/octet-stream; name=0003-Log-a-WARNING-if-_PG_init-accesses-syscache-v2.patchDownload+32-2
#13Andrew Gierth
andrew@tao11.riddles.org.uk
In reply to: Tom Lane (#11)
Re: BUG #15350: Getting invalid cache ID: 11 Errors

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Tom> That is, perhaps, a fair question; but I see little benefit to
Tom> users in such a library design. Most people who are using a
Tom> library heavily would likely wish to have it preloaded.

possible counterexample: preprepare

--
Andrew (irc:RhodiumToad)

#14Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#12)
Re: BUG #15350: Getting invalid cache ID: 11 Errors

On Thu, Sep 13, 2018 at 4:22 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Sep 13, 2018 at 1:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

... Do we really want to crack down on
extension libraries that wouldn't work in shared_preload_libraries?

That is, perhaps, a fair question; but I see little benefit to users
in such a library design. Most people who are using a library heavily
would likely wish to have it preloaded.

What about hypothetical libraries that are smart enough to avoid
making such calls when they detect that they're running from
shared_preload_libraries -- you'd generate false warnings when loaded
the regular way.

I find that argument too thin to take seriously. If the library
can manage that, it might as well just postpone the calls all the
time, rather than have two different operating modes that would
require separate testing. What would be the advantage of the
extra complication?

Ok, 0003 (attached) is a straw-man patch to log a warning. I guess it
won't be too popular with anyone who has worked on making syscache as
fast as possible.

I would like to close this case, so unless anyone objects, tomorrow
I'm going to push the reordering patch (0002 in my earlier email) so
that we stop getting bug reports about extensions that crash in
parallel workers (even though we consider those extensions to be
operating out of contract). If someone wants to do a patch that will
display warnings when extensions break that rule (something like the
0003 patch I sketched), that can be done separately.

--
Thomas Munro
http://www.enterprisedb.com

#15Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#14)
Re: BUG #15350: Getting invalid cache ID: 11 Errors

On Wed, Sep 19, 2018 at 6:10 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I would like to close this case, so unless anyone objects, tomorrow
I'm going to push the reordering patch (0002 in my earlier email) so
that we stop getting bug reports about extensions that crash in
parallel workers (even though we consider those extensions to be
operating out of contract). If someone wants to do a patch that will
display warnings when extensions break that rule (something like the
0003 patch I sketched), that can be done separately.

Done.

--
Thomas Munro
http://www.enterprisedb.com