Re: Issue with pg_dump due to Schema OID Error

Started by Renzo Daniover 1 year ago8 messagesgeneral
Jump to latest
#1Renzo Dani
arons7@gmail.com

I forgot to attache the script.

On Thu, Dec 19, 2024 at 10:41 AM Renzo Dani <renzo.dani@gmail.com> wrote:

Show quoted text

Hi,

Recently, I encountered a problem during a database export using pg_dump.

Here is the error message:

pg_dump: last built-in OID is 16383

pg_dump: reading extensions

pg_dump: identifying extension members

pg_dump: reading schemas

pg_dump: reading user-defined tables

pg_dump: reading user-defined functions

pg_dump: error: schema with OID 41960442 does not exist

To investigate the issue, I ran the following query:

SELECT * FROM pg_proc WHERE pronamespace = 41960442;

The result:

oid;proname;pronamespace;proowner;prolang;procost;prorows;provariadic;
prosupport;prokind;prosecdef;proleakproof;proisstrict;proretset;
provolatile;proparallel;pronargs;pronargdefaults;prorettype;proargtypes;
proallargtypes;proargmodes;proargnames;proargdefaults;protrftypes;prosrc;
probin;prosqlbody;proconfig;proacl

41966618;remapprotocoltypeids
;41960442;19214494;13547;100;0;0;-;f;f;f;f;f;v;u;1;0;25;25;;;{
pprotocoltypeids};;;

I resolved the issue by removing the problematic record (admin privileges
required):

DELETE FROM pg_proc WHERE oid = 41966618;

This situation seems inconsistent and likely should not occur under normal
conditions.

While I’m unsure exactly when this issue originated in our environment, I
was able to reproduce it by performing concurrent modifications on the
schema.

To demonstrate, I wrote a bash script (test_bug.sh) that starts two
threads running in parallel.

Each thread drops the schema with CASCADE and recreates it using the SQL
script search_bug.sql.

To use the script, you’ll need to adapt two variables at the beginning of
the script: PGPASSWORD and URL.

Using this script, I reproduced the problem on PostgreSQL versions 16.1
and 17.1.

It typically takes less than a minute to trigger the issue.

The script terminates automatically as soon as the problem is detected.

Here are additional references that might be related to this issue:

/messages/by-id/20110209003823.GA93840@mr-paradox.net

/messages/by-id/BB8AF37F-E3D9-4DE0-B398-AF89748704F5@bandwidth.com

Let me know if you need additional information.

Best regards

Renzo

Attachments:

search_bug.sqlapplication/octet-stream; name=search_bug.sqlDownload
test_bug.shapplication/octet-stream; name=test_bug.shDownload
#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Renzo Dani (#1)

On 12/19/24 01:43, arons wrote:

I forgot to attache the script.

In MyTestBugSchema01.baseProc() you meant to have:

select MyTestBugSchema.afunction( u.username )

instead of

select MyTestBugSchema2.afunction( u.username )

In other words there is a predefined MyTestBugSchema?

On Thu, Dec 19, 2024 at 10:41 AM Renzo Dani <renzo.dani@gmail.com
<mailto:renzo.dani@gmail.com>> wrote:

Hi,

Recently, I encountered a problem during a database export using
pg_dump.

Here is the error message:

pg_dump: last built-in OID is 16383

pg_dump: reading extensions

pg_dump: identifying extension members

pg_dump: reading schemas

pg_dump: reading user-defined tables

pg_dump: reading user-defined functions

pg_dump: error: schema with OID 41960442 does not exist

To investigate the issue, I ran the following query:

SELECT * FROM pg_proc WHERE pronamespace = 41960442;

The result:

oid;proname;pronamespace;proowner;prolang;procost;prorows;provariadic;prosupport;prokind;prosecdef;proleakproof;proisstrict;proretset;provolatile;proparallel;pronargs;pronargdefaults;prorettype;proargtypes;proallargtypes;proargmodes;proargnames;proargdefaults;protrftypes;prosrc;probin;prosqlbody;proconfig;proacl

41966618;remapprotocoltypeids;41960442;19214494;13547;100;0;0;-;f;f;f;f;f;v;u;1;0;25;25;;;{pprotocoltypeids};;;

I resolved the issue by removing the problematic record (admin
privileges required):

DELETE FROM pg_proc WHERE oid = 41966618;

This situation seems inconsistent and likely should not occur under
normal conditions.

While I’m unsure exactly when this issue originated in our
environment, I was able to reproduce it by performing concurrent
modifications on the schema.

To demonstrate, I wrote a bash script (test_bug.sh) that starts two
threads running in parallel.

Each thread drops the schema with CASCADE and recreates it using the
SQL script search_bug.sql.

To use the script, you’ll need to adapt two variables at the
beginning of the script: PGPASSWORD and URL.

Using this script, I reproduced the problem on PostgreSQL versions
16.1 and 17.1.

It typically takes less than a minute to trigger the issue.

The script terminates automatically as soon as the problem is detected.

Here are additional references that might be related to this issue:

/messages/by-id/20110209003823.GA93840@mr-paradox.net </messages/by-id/20110209003823.GA93840@mr-paradox.net&gt;

/messages/by-id/BB8AF37F-E3D9-4DE0-B398-AF89748704F5@bandwidth.com </messages/by-id/BB8AF37F-E3D9-4DE0-B398-AF89748704F5@bandwidth.com&gt;

Let me know if you need additional information.

Best regards

Renzo

--
Adrian Klaver
adrian.klaver@aklaver.com

#3Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Renzo Dani (#1)

On 12/19/24 08:21, Renzo Dani wrote:

Reply to list also.
Ccing list.

HI Adrian,
you are right, there is a typo, the correct would be to have:
 MyTestBugSchema2.afunction( u.username )

In any case the problem appears in my tests also with that script.

To be clear the test script did not use MyTestBugSchema2.afunction(
u.username ) but instead MyTestBugSchema.afunction( u.username ).

If that is the case where did MyTestBugSchema come from?

I think the execution is not really important, I suppose what create the
problem is the drop cascade of the schema.

It is important if someone wants to replicate the test case.

BR
Renzo

On Thu, Dec 19, 2024 at 5:15 PM Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

On 12/19/24 01:43, arons wrote:

I forgot to attache the script.

In MyTestBugSchema01.baseProc() you meant to have:

select MyTestBugSchema.afunction( u.username )

instead of

select MyTestBugSchema2.afunction( u.username )

In other words there is a predefined MyTestBugSchema?

On Thu, Dec 19, 2024 at 10:41 AM Renzo Dani <renzo.dani@gmail.com

<mailto:renzo.dani@gmail.com>

<mailto:renzo.dani@gmail.com <mailto:renzo.dani@gmail.com>>> wrote:

     Hi,

     Recently, I encountered a problem during a database export using
     pg_dump.

     Here is the error message:

     pg_dump: last built-in OID is 16383

     pg_dump: reading extensions

     pg_dump: identifying extension members

     pg_dump: reading schemas

     pg_dump: reading user-defined tables

     pg_dump: reading user-defined functions

     pg_dump: error: schema with OID 41960442 does not exist

     To investigate the issue, I ran the following query:

     SELECT * FROM pg_proc WHERE pronamespace = 41960442;

     The result:

 oid;proname;pronamespace;proowner;prolang;procost;prorows;provariadic;prosupport;prokind;prosecdef;proleakproof;proisstrict;proretset;provolatile;proparallel;pronargs;pronargdefaults;prorettype;proargtypes;proallargtypes;proargmodes;proargnames;proargdefaults;protrftypes;prosrc;probin;prosqlbody;proconfig;proacl

 41966618;remapprotocoltypeids;41960442;19214494;13547;100;0;0;-;f;f;f;f;f;v;u;1;0;25;25;;;{pprotocoltypeids};;;

     I resolved the issue by removing the problematic record (admin
     privileges required):

     DELETE FROM pg_proc WHERE oid = 41966618;

     This situation seems inconsistent and likely should not occur

under

     normal conditions.

     While I’m unsure exactly when this issue originated in our
     environment, I was able to reproduce it by performing concurrent
     modifications on the schema.

     To demonstrate, I wrote a bash script (test_bug.sh) that

starts two

     threads running in parallel.

     Each thread drops the schema with CASCADE and recreates it

using the

     SQL script search_bug.sql.

     To use the script, you’ll need to adapt two variables at the
     beginning of the script: PGPASSWORD and URL.

     Using this script, I reproduced the problem on PostgreSQL

versions

     16.1 and 17.1.

     It typically takes less than a minute to trigger the issue.

     The script terminates automatically as soon as the problem is

detected.

     Here are additional references that might be related to this

issue:

/messages/by-id/20110209003823.GA93840@mr-paradox.net </messages/by-id/20110209003823.GA93840@mr-paradox.net&gt; </messages/by-id/20110209003823.GA93840@mr-paradox.net </messages/by-id/20110209003823.GA93840@mr-paradox.net&gt;&gt;

/messages/by-id/BB8AF37F-E3D9-4DE0-B398-AF89748704F5@bandwidth.com </messages/by-id/BB8AF37F-E3D9-4DE0-B398-AF89748704F5@bandwidth.com&gt; </messages/by-id/BB8AF37F-E3D9-4DE0-B398-AF89748704F5@bandwidth.com </messages/by-id/BB8AF37F-E3D9-4DE0-B398-AF89748704F5@bandwidth.com&gt;&gt;

     Let me know if you need additional information.

     Best regards

     Renzo

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

#4Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Renzo Dani (#1)

On 12/19/24 08:46, Renzo Dani wrote:

Again.

Reply to list also using Reply All.
Ccing list.

Hi Adrian,
here a new version of the script that I just tested produce the same
problem.
The script do not rely now on any additional relation than the ones
defined into the script.

BR
Renzo

On Thu, Dec 19, 2024 at 5:28 PM Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

On 12/19/24 08:21, Renzo Dani wrote:

Reply to list also.
Ccing list.

HI Adrian,
you are right, there is a typo, the correct would be to have:
   MyTestBugSchema2.afunction( u.username )

In any case the problem appears in my tests also with that script.

To be clear the test script did not use MyTestBugSchema2.afunction(
u.username ) but instead MyTestBugSchema.afunction( u.username ).

If that is the case where did MyTestBugSchema come from?

I think the execution is not really important, I suppose what

create the

problem is the drop cascade of the schema.

It is important if someone wants to replicate the test case.

BR
Renzo

On Thu, Dec 19, 2024 at 5:15 PM Adrian Klaver

<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com

<mailto:adrian.klaver@aklaver.com>>> wrote:

     On 12/19/24 01:43, arons wrote:
      > I forgot to attache the script.

     In MyTestBugSchema01.baseProc() you meant to have:

     select MyTestBugSchema.afunction( u.username )

     instead of

     select MyTestBugSchema2.afunction( u.username )

     In other words there is a predefined MyTestBugSchema?

      >
      > On Thu, Dec 19, 2024 at 10:41 AM Renzo Dani

<renzo.dani@gmail.com <mailto:renzo.dani@gmail.com>

     <mailto:renzo.dani@gmail.com <mailto:renzo.dani@gmail.com>>
      > <mailto:renzo.dani@gmail.com <mailto:renzo.dani@gmail.com>

<mailto:renzo.dani@gmail.com <mailto:renzo.dani@gmail.com>>>> wrote:

      >
      >     Hi,
      >
      >
      >     Recently, I encountered a problem during a database

export using

      >     pg_dump.
      >
      >
      >     Here is the error message:
      >
      >
      >     pg_dump: last built-in OID is 16383
      >
      >     pg_dump: reading extensions
      >
      >     pg_dump: identifying extension members
      >
      >     pg_dump: reading schemas
      >
      >     pg_dump: reading user-defined tables
      >
      >     pg_dump: reading user-defined functions
      >
      >     pg_dump: error: schema with OID 41960442 does not exist
      >
      >
      >     To investigate the issue, I ran the following query:
      >
      >
      >     SELECT * FROM pg_proc WHERE pronamespace = 41960442;
      >
      >
      >     The result:
      >
      >
      >

 oid;proname;pronamespace;proowner;prolang;procost;prorows;provariadic;prosupport;prokind;prosecdef;proleakproof;proisstrict;proretset;provolatile;proparallel;pronargs;pronargdefaults;prorettype;proargtypes;proallargtypes;proargmodes;proargnames;proargdefaults;protrftypes;prosrc;probin;prosqlbody;proconfig;proacl

      >
      >

 41966618;remapprotocoltypeids;41960442;19214494;13547;100;0;0;-;f;f;f;f;f;v;u;1;0;25;25;;;{pprotocoltypeids};;;

      >
      >
      >     I resolved the issue by removing the problematic

record (admin

      >     privileges required):
      >
      >
      >     DELETE FROM pg_proc WHERE oid = 41966618;
      >
      >
      >     This situation seems inconsistent and likely should

not occur

     under
      >     normal conditions.
      >
      >
      >     While I’m unsure exactly when this issue originated in our
      >     environment, I was able to reproduce it by performing

concurrent

      >     modifications on the schema.
      >
      >
      >     To demonstrate, I wrote a bash script (test_bug.sh) that
     starts two
      >     threads running in parallel.
      >
      >     Each thread drops the schema with CASCADE and recreates it
     using the
      >     SQL script search_bug.sql.
      >
      >
      >     To use the script, you’ll need to adapt two variables

at the

      >     beginning of the script: PGPASSWORD and URL.
      >
      >
      >     Using this script, I reproduced the problem on PostgreSQL
     versions
      >     16.1 and 17.1.
      >
      >     It typically takes less than a minute to trigger the

issue.

      >
      >     The script terminates automatically as soon as the

problem is

     detected.
      >
      >
      >     Here are additional references that might be related

to this

     issue:
      >
      >
      >

/messages/by-id/20110209003823.GA93840@mr-paradox.net </messages/by-id/20110209003823.GA93840@mr-paradox.net&gt; </messages/by-id/20110209003823.GA93840@mr-paradox.net </messages/by-id/20110209003823.GA93840@mr-paradox.net&gt;&gt; </messages/by-id/20110209003823.GA93840@mr-paradox.net </messages/by-id/20110209003823.GA93840@mr-paradox.net&gt; </messages/by-id/20110209003823.GA93840@mr-paradox.net </messages/by-id/20110209003823.GA93840@mr-paradox.net&gt;&gt;&gt;

      >
      >

/messages/by-id/BB8AF37F-E3D9-4DE0-B398-AF89748704F5@bandwidth.com </messages/by-id/BB8AF37F-E3D9-4DE0-B398-AF89748704F5@bandwidth.com&gt; </messages/by-id/BB8AF37F-E3D9-4DE0-B398-AF89748704F5@bandwidth.com </messages/by-id/BB8AF37F-E3D9-4DE0-B398-AF89748704F5@bandwidth.com&gt;&gt; </messages/by-id/BB8AF37F-E3D9-4DE0-B398-AF89748704F5@bandwidth.com </messages/by-id/BB8AF37F-E3D9-4DE0-B398-AF89748704F5@bandwidth.com&gt; </messages/by-id/BB8AF37F-E3D9-4DE0-B398-AF89748704F5@bandwidth.com </messages/by-id/BB8AF37F-E3D9-4DE0-B398-AF89748704F5@bandwidth.com&gt;&gt;&gt;

      >
      >
      >
      >     Let me know if you need additional information.
      >
      >
      >     Best regards
      >
      >     Renzo
      >

     --
     Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

<mailto:adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>>

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

Attachments:

search_bug.sqlapplication/sql; name=search_bug.sqlDownload
#5Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Adrian Klaver (#4)

On 12/19/24 08:50, Adrian Klaver wrote:

On 12/19/24 08:46, Renzo Dani wrote:

Again.

Reply to list also using Reply All.
Ccing list.

Hi Adrian,
here a new version of the script that I just tested produce the same
problem.
The script do not rely now on any additional relation than the ones
defined into the script.

I could not replicate using Ubuntu 22.04 and Postgres 16.6.

From your original post:

"I reproduced the problem on PostgreSQL versions 16.1 and 17.1."

The latest bug fixes of each are 16.6 and 17.2. I would upgrade to those
and then try again.

BR
Renzo

--
Adrian Klaver
adrian.klaver@aklaver.com

#6Renzo Dani
arons7@gmail.com
In reply to: Adrian Klaver (#5)

Hi Adrian,
Thanks for the test, I installed also version 17.2 and test with a fresh
newly created db.
At the beginning I cannot reproduce the problem as well.
Than I start adding the extension we use, one after one, and testing in
between.

I found out that the problem appear immediately as soon as I add the
extension pg_stat_statements:

create extension pg_stat_statements WITH SCHEMA public;

alter system set shared_preload_libraries = 'pg_stat_statements';

Can you maybe test as well with this extension?
the other that I installed before that are: pgcrypto and pg_prewarm.

here the current installed one:

dbtest=# \dx
List of installed extensions
Name | Version | Schema |
Description
--------------------+---------+------------+------------------------------------------------------------------------
pg_prewarm | 1.2 | public | prewarm relation data
pg_stat_statements | 1.11 | public | track planning and execution
statistics of all SQL statements executed
pgcrypto | 1.3 | public | cryptographic functions
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
(4 rows)

dbtest=# select version();
version

------------------------------------------------------------------------------------------------------------------
PostgreSQL 17.2 reda build on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
13.3.0-6ubuntu2~24.04) 13.3.0, 64-bit
(1 row)

I'll do some additional tests.

BR
Renzo

On Thu, Dec 19, 2024 at 6:21 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Show quoted text

On 12/19/24 08:50, Adrian Klaver wrote:

On 12/19/24 08:46, Renzo Dani wrote:

Again.

Reply to list also using Reply All.
Ccing list.

Hi Adrian,
here a new version of the script that I just tested produce the same
problem.
The script do not rely now on any additional relation than the ones
defined into the script.

I could not replicate using Ubuntu 22.04 and Postgres 16.6.

From your original post:

"I reproduced the problem on PostgreSQL versions 16.1 and 17.1."

The latest bug fixes of each are 16.6 and 17.2. I would upgrade to those
and then try again.

BR
Renzo

--
Adrian Klaver
adrian.klaver@aklaver.com

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Adrian Klaver (#5)

Adrian Klaver <adrian.klaver@aklaver.com> writes:

The latest bug fixes of each are 16.6 and 17.2. I would upgrade to those
and then try again.

Highly unlikely to make any difference. What's evidently going on
here is that the test script attempts to do DROP SCHEMA concurrently
with another session that's creating an object inside that schema.
(Here, that's a function, but the particular type of object doesn't
really matter.) There are three possible outcomes of that:

1. The object creation commits soon enough that DROP SCHEMA sees it,
and drops the object along with the schema.

2. The object creation begins after DROP SCHEMA commits, and fails
because the schema is not to be found.

3. The object creation goes through, leaving a now-dangling schema
OID reference in the object's catalog entry. The object is useless
because it's unnamable, but it won't really cause any trouble except
for applications that scan the system catalogs (like pg_dump).

Exactly none of these outcomes result in a usable object, so
one wonders why your application is doing this sort of thing
often enough to hit the race condition.

We could prevent case 3 by locking the schema during object creation,
converting it to one of the other cases. We actually do that for
tables, but not for any other object types, reasoning that the greatly
increased cost of locking would outweigh the problems that dangling
objects create. (Note that to eliminate the issue fully, we'd have to
lock every referenced object not only schemas; for example, also the
data types of the function's arguments and result.) Also, adding such
locking might well lead to deadlocks in concurrent add/drop scenarios,
not just performance costs.

tl;dr: it's been like this a long time, and I don't really foresee
us accepting the costs of making it not act like that. I seem to
recall someone submitting a patch recently that would add such
locking, but I doubt it'll get accepted.

regards, tom lane

#8Renzo Dani
arons7@gmail.com
In reply to: Renzo Dani (#6)

HI Adrian,
I did additional tests and I can reproduce the problem also without the
extensions.
I did the test multiple time with a script that I put as attachment
(wholetest.sh).
I added also the file log_different_run.txt that contains the results of
some test and the times it takes to reproduce.

BR
Renzo

On Thu, Dec 19, 2024 at 9:42 PM Renzo Dani <arons7@gmail.com> wrote:

Show quoted text

Hi Adrian,
Thanks for the test, I installed also version 17.2 and test with a fresh
newly created db.
At the beginning I cannot reproduce the problem as well.
Than I start adding the extension we use, one after one, and testing in
between.

I found out that the problem appear immediately as soon as I add the
extension pg_stat_statements:

create extension pg_stat_statements WITH SCHEMA public;

alter system set shared_preload_libraries = 'pg_stat_statements';

Can you maybe test as well with this extension?
the other that I installed before that are: pgcrypto and pg_prewarm.

here the current installed one:

dbtest=# \dx
List of installed extensions
Name | Version | Schema |
Description

--------------------+---------+------------+------------------------------------------------------------------------
pg_prewarm | 1.2 | public | prewarm relation data
pg_stat_statements | 1.11 | public | track planning and execution
statistics of all SQL statements executed
pgcrypto | 1.3 | public | cryptographic functions
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
(4 rows)

dbtest=# select version();
version

------------------------------------------------------------------------------------------------------------------
PostgreSQL 17.2 reda build on x86_64-pc-linux-gnu, compiled by gcc
(Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0, 64-bit
(1 row)

I'll do some additional tests.

BR
Renzo

On Thu, Dec 19, 2024 at 6:21 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 12/19/24 08:50, Adrian Klaver wrote:

On 12/19/24 08:46, Renzo Dani wrote:

Again.

Reply to list also using Reply All.
Ccing list.

Hi Adrian,
here a new version of the script that I just tested produce the same
problem.
The script do not rely now on any additional relation than the ones
defined into the script.

I could not replicate using Ubuntu 22.04 and Postgres 16.6.

From your original post:

"I reproduced the problem on PostgreSQL versions 16.1 and 17.1."

The latest bug fixes of each are 16.6 and 17.2. I would upgrade to those
and then try again.

BR
Renzo

--
Adrian Klaver
adrian.klaver@aklaver.com

Attachments:

log_different_run.txttext/plain; charset=US-ASCII; name=log_different_run.txtDownload
wholetest.shapplication/x-shellscript; name=wholetest.shDownload