invalid non-zero objectSubId for object class

Started by Michel Pelletieralmost 6 years ago11 messagesgeneral
Jump to latest
#1Michel Pelletier
pelletier.michel@gmail.com

On a 12.3 AWS RDS instance, I get the following error when trying to drop
either of two tables:

dev=> drop table current_flight;
ERROR: invalid non-zero objectSubId for object class 297108
dev=> drop table flight;
ERROR: invalid non-zero objectSubId for object class 297108

I can create and drop new tables, but these two I cannot drop.

Both tables were at one point replicated using pglogical, but I've removed
them from their replication sets and nothing depends on them that I can
see. The replication of other tables is still proceeding normally.

Anyone have any thoughts on this? The db was also upgraded seemingly
successfully from 12.2 a couple of days ago and otherwise seems to be
working fine.

Thanks for any pointers!

-Michel

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michel Pelletier (#1)
Re: invalid non-zero objectSubId for object class

Michel Pelletier <pelletier.michel@gmail.com> writes:

On a 12.3 AWS RDS instance, I get the following error when trying to drop
either of two tables:

dev=> drop table current_flight;
ERROR: invalid non-zero objectSubId for object class 297108
dev=> drop table flight;
ERROR: invalid non-zero objectSubId for object class 297108

This looks like corrupt data in pg_depend, specifically an entry or
entries with classid or refclassid = 297108, which should not happen
(the classid should always be the OID of one of a short list of system
catalogs). You could try poking around in pg_depend to see if you
can identify any obviously-bogus rows.

No idea how it got that way. Have you had any database crashes or the
like?

regards, tom lane

#3Michel Pelletier
pelletier.michel@gmail.com
In reply to: Tom Lane (#2)
Re: invalid non-zero objectSubId for object class

On Thu, Jul 9, 2020 at 4:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Michel Pelletier <pelletier.michel@gmail.com> writes:

On a 12.3 AWS RDS instance, I get the following error when trying to drop
either of two tables:

dev=> drop table current_flight;
ERROR: invalid non-zero objectSubId for object class 297108
dev=> drop table flight;
ERROR: invalid non-zero objectSubId for object class 297108

This looks like corrupt data in pg_depend, specifically an entry or
entries with classid or refclassid = 297108, which should not happen
(the classid should always be the OID of one of a short list of system
catalogs). You could try poking around in pg_depend to see if you
can identify any obviously-bogus rows.

Hi Tom, thanks for getting back so quick:

I don't seem to have either:

dev=> select * from pg_depend where classid = 297108 or refclassid = 297108;
classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype
---------+-------+----------+------------+----------+-------------+---------
(0 rows)

I'm not sure what a bogus row would look like.

No idea how it got that way. Have you had any database crashes or the
like?

No crashes, but a restart and one upgrade. On Sunday and Monday, at
exactly UTC midnight we run a cron job to create a new partition for an
unrelated table and attach it to a pglogical replication set. I updated
the procedure on saturday to create two new partitions for two unrelated
tables, and that somehow caused an issue on 12.2 / pglogical 2.3.0 that
caused an error, but not a crash. What's puzzling is that the two
partition creation still worked, and replicated to all downstream
consumers, but from that point on replication ceased and consumers logged
the error in the link below:

https://github.com/2ndQuadrant/pglogical/issues/267

This spooled up changes on the RDS primary until it filled up the storage.
On sunday we resized the instance and restarted, and reinitialized the
pglogical setup which restarted replication. On monday the error happened
again at midnight, and we restarted replication and upgraded to 12.3/2.3.1
on tuesday as recommended in the issue. It has thus run till now without
error and has been replicating nicely so have assumed that issue is fixed.

Neither of these two tables are involved in the midnight job, they're no
longer used and I was hoping to clean them up. I guess my concern should
be, is there additional possible corruption I can check for? And if that's
ok is there some manual intervention I can do to drop the tables?

Thanks,

-Michel

Show quoted text

regards, tom lane

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michel Pelletier (#3)
Re: invalid non-zero objectSubId for object class

On 2020-Jul-09, Michel Pelletier wrote:

Hi Tom, thanks for getting back so quick:

I don't seem to have either:

dev=> select * from pg_depend where classid = 297108 or refclassid = 297108;
classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype
---------+-------+----------+------------+----------+-------------+---------
(0 rows)

Maybe try looking in pg_shdepend? Or pg_description?

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#4)
Re: invalid non-zero objectSubId for object class

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

On 2020-Jul-09, Michel Pelletier wrote:

I don't seem to have either:

dev=> select * from pg_depend where classid = 297108 or refclassid = 297108;
classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype
---------+-------+----------+------------+----------+-------------+---------
(0 rows)

Maybe try looking in pg_shdepend? Or pg_description?

I think it has to be pg_depend, because there'd be no reason to
be calling getObjectClass() on a entry from one of those other
catalogs during a DROP TABLE.

So my thoughts are now turning towards there being corruption
in pg_depend's indexes as well as the heap. Try doing the above
query after turning enable_indexscan and enable_bitmapscan off.

regards, tom lane

#6Michel Pelletier
pelletier.michel@gmail.com
In reply to: Alvaro Herrera (#4)
Re: invalid non-zero objectSubId for object class

On Thu, Jul 9, 2020 at 5:26 PM Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

On 2020-Jul-09, Michel Pelletier wrote:

Hi Tom, thanks for getting back so quick:

I don't seem to have either:

dev=> select * from pg_depend where classid = 297108 or refclassid =

297108;

classid | objid | objsubid | refclassid | refobjid | refobjsubid |

deptype

---------+-------+----------+------------+----------+-------------+---------

(0 rows)

Maybe try looking in pg_shdepend? Or pg_description?

Hi Álvaro,

It doesn't occur in either of those either
dev=> select * from pg_description \g | grep 297108
dev=> select * from pg_shdepend \g | grep 297108

-Michel

Show quoted text

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#7Michel Pelletier
pelletier.michel@gmail.com
In reply to: Tom Lane (#5)
Re: invalid non-zero objectSubId for object class

On Thu, Jul 9, 2020 at 5:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

On 2020-Jul-09, Michel Pelletier wrote:

I don't seem to have either:

dev=> select * from pg_depend where classid = 297108 or refclassid =

297108;

classid | objid | objsubid | refclassid | refobjid | refobjsubid |

deptype

---------+-------+----------+------------+----------+-------------+---------

(0 rows)

Maybe try looking in pg_shdepend? Or pg_description?

I think it has to be pg_depend, because there'd be no reason to
be calling getObjectClass() on a entry from one of those other
catalogs during a DROP TABLE.

So my thoughts are now turning towards there being corruption
in pg_depend's indexes as well as the heap. Try doing the above
query after turning enable_indexscan and enable_bitmapscan off.

Unfortunately still no refs there:

dev=> set enable_indexscan = off;
SET
dev=> set enable_bitmapscan = off;
SET
dev=> select * from pg_depend \g | grep 297108
dev=> select * from pg_depend where classid = 297108::oid or refclassid =
297108::oid;
classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype
---------+-------+----------+------------+----------+-------------+---------
(0 rows)

dev=> explain select * from pg_depend where classid = 297108::oid or
refclassid = 297108::oid;
QUERY PLAN
-----------------------------------------------------------------------
Seq Scan on pg_depend (cost=0.00..336.87 rows=2 width=25)
Filter: ((classid = '297108'::oid) OR (refclassid = '297108'::oid))
(2 rows)

-Michel

Show quoted text

regards, tom lane

#8Michel Pelletier
pelletier.michel@gmail.com
In reply to: Michel Pelletier (#7)
Re: invalid non-zero objectSubId for object class

I restored a snapshot and I can drop the tables there, so we'll likely
proceed to swap the replicas over tomorrow. I have this corrupted
instance i can continue to debug on if necessary. There seem to be some
other issues now that we're investigating, like a max(timestamp) query on
the old instance is doing a seqscan on all the partitions instead of using
an index that's there, the new instance from the snapshot correctly uses
the index, so I suspect further corruption.

Thanks for your help!

-Michel

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michel Pelletier (#8)
Re: invalid non-zero objectSubId for object class

On 2020-Jul-09, Michel Pelletier wrote:

I restored a snapshot and I can drop the tables there, so we'll likely
proceed to swap the replicas over tomorrow. I have this corrupted
instance i can continue to debug on if necessary. There seem to be some
other issues now that we're investigating, like a max(timestamp) query on
the old instance is doing a seqscan on all the partitions instead of using
an index that's there, the new instance from the snapshot correctly uses
the index, so I suspect further corruption.

This being RDS makes it very difficult to debug. For example I would
suggest a debugger to find the problem, but you don't have access to
that.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Michel Pelletier
pelletier.michel@gmail.com
In reply to: Alvaro Herrera (#9)
Re: invalid non-zero objectSubId for object class

On Fri, Jul 10, 2020 at 5:37 AM Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

On 2020-Jul-09, Michel Pelletier wrote:

I restored a snapshot and I can drop the tables there, so we'll likely
proceed to swap the replicas over tomorrow. I have this corrupted
instance i can continue to debug on if necessary. There seem to be some
other issues now that we're investigating, like a max(timestamp) query on
the old instance is doing a seqscan on all the partitions instead of

using

an index that's there, the new instance from the snapshot correctly uses
the index, so I suspect further corruption.

This being RDS makes it very difficult to debug. For example I would
suggest a debugger to find the problem, but you don't have access to
that.

Yes agreed, we'll probably terminate the instance today if there's no
further lines of investigation. I know this isn't their forum, but I don't
get why AWS pins all their extension versions to the postgres version, If
you use 12.2 on rds, you get pglogical 2.3.0. If you want 2.3.1, you have
to upgrade to pg 12.3. I don't get that logic, shouldn't 2.3.1 be the
current version on both? What's the point of them supporting 12.2 if they
ship an extension version that 2ndQ says you shouldn't use?

Anyway, just musing out loud, don't expect any kind of answer to that here.
:)

Thanks again everyone,

-Michel

Show quoted text

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Michel Pelletier (#10)
Re: invalid non-zero objectSubId for object class

On 7/10/20 8:06 AM, Michel Pelletier wrote:

On Fri, Jul 10, 2020 at 5:37 AM Alvaro Herrera <alvherre@2ndquadrant.com
<mailto:alvherre@2ndquadrant.com>> wrote:

On 2020-Jul-09, Michel Pelletier wrote:

I restored a snapshot and I can drop the tables there, so we'll

likely

proceed  to swap the replicas over tomorrow.  I have this corrupted
instance i can continue to debug on if necessary.  There seem to

be some

other issues now that we're investigating, like a max(timestamp)

query on

the old instance is doing a seqscan on all the partitions instead

of using

an index that's there, the new instance from the snapshot

correctly uses

the index, so I suspect further corruption.

This being RDS makes it very difficult to debug.  For example I would
suggest a debugger to find the problem, but you don't have access to
that.

Yes agreed, we'll probably terminate the instance today if there's no
further lines of investigation.  I know this isn't their forum, but I
don't get why AWS pins all their extension versions to the postgres
version, If you use 12.2 on rds, you get pglogical 2.3.0.  If you want
2.3.1, you have to upgrade to pg 12.3.  I don't get that logic,
shouldn't 2.3.1 be the current version on both?  What's the point of
them supporting 12.2 if they ship an extension version that 2ndQ says
you shouldn't use?

Anyway, just musing out loud, don't expect any kind of answer to that
here. :)

Already asked and not really answered:

/messages/by-id/A51655C0-BDD0-4F52-8DC5-2672AADA32D0@gmail.com

Thanks again everyone,

-Michel

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Adrian Klaver
adrian.klaver@aklaver.com