"PANIC: could not open critical system index 2662" - twice
Our PostgreSQL 15.2 instance running on Ubuntu 18.04 has crashed with
this error:
2023-04-05 09:24:03.448 UTC [15227] ERROR: index "pg_class_oid_index"
contains unexpected zero page at block 0
2023-04-05 09:24:03.448 UTC [15227] HINT: Please REINDEX it.
...
2023-04-05 13:05:25.018 UTC [15437]
root@test_behavior_638162834106895162 FATAL: index "pg_class_oid_index"
contains unexpected zero page at block 0
2023-04-05 13:05:25.018 UTC [15437]
root@test_behavior_638162834106895162 HINT: Please REINDEX it.
... (same error for a few more DBs)
2023-04-05 13:05:25.144 UTC [16965]
root@test_behavior_638162855458823077 FATAL: index "pg_class_oid_index"
contains unexpected zero page at block 0
2023-04-05 13:05:25.144 UTC [16965]
root@test_behavior_638162855458823077 HINT: Please REINDEX it.
...
2023-04-05 13:05:25.404 UTC [17309]
root@test_behavior_638162881641031612 PANIC: could not open critical
system index 2662
2023-04-05 13:05:25.405 UTC [9372] LOG: server process (PID 17309) was
terminated by signal 6: Aborted
2023-04-05 13:05:25.405 UTC [9372] LOG: terminating any other active
server processes
We had the same thing happened about a month ago on a different database
on the same cluster. For a while PG actually ran OK as long as you
didn't access that specific DB, but when trying to back up that DB with
pg_dump it would crash every time. At that time one of the disks hosting
the ZFS dataset with the PG data directory on it was reporting errors,
so we thought it was likely due to that.
Unfortunately, before we could replace the disks, PG crashed completely
and would not start again at all, so I had to rebuild the cluster from
scratch and restore from pg_dump backups (still onto the old, bad
disks). Once the disks were replaced (all of them) I just copied the
data to them using zfs send | zfs receive and didn't bother restoring
pg_dump backups again - which was perhaps foolish in hindsight.
Well, yesterday it happened again. The server still restarted OK, so I
took fresh pg_dump backups of the databases we care about (which ran
fine), rebuilt the cluster and restored the pg_dump backups again - now
onto the new disks, which are not reporting any problems.
So while everything is up and running now this error has me rather
concerned. Could the error we're seeing now have been caused by some
corruption in the PG data that's been there for a month (so it could
still be attributed to the bad disk), which should now be fixed by
having restored from backups onto good disks? Could this be a PG bug?
What can I do to figure out why this is happening and prevent it from
happening again? Advice appreciated!
On Thu, 2023-04-06 at 16:41 +0000, Evgeny Morozov wrote:
Our PostgreSQL 15.2 instance running on Ubuntu 18.04 has crashed with this error:
2023-04-05 09:24:03.448 UTC [15227] ERROR: index "pg_class_oid_index" contains unexpected zero page at block 0
[...]We had the same thing happened about a month ago on a different database on the same cluster.
For a while PG actually ran OK as long as you didn't access that specific DB, but when trying
to back up that DB with pg_dump it would crash every time. At that time one of the disks
hosting the ZFS dataset with the PG data directory on it was reporting errors, so we thought
it was likely due to that.Unfortunately, before we could replace the disks, PG crashed completely and would not start
again at all, so I had to rebuild the cluster from scratch and restore from pg_dump backups
(still onto the old, bad disks). Once the disks were replaced (all of them) I just copied
the data to them using zfs send | zfs receive and didn't bother restoring pg_dump backups
again - which was perhaps foolish in hindsight.Well, yesterday it happened again. The server still restarted OK, so I took fresh pg_dump
backups of the databases we care about (which ran fine), rebuilt the cluster and restored
the pg_dump backups again - now onto the new disks, which are not reporting any problems.So while everything is up and running now this error has me rather concerned. Could the
error we're seeing now have been caused by some corruption in the PG data that's been there
for a month (so it could still be attributed to the bad disk), which should now be fixed by
having restored from backups onto good disks?
Yes, that is entirely possible.
Could this be a PG bug?
It could be, but data corruption caused by bad hardware is much more likely.
What can I do to figure out why this is happening and prevent it from happening again?
No idea about the former, but bad hardware is a good enough explanation.
As to keeping it from happening: use good hardware.
Yours,
Laurenz Albe
On Fri, Apr 07, 2023 at 01:04:34PM +0200, Laurenz Albe wrote:
On Thu, 2023-04-06 at 16:41 +0000, Evgeny Morozov wrote:
Could this be a PG bug?
It could be, but data corruption caused by bad hardware is much more likely.
There is no way to be completely sure here, except if we would be able
to put our hands on a reproducible test case that would break the
cluster so much that we'd get into this state. I don't recall seeing
this error pattern in recent history, though.
--
Michael
On 2023-04-07 13:04:34 +0200, Laurenz Albe wrote:
On Thu, 2023-04-06 at 16:41 +0000, Evgeny Morozov wrote:
What can I do to figure out why this is happening and prevent it from happening again?
No idea about the former, but bad hardware is a good enough explanation.
As to keeping it from happening: use good hardware.
Also: Use checksums. PostgreSQL offers data checksums[1]For some reason I thought the Debian/Ubuntu packages enabled this by default. But that doesn't seem to be the case.. Some filesystems
also offer checksums.
This doesn't prevent corruption but at least it will be detected early
and can't spread.
hp
[1]: For some reason I thought the Debian/Ubuntu packages enabled this by default. But that doesn't seem to be the case.
default. But that doesn't seem to be the case.
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
No idea about the former, but bad hardware is a good enough explanation.
As to keeping it from happening: use good hardware.
Alright, thanks, I'll just keep my fingers crossed that it doesn't
happen again then!
Also: Use checksums. PostgreSQL offers data checksums[1]. Some
filesystems also offer checksums.
We have data_checksums=on. (It must be on by default, since I cannot
find that in our config files anywhere.) However, the docs say "Only
data pages are protected by checksums; internal data structures and
temporary files are not.", so I guess pg_class_oid_index might be an
"internal data structure"?
We also have checksum=on for the ZFS dataset on which the data is stored
(also the default - we didn't change it). ZFS did detect problems (zpool
status reported read, write and checksum errors for one of the old
disks), but it also said "errors: No known data errors". I understood
that to meant that it recovered from the errors, i.e. wrote the data
different disk blocks or read it from another disk in the pool.
On Tue, Apr 11, 2023 at 04:44:54PM +0000, Evgeny Morozov wrote:
We have data_checksums=on. (It must be on by default, since I cannot
find that in our config files anywhere.)
initdb does not enable checksums by default, requiring a
-k/--data-checksums, so likely this addition comes from from your
environment.
However, the docs say "Only
data pages are protected by checksums; internal data structures and
temporary files are not.", so I guess pg_class_oid_index might be an
"internal data structure"?
pg_class_oid_index is a btree index that relies on 8k on-disk pages
(default size), so it is subject to the same rules as normal relations
regarding checksums for the pages flushed to disk, even if it is on a
catalog.
--
Michael
On 12/04/2023 2:35 am, Michael Paquier wrote:
initdb does not enable checksums by default, requiring a
-k/--data-checksums, so likely this addition comes from from your
environment.
Indeed, turns out we had it in init_db_options.
However, the docs say "Only
data pages are protected by checksums; internal data structures and
temporary files are not.", so I guess pg_class_oid_index might be an
"internal data structure"?pg_class_oid_index is a btree index that relies on 8k on-disk pages
(default size), so it is subject to the same rules as normal relations
regarding checksums for the pages flushed to disk, even if it is on a
catalog.
OK, so then what does that mean for the error in the subject? At what
point should that problem have been detected by the data checksums?
On Thu, 2023-04-13 at 06:56 +0000, Evgeny Morozov wrote:
On 12/04/2023 2:35 am, Michael Paquier wrote:
initdb does not enable checksums by default, requiring a
-k/--data-checksums, so likely this addition comes from from your
environment.OK, so then what does that mean for the error in the subject? At what
point should that problem have been detected by the data checksums?
It means that if the error is caused by a faulty disk changing your data,
you'll notice as soon as you touch the page.
That would perhaps not have made a lot of difference in your case,
except that the error message would have been different and proof
that the disk was the problem.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
On 13/04/2023 5:02 pm, Laurenz Albe wrote:
It means that if the error is caused by a faulty disk changing your data,
you'll notice as soon as you touch the page.That would perhaps not have made a lot of difference in your case,
except that the error message would have been different and proof
that the disk was the problem.
OK, but we had data checksums on the whole time. So that means that the
disk was NOT the problem in our case?
On Thu, 2023-04-13 at 19:07 +0000, Evgeny Morozov wrote:
On 13/04/2023 5:02 pm, Laurenz Albe wrote:
It means that if the error is caused by a faulty disk changing your data,
you'll notice as soon as you touch the page.That would perhaps not have made a lot of difference in your case,
except that the error message would have been different and proof
that the disk was the problem.OK, but we had data checksums on the whole time. So that means that the
disk was NOT the problem in our case?
Hmm, I am not certain. The block was filled with zeros from your error
message, and I think such blocks don't trigger a checksum warning.
So if your disk replaces a valid block with zeros (filesystem check
after crash?), that could explain what you see.
Yours,
Laurenz Albe
On Fri, 14 Apr 2023, Laurenz Albe wrote:
So if your disk replaces a valid block with zeros (filesystem check
after crash?), that could explain what you see.
Oh, I had that happen on a RAID 1 once. On of the two discs had an
intermittent error (write I guess) but didn’t fail out of the RAID,
and some of the reads from there got zero-filled blocks in some
positions. It was a CVS repository so I was able to identify all such
blocks in question and restore them from the rsync slave (whose initial
population predated the HDD issue).
Hm, now that I think about it, it could even have been a read error
with subsequent block reassignment. Oh well.
Filesystem issues (ext3, and ext4 without/predating auto_da_alloc,
in particular) could be it just as well of course.
bye,
//mirabilos
--
15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)
Hmm, I am not certain. The block was filled with zeros from your error
message, and I think such blocks don't trigger a checksum warning.
OK, so data_checksums=on might not have made any difference in this case?
So if your disk replaces a valid block with zeros (filesystem check
after crash?), that could explain what you see.
If by "crash" here you mean the OS crashing - we didn't have that
happen. The OS is on separate disks, which have not reported any errors.
When we first ran into this problem the PG data was on a ZFS RAIDZ (i.e.
RAID5) volume of 3 disks, and for one of them `zpool status -v` reported
read, write and checksum error count > 0, but it also said "errors: No
known data errors" and the disk status remained "online" (it did not
become "faulted" or "offline"). (Now we have the PG data on a ZFS mirror
volume of 2 new disks, which have not reported any errors.)
I don't know whether ZFS zero-fills blocks on disk errors. As I
understood, ZFS should have been able to recover from disk errors (that
were "unrecoverable" at the hardware level) using the data on the other
two disks (which did not report any errors). Thus, PG should not have
seen any corrupted data (if ZFS was working correctly).
https://unix.stackexchange.com/questions/341614/understanding-the-error-reporting-of-zfs-on-linux
seems to confirm this. Am I misunderstanding something?
On 14 Apr 2023, at 9:38, Evgeny Morozov <postgresql3@realityexists.net> wrote:
(…)
I don't know whether ZFS zero-fills blocks on disk errors. As I
understood, ZFS should have been able to recover from disk errors (that
were "unrecoverable" at the hardware level) using the data on the other
two disks (which did not report any errors). Thus, PG should not have
seen any corrupted data (if ZFS was working correctly).
https://unix.stackexchange.com/questions/341614/understanding-the-error-reporting-of-zfs-on-linux
seems to confirm this. Am I misunderstanding something?
Your problem coincides with a thread at freebsd-current with very similar data corruption after a recent OpenZFS import: blocks of all zeroes, but also missing files. So, perhaps these problems are related?
Apparently, there was a recent fix for a data corruption issue with the block_cloning feature enabled, but people are still seeing corruption even when they never enabled that feature.
I couldn’t really find the start of the thread in the archives, so this one kind of jumps into the middle of the thread at a relevant-looking point:
https://lists.freebsd.org/archives/freebsd-current/2023-April/003446.html
Regards,
Alban Hertroys
--
There is always an exception to always.
On 14/04/2023 10:42 am, Alban Hertroys wrote:
Your problem coincides with a thread at freebsd-current with very
similar data corruption after a recent OpenZFS import: blocks of all
zeroes, but also missing files. So, perhaps these problems are related?
Apparently, there was a recent fix for a data corruption issue with the block_cloning feature enabled, but people are still seeing corruption even when they never enabled that feature.I couldn’t really find the start of the thread in the archives, so this one kind of jumps into the middle of the thread at a relevant-looking point:
https://lists.freebsd.org/archives/freebsd-current/2023-April/003446.html
That thread was a bit over my head, I'm afraid, so I can't say if it's
related. I haven't detected any missing files, anyway.
Well, the problem happened again! Kind of... This time PG has not
crashed with the PANIC error in the subject, but pg_dumping certain DBs
again fails with
pg_dump: error: connection to server on socket
"/var/run/postgresql/.s.PGSQL.5434" failed: FATAL: index
"pg_class_oid_index" contains unexpected zero page at block 0
PG server log contains:
2023-05-03 04:31:49.903 UTC [14724]
postgres@test_behavior_638186279733138190 FATAL: index
"pg_class_oid_index" contains unexpected zero page at block 0
2023-05-03 04:31:49.903 UTC [14724]
postgres@test_behavior_638186279733138190 HINT: Please REINDEX it.
The server PID does not change on such a pg_dump attempt, so it appears
that only the backend process for the pg_dump connection crashes this
time. I don't see any disk errors and there haven't been any OS crashes.
This currently happens for two DBs. Both of them are very small DBs
created by automated .NET tests using Npgsql as client. The code creates
such a test DB from a template DB and tries to drop it at the end of the
test. This times out sometimes and on timeout our test code tries to
drop the DB again (while the first drop command is likely still pending
on the server). This second attempt to drop the DB also timed out:
[12:40:39] Npgsql.NpgsqlException : Exception while reading from stream
----> System.TimeoutException : Timeout during reading attempt
at
Npgsql.NpgsqlConnector.<ReadMessage>g__ReadMessageLong|194_0(NpgsqlConnector
connector, Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean
readingNotifications, Boolean isReadingPrependedMessage)
at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean
isConsuming, CancellationToken cancellationToken)
at Npgsql.NpgsqlDataReader.NextResult()
at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior,
Boolean async, CancellationToken cancellationToken)
at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior,
Boolean async, CancellationToken cancellationToken)
at Npgsql.NpgsqlCommand.ExecuteNonQuery(Boolean async,
CancellationToken cancellationToken)
at Npgsql.NpgsqlCommand.ExecuteNonQuery()
...
[12:41:41] (same error again for the same DB)
From looking at old logs it seems like the same thing happened last time
(in April) as well. That's quite an unlikely coincidence if a bad disk
or bad filesystem was to blame, isn't it?
I've tried to reproduce this by re-running those tests over and over,
but without success so far. So what can I do about this? Do I just try
to drop those databases again manually? But what about the next time it
happens? How do I figure out the cause and prevent this problem?
On Thu, 2023-05-04 at 15:49 +0000, Evgeny Morozov wrote:
Well, the problem happened again! Kind of... This time PG has not
crashed with the PANIC error in the subject, but pg_dumping certain DBs
again fails withpg_dump: error: connection to server on socket
"/var/run/postgresql/.s.PGSQL.5434" failed: FATAL: index
"pg_class_oid_index" contains unexpected zero page at block 0
If you dumped and restored the database after the last time the error
happened, there must be a systemic problem.
Perhaps you have bad hardware, or a problem with a storage driver,
file system or some other low-level software component.
It might of course be a PostgreSQL bug too, but it is hard to say
without a way to reproduce...
Yours,
Laurenz Albe
On 4/05/2023 6:42 pm, Laurenz Albe wrote:
On Thu, 2023-05-04 at 15:49 +0000, Evgeny Morozov wrote:
Well, the problem happened again! Kind of... This time PG has not
crashed with the PANIC error in the subject, but pg_dumping certain DBs
again fails withpg_dump: error: connection to server on socket
"/var/run/postgresql/.s.PGSQL.5434" failed: FATAL: index
"pg_class_oid_index" contains unexpected zero page at block 0If you dumped and restored the database after the last time the error
happened, there must be a systemic problem.
I dumped and restored the "real" databases I cared about. The tests
databases on which error now happens are new (created 2 days ago).
Perhaps you have bad hardware, or a problem with a storage driver,
file system or some other low-level software component.
It might of course be a PostgreSQL bug too, but it is hard to say
without a way to reproduce...
I'm now thinking of setting up a dedicated AWS EC2 instance just for
these little DBs that get created by our automated tests. If the problem
happens there as well then that would strongly point towards a bug in
PostgreSQL, wouldn't it? (And if nothing else, at least it won't affect
the more important DBs!)
Meanwhile, what do I do with the existing server, though? Just try to
drop the problematic DBs again manually?
On 5/4/23 13:10, Evgeny Morozov wrote:
[snip]
I'm now thinking of setting up a dedicated AWS EC2 instance just for
these little DBs that get created by our automated tests. If the problem
happens there as well then that would strongly point towards a bug in
PostgreSQL, wouldn't it?
Many other people besides you would have noticed regular corruption of
system catalogs.
Meanwhile, what do I do with the existing server, though? Just try to
drop the problematic DBs again manually?
*"Fix the hardware"* is what you do.
--
Born in Arizona, moved to Babylonia.
On Fri, May 5, 2023 at 6:11 AM Evgeny Morozov
<postgresql3@realityexists.net> wrote:
Meanwhile, what do I do with the existing server, though? Just try to
drop the problematic DBs again manually?
That earlier link to a FreeBSD thread is surely about bleeding edge
new ZFS stuff that was briefly broken then fixed, being discovered by
people running code imported from OpenZFS master branch into FreeBSD
main branch (ie it's not exactly released, not following the details
but I think it might soon be 2.2?), but you're talking about an LTS
Ubuntu release from 2018, which shipped "ZFS on Linux" version 0.7.5,
unless you installed a newer version somehow? So it doesn't sound
like it could be related.
That doesn't mean it couldn't be a different ZFS bug though. While
looking into file system corruption issues that had similar symptoms
on some other file system (which turned out to be a bug in btrfs) I
did bump into a claim that ZFS could product unexpected zeroes in some
mmap coherency scenario, OpenZFS issue #14548. I don't immediately
see how PostgreSQL could get tangled up with that problem though, as
we aren't doing that...
It seems quite interesting that it's always pg_class_oid_index block 0
(the btree meta-page), which feels more like a PostgreSQL bug, unless
the access pattern of that particular file/block is somehow highly
unusual compared to every other block and tickling bugs elsewhere in
the stack. How does that file look, in terms of size, and how many
pages in it are zero? I think it should be called base/5/2662.
Oooh, but this is a relation that goes through
RelationMapOidToFilenumber. What does select
pg_relation_filepath('pg_class_oid_index') show in the corrupted
database, base/5/2662 or something else? Now *that* is a piece of
logic that changed in PostgreSQL 15. It changed from sector-based
atomicity assumptions to a directory entry swizzling trick, in commit
d8cd0c6c95c0120168df93aae095df4e0682a08a. Hmm.
On Fri, May 5, 2023 at 11:15 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Now *that* is a piece of
logic that changed in PostgreSQL 15. It changed from sector-based
atomicity assumptions to a directory entry swizzling trick, in commit
d8cd0c6c95c0120168df93aae095df4e0682a08a. Hmm.
I spoke too soon, that only changed in 16. But still, it means there
are two files that could be corrupted here, pg_filenode.map which
might somehow be pointing to the wrong file, and the relation (index)
main fork file.
On Fri, May 5, 2023 at 11:15 AM Thomas Munro <thomas.munro@gmail.com> wrote:
What does select
pg_relation_filepath('pg_class_oid_index') show in the corrupted
database, base/5/2662 or something else?
Oh, you can't get that far, but perhaps you could share the
pg_filenode.map file? Or alternatively strace -f PostgreSQL while
it's starting up to see which file it's reading, just to be sure. One
way to find clues about whether PostgreSQL did something wrong, once
we definitely have the right relfilenode for the index, aside from
examining its contents, would be to search the WAL for references to
that block with pg_waldump. Maybe you still have enough WAL if it
happened recently?