Issue with Postgres process startup after instance restart
Hello,
I recently faced an issue with PG 11 where the VM that the PG process was
running on got restarted because of a hardware issue. After the VM restart,
the Postgres process failed to start on the 1st attempt with the error "*LOG:
could not open directory "pg_tblspc/16388/PG_11_201809051": No such file
or directory*" even though that directory was present. But on the 2nd
attempt it started up without issues. There didn't seem to be any disk
corruption issues and there were no other errors in the syslog either. Has
anyone else faced such an issue or has any ideas on why this could have
occurred?
Shishir Joshi <shishir.joshi@gojek.com> writes:
I recently faced an issue with PG 11 where the VM that the PG process was
running on got restarted because of a hardware issue. After the VM restart,
the Postgres process failed to start on the 1st attempt with the error "*LOG:
could not open directory "pg_tblspc/16388/PG_11_201809051": No such file
or directory*" even though that directory was present. But on the 2nd
attempt it started up without issues. There didn't seem to be any disk
corruption issues and there were no other errors in the syslog either. Has
anyone else faced such an issue or has any ideas on why this could have
occurred?
Maybe whatever the tablespace is pointing at wasn't mounted yet?
Slow remote mounts are the bane of PG DBAs --- I can recall at least
one famous incident in which someone's database became totally
corrupt because the NFS mount it was on came up after server start,
leading to the server having a mishmash of files on the NFS server
and files on the local disk, now hidden underneath the mount point.
If this is what your issue was, you got very lucky to escape without
damage. Suggest adapting your PG server start script to make sure the
mounted file system is present before you allow the server to start.
regards, tom lane
Hi Tom,
I forgot to mention, but in this case it looks the mount was completed
before the PG process was started up. But we don't have an explicit check
for making sure the file system is present in the start script. Thanks for
the tip.
On Fri, 27 Mar 2020 at 19:30, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Shishir Joshi <shishir.joshi@gojek.com> writes:
I recently faced an issue with PG 11 where the VM that the PG process was
running on got restarted because of a hardware issue. After the VMrestart,
the Postgres process failed to start on the 1st attempt with the error
"*LOG:
could not open directory "pg_tblspc/16388/PG_11_201809051": No such file
or directory*" even though that directory was present. But on the 2nd
attempt it started up without issues. There didn't seem to be any disk
corruption issues and there were no other errors in the syslog either.Has
anyone else faced such an issue or has any ideas on why this could have
occurred?Maybe whatever the tablespace is pointing at wasn't mounted yet?
Slow remote mounts are the bane of PG DBAs --- I can recall at least
one famous incident in which someone's database became totally
corrupt because the NFS mount it was on came up after server start,
leading to the server having a mishmash of files on the NFS server
and files on the local disk, now hidden underneath the mount point.If this is what your issue was, you got very lucky to escape without
damage. Suggest adapting your PG server start script to make sure the
mounted file system is present before you allow the server to start.regards, tom lane
On Mon, 2020-03-30 at 11:02 +0530, Shishir Joshi wrote:
On Fri, 27 Mar 2020 at 19:30, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Shishir Joshi <shishir.joshi@gojek.com> writes:
I recently faced an issue with PG 11 where the VM that the PG process was
running on got restarted because of a hardware issue. After the VM restart,
the Postgres process failed to start on the 1st attempt with the error "*LOG:
could not open directory "pg_tblspc/16388/PG_11_201809051": No such file
or directory*" even though that directory was present. But on the 2nd
attempt it started up without issues. There didn't seem to be any disk
corruption issues and there were no other errors in the syslog either. Has
anyone else faced such an issue or has any ideas on why this could have
occurred?Maybe whatever the tablespace is pointing at wasn't mounted yet?
Slow remote mounts are the bane of PG DBAs --- I can recall at least
one famous incident in which someone's database became totally
corrupt because the NFS mount it was on came up after server start,
leading to the server having a mishmash of files on the NFS server
and files on the local disk, now hidden underneath the mount point.If this is what your issue was, you got very lucky to escape without
damage. Suggest adapting your PG server start script to make sure the
mounted file system is present before you allow the server to start.I forgot to mention, but in this case it looks the mount was completed before
the PG process was started up. But we don't have an explicit check for making
sure the file system is present in the start script. Thanks for the tip.
If that is an NFS mount, make sure it is "fg", not "bg".
Also, check that your startup script simply fails if the file system is not
mounted yet, rather than automatically running "initdb".
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com