Issue with Postgres process startup after instance restart

Started by Shishir Joshiabout 6 years ago4 messagesgeneral

shishir.joshi@gojek.com

about 6 years ago

Hello,
I recently faced an issue with PG 11 where the VM that the PG process was
running on got restarted because of a hardware issue. After the VM restart,
the Postgres process failed to start on the 1st attempt with the error "*LOG:
could not open directory "pg_tblspc/16388/PG_11_201809051": No such file
or directory*" even though that directory was present. But on the 2nd
attempt it started up without issues. There didn't seem to be any disk
corruption issues and there were no other errors in the syslog either. Has
anyone else faced such an issue or has any ideas on why this could have
occurred?

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: Shishir Joshi (#1)

Re: Issue with Postgres process startup after instance restart

Shishir Joshi <shishir.joshi@gojek.com> writes:

I recently faced an issue with PG 11 where the VM that the PG process was
running on got restarted because of a hardware issue. After the VM restart,
the Postgres process failed to start on the 1st attempt with the error "*LOG:
could not open directory "pg_tblspc/16388/PG_11_201809051": No such file
or directory*" even though that directory was present. But on the 2nd
attempt it started up without issues. There didn't seem to be any disk
corruption issues and there were no other errors in the syslog either. Has
anyone else faced such an issue or has any ideas on why this could have
occurred?

Maybe whatever the tablespace is pointing at wasn't mounted yet?
Slow remote mounts are the bane of PG DBAs --- I can recall at least
one famous incident in which someone's database became totally
corrupt because the NFS mount it was on came up after server start,
leading to the server having a mishmash of files on the NFS server
and files on the local disk, now hidden underneath the mount point.

If this is what your issue was, you got very lucky to escape without
damage. Suggest adapting your PG server start script to make sure the
mounted file system is present before you allow the server to start.

regards, tom lane

Shishir Joshi

shishir.joshi@gojek.com

about 6 years ago

In reply to: Tom Lane (#2)

Re: Issue with Postgres process startup after instance restart

Hi Tom,
I forgot to mention, but in this case it looks the mount was completed
before the PG process was started up. But we don't have an explicit check
for making sure the file system is present in the start script. Thanks for
the tip.

On Fri, 27 Mar 2020 at 19:30, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Shishir Joshi <shishir.joshi@gojek.com> writes:

I recently faced an issue with PG 11 where the VM that the PG process was
running on got restarted because of a hardware issue. After the VM

restart,

the Postgres process failed to start on the 1st attempt with the error

"*LOG:

could not open directory "pg_tblspc/16388/PG_11_201809051": No such file
or directory*" even though that directory was present. But on the 2nd
attempt it started up without issues. There didn't seem to be any disk
corruption issues and there were no other errors in the syslog either.

Has

anyone else faced such an issue or has any ideas on why this could have
occurred?

Maybe whatever the tablespace is pointing at wasn't mounted yet?
Slow remote mounts are the bane of PG DBAs --- I can recall at least
one famous incident in which someone's database became totally
corrupt because the NFS mount it was on came up after server start,
leading to the server having a mishmash of files on the NFS server
and files on the local disk, now hidden underneath the mount point.

If this is what your issue was, you got very lucky to escape without
damage. Suggest adapting your PG server start script to make sure the
mounted file system is present before you allow the server to start.

regards, tom lane

Laurenz Albe

laurenz.albe@cybertec.at

about 6 years ago

In reply to: Shishir Joshi (#3)

Re: Issue with Postgres process startup after instance restart

On Mon, 2020-03-30 at 11:02 +0530, Shishir Joshi wrote:

On Fri, 27 Mar 2020 at 19:30, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Shishir Joshi <shishir.joshi@gojek.com> writes:

I recently faced an issue with PG 11 where the VM that the PG process was
running on got restarted because of a hardware issue. After the VM restart,
the Postgres process failed to start on the 1st attempt with the error "*LOG:
could not open directory "pg_tblspc/16388/PG_11_201809051": No such file
or directory*" even though that directory was present. But on the 2nd
attempt it started up without issues. There didn't seem to be any disk
corruption issues and there were no other errors in the syslog either. Has
anyone else faced such an issue or has any ideas on why this could have
occurred?

Maybe whatever the tablespace is pointing at wasn't mounted yet?
Slow remote mounts are the bane of PG DBAs --- I can recall at least
one famous incident in which someone's database became totally
corrupt because the NFS mount it was on came up after server start,
leading to the server having a mishmash of files on the NFS server
and files on the local disk, now hidden underneath the mount point.

If this is what your issue was, you got very lucky to escape without
damage. Suggest adapting your PG server start script to make sure the
mounted file system is present before you allow the server to start.

I forgot to mention, but in this case it looks the mount was completed before
the PG process was started up. But we don't have an explicit check for making
sure the file system is present in the start script. Thanks for the tip.

If that is an NFS mount, make sure it is "fg", not "bg".

Also, check that your startup script simply fails if the file system is not
mounted yet, rather than automatically running "initdb".

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com