9.0 standby - could not open file global/XXXXX

Started by Filip Rembiałkowskiabout 7 years ago6 messagesgeneral
Jump to latest
#1Filip Rembiałkowski
filip.rembialkowski@gmail.com

Hi.

There is a large (>5T) database on PostgreSQL 9.0.23.

I would like to setup new WAL-shipping standby.
https://www.postgresql.org/docs/9.0/warm-standby.html

On my way I find unexpected issues. Here's the story, in short:

1. WAL archiving to remote archive is setup & verified

2. base backup is transferred directly to new server using
pg_start_backup + rsync + pg_stop_backup.

3. recovery.conf is created

4. Server is started and consumes all the remaining WAL segments
accumulated in the archive - finishing with optimistic message LOG:
consistent recovery state reached at 9FC1/112BEE10.

5. When I go to postgres on the standby and try to connect system
"postgres" database psql: FATAL: could not open file "global/11819":
No such file or directory

I guessed the OID refereds to the pg_authid, but other system tables
might be affected too.

What could be wrong here?

Thanks!

#2Stephen Frost
sfrost@snowman.net
In reply to: Filip Rembiałkowski (#1)
Re: 9.0 standby - could not open file global/XXXXX

Greetings,

* Filip Rembiałkowski (filip.rembialkowski@gmail.com) wrote:

There is a large (>5T) database on PostgreSQL 9.0.23.

First off, I hope you understand that 9.0 has been *long* out of
support and that you *really* need to upgrade to a supported version of
PostgreSQL (9.4 and up these days...).

I would like to setup new WAL-shipping standby.
https://www.postgresql.org/docs/9.0/warm-standby.html

On my way I find unexpected issues. Here's the story, in short:

1. WAL archiving to remote archive is setup & verified

2. base backup is transferred directly to new server using
pg_start_backup + rsync + pg_stop_backup.

3. recovery.conf is created

4. Server is started and consumes all the remaining WAL segments
accumulated in the archive - finishing with optimistic message LOG:
consistent recovery state reached at 9FC1/112BEE10.

5. When I go to postgres on the standby and try to connect system
"postgres" database psql: FATAL: could not open file "global/11819":
No such file or directory

That seems pretty odd- does that file exist on the existing database
system..?

Thanks!

Stephen

#3Andres Freund
andres@anarazel.de
In reply to: Filip Rembiałkowski (#1)
Re: 9.0 standby - could not open file global/XXXXX

Hi,

On 2019-02-25 20:06:42 +0100, Filip Rembiałkowski wrote:

There is a large (>5T) database on PostgreSQL 9.0.23.

As Stephen said, this is long out of support.

2. base backup is transferred directly to new server using
pg_start_backup + rsync + pg_stop_backup.

Do you exclude any files? Remove any? If so which?

Greetings,

Andres Freund

#4Filip Rembiałkowski
filip.rembialkowski@gmail.com
In reply to: Stephen Frost (#2)
Re: 9.0 standby - could not open file global/XXXXX

On Mon, Feb 25, 2019 at 11:45 PM Stephen Frost <sfrost@snowman.net> wrote:

Greetings,

* Filip Rembiałkowski (filip.rembialkowski@gmail.com) wrote:

There is a large (>5T) database on PostgreSQL 9.0.23.

First off, I hope you understand that 9.0 has been *long* out of
support and that you *really* need to upgrade to a supported version of
PostgreSQL (9.4 and up these days...).

Yes I do. Unfortunately it was not possible.

I would like to setup new WAL-shipping standby.

https://www.postgresql.org/docs/9.0/warm-standby.html

On my way I find unexpected issues. Here's the story, in short:

1. WAL archiving to remote archive is setup & verified

2. base backup is transferred directly to new server using
pg_start_backup + rsync + pg_stop_backup.

3. recovery.conf is created

4. Server is started and consumes all the remaining WAL segments
accumulated in the archive - finishing with optimistic message LOG:
consistent recovery state reached at 9FC1/112BEE10.

5. When I go to postgres on the standby and try to connect system
"postgres" database psql: FATAL: could not open file "global/11819":
No such file or directory

That seems pretty odd- does that file exist on the existing database
system..?

No. "global/11819" does not exist on the primary. That's what makes it
a mystery for me.

At first, I was thinking - maybe OIDs of some objects in pg_global
tablespace were "rotated" (VACUUM FULL, etc) during the rsync.
But it seems unlikely.

#5Filip Rembiałkowski
filip.rembialkowski@gmail.com
In reply to: Andres Freund (#3)
Re: 9.0 standby - could not open file global/XXXXX

On Tue, Feb 26, 2019 at 2:39 AM Andres Freund <andres@anarazel.de> wrote:

2. base backup is transferred directly to new server using
pg_start_backup + rsync + pg_stop_backup.

I excluded contents of pg_xlog only. Exact command was:

# start script
psql -Xc "select pg_start_backup('mirror to $standby', true)"
( cd $PGDATA \
&& find . -type d \( -path ./pg_xlog \) -prune -o -type f -print \
| xargs -n $nargs \
| xargs -P $njobs -I {} bash -c "while ! rsync -a -u -z -R --inplace
{} $standby:$PGDATA; do echo retrying; done"
)
psql -Xc "select pg_stop_backup()"
# end script

Actually I am now in the process of repeating this with
single-threaded rsync - and I'll see if same issue happens.
This is what I'm going to use:

rsync -a -z --inplace $PGDATA/ $standby:$PGDATA/

But it's unlikely to change anything. Still no idea what could be the
root cause.

Thanks!

#6Filip Rembiałkowski
filip.rembialkowski@gmail.com
In reply to: Filip Rembiałkowski (#1)
Re: 9.0 standby - could not open file global/XXXXX

OK I have it fixed;; just for anyone who's interested - the error was in
the base backup procedure.
When switched to plain "rsync -az" - it works like a charm.

Most probably, the fault was I assumed that you can use the rsync --update
option when doing base backup.
You cannot, especially when time sync on both servers is not accurate. In
my case, destination server clock was few minutes in future.
So the pg_clog was broken due to this. Which means a completely corrupted
database.

thanks Stephen & Andres for your responses.

On Mon, Feb 25, 2019 at 8:06 PM Filip Rembiałkowski <
filip.rembialkowski@gmail.com> wrote:

Show quoted text

Hi.

There is a large (>5T) database on PostgreSQL 9.0.23.

I would like to setup new WAL-shipping standby.
https://www.postgresql.org/docs/9.0/warm-standby.html

On my way I find unexpected issues. Here's the story, in short:

1. WAL archiving to remote archive is setup & verified

2. base backup is transferred directly to new server using
pg_start_backup + rsync + pg_stop_backup.

3. recovery.conf is created

4. Server is started and consumes all the remaining WAL segments
accumulated in the archive - finishing with optimistic message LOG:
consistent recovery state reached at 9FC1/112BEE10.

5. When I go to postgres on the standby and try to connect system
"postgres" database psql: FATAL: could not open file "global/11819":
No such file or directory

I guessed the OID refereds to the pg_authid, but other system tables
might be affected too.

What could be wrong here?

Thanks!