pg_rewind problem: cannot find WAL

Started by Luca Ferrari11 months ago8 messagesgeneral
Jump to latest
#1Luca Ferrari
fluca1978@gmail.com

Hi all,
running 17.4 on ubuntu 24.04 machines. I've three hosts, pg-1
(primary) and two physical replicas.
I then promote host pg-3 as a master (pg_promote()) and want to rewind
the pg-1 to follow the new master, so:

ssh pg-3 'sudo -u postgres /usr/lib/postgresql/17/bin/pg_rewind -D
/var/lib/postgresql/17/main --source-server="user=replica_fluca
host=pg-3 dbname=replica_fluca"'
pg_rewind: servers diverged at WAL location 0/B8550F8 on timeline 1
pg_rewind: error: could not open file
"/var/lib/postgresql/17/main/pg_wal/00000001000000000000000A": No such
file or directory
pg_rewind: error: could not find previous WAL record at 0/AFFF4E8

But the file 0x010000A is not there:

% ssh pg-3 'sudo ls /var/lib/postgresql/17/main/pg_wal'
00000001000000000000000B.partial
00000002.history
00000002000000000000000B
00000002000000000000000C
00000002000000000000000D
00000002000000000000000E
archive_status
summaries

% ssh pg-1 'sudo ls /var/lib/postgresql/17/main/pg_wal'
000000010000000000000005.00000028.backup
00000001000000000000000B
00000001000000000000000C
00000001000000000000000D
00000001000000000000000E
archive_status
summaries

Do i have to ensure the old primary pg-1 does a wal switch before
promoting the other one and try to rewind?

Thanks,
Luca

#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Luca Ferrari (#1)
Re: pg_rewind problem: cannot find WAL

On Wed, 2025-05-07 at 12:51 +0200, Luca Ferrari wrote:

running 17.4 on ubuntu 24.04 machines. I've three hosts, pg-1
(primary) and two physical replicas.
I then promote host pg-3 as a master (pg_promote()) and want to rewind
the pg-1 to follow the new master, so:

ssh pg-3 'sudo -u postgres /usr/lib/postgresql/17/bin/pg_rewind -D
/var/lib/postgresql/17/main --source-server="user=replica_fluca
host=pg-3 dbname=replica_fluca"'
pg_rewind: servers diverged at WAL location 0/B8550F8 on timeline 1
pg_rewind: error: could not open file
"/var/lib/postgresql/17/main/pg_wal/00000001000000000000000A": No such
file or directory
pg_rewind: error: could not find previous WAL record at 0/AFFF4E8

But the file 0x010000A is not there:

% ssh pg-3 'sudo ls /var/lib/postgresql/17/main/pg_wal'
00000001000000000000000B.partial
00000002.history
00000002000000000000000B
00000002000000000000000C
00000002000000000000000D
00000002000000000000000E
archive_status
summaries

% ssh pg-1 'sudo ls /var/lib/postgresql/17/main/pg_wal'
000000010000000000000005.00000028.backup
00000001000000000000000B
00000001000000000000000C
00000001000000000000000D
00000001000000000000000E
archive_status
summaries

Do i have to ensure the old primary pg-1 does a wal switch before
promoting the other one and try to rewind?

I don't think it is connected to a WAL switch.

I'd say that you should set "wal_keep_size" high enough that all the WAL
needed for pg_rewind is still present.

If you have a WAL archive, you could define a restore_command on the server
you want to rewind.

Yours,
Laurenz Albe

#3Luca Ferrari
fluca1978@gmail.com
In reply to: Laurenz Albe (#2)
Re: pg_rewind problem: cannot find WAL

On Wed, May 7, 2025 at 3:55 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

I don't think it is connected to a WAL switch.

Thanks.

I'd say that you should set "wal_keep_size" high enough that all the WAL
needed for pg_rewind is still present.

If you have a WAL archive, you could define a restore_command on the server
you want to rewind.

I've pgbackrest making backups, so I have an archive_command. I'm
going to see if putting a restore_command can fix the problem.

Thanks for the suggestion.

Luca

#4Luca Ferrari
fluca1978@gmail.com
In reply to: Luca Ferrari (#3)
Re: pg_rewind problem: cannot find WAL

On Thu, May 8, 2025 at 8:54 AM Luca Ferrari <fluca1978@gmail.com> wrote:

I've pgbackrest making backups, so I have an archive_command. I'm
going to see if putting a restore_command can fix the problem.

But I'm facing a quite trivial problem: in ubuntu installation the
configuration files are separated from the PGDATA.
Apparently pg_rewind is trying to read postgresql.conf to get the
restore_command, and I don't know how to specify the different
location of the postgresql.conf (cannot specifcy -c as in postgres):

$ /usr/lib/postgresql/17/bin/pg_rewind -D /var/lib/postgresql/17/main
--source-server="user=replica_fluca host=dev-psqlha3
dbname=replica_fluca" -R -P --debug -c
postgres: could not access the server configuration file
"/var/lib/postgresql/17/main/postgresql.conf": No such file or
directory
no data was returned by command "/usr/lib/postgresql/17/bin/postgres
-D /var/lib/postgresql/17/main -C restore_command"
child process exited with exit code 2
pg_rewind: error: could not read restore_command from target cluster

Any idea?
Clearly, postgresql.auto.conf is within PGDATA, and since my
recovery_command is there, one trick could be to touch and empty
PGDATA/postgresql.conf, pg_rewind, remove the fake configurtion file.
But I'm sure there is a smarter solution.

Thanks,
Luca

#5Rob Sargent
robjsargent@gmail.com
In reply to: Luca Ferrari (#4)
Re: pg_rewind problem: cannot find WAL

Any idea?
Clearly, postgresql.auto.conf is within PGDATA, and since my
recovery_command is there, one trick could be to touch and empty
PGDATA/postgresql.conf, pg_rewind, remove the fake configurtion file.
But I'm sure there is a smarter solution.

Thanks,
Luca

A symlink from $PGDATA to where actual file?

#6Luca Ferrari
fluca1978@gmail.com
In reply to: Rob Sargent (#5)
Re: pg_rewind problem: cannot find WAL

On Thu, May 8, 2025 at 4:04 PM Rob Sargent <robjsargent@gmail.com> wrote:

A symlink from $PGDATA to where actual file?

Could be, I need to experiment with pg_basebackup to ensure it is not
conflicting with the /etc/ configuration file when creating a clone.

Luca

#7Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Luca Ferrari (#4)
Re: pg_rewind problem: cannot find WAL

On 5/8/25 04:26, Luca Ferrari wrote:

On Thu, May 8, 2025 at 8:54 AM Luca Ferrari <fluca1978@gmail.com> wrote:

I've pgbackrest making backups, so I have an archive_command. I'm
going to see if putting a restore_command can fix the problem.

But I'm facing a quite trivial problem: in ubuntu installation the
configuration files are separated from the PGDATA.
Apparently pg_rewind is trying to read postgresql.conf to get the
restore_command, and I don't know how to specify the different
location of the postgresql.conf (cannot specifcy -c as in postgres):

$ /usr/lib/postgresql/17/bin/pg_rewind -D /var/lib/postgresql/17/main
--source-server="user=replica_fluca host=dev-psqlha3
dbname=replica_fluca" -R -P --debug -c
postgres: could not access the server configuration file
"/var/lib/postgresql/17/main/postgresql.conf": No such file or
directory
no data was returned by command "/usr/lib/postgresql/17/bin/postgres
-D /var/lib/postgresql/17/main -C restore_command"
child process exited with exit code 2
pg_rewind: error: could not read restore_command from target cluster

Any idea?

/usr/lib/postgresql/17/bin/pg_rewind --help
pg_rewind resynchronizes a PostgreSQL cluster with another copy of the
cluster.

Usage:
pg_rewind [OPTION]...

Options:
-c, --restore-target-wal use "restore_command" in target
configuration to
retrieve WAL files from archives
-D, --target-pgdata=DIRECTORY existing data directory to modify
--source-pgdata=DIRECTORY source data directory to synchronize with
--source-server=CONNSTR source server to synchronize with
-n, --dry-run stop before modifying anything
-N, --no-sync do not wait for changes to be written
safely to disk
-P, --progress write progress messages
-R, --write-recovery-conf write configuration for replication
(requires --source-server)
--config-file=FILENAME use specified main server configuration
file when running target cluster
--debug write a lot of debug messages
--no-ensure-shutdown do not automatically fix unclean shutdown
--sync-method=METHOD set method for syncing files to disk
-V, --version output version information, then exit
-?, --help show this help, then exit

So use --config-file=FILENAME?

Clearly, postgresql.auto.conf is within PGDATA, and since my
recovery_command is there, one trick could be to touch and empty
PGDATA/postgresql.conf, pg_rewind, remove the fake configurtion file.
But I'm sure there is a smarter solution.

Thanks,
Luca

--
Adrian Klaver
adrian.klaver@aklaver.com

#8Luca Ferrari
fluca1978@gmail.com
In reply to: Adrian Klaver (#7)
Re: pg_rewind problem: cannot find WAL

On Thu, May 8, 2025 at 5:11 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:

/usr/lib/postgresql/17/bin/pg_rewind --help
pg_rewind resynchronizes a PostgreSQL cluster with another copy of the
cluster.
--config-file=FILENAME use specified main server configuration

shame on me! I was grepping config_file as in pg_ctl...

Thanks!

Luca