Logical replication type- WAL recovery fails and changes the size of wal segment in archivedir
Hi team,
With wal_level = 'logical', backup was taken using non-exclusive backup method.
Following procedure here for restore and recovery - PostgreSQL: Documentation: 16: 26.3. Continuous Archiving and Point-in-Time Recovery (PITR)<https://www.postgresql.org/docs/16/continuous-archiving.html#BACKUP-PITR-RECOVERY>
While starting the PostgreSQL server, below issue is seen:
2024-06-05 11:41:32.369 IST [54369] LOG: restored log file "00000005000000010000006A" from archive
2024-06-05 11:41:33.112 IST [54369] LOG: restored log file "00000005000000010000006B" from archive
cp: cannot stat '/home/pgsql/wmaster/00000005000000010000006C': No such file or directory
2024-06-05 11:41:33.167 IST [54369] LOG: redo done at 1/6B000100
2024-06-05 11:41:33.172 IST [54369] FATAL: archive file "00000005000000010000006B" has wrong size: 0 instead of 16777216
2024-06-05 11:41:33.173 IST [54367] LOG: startup process (PID 54369) exited with exit code 1
2024-06-05 11:41:33.173 IST [54367] LOG: terminating any other active server processes
2024-06-05 11:41:33.174 IST [54375] FATAL: archive command was terminated by signal 3: Quit
2024-06-05 11:41:33.174 IST [54375] DETAIL: The failed archive command was: cp pg_wal/00000005000000010000006B /home/pgsql/wmaster/00000005000000010000006B
2024-06-05 11:41:33.175 IST [54367] LOG: archiver process (PID 54375) exited with exit code 1
2024-06-05 11:41:33.177 IST [54367] LOG: database system is shut down
Here '/home/pgsql/wmaster' is my archivedir (the folder where WAL segments are restored from)
Before attempting start, size of 00000005000000010000006B file was 16 MB.
After failing to detect 00000005000000010000006C, there is a FATAL error saying wrong size for 00000005000000010000006B
Now the size of 00000005000000010000006B is observed as 2 MB. Size of all other WAL segments remain 16 MB.
-rw------- 1 postgres postgres 2359296 Jun 5 11:34 00000005000000010000006B
Why is it changing the size of WAL segment in archive log directory?
All necessary WAL segments are present and 00000005000000010000006C was never archived.
bash-4.2$ cat /home/pgsql/dmaster/backup_label.old
START WAL LOCATION: 1/69000028 (file 000000050000000100000069)
CHECKPOINT LOCATION: 1/69000060
BACKUP METHOD: streamed
BACKUP FROM: master
START TIME: 2024-05-31 17:39:43 IST
LABEL: pgida_backup_4321_315606_1717157383
START TIMELINE: 5
bash-4.2$ cat /home/pgsql/wmaster/00000005000000010000006B.00000028.backup
START WAL LOCATION: 1/6B000028 (file 00000005000000010000006B)
STOP WAL LOCATION: 1/6B000100 (file 00000005000000010000006B)
CHECKPOINT LOCATION: 1/6B000060
BACKUP METHOD: streamed
BACKUP FROM: master
START TIME: 2024-05-31 17:40:28 IST
LABEL: pgida_backup_4321_315606_1717157427
START TIMELINE: 5
STOP TIME: 2024-05-31 17:40:28 IST
STOP TIMELINE: 5
bash-4.2$ cat /home/pgsql/wmaster/00000005.history
1 0/3E000000 before 2000-01-01 05:30:00+05:30
2 0/63000000 before 2000-01-01 05:30:00+05:30
3 0/E8000000 no recovery target specified
4 1/68000000 before 2000-01-01 05:30:00+05:30
Despite our efforts to troubleshoot, the problem persists. Please help to find a solution.
Regards,
Meera
On Wed, 2024-06-05 at 06:36 +0000, Meera Nair wrote:
2024-06-05 11:41:32.369 IST [54369] LOG: restored log file "00000005000000010000006A" from archive
2024-06-05 11:41:33.112 IST [54369] LOG: restored log file "00000005000000010000006B" from archive
cp: cannot stat ‘/home/pgsql/wmaster/00000005000000010000006C’: No such file or directory
2024-06-05 11:41:33.167 IST [54369] LOG: redo done at 1/6B000100
2024-06-05 11:41:33.172 IST [54369] FATAL: archive file "00000005000000010000006B" has wrong size: 0 instead of 16777216
2024-06-05 11:41:33.173 IST [54367] LOG: startup process (PID 54369) exited with exit code 1
2024-06-05 11:41:33.173 IST [54367] LOG: terminating any other active server processes
2024-06-05 11:41:33.174 IST [54375] FATAL: archive command was terminated by signal 3: Quit
2024-06-05 11:41:33.174 IST [54375] DETAIL: The failed archive command was: cp pg_wal/00000005000000010000006B /home/pgsql/wmaster/00000005000000010000006B
2024-06-05 11:41:33.175 IST [54367] LOG: archiver process (PID 54375) exited with exit code 1
2024-06-05 11:41:33.177 IST [54367] LOG: database system is shut down
Here ‘/home/pgsql/wmaster’ is my archivedir (the folder where WAL segments are restored from)
Before attempting start, size of
00000005000000010000006B file was 16 MB.
After failing to detect 00000005000000010000006C, there is a FATAL error saying wrong size for 00000005000000010000006B
Now the size of 00000005000000010000006B is observed as 2 MB. Size of all other WAL segments remain 16 MB.
-rw------- 1 postgres postgres 2359296 Jun 5 11:34 00000005000000010000006B
That looks like you have "archive_mode = always", and "archive_command" writes
back to the archive. Don't do that.
Yours,
Laurenz Albe
Hi Laurenz,
Thanks a lot!
Regards,
Meera
-----Original Message-----
From: Laurenz Albe <laurenz.albe@cybertec.at>
Sent: Wednesday, June 5, 2024 3:56 PM
To: Meera Nair <mnair@commvault.com>; pgsql-general@lists.postgresql.org
Cc: Punit Pranesh Koujalgi <pkoujalgi@commvault.com>; B Ganesh Kishan <bkishan@commvault.com>
Subject: Re: Logical replication type- WAL recovery fails and changes the size of wal segment in archivedir
[You don't often get email from laurenz.albe@cybertec.at. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
External email. Inspect before opening.
On Wed, 2024-06-05 at 06:36 +0000, Meera Nair wrote:
2024-06-05 11:41:32.369 IST [54369] LOG: restored log file
"00000005000000010000006A" from archive
2024-06-05 11:41:33.112 IST [54369] LOG: restored log file
"00000005000000010000006B" from archive
cp: cannot stat ‘/home/pgsql/wmaster/00000005000000010000006C’: No
such file or directory
2024-06-05 11:41:33.167 IST [54369] LOG: redo done at 1/6B000100
2024-06-05 11:41:33.172 IST [54369] FATAL: archive file
"00000005000000010000006B" has wrong size: 0 instead of 16777216
2024-06-05 11:41:33.173 IST [54367] LOG: startup process (PID 54369)
exited with exit code 1
2024-06-05 11:41:33.173 IST [54367] LOG: terminating any other active
server processes
2024-06-05 11:41:33.174 IST [54375] FATAL: archive command was
terminated by signal 3: Quit
2024-06-05 11:41:33.174 IST [54375] DETAIL: The failed archive
command was: cp pg_wal/00000005000000010000006B
/home/pgsql/wmaster/00000005000000010000006B
2024-06-05 11:41:33.175 IST [54367] LOG: archiver process (PID 54375)
exited with exit code 1
2024-06-05 11:41:33.177 IST [54367] LOG: database system is shut downHere ‘/home/pgsql/wmaster’ is my archivedir (the folder where WAL
segments are restored from)Before attempting start, size of
00000005000000010000006B file was 16 MB.
After failing to detect 00000005000000010000006C, there is a FATAL
error saying wrong size for 00000005000000010000006B Now the size of 00000005000000010000006B is observed as 2 MB. Size of all other WAL segments remain 16 MB.-rw------- 1 postgres postgres 2359296 Jun 5 11:34
00000005000000010000006B
That looks like you have "archive_mode = always", and "archive_command" writes back to the archive. Don't do that.
Yours,
Laurenz Albe
On Wed, Jun 5, 2024 at 6:26 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:
On Wed, 2024-06-05 at 06:36 +0000, Meera Nair wrote:
2024-06-05 11:41:32.369 IST [54369] LOG: restored log file
"00000005000000010000006A" from archive
2024-06-05 11:41:33.112 IST [54369] LOG: restored log file
"00000005000000010000006B" from archive
cp: cannot stat ‘/home/pgsql/wmaster/00000005000000010000006C’: No such
file or directory
2024-06-05 11:41:33.167 IST [54369] LOG: redo done at 1/6B000100
2024-06-05 11:41:33.172 IST [54369] FATAL: archive file"00000005000000010000006B" has wrong size: 0 instead of 16777216
2024-06-05 11:41:33.173 IST [54367] LOG: startup process (PID 54369)
exited with exit code 1
2024-06-05 11:41:33.173 IST [54367] LOG: terminating any other active
server processes
2024-06-05 11:41:33.174 IST [54375] FATAL: archive command was
terminated by signal 3: Quit
2024-06-05 11:41:33.174 IST [54375] DETAIL: The failed archive command
was: cp pg_wal/00000005000000010000006B
/home/pgsql/wmaster/00000005000000010000006B2024-06-05 11:41:33.175 IST [54367] LOG: archiver process (PID 54375)
exited with exit code 1
2024-06-05 11:41:33.177 IST [54367] LOG: database system is shut down
Here ‘/home/pgsql/wmaster’ is my archivedir (the folder where WAL
segments are restored from)
Before attempting start, size of
00000005000000010000006B file was 16 MB.
After failing to detect 00000005000000010000006C, there is a FATAL errorsaying wrong size for 00000005000000010000006B
Now the size of 00000005000000010000006B is observed as 2 MB. Size of
all other WAL segments remain 16 MB.
-rw------- 1 postgres postgres 2359296 Jun 5 11:34
00000005000000010000006B
That looks like you have "archive_mode = always", and "archive_command"
writes
back to the archive. Don't do that.
In fact, don't write your own PITR backup process. Use something like
PgBackRest or BarMan.
Hi Ron,
We do have our own solution. We work for Commvault, a data protection company.
We support backup of PostgreSQL in different ways - streaming, snapshot, block level, backup from standby server.
PostgreSQL (commvault.com)<https://documentation.commvault.com/2024/expert/postgresql.html>
Regards,
Meera
From: Ron Johnson <ronljohnsonjr@gmail.com>
Sent: Thursday, June 6, 2024 9:56 PM
To: Laurenz Albe <laurenz.albe@cybertec.at>
Cc: Meera Nair <mnair@commvault.com>; pgsql-general@lists.postgresql.org; Punit Pranesh Koujalgi <pkoujalgi@commvault.com>; B Ganesh Kishan <bkishan@commvault.com>
Subject: Re: Logical replication type- WAL recovery fails and changes the size of wal segment in archivedir
You don't often get email from ronljohnsonjr@gmail.com<mailto:ronljohnsonjr@gmail.com>. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
External email. Inspect before opening.
On Wed, Jun 5, 2024 at 6:26 AM Laurenz Albe <laurenz.albe@cybertec.at<mailto:laurenz.albe@cybertec.at>> wrote:
On Wed, 2024-06-05 at 06:36 +0000, Meera Nair wrote:
2024-06-05 11:41:32.369 IST [54369] LOG: restored log file "00000005000000010000006A" from archive
2024-06-05 11:41:33.112 IST [54369] LOG: restored log file "00000005000000010000006B" from archive
cp: cannot stat ‘/home/pgsql/wmaster/00000005000000010000006C’: No such file or directory
2024-06-05 11:41:33.167 IST [54369] LOG: redo done at 1/6B000100
2024-06-05 11:41:33.172 IST [54369] FATAL: archive file "00000005000000010000006B" has wrong size: 0 instead of 16777216
2024-06-05 11:41:33.173 IST [54367] LOG: startup process (PID 54369) exited with exit code 1
2024-06-05 11:41:33.173 IST [54367] LOG: terminating any other active server processes
2024-06-05 11:41:33.174 IST [54375] FATAL: archive command was terminated by signal 3: Quit
2024-06-05 11:41:33.174 IST [54375] DETAIL: The failed archive command was: cp pg_wal/00000005000000010000006B /home/pgsql/wmaster/00000005000000010000006B
2024-06-05 11:41:33.175 IST [54367] LOG: archiver process (PID 54375) exited with exit code 1
2024-06-05 11:41:33.177 IST [54367] LOG: database system is shut downHere ‘/home/pgsql/wmaster’ is my archivedir (the folder where WAL segments are restored from)
Before attempting start, size of
00000005000000010000006B file was 16 MB.
After failing to detect 00000005000000010000006C, there is a FATAL error saying wrong size for 00000005000000010000006B
Now the size of 00000005000000010000006B is observed as 2 MB. Size of all other WAL segments remain 16 MB.-rw------- 1 postgres postgres 2359296 Jun 5 11:34 00000005000000010000006B
That looks like you have "archive_mode = always", and "archive_command" writes
back to the archive. Don't do that.
In fact, don't write your own PITR backup process. Use something like PgBackRest or BarMan.