Questions about the continuity of WAL archiving

Started by px shi8 months ago21 messagesgeneral

spxlyy123@gmail.com

8 months ago

Hi,
There is a scenario: the current timeline of the PostgreSQL primary node is
1, and the latest WAL file is 100. The standby node has also received up to
WAL file 100. However, the latest WAL file archived is only file 80. If the
primary node crashes at this point and the standby is promoted to the new
primary, archiving will resume from file 100 on timeline 2. As a result,
WAL files from 81 to 100 on timeline 1 will be missing from the archive.
Is there a good solution to prevent this situation?

Regards,
Pixian Shi

Adrian Klaver

adrian.klaver@aklaver.com

8 months ago

In reply to: px shi (#1)

Re: Questions about the continuity of WAL archiving

On 8/7/25 20:20, px shi wrote:

Hi,
There is a scenario: the current timeline of the PostgreSQL primary node
is 1, and the latest WAL file is 100. The standby node has also received
up to WAL file 100. However, the latest WAL file archived is only file
80. If the primary node crashes at this point and the standby is
promoted to the new primary, archiving will resume from file 100 on
timeline 2. As a result, WAL files from 81 to 100 on timeline 1 will be
missing from the archive.

What are you planning to do with the archived files?

Also is not the case that once the primary crashes you are in a split
brain case and can't really trust it's timeline anymore?

Is there a good solution to prevent this situation?

Regards,
Pixian Shi

--
Adrian Klaver
adrian.klaver@aklaver.com

px shi

spxlyy123@gmail.com

8 months ago

In reply to: Adrian Klaver (#2)

Re: Questions about the continuity of WAL archiving

Thank you for your reply.
The archived files can be used for PITR (Point-In-Time Recovery), allowing
recovery to any point between WAL 80 and 100 on timeline 1.
Additionally, if there's a backup taken during timeline 1 and a switchover
to a new primary has occurred without taking a new full backup yet, these
WAL logs can still be used to recover to any point on timeline 2.

Regards,
Pixian Shi

Adrian Klaver <adrian.klaver@aklaver.com> 于2025年8月8日周五 12:25写道：

Show quoted text

On 8/7/25 20:20, px shi wrote:

Hi,
There is a scenario: the current timeline of the PostgreSQL primary node
is 1, and the latest WAL file is 100. The standby node has also received
up to WAL file 100. However, the latest WAL file archived is only file
80. If the primary node crashes at this point and the standby is
promoted to the new primary, archiving will resume from file 100 on
timeline 2. As a result, WAL files from 81 to 100 on timeline 1 will be
missing from the archive.

What are you planning to do with the archived files?

Also is not the case that once the primary crashes you are in a split
brain case and can't really trust it's timeline anymore?

Is there a good solution to prevent this situation?

Regards,
Pixian Shi

--
Adrian Klaver
adrian.klaver@aklaver.com

Adrian Klaver

adrian.klaver@aklaver.com

8 months ago

In reply to: px shi (#3)

Re: Questions about the continuity of WAL archiving

On 8/7/25 22:50, px shi wrote:

Thank you for your reply.
The archived files can be used for PITR (Point-In-Time Recovery),
allowing recovery to any point between WAL 80 and 100 on timeline 1.
Additionally, if there's a backup taken during timeline 1 and a
switchover to a new primary has occurred without taking a new full
backup yet, these WAL logs can still be used to recover to any point on
timeline 2.

Alright I see.

Two things:

1) What is the current archiving setup on the primary and why is lagging?

2) Have you looked at archiving off the standby node while it is in
standby per:

https://www.postgresql.org/docs/current/warm-standby.html#CONTINUOUS-ARCHIVING-IN-STANDBY

Regards,
Pixian Shi

Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> 于2025年8月8日周五 12:25写道：

On 8/7/25 20:20, px shi wrote:

Hi,
There is a scenario: the current timeline of the PostgreSQL

primary node

is 1, and the latest WAL file is 100. The standby node has also

received

up to WAL file 100. However, the latest WAL file archived is only

file

80. If the primary node crashes at this point and the standby is
promoted to the new primary, archiving will resume from file 100 on
timeline 2. As a result, WAL files from 81 to 100 on timeline 1

will be

missing from the archive.

What are you planning to do with the archived files?

Also is not the case that once the primary crashes you are in a split
brain case and can't really trust it's timeline anymore?

Is there a good solution to prevent this situation?

Regards,
Pixian Shi

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

Greg Sabino Mullane

greg@turnstep.com

8 months ago

In reply to: px shi (#1)

Re: Questions about the continuity of WAL archiving

There is a scenario: the current timeline of the PostgreSQL primary node
is 1, and the latest WAL file is 100. The standby node has also received up
to WAL file 100. However, the latest WAL file archived is only file 80. If
the primary node crashes at this point and the standby is promoted to the
new primary, archiving will resume from file 100 on timeline 2. As a
result, WAL files from 81 to 100 on timeline 1 will be missing from the
archive.
Is there a good solution to prevent this situation?

I'm still not clear on what the problem here is, other than your archiving
not keeping up. The best solution to that is:

https://pgbackrest.org/1/configuration.html#section-archive/option-archive-async

Yes, you would lost some ability for easy PITR for 80-100, but could still
be done by resurrecting your crashed primary, or carefully grabbing from
the replica before they get recycled. You can set archive_mode=always on
the replicas to help with this.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

Ron

ronljohnsonjr@gmail.com

8 months ago

In reply to: Greg Sabino Mullane (#5)

Re: Questions about the continuity of WAL archiving

On Fri, Aug 8, 2025 at 2:26 PM Greg Sabino Mullane <htamfids@gmail.com>
wrote:

There is a scenario: the current timeline of the PostgreSQL primary node

is 1, and the latest WAL file is 100. The standby node has also received up
to WAL file 100. However, the latest WAL file archived is only file 80. If
the primary node crashes at this point and the standby is promoted to the
new primary, archiving will resume from file 100 on timeline 2. As a
result, WAL files from 81 to 100 on timeline 1 will be missing from the
archive.
Is there a good solution to prevent this situation?

I'm still not clear on what the problem here is, other than your archiving
not keeping up. The best solution to that is:

https://pgbackrest.org/1/configuration.html#section-archive/option-archive-async

Yes, you would lost some ability for easy PITR for 80-100, but could still
be done by resurrecting your crashed primary, or carefully grabbing from
the replica before they get recycled. You can set archive_mode=always on
the replicas to help with this.

Bog-standard PgBackRest retains all WAL files required for a full backup
set and its associated differential/incremental backups, no? I've
certainly done more than one --type=time --target="${RestoreUntil}" restore
without giving a second thought to timelines or whether the WAL exists.

Maybe I've just ignored the problem, since it (seemingly) does everything
for PITR backups.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

px shi

spxlyy123@gmail.com

8 months ago

In reply to: Adrian Klaver (#4)

Re: Questions about the continuity of WAL archiving

1) What is the current archiving setup on the primary and why is lagging?

The archive command uses pgBackRest to archive to S3. Because it is
uploaded to S3, the archiving speed is slow, which has caused lagging.

2) Have you looked at archiving off the standby node while it is in standby

per:

Yes, archiving on the standby node is disabled. Is it recommended to share
the WAL archive between the primary and standby nodes to avoid
interruptions in archiving?

Adrian Klaver <adrian.klaver@aklaver.com> 于2025年8月8日周五 23:23写道：

Show quoted text

On 8/7/25 22:50, px shi wrote:

Thank you for your reply.
The archived files can be used for PITR (Point-In-Time Recovery),
allowing recovery to any point between WAL 80 and 100 on timeline 1.
Additionally, if there's a backup taken during timeline 1 and a
switchover to a new primary has occurred without taking a new full
backup yet, these WAL logs can still be used to recover to any point on
timeline 2.

Alright I see.

Two things:

1) What is the current archiving setup on the primary and why is lagging?

2) Have you looked at archiving off the standby node while it is in
standby per:

https://www.postgresql.org/docs/current/warm-standby.html#CONTINUOUS-ARCHIVING-IN-STANDBY

Regards,
Pixian Shi

Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> 于2025年8月8日周五 12:25写道：

On 8/7/25 20:20, px shi wrote:

Hi,
There is a scenario: the current timeline of the PostgreSQL

primary node

is 1, and the latest WAL file is 100. The standby node has also

received

up to WAL file 100. However, the latest WAL file archived is only

file

80. If the primary node crashes at this point and the standby is
promoted to the new primary, archiving will resume from file 100

on

timeline 2. As a result, WAL files from 81 to 100 on timeline 1

will be

missing from the archive.

What are you planning to do with the archived files?

Also is not the case that once the primary crashes you are in a split
brain case and can't really trust it's timeline anymore?

Is there a good solution to prevent this situation?

Regards,
Pixian Shi

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

px shi

spxlyy123@gmail.com

8 months ago

In reply to: Greg Sabino Mullane (#5)

Re: Questions about the continuity of WAL archiving

I'm still not clear on what the problem here is, other than your archiving
not keeping up.

In my scenario, archive_mode is not set to always on the replicas, it may
cause interruptions in the archived WAL logs.

You can set archive_mode=always on the replicas to help with this.

Yes, it can work. And I would like to know if this is the recommended
configuration for production use?

Greg Sabino Mullane <htamfids@gmail.com> 于2025年8月9日周六 02:25写道：

Show quoted text

There is a scenario: the current timeline of the PostgreSQL primary node

is 1, and the latest WAL file is 100. The standby node has also received up
to WAL file 100. However, the latest WAL file archived is only file 80. If
the primary node crashes at this point and the standby is promoted to the
new primary, archiving will resume from file 100 on timeline 2. As a
result, WAL files from 81 to 100 on timeline 1 will be missing from the
archive.
Is there a good solution to prevent this situation?

I'm still not clear on what the problem here is, other than your archiving
not keeping up. The best solution to that is:

https://pgbackrest.org/1/configuration.html#section-archive/option-archive-async

Yes, you would lost some ability for easy PITR for 80-100, but could still
be done by resurrecting your crashed primary, or carefully grabbing from
the replica before they get recycled. You can set archive_mode=always on
the replicas to help with this.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

px shi

spxlyy123@gmail.com

8 months ago

In reply to: Ron (#6)

Re: Questions about the continuity of WAL archiving

Bog-standard PgBackRest retains all WAL files required for a full backup
set and its associated differential/incremental backups.

Yes, WAL files are continuous under normal circumstances. However, if the
primary node crashes under high load, the archived WAL logs on S3 may be
discontinuous.

Ron Johnson <ronljohnsonjr@gmail.com> 于2025年8月9日周六 02:45写道：

Show quoted text

On Fri, Aug 8, 2025 at 2:26 PM Greg Sabino Mullane <htamfids@gmail.com>
wrote:

There is a scenario: the current timeline of the PostgreSQL primary node

is 1, and the latest WAL file is 100. The standby node has also received up
to WAL file 100. However, the latest WAL file archived is only file 80. If
the primary node crashes at this point and the standby is promoted to the
new primary, archiving will resume from file 100 on timeline 2. As a
result, WAL files from 81 to 100 on timeline 1 will be missing from the
archive.
Is there a good solution to prevent this situation?

I'm still not clear on what the problem here is, other than your
archiving not keeping up. The best solution to that is:

https://pgbackrest.org/1/configuration.html#section-archive/option-archive-async

Yes, you would lost some ability for easy PITR for 80-100, but could
still be done by resurrecting your crashed primary, or carefully grabbing
from the replica before they get recycled. You can set archive_mode=always
on the replicas to help with this.

Bog-standard PgBackRest retains all WAL files required for a full backup
set and its associated differential/incremental backups, no? I've
certainly done more than one --type=time --target="${RestoreUntil}" restore
without giving a second thought to timelines or whether the WAL exists.

Maybe I've just ignored the problem, since it (seemingly) does everything
for PITR backups.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#10

Adrian Klaver

adrian.klaver@aklaver.com

8 months ago

In reply to: px shi (#7)

Re: Questions about the continuity of WAL archiving

On 8/12/25 01:24, px shi wrote:

1) What is the current archiving setup on the primary and why is
lagging?

The archive command uses pgBackRest to archive to S3. Because it is
uploaded to S3, the archiving speed is slow, which has caused lagging.

2) Have you looked at archiving off the standby node while it is in
standby per:

Yes, archiving on the standby node is disabled. Is it recommended to
share the WAL archive between the primary and standby nodes to avoid
interruptions in archiving?

Given that you are using a less then capable storage solution(S3) why do
you think pushing the WAL from the standby to S3 would perform any
better then what is happening with the primary WAL?

The solution is to use a more capable storage platform.

Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> 于2025年8月8日周五 23:23写道：

--
Adrian Klaver
adrian.klaver@aklaver.com

#11

Bob Jolliffe

bobjolliffe@gmail.com

8 months ago

In reply to: Adrian Klaver (#10)

Re: Questions about the continuity of WAL archiving

On Tue, 12 Aug 2025 at 17:14, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 8/12/25 01:24, px shi wrote:

1) What is the current archiving setup on the primary and why is
lagging?

The archive command uses pgBackRest to archive to S3. Because it is
uploaded to S3, the archiving speed is slow, which has caused lagging.

2) Have you looked at archiving off the standby node while it is in
standby per:

Yes, archiving on the standby node is disabled. Is it recommended to
share the WAL archive between the primary and standby nodes to avoid
interruptions in archiving?

Given that you are using a less then capable storage solution(S3) why do
you think pushing the WAL from the standby to S3 would perform any
better then what is happening with the primary WAL?

The solution is to use a more capable storage platform.

That is an interesting point you make Adrian. S3 seems quite popular for
this type of archiving. What would you suggest as a more capable (and cost
effective) storage platform?

Regards
Bob

Show quoted text

Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> 于2025年8月8日周五 23:23写道：

--
Adrian Klaver
adrian.klaver@aklaver.com

#12

Ron

ronljohnsonjr@gmail.com

8 months ago

In reply to: px shi (#9)

Re: Questions about the continuity of WAL archiving

On Tue, Aug 12, 2025 at 4:37 AM px shi <spxlyy123@gmail.com> wrote:

Bog-standard PgBackRest retains all WAL files required for a full backup

set and its associated differential/incremental backups.

Yes, WAL files are continuous under normal circumstances. However, if the
primary node crashes under high load, the archived WAL logs on S3 may be
discontinuous.

1) PG does not purge WAL files that are needed for immediate crash recovery.
2) PgBackRest can archive (compressed and encrypted) WAL files to S3.
https://pgbackrest.org/user-guide-rhel.html#s3-support

Ron Johnson <ronljohnsonjr@gmail.com> 于2025年8月9日周六 02:45写道：

On Fri, Aug 8, 2025 at 2:26 PM Greg Sabino Mullane <htamfids@gmail.com>
wrote:

There is a scenario: the current timeline of the PostgreSQL primary node

is 1, and the latest WAL file is 100. The standby node has also received up
to WAL file 100. However, the latest WAL file archived is only file 80. If
the primary node crashes at this point and the standby is promoted to the
new primary, archiving will resume from file 100 on timeline 2. As a
result, WAL files from 81 to 100 on timeline 1 will be missing from the
archive.
Is there a good solution to prevent this situation?

I'm still not clear on what the problem here is, other than your
archiving not keeping up. The best solution to that is:

https://pgbackrest.org/1/configuration.html#section-archive/option-archive-async

Yes, you would lost some ability for easy PITR for 80-100, but could
still be done by resurrecting your crashed primary, or carefully grabbing
from the replica before they get recycled. You can set archive_mode=always
on the replicas to help with this.

Bog-standard PgBackRest retains all WAL files required for a full backup
set and its associated differential/incremental backups, no? I've
certainly done more than one --type=time --target="${RestoreUntil}" restore
without giving a second thought to timelines or whether the WAL exists.

Maybe I've just ignored the problem, since it (seemingly) does everything
for PITR backups.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#13

Adrian Klaver

adrian.klaver@aklaver.com

8 months ago

In reply to: Bob Jolliffe (#11)

Re: Questions about the continuity of WAL archiving

On 8/12/25 10:40, Bob Jolliffe wrote:

On Tue, 12 Aug 2025 at 17:14, Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

The solution is to use a more capable storage platform.

That is an interesting point you make Adrian. S3 seems quite popular
for this type of archiving. What would you suggest as a more capable

Yes but from here:

https://pgbackrest.org/user-guide-rhel.html#s3-support

File creation time in S3 is relatively slow so backup/restore
performance is improved by enabling file bundling.

Where file bundling is explained here:

https://pgbackrest.org/user-guide-rhel.html#backup/bundle

Though I don't think would help in this case.

(and cost effective) storage platform?

I would say anything that does not use object storage and instead uses
block storage, so you are not doing the conversion. I have no specific
recommendations as this is not something I do, archive to the cloud.

Regards
Bob

--
Adrian Klaver
adrian.klaver@aklaver.com

#14

px shi

spxlyy123@gmail.com

8 months ago

In reply to: Adrian Klaver (#10)

Re: Questions about the continuity of WAL archiving

Hi, Adrian

Given that you are using a less then capable storage solution(S3) why do

you think pushing the WAL from the standby to S3 would perform any
better then what is happening with the primary WAL?

I mean that archive_mode is set to on in primary and set to always in
standby.
This way, even if the primary crashes, the standby can still archive WAL
files that the primary did not archive.

The solution is to use a more capable storage platform.

However, I believe that even if we use a more capable storage platform, it
is still impossible to archive WAL files in real time. As long as real-time
archiving cannot be achieved, there will always be some WAL files that are
not archived if the primary node crashes.

Adrian Klaver <adrian.klaver@aklaver.com> 于2025年8月13日周三 00:14写道：

Show quoted text

On 8/12/25 01:24, px shi wrote:

1) What is the current archiving setup on the primary and why is
lagging?

The archive command uses pgBackRest to archive to S3. Because it is
uploaded to S3, the archiving speed is slow, which has caused lagging.

2) Have you looked at archiving off the standby node while it is in
standby per:

Yes, archiving on the standby node is disabled. Is it recommended to
share the WAL archive between the primary and standby nodes to avoid
interruptions in archiving?

Given that you are using a less then capable storage solution(S3) why do
you think pushing the WAL from the standby to S3 would perform any
better then what is happening with the primary WAL?

The solution is to use a more capable storage platform.

Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> 于2025年8月8日周五 23:23写道：

--
Adrian Klaver
adrian.klaver@aklaver.com

#15

Ron

ronljohnsonjr@gmail.com

8 months ago

In reply to: px shi (#14)

Re: Questions about the continuity of WAL archiving

How often does your primary node crash, and then not recover due to WALs
corruption or WALs not existing?

If it's _ever_ happened, you should _fix that_ instead of rolling your own
WAL archival.process.

On Tue, Aug 12, 2025 at 10:05 PM px shi <spxlyy123@gmail.com> wrote:

Hi, Adrian

Given that you are using a less then capable storage solution(S3) why do

you think pushing the WAL from the standby to S3 would perform any
better then what is happening with the primary WAL?

I mean that archive_mode is set to on in primary and set to always in
standby.
This way, even if the primary crashes, the standby can still archive WAL
files that the primary did not archive.

The solution is to use a more capable storage platform.

However, I believe that even if we use a more capable storage platform,
it is still impossible to archive WAL files in real time. As long as
real-time archiving cannot be achieved, there will always be some WAL files
that are not archived if the primary node crashes.

Adrian Klaver <adrian.klaver@aklaver.com> 于2025年8月13日周三 00:14写道：

On 8/12/25 01:24, px shi wrote:

1) What is the current archiving setup on the primary and why is
lagging?

The archive command uses pgBackRest to archive to S3. Because it is
uploaded to S3, the archiving speed is slow, which has caused lagging.

2) Have you looked at archiving off the standby node while it is in
standby per:

Yes, archiving on the standby node is disabled. Is it recommended to
share the WAL archive between the primary and standby nodes to avoid
interruptions in archiving?

Given that you are using a less then capable storage solution(S3) why do
you think pushing the WAL from the standby to S3 would perform any
better then what is happening with the primary WAL?

The solution is to use a more capable storage platform.

Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> 于2025年8月8日周五 23:23写道：

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#16

px shi

spxlyy123@gmail.com

8 months ago

In reply to: Ron (#15)

Re: Questions about the continuity of WAL archiving

How often does your primary node crash, and then not recover due to WALs
corruption or WALs not existing?

If it's _ever_ happened, you should _fix that_ instead of rolling your own
WAL archival.process.

I once encountered a case where the recovery process failed to restore to
the latest LSN due to missing WAL files in the archive. The root cause was
multiple failovers between primary and standby. During one of the
switchovers, the primary crashed before completing the archiving of all WAL
files. When the standby was promoted to primary, it began archiving WAL
files for the new timeline, resulting in a gap between the WAL files of the
two timelines. Moreover, no base backup was taken during this period.

Ron Johnson <ronljohnsonjr@gmail.com> 于2025年8月13日周三 10:11写道：

Show quoted text

How often does your primary node crash, and then not recover due to WALs
corruption or WALs not existing?

If it's _ever_ happened, you should _fix that_ instead of rolling your own
WAL archival.process.

On Tue, Aug 12, 2025 at 10:05 PM px shi <spxlyy123@gmail.com> wrote:

Hi, Adrian

Given that you are using a less then capable storage solution(S3) why do

you think pushing the WAL from the standby to S3 would perform any
better then what is happening with the primary WAL?

I mean that archive_mode is set to on in primary and set to always in
standby.
This way, even if the primary crashes, the standby can still archive WAL
files that the primary did not archive.

The solution is to use a more capable storage platform.

However, I believe that even if we use a more capable storage platform,
it is still impossible to archive WAL files in real time. As long as
real-time archiving cannot be achieved, there will always be some WAL files
that are not archived if the primary node crashes.

Adrian Klaver <adrian.klaver@aklaver.com> 于2025年8月13日周三 00:14写道：

On 8/12/25 01:24, px shi wrote:

1) What is the current archiving setup on the primary and why is
lagging?

The archive command uses pgBackRest to archive to S3. Because it is
uploaded to S3, the archiving speed is slow, which has caused lagging.

2) Have you looked at archiving off the standby node while it is in
standby per:

Yes, archiving on the standby node is disabled. Is it recommended to
share the WAL archive between the primary and standby nodes to avoid
interruptions in archiving?

Given that you are using a less then capable storage solution(S3) why do
you think pushing the WAL from the standby to S3 would perform any
better then what is happening with the primary WAL?

The solution is to use a more capable storage platform.

Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> 于2025年8月8日周五 23:23写道：

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#17

Justin

zzzzz.graf@gmail.com

8 months ago

In reply to: px shi (#16)

Re: Questions about the continuity of WAL archiving

On Tue, Aug 12, 2025 at 10:24 PM px shi <spxlyy123@gmail.com> wrote:

How often does your primary node crash, and then not recover due to WALs

corruption or WALs not existing?

If it's _ever_ happened, you should _fix that_ instead of rolling your
own WAL archival.process.

I once encountered a case where the recovery process failed to restore to
the latest LSN due to missing WAL files in the archive. The root cause was
multiple failovers between primary and standby. During one of the
switchovers, the primary crashed before completing the archiving of all WAL
files. When the standby was promoted to primary, it began archiving WAL
files for the new timeline, resulting in a gap between the WAL files of the
two timelines. Moreover, no base backup was taken during this period.

I am not sure what the problem is here either, other than something
seriously wrong with configuration with PostgreSQL and PgBackrest.

The replica should be receiving the WAL via a replication slot using
Streaming. Meaning the primary will keep the WAL until the replica is
caught up, if the replica becomes disconnected due to
max_slot_wal_keep_size aka wal_keep_segments is exceeded the replicas
recovery_command can take offer and fetch from the WAL Archive to catch the
replica up. This assumes hot_feedback is on so the WAL replay won't become
delayed due to snapshot locks on the replica.

If all the above is true the replica should never lag behind unless the
disk IO layer is way undersized compared to the Primary. S3 is being
talked about so it makes me wonder about DISK IO configuration on the
primary vs the replica. I see this causing lag under high load where the
replica IO layer is the bottleneck.

If PgBackrest can't keep up with WAL archiving, as others have stated you
need to configure Asynchronous Archiving. The number of workers depends on
the load. I have a server running 8 parallel workers to archive 1TB of WAL
daily.... And another server during maintenance tasks generates around
10,000 WAL files in about 2 hours using 6 PgBAckrest workers All to S3
buckets.

The above statement makes me wonder if there is some kind of High
Availability monitor running like pg_autofailover, that is promoting a
replica then converting the former primary to a replica of the recently
"promoted replica"

If the above matches to what is happening, it is very easy to mess up the
configuration for WAL archiving and backups. Part of the process of
promoting a replica is to make sure WAL archiving is working. The replica
after being promoted immediately kicks of autovacuum to rebuild things like
FSM which generates a lot of WAL files.

If you are losing WAL files the configuration is wrong somewhere..

Just not enough information on the series of events and the configuration
to tell what the root cause is other than miss-configuration.

Thanks
Justin

#18

px shi

spxlyy123@gmail.com

8 months ago

In reply to: Justin (#17)

Re: Questions about the continuity of WAL archiving

Here’s a scenario: The latest WAL file on the primary node is
0000000100000000000000AF, and the standby node has also received up to
0000000100000000000000AF. However, the latest WAL file that has been
successfully archived from the primary is only 0000000100000000000000A1
(WAL files from A2 to AE have not yet been archived). If the primary
crashes at this point, triggering a failover, the new primary will start
generating and archiving WAL on a new timeline (2), beginning with
0000000200000000000000AF. It will not backfill the missing WAL files from
timeline 1 (0000000100000000000000A2 to 0000000100000000000000AE). As a
result, while the new primary does not have any local WAL gaps, the archive
directory will contain a gap in that WAL range.
I’m not sure if I explained it clearly.

Justin <zzzzz.graf@gmail.com> 于2025年8月13日周三 10:51写道：

Show quoted text

On Tue, Aug 12, 2025 at 10:24 PM px shi <spxlyy123@gmail.com> wrote:

How often does your primary node crash, and then not recover due to WALs

corruption or WALs not existing?

If it's _ever_ happened, you should _fix that_ instead of rolling your
own WAL archival.process.

I once encountered a case where the recovery process failed to restore
to the latest LSN due to missing WAL files in the archive. The root cause
was multiple failovers between primary and standby. During one of the
switchovers, the primary crashed before completing the archiving of all WAL
files. When the standby was promoted to primary, it began archiving WAL
files for the new timeline, resulting in a gap between the WAL files of the
two timelines. Moreover, no base backup was taken during this period.

I am not sure what the problem is here either, other than something
seriously wrong with configuration with PostgreSQL and PgBackrest.

The replica should be receiving the WAL via a replication slot using
Streaming. Meaning the primary will keep the WAL until the replica is
caught up, if the replica becomes disconnected due to
max_slot_wal_keep_size aka wal_keep_segments is exceeded the replicas
recovery_command can take offer and fetch from the WAL Archive to catch the
replica up. This assumes hot_feedback is on so the WAL replay won't become
delayed due to snapshot locks on the replica.

If all the above is true the replica should never lag behind unless the
disk IO layer is way undersized compared to the Primary. S3 is being
talked about so it makes me wonder about DISK IO configuration on the
primary vs the replica. I see this causing lag under high load where the
replica IO layer is the bottleneck.

If PgBackrest can't keep up with WAL archiving, as others have stated you
need to configure Asynchronous Archiving. The number of workers depends on
the load. I have a server running 8 parallel workers to archive 1TB of WAL
daily.... And another server during maintenance tasks generates around
10,000 WAL files in about 2 hours using 6 PgBAckrest workers All to S3
buckets.

The above statement makes me wonder if there is some kind of High
Availability monitor running like pg_autofailover, that is promoting a
replica then converting the former primary to a replica of the recently
"promoted replica"

If the above matches to what is happening, it is very easy to mess up the
configuration for WAL archiving and backups. Part of the process of
promoting a replica is to make sure WAL archiving is working. The replica
after being promoted immediately kicks of autovacuum to rebuild things like
FSM which generates a lot of WAL files.

If you are losing WAL files the configuration is wrong somewhere..

Just not enough information on the series of events and the configuration
to tell what the root cause is other than miss-configuration.

Thanks
Justin

#19

Justin

zzzzz.graf@gmail.com

8 months ago

In reply to: px shi (#18)

Re: Questions about the continuity of WAL archiving

Justin <zzzzz.graf@gmail.com> 于2025年8月13日周三 10:51写道：

On Tue, Aug 12, 2025 at 10:24 PM px shi <spxlyy123@gmail.com> wrote:

How often does your primary node crash, and then not recover due to WALs

corruption or WALs not existing?

If it's _ever_ happened, you should _fix that_ instead of rolling your
own WAL archival.process.

I once encountered a case where the recovery process failed to restore
to the latest LSN due to missing WAL files in the archive. The root cause
was multiple failovers between primary and standby. During one of the
switchovers, the primary crashed before completing the archiving of all WAL
files. When the standby was promoted to primary, it began archiving WAL
files for the new timeline, resulting in a gap between the WAL files of the
two timelines. Moreover, no base backup was taken during this period.

I am not sure what the problem is here either, other than something
seriously wrong with configuration with PostgreSQL and PgBackrest.

The replica should be receiving the WAL via a replication slot using
Streaming. Meaning the primary will keep the WAL until the replica is
caught up, if the replica becomes disconnected due to
max_slot_wal_keep_size aka wal_keep_segments is exceeded the replicas
recovery_command can take offer and fetch from the WAL Archive to catch the
replica up. This assumes hot_feedback is on so the WAL replay won't become
delayed due to snapshot locks on the replica.

If all the above is true the replica should never lag behind unless the
disk IO layer is way undersized compared to the Primary. S3 is being
talked about so it makes me wonder about DISK IO configuration on the
primary vs the replica. I see this causing lag under high load where the
replica IO layer is the bottleneck.

If PgBackrest can't keep up with WAL archiving, as others have stated
you need to configure Asynchronous Archiving. The number of workers depends
on the load. I have a server running 8 parallel workers to archive 1TB of
WAL daily.... And another server during maintenance tasks
generates around 10,000 WAL files in about 2 hours using 6 PgBAckrest
workers All to S3 buckets.

The above statement makes me wonder if there is some kind of High
Availability monitor running like pg_autofailover, that is promoting a
replica then converting the former primary to a replica of the recently
"promoted replica"

If the above matches to what is happening, it is very easy to mess up the
configuration for WAL archiving and backups. Part of the process of
promoting a replica is to make sure WAL archiving is working. The replica
after being promoted immediately kicks of autovacuum to rebuild things like
FSM which generates a lot of WAL files.

If you are losing WAL files the configuration is wrong somewhere..

Just not enough information on the series of events and the configuration
to tell what the root cause is other than miss-configuration.

Thanks
Justin

On Wed, Aug 13, 2025 at 1:48 AM px shi <spxlyy123@gmail.com> wrote:

Here’s a scenario: The latest WAL file on the primary node is
0000000100000000000000AF, and the standby node has also received up to
0000000100000000000000AF. However, the latest WAL file that has been
successfully archived from the primary is only 0000000100000000000000A1
(WAL files from A2 to AE have not yet been archived). If the primary
crashes at this point, triggering a failover, the new primary will start
generating and archiving WAL on a new timeline (2), beginning with
0000000200000000000000AF. It will not backfill the missing WAL files from
timeline 1 (0000000100000000000000A2 to 0000000100000000000000AE). As a
result, while the new primary does not have any local WAL gaps, the archive
directory will contain a gap in that WAL range.
I’m not sure if I explained it clearly.

This will happen if the replica is lagging and promoted before the replica
has had a chance to catch up. This is working correctly to the design
intent. There are several tools available to tell us if the replica is
sync before promoting. In the above case a lagging Replica was promoted,
it stops looking at the previous timeline and will NOT look for the missing
WAL files from the previous timeline. The replica does not even know they
exist anymore.

The data in the previous timeline is not accessible anymore from the
Promoted Replica; it is working on a new timeline. The only place the old
timeline/missed WAL files are accessible is on the crashed primary, it
never archived or streamed the WAL files to the replica.

Promoting an out of sync/lagging replica will result in loss of data.

Does this answer the question here?

#20

Adrian Klaver

adrian.klaver@aklaver.com

8 months ago

In reply to: px shi (#18)

Re: Questions about the continuity of WAL archiving

On 8/12/25 22:48, px shi wrote:

Here’s a scenario: The latest WAL file on the primary node is
0000000100000000000000AF, and the standby node has also received up to
0000000100000000000000AF. However, the latest WAL file that has been
successfully archived from the primary is only 0000000100000000000000A1
(WAL files from A2 to AE have not yet been archived). If the primary
crashes at this point, triggering a failover, the new primary will start
generating and archiving WAL on a new timeline (2), beginning with
0000000200000000000000AF. It will not backfill the missing WAL files
from timeline 1 (0000000100000000000000A2 to 0000000100000000000000AE).
As a result, while the new primary does not have any local WAL gaps, the
archive directory will contain a gap in that WAL range.
I’m not sure if I explained it clearly.

Why does it matter?

1) Your standby is starting off up to date.

2) You can do a pg_basebackup from the new primary as a base for the
restart of the old primary. Assuming you have archiving set up on the
new primary then the restarted primary can catch up.

3) If you don't want to do 2) then you need an archive location that can
deal with the velocity of the WAL archiving.

Justin <zzzzz.graf@gmail.com <mailto:zzzzz.graf@gmail.com>> 于2025年8月
13日周三 10:51写道：

--
Adrian Klaver
adrian.klaver@aklaver.com

#21

Greg Sabino Mullane

greg@turnstep.com

8 months ago

In reply to: Adrian Klaver (#13)