Intermittent Issue with WAL Segment Removal in Logical Replication

Started by Kaushik Iskaover 2 years ago6 messagesgeneral

kaushik@peerdb.io

over 2 years ago

Dear PostgreSQL Community,

I am seeking guidance regarding a recurring issue we've encountered with
WAL segment removal during logical replication using pgoutput plugin. We
sporadically encounter an error indicating that a requested WAL segment has
already been removed. This issue arises intermittently when executing
START_REPLICATION. An example error message is as follows:

requested WAL segment 000000010000146000000AE has already been removed

Please note that this error is not specific to the segment mentioned above;
it serves as an example of the type of error we are experiencing.

Additional Context:

max_slot_wal_keep_size is -1, logical_decoding_work_mem is 4 GB.
-

The error seems to appear randomly and is not consistent.
-

After a couple of retries, the replication process eventually succeeds.
-

For one of the users it seems to be happening every 16 hours or so.

Our approach involves starting with START_REPLICATION 0, replicating data
in batches, and then restarting at the last LSN of the previous batch. We
are trying to understand the root cause behind the intermittent removal of
WAL segments during logical replication. Specifically, we are looking for
insights into:

The potential reasons for the WAL segments being reported as removed.
-

Why this error occurs intermittently and why replication succeeds after
several retries.
-

Any advice on troubleshooting and resolving this issue, or insights into
whether it might be related to our specific replication setup or a
characteristic of pgoutput, would be highly valuable.

https://issues.redhat.com/browse/DBZ-590
-

Troubleshooting Postgres Sources | Airbyte Documentation
<https://docs.airbyte.com/integrations/sources/postgres/postgres-troubleshooting#under-cdc-incremental-mode-there-are-still-full-refresh-syncs>
-

https://fivetran.com/docs/databases/postgresql/troubleshooting/last-tracked-lsn-error

Thank you very much for your time and assistance.

Thanks,

Kaushik Iska

kaushik@peerdb.io

over 2 years ago

In reply to: Kaushik Iska (#1)

Re: Intermittent Issue with WAL Segment Removal in Logical Replication

Hi all,

I'm including additional details, as I am able to reproduce this issue a
little more reliably.

Postgres Version: POSTGRES_14_9.R20230830.01_07
Vendor: Google Cloud SQL
Logical Replication Protocol version 1

Here are the logs of attempt succeeding right after it fails:

2023-12-27 01:12:40.581 UTC [59790]: [6-1] db=postgres,user=postgres
STATEMENT: START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL
6/5AE67D79 (proto_version '1', publication_names
'peerflow_pub_wal_testing_2') <- FAILS
2023-12-27 01:12:41.087 UTC [59790]: [7-1] db=postgres,user=postgres ERROR:
requested WAL segment 000000010000000600000059 has already been removed
2023-12-27 01:12:44.581 UTC [59794]: [3-1] db=postgres,user=postgres
STATEMENT: START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL
6/5AE67D79 (proto_version '1', publication_names
'peerflow_pub_wal_testing_2') <- SUCCEEDS
2023-12-27 01:12:44.582 UTC [59794]: [4-1] db=postgres,user=postgres LOG:
logical decoding found consistent point at 6/5A31F050

Happy to include any additional details of my setup.

Thanks,
Kaushik

On Tue, Dec 26, 2023 at 10:36 AM Kaushik Iska <kaushik@peerdb.io> wrote:

Show quoted text

Dear PostgreSQL Community,

I am seeking guidance regarding a recurring issue we've encountered with
WAL segment removal during logical replication using pgoutput plugin. We
sporadically encounter an error indicating that a requested WAL segment has
already been removed. This issue arises intermittently when executing
START_REPLICATION. An example error message is as follows:

requested WAL segment 000000010000146000000AE has already been removed

Please note that this error is not specific to the segment mentioned
above; it serves as an example of the type of error we are experiencing.

Additional Context:

-

max_slot_wal_keep_size is -1, logical_decoding_work_mem is 4 GB.
-

The error seems to appear randomly and is not consistent.
-

After a couple of retries, the replication process eventually succeeds.
-

For one of the users it seems to be happening every 16 hours or so.

Our approach involves starting with START_REPLICATION 0, replicating data
in batches, and then restarting at the last LSN of the previous batch. We
are trying to understand the root cause behind the intermittent removal of
WAL segments during logical replication. Specifically, we are looking for
insights into:

-

The potential reasons for the WAL segments being reported as removed.
-

Why this error occurs intermittently and why replication succeeds
after several retries.
-

Any advice on troubleshooting and resolving this issue, or insights
into whether it might be related to our specific replication setup or a
characteristic of pgoutput, would be highly valuable.

Related Posts

-

https://issues.redhat.com/browse/DBZ-590
-

Troubleshooting Postgres Sources | Airbyte Documentation
<https://docs.airbyte.com/integrations/sources/postgres/postgres-troubleshooting#under-cdc-incremental-mode-there-are-still-full-refresh-syncs>
-

https://fivetran.com/docs/databases/postgresql/troubleshooting/last-tracked-lsn-error

Thank you very much for your time and assistance.

Thanks,

Kaushik Iska

Ron

ronljohnsonjr@gmail.com

over 2 years ago

In reply to: Kaushik Iska (#2)

Re: Intermittent Issue with WAL Segment Removal in Logical Replication

On Thu, Dec 28, 2023 at 4:54 PM Kaushik Iska <kaushik@peerdb.io> wrote:

Hi all,

I'm including additional details, as I am able to reproduce this issue a
little more reliably.

Postgres Version: POSTGRES_14_9.R20230830.01_07
Vendor: Google Cloud SQL
Logical Replication Protocol version 1

Here are the logs of attempt succeeding right after it fails:

2023-12-27 01:12:40.581 UTC [59790]: [6-1] db=postgres,user=postgres
STATEMENT: START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL
6/5AE67D79 (proto_version '1', publication_names
'peerflow_pub_wal_testing_2') <- FAILS
2023-12-27 01:12:41.087 UTC [59790]: [7-1] db=postgres,user=postgres
ERROR: requested WAL segment 000000010000000600000059 has already been
removed
2023-12-27 01:12:44.581 UTC [59794]: [3-1] db=postgres,user=postgres
STATEMENT: START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL
6/5AE67D79 (proto_version '1', publication_names
'peerflow_pub_wal_testing_2') <- SUCCEEDS
2023-12-27 01:12:44.582 UTC [59794]: [4-1] db=postgres,user=postgres LOG:
logical decoding found consistent point at 6/5A31F050

Happy to include any additional details of my setup.

Thanks,
Kaushik

On Tue, Dec 26, 2023 at 10:36 AM Kaushik Iska <kaushik@peerdb.io> wrote:

Dear PostgreSQL Community,

I am seeking guidance regarding a recurring issue we've encountered with
WAL segment removal during logical replication using pgoutput plugin. We
sporadically encounter an error indicating that a requested WAL segment has
already been removed. This issue arises intermittently when executing
START_REPLICATION. An example error message is as follows:

requested WAL segment 000000010000146000000AE has already been removed

Please note that this error is not specific to the segment mentioned
above; it serves as an example of the type of error we are experiencing.

Additional Context:

-

max_slot_wal_keep_size is -1, logical_decoding_work_mem is 4 GB.
-

The error seems to appear randomly and is not consistent.
-

After a couple of retries, the replication process eventually
succeeds.
-

For one of the users it seems to be happening every 16 hours or so.

Our approach involves starting with START_REPLICATION 0, replicating data
in batches, and then restarting at the last LSN of the previous batch. We
are trying to understand the root cause behind the intermittent removal of
WAL segments during logical replication. Specifically, we are looking for
insights into:

-

The potential reasons for the WAL segments being reported as removed.
-

Why this error occurs intermittently and why replication succeeds
after several retries.
-

Any advice on troubleshooting and resolving this issue, or insights
into whether it might be related to our specific replication setup or a
characteristic of pgoutput, would be highly valuable.

Related Posts

-

https://issues.redhat.com/browse/DBZ-590
-

Troubleshooting Postgres Sources | Airbyte Documentation
<https://docs.airbyte.com/integrations/sources/postgres/postgres-troubleshooting#under-cdc-incremental-mode-there-are-still-full-refresh-syncs>
-

https://fivetran.com/docs/databases/postgresql/troubleshooting/last-tracked-lsn-error

Thank you very much for your time and assistance.

Thanks,

Kaushik Iska

It might be interesting to see the contents of pg_replication_slots.

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 2 years ago

In reply to: Kaushik Iska (#2)

Re: Intermittent Issue with WAL Segment Removal in Logical Replication

On 12/27/23 16:31, Kaushik Iska wrote:

Hi all,

I'm including additional details, as I am able to reproduce this issue a
little more reliably.

Postgres Version: POSTGRES_14_9.R20230830.01_07
Vendor: Google Cloud SQL
Logical Replication Protocol version 1

I don't know much about Google Cloud SQL internals. Is it relatively
close to Postgres (as e.g. RDS) or are the internals very different /
modified for cloud environments?

Here are the logs of attempt succeeding right after it fails:

2023-12-27 01:12:40.581 UTC [59790]: [6-1] db=postgres,user=postgres
STATEMENT: START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL
6/5AE67D79 (proto_version '1', publication_names
'peerflow_pub_wal_testing_2') <- FAILS
2023-12-27 01:12:41.087 UTC [59790]: [7-1] db=postgres,user=postgres
ERROR: requested WAL segment 000000010000000600000059 has already been
removed
2023-12-27 01:12:44.581 UTC [59794]: [3-1] db=postgres,user=postgres
STATEMENT: START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL
6/5AE67D79 (proto_version '1', publication_names
'peerflow_pub_wal_testing_2') <- SUCCEEDS
2023-12-27 01:12:44.582 UTC [59794]: [4-1] db=postgres,user=postgres
LOG: logical decoding found consistent point at 6/5A31F050

Happy to include any additional details of my setup.

I personally don't see how could this fail and then succeed, unless
Google does something smart with the WAL segments under the hood. Surely
we try to open the same WAL segment (given the LSN is the same), so how
could it not exist and then exist?

As Ron already suggested, it might be useful to see information for the
replication slot peerflow_slot_wal_testing_2 (especially the restart_lsn
value). Also, maybe show the contents of pg_wal (especially for the
segment referenced in the error message).

Can you reproduce this outside Google cloud environment?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Kaushik Iska

kaushik@peerdb.io

over 2 years ago

In reply to: Tomas Vondra (#4)

Re: Intermittent Issue with WAL Segment Removal in Logical Replication

I am unfortunately not really familiar with Google Cloud SQL internals as
well. But we have seen this happen on Amazon RDS as well.

Could it be possible that we are requesting a future WAL segment, say WAL
upto X is written and we are asking for X + 1? It could be that the error
message is misleading.

I do not have the information from pg_replication_slots as I have
terminated the test. I am fairly certain that I can reproduce this again. I
will gather both the restart_lsn and contents of pg_wal for the failed
segment. Is there any other information that would help debug this further?

On Fri, Dec 29, 2023 at 4:16 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

Show quoted text

On 12/27/23 16:31, Kaushik Iska wrote:

Hi all,

I'm including additional details, as I am able to reproduce this issue a
little more reliably.

Postgres Version: POSTGRES_14_9.R20230830.01_07
Vendor: Google Cloud SQL
Logical Replication Protocol version 1

I don't know much about Google Cloud SQL internals. Is it relatively
close to Postgres (as e.g. RDS) or are the internals very different /
modified for cloud environments?

Here are the logs of attempt succeeding right after it fails:

2023-12-27 01:12:40.581 UTC [59790]: [6-1] db=postgres,user=postgres
STATEMENT: START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL
6/5AE67D79 (proto_version '1', publication_names
'peerflow_pub_wal_testing_2') <- FAILS
2023-12-27 01:12:41.087 UTC [59790]: [7-1] db=postgres,user=postgres
ERROR: requested WAL segment 000000010000000600000059 has already been
removed
2023-12-27 01:12:44.581 UTC [59794]: [3-1] db=postgres,user=postgres
STATEMENT: START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL
6/5AE67D79 (proto_version '1', publication_names
'peerflow_pub_wal_testing_2') <- SUCCEEDS
2023-12-27 01:12:44.582 UTC [59794]: [4-1] db=postgres,user=postgres
LOG: logical decoding found consistent point at 6/5A31F050

Happy to include any additional details of my setup.

I personally don't see how could this fail and then succeed, unless
Google does something smart with the WAL segments under the hood. Surely
we try to open the same WAL segment (given the LSN is the same), so how
could it not exist and then exist?

As Ron already suggested, it might be useful to see information for the
replication slot peerflow_slot_wal_testing_2 (especially the restart_lsn
value). Also, maybe show the contents of pg_wal (especially for the
segment referenced in the error message).

Can you reproduce this outside Google cloud environment?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 2 years ago

In reply to: Kaushik Iska (#5)

Re: Intermittent Issue with WAL Segment Removal in Logical Replication

On 12/29/23 22:28, Kaushik Iska wrote:

I am unfortunately not really familiar with Google Cloud SQL internals
as well. But we have seen this happen on Amazon RDS as well.

Do you have a reproducer for regular Postgres?

Could it be possible that we are requesting a future WAL segment, say
WAL upto X is written and we are asking for X + 1? It could be that the
error message is misleading.

I don't think that should be possible. The LSN in the START_REPLICATION
comes from the replica, where it's tracked as the last LSN received from
the upstream. So that shouldn't be in the future. And it's doesn't seem
to be suspiciously close to segment boundary either.

In fact, the LSN in the message is 6/5AE67D79, but the "failed" segment
is 000000010000000600000059, which is the *preceding* one. So it can't
be in the future.

I do not have the information from pg_replication_slots as I have
terminated the test. I am fairly certain that I can reproduce this
again. I will gather both the restart_lsn and contents of pg_wal for the
failed segment. Is there any other information that would help debug
this further?

Hard to say. The best thing would be to have a reproducer script, ofc.
If that's not possible, the information already requested seems like a
good start.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company