Logical replication failure modes

Started by Philip Warneralmost 2 years ago3 messages

pjw@rhyme.com.au

almost 2 years ago

I am trying to discover the causes of occasional data loss in logical
replication; it is VERY rare and happens every few week/months.

Our setup is a source DB running in docker on AWS cloud server. The
source database is stored in on local disks on the cloud server.

The replication target is a K8 POD running in an AWS instance with an
attached persistent AWS disk. The disk mounting is managed by K8.
Periodically this POD is deleted and restarted in an orderly way, and
the persistent disk stores the database.

What we are seeing is *very* occasional records not being replicated in
the more active tables.

Sometimes we have a backlog of several GB of data due to missing fields
in the target or network outages etc.

I am also seeing signs that some triggers are not being applied (at the
same time frame): ie. data *is* inserted but triggers that summarize
that data is not summarizing some rows and the dates on those
non-summarized rows corresponds to dates on unrelated missing rows in
other tables.

This all leads me to conclude that there might be missing transactions?
Or non-applied transactions etc. But it is further complicated by the
fact that there is a second target database that *does* have all the
missing records.

Any insights or avenues of exploration would be very welcome!

Philip Warner

pjw@rhyme.com.au

almost 2 years ago

In reply to: Philip Warner (#1)

Re: Logical replication failure modes

I should have added that the source DB is 16.1 and the target is 16.0

vignesh C

vignesh21@gmail.com

almost 2 years ago

In reply to: Philip Warner (#1)

Re: Logical replication failure modes

On Fri, 29 Mar 2024 at 17:14, Philip Warner <pjw@rhyme.com.au> wrote:

I am trying to discover the causes of occasional data loss in logical replication; it is VERY rare and happens every few week/months.

Our setup is a source DB running in docker on AWS cloud server. The source database is stored in on local disks on the cloud server.

The replication target is a K8 POD running in an AWS instance with an attached persistent AWS disk. The disk mounting is managed by K8. Periodically this POD is deleted and restarted in an orderly way, and the persistent disk stores the database.

What we are seeing is *very* occasional records not being replicated in the more active tables.

Sometimes we have a backlog of several GB of data due to missing fields in the target or network outages etc.

I am also seeing signs that some triggers are not being applied (at the same time frame): ie. data *is* inserted but triggers that summarize that data is not summarizing some rows and the dates on those non-summarized rows corresponds to dates on unrelated missing rows in other tables.

This all leads me to conclude that there might be missing transactions? Or non-applied transactions etc. But it is further complicated by the fact that there is a second target database that *does* have all the missing records.

Any insights or avenues of exploration would be very welcome!

Can you check the following a) if there is any error in the log files
(both publisher and subscriber), b) Is the apply worker process
running for that particular subscription in the subscriber c) Check
pg_stat_subscription if last_msg_send_time, last_msg_receipt_time,
latest_end_lsn and latest_end_time are getting advanced in the
subscriber, e) Check pg_stat_replication for the lsn values and
reply_time getting updated in the publisher.

Since you have the second target database up to date,comparing the
pg_stat_subscription and pg_stat_replication for both of them will
give what is happening.

Regards,
Vignesh