Postgresql 11: terminating walsender process due to replication timeout

Started by Abhishek Bholaover 4 years ago6 messagesgeneral

abhishek.bhola@japannext.co.jp

over 4 years ago

I have found some questions about the same error, but didn't find any of
them answering my problem.

The setup is that I have two Postgres11 clusters (A and B) and they are
making use of publication and subscription features to copy data from A to
B.

A (source DB- publication) --------------> B (target DB - subscription)

This works fine, but often (not always) when the data volume being inserted
on a table in node A increases, it gives the following error.

"terminating walsender process due to replication timeout"

The data volume at the moment being entered is about 30K rows per second
continuously for hours through COPY command.

Earlier the wal_sender_timeout was set to 5 sec and I would see this error
much often. I then increased it to 1 min and the frequency of this error
reduced. But I don't want to keep increasing it without understanding what
is causing it. I looked at the code of walsender.c and know the exact lines
where it's coming from.

But I am still not clear which parameter is making the sender assume that
the receiver node is inactive and therefore it should stop the wal_sender.

Can anyone please suggest what changes I should make to remove this error?

sourcedb=# show wal_sender_timeout;
wal_sender_timeout--------------------
1min
(1 row)

targetdb=# show wal_receiver_timeout;
wal_receiver_timeout----------------------
1min
(1 row)

targetdb=# show wal_retrieve_retry_interval ;
wal_retrieve_retry_interval-----------------------------
5s
(1 row)

targetdb=# show wal_receiver_status_interval;
wal_receiver_status_interval------------------------------
2s
(1 row)

Increased the wal_sender_timeout to 5 mins and the error started
appearing more frequently instead. Not only that, it even killed the
active subscription and stopped replicating data. Had to restart it.
So clearly, just increasing the wal_sender_timeout hasn't helped.

--
_This correspondence (including any attachments) is for the intended
recipient(s) only. It may contain confidential or privileged information or
both. No confidentiality or privilege is waived or lost by any
mis-transmission. If you receive this correspondence by mistake, please
contact the sender immediately, delete this correspondence (and all
attachments) and destroy any hard copies. You must not use, disclose, copy,
distribute or rely on any part of this correspondence (including any
attachments) if you are not the intended
recipient(s).本メッセージに記載および添付されている情報（以下、総称して「本情報」といいます。）は、本来の受信者による使用のみを意図しています。誤送信等により本情報を取得された場合でも、本情報に係る秘密、または法律上の秘匿特権が失われるものではありません。本電子メールを受取られた方が、本来の受信者ではない場合には、本情報及びそのコピーすべてを削除・破棄し、本電子メールが誤って届いた旨を発信者宛てにご通知下さいますようお願いします。本情報の閲覧、発信または本情報に基づくいかなる行為も明確に禁止されていることをご了承ください。_

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Abhishek Bhola (#1)

Re: Postgresql 11: terminating walsender process due to replication timeout

At Thu, 9 Sep 2021 14:52:25 +0900, Abhishek Bhola <abhishek.bhola@japannext.co.jp> wrote in

I have found some questions about the same error, but didn't find any of
them answering my problem.

The setup is that I have two Postgres11 clusters (A and B) and they are
making use of publication and subscription features to copy data from A to
B.

A (source DB- publication) --------------> B (target DB - subscription)

This works fine, but often (not always) when the data volume being inserted
on a table in node A increases, it gives the following error.

"terminating walsender process due to replication timeout"

The data volume at the moment being entered is about 30K rows per second
continuously for hours through COPY command.

Earlier the wal_sender_timeout was set to 5 sec and I would see this error
much often. I then increased it to 1 min and the frequency of this error
reduced. But I don't want to keep increasing it without understanding what
is causing it. I looked at the code of walsender.c and know the exact lines
where it's coming from.

But I am still not clear which parameter is making the sender assume that
the receiver node is inactive and therefore it should stop the wal_sender.

Can anyone please suggest what changes I should make to remove this error?

What minor-version is the Postgres server mentioned? PostgreSQL 11
have gotten the following fix at 11.6, which could be related to the
trouble.

https://www.postgresql.org/docs/11/release-11-6.html

Fix timeout handling in logical replication walreceiver processes
(Julien Rouhaud)

Erroneous logic prevented wal_receiver_timeout from working in
logical replication deployments.

The details of the fix is here.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3f60f690fac1bf375b92cf2f8682e8fe8f69098

Fix timeout handling in logical replication worker

The timestamp tracking the last moment a message is received in a
logical replication worker was initialized in each loop checking if a
message was received or not, causing wal_receiver_timeout to be ignored
in basically any logical replication deployments. This also broke the
ping sent to the server when reaching half of wal_receiver_timeout.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Abhishek Bhola

abhishek.bhola@japannext.co.jp

over 4 years ago

In reply to: Kyotaro Horiguchi (#2)

Re: Postgresql 11: terminating walsender process due to replication timeout

sourcedb:~$ postgres --version
postgres (PostgreSQL) 11.6

Sorry for missing this information.
But looks like this fix is already included in the version I am running.

Regards,
Abhishek Bhola

On Thu, Sep 9, 2021 at 3:56 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

At Thu, 9 Sep 2021 14:52:25 +0900, Abhishek Bhola <
abhishek.bhola@japannext.co.jp> wrote in

I have found some questions about the same error, but didn't find any of
them answering my problem.

The setup is that I have two Postgres11 clusters (A and B) and they are
making use of publication and subscription features to copy data from A

to

B.

A (source DB- publication) --------------> B (target DB - subscription)

This works fine, but often (not always) when the data volume being

inserted

on a table in node A increases, it gives the following error.

"terminating walsender process due to replication timeout"

The data volume at the moment being entered is about 30K rows per second
continuously for hours through COPY command.

Earlier the wal_sender_timeout was set to 5 sec and I would see this

error

much often. I then increased it to 1 min and the frequency of this error
reduced. But I don't want to keep increasing it without understanding

what

is causing it. I looked at the code of walsender.c and know the exact

lines

where it's coming from.

But I am still not clear which parameter is making the sender assume that
the receiver node is inactive and therefore it should stop the

wal_sender.

Can anyone please suggest what changes I should make to remove this

error?

What minor-version is the Postgres server mentioned? PostgreSQL 11
have gotten the following fix at 11.6, which could be related to the
trouble.

https://www.postgresql.org/docs/11/release-11-6.html

Fix timeout handling in logical replication walreceiver processes
(Julien Rouhaud)

Erroneous logic prevented wal_receiver_timeout from working in
logical replication deployments.

The details of the fix is here.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3f60f690fac1bf375b92cf2f8682e8fe8f69098

Fix timeout handling in logical replication worker

The timestamp tracking the last moment a message is received in a
logical replication worker was initialized in each loop checking if a
message was received or not, causing wal_receiver_timeout to be ignored
in basically any logical replication deployments. This also broke the
ping sent to the server when reaching half of wal_receiver_timeout.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Abhishek Bhola (#3)

Re: Postgresql 11: terminating walsender process due to replication timeout

At Thu, 9 Sep 2021 16:06:25 +0900, Abhishek Bhola <abhishek.bhola@japannext.co.jp> wrote in

sourcedb:~$ postgres --version
postgres (PostgreSQL) 11.6

Sorry for missing this information.
But looks like this fix is already included in the version I am running.

Ok. I'm not sure but there may be a case where too-busy (or too poor
relative to the publisher) subscriber cannot send a response for a
long time. Usually keep-alive packets sent from publisher causes
subscriber response even while busy time but it seems that if
subscriber applies changes more than two times slower than the
publisher sends, subscriber doesn't send a response in the timeout
window.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Abhishek Bhola

abhishek.bhola@japannext.co.jp

over 4 years ago

In reply to: Kyotaro Horiguchi (#4)

Re: Postgresql 11: terminating walsender process due to replication timeout

So is there any solution to this issue?
I did try to increase the wal_sender_timeout and it broke the pub/sub.
I increased the wal_receiver_timeout and it wouldn't attempt to restart the
subscription until that time elapsed.
Due to that, the WAL segments got removed by the time it came up again and
it stopped working.
So given that the publisher is publishing at a higher rate than the
subscriber is subscribing, what can be done?

On Fri, Sep 10, 2021 at 9:26 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

At Thu, 9 Sep 2021 16:06:25 +0900, Abhishek Bhola <
abhishek.bhola@japannext.co.jp> wrote in

sourcedb:~$ postgres --version
postgres (PostgreSQL) 11.6

Sorry for missing this information.
But looks like this fix is already included in the version I am running.

Ok. I'm not sure but there may be a case where too-busy (or too poor
relative to the publisher) subscriber cannot send a response for a
long time. Usually keep-alive packets sent from publisher causes
subscriber response even while busy time but it seems that if
subscriber applies changes more than two times slower than the
publisher sends, subscriber doesn't send a response in the timeout
window.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Abhishek Bhola (#5)

Re: Postgresql 11: terminating walsender process due to replication timeout

At Fri, 10 Sep 2021 16:55:48 +0900, Abhishek Bhola <abhishek.bhola@japannext.co.jp> wrote in

So is there any solution to this issue?
I did try to increase the wal_sender_timeout and it broke the pub/sub.
I increased the wal_receiver_timeout and it wouldn't attempt to restart the
subscription until that time elapsed.
Due to that, the WAL segments got removed by the time it came up again and
it stopped working.
So given that the publisher is publishing at a higher rate than the
subscriber is subscribing, what can be done?

Given that my assumption is right, to enable a subscriber to send a
response, the subscriber needs to see a keepalive packet from
publisher (sent with intervals of wal_sender_timeout/2) within every
interval of wal_sender_timeout. Otherwise needs to "rest" by finding a
gap in the data stream from the publisher with intervals shorter than
wal_sender_timeout.

The reason subscriber is kept busy is it receives the next data
before it finishes applying the previous data. So possible workaround
I came up with for now are:

- Increase processing power of the subscriber, so as to increase the
possibility that it can finish applying changes before the next data
block comes from the publisher. Or, to make the subscriber can keep
catching up to the publisher. This is the most appropriate solution,
I think.

- Throttle network bandwidth to obtain the same effect to the first
reason above. (This may give a side effect that the bandwidth become
finally insufficient.)

- Break large transactions on the publisher into smaller pieces.
Publisher sends data of a transaction at once at transaction commit,
so this could average data transfer rate.

- If you are setting *sufficient* logical_decoding_work_mem for such
problematic large transactions, *decreasing* it might mitigate the
issue. Lower logical_decoding_work_mem cause transaction data spill
out to disk and the spilled data on disk could be sent at slower
rate than on-memory data. Of course this is in exchange with total
performance.

- Streaming mode of logical replication introduced in PostgreSQL 14
might be able to mitigate the problem. It starts sending
transaction data before the transaction completes.

I'm not sure this is "fixed" for 13 or earlier, because a straight
forward resolution surely decreases maximum processing rate at
subscriber.

On Fri, Sep 10, 2021 at 9:26 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

At Thu, 9 Sep 2021 16:06:25 +0900, Abhishek Bhola <
abhishek.bhola@japannext.co.jp> wrote in

sourcedb:~$ postgres --version
postgres (PostgreSQL) 11.6

Sorry for missing this information.
But looks like this fix is already included in the version I am running.

Ok. I'm not sure but there may be a case where too-busy (or too poor
relative to the publisher) subscriber cannot send a response for a
long time. Usually keep-alive packets sent from publisher causes
subscriber response even while busy time but it seems that if
subscriber applies changes more than two times slower than the
publisher sends, subscriber doesn't send a response in the timeout
window.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center