conflict with recovery when delay is gone

Started by Radoslav Nedyalkovover 5 years ago6 messagesgeneral

rnedyalkov@gmail.com

over 5 years ago

Hi Forum,all
On a very busy master-standby setup which runs typical olap processing -
long living , massive writes statements, we're getting on the standby:

ERROR: canceling statement due to conflict with recovery
FATAL: terminating connection due to conflict with recovery

The weird thing is that cancellations happen usually after standby has
experienced
some huge delay(2h), still not at the allowed maximum(3h). Even recently
run statements
got cancelled when the delay is already at zero.

Sometimes the situation got relaxed after an hour or so.
Restarting the server instantly helps.

It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.

What phenomenon could we be facing?

Thank you,
Rado

Laurenz Albe

laurenz.albe@cybertec.at

over 5 years ago

In reply to: Radoslav Nedyalkov (#1)

Re: conflict with recovery when delay is gone

On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote:

On a very busy master-standby setup which runs typical olap processing -
long living , massive writes statements, we're getting on the standby:

ERROR: canceling statement due to conflict with recovery
FATAL: terminating connection due to conflict with recovery

The weird thing is that cancellations happen usually after standby has experienced
some huge delay(2h), still not at the allowed maximum(3h). Even recently run statements
got cancelled when the delay is already at zero.

Sometimes the situation got relaxed after an hour or so.
Restarting the server instantly helps.

It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.

What phenomenon could we be facing?

Hard to say. Perhaps an unusual kind of replication conflict?

What is in "pg_stat_database_conflicts" on the standby server?

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

Radoslav Nedyalkov

rnedyalkov@gmail.com

over 5 years ago

In reply to: Laurenz Albe (#2)

Re: conflict with recovery when delay is gone

On Fri, Nov 13, 2020 at 7:37 PM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote:

On a very busy master-standby setup which runs typical olap processing -
long living , massive writes statements, we're getting on the standby:

ERROR: canceling statement due to conflict with recovery
FATAL: terminating connection due to conflict with recovery

The weird thing is that cancellations happen usually after standby has

experienced

some huge delay(2h), still not at the allowed maximum(3h). Even recently

run statements

got cancelled when the delay is already at zero.

Sometimes the situation got relaxed after an hour or so.
Restarting the server instantly helps.

It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.

What phenomenon could we be facing?

Hard to say. Perhaps an unusual kind of replication conflict?

What is in "pg_stat_database_conflicts" on the standby server?

db01=# select * from pg_stat_database_conflicts;
datid | datname | confl_tablespace | confl_lock | confl_snapshot |
confl_bufferpin | confl_deadlock
-------+-----------+------------------+------------+----------------+-----------------+----------------
13877 | template0 | 0 | 0 | 0 |
0 | 0
16400 | template1 | 0 | 0 | 0 |
0 | 0
16402 | postgres | 0 | 0 | 0 |
0 | 0
16401 | db01 | 0 | 0 | 51 |
0 | 0
(4 rows)

On a freshly restarted standby we've just got similar behaviour after a 2
hours delay and a slow catch-up.
confl_snapshots is 51 and we have exactly the same number cancelled
statements.

Radoslav Nedyalkov

rnedyalkov@gmail.com

over 5 years ago

In reply to: Radoslav Nedyalkov (#3)

Re: conflict with recovery when delay is gone

On Fri, Nov 13, 2020 at 8:13 PM Radoslav Nedyalkov <rnedyalkov@gmail.com>
wrote:

On Fri, Nov 13, 2020 at 7:37 PM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote:

On a very busy master-standby setup which runs typical olap processing -
long living , massive writes statements, we're getting on the standby:

ERROR: canceling statement due to conflict with recovery
FATAL: terminating connection due to conflict with recovery

The weird thing is that cancellations happen usually after standby has

experienced

some huge delay(2h), still not at the allowed maximum(3h). Even

recently run statements

got cancelled when the delay is already at zero.

Sometimes the situation got relaxed after an hour or so.
Restarting the server instantly helps.

It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.

What phenomenon could we be facing?

Hard to say. Perhaps an unusual kind of replication conflict?

What is in "pg_stat_database_conflicts" on the standby server?

db01=# select * from pg_stat_database_conflicts;
datid | datname | confl_tablespace | confl_lock | confl_snapshot |
confl_bufferpin | confl_deadlock

-------+-----------+------------------+------------+----------------+-----------------+----------------
13877 | template0 | 0 | 0 | 0 |
0 | 0
16400 | template1 | 0 | 0 | 0 |
0 | 0
16402 | postgres | 0 | 0 | 0 |
0 | 0
16401 | db01 | 0 | 0 | 51 |
0 | 0
(4 rows)

On a freshly restarted standby we've just got similar behaviour after a 2
hours delay and a slow catch-up.
confl_snapshots is 51 and we have exactly the same number cancelled
statements.

No luck so far. Searching for the explanation i found we fail into the
unexplained case when
snapshot conflicts happen even hot_standby_feedback is on.

Thanks,
Rado

Mohamed Wael Khobalatte

mkhobalatte@grubhub.com

over 5 years ago

In reply to: Radoslav Nedyalkov (#4)

Re: conflict with recovery when delay is gone

On Sat, Nov 14, 2020 at 2:46 PM Radoslav Nedyalkov <rnedyalkov@gmail.com>
wrote:

On Fri, Nov 13, 2020 at 8:13 PM Radoslav Nedyalkov <rnedyalkov@gmail.com>
wrote:

On Fri, Nov 13, 2020 at 7:37 PM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote:

On a very busy master-standby setup which runs typical olap processing

-

long living , massive writes statements, we're getting on the standby:

ERROR: canceling statement due to conflict with recovery
FATAL: terminating connection due to conflict with recovery

The weird thing is that cancellations happen usually after standby has

experienced

some huge delay(2h), still not at the allowed maximum(3h). Even

recently run statements

got cancelled when the delay is already at zero.

Sometimes the situation got relaxed after an hour or so.
Restarting the server instantly helps.

It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.

What phenomenon could we be facing?

Hard to say. Perhaps an unusual kind of replication conflict?

What is in "pg_stat_database_conflicts" on the standby server?

db01=# select * from pg_stat_database_conflicts;
datid | datname | confl_tablespace | confl_lock | confl_snapshot |
confl_bufferpin | confl_deadlock

-------+-----------+------------------+------------+----------------+-----------------+----------------
13877 | template0 | 0 | 0 | 0 |
0 | 0
16400 | template1 | 0 | 0 | 0 |
0 | 0
16402 | postgres | 0 | 0 | 0 |
0 | 0
16401 | db01 | 0 | 0 | 51 |
0 | 0
(4 rows)

On a freshly restarted standby we've just got similar behaviour after a 2
hours delay and a slow catch-up.
confl_snapshots is 51 and we have exactly the same number cancelled
statements.

No luck so far. Searching for the explanation i found we fail into the
unexplained case when
snapshot conflicts happen even hot_standby_feedback is on.

Thanks,
Rado

Perhaps you have a value set for old_snapshot_threshold? If not, do the
walreceiver connections drop out?

Radoslav Nedyalkov

rnedyalkov@gmail.com

over 5 years ago

In reply to: Mohamed Wael Khobalatte (#5)

Re: conflict with recovery when delay is gone

On Sun, Nov 15, 2020 at 12:48 AM Mohamed Wael Khobalatte <
mkhobalatte@grubhub.com> wrote:

On Sat, Nov 14, 2020 at 2:46 PM Radoslav Nedyalkov <rnedyalkov@gmail.com>
wrote:

On Fri, Nov 13, 2020 at 8:13 PM Radoslav Nedyalkov <rnedyalkov@gmail.com>
wrote:

On Fri, Nov 13, 2020 at 7:37 PM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote:

On a very busy master-standby setup which runs typical olap

processing -

long living , massive writes statements, we're getting on the

standby:

ERROR: canceling statement due to conflict with recovery
FATAL: terminating connection due to conflict with recovery

The weird thing is that cancellations happen usually after standby

has experienced

some huge delay(2h), still not at the allowed maximum(3h). Even

recently run statements

got cancelled when the delay is already at zero.

Sometimes the situation got relaxed after an hour or so.
Restarting the server instantly helps.

It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G.

What phenomenon could we be facing?

Hard to say. Perhaps an unusual kind of replication conflict?

What is in "pg_stat_database_conflicts" on the standby server?

db01=# select * from pg_stat_database_conflicts;
datid | datname | confl_tablespace | confl_lock | confl_snapshot |
confl_bufferpin | confl_deadlock

-------+-----------+------------------+------------+----------------+-----------------+----------------
13877 | template0 | 0 | 0 | 0 |
0 | 0
16400 | template1 | 0 | 0 | 0 |
0 | 0
16402 | postgres | 0 | 0 | 0 |
0 | 0
16401 | db01 | 0 | 0 | 51 |
0 | 0
(4 rows)

On a freshly restarted standby we've just got similar behaviour after a
2 hours delay and a slow catch-up.
confl_snapshots is 51 and we have exactly the same number cancelled
statements.

No luck so far. Searching for the explanation i found we fail into the
unexplained case when
snapshot conflicts happen even hot_standby_feedback is on.

Thanks,
Rado

Perhaps you have a value set for old_snapshot_threshold? If not, do the
walreceiver connections drop out?

old_snapshot_threshold is -1 on both master and replica.
walreceiver does not drop.