Unresolved repliaction hang and stop problem.

Started by Krzysztof Koisalmost 5 years ago21 messages

krzysztof.kois@unitygroup.com

almost 5 years ago

Hello,
After upgrading the cluster from 10.x to 13.1 we've started getting a problem describe pgsql-general:
/messages/by-id/8bf8785c-f47d-245c-b6af-80dc1eed40db@unitygroup.com
We've noticed similar issue being described on this list in
https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html
with a fix being rolled out in 13.2.

After the 13.2 release, we've upgraded to it and unfortunately this did not solve the issue - the replication still stalls just as described in the original issue.
Please advise, how to debug and solve this issue.

Amit Kapila

amit.kapila16@gmail.com

over 4 years ago

In reply to: Krzysztof Kois (#1)

Re: Unresolved repliaction hang and stop problem.

On Tue, Apr 13, 2021 at 1:18 PM Krzysztof Kois
<krzysztof.kois@unitygroup.com> wrote:

Hello,
After upgrading the cluster from 10.x to 13.1 we've started getting a problem describe pgsql-general:
/messages/by-id/8bf8785c-f47d-245c-b6af-80dc1eed40db@unitygroup.com
We've noticed similar issue being described on this list in
https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html
with a fix being rolled out in 13.2.

The fix for the problem discussed in the above threads is committed
only in PG-14, see [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d7181d83cf2363570f7a205b8eb1008dbc. I don't know what makes you think it is fixed
in 13.2. Also, it is not easy to back-patch that because this fix
depends on some of the infrastructure introduced in PG-14.

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d7181d83cf2363570f7a205b8eb1008dbc

--
With Regards,
Amit Kapila.

Alvaro Herrera

alvherre@alvh.no-ip.org

over 4 years ago

In reply to: Amit Kapila (#2)

Re: Unresolved repliaction hang and stop problem.

On 2021-Apr-14, Amit Kapila wrote:

On Tue, Apr 13, 2021 at 1:18 PM Krzysztof Kois
<krzysztof.kois@unitygroup.com> wrote:

Hello,
After upgrading the cluster from 10.x to 13.1 we've started getting a problem describe pgsql-general:
/messages/by-id/8bf8785c-f47d-245c-b6af-80dc1eed40db@unitygroup.com
We've noticed similar issue being described on this list in
https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html
with a fix being rolled out in 13.2.

The fix for the problem discussed in the above threads is committed
only in PG-14, see [1]. I don't know what makes you think it is fixed
in 13.2. Also, it is not easy to back-patch that because this fix
depends on some of the infrastructure introduced in PG-14.

[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d7181d83cf2363570f7a205b8eb1008dbc

Hmm ... On what does it depend (other than plain git conflicts, which
are aplenty)? On a quick look to the commit, it's clear that we need to
be careful in order not to cause an ABI break, but that doesn't seem
impossible to solve, but I'm wondering if there is more to it than that.

--
ï¿½lvaro Herrera 39ï¿½49'30"S 73ï¿½17'W

Amit Kapila

amit.kapila16@gmail.com

over 4 years ago

In reply to: Alvaro Herrera (#3)

Re: Unresolved repliaction hang and stop problem.

On Wed, Apr 28, 2021 at 6:48 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2021-Apr-14, Amit Kapila wrote:

On Tue, Apr 13, 2021 at 1:18 PM Krzysztof Kois
<krzysztof.kois@unitygroup.com> wrote:

Hello,
After upgrading the cluster from 10.x to 13.1 we've started getting a problem describe pgsql-general:
/messages/by-id/8bf8785c-f47d-245c-b6af-80dc1eed40db@unitygroup.com
We've noticed similar issue being described on this list in
https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html
with a fix being rolled out in 13.2.

The fix for the problem discussed in the above threads is committed
only in PG-14, see [1]. I don't know what makes you think it is fixed
in 13.2. Also, it is not easy to back-patch that because this fix
depends on some of the infrastructure introduced in PG-14.

[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d7181d83cf2363570f7a205b8eb1008dbc

Hmm ... On what does it depend (other than plain git conflicts, which
are aplenty)? On a quick look to the commit, it's clear that we need to
be careful in order not to cause an ABI break, but that doesn't seem
impossible to solve, but I'm wondering if there is more to it than that.

As mentioned in the commit message, we need another commit [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0 change
to make this work.

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0

--
With Regards,
Amit Kapila.

Alvaro Herrera

alvherre@alvh.no-ip.org

over 4 years ago

In reply to: Amit Kapila (#4)

Re: Unresolved repliaction hang and stop problem.

On 2021-Apr-28, Amit Kapila wrote:

On Wed, Apr 28, 2021 at 6:48 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Hmm ... On what does it depend (other than plain git conflicts, which
are aplenty)? On a quick look to the commit, it's clear that we need to
be careful in order not to cause an ABI break, but that doesn't seem
impossible to solve, but I'm wondering if there is more to it than that.

As mentioned in the commit message, we need another commit [1] change
to make this work.

[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0

Oh, yeah, that looks tougher. (Still not impossible: it adds a new WAL
message type, but we have added such on a minor release before.)

... It's strange that replication worked for them on pg10 though and
broke on 13. What did we change anything to make it so? (I don't have
any fish to fry on this topic at present, but it seems a bit
concerning.)

--
ï¿½lvaro Herrera Valdivia, Chile

Amit Kapila

amit.kapila16@gmail.com

over 4 years ago

In reply to: Alvaro Herrera (#5)

Re: Unresolved repliaction hang and stop problem.

On Wed, Apr 28, 2021 at 7:36 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2021-Apr-28, Amit Kapila wrote:

On Wed, Apr 28, 2021 at 6:48 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Hmm ... On what does it depend (other than plain git conflicts, which
are aplenty)? On a quick look to the commit, it's clear that we need to
be careful in order not to cause an ABI break, but that doesn't seem
impossible to solve, but I'm wondering if there is more to it than that.

As mentioned in the commit message, we need another commit [1] change
to make this work.

[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0

Oh, yeah, that looks tougher. (Still not impossible: it adds a new WAL
message type, but we have added such on a minor release before.)

Yeah, we can try to make it possible if it is really a pressing issue
but I guess even in that case it is better to do it after we release
PG14 so that it can get some more testing.

... It's strange that replication worked for them on pg10 though and
broke on 13. What did we change anything to make it so?

No idea but probably if the other person can share the exact test case
which he sees working fine on PG10 but not on PG13 then it might be a
bit easier to investigate.

--
With Regards,
Amit Kapila.

Alvaro Herrera

alvherre@alvh.no-ip.org

over 4 years ago

In reply to: Amit Kapila (#6)

Re: Unresolved repliaction hang and stop problem.

Cc'ing Lukasz Biegaj because of the pgsql-general thread.

On 2021-Apr-29, Amit Kapila wrote:

On Wed, Apr 28, 2021 at 7:36 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

... It's strange that replication worked for them on pg10 though and
broke on 13. What did we change anything to make it so?

No idea but probably if the other person can share the exact test case
which he sees working fine on PG10 but not on PG13 then it might be a
bit easier to investigate.

Ah, noticed now that Krzysztof posted links to these older threads,
where a problem is described:

/messages/by-id/CANDwggKYveEtXjXjqHA6RL3AKSHMsQyfRY6bK+NqhAWJyw8psQ@mail.gmail.com
/messages/by-id/8bf8785c-f47d-245c-b6af-80dc1eed40db@unitygroup.com

Krzysztof said "after upgrading to pg13 we started having problems",
which implicitly indicates that the same thing worked well in pg10 ---
but if the problem has been correctly identified, then this wouldn't
have worked in pg10 either. So something in the story doesn't quite
match up. Maybe it's not the same problem after all, or maybe they
weren't doing X in pg10 which they are attempting in pg13.

Krzysztof, Lukasz, maybe you can describe more?

--
ï¿½lvaro Herrera Valdivia, Chile

Lukasz Biegaj

lukasz.biegaj@unitygroup.com

over 4 years ago

In reply to: Alvaro Herrera (#7)

Re: Unresolved repliaction hang and stop problem.

Hey, thanks for reaching out and sorry for the late reply - we had few
days of national holidays.

On 29.04.2021 15:55, Alvaro Herrera wrote:

/messages/by-id/CANDwggKYveEtXjXjqHA6RL3AKSHMsQyfRY6bK+NqhAWJyw8psQ@mail.gmail.com
/messages/by-id/8bf8785c-f47d-245c-b6af-80dc1eed40db@unitygroup.com

Krzysztof said "after upgrading to pg13 we started having problems",
which implicitly indicates that the same thing worked well in pg10 ---
but if the problem has been correctly identified, then this wouldn't
have worked in pg10 either. So something in the story doesn't quite
match up. Maybe it's not the same problem after all, or maybe they
weren't doing X in pg10 which they are attempting in pg13.

The problem started occurring after upgrade from pg10 to pg13. No other
changes were performed, especially not within the database structure nor
performed operations.

The problem is as described in
/messages/by-id/8bf8785c-f47d-245c-b6af-80dc1eed40db@unitygroup.com

It does occur on two separate production clusters and one test cluster -
all belonging to the same customer, although processing slightly
different data (it's an e-commerce store with multiple languages and
separate production databases for each language).

We've tried recreating the database from dump, and recreating the
replication, but without any positive effect - the problem persists.

We did not rollback the databases to pg10, instead we've stayed with
pg13 and implemented a shell script to kill the walsender process if it
seems stuck in `hash_seq_search`. It's ugly, but it works and we backup
and monitor the data integrity anyway.

I'd be happy to help in debugging the issue had I known how to do it
:-). If you'd like then we can also try to rollback the installation
back to pg10 to get certainty that this was not caused by schema changes.

--
Lukasz Biegaj | Unity Group | https://www.unitygroup.com/
System Architect, AWS Certified Solutions Architect

Alvaro Herrera

alvherre@alvh.no-ip.org

over 4 years ago

In reply to: Lukasz Biegaj (#8)

Re: Unresolved repliaction hang and stop problem.

Hi Lukasz, thanks for following up.

On 2021-May-04, Lukasz Biegaj wrote:

The problem is as described in /messages/by-id/8bf8785c-f47d-245c-b6af-80dc1eed40db@unitygroup.com

It does occur on two separate production clusters and one test cluster - all
belonging to the same customer, although processing slightly different data
(it's an e-commerce store with multiple languages and separate production
databases for each language).

I think the best next move would be to make certain that the problem is
what we think it is, so that we can discuss if Amit's commit is an
appropriate fix. I would suggest to do that by running the problematic
workload in the test system under "perf record -g" and then get a report
with "perf report -g" which should hopefully give enough of a clue.
(Sometimes the reports are much better if you use a binary that was
compiled with -fno-omit-frame-pointer, so if you're in a position to try
that, it might be useful -- or apparently you could try "perf record
--call-graph dwarf" or "perf record --call-graph lbr", depending.)

Also I would be much more comfortable about proposing to backpatch such
an invasive change if you could ensure that in pg10 the same workload
does not cause the problem. If it does, then it'd be clear we're
talking about a regression.

--
ï¿½lvaro Herrera Valdivia, Chile
"I'm always right, but sometimes I'm more right than other times."
(Linus Torvalds)

#10

Lukasz Biegaj

lukasz.biegaj@unitygroup.com

over 4 years ago

In reply to: Alvaro Herrera (#9)

Re: Unresolved repliaction hang and stop problem.

On 04.05.2021 16:35, Alvaro Herrera wrote:

I would suggest to do that by running the problematic
workload in the test system under "perf record -g"
[..]
you could ensure that in pg10 the same workload
does not cause the problem.

We'll go with both propositions. I expect to come back to you with
results in about a week or two.

--
Lukasz Biegaj | Unity Group | https://www.unitygroup.com/
System Architect, AWS Certified Solutions Architect

#11

Ha Ka

klasahubert@gmail.com

over 4 years ago

In reply to: Lukasz Biegaj (#10)

Re: Unresolved repliaction hang and stop problem.

śr., 16 cze 2021 o 12:31 Lukasz Biegaj <lukasz.biegaj@unitygroup.com>
napisał(a):

On 04.05.2021 16:35, Alvaro Herrera wrote:

I would suggest to do that by running the problematic
workload in the test system under "perf record -g"

We'll go with both propositions. I expect to come back to you with
results in about a week or two.

Hi Alvaro,
We reproduced the replication issue and recorded the walsender process
using perf.
Below you can find the data for broken and working replication:

https://easyupload.io/kxcovg

password to the zip file: johS5jeewo

Please let me know if you would like us to proceed with the downgrade.

--
Hubert Klasa

#12

Alvaro Herrera

alvherre@alvh.no-ip.org

over 4 years ago

In reply to: Ha Ka (#11)

Re: Unresolved repliaction hang and stop problem.

On 2021-Jun-16, Ha Ka wrote:

śr., 16 cze 2021 o 12:31 Lukasz Biegaj <lukasz.biegaj@unitygroup.com>
napisał(a):

On 04.05.2021 16:35, Alvaro Herrera wrote:

I would suggest to do that by running the problematic
workload in the test system under "perf record -g"

We'll go with both propositions. I expect to come back to you with
results in about a week or two.

Hi Alvaro,
We reproduced the replication issue and recorded the walsender process
using perf.

Hello, thanks, I downloaded the files but since you sent the perf.data
files there's not much I can do to usefully interpret them. Can you
please do "perf report -g > perf_report.txt" on each subdir with a
perf.data file and upload those text files? (You don't need to rerun
the test cases.)

Thanks

--
Álvaro Herrera Valdivia, Chile
"People get annoyed when you try to debug them." (Larry Wall)

#13

Ha Ka

klasahubert@gmail.com

over 4 years ago

In reply to: Alvaro Herrera (#12)

Re: Unresolved repliaction hang and stop problem.

Hello, thanks, I downloaded the files but since you sent the perf.data
files there's not much I can do to usefully interpret them. Can you
please do "perf report -g > perf_report.txt" on each subdir with a
perf.data file and upload those text files? (You don't need to rerun
the test cases.)
Thanks

Hi,
Here is the upload with generated reports: https://easyupload.io/p38izx
passwd: johS5jeewo

Regards

--
Hubert Klasa

#14

Alvaro Herrera

alvherre@alvh.no-ip.org

over 4 years ago

In reply to: Ha Ka (#13)

Re: Unresolved repliaction hang and stop problem.

On 2021-Jun-16, Ha Ka wrote:

Here is the upload with generated reports: https://easyupload.io/p38izx
passwd: johS5jeewo

OK, so I downloaded that and this is the interesting entry in the
profile for the broken case:

# Samples: 5K of event 'cpu-clock'
# Event count (approx.): 59989898390
#
# Children Self Command Shared Object Symbol
# ........ ........ ........ ............. ..................................
#
100.00% 0.00% postgres postgres [.] exec_replication_command
|
---exec_replication_command
WalSndLoop
XLogSendLogical
LogicalDecodingProcessRecord
|
--99.51%--ReorderBufferQueueChange
|
|--96.06%--hash_seq_search
|
|--1.78%--ReorderBufferSerializeTXN
| |
| --0.52%--errstart
|
--0.76%--deregister_seq_scan

What this tells me is that ReorderBufferQueueChange is spending a lot of
time doing hash_seq_search, which probably is the one in
ReorderBufferTXNByXid.

I have, as yet, no idea what this means.

--
ï¿½lvaro Herrera 39ï¿½49'30"S 73ï¿½17'W

#15

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Alvaro Herrera (#14)

Re: Unresolved repliaction hang and stop problem.

At Wed, 16 Jun 2021 18:28:28 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in

On 2021-Jun-16, Ha Ka wrote:
# Children Self Command Shared Object Symbol
# ........ ........ ........ ............. ..................................
#
100.00% 0.00% postgres postgres [.] exec_replication_command
|
---exec_replication_command
WalSndLoop
XLogSendLogical
LogicalDecodingProcessRecord
|
--99.51%--ReorderBufferQueueChange
|
|--96.06%--hash_seq_search
|
|--1.78%--ReorderBufferSerializeTXN
| |
| --0.52%--errstart
|
--0.76%--deregister_seq_scan

What this tells me is that ReorderBufferQueueChange is spending a lot of
time doing hash_seq_search, which probably is the one in
ReorderBufferTXNByXid.

I don't see a call to hash_*seq*_search there. Instead, I see one in
ReorderBufferCheckMemoryLimit().

If added an elog line in hash_seq_search that is visited only when it
is called under ReorderBufferQueueChange, then set
logical_decoding_work_mem to 64kB.

Running the following query calls hash_seq_search (relatively) frequently.

pub=# create table t1 (a int primary key);
pub=# create publication p1 for table t1;
sub=# create table t1 (a int primary key);
sub=# create subscription s1 connection 'host=/tmp port=5432' publication p1;
pub=# insert into t1 (select a from generate_series(0, 9999) a);

The insert above makes 20 calls to ReorderBufferLargestTXN() (via
ReorderBufferCheckmemoryLimit()), which loops over hash_seq_search.

/*
* Find the largest transaction (toplevel or subxact) to evict (spill to disk).
*
* XXX With many subtransactions this might be quite slow, because we'll have
* to walk through all of them. There are some options how we could improve
* that: (a) maintain some secondary structure with transactions sorted by
* amount of changes, (b) not looking for the entirely largest transaction,
* but e.g. for transaction using at least some fraction of the memory limit,
* and (c) evicting multiple transactions at once, e.g. to free a given portion
* of the memory limit (e.g. 50%).
*/
static ReorderBufferTXN *
ReorderBufferLargestTXN(ReorderBuffer *rb)

This looks like a candidate of the culprit. The perf line for
"ReorderBufferSerializeTXN supports this hypothesis.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#16

Amit Kapila

amit.kapila16@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#15)

Re: Unresolved repliaction hang and stop problem.

On Thu, Jun 17, 2021 at 7:28 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Wed, 16 Jun 2021 18:28:28 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in

On 2021-Jun-16, Ha Ka wrote:
# Children Self Command Shared Object Symbol
# ........ ........ ........ ............. ..................................
#
100.00% 0.00% postgres postgres [.] exec_replication_command
|
---exec_replication_command
WalSndLoop
XLogSendLogical
LogicalDecodingProcessRecord
|
--99.51%--ReorderBufferQueueChange
|
|--96.06%--hash_seq_search
|
|--1.78%--ReorderBufferSerializeTXN
| |
| --0.52%--errstart
|
--0.76%--deregister_seq_scan

What this tells me is that ReorderBufferQueueChange is spending a lot of
time doing hash_seq_search, which probably is the one in
ReorderBufferTXNByXid.

I don't see a call to hash_*seq*_search there. Instead, I see one in
ReorderBufferCheckMemoryLimit().

If added an elog line in hash_seq_search that is visited only when it
is called under ReorderBufferQueueChange, then set
logical_decoding_work_mem to 64kB.

Running the following query calls hash_seq_search (relatively) frequently.

pub=# create table t1 (a int primary key);
pub=# create publication p1 for table t1;
sub=# create table t1 (a int primary key);
sub=# create subscription s1 connection 'host=/tmp port=5432' publication p1;
pub=# insert into t1 (select a from generate_series(0, 9999) a);

The insert above makes 20 calls to ReorderBufferLargestTXN() (via
ReorderBufferCheckmemoryLimit()), which loops over hash_seq_search.

If there are large transactions then someone can probably set
logical_decoding_work_mem to a higher value.

--
With Regards,
Amit Kapila.

#17

Alvaro Herrera

alvherre@alvh.no-ip.org

over 4 years ago

In reply to: Kyotaro Horiguchi (#15)

Re: Unresolved repliaction hang and stop problem.

On 2021-Jun-17, Kyotaro Horiguchi wrote:

I don't see a call to hash_*seq*_search there. Instead, I see one in
ReorderBufferCheckMemoryLimit().

Doh, of course -- I misread.

ReorderBufferCheckMemoryLimit is new in pg13 (cec2edfa7859) so now at
least we have a reason why this workload regresses in pg13 compared to
earlier releases.

Looking at the code, it does seem that increasing the memory limit as
Amit suggests might solve the issue. Is that a practical workaround?

--
ï¿½lvaro Herrera Valdivia, Chile

#18

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Alvaro Herrera (#17)

Re: Unresolved repliaction hang and stop problem.

At Thu, 17 Jun 2021 12:56:42 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in

On 2021-Jun-17, Kyotaro Horiguchi wrote:

I don't see a call to hash_*seq*_search there. Instead, I see one in
ReorderBufferCheckMemoryLimit().

Doh, of course -- I misread.

ReorderBufferCheckMemoryLimit is new in pg13 (cec2edfa7859) so now at
least we have a reason why this workload regresses in pg13 compared to
earlier releases.

Looking at the code, it does seem that increasing the memory limit as
Amit suggests might solve the issue. Is that a practical workaround?

I believe so generally. I'm not sure about the op, though.

Just increasing the working memory to certain size would solve the
problem. There might be a potential issue that it might be hard like
this case for users to find out that the GUC works for their issue (if
actually works). pg_stat_replicatoin_slots.spill_count / spill_txns
could be a guide for setting logical_decoding_work_mem. Is it worth
having additional explanation like that for the GUC variable in the
documentation?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#19

Amit Kapila

amit.kapila16@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#18)

Re: Unresolved repliaction hang and stop problem.

On Fri, Jun 18, 2021 at 11:22 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 17 Jun 2021 12:56:42 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in

On 2021-Jun-17, Kyotaro Horiguchi wrote:

I don't see a call to hash_*seq*_search there. Instead, I see one in
ReorderBufferCheckMemoryLimit().

Doh, of course -- I misread.

ReorderBufferCheckMemoryLimit is new in pg13 (cec2edfa7859) so now at
least we have a reason why this workload regresses in pg13 compared to
earlier releases.

Looking at the code, it does seem that increasing the memory limit as
Amit suggests might solve the issue. Is that a practical workaround?

I believe so generally. I'm not sure about the op, though.

Just increasing the working memory to certain size would solve the
problem. There might be a potential issue that it might be hard like
this case for users to find out that the GUC works for their issue (if
actually works). pg_stat_replicatoin_slots.spill_count / spill_txns
could be a guide for setting logical_decoding_work_mem. Is it worth
having additional explanation like that for the GUC variable in the
documentation?

I don't see any harm in doing that but note that we can update it only
for PG-14. The view pg_stat_replicatoin_slots is introduced in PG-14.

--
With Regards,
Amit Kapila.

#20

Ha Ka

klasahubert@gmail.com

over 4 years ago

In reply to: Amit Kapila (#19)

Re: Unresolved repliaction hang and stop problem.

sob., 19 cze 2021 o 12:14 Amit Kapila <amit.kapila16@gmail.com> napisał(a):

On Fri, Jun 18, 2021 at 11:22 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 17 Jun 2021 12:56:42 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in

On 2021-Jun-17, Kyotaro Horiguchi wrote:

I don't see a call to hash_*seq*_search there. Instead, I see one in
ReorderBufferCheckMemoryLimit().

Doh, of course -- I misread.

ReorderBufferCheckMemoryLimit is new in pg13 (cec2edfa7859) so now at
least we have a reason why this workload regresses in pg13 compared to
earlier releases.

Looking at the code, it does seem that increasing the memory limit as
Amit suggests might solve the issue. Is that a practical workaround?

I believe so generally. I'm not sure about the op, though.

Just increasing the working memory to certain size would solve the
problem. There might be a potential issue that it might be hard like
this case for users to find out that the GUC works for their issue (if
actually works). pg_stat_replicatoin_slots.spill_count / spill_txns
could be a guide for setting logical_decoding_work_mem. Is it worth
having additional explanation like that for the GUC variable in the
documentation?

I don't see any harm in doing that but note that we can update it only
for PG-14. The view pg_stat_replicatoin_slots is introduced in PG-14.

--
With Regards,
Amit Kapila.

We increased logical_decoding_work_mem for our production database
from 64 to 192 MB and it looks like the issue still persists. The
frequency with which replication hangs has remained the same. Do you
need any additional perf reports after our change?

--
Regards,
Hubert Klasa.

#21

Amit Kapila

amit.kapila16@gmail.com

over 4 years ago

In reply to: Ha Ka (#20)

Re: Unresolved repliaction hang and stop problem.

On Tue, Aug 10, 2021 at 8:15 PM Ha Ka <klasahubert@gmail.com> wrote:

sob., 19 cze 2021 o 12:14 Amit Kapila <amit.kapila16@gmail.com> napisał(a):

We increased logical_decoding_work_mem for our production database
from 64 to 192 MB and it looks like the issue still persists. The
frequency with which replication hangs has remained the same.

Sounds strange. I think one thing to identify at the time slowdown has
happened is whether there are a very large number of in-progress
transactions at the time slowdown happened. Because the profile shared
last time seems to be spending more time in hash_seq_search than in
actually serializing the exact. Another possibility to try out for
your case is to just always serialize the current xact and see what
happens, this might not be an actual solution but can help in
diagnosing the problem.

Do you
need any additional perf reports after our change?

It might be good if you can share the WALSender portion of perf as
shared in one of the emails above?

--
With Regards,
Amit Kapila.