Reload configuration more frequently in apply worker.

Started by Zhijie Hou (Fujitsu)over 2 years ago6 messages

houzj.fnst@fujitsu.com

over 2 years ago

2 attachment(s)

Hi,

Currently, the main loop of apply worker looks like below[1]for(;;) /* outer loop */ { for(;;) /* inner loop */ { len = walrcv_receive() if (len == 0) break; ... apply change }. Since there are
two loops, the inner loop will keep receiving and applying message from
publisher until no more message left. The worker only reloads the configuration in
the outer loop. This means if the publisher keeps sending messages (it could
keep sending multiple transactions), the apply worker won't get a chance to
update the GUCs.

[1]: for(;;) /* outer loop */ { for(;;) /* inner loop */ { len = walrcv_receive() if (len == 0) break; ... apply change }
for(;;) /* outer loop */
{
for(;;) /* inner loop */
{
len = walrcv_receive()
if (len == 0)
break;
...
apply change
}

...
if (ConfigReloadPending)
{
ConfigReloadPending = false;
ProcessConfigFile(PGC_SIGHUP);
}
...
}

I think it would be better that the apply worker can reflect user's
configuration changes sooner. To achieve this, we can add one more
ProcessConfigFile() call in the inner loop. Attach the patch for the same. What
do you think ?

BTW, I saw one BF failure[2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2023-05-12%2008%3A05%3A41 (it's very rare and only happened once in 4
months) which I think is due to the low frequent reload in apply worker.

The attached tap test shows how the failure happened.

The test use streaming parallel mode and change logical_replication_mode to
immediate, we expect serialization to happen in the test. To reproduce the failure
easier, we need to add a sleep(1s) in the inner loop of apply worker so
that the apply worker won't be able to consume all messages quickly and will be
busy in the inner loop. Then the attached test will fail because the leader
apply didn't reload the configuration, thus serialization didn't happen.

[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2023-05-12%2008%3A05%3A41

Best Regards,
Hou zj

Attachments:

0001-Reload-configuration-more-frequently-in-apply-worker.patchapplication/octet-stream; name=0001-Reload-configuration-more-frequently-in-apply-worker.patchDownload

From 21de63cccf1040567e3f87ccf373c24eb5894de6 Mon Sep 17 00:00:00 2001
From: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
Date: Tue, 16 May 2023 16:28:19 +0800
Subject: [PATCH] Reload configuration more frequently in apply worker.

---
 src/backend/replication/logical/worker.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 879309b316..2ea10b1a4a 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -3562,6 +3562,12 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 					int			c;
 					StringInfoData s;
 
+					if (ConfigReloadPending)
+					{
+						ConfigReloadPending = false;
+						ProcessConfigFile(PGC_SIGHUP);
+					}
+
 					/* Reset timeout. */
 					last_recv_timestamp = GetCurrentTimestamp();
 					ping_sent = false;
-- 
2.30.0.windows.2

000_failure_reproduce.pl.txttext/plain; name=000_failure_reproduce.pl.txtDownload

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Zhijie Hou (Fujitsu) (#1)

Re: Reload configuration more frequently in apply worker.

On Wed, May 17, 2023 at 7:18 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Currently, the main loop of apply worker looks like below[1]. Since there are
two loops, the inner loop will keep receiving and applying message from
publisher until no more message left. The worker only reloads the configuration in
the outer loop. This means if the publisher keeps sending messages (it could
keep sending multiple transactions), the apply worker won't get a chance to
update the GUCs.

Apart from that, I think in rare cases, it seems possible that after
the apply worker has waited for the data and just before it receives
the new replication data/message, the reload happens, then it won't
get a chance to process the reload before processing the new message.
I think such a theory can explain the rare BF failure you pointed out
later in the thread. Does that make sense?

[1]
for(;;) /* outer loop */
{
for(;;) /* inner loop */
{
len = walrcv_receive()
if (len == 0)
break;
...
apply change
}

...
if (ConfigReloadPending)
{
ConfigReloadPending = false;
ProcessConfigFile(PGC_SIGHUP);
}
...
}

I think it would be better that the apply worker can reflect user's
configuration changes sooner. To achieve this, we can add one more
ProcessConfigFile() call in the inner loop. Attach the patch for the same. What
do you think ?

I think it appears to somewhat match what Tom said in the third point
in his email [1]/messages/by-id/2138662.1623460441@sss.pgh.pa.us.

BTW, I saw one BF failure[2] (it's very rare and only happened once in 4
months) which I think is due to the low frequent reload in apply worker.

The attached tap test shows how the failure happened.

I haven't yet tried to reproduce it but will try later sometime.
Thanks for your analysis.

[1]: /messages/by-id/2138662.1623460441@sss.pgh.pa.us

--
With Regards,
Amit Kapila.

Zhijie Hou (Fujitsu)

houzj.fnst@fujitsu.com

over 2 years ago

In reply to: Amit Kapila (#2)

RE: Reload configuration more frequently in apply worker.

On Wednesday, May 17, 2023 11:05 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, May 17, 2023 at 7:18 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:

Currently, the main loop of apply worker looks like below[1]. Since
there are two loops, the inner loop will keep receiving and applying
message from publisher until no more message left. The worker only
reloads the configuration in the outer loop. This means if the
publisher keeps sending messages (it could keep sending multiple
transactions), the apply worker won't get a chance to update the GUCs.

Apart from that, I think in rare cases, it seems possible that after the apply
worker has waited for the data and just before it receives the new replication
data/message, the reload happens, then it won't get a chance to process the
reload before processing the new message.
I think such a theory can explain the rare BF failure you pointed out later in the
thread. Does that make sense?

Yes, that makes sense. That's another case where we would miss the reload and I think
is the reason for the failure because the apply worker has finished applying changes(which
means it's idle) before the failed case.

Best Regards,
Hou zj

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Amit Kapila (#2)

Re: Reload configuration more frequently in apply worker.

On Wed, May 17, 2023 at 8:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, May 17, 2023 at 7:18 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

The attached tap test shows how the failure happened.

I haven't yet tried to reproduce it but will try later sometime.

I am able to reproduce the problem with the script and steps mentioned
by you. The patch provided by you fixes the problem and looks good to
me. I'll push this by Wednesday unless there are any
comments/suggestions. Though this problem exists in previous branches
as well, but as there are no field reports and the problem also
doesn't appear to be a common problem, so I intend to push only for
v16. What do you or others think?

--
With Regards,
Amit Kapila.

Zhijie Hou (Fujitsu)

houzj.fnst@fujitsu.com

over 2 years ago

In reply to: Amit Kapila (#4)

RE: Reload configuration more frequently in apply worker.

On Monday, June 5, 2023 8:03 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, May 17, 2023 at 8:34 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, May 17, 2023 at 7:18 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

The attached tap test shows how the failure happened.

I haven't yet tried to reproduce it but will try later sometime.

I am able to reproduce the problem with the script and steps mentioned by
you. The patch provided by you fixes the problem and looks good to me. I'll
push this by Wednesday unless there are any comments/suggestions. Though
this problem exists in previous branches as well, but as there are no field
reports and the problem also doesn't appear to be a common problem, so I
intend to push only for v16. What do you or others think?

Thanks for the review. I agree that we can push only for V16.

Best Regards,
Hou zj

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Zhijie Hou (Fujitsu) (#5)

Re: Reload configuration more frequently in apply worker.

On Tue, Jun 6, 2023 at 7:45 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Monday, June 5, 2023 8:03 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I am able to reproduce the problem with the script and steps mentioned by
you. The patch provided by you fixes the problem and looks good to me. I'll
push this by Wednesday unless there are any comments/suggestions. Though
this problem exists in previous branches as well, but as there are no field
reports and the problem also doesn't appear to be a common problem, so I
intend to push only for v16. What do you or others think?

Thanks for the review. I agree that we can push only for V16.

Pushed.

--
With Regards,
Amit Kapila.