tablesync patch broke the assumption that logical rep depends on?

Started by Fujii Masaoover 8 years ago24 messages
#1Fujii Masao
Fujii Masao
masao.fujii@gmail.com

Hi,

src/backend/replication/logical/launcher.c
* Worker started and attached to our shmem. This check is safe
* because only launcher ever starts the workers, so nobody can steal
* the worker slot.

The tablesync patch enabled even worker to start another worker.
So the above assumption is not valid for now.

This issue seems to cause the corner case where the launcher picks up
the same worker slot that previously-started worker has already picked
up to start another worker.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Noah Misch
Noah Misch
noah@leadboat.com
In reply to: Fujii Masao (#1)
Re: tablesync patch broke the assumption that logical rep depends on?

On Tue, Apr 11, 2017 at 02:28:44AM +0900, Fujii Masao wrote:

src/backend/replication/logical/launcher.c
* Worker started and attached to our shmem. This check is safe
* because only launcher ever starts the workers, so nobody can steal
* the worker slot.

The tablesync patch enabled even worker to start another worker.
So the above assumption is not valid for now.

This issue seems to cause the corner case where the launcher picks up
the same worker slot that previously-started worker has already picked
up to start another worker.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Peter,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1]/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1]: /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Peter Eisentraut
Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Fujii Masao (#1)
Re: tablesync patch broke the assumption that logical rep depends on?

On 4/10/17 13:28, Fujii Masao wrote:

src/backend/replication/logical/launcher.c
* Worker started and attached to our shmem. This check is safe
* because only launcher ever starts the workers, so nobody can steal
* the worker slot.

The tablesync patch enabled even worker to start another worker.
So the above assumption is not valid for now.

This issue seems to cause the corner case where the launcher picks up
the same worker slot that previously-started worker has already picked
up to start another worker.

I think what the comment should rather say is that workers are always
started through logicalrep_worker_launch() and worker slots are always
handed out while holding LogicalRepWorkerLock exclusively, so nobody can
steal the worker slot.

Does that make sense?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Fujii Masao
Fujii Masao
masao.fujii@gmail.com
In reply to: Peter Eisentraut (#3)
Re: tablesync patch broke the assumption that logical rep depends on?

On Fri, Apr 14, 2017 at 1:28 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 4/10/17 13:28, Fujii Masao wrote:

src/backend/replication/logical/launcher.c
* Worker started and attached to our shmem. This check is safe
* because only launcher ever starts the workers, so nobody can steal
* the worker slot.

The tablesync patch enabled even worker to start another worker.
So the above assumption is not valid for now.

This issue seems to cause the corner case where the launcher picks up
the same worker slot that previously-started worker has already picked
up to start another worker.

I think what the comment should rather say is that workers are always
started through logicalrep_worker_launch() and worker slots are always
handed out while holding LogicalRepWorkerLock exclusively, so nobody can
steal the worker slot.

Does that make sense?

No unless I'm missing something.

logicalrep_worker_launch() picks up unused worker slot (slot's proc == NULL)
while holding LogicalRepWorkerLock. But it releases the lock before the slot
is marked as used (i.e., slot is set to non-NULL). Then newly-launched worker
calls logicalrep_worker_attach() and marks the slot as used.

So if another logicalrep_worker_launch() starts after LogicalRepWorkerLock
is released before the slot is marked as used, it can pick up the same slot
because that slot looks unused.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Petr Jelinek
Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Fujii Masao (#4)
Re: tablesync patch broke the assumption that logical rep depends on?

On 13/04/17 19:31, Fujii Masao wrote:

On Fri, Apr 14, 2017 at 1:28 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 4/10/17 13:28, Fujii Masao wrote:

src/backend/replication/logical/launcher.c
* Worker started and attached to our shmem. This check is safe
* because only launcher ever starts the workers, so nobody can steal
* the worker slot.

The tablesync patch enabled even worker to start another worker.
So the above assumption is not valid for now.

This issue seems to cause the corner case where the launcher picks up
the same worker slot that previously-started worker has already picked
up to start another worker.

I think what the comment should rather say is that workers are always
started through logicalrep_worker_launch() and worker slots are always
handed out while holding LogicalRepWorkerLock exclusively, so nobody can
steal the worker slot.

Does that make sense?

No unless I'm missing something.

logicalrep_worker_launch() picks up unused worker slot (slot's proc == NULL)
while holding LogicalRepWorkerLock. But it releases the lock before the slot
is marked as used (i.e., slot is set to non-NULL). Then newly-launched worker
calls logicalrep_worker_attach() and marks the slot as used.

So if another logicalrep_worker_launch() starts after LogicalRepWorkerLock
is released before the slot is marked as used, it can pick up the same slot
because that slot looks unused.

Yeah I think it's less of a problem of that comment than the fact that
logicalrep_worker_launch isn't concurrency safe. We need in_use marker
for the workers and update it as needed instead of relying on pgproc.
I'll write up something over the weekend.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Steve Singer
Steve Singer
steve@ssinger.info
In reply to: Fujii Masao (#1)
Re: tablesync patch broke the assumption that logical rep depends on?

On 04/10/2017 01:28 PM, Fujii Masao wrote:

Hi,

src/backend/replication/logical/launcher.c
* Worker started and attached to our shmem. This check is safe
* because only launcher ever starts the workers, so nobody can steal
* the worker slot.

The tablesync patch enabled even worker to start another worker.
So the above assumption is not valid for now.

This issue seems to cause the corner case where the launcher picks up
the same worker slot that previously-started worker has already picked
up to start another worker.

Regards,

I'm not sure if what I am seeing is related to this or another issue.

If I create a subscription for a publication that doesn't exist (yet) I
see 'publication mypub does not exist" in the subscribers log

If I then create the publication on the publisher the error doesn't go
away, the worker process keeps spinning with the same error.

If I then drop the subscription and recreate it it works.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Petr Jelinek
Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Steve Singer (#6)
Re: tablesync patch broke the assumption that logical rep depends on?

On 16/04/17 21:27, Steve Singer wrote:

On 04/10/2017 01:28 PM, Fujii Masao wrote:

Hi,

src/backend/replication/logical/launcher.c
* Worker started and attached to our shmem. This check is safe
* because only launcher ever starts the workers, so nobody
can steal
* the worker slot.

The tablesync patch enabled even worker to start another worker.
So the above assumption is not valid for now.

This issue seems to cause the corner case where the launcher picks up
the same worker slot that previously-started worker has already picked
up to start another worker.

Regards,

I'm not sure if what I am seeing is related to this or another issue.

If I create a subscription for a publication that doesn't exist (yet) I
see 'publication mypub does not exist" in the subscribers log

If I then create the publication on the publisher the error doesn't go
away, the worker process keeps spinning with the same error.

If I then drop the subscription and recreate it it works.

No that's a result of how logical decoding/slots work. We see catalogs
as what they looked like at the specific point in time. If you create
slot (by creating subscription) it will not see publication that was
created after until decoding on that slot reaches point in time when it
was actually created.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Noah Misch
Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#2)
Re: tablesync patch broke the assumption that logical rep depends on?

On Thu, Apr 13, 2017 at 04:56:05AM +0000, Noah Misch wrote:

On Tue, Apr 11, 2017 at 02:28:44AM +0900, Fujii Masao wrote:

src/backend/replication/logical/launcher.c
* Worker started and attached to our shmem. This check is safe
* because only launcher ever starts the workers, so nobody can steal
* the worker slot.

The tablesync patch enabled even worker to start another worker.
So the above assumption is not valid for now.

This issue seems to cause the corner case where the launcher picks up
the same worker slot that previously-started worker has already picked
up to start another worker.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Peter,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Peter Eisentraut
Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Noah Misch (#8)
Re: tablesync patch broke the assumption that logical rep depends on?

On 4/16/17 22:01, Noah Misch wrote:

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

I think we're not really sure yet what to do about this. Discussion is
ongoing. I'll report back on Wednesday.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Petr Jelinek
Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Peter Eisentraut (#9)
1 attachment(s)
Re: tablesync patch broke the assumption that logical rep depends on?

On 17/04/17 20:09, Peter Eisentraut wrote:

On 4/16/17 22:01, Noah Misch wrote:

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

I think we're not really sure yet what to do about this. Discussion is
ongoing. I'll report back on Wednesday.

So my idea was to add some kind of inuse flag. This turned out to be bit
more complicated in terms of how to clean it than I would have hoped.
This is due to the fact that there is no way to reliably tell if worker
has failed to start if the parent worker crashed while waiting.

My solution to that is to use similar logic to autovacuum where we use
timeout for worker to attach to shmem. We do this only if there is no
free slot found when launch of replication worker was requested.

While working on this patch I also noticed other subtle concurrency
issues and fixed them as well - stopping worker that hasn't finished
starting yet wasn't completely concurrency safe and limiting sync
workers per subscription theoretically wasn't either (although I don't
think it could happen in practice).

I do wonder now though if it's such a good idea to have the
BackgroundWorkerHandle private to the bgworker.c given that this is the
3rd time (twice before was outside of postgres core) I had to write
similar generation mechanism that it uses for unique bgworker
authentication inside the process which started it. It would have been
much easier if I could just save the BackgroundWorkerHandle itself to
shmem so it could be used across processes instead of having to reinvent
it every time.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Fix-various-concurrency-issues-in-logical-replicatio.patchtext/plain; charset=UTF-8; name=0001-Fix-various-concurrency-issues-in-logical-replicatio.patch
#11Peter Eisentraut
Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Petr Jelinek (#10)
Re: tablesync patch broke the assumption that logical rep depends on?

On 4/18/17 22:13, Petr Jelinek wrote:

So my idea was to add some kind of inuse flag. This turned out to be bit
more complicated in terms of how to clean it than I would have hoped.
This is due to the fact that there is no way to reliably tell if worker
has failed to start if the parent worker crashed while waiting.

My solution to that is to use similar logic to autovacuum where we use
timeout for worker to attach to shmem. We do this only if there is no
free slot found when launch of replication worker was requested.

It looks like launch_time is never set the current time in your patch.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Petr Jelinek
Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Peter Eisentraut (#11)
1 attachment(s)
Re: tablesync patch broke the assumption that logical rep depends on?

On 20/04/17 18:58, Peter Eisentraut wrote:

On 4/18/17 22:13, Petr Jelinek wrote:

So my idea was to add some kind of inuse flag. This turned out to be bit
more complicated in terms of how to clean it than I would have hoped.
This is due to the fact that there is no way to reliably tell if worker
has failed to start if the parent worker crashed while waiting.

My solution to that is to use similar logic to autovacuum where we use
timeout for worker to attach to shmem. We do this only if there is no
free slot found when launch of replication worker was requested.

It looks like launch_time is never set the current time in your patch.

Oops, fixed.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Fix-various-concurrency-issues-in-logical-replicatiov2.patchtext/plain; charset=UTF-8; name=0001-Fix-various-concurrency-issues-in-logical-replicatiov2.patch
#13Noah Misch
Noah Misch
noah@leadboat.com
In reply to: Peter Eisentraut (#9)
Re: tablesync patch broke the assumption that logical rep depends on?

On Mon, Apr 17, 2017 at 02:09:54PM -0400, Peter Eisentraut wrote:

I think we're not really sure yet what to do about this. Discussion is
ongoing. I'll report back on Wednesday.

This PostgreSQL 10 open item is past due for your status update, and this is
overall the seventh time you have you allowed one of your open items to go out
of compliance. Kindly start complying with the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

At a bare minimum, send a status update within 24 hours, and include a date
for your subsequent status update.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Peter Eisentraut
Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Petr Jelinek (#12)
Re: tablesync patch broke the assumption that logical rep depends on?

On 4/20/17 14:29, Petr Jelinek wrote:

+		/* Find unused worker slot. */
+		if (!w->in_use)
{
-			worker = &LogicalRepCtx->workers[slot];
-			break;
+			worker = w;
+			slot = i;
+		}

Doesn't this still need a break? Otherwise it always picks the last slot.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Peter Eisentraut
Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Noah Misch (#13)
Re: tablesync patch broke the assumption that logical rep depends on?

On 4/20/17 22:24, Noah Misch wrote:

On Mon, Apr 17, 2017 at 02:09:54PM -0400, Peter Eisentraut wrote:

I think we're not really sure yet what to do about this. Discussion is
ongoing. I'll report back on Wednesday.

This PostgreSQL 10 open item is past due for your status update, and this is
overall the seventh time you have you allowed one of your open items to go out
of compliance. Kindly start complying with the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

We are actively working on it. It should be resolved within a few days.
Next check-in Monday.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Petr Jelinek
Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Peter Eisentraut (#14)
Re: tablesync patch broke the assumption that logical rep depends on?

On 21/04/17 16:09, Peter Eisentraut wrote:

On 4/20/17 14:29, Petr Jelinek wrote:

+		/* Find unused worker slot. */
+		if (!w->in_use)
{
-			worker = &LogicalRepCtx->workers[slot];
-			break;
+			worker = w;
+			slot = i;
+		}

Doesn't this still need a break? Otherwise it always picks the last slot.

Yes it will pick the last slot, does that matter though, is the first
one better somehow?

We can't break because we also need to continue the counter (I think the
issue that the counter solves is probably just theoretical, but still).

Hmm actually, maybe the if (!w->in_use) should be if (worker == NULL &&
!w->in_use)?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Peter Eisentraut
Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Petr Jelinek (#16)
Re: tablesync patch broke the assumption that logical rep depends on?

On 4/21/17 10:11, Petr Jelinek wrote:

On 21/04/17 16:09, Peter Eisentraut wrote:

On 4/20/17 14:29, Petr Jelinek wrote:

+		/* Find unused worker slot. */
+		if (!w->in_use)
{
-			worker = &LogicalRepCtx->workers[slot];
-			break;
+			worker = w;
+			slot = i;
+		}

Doesn't this still need a break? Otherwise it always picks the last slot.

Yes it will pick the last slot, does that matter though, is the first
one better somehow?

We can't break because we also need to continue the counter (I think the
issue that the counter solves is probably just theoretical, but still).

I see. I think the code would be less confusing if we break the loop
like before and call logicalrep_sync_worker_count() separately.

Hmm actually, maybe the if (!w->in_use) should be if (worker == NULL &&
!w->in_use)?

That would also do it. But it's getting a bit fiddly.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Petr Jelinek
Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Peter Eisentraut (#17)
Re: tablesync patch broke the assumption that logical rep depends on?

On 21/04/17 16:23, Peter Eisentraut wrote:

On 4/21/17 10:11, Petr Jelinek wrote:

On 21/04/17 16:09, Peter Eisentraut wrote:

On 4/20/17 14:29, Petr Jelinek wrote:

+		/* Find unused worker slot. */
+		if (!w->in_use)
{
-			worker = &LogicalRepCtx->workers[slot];
-			break;
+			worker = w;
+			slot = i;
+		}

Doesn't this still need a break? Otherwise it always picks the last slot.

Yes it will pick the last slot, does that matter though, is the first
one better somehow?

We can't break because we also need to continue the counter (I think the
issue that the counter solves is probably just theoretical, but still).

I see. I think the code would be less confusing if we break the loop
like before and call logicalrep_sync_worker_count() separately.

Hmm actually, maybe the if (!w->in_use) should be if (worker == NULL &&
!w->in_use)?

That would also do it. But it's getting a bit fiddly.

I just wanted to avoid looping twice, especially since the garbage
collecting code has to also do the loop. I guess I'll go with my
original coding for this then which was to put retry label above the
loop first, then try finding worker slot, if found call the
logicalrep_sync_worker_count and if not found do the garbage collection
and if we cleaned up something then goto retry.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Petr Jelinek
Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Petr Jelinek (#18)
1 attachment(s)
Re: tablesync patch broke the assumption that logical rep depends on?

On 21/04/17 16:31, Petr Jelinek wrote:

On 21/04/17 16:23, Peter Eisentraut wrote:

On 4/21/17 10:11, Petr Jelinek wrote:

On 21/04/17 16:09, Peter Eisentraut wrote:

On 4/20/17 14:29, Petr Jelinek wrote:

+		/* Find unused worker slot. */
+		if (!w->in_use)
{
-			worker = &LogicalRepCtx->workers[slot];
-			break;
+			worker = w;
+			slot = i;
+		}

Doesn't this still need a break? Otherwise it always picks the last slot.

Yes it will pick the last slot, does that matter though, is the first
one better somehow?

We can't break because we also need to continue the counter (I think the
issue that the counter solves is probably just theoretical, but still).

I see. I think the code would be less confusing if we break the loop
like before and call logicalrep_sync_worker_count() separately.

Hmm actually, maybe the if (!w->in_use) should be if (worker == NULL &&
!w->in_use)?

That would also do it. But it's getting a bit fiddly.

I just wanted to avoid looping twice, especially since the garbage
collecting code has to also do the loop. I guess I'll go with my
original coding for this then which was to put retry label above the
loop first, then try finding worker slot, if found call the
logicalrep_sync_worker_count and if not found do the garbage collection
and if we cleaned up something then goto retry.

Here is the patch doing just that.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

Fix-various-concurrency-issues-in-logical-replicatiov3.patchtext/plain; charset=UTF-8; name=Fix-various-concurrency-issues-in-logical-replicatiov3.patch
#20Petr Jelinek
Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Petr Jelinek (#19)
1 attachment(s)
Re: tablesync patch broke the assumption that logical rep depends on?

On 22/04/17 22:09, Petr Jelinek wrote:

On 21/04/17 16:31, Petr Jelinek wrote:

On 21/04/17 16:23, Peter Eisentraut wrote:

On 4/21/17 10:11, Petr Jelinek wrote:

On 21/04/17 16:09, Peter Eisentraut wrote:

On 4/20/17 14:29, Petr Jelinek wrote:

+		/* Find unused worker slot. */
+		if (!w->in_use)
{
-			worker = &LogicalRepCtx->workers[slot];
-			break;
+			worker = w;
+			slot = i;
+		}

Doesn't this still need a break? Otherwise it always picks the last slot.

Yes it will pick the last slot, does that matter though, is the first
one better somehow?

We can't break because we also need to continue the counter (I think the
issue that the counter solves is probably just theoretical, but still).

I see. I think the code would be less confusing if we break the loop
like before and call logicalrep_sync_worker_count() separately.

Hmm actually, maybe the if (!w->in_use) should be if (worker == NULL &&
!w->in_use)?

That would also do it. But it's getting a bit fiddly.

I just wanted to avoid looping twice, especially since the garbage
collecting code has to also do the loop. I guess I'll go with my
original coding for this then which was to put retry label above the
loop first, then try finding worker slot, if found call the
logicalrep_sync_worker_count and if not found do the garbage collection
and if we cleaned up something then goto retry.

Here is the patch doing just that.

And one more revision which also checks in_use when attaching shared
memory. This is mainly to improve the user visible behavior in
theoretical corner case when the worker is supposed to be cleaned up but
actually still manages to start (it would still fail even without this
change but the error message would be more obscure).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

Fix-various-concurrency-issues-in-logical-replicatiov4.patchbinary/octet-stream; name=Fix-various-concurrency-issues-in-logical-replicatiov4.patch
#21Peter Eisentraut
Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Petr Jelinek (#20)
Re: tablesync patch broke the assumption that logical rep depends on?

On 4/25/17 15:42, Petr Jelinek wrote:

Here is the patch doing just that.

And one more revision which also checks in_use when attaching shared
memory. This is mainly to improve the user visible behavior in
theoretical corner case when the worker is supposed to be cleaned up but
actually still manages to start (it would still fail even without this
change but the error message would be more obscure).

Committed that, with some editorializing.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Jeff Janes
Jeff Janes
jeff.janes@gmail.com
In reply to: Peter Eisentraut (#21)
Re: tablesync patch broke the assumption that logical rep depends on?

On Wed, Apr 26, 2017 at 8:00 AM, Peter Eisentraut <
peter.eisentraut@2ndquadrant.com> wrote:

On 4/25/17 15:42, Petr Jelinek wrote:

Here is the patch doing just that.

And one more revision which also checks in_use when attaching shared
memory. This is mainly to improve the user visible behavior in
theoretical corner case when the worker is supposed to be cleaned up but
actually still manages to start (it would still fail even without this
change but the error message would be more obscure).

Committed that, with some editorializing.

This gives me compiler warning:

launcher.c: In function 'logicalrep_worker_launch':
launcher.c:257: warning: 'slot' may be used uninitialized in this function

gcc version 4.4.7 20120313 (Red Hat 4.4.7-18) (GCC)

Cheers,

Jeff

#23Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Janes (#22)
Re: tablesync patch broke the assumption that logical rep depends on?

Jeff Janes <jeff.janes@gmail.com> writes:

This gives me compiler warning:
launcher.c: In function 'logicalrep_worker_launch':
launcher.c:257: warning: 'slot' may be used uninitialized in this function

Yeah, me too. Fix pushed.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Neha Khatri
Neha Khatri
nehakhatri5@gmail.com
In reply to: Tom Lane (#23)
Re: tablesync patch broke the assumption that logical rep depends on?

On Thu, Apr 27, 2017 at 4:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

This gives me compiler warning:
launcher.c: In function 'logicalrep_worker_launch':
launcher.c:257: warning: 'slot' may be used uninitialized in this

function

Yeah, me too. Fix pushed.

Somewhat off the mark, but related to warning above, would you get similar
warning for "address" in ProcessUtilitySlow() in utility.c.

Regards,
Neha