logical replication and PANIC during shutdown checkpoint in publisher

Started by Fujii Masaoabout 9 years ago67 messageshackers
Jump to latest
#1Fujii Masao
masao.fujii@gmail.com

Hi,

When I shut down the publisher while I repeated creating and dropping
the subscription in the subscriber, the publisher emitted the following
PANIC error during shutdown checkpoint.

PANIC: concurrent transaction log activity while database system is
shutting down

The cause of this problem is that walsender for logical replication can
generate WAL records even during shutdown checkpoint.

Firstly walsender keeps running until shutdown checkpoint finishes
so that all the WAL including shutdown checkpoint record can be
replicated to the standby. This was safe because previously walsender
could not generate WAL records. However this assumption became
invalid because of logical replication. That is, currenty walsender for
logical replication can generate WAL records, for example, by executing
CREATE_REPLICATION_SLOT command. This is an oversight in
logical replication patch, I think.

To fix this issue, we should terminate walsender for logical replication
before shutdown checkpoint starts. Of course walsender for physical
replication still needs to keep running until shutdown checkpoint ends,
though.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Peter Eisentraut
peter_e@gmx.net
In reply to: Fujii Masao (#1)
Re: logical replication and PANIC during shutdown checkpoint in publisher

On 4/12/17 09:55, Fujii Masao wrote:

To fix this issue, we should terminate walsender for logical replication
before shutdown checkpoint starts. Of course walsender for physical
replication still needs to keep running until shutdown checkpoint ends,
though.

Can we turn it into a kind of read-only or no-new-commands mode instead,
so it can keep streaming but not accept any new actions?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Fujii Masao
masao.fujii@gmail.com
In reply to: Peter Eisentraut (#2)
Re: logical replication and PANIC during shutdown checkpoint in publisher

On Thu, Apr 13, 2017 at 5:25 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 4/12/17 09:55, Fujii Masao wrote:

To fix this issue, we should terminate walsender for logical replication
before shutdown checkpoint starts. Of course walsender for physical
replication still needs to keep running until shutdown checkpoint ends,
though.

Can we turn it into a kind of read-only or no-new-commands mode instead,
so it can keep streaming but not accept any new actions?

So we make walsenders switch to that mode and wait for all the already-ongoing
their "write" commands to finish, and then we start a shutdown checkpoint?
This is an idea, but seems a bit complicated. ISTM that it's simpler to
terminate only walsenders for logical rep before shutdown checkpoint.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#3)
Re: logical replication and PANIC during shutdown checkpoint in publisher

On Thu, Apr 13, 2017 at 12:28 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Apr 13, 2017 at 5:25 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 4/12/17 09:55, Fujii Masao wrote:

To fix this issue, we should terminate walsender for logical replication
before shutdown checkpoint starts. Of course walsender for physical
replication still needs to keep running until shutdown checkpoint ends,
though.

Can we turn it into a kind of read-only or no-new-commands mode instead,
so it can keep streaming but not accept any new actions?

So we make walsenders switch to that mode and wait for all the already-ongoing
their "write" commands to finish, and then we start a shutdown checkpoint?
This is an idea, but seems a bit complicated. ISTM that it's simpler to
terminate only walsenders for logical rep before shutdown checkpoint.

Perhaps my memory is failing me here... But in clean shutdowns we do
shut down WAL senders after the checkpoint has completed so as we are
sure that they have flushed the LSN corresponding to the checkpoint
ending, right? Why introducing an inconsistency for logical workers?
It seems to me that logical workers should fail under the same rules.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#4)
Re: logical replication and PANIC during shutdown checkpoint in publisher

On Thu, Apr 13, 2017 at 12:36 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Apr 13, 2017 at 12:28 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Apr 13, 2017 at 5:25 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 4/12/17 09:55, Fujii Masao wrote:

To fix this issue, we should terminate walsender for logical replication
before shutdown checkpoint starts. Of course walsender for physical
replication still needs to keep running until shutdown checkpoint ends,
though.

Can we turn it into a kind of read-only or no-new-commands mode instead,
so it can keep streaming but not accept any new actions?

So we make walsenders switch to that mode and wait for all the already-ongoing
their "write" commands to finish, and then we start a shutdown checkpoint?
This is an idea, but seems a bit complicated. ISTM that it's simpler to
terminate only walsenders for logical rep before shutdown checkpoint.

Perhaps my memory is failing me here... But in clean shutdowns we do
shut down WAL senders after the checkpoint has completed so as we are
sure that they have flushed the LSN corresponding to the checkpoint
ending, right?

Yes.

Why introducing an inconsistency for logical workers?
It seems to me that logical workers should fail under the same rules.

Could you tell me why? You think that even walsender for logical rep
needs to stream the shutdown checkpoint WAL record to the subscriber?
I was not thinking that's true.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#5)
Re: logical replication and PANIC during shutdown checkpoint in publisher

On Fri, Apr 14, 2017 at 3:03 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Apr 13, 2017 at 12:36 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Apr 13, 2017 at 12:28 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Apr 13, 2017 at 5:25 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 4/12/17 09:55, Fujii Masao wrote:

To fix this issue, we should terminate walsender for logical replication
before shutdown checkpoint starts. Of course walsender for physical
replication still needs to keep running until shutdown checkpoint ends,
though.

Can we turn it into a kind of read-only or no-new-commands mode instead,
so it can keep streaming but not accept any new actions?

So we make walsenders switch to that mode and wait for all the already-ongoing
their "write" commands to finish, and then we start a shutdown checkpoint?
This is an idea, but seems a bit complicated. ISTM that it's simpler to
terminate only walsenders for logical rep before shutdown checkpoint.

Perhaps my memory is failing me here... But in clean shutdowns we do
shut down WAL senders after the checkpoint has completed so as we are
sure that they have flushed the LSN corresponding to the checkpoint
ending, right?

Yes.

Why introducing an inconsistency for logical workers?
It seems to me that logical workers should fail under the same rules.

Could you tell me why? You think that even walsender for logical rep
needs to stream the shutdown checkpoint WAL record to the subscriber?
I was not thinking that's true.

For physical replication, the property to wait that standbys have
flushed the LSN of the shutdown checkpoint can be important for
switchovers. For example, with a primary and a standby, it is possible
to stop cleanly the master, promote the standby, and then connect back
to the cluster the old primary as a standby to the now-new primary
with the guarantee that both are in a consistent state. It seems to me
that having similar guarantees for logical replication is important.

Now, I have not reviewed the code of logirep in details at the level
of Peter, Petr or yourself...
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Petr Jelinek
petr@2ndquadrant.com
In reply to: Fujii Masao (#1)
Re: logical replication and PANIC during shutdown checkpoint in publisher

On 12/04/17 15:55, Fujii Masao wrote:

Hi,

When I shut down the publisher while I repeated creating and dropping
the subscription in the subscriber, the publisher emitted the following
PANIC error during shutdown checkpoint.

PANIC: concurrent transaction log activity while database system is
shutting down

The cause of this problem is that walsender for logical replication can
generate WAL records even during shutdown checkpoint.

Firstly walsender keeps running until shutdown checkpoint finishes
so that all the WAL including shutdown checkpoint record can be
replicated to the standby. This was safe because previously walsender
could not generate WAL records. However this assumption became
invalid because of logical replication. That is, currenty walsender for
logical replication can generate WAL records, for example, by executing
CREATE_REPLICATION_SLOT command. This is an oversight in
logical replication patch, I think.

Hmm, but CREATE_REPLICATION_SLOT should not generate WAL afaik. I agree
that the issue with walsender still exist (since we now allow normal SQL
to run there) but I think it's important to identify what exactly causes
the WAL activity in your case - if it's one of the replication commands
and not SQL then we'll need to backpatch any solution we come up with to
9.4, if it's not replication command, we only need to fix 10.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Fujii Masao
masao.fujii@gmail.com
In reply to: Petr Jelinek (#7)
Re: logical replication and PANIC during shutdown checkpoint in publisher

On Fri, Apr 14, 2017 at 10:33 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 12/04/17 15:55, Fujii Masao wrote:

Hi,

When I shut down the publisher while I repeated creating and dropping
the subscription in the subscriber, the publisher emitted the following
PANIC error during shutdown checkpoint.

PANIC: concurrent transaction log activity while database system is
shutting down

The cause of this problem is that walsender for logical replication can
generate WAL records even during shutdown checkpoint.

Firstly walsender keeps running until shutdown checkpoint finishes
so that all the WAL including shutdown checkpoint record can be
replicated to the standby. This was safe because previously walsender
could not generate WAL records. However this assumption became
invalid because of logical replication. That is, currenty walsender for
logical replication can generate WAL records, for example, by executing
CREATE_REPLICATION_SLOT command. This is an oversight in
logical replication patch, I think.

Hmm, but CREATE_REPLICATION_SLOT should not generate WAL afaik. I agree
that the issue with walsender still exist (since we now allow normal SQL
to run there) but I think it's important to identify what exactly causes
the WAL activity in your case

At least in my case, the following CREATE_REPLICATION_SLOT command
generated WAL record.

BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ;
CREATE_REPLICATION_SLOT testslot TEMPORARY LOGICAL pgoutput USE_SNAPSHOT;

Here is the pg_waldump output of the WAL record that CREATE_REPLICATION_SLOT
generated.

rmgr: Standby len (rec/tot): 24/ 50, tx: 0,
lsn: 0/01601438, prev 0/01601400, desc: RUNNING_XACTS nextXid 692
latestCompletedXid 691 oldestRunningXid 692

So I guess that CREATE_REPLICATION_SLOT code calls LogStandbySnapshot()
and which generates WAL record about snapshot of running transactions.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Petr Jelinek
petr@2ndquadrant.com
In reply to: Fujii Masao (#8)
Re: logical replication and PANIC during shutdown checkpoint in publisher

On 14/04/17 19:33, Fujii Masao wrote:

On Fri, Apr 14, 2017 at 10:33 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 12/04/17 15:55, Fujii Masao wrote:

Hi,

When I shut down the publisher while I repeated creating and dropping
the subscription in the subscriber, the publisher emitted the following
PANIC error during shutdown checkpoint.

PANIC: concurrent transaction log activity while database system is
shutting down

The cause of this problem is that walsender for logical replication can
generate WAL records even during shutdown checkpoint.

Firstly walsender keeps running until shutdown checkpoint finishes
so that all the WAL including shutdown checkpoint record can be
replicated to the standby. This was safe because previously walsender
could not generate WAL records. However this assumption became
invalid because of logical replication. That is, currenty walsender for
logical replication can generate WAL records, for example, by executing
CREATE_REPLICATION_SLOT command. This is an oversight in
logical replication patch, I think.

Hmm, but CREATE_REPLICATION_SLOT should not generate WAL afaik. I agree
that the issue with walsender still exist (since we now allow normal SQL
to run there) but I think it's important to identify what exactly causes
the WAL activity in your case

At least in my case, the following CREATE_REPLICATION_SLOT command
generated WAL record.

BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ;
CREATE_REPLICATION_SLOT testslot TEMPORARY LOGICAL pgoutput USE_SNAPSHOT;

Here is the pg_waldump output of the WAL record that CREATE_REPLICATION_SLOT
generated.

rmgr: Standby len (rec/tot): 24/ 50, tx: 0,
lsn: 0/01601438, prev 0/01601400, desc: RUNNING_XACTS nextXid 692
latestCompletedXid 691 oldestRunningXid 692

So I guess that CREATE_REPLICATION_SLOT code calls LogStandbySnapshot()
and which generates WAL record about snapshot of running transactions.

Ah yes looking at the code, it does exactly that (on master only). Means
that backport will be necessary.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Peter Eisentraut
peter_e@gmx.net
In reply to: Petr Jelinek (#9)
Re: logical replication and PANIC during shutdown checkpoint in publisher

On 4/14/17 14:23, Petr Jelinek wrote:

Ah yes looking at the code, it does exactly that (on master only). Means
that backport will be necessary.

I think these two sentences are contradicting each other.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Petr Jelinek
petr@2ndquadrant.com
In reply to: Peter Eisentraut (#10)
Re: logical replication and PANIC during shutdown checkpoint in publisher

On 14/04/17 21:05, Peter Eisentraut wrote:

On 4/14/17 14:23, Petr Jelinek wrote:

Ah yes looking at the code, it does exactly that (on master only). Means
that backport will be necessary.

I think these two sentences are contradicting each other.

Hehe, didn't realize master will be taken as master branch, I meant
master as in not standby :)

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Noah Misch
noah@leadboat.com
In reply to: Fujii Masao (#1)
Re: logical replication and PANIC during shutdown checkpoint in publisher

On Wed, Apr 12, 2017 at 10:55:08PM +0900, Fujii Masao wrote:

When I shut down the publisher while I repeated creating and dropping
the subscription in the subscriber, the publisher emitted the following
PANIC error during shutdown checkpoint.

PANIC: concurrent transaction log activity while database system is
shutting down

The cause of this problem is that walsender for logical replication can
generate WAL records even during shutdown checkpoint.

Firstly walsender keeps running until shutdown checkpoint finishes
so that all the WAL including shutdown checkpoint record can be
replicated to the standby. This was safe because previously walsender
could not generate WAL records. However this assumption became
invalid because of logical replication. That is, currenty walsender for
logical replication can generate WAL records, for example, by executing
CREATE_REPLICATION_SLOT command. This is an oversight in
logical replication patch, I think.

To fix this issue, we should terminate walsender for logical replication
before shutdown checkpoint starts. Of course walsender for physical
replication still needs to keep running until shutdown checkpoint ends,
though.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Peter,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1]/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1]: /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Petr Jelinek
petr@2ndquadrant.com
In reply to: Noah Misch (#12)
Re: Re: logical replication and PANIC during shutdown checkpoint in publisher

On 16/04/17 08:12, Noah Misch wrote:

On Wed, Apr 12, 2017 at 10:55:08PM +0900, Fujii Masao wrote:

When I shut down the publisher while I repeated creating and dropping
the subscription in the subscriber, the publisher emitted the following
PANIC error during shutdown checkpoint.

PANIC: concurrent transaction log activity while database system is
shutting down

The cause of this problem is that walsender for logical replication can
generate WAL records even during shutdown checkpoint.

Firstly walsender keeps running until shutdown checkpoint finishes
so that all the WAL including shutdown checkpoint record can be
replicated to the standby. This was safe because previously walsender
could not generate WAL records. However this assumption became
invalid because of logical replication. That is, currenty walsender for
logical replication can generate WAL records, for example, by executing
CREATE_REPLICATION_SLOT command. This is an oversight in
logical replication patch, I think.

To fix this issue, we should terminate walsender for logical replication
before shutdown checkpoint starts. Of course walsender for physical
replication still needs to keep running until shutdown checkpoint ends,
though.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Peter,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

Just FYI this is not new in 10, the issue exists since the 9.4
introduction of logical replication slots.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Andres Freund
andres@anarazel.de
In reply to: Fujii Masao (#8)
Re: logical replication and PANIC during shutdown checkpoint in publisher

On 2017-04-15 02:33:59 +0900, Fujii Masao wrote:

On Fri, Apr 14, 2017 at 10:33 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 12/04/17 15:55, Fujii Masao wrote:

Hi,

When I shut down the publisher while I repeated creating and dropping
the subscription in the subscriber, the publisher emitted the following
PANIC error during shutdown checkpoint.

PANIC: concurrent transaction log activity while database system is
shutting down

The cause of this problem is that walsender for logical replication can
generate WAL records even during shutdown checkpoint.

Firstly walsender keeps running until shutdown checkpoint finishes
so that all the WAL including shutdown checkpoint record can be
replicated to the standby. This was safe because previously walsender
could not generate WAL records. However this assumption became
invalid because of logical replication. That is, currenty walsender for
logical replication can generate WAL records, for example, by executing
CREATE_REPLICATION_SLOT command. This is an oversight in
logical replication patch, I think.

Hmm, but CREATE_REPLICATION_SLOT should not generate WAL afaik. I agree
that the issue with walsender still exist (since we now allow normal SQL
to run there) but I think it's important to identify what exactly causes
the WAL activity in your case

At least in my case, the following CREATE_REPLICATION_SLOT command
generated WAL record.

BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ;
CREATE_REPLICATION_SLOT testslot TEMPORARY LOGICAL pgoutput USE_SNAPSHOT;

Here is the pg_waldump output of the WAL record that CREATE_REPLICATION_SLOT
generated.

rmgr: Standby len (rec/tot): 24/ 50, tx: 0,
lsn: 0/01601438, prev 0/01601400, desc: RUNNING_XACTS nextXid 692
latestCompletedXid 691 oldestRunningXid 692

So I guess that CREATE_REPLICATION_SLOT code calls LogStandbySnapshot()
and which generates WAL record about snapshot of running transactions.

Erroring out in these cases sounds easy enough. Wonder if there's not a
bigger problem with WAL records generated e.g. by HOT pruning or such,
during decoding. Not super likely, but would probably hit exactly the
same, no?

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Petr Jelinek
petr@2ndquadrant.com
In reply to: Andres Freund (#14)
Re: logical replication and PANIC during shutdown checkpoint in publisher

On 17/04/17 18:02, Andres Freund wrote:

On 2017-04-15 02:33:59 +0900, Fujii Masao wrote:

On Fri, Apr 14, 2017 at 10:33 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 12/04/17 15:55, Fujii Masao wrote:

Hi,

When I shut down the publisher while I repeated creating and dropping
the subscription in the subscriber, the publisher emitted the following
PANIC error during shutdown checkpoint.

PANIC: concurrent transaction log activity while database system is
shutting down

The cause of this problem is that walsender for logical replication can
generate WAL records even during shutdown checkpoint.

Firstly walsender keeps running until shutdown checkpoint finishes
so that all the WAL including shutdown checkpoint record can be
replicated to the standby. This was safe because previously walsender
could not generate WAL records. However this assumption became
invalid because of logical replication. That is, currenty walsender for
logical replication can generate WAL records, for example, by executing
CREATE_REPLICATION_SLOT command. This is an oversight in
logical replication patch, I think.

Hmm, but CREATE_REPLICATION_SLOT should not generate WAL afaik. I agree
that the issue with walsender still exist (since we now allow normal SQL
to run there) but I think it's important to identify what exactly causes
the WAL activity in your case

At least in my case, the following CREATE_REPLICATION_SLOT command
generated WAL record.

BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ;
CREATE_REPLICATION_SLOT testslot TEMPORARY LOGICAL pgoutput USE_SNAPSHOT;

Here is the pg_waldump output of the WAL record that CREATE_REPLICATION_SLOT
generated.

rmgr: Standby len (rec/tot): 24/ 50, tx: 0,
lsn: 0/01601438, prev 0/01601400, desc: RUNNING_XACTS nextXid 692
latestCompletedXid 691 oldestRunningXid 692

So I guess that CREATE_REPLICATION_SLOT code calls LogStandbySnapshot()
and which generates WAL record about snapshot of running transactions.

Erroring out in these cases sounds easy enough. Wonder if there's not a
bigger problem with WAL records generated e.g. by HOT pruning or such,
during decoding. Not super likely, but would probably hit exactly the
same, no?

Sounds possible, yes. Sounds like that's going to be nontrivial to fix
though.

Another problem is that queries can run on walsender now. But that
should be possible to detect and shutdown just like backend.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#15)
Re: logical replication and PANIC during shutdown checkpoint in publisher

On 2017-04-17 18:28:16 +0200, Petr Jelinek wrote:

On 17/04/17 18:02, Andres Freund wrote:

On 2017-04-15 02:33:59 +0900, Fujii Masao wrote:

On Fri, Apr 14, 2017 at 10:33 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 12/04/17 15:55, Fujii Masao wrote:

Hi,

When I shut down the publisher while I repeated creating and dropping
the subscription in the subscriber, the publisher emitted the following
PANIC error during shutdown checkpoint.

PANIC: concurrent transaction log activity while database system is
shutting down

The cause of this problem is that walsender for logical replication can
generate WAL records even during shutdown checkpoint.

Firstly walsender keeps running until shutdown checkpoint finishes
so that all the WAL including shutdown checkpoint record can be
replicated to the standby. This was safe because previously walsender
could not generate WAL records. However this assumption became
invalid because of logical replication. That is, currenty walsender for
logical replication can generate WAL records, for example, by executing
CREATE_REPLICATION_SLOT command. This is an oversight in
logical replication patch, I think.

Hmm, but CREATE_REPLICATION_SLOT should not generate WAL afaik. I agree
that the issue with walsender still exist (since we now allow normal SQL
to run there) but I think it's important to identify what exactly causes
the WAL activity in your case

At least in my case, the following CREATE_REPLICATION_SLOT command
generated WAL record.

BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ;
CREATE_REPLICATION_SLOT testslot TEMPORARY LOGICAL pgoutput USE_SNAPSHOT;

Here is the pg_waldump output of the WAL record that CREATE_REPLICATION_SLOT
generated.

rmgr: Standby len (rec/tot): 24/ 50, tx: 0,
lsn: 0/01601438, prev 0/01601400, desc: RUNNING_XACTS nextXid 692
latestCompletedXid 691 oldestRunningXid 692

So I guess that CREATE_REPLICATION_SLOT code calls LogStandbySnapshot()
and which generates WAL record about snapshot of running transactions.

Erroring out in these cases sounds easy enough. Wonder if there's not a
bigger problem with WAL records generated e.g. by HOT pruning or such,
during decoding. Not super likely, but would probably hit exactly the
same, no?

Sounds possible, yes. Sounds like that's going to be nontrivial to fix
though.

Another problem is that queries can run on walsender now. But that
should be possible to detect and shutdown just like backend.

This sounds like a case for s/PANIC/ERROR|FATAL/ to me...

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#16)
Re: logical replication and PANIC during shutdown checkpoint in publisher

On 4/17/17 12:30, Andres Freund wrote:

So I guess that CREATE_REPLICATION_SLOT code calls LogStandbySnapshot()
and which generates WAL record about snapshot of running transactions.

Erroring out in these cases sounds easy enough. Wonder if there's not a
bigger problem with WAL records generated e.g. by HOT pruning or such,
during decoding. Not super likely, but would probably hit exactly the
same, no?

Sounds possible, yes. Sounds like that's going to be nontrivial to fix
though.

Another problem is that queries can run on walsender now. But that
should be possible to detect and shutdown just like backend.

This sounds like a case for s/PANIC/ERROR|FATAL/ to me...

I'd imagine the postmaster would tell the walsender that it has started
shutdown, and then the walsender would reject $certain_things. But I
don't see an existing way for the walsender to know that shutdown has
been initiated. SIGINT is still free ...

The alternative of shutting down logical walsenders earlier also doesn't
look straightforward, since the postmaster doesn't know directly what
kind of walsender a certain process is. So you'd also need additional
signal types or something there.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#17)
Re: logical replication and PANIC during shutdown checkpoint in publisher

On Tue, Apr 18, 2017 at 3:27 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

I'd imagine the postmaster would tell the walsender that it has started
shutdown, and then the walsender would reject $certain_things. But I
don't see an existing way for the walsender to know that shutdown has
been initiated. SIGINT is still free ...

The WAL sender receives SIGUSR2 from the postmaster when shutdown is
initiated, so why not just rely on that and issue an ERROR when a
client attempts to create or drop a new slot, setting up
walsender_ready_to_stop unconditionally? It seems to me that the issue
here is the delay between the moment SIGTERM is acknowledged by the
WAL sender and the moment CREATE_SLOT is treater. An idea with the
attached...

The alternative of shutting down logical walsenders earlier also doesn't
look straightforward, since the postmaster doesn't know directly what
kind of walsender a certain process is. So you'd also need additional
signal types or something there.

Yup, but is a switchover between a publisher and a subscriber
something that can happen?
--
Michael

Attachments:

walsender-shutdown-fix.patchapplication/octet-stream; name=walsender-shutdown-fix.patchDownload+18-1
#19Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#18)
Re: logical replication and PANIC during shutdown checkpoint in publisher

On 4/19/17 01:45, Michael Paquier wrote:

On Tue, Apr 18, 2017 at 3:27 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

I'd imagine the postmaster would tell the walsender that it has started
shutdown, and then the walsender would reject $certain_things. But I
don't see an existing way for the walsender to know that shutdown has
been initiated. SIGINT is still free ...

The WAL sender receives SIGUSR2 from the postmaster when shutdown is
initiated, so why not just rely on that and issue an ERROR when a
client attempts to create or drop a new slot, setting up
walsender_ready_to_stop unconditionally? It seems to me that the issue
here is the delay between the moment SIGTERM is acknowledged by the
WAL sender and the moment CREATE_SLOT is treater. An idea with the
attached...

I think the problem with a signal-based solution is that there is no
feedback. Ideally, you would wait for all walsenders to acknowledge the
receipt of SIGUSR2 (or similar) and only then proceed with the shutdown
checkpoint.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#12)
Re: logical replication and PANIC during shutdown checkpoint in publisher

On Sun, Apr 16, 2017 at 06:12:58AM +0000, Noah Misch wrote:

On Wed, Apr 12, 2017 at 10:55:08PM +0900, Fujii Masao wrote:

When I shut down the publisher while I repeated creating and dropping
the subscription in the subscriber, the publisher emitted the following
PANIC error during shutdown checkpoint.

PANIC: concurrent transaction log activity while database system is
shutting down

The cause of this problem is that walsender for logical replication can
generate WAL records even during shutdown checkpoint.

Firstly walsender keeps running until shutdown checkpoint finishes
so that all the WAL including shutdown checkpoint record can be
replicated to the standby. This was safe because previously walsender
could not generate WAL records. However this assumption became
invalid because of logical replication. That is, currenty walsender for
logical replication can generate WAL records, for example, by executing
CREATE_REPLICATION_SLOT command. This is an oversight in
logical replication patch, I think.

To fix this issue, we should terminate walsender for logical replication
before shutdown checkpoint starts. Of course walsender for physical
replication still needs to keep running until shutdown checkpoint ends,
though.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Peter,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#19)
#22Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#21)
#23Petr Jelinek
petr@2ndquadrant.com
In reply to: Michael Paquier (#22)
#24Peter Eisentraut
peter_e@gmx.net
In reply to: Petr Jelinek (#23)
#25Peter Eisentraut
peter_e@gmx.net
In reply to: Noah Misch (#20)
#26Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#24)
#27Petr Jelinek
petr@2ndquadrant.com
In reply to: Michael Paquier (#26)
#28Michael Paquier
michael@paquier.xyz
In reply to: Petr Jelinek (#27)
#29Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#26)
#30Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#25)
#31Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#30)
#32Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#29)
#33Noah Misch
noah@leadboat.com
In reply to: Peter Eisentraut (#30)
#34Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#33)
#35Peter Eisentraut
peter_e@gmx.net
In reply to: Noah Misch (#34)
#36Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#32)
#37Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#36)
#38Petr Jelinek
petr@2ndquadrant.com
In reply to: Michael Paquier (#37)
#39Michael Paquier
michael@paquier.xyz
In reply to: Petr Jelinek (#38)
#40Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#39)
#41Peter Eisentraut
peter_e@gmx.net
In reply to: Petr Jelinek (#38)
#42Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#40)
#43Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#41)
#44Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#43)
#45Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#35)
#46Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#44)
#47Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Michael Paquier (#46)
#48Michael Paquier
michael@paquier.xyz
In reply to: Pavan Deolasee (#47)
#49Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#46)
#50Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#49)
#51Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#50)
#52Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#51)
#53Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#52)
#54Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#49)
#55Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#54)
#56Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#55)
#57Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#56)
#58Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#57)
#59Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#54)
#60Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#56)
#61Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#60)
#62Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#61)
#63Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#62)
#64Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#61)
#65Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#64)
#66Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#65)
#67Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#64)