Back-branch update releases coming in a couple weeks

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Stephen Frost (#2)

Re: Back-branch update releases coming in a couple weeks

Stephen Frost <sfrost@snowman.net> writes:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Since we've fixed a couple of relatively nasty bugs recently, the core
committee has determined that it'd be a good idea to push out PG update
releases soon. The current plan is to wrap on Monday Feb 4 for public
announcement Thursday Feb 7. If you're aware of any bug fixes you think
ought to get included, now's the time to get them done ...

Should we consider including Simon's fix for an issue which Noah noted
in this thread?:
/messages/by-id/CA+U5nMKBrqFxyohr=JSDpgxZ6y0nfAdmt=K3hK4Zzfqo1MHSJg@mail.gmail.com

It sounds like there's something to fix there, but AFAICS the thread is
still arguing about the best fix. There's time to do it non-hastily.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

tsunakawa.takay@jp.fujitsu.com

over 13 years ago

In reply to: Tom Lane (#1)

Re: Back-branch update releases coming in a couple weeks

From: "Tom Lane" <tgl@sss.pgh.pa.us>

Since we've fixed a couple of relatively nasty bugs recently, the core
committee has determined that it'd be a good idea to push out PG update
releases soon. The current plan is to wrap on Monday Feb 4 for public
announcement Thursday Feb 7. If you're aware of any bug fixes you think
ought to get included, now's the time to get them done ...

I've just encountered a serious problem, and I really wish it would be fixed
in the upcoming minor release. Could you tell me whether this is already
fixed and will be included?

I'm using synchronous streaming replication with PostgreSQL 9.1.6 on Linux.
There are two nodes: one is primary and the other is a standby. When I
performed failover tests by doing "pg_ctl stop -mi" against the primary
while some applications are reading/writing the database, the standby
crashed while it was performing recovery after receiving promote request:

...
LOG: archive recovery complete
WARNING: page 506747 of relation base/482272/482304 was uninitialized
PANIC: WAL contains references to invalid pages
LOG: startup process (PID 8938) was terminated by signal 6: Aborted
LOG: terminating any other active server processes
(the log ends here)

The mentioned relation is an index. The contents of the referred page was
all zeros when I looked at it with "od -t x $PGDATA/base/482272/482304.3"
after the crash. The subsequent pages of the same file had valid-seeming
contents.

I searched through PostgreSQL mailing lists with "WAL contains references to
invalid pages", and i found 19 messages. Some people encountered similar
problem. There were some discussions regarding those problems (Tom and
Simon Riggs commented), but those discussions did not reach a solution.

I also found a discussion which might relate to this problem. Does this fix
the problem?

[BUG] lag of minRecoveryPont in archive recovery
/messages/by-id/20121206.130458.170549097.horiguchi.kyotaro@lab.ntt.co.jp

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

masao.fujii@gmail.com

over 13 years ago

In reply to: Tsunakawa, Takayuki (#4)

Re: Back-branch update releases coming in a couple weeks

On Thu, Jan 24, 2013 at 7:42 AM, MauMau <maumau307@gmail.com> wrote:

From: "Tom Lane" <tgl@sss.pgh.pa.us>

Since we've fixed a couple of relatively nasty bugs recently, the core
committee has determined that it'd be a good idea to push out PG update
releases soon. The current plan is to wrap on Monday Feb 4 for public
announcement Thursday Feb 7. If you're aware of any bug fixes you think
ought to get included, now's the time to get them done ...

I've just encountered a serious problem, and I really wish it would be fixed
in the upcoming minor release. Could you tell me whether this is already
fixed and will be included?

I'm using synchronous streaming replication with PostgreSQL 9.1.6 on Linux.
There are two nodes: one is primary and the other is a standby. When I
performed failover tests by doing "pg_ctl stop -mi" against the primary
while some applications are reading/writing the database, the standby
crashed while it was performing recovery after receiving promote request:

...
LOG: archive recovery complete
WARNING: page 506747 of relation base/482272/482304 was uninitialized
PANIC: WAL contains references to invalid pages
LOG: startup process (PID 8938) was terminated by signal 6: Aborted
LOG: terminating any other active server processes
(the log ends here)

The mentioned relation is an index. The contents of the referred page was
all zeros when I looked at it with "od -t x $PGDATA/base/482272/482304.3"
after the crash. The subsequent pages of the same file had valid-seeming
contents.

I searched through PostgreSQL mailing lists with "WAL contains references to
invalid pages", and i found 19 messages. Some people encountered similar
problem. There were some discussions regarding those problems (Tom and
Simon Riggs commented), but those discussions did not reach a solution.

I also found a discussion which might relate to this problem. Does this fix
the problem?

[BUG] lag of minRecoveryPont in archive recovery
/messages/by-id/20121206.130458.170549097.horiguchi.kyotaro@lab.ntt.co.jp

Yes. Could you check whether you can reproduce the problem on the
latest REL9_2_STABLE?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

tsunakawa.takay@jp.fujitsu.com

over 13 years ago

In reply to: Fujii Masao (#5)

Re: Back-branch update releases coming in a couple weeks

From: "Fujii Masao" <masao.fujii@gmail.com>

On Thu, Jan 24, 2013 at 7:42 AM, MauMau <maumau307@gmail.com> wrote:

I searched through PostgreSQL mailing lists with "WAL contains references
to
invalid pages", and i found 19 messages. Some people encountered similar
problem. There were some discussions regarding those problems (Tom and
Simon Riggs commented), but those discussions did not reach a solution.

I also found a discussion which might relate to this problem. Does this
fix
the problem?

[BUG] lag of minRecoveryPont in archive recovery
/messages/by-id/20121206.130458.170549097.horiguchi.kyotaro@lab.ntt.co.jp

Yes. Could you check whether you can reproduce the problem on the
latest REL9_2_STABLE?

I tried to produce the problem by doing "pg_ctl stop -mi" against the
primary more than ten times on REL9_2_STABLE, but the problem did not
appear. However, I encountered the crash only once out of dozens of
failovers, possibly more than a hundred times, on PostgreSQL 9.1.6. So, I'm
not sure the problem is fixed in REL9_2_STABLE.

I'm wondering if the fix discussed in the above thread solves my problem. I
found the following differences between Horiguchi-san's case and my case:

(1)
Horiguchi-san says the bug outputs the message:

WARNING: page 0 of relation base/16384/16385 does not exist

On the other hand, I got the message:

WARNING: page 506747 of relation base/482272/482304 was uninitialized

(2)
Horiguchi-san produced the problem when he shut the standby immediately and
restarted it. However, I saw the problem during failover.

(3)
Horiguchi-san did not use any index, but in my case the WARNING message
refers to an index.

But there's a similar point. Horiguchi-san says the problem occurs after
DELETE+VACUUM. In my case, I shut the primary down while the application
was doing INSERT/UPDATE. As the below messages show, some vacuuming was
running just before the immediate shutdown:

...
LOG: automatic vacuum of table "GOLD.scm1.tbl1": index scans: 0
pages: 0 removed, 36743 remain
tuples: 0 removed, 73764 remain
system usage: CPU 0.09s/0.11u sec elapsed 0.66 sec
LOG: automatic analyze of table "GOLD.scm1.tbl1" system usage: CPU
0.00s/0.14u sec elapsed 0.32 sec
LOG: automatic vacuum of table "GOLD.scm1.tbl2": index scans: 0
pages: 0 removed, 12101 remain
tuples: 40657 removed, 44142 remain system usage: CPU 0.06s/0.06u sec
elapsed 0.30 sec
LOG: automatic analyze of table "GOLD.scm1.tbl2" system usage: CPU
0.00s/0.06u sec elapsed 0.14 sec
LOG: received immediate shutdown request
...

Could you tell me the details of the problem discussed and fixed in the
upcoming minor release? I would to like to know the phenomenon and its
conditions, and whether it applies to my case.

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

/messages/by-id/20121206.130458.170549097.horiguchi.kyotaro@lab.ntt.co.jp

masao.fujii@gmail.com

over 13 years ago

In reply to: Tsunakawa, Takayuki (#6)

Re: Back-branch update releases coming in a couple weeks

On Thu, Jan 24, 2013 at 11:53 PM, MauMau <maumau307@gmail.com> wrote:

From: "Fujii Masao" <masao.fujii@gmail.com>

On Thu, Jan 24, 2013 at 7:42 AM, MauMau <maumau307@gmail.com> wrote:

I searched through PostgreSQL mailing lists with "WAL contains references
to
invalid pages", and i found 19 messages. Some people encountered similar
problem. There were some discussions regarding those problems (Tom and
Simon Riggs commented), but those discussions did not reach a solution.

I also found a discussion which might relate to this problem. Does this
fix
the problem?

[BUG] lag of minRecoveryPont in archive recovery

/messages/by-id/20121206.130458.170549097.horiguchi.kyotaro@lab.ntt.co.jp

Yes. Could you check whether you can reproduce the problem on the
latest REL9_2_STABLE?

I tried to produce the problem by doing "pg_ctl stop -mi" against the
primary more than ten times on REL9_2_STABLE, but the problem did not
appear. However, I encountered the crash only once out of dozens of
failovers, possibly more than a hundred times, on PostgreSQL 9.1.6. So, I'm
not sure the problem is fixed in REL9_2_STABLE.

You can reproduce the problem in REL9_1_STABLE?

Sorry, I pointed wrong version, i.e., REL9_2_STABLE. Since you encountered
the problem in 9.1.6, you need to retry the test in REL9_1_STABLE.

I'm wondering if the fix discussed in the above thread solves my problem. I
found the following differences between Horiguchi-san's case and my case:

(1)
Horiguchi-san says the bug outputs the message:

WARNING: page 0 of relation base/16384/16385 does not exist

On the other hand, I got the message:

WARNING: page 506747 of relation base/482272/482304 was uninitialized

(2)
Horiguchi-san produced the problem when he shut the standby immediately and
restarted it. However, I saw the problem during failover.

(3)
Horiguchi-san did not use any index, but in my case the WARNING message
refers to an index.

But there's a similar point. Horiguchi-san says the problem occurs after
DELETE+VACUUM. In my case, I shut the primary down while the application
was doing INSERT/UPDATE. As the below messages show, some vacuuming was
running just before the immediate shutdown:

...
LOG: automatic vacuum of table "GOLD.scm1.tbl1": index scans: 0
pages: 0 removed, 36743 remain
tuples: 0 removed, 73764 remain
system usage: CPU 0.09s/0.11u sec elapsed 0.66 sec
LOG: automatic analyze of table "GOLD.scm1.tbl1" system usage: CPU
0.00s/0.14u sec elapsed 0.32 sec
LOG: automatic vacuum of table "GOLD.scm1.tbl2": index scans: 0
pages: 0 removed, 12101 remain
tuples: 40657 removed, 44142 remain system usage: CPU 0.06s/0.06u sec
elapsed 0.30 sec
LOG: automatic analyze of table "GOLD.scm1.tbl2" system usage: CPU
0.00s/0.06u sec elapsed 0.14 sec
LOG: received immediate shutdown request
...

Could you tell me the details of the problem discussed and fixed in the
upcoming minor release? I would to like to know the phenomenon and its
conditions, and whether it applies to my case.

Could you read the discussion in the above thread?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

tsunakawa.takay@jp.fujitsu.com

over 13 years ago

In reply to: Fujii Masao (#7)

Re: Back-branch update releases coming in a couple weeks

From: "Fujii Masao" <masao.fujii@gmail.com>

On Thu, Jan 24, 2013 at 11:53 PM, MauMau <maumau307@gmail.com> wrote:

I'm wondering if the fix discussed in the above thread solves my problem.
I
found the following differences between Horiguchi-san's case and my case:

(1)
Horiguchi-san says the bug outputs the message:

WARNING: page 0 of relation base/16384/16385 does not exist

On the other hand, I got the message:

WARNING: page 506747 of relation base/482272/482304 was uninitialized

(2)
Horiguchi-san produced the problem when he shut the standby immediately
and
restarted it. However, I saw the problem during failover.

(3)
Horiguchi-san did not use any index, but in my case the WARNING message
refers to an index.

But there's a similar point. Horiguchi-san says the problem occurs after
DELETE+VACUUM. In my case, I shut the primary down while the application
was doing INSERT/UPDATE. As the below messages show, some vacuuming was
running just before the immediate shutdown:

...
LOG: automatic vacuum of table "GOLD.scm1.tbl1": index scans: 0
pages: 0 removed, 36743 remain
tuples: 0 removed, 73764 remain
system usage: CPU 0.09s/0.11u sec elapsed 0.66 sec
LOG: automatic analyze of table "GOLD.scm1.tbl1" system usage: CPU
0.00s/0.14u sec elapsed 0.32 sec
LOG: automatic vacuum of table "GOLD.scm1.tbl2": index scans: 0
pages: 0 removed, 12101 remain
tuples: 40657 removed, 44142 remain system usage: CPU 0.06s/0.06u sec
elapsed 0.30 sec
LOG: automatic analyze of table "GOLD.scm1.tbl2" system usage: CPU
0.00s/0.06u sec elapsed 0.14 sec
LOG: received immediate shutdown request
...

Could you tell me the details of the problem discussed and fixed in the
upcoming minor release? I would to like to know the phenomenon and its
conditions, and whether it applies to my case.

/messages/by-id/20121206.130458.170549097.horiguchi.kyotaro@lab.ntt.co.jp

Could you read the discussion in the above thread?

Yes, I've just read the discussion (it was difficult for me...)

Although you said the fix will solve my problem, I don't feel it will. The
discussion is about the crash when the standby "re"starts after the primary
vacuums and truncates a table. On the other hand, in my case, the standby
crashed during failover (not at restart), emitting a message that some WAL
record refers to an "uninitialized" page (not a non-existent page) of an
"index" (not a table).

In addition, fujii_test.sh did not reproduce the mentioned crash on
PostgreSQL 9.1.6.

I'm sorry to cause you trouble, but could you elaborate on how the fix
relates to my case?

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

masao.fujii@gmail.com

over 13 years ago

In reply to: Tsunakawa, Takayuki (#8)

Re: Back-branch update releases coming in a couple weeks

On Sun, Jan 27, 2013 at 12:17 AM, MauMau <maumau307@gmail.com> wrote:

From: "Fujii Masao" <masao.fujii@gmail.com>

On Thu, Jan 24, 2013 at 11:53 PM, MauMau <maumau307@gmail.com> wrote:

I'm wondering if the fix discussed in the above thread solves my problem.
I
found the following differences between Horiguchi-san's case and my case:

(1)
Horiguchi-san says the bug outputs the message:

WARNING: page 0 of relation base/16384/16385 does not exist

On the other hand, I got the message:

WARNING: page 506747 of relation base/482272/482304 was uninitialized

(2)
Horiguchi-san produced the problem when he shut the standby immediately
and
restarted it. However, I saw the problem during failover.

(3)
Horiguchi-san did not use any index, but in my case the WARNING message
refers to an index.

But there's a similar point. Horiguchi-san says the problem occurs after
DELETE+VACUUM. In my case, I shut the primary down while the application
was doing INSERT/UPDATE. As the below messages show, some vacuuming was
running just before the immediate shutdown:

...
LOG: automatic vacuum of table "GOLD.scm1.tbl1": index scans: 0
pages: 0 removed, 36743 remain
tuples: 0 removed, 73764 remain
system usage: CPU 0.09s/0.11u sec elapsed 0.66 sec
LOG: automatic analyze of table "GOLD.scm1.tbl1" system usage: CPU
0.00s/0.14u sec elapsed 0.32 sec
LOG: automatic vacuum of table "GOLD.scm1.tbl2": index scans: 0
pages: 0 removed, 12101 remain
tuples: 40657 removed, 44142 remain system usage: CPU 0.06s/0.06u sec
elapsed 0.30 sec
LOG: automatic analyze of table "GOLD.scm1.tbl2" system usage: CPU
0.00s/0.06u sec elapsed 0.14 sec
LOG: received immediate shutdown request
...

Could you tell me the details of the problem discussed and fixed in the
upcoming minor release? I would to like to know the phenomenon and its
conditions, and whether it applies to my case.

/messages/by-id/20121206.130458.170549097.horiguchi.kyotaro@lab.ntt.co.jp

Could you read the discussion in the above thread?

Yes, I've just read the discussion (it was difficult for me...)

Although you said the fix will solve my problem, I don't feel it will. The
discussion is about the crash when the standby "re"starts after the primary
vacuums and truncates a table. On the other hand, in my case, the standby
crashed during failover (not at restart), emitting a message that some WAL
record refers to an "uninitialized" page (not a non-existent page) of an
"index" (not a table).

In addition, fujii_test.sh did not reproduce the mentioned crash on
PostgreSQL 9.1.6.

I'm sorry to cause you trouble, but could you elaborate on how the fix
relates to my case?

Maybe I had not been understanding your problem correctly.
Could you show the self-contained test case which reproduces the problem?
Is the problem still reproducible in REL9_1_STABLE?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

tsunakawa.takay@jp.fujitsu.com

over 13 years ago

In reply to: Fujii Masao (#9)

Re: Back-branch update releases coming in a couple weeks

From: "Fujii Masao" <masao.fujii@gmail.com>

On Sun, Jan 27, 2013 at 12:17 AM, MauMau <maumau307@gmail.com> wrote:

Although you said the fix will solve my problem, I don't feel it will.
The
discussion is about the crash when the standby "re"starts after the
primary
vacuums and truncates a table. On the other hand, in my case, the
standby
crashed during failover (not at restart), emitting a message that some
WAL
record refers to an "uninitialized" page (not a non-existent page) of an
"index" (not a table).

In addition, fujii_test.sh did not reproduce the mentioned crash on
PostgreSQL 9.1.6.

I'm sorry to cause you trouble, but could you elaborate on how the fix
relates to my case?

Maybe I had not been understanding your problem correctly.
Could you show the self-contained test case which reproduces the problem?
Is the problem still reproducible in REL9_1_STABLE?

As I said before, it's very hard to reproduce the problem. All what I did
is to repeat the following sequence:

1. run "pg_ctl stop -mi" against the primary while the applications were
performing INSERT/UPDATE/SELECT.
2. run "pg_ctl promote" against the standby of synchronous streaming
standby.
3. run pg_basebackup on the stopped (original) primary to create a new
standby, and start the new standby.

I did this failover test dozens of times, probably more than a hundred. And
I encountered the crash only once.

Although I saw the problem only once, the result is catastrophic. So, I
really wish Heiki's patch (in cooperation with Horiguchi-san and you) could
fix the issue.

Do you think of anything?

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7bffc9b7bf9e09ddeddc65117e49829f758e500d
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=970fb12de121941939e777764d6e0446c974bba3

masao.fujii@gmail.com

over 13 years ago

In reply to: Tsunakawa, Takayuki (#10)

Re: Back-branch update releases coming in a couple weeks

On Sun, Jan 27, 2013 at 11:38 PM, MauMau <maumau307@gmail.com> wrote:

From: "Fujii Masao" <masao.fujii@gmail.com>

On Sun, Jan 27, 2013 at 12:17 AM, MauMau <maumau307@gmail.com> wrote:

Although you said the fix will solve my problem, I don't feel it will.
The
discussion is about the crash when the standby "re"starts after the
primary
vacuums and truncates a table. On the other hand, in my case, the
standby
crashed during failover (not at restart), emitting a message that some
WAL
record refers to an "uninitialized" page (not a non-existent page) of an
"index" (not a table).

In addition, fujii_test.sh did not reproduce the mentioned crash on
PostgreSQL 9.1.6.

I'm sorry to cause you trouble, but could you elaborate on how the fix
relates to my case?

Maybe I had not been understanding your problem correctly.
Could you show the self-contained test case which reproduces the problem?
Is the problem still reproducible in REL9_1_STABLE?

As I said before, it's very hard to reproduce the problem. All what I did
is to repeat the following sequence:

1. run "pg_ctl stop -mi" against the primary while the applications were
performing INSERT/UPDATE/SELECT.
2. run "pg_ctl promote" against the standby of synchronous streaming
standby.
3. run pg_basebackup on the stopped (original) primary to create a new
standby, and start the new standby.

I did this failover test dozens of times, probably more than a hundred. And
I encountered the crash only once.

Although I saw the problem only once, the result is catastrophic. So, I
really wish Heiki's patch (in cooperation with Horiguchi-san and you) could
fix the issue.

Do you think of anything?

Umm... it's hard to tell whether your problem has been fixed in the latest
9.1, from that information. The bug fix which you mentioned consists of
two patches.

The former seems not to be related to your problem because the problem
that patch fixed could basically happen only when restarting the standby.
The latter might be related to your problem....

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

http://www.gnu.org/software/libc/manual/html_node/Nonreentrancy.html#Nonreentrancy

tsunakawa.takay@jp.fujitsu.com

over 13 years ago

In reply to: Tom Lane (#1)

backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)

From: "Tom Lane" <tgl@sss.pgh.pa.us>

Since we've fixed a couple of relatively nasty bugs recently, the core
committee has determined that it'd be a good idea to push out PG update
releases soon. The current plan is to wrap on Monday Feb 4 for public
announcement Thursday Feb 7. If you're aware of any bug fixes you think
ought to get included, now's the time to get them done ...

I've just encountered another serious bug, which I wish to be fixed in the
upcoming minor release.

I'm using streaming replication with PostgreSQL 9.1.6 on Linux (RHEL6.2,
kernel 2.6.32). But this problem should happen regardless of the use of
streaming replication.

When I ran "pg_ctl stop -mi" against the primary, some applications
connected to the primary did not stop. The cause was that the backends was
deadlocked in quickdie() with some call stack like the following. I'm sorry
to have left the stack trace file on the testing machine, so I'll show you
the precise stack trace tomorrow.

some lock function
malloc()
gettext()
errhint()
quickdie()
<signal handler called because of SIGQUIT>
free()
...
PostgresMain()
...

The root cause is that gettext() is called in the signal handler quickdie()
via errhint(). As you know, malloc() cannot be called in a signal handler:

[Excerpt]
On most systems, malloc and free are not reentrant, because they use a
static data structure which records what memory blocks are free. As a
result, no library functions that allocate or free memory are reentrant.
This includes functions that allocate space to store a result.

And gettext() calls malloc(), as reported below:

http://lists.gnu.org/archive/html/bug-coreutils/2005-04/msg00056.html

I think the solution is the typical one. That is, to just remember the
receipt of SIGQUIT by setting a global variable and call siglongjmp() in
quickdie(), and perform tasks currently done in quickdie() when sigsetjmp()
returns in PostgresMain().

What do think about the solution? Could you include the fix? If it's okay
and you want, I'll submit the patch.

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

http://en.wikipedia.org/wiki/Big_Red_Switch

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Tsunakawa, Takayuki (#12)

Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)

"MauMau" <maumau307@gmail.com> writes:

When I ran "pg_ctl stop -mi" against the primary, some applications
connected to the primary did not stop. ...
The root cause is that gettext() is called in the signal handler quickdie()
via errhint().

Yeah, it's a known hazard that quickdie() operates like that.

I think the solution is the typical one. That is, to just remember the
receipt of SIGQUIT by setting a global variable and call siglongjmp() in
quickdie(), and perform tasks currently done in quickdie() when sigsetjmp()
returns in PostgresMain().

I think this cure is considerably worse than the disease. As stated,
it's not a fix at all: longjmp'ing out of a signal handler is no better
defined than what happens now, in fact it's probably even less safe.
We could just set a flag and wait for the mainline code to notice,
but that would make SIGQUIT hardly any stronger than SIGTERM --- in
particular it couldn't get you out of any loop that wasn't checking for
interrupts.

The long and the short of it is that SIGQUIT is the emergency-stop panic
button. You don't use it for routine shutdowns --- you use it when
there is a damn good reason to and you're prepared to do some manual
cleanup if necessary.

What do think about the solution? Could you include the fix?

Even if we had an arguably-better solution, I'd be disinclined to
risk cramming it into stable branches on such short notice.

What might make sense on short notice is to strengthen the
documentation's cautions against using SIGQUIT unnecessarily.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

andres@anarazel.de

over 13 years ago

In reply to: Tom Lane (#13)

Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)

On 2013-01-30 10:23:09 -0500, Tom Lane wrote:

"MauMau" <maumau307@gmail.com> writes:

When I ran "pg_ctl stop -mi" against the primary, some applications
connected to the primary did not stop. ...
The root cause is that gettext() is called in the signal handler quickdie()
via errhint().

Yeah, it's a known hazard that quickdie() operates like that.

What about not translating those? The messages are static and all memory
needed by postgres should be pre-allocated.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Andres Freund (#14)

Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)

Andres Freund <andres@2ndquadrant.com> writes:

On 2013-01-30 10:23:09 -0500, Tom Lane wrote:

Yeah, it's a known hazard that quickdie() operates like that.

What about not translating those? The messages are static and all memory
needed by postgres should be pre-allocated.

That would reduce our exposure slightly, but hardly to zero. For
instance, if SIGQUIT happened in the midst of handling a regular error,
ErrorContext might be pretty full already, necessitating further malloc
requests. I thought myself about suggesting that quickdie do something
to disable gettext(), but it doesn't seem like that would make it enough
safer to justify the loss of user-friendliness for non English speakers.

I think the conflict between "we don't want SIGQUIT to interrupt this"
and "we do want SIGQUIT to interrupt that" is pretty fundamental, and
there's probably not any bulletproof solution (or at least none that
would have reasonable development/maintenance cost). If we had more
confidence that there were no major loops lacking CHECK_FOR_INTERRUPTS
calls, maybe the set-a-flag approach would be acceptable ... but I
sure don't have such confidence.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

tsunakawa.takay@jp.fujitsu.com

over 13 years ago

In reply to: Tom Lane (#15)

Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)

From: "Tom Lane" <tgl@sss.pgh.pa.us>

"MauMau" <maumau307@gmail.com> writes:

I think the solution is the typical one. That is, to just remember the
receipt of SIGQUIT by setting a global variable and call siglongjmp() in
quickdie(), and perform tasks currently done in quickdie() when
sigsetjmp()
returns in PostgresMain().

I think this cure is considerably worse than the disease. As stated,
it's not a fix at all: longjmp'ing out of a signal handler is no better
defined than what happens now, in fact it's probably even less safe.
We could just set a flag and wait for the mainline code to notice,
but that would make SIGQUIT hardly any stronger than SIGTERM --- in
particular it couldn't get you out of any loop that wasn't checking for
interrupts.

Oh, I was careless. You are right, my suggestion is not a fix at all
because free() would continue to hold some lock after siglongjmp(), which
malloc() tries to acquire.

The long and the short of it is that SIGQUIT is the emergency-stop panic
button. You don't use it for routine shutdowns --- you use it when
there is a damn good reason to and you're prepared to do some manual
cleanup if necessary.

http://en.wikipedia.org/wiki/Big_Red_Switch

How about the case where some backend crashes due to a bug of PostgreSQL?
In this case, postmaster sends SIGQUIT to all backends, too. The instance
is expected to disappear cleanly and quickly. Doesn't the hanging backend
harm the restart of the instance?

How about using SIGKILL instead of SIGQUIT? The purpose of SIGQUIT is to
shutdown the processes quickly. SIGKILL is the best signal for that
purpose. The WARNING message would not be sent to clients, but that does
not justify the inability of immediately shutting down.

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Tsunakawa, Takayuki (#16)

Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)

"MauMau" <maumau307@gmail.com> writes:

From: "Tom Lane" <tgl@sss.pgh.pa.us>

The long and the short of it is that SIGQUIT is the emergency-stop panic
button. You don't use it for routine shutdowns --- you use it when
there is a damn good reason to and you're prepared to do some manual
cleanup if necessary.

How about the case where some backend crashes due to a bug of PostgreSQL?
In this case, postmaster sends SIGQUIT to all backends, too. The instance
is expected to disappear cleanly and quickly. Doesn't the hanging backend
harm the restart of the instance?

[ shrug... ] That isn't guaranteed, and never has been --- for
instance, the process might have SIGQUIT blocked, perhaps as a result
of third-party code we have no control over.

How about using SIGKILL instead of SIGQUIT?

Because then we couldn't notify clients at all. One practical
disadvantage of that is that it would become quite hard to tell from
the outside which client session actually crashed, which is frequently
useful to know.

This isn't an area that admits of quick-fix solutions --- everything
we might do has disadvantages. Also, the lack of complaints to date
shows that the problem is not so large as to justify panic responses.
I'm not really inclined to mess around with a tradeoff that's been
working pretty well for a dozen years or more.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Tatsuo Ishii

ishii@postgresql.org

over 13 years ago

In reply to: Tom Lane (#17)

Re: backend hangs at immediate shutdown

This isn't an area that admits of quick-fix solutions --- everything
we might do has disadvantages. Also, the lack of complaints to date
shows that the problem is not so large as to justify panic responses.
I'm not really inclined to mess around with a tradeoff that's been
working pretty well for a dozen years or more.

What about adding a caution to the doc something like:

"pg_ctl -m -i stop" may cause a PostgreSQL hang if native laguage support enabled.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

andres@anarazel.de

over 13 years ago

In reply to: Tatsuo Ishii (#18)

Re: backend hangs at immediate shutdown

On 2013-01-31 08:27:13 +0900, Tatsuo Ishii wrote:

This isn't an area that admits of quick-fix solutions --- everything
we might do has disadvantages. Also, the lack of complaints to date
shows that the problem is not so large as to justify panic responses.
I'm not really inclined to mess around with a tradeoff that's been
working pretty well for a dozen years or more.

What about adding a caution to the doc something like:

"pg_ctl -m -i stop" may cause a PostgreSQL hang if native laguage support enabled.

That doesn't entirely solve the problem, see quote and reply in
6845.1359561252@sss.pgh.pa.us

I think adding errmsg_raw() or somesuch that doesn't allocate any memory
and only accepts constant strings could solve the problem more
completely, at the obvious price of not allowing translated strings
directly.
Those could be pretranslated during startup, but thats mighty ugly.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Andres Freund (#19)

Re: backend hangs at immediate shutdown

Andres Freund <andres@2ndquadrant.com> writes:

On 2013-01-31 08:27:13 +0900, Tatsuo Ishii wrote:

What about adding a caution to the doc something like:
"pg_ctl -m -i stop" may cause a PostgreSQL hang if native laguage support enabled.

That doesn't entirely solve the problem, see quote and reply in
6845.1359561252@sss.pgh.pa.us

I think adding errmsg_raw() or somesuch that doesn't allocate any memory
and only accepts constant strings could solve the problem more
completely, at the obvious price of not allowing translated strings
directly.

I really doubt that this would make a measurable difference in the
probability of failure. The OP's case looks like it might not have
occurred if we weren't translating, but (a) that's not actually proven,
and (b) there are any number of other, equally low-probability, reasons
to have a problem here. Please note for instance that elog.c would
still be doing a whole lot of palloc's even if the passed strings were
not copied.

I think if we want to make it bulletproof we'd have to do what the
OP suggested and switch to SIGKILL. I'm not enamored of that for the
reasons I mentioned --- but one idea that might dodge the disadvantages
is to have the postmaster wait a few seconds and then SIGKILL any
backends that hadn't exited.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Tatsuo Ishii

ishii@postgresql.org

over 13 years ago

In reply to: Andres Freund (#19)

#22

tsunakawa.takay@jp.fujitsu.com

over 13 years ago

In reply to: Tom Lane (#20)

#23

tsunakawa.takay@jp.fujitsu.com

over 13 years ago

In reply to: Tom Lane (#20)

#24

Peter Eisentraut

peter_e@gmx.net

over 13 years ago

In reply to: Tsunakawa, Takayuki (#12)

#25

tsunakawa.takay@jp.fujitsu.com

over 13 years ago

In reply to: Peter Eisentraut (#24)

#26

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 13 years ago

In reply to: Tsunakawa, Takayuki (#25)

#27

andres@anarazel.de

over 13 years ago

In reply to: Tom Lane (#1)

#28

Peter Eisentraut

peter_e@gmx.net

over 13 years ago

In reply to: Tsunakawa, Takayuki (#25)

#29

andres@anarazel.de

over 13 years ago

In reply to: Peter Eisentraut (#28)

#30

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Andres Freund (#29)

#31

tsunakawa.takay@jp.fujitsu.com

over 13 years ago

In reply to: Tom Lane (#30)

#32

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Tsunakawa, Takayuki (#31)

#33

tsunakawa.takay@jp.fujitsu.com

about 13 years ago

In reply to: Alvaro Herrera (#32)

#34

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Tsunakawa, Takayuki (#33)

#35

tsunakawa.takay@jp.fujitsu.com

about 13 years ago

In reply to: Alvaro Herrera (#34)

#36

Noah Misch

noah@leadboat.com

about 13 years ago

In reply to: Alvaro Herrera (#34)

#37

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Noah Misch (#36)

#38

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Tsunakawa, Takayuki (#35)

#39

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Tsunakawa, Takayuki (#35)

#40

Hitoshi Harada

umi.tanuki@gmail.com

about 13 years ago

In reply to: Tsunakawa, Takayuki (#35)

#41

andres@anarazel.de

about 13 years ago

In reply to: Alvaro Herrera (#37)

#42

tsunakawa.takay@jp.fujitsu.com

about 13 years ago

In reply to: Alvaro Herrera (#38)

#43

tsunakawa.takay@jp.fujitsu.com

about 13 years ago

In reply to: Alvaro Herrera (#39)

#44

robertmhaas@gmail.com

about 13 years ago

In reply to: Alvaro Herrera (#34)

#45

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Robert Haas (#44)

#46

robertmhaas@gmail.com

about 13 years ago

In reply to: Tom Lane (#45)

#47

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Andres Freund (#41)

#48

Chris Browne

cbbrowne@acm.org

about 13 years ago

In reply to: Robert Haas (#46)

#49

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Alvaro Herrera (#47)

#50

robertmhaas@gmail.com

about 13 years ago

In reply to: Tom Lane (#49)

#51

tsunakawa.takay@jp.fujitsu.com

about 13 years ago

In reply to: Robert Haas (#46)

#52

tsunakawa.takay@jp.fujitsu.com

about 13 years ago

In reply to: Robert Haas (#46)

#53

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Robert Haas (#50)

#54

andres@anarazel.de

about 13 years ago

In reply to: Tom Lane (#53)

#55

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Tsunakawa, Takayuki (#43)

#56

andres@anarazel.de

about 13 years ago

In reply to: Alvaro Herrera (#47)

#57

robertmhaas@gmail.com

about 13 years ago

In reply to: Tom Lane (#53)

#58

robertmhaas@gmail.com

about 13 years ago

In reply to: Tsunakawa, Takayuki (#51)

#59

tsunakawa.takay@jp.fujitsu.com

about 13 years ago

In reply to: Alvaro Herrera (#55)

#60

tsunakawa.takay@jp.fujitsu.com

about 13 years ago

In reply to: Robert Haas (#58)

#61

Noah Misch

noah@leadboat.com

about 13 years ago

In reply to: Tsunakawa, Takayuki (#60)

#62

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Noah Misch (#61)

#63

tsunakawa.takay@jp.fujitsu.com

about 13 years ago

In reply to: Alvaro Herrera (#62)

#64

tsunakawa.takay@jp.fujitsu.com

about 13 years ago

In reply to: Alvaro Herrera (#62)

#65

alvherre@2ndquadrant.com

almost 13 years ago

In reply to: Tsunakawa, Takayuki (#64)

#66