Problem with PostgreSQL 9.2.7 and make check on AIX 7.1
Hello,
I am still using a 8.2 PortgresSQL for an older application.
I have compiled 8.2.11 (same as on the old server) and
8.2.23 (latest 8.2) on AIX 7.1 TL3 SP1 (latest level) with
the IBM C/C++ V12 compiler.
Unfortunate I see the following errors:
...
2014-02-24 10:07:30 CET LOG: received fast shutdown request <--- here
2014-02-24 10:07:30 CET LOG: shutting down
2014-02-24 10:07:30 CET LOG: database system is shut down
2014-02-24 10:07:36 CET LOG: database system was shut down at
2014-02-24 10:07:30 CET
2014-02-24 10:07:36 CET LOG: checkpoint record is at 0/E9F073E0
2014-02-24 10:07:36 CET LOG: redo record is at 0/E9F073E0; undo record
is at 0/0; shutdown TRUE
2014-02-24 10:07:36 CET LOG: next transaction ID: 0/7161900; next OID:
39134
2014-02-24 10:07:36 CET LOG: next MultiXactId: 1; next MultiXactOffset: 0
2014-02-24 10:07:36 CET LOG: database system is ready
2014-02-24 10:08:35 CET LOG: received fast shutdown request <--- here
2014-02-24 10:08:35 CET LOG: shutting down
2014-02-24 10:08:35 CET LOG: database system is shut down
2014-02-24 10:10:45 CET LOG: database system was shut down at
2014-02-24 10:08:35 CET
2014-02-24 10:10:45 CET LOG: checkpoint record is at 0/E9F07430
2014-02-24 10:10:45 CET LOG: redo record is at 0/E9F07430; undo record
is at 0/0; shutdown TRUE
2014-02-24 10:10:45 CET LOG: next transaction ID: 0/7161950; next OID:
39134
2014-02-24 10:10:45 CET LOG: next MultiXactId: 1; next MultiXactOffset: 0
2014-02-24 10:10:45 CET LOG: database system is ready
I had set a higher log level to see all client statements.
Unfortunately I could see in the log file that when the
fast shutdown was executed that there where no running statements.
I never had a problem on AIX 5.3. Do you have any ideas??
So I have decided to give 9.2.7 a try. During the make check I have
found two failures - to be more precisely two hangs.
I have changed parallel_schedule to run most test sequential.
The following tests do show a problem:
1. problem
plpgsql.out <- this test hangs in statement_timeout.
...
end$$ language plpgsql;
set statement_timeout to 2000;
select blockme(); <- hang
2. problem
prepared_xacts.out <- this test hangs in statement_timeout.
...
-- pxtest3 should be locked because of the pending DROP
set statement_timeout to 2000;
SELECT * FROM pxtest3; <- hang
These tests do not move forward even if I wait 10 minutes.
So my conclusion is that the statement_timeout does not work as expected.
Bye
Rainer
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Hello,
There is one update:
The tests plpgsql and prepared_xacts do run on AIX 7.1 with PostgeSQL
8.2.11 and 8.2.23 without a problem.
Bye
Rainer
On 24.02.2014 11:29, Rainer Tammer wrote:
Hello,
I am still using a 8.2 PortgresSQL for an older application.
I have compiled 8.2.11 (same as on the old server) and
8.2.23 (latest 8.2) on AIX 7.1 TL3 SP1 (latest level) with
the IBM C/C++ V12 compiler.Unfortunate I see the following errors:
...
2014-02-24 10:07:30 CET LOG: received fast shutdown request <--- here
2014-02-24 10:07:30 CET LOG: shutting down
2014-02-24 10:07:30 CET LOG: database system is shut down
2014-02-24 10:07:36 CET LOG: database system was shut down at
2014-02-24 10:07:30 CET
2014-02-24 10:07:36 CET LOG: checkpoint record is at 0/E9F073E0
2014-02-24 10:07:36 CET LOG: redo record is at 0/E9F073E0; undo record
is at 0/0; shutdown TRUE
2014-02-24 10:07:36 CET LOG: next transaction ID: 0/7161900; next OID:
39134
2014-02-24 10:07:36 CET LOG: next MultiXactId: 1; next MultiXactOffset: 0
2014-02-24 10:07:36 CET LOG: database system is ready
2014-02-24 10:08:35 CET LOG: received fast shutdown request <--- here
2014-02-24 10:08:35 CET LOG: shutting down
2014-02-24 10:08:35 CET LOG: database system is shut down
2014-02-24 10:10:45 CET LOG: database system was shut down at
2014-02-24 10:08:35 CET
2014-02-24 10:10:45 CET LOG: checkpoint record is at 0/E9F07430
2014-02-24 10:10:45 CET LOG: redo record is at 0/E9F07430; undo record
is at 0/0; shutdown TRUE
2014-02-24 10:10:45 CET LOG: next transaction ID: 0/7161950; next OID:
39134
2014-02-24 10:10:45 CET LOG: next MultiXactId: 1; next MultiXactOffset: 0
2014-02-24 10:10:45 CET LOG: database system is readyI had set a higher log level to see all client statements.
Unfortunately I could see in the log file that when the
fast shutdown was executed that there where no running statements.I never had a problem on AIX 5.3. Do you have any ideas??
So I have decided to give 9.2.7 a try. During the make check I have
found two failures - to be more precisely two hangs.I have changed parallel_schedule to run most test sequential.
The following tests do show a problem:1. problem
plpgsql.out <- this test hangs in statement_timeout.
...
end$$ language plpgsql;
set statement_timeout to 2000;
select blockme(); <- hang2. problem
prepared_xacts.out <- this test hangs in statement_timeout.
...
-- pxtest3 should be locked because of the pending DROP
set statement_timeout to 2000;
SELECT * FROM pxtest3; <- hangThese tests do not move forward even if I wait 10 minutes.
So my conclusion is that the statement_timeout does not work as expected.
Bye
Rainer
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Rainer Tammer <pgsql@spg.schulergroup.com> writes:
I have compiled 8.2.11 (same as on the old server) and
8.2.23 (latest 8.2) on AIX 7.1 TL3 SP1 (latest level) with
the IBM C/C++ V12 compiler.
Unfortunate I see the following errors:
2014-02-24 10:07:30 CET LOG: received fast shutdown request <--- here
If this is the log of a "make check" run, that seems as-expected.
Otherwise, something is sending the postmaster process a SIGINT.
So I have decided to give 9.2.7 a try. During the make check I have
found two failures - to be more precisely two hangs.
Hm. Unfortunately, you're kind of on your own to debug this; AFAIK
there are no active Postgres developers who use AIX. It's been awhile
since there was an active AIX buildfarm machine either, so that it
would not exactly be astonishing to find that we'd inadvertently
broken something for that platform. (And I'm not sure there ever
was a buildfarm member running AIX 7.1 anyway; according to
http://buildfarm.postgresql.org/cgi-bin/show_members.pl
grebe was running 5.3 when last heard from, half a year ago.)
We're still willing to support AIX, but we can't do it without help
from users of that platform. If you send in a patch for whatever
is broken, we'll almost certainly accept it (in some form). But
it would be a good idea to set up a buildfarm animal so that any
future breakage gets detected in a more timely fashion. See
http://buildfarm.postgresql.org/index.html
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Mon, Feb 24, 2014 at 3:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hm. Unfortunately, you're kind of on your own to debug this; AFAIK
there are no active Postgres developers who use AIX. It's been awhile
since there was an active AIX buildfarm machine either, so that it
would not exactly be astonishing to find that we'd inadvertently
broken something for that platform. (And I'm not sure there ever
was a buildfarm member running AIX 7.1 anyway; according to
http://buildfarm.postgresql.org/cgi-bin/show_members.pl
grebe was running 5.3 when last heard from, half a year ago.)
We retired the AIX boxes, and aren't adding more, so it's not
likely to be us that retrieve this support. :-(
There might be some elderly servers still kicking around, but
I imagine that new licenses would need to be purchased in
order to have AIX 7.x run on them, which might readily cost
more than the hardware's worth.
I was glad to help provide buildfarm support on AIX; it certainly
was helpful to get issues dealt with, and meant that pretty much
all "usual" Postgres functionality (including contrib modules)
worked well, which wasn't the case pre-buildfarm.
We're still willing to support AIX, but we can't do it without help
from users of that platform. If you send in a patch for whatever
is broken, we'll almost certainly accept it (in some form). But
it would be a good idea to set up a buildfarm animal so that any
future breakage gets detected in a more timely fashion. See
http://buildfarm.postgresql.org/index.html
Indeed, if someone's keen on having AIX supported, then it's
important to be able to spare enough resources for a buildfarm
node. It doesn't mean that you'll need to devote development
resources to deal with every problem that comes along; I
recall a number of cases where I saw email threads go past
where Tom (or someone else) was poking away at a problem
noticed because grebe "turned red."
Someone with useful contacts at IBM might ask if they have
a server hiding somewhere that could be used as a buildfarm
animal.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"
Hello,
First of all: Thanks for the reply.
1. fast shutdown
The unexpected "LOG: received fast shutdown request" is happening
on an installed instance. I have found other articles on the WEB which
describe the same problem (other platform) - unfortunately there was
no real solution to it.
I am not too familiar with the PostgreSQL source code, but the only
location where this message is generated is:
---[postmaster.c]-------
...
/*
* pmdie -- signal handler for processing various postmaster signals.
*/
static void
pmdie(SIGNAL_ARGS)
{
...
case SIGINT:
/*
* Fast Shutdown:
*
* Abort all children with SIGTERM (rollback active transactions
* and exit) and shut down when they are gone.
*/
if (Shutdown >= FastShutdown)
break;
Shutdown = FastShutdown;
ereport(LOG,
(errmsg("received fast shutdown request")));
<------ HERE
if (DLGetHead(BackendList) || AutoVacPID != 0)
{
if (!FatalError)
{
ereport(LOG,
(errmsg("aborting any active
transactions")));
SignalChildren(SIGTERM);
if (AutoVacPID != 0)
signal_child(AutoVacPID, SIGTERM);
/* reaper() does the rest */
}
break;
}
---[postmaster.c]-------
I will try to add some code to get the source of the signal.
Would this help?
2. hang during make check
AIX 7.1 Technology Level 3
PostgreSQL 8.4.20 -> make check does finish without hang
PostgreSQL 9.0.16 -> hang
PostgreSQL 9.2.7 -> hang
As far as I can see the hang is caused by the "set statement_timeout to
2000;"
statement. Where would be a good start point to diagnose this problem??
Would this place be a good start?
Al have already checked if HAVE_SETSID is ausing the problem - unfortunately
that's not the case.
---[src/backend/storage/lmgr/proc.c]------
....
bool
enable_sig_alarm(int delayms, bool is_statement_timeout)
{
TimestampTz fin_time;
struct itimerval timeval;
if (is_statement_timeout)
{
....
static bool
CheckStatementTimeout(void)
{
TimestampTz now;
if (!statement_timeout_active)
return true; /* do nothing if not
active */
now = GetCurrentTimestamp();
if (now >= statement_fin_time)
{
/* Time to die */
statement_timeout_active = false;
cancel_from_timeout = true;
#ifdef HAVE_SETSID
/* try to signal whole process group */
kill(-MyProcPid, SIGINT);
#endif
kill(MyProcPid, SIGINT);
---[src/backend/storage/lmgr/proc.c]------
AIX 6.1 Technology Level 6
PostgreSQL 9.2.7 -> make check does finish without a hang.
That is strange. I will try to get a box with the latest AIX 6.1
Technology Level to check is the test will pass, too.
3. Smoker
There are several possibilities to setup a smoker.
a.) IBM does provide Power 7/7+ machines with Power Linux and AIX
6.1/7.1. As far as I know the access is free
This is the URL:
http://www-304.ibm.com/partnerworld/wps/servlet/ContentHandler/stg_com_sys_power-development-platform
Maybe I can help with the setup.
b.) Maybe I can setup a smoker
Is external access to the smoker required?
Bye
Rainer Tammer
On 24.02.2014 21:21, Tom Lane wrote:
Rainer Tammer <pgsql@spg.schulergroup.com> writes:
I have compiled 8.2.11 (same as on the old server) and
8.2.23 (latest 8.2) on AIX 7.1 TL3 SP1 (latest level) with
the IBM C/C++ V12 compiler.
Unfortunate I see the following errors:
2014-02-24 10:07:30 CET LOG: received fast shutdown request <--- hereIf this is the log of a "make check" run, that seems as-expected.
Otherwise, something is sending the postmaster process a SIGINT.So I have decided to give 9.2.7 a try. During the make check I have
found two failures - to be more precisely two hangs.Hm. Unfortunately, you're kind of on your own to debug this; AFAIK
there are no active Postgres developers who use AIX. It's been awhile
since there was an active AIX buildfarm machine either, so that it
would not exactly be astonishing to find that we'd inadvertently
broken something for that platform. (And I'm not sure there ever
was a buildfarm member running AIX 7.1 anyway; according to
http://buildfarm.postgresql.org/cgi-bin/show_members.pl
grebe was running 5.3 when last heard from, half a year ago.)We're still willing to support AIX, but we can't do it without help
from users of that platform. If you send in a patch for whatever
is broken, we'll almost certainly accept it (in some form). But
it would be a good idea to set up a buildfarm animal so that any
future breakage gets detected in a more timely fashion. See
http://buildfarm.postgresql.org/index.htmlregards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Rainer Tammer <pgsql@spg.schulergroup.com> writes:
1. fast shutdown
The unexpected "LOG: received fast shutdown request" is happening
on an installed instance. I have found other articles on the WEB which
describe the same problem (other platform) - unfortunately there was
no real solution to it.
As far as we can tell, any SIGINT of the postmaster must be coming from
outside the Postgres code. There are several places where SIGINT is
generated internally, but I've just been through all of them again and
it's pretty nearly impossible to believe that they could target the
postmaster process rather than some child process. If you've got a
way to instrument it and find out where the signal came from (eg what
PID sent it), that would be interesting information.
2. hang during make check
PostgreSQL 8.4.20 -> make check does finish without hang
PostgreSQL 9.0.16 -> hang
PostgreSQL 9.2.7 -> hang
Interesting, since AFAIR there was no major surgery on the timeout
code in 9.0. If you'd said 9.3 broke it, that wouldn't be so
surprising ...
As far as I can see the hang is caused by the "set statement_timeout to
2000;"
statement. Where would be a good start point to diagnose this problem??
Well, the point is that the timeout is failing to happen. Is the SIGALRM
signal being blocked? If it is delivered, why doesn't that spring the
process off its wait? Anyway, I see you already found enable_sig_alarm
and CheckStatementTimeout, so those are reasonable places to start
injecting some additional logging. You might also need to instrument
the backend SIGINT handler, StatementCancelHandler in postgres.c.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Hello,
I will try to get some debug code in the SIGINT handler.
In the wort case I will start a system trace.
In the meantime I have build more version on AIX 6.1.
I can see no failure on AIX 6.1, including 9.2.7.
I have installed the same C/C++ compiler on the AIX 6.1 box
as I have on the AIX 7.1 box - still same result. All test are OK.
So the problem is not compiler dependent.
Currently I upgrade a AIX 7.1 test LPAR on a Power 5 box.
This way I can check if the problem is dependant on AIX 7.1
or Power 7+. (Some years ago there was a problem with
Java on the newer CPUs.)
What code path is executed if the timeout passes and
the signal is send?
- Where exactly is the signal send?
--> Is the signal really send?
- Where is the first entry in the handler?
--> Do we receive the signal?
Bye
Rainer
P.S.: What do you think of a smoker on the IBM developer cloud?
If this is interesting I might organize the setup.
On 25.02.2014 17:35, Tom Lane wrote:
Rainer Tammer <pgsql@spg.schulergroup.com> writes:
1. fast shutdown
The unexpected "LOG: received fast shutdown request" is happening
on an installed instance. I have found other articles on the WEB which
describe the same problem (other platform) - unfortunately there was
no real solution to it.As far as we can tell, any SIGINT of the postmaster must be coming from
outside the Postgres code. There are several places where SIGINT is
generated internally, but I've just been through all of them again and
it's pretty nearly impossible to believe that they could target the
postmaster process rather than some child process. If you've got a
way to instrument it and find out where the signal came from (eg what
PID sent it), that would be interesting information.2. hang during make check
PostgreSQL 8.4.20 -> make check does finish without hang
PostgreSQL 9.0.16 -> hang
PostgreSQL 9.2.7 -> hangInteresting, since AFAIR there was no major surgery on the timeout
code in 9.0. If you'd said 9.3 broke it, that wouldn't be so
surprising ...As far as I can see the hang is caused by the "set statement_timeout to
2000;"
statement. Where would be a good start point to diagnose this problem??Well, the point is that the timeout is failing to happen. Is the SIGALRM
signal being blocked? If it is delivered, why doesn't that spring the
process off its wait? Anyway, I see you already found enable_sig_alarm
and CheckStatementTimeout, so those are reasonable places to start
injecting some additional logging. You might also need to instrument
the backend SIGINT handler, StatementCancelHandler in postgres.c.regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Rainer Tammer <pgsql@spg.schulergroup.com> writes:
What code path is executed if the timeout passes and
the signal is send?
Well, the general idea (pre 9.3) is that at the start of the statement,
enable_sig_alarm calculates a future timeout instant and calls setitimer()
to schedule a SIGALRM signal then. When the signal is delivered,
CheckStatementTimeout should do kill(MyProcPid, SIGINT), which should
lead to ProcessInterrupts calling ereport(ERROR), which will longjmp
back to the process idle loop. We need to narrow down which of these
steps is failing to happen before we can speculate much on what's wrong.
There are scenarios in which the SIGINT handler proper won't think it's
safe to call ProcessInterrupts immediately, but will just set a flag
to make that happen later. That should not apply in these test cases,
though.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Rainer Tammer escribi�:
P.S.: What do you think of a smoker on the IBM developer cloud?
If this is interesting I might organize the setup.
Please do get some Power buildfarm members setup if it's within your,
err, powers. See here for instructions:
http://wiki.postgresql.org/wiki/PostgreSQL_Buildfarm_Howto
If you can get machines for all architectures across all supported
versions of AIX, that'd be great. Since these Power machines are not
very common, perhaps it'd be good if you can get one running Linux too;
we only have PowerPC running NetBSD and Mac OS X, nothing on POWER7.
There's no need for anyone to access the machines. There's a push
script that's used to upload the test results.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Hello,
OK, so SIGALRM will kill the worker process if the timer runs down.
tammer 7078048 10289298 0 18:43:45 pts/1 0:00 gmake -C
src/test/regress check
tammer 9175106 12386340 0 18:45:31 pts/1 0:00
/daten/source/postgresql-9.2.7/src/test/regress/./tmp_check/install//usr/local/pgsql-9.2.7/bin/psql
-X -a -q -d regression
tammer 10289298 20512896 0 18:43:39 pts/1 0:00 gmake check
tammer 12189838 20840526 0 18:44:37 - 0:00 postgres: autovacuum
launcher process
tammer 12386340 7078048 0 18:43:45 pts/1 0:00
../../../src/test/regress/pg_regress --inputdir=.
--temp-install=./tmp_check --top-builddir=../../.. --dlpath=.
--schedule=./parallel_schedule
tammer 13172918 20840526 0 18:44:37 - 0:00 postgres: wal writer
process
tammer 16777336 20840526 0 18:44:37 - 0:00 postgres:
checkpointer process
tammer 18874572 20840526 0 18:45:31 - 0:00 postgres: tammer
regression [local] SELECT waiting <------ here
tammer 20512896 15270014 0 18:36:19 pts/1 0:00 -ksh
tammer 20644046 20840526 0 18:44:37 - 0:00 postgres: writer process
tammer 20840526 12386340 0 18:44:37 pts/1 0:00
/daten/source/postgresql-9.2.7/src/test/regress/./tmp_check/install//usr/local/pgsql-9.2.7/bin/postgres
-D /daten/source/postgresql-9.2.7/src/test/regress/./tmp_check/data -F
-c listen_addresses=
tammer 24182972 20840526 0 18:44:37 - 0:00 postgres: stats
collector process
The worker is hanging here:
root@adsmsrv4 rc:0 # dbx -a 18874572
Waiting to attach to process 18874572 ...
Successfully attached to postgres.
warning: Directory containing postgres could not be determined.
Apply 'use' command to initialize source path.
Type 'help' for help.
reading symbolic information ...
stopped in semop at 0xd02f8df0 ($t1)
0xd02f8df0 (semop+0xb0) 80410014 lwz r2,0x14(r1)
(dbx) where
semop(??, ??, ??) at 0xd02f8df0
PGSemaphoreLock(0x32438750, 0x1000001) at 0x10060958
ProcSleep(0x20212d18, 0x20039140) at 0x10114ab8
WaitOnLock(0x20212d18, 0x201f74f0) at 0x101269c0
LockAcquireExtended(0x2ff1dc90, 0x1, 0x0, 0x0, 0x1000001) at 0x10128384
LockAcquire(0x2ff1dc90, 0x1, 0x0, 0x0) at 0x101284a0
LockRelationOid(0xa35c, 0x1) at 0x10173d50
RangeVarGetRelidExtended(0x2020f5d8, 0x1, 0x1000001, 0x0, 0x0, 0x0) at
0x1009dcf8
relation_openrv_extended(0x2020f5d8, 0x1, 0x1000001) at 0x1004e76c
heap_openrv_extended(0x2020f5d8, 0x1, 0x1000001) at 0x1004eb98
parserOpenTable(0x2020f6d8, 0x2020f5d8, 0x1) at 0x101abeb0
addRangeTableEntry(0x2020f6d8, 0x2020f5d8, 0x0, 0x1000001, 0x1000001) at
0x101ac534
transformTableEntry(0x2020f6d8, 0x2020f5d8) at 0x1027f42c
transformFromClauseItem(0x2020f6d8, 0x2020f5d8, 0x2ff1e05c, 0x2ff1e060,
0x2ff1e064, 0x2ff1e068) at 0x1027f528
transformFromClause(0x2020f6d8, 0x2020f610) at 0x102800c8
transformSelectStmt(0x2020f6d8, 0x2020f628) at 0x10313f90
transformStmt(0x2020f6d8, 0x2020f628) at 0x1031559c
transformTopLevelStmt(0x2020f6d8, 0x2020f628) at 0x10315788
parse_analyze(0x2020f628, 0x2020ec10, 0x0, 0x0) at 0x1031583c
pg_analyze_and_rewrite(0x2020f628, 0x2020ec10, 0x0, 0x0) at 0x10062528
exec_simple_query(0x2020ec10) at 0x10066b74
PostgresMain(0x1, 0x201f6048, 0x201f5ed8, 0x201f5ec8) at 0x10067e4c
BackendRun(0x20234ad8) at 0x10117e54
BackendStartup(0x20234ad8) at 0x10119618
ServerLoop() at 0x10119c08
PostmasterMain(0x6, 0x201f56c8) at 0x1011b65c
main(0x6, 0x201f56c8) at 0x10000b1c
So the SELECT is waiting for the lock. The SIGALRM should send a SININT
to this worker process to terminate the worker.
Bye
Rainer
On 25.02.2014 18:18, Tom Lane wrote:
Rainer Tammer <pgsql@spg.schulergroup.com> writes:
What code path is executed if the timeout passes and
the signal is send?Well, the general idea (pre 9.3) is that at the start of the statement,
enable_sig_alarm calculates a future timeout instant and calls setitimer()
to schedule a SIGALRM signal then. When the signal is delivered,
CheckStatementTimeout should do kill(MyProcPid, SIGINT), which should
lead to ProcessInterrupts calling ereport(ERROR), which will longjmp
back to the process idle loop. We need to narrow down which of these
steps is failing to happen before we can speculate much on what's wrong.There are scenarios in which the SIGINT handler proper won't think it's
safe to call ProcessInterrupts immediately, but will just set a flag
to make that happen later. That should not apply in these test cases,
though.regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Hello,
OK, if no access from the outside is needed I can probably run smoke
builds on a Power5 on AIX 5.3, 6.1 and 7.1 LPAR.
And maybe we can get a Power 7+ LPAR at IBM. I will try to get a contact
within IBM.
Bye
Rainer
P.S.: Power with AIX has now a market share from ~50% compared to all
other Unix systems.
On 25.02.2014 18:47, Alvaro Herrera wrote:
Rainer Tammer escribi�:
P.S.: What do you think of a smoker on the IBM developer cloud?
If this is interesting I might organize the setup.Please do get some Power buildfarm members setup if it's within your,
err, powers. See here for instructions:
http://wiki.postgresql.org/wiki/PostgreSQL_Buildfarm_HowtoIf you can get machines for all architectures across all supported
versions of AIX, that'd be great. Since these Power machines are not
very common, perhaps it'd be good if you can get one running Linux too;
we only have PowerPC running NetBSD and Mac OS X, nothing on POWER7.There's no need for anyone to access the machines. There's a push
script that's used to upload the test results.
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Rainer Tammer <pgsql@spg.schulergroup.com> writes:
The worker is hanging here:
(dbx) where
semop(??, ??, ??) at 0xd02f8df0
PGSemaphoreLock(0x32438750, 0x1000001) at 0x10060958
ProcSleep(0x20212d18, 0x20039140) at 0x10114ab8
WaitOnLock(0x20212d18, 0x201f74f0) at 0x101269c0
LockAcquireExtended(0x2ff1dc90, 0x1, 0x0, 0x0, 0x1000001) at 0x10128384
LockAcquire(0x2ff1dc90, 0x1, 0x0, 0x0) at 0x101284a0
LockRelationOid(0xa35c, 0x1) at 0x10173d50
RangeVarGetRelidExtended(0x2020f5d8, 0x1, 0x1000001, 0x0, 0x0, 0x0) at
0x1009dcf8
Pretty much as expected. So the question is why the signal isn't getting
serviced; semop() should be interruptable. Given that you found it works
again on a more recent AIX release, maybe that's an OS bug?
It'd be worth adding some elog printouts to try to confirm whether the
signal handlers are getting entered at all. It seems possible that the
SIGALRM handler is entered but then the nested SIGINT occurrence is
not serviced for some reason.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Hello,
one note.. it works in AIX 6.1 but not 7.1. So it could be a OS problem.
I have opened a support call at IBM. Maybe the have an idea.
The semop() should be interrupted by SIGINT, right?
Bye
Rainer
On 25.02.2014 19:26, Tom Lane wrote:
Rainer Tammer <pgsql@spg.schulergroup.com> writes:
The worker is hanging here:
(dbx) where
semop(??, ??, ??) at 0xd02f8df0
PGSemaphoreLock(0x32438750, 0x1000001) at 0x10060958
ProcSleep(0x20212d18, 0x20039140) at 0x10114ab8
WaitOnLock(0x20212d18, 0x201f74f0) at 0x101269c0
LockAcquireExtended(0x2ff1dc90, 0x1, 0x0, 0x0, 0x1000001) at 0x10128384
LockAcquire(0x2ff1dc90, 0x1, 0x0, 0x0) at 0x101284a0
LockRelationOid(0xa35c, 0x1) at 0x10173d50
RangeVarGetRelidExtended(0x2020f5d8, 0x1, 0x1000001, 0x0, 0x0, 0x0) at
0x1009dcf8Pretty much as expected. So the question is why the signal isn't getting
serviced; semop() should be interruptable. Given that you found it works
again on a more recent AIX release, maybe that's an OS bug?It'd be worth adding some elog printouts to try to confirm whether the
signal handlers are getting entered at all. It seems possible that the
SIGALRM handler is entered but then the nested SIGINT occurrence is
not serviced for some reason.regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Rainer Tammer <pgsql@spg.schulergroup.com> writes:
The semop() should be interrupted by SIGINT, right?
Yeah. Note that we're expecting the SIGINT handler to do a longjmp,
so that it doesn't matter whether or not the semop would choose to
resume waiting after a signal. But it has to execute the handler.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Hello,
sorry to bother you again...
There is a pg_sema.c in src/backend/port.
This is linked ti sysv_sema.c, there is also a posix_sema.c.
How do you select the one or the other?
Bye
Rainer
On 25.02.2014 19:26, Tom Lane wrote:
Rainer Tammer <pgsql@spg.schulergroup.com> writes:
The worker is hanging here:
(dbx) where
semop(??, ??, ??) at 0xd02f8df0
PGSemaphoreLock(0x32438750, 0x1000001) at 0x10060958
ProcSleep(0x20212d18, 0x20039140) at 0x10114ab8
WaitOnLock(0x20212d18, 0x201f74f0) at 0x101269c0
LockAcquireExtended(0x2ff1dc90, 0x1, 0x0, 0x0, 0x1000001) at 0x10128384
LockAcquire(0x2ff1dc90, 0x1, 0x0, 0x0) at 0x101284a0
LockRelationOid(0xa35c, 0x1) at 0x10173d50
RangeVarGetRelidExtended(0x2020f5d8, 0x1, 0x1000001, 0x0, 0x0, 0x0) at
0x1009dcf8Pretty much as expected. So the question is why the signal isn't getting
serviced; semop() should be interruptable. Given that you found it works
again on a more recent AIX release, maybe that's an OS bug?It'd be worth adding some elog printouts to try to confirm whether the
signal handlers are getting entered at all. It seems possible that the
SIGALRM handler is entered but then the nested SIGINT occurrence is
not serviced for some reason.regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Rainer Tammer <pgsql@spg.schulergroup.com> writes:
There is a pg_sema.c in src/backend/port.
This is linked ti sysv_sema.c, there is also a posix_sema.c.
How do you select the one or the other?
The configure script chooses which to use for a particular platform.
Perhaps it should be making a different choice for AIX?
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Hello,
Just to clarify my picture:
backend process:
tammer 12255262 13107322 0 07:52:10 - 0:00 postgres:
tammer regression [local] SELECT waiting
If I send a SIGINT (-2) to the backend process then this would
be equivalent to the SIGINT send by the SIGALRM handler?
Bye
Rainer
On 25.02.2014 19:47, Tom Lane wrote:
Rainer Tammer <pgsql@spg.schulergroup.com> writes:
The semop() should be interrupted by SIGINT, right?
Yeah. Note that we're expecting the SIGINT handler to do a longjmp,
so that it doesn't matter whether or not the semop would choose to
resume waiting after a signal. But it has to execute the handler.regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Hello,
so far:
- The switch to POSIX semaphores do not solve the problem...
- It's not CPU dependent, the problem does show on Power 5 and Power 7+
- The problem does show on AIX 7.1 (at least on Technology Level 3)
- The problem does not show on AIX 6.1
- If I send a SIGINT manually to the backend process nothing happens.
My next steps:
- Instrument the SIGALRM handler
- Instrument the SIGINT handler
Bye
Rainer
On 25.02.2014 20:29, Tom Lane wrote:
Rainer Tammer <pgsql@spg.schulergroup.com> writes:
There is a pg_sema.c in src/backend/port.
This is linked ti sysv_sema.c, there is also a posix_sema.c.
How do you select the one or the other?The configure script chooses which to use for a particular platform.
Perhaps it should be making a different choice for AIX?
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Hello,
OK, at least the first stage is working...
src/backend/storage/lmgr/proc.c
...
static bool
CheckStatementTimeout(void)
{
TimestampTz now;
if (!statement_timeout_active)
return true; /* do nothing if not
active */
elog(FATAL, "enter: CheckStatementTimeout - after
statement_timeout_active check");
now = GetCurrentTimestamp();
...
gmake check: prepared_xacts.out
...
-- pxtest3 should be locked because of the pending DROP
set statement_timeout to 2000;
SELECT * FROM pxtest3;
FATAL: enter: CheckStatementTimeout - after statement_timeout_active check
LINE 1: SELECT * FROM pxtest3;
^
FATAL: enter: CheckStatementTimeout - after statement_timeout_active check
FATAL: enter: CheckStatementTimeout - after statement_timeout_active check
LINE 1: SELECT * FROM pxtest3;
^
FATAL: enter: CheckStatementTimeout - after statement_timeout_active check
connection to server was lost
....
Now I have put the output in CheckStatementTimeout:
src/backend/storage/lmgr/proc.c
...
static bool
CheckStatementTimeout(void)
{
TimestampTz now;
if (!statement_timeout_active)
return true; /* do nothing if not
active */
now = GetCurrentTimestamp();
if (now >= statement_fin_time)
{
/* Time to die */
statement_timeout_active = false;
cancel_from_timeout = true;
elog(FATAL, "enter: CheckStatementTimeout - next statement send
kill");
#ifdef HAVE_SETSID
/* try to signal whole process group */
kill(-MyProcPid, SIGINT);
#endif
kill(MyProcPid, SIGINT);
}
...
gmake check: prepared_xacts.out
...
-- pxtest3 should be locked because of the pending DROP
set statement_timeout to 2000;
SELECT * FROM pxtest3;
FATAL: enter: CheckStatementTimeout - next statement send kill
LINE 1: SELECT * FROM pxtest3;
^
FATAL: enter: CheckStatementTimeout - next statement send kill
LINE 1: SELECT * FROM pxtest3;
^
connection to server was lost
So the next thing should be that the backend process (the one hanging in
the SELECT) should receive the SIGINT - correct?
Bye
Rainer
On 25.02.2014 20:29, Tom Lane wrote:
Rainer Tammer <pgsql@spg.schulergroup.com> writes:
There is a pg_sema.c in src/backend/port.
This is linked ti sysv_sema.c, there is also a posix_sema.c.
How do you select the one or the other?The configure script chooses which to use for a particular platform.
Perhaps it should be making a different choice for AIX?
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Hello,
So we are getting closer (if I did not instrument the wrond code):
src/backend/tcop/postgres.c
int
PostgresMain(int argc, char *argv[],
const char *dbname,
const char *username)
{
...
if (am_walsender)
WalSndSignals();
else
{
pqsignal(SIGHUP, SigHupHandler); /* set
flag to read config
* file */
----> register handler here <-------------------------------------
pqsignal(SIGINT, StatementCancelHandler);
/* cancel current query */
----> register handler here <-------------------------------------
pqsignal(SIGTERM, die); /* cancel current query and exit */
...
void
StatementCancelHandler(SIGNAL_ARGS)
{
int save_errno = errno;
elog(WARNING, "StatementCancelHandler() - entered");
/*
* Don't joggle the elbow of proc_exit
*/
if (!proc_exit_inprogress)
{
...
This part is never reached. Does this mean that the signal SIGINT got lost?
Or do I search in the wrong place.
Your help is much appreciated.
Bye
Rainer
On 25.02.2014 19:47, Tom Lane wrote:
Rainer Tammer <pgsql@spg.schulergroup.com> writes:
The semop() should be interrupted by SIGINT, right?
Yeah. Note that we're expecting the SIGINT handler to do a longjmp,
so that it doesn't matter whether or not the semop would choose to
resume waiting after a signal. But it has to execute the handler.regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs