isolation test deadlocking on buildfarm member coypu
Hi,
Isolation tests seem to deadlock on buildfarm member coypu (NetBSD/powerpc 5.1).
Here are the process for one deadlock (several days) :
pgbuildfarm 2405 0.0 1.2 26948 7916 ? Is Wed02AM 0:00.25 postgres: pgbuildfarm isolationtest [local] INSERT waiting
pgbuildfarm 6559 0.0 0.4 10288 2684 ? I Wed02AM 0:00.05 ./pg_isolation_regress --psqldir=/home/pgbuildfarm/workdir/HEAD/inst/bin --inputdir=. --schedule=./isolation_schedule (pg_isolation_reg)
pgbuildfarm 8296 0.0 1.8 25924 11880 ? Ss Wed02AM 0:59.79 postgres: writer process
pgbuildfarm 10630 0.0 0.4 16212 2916 ? Ss Wed02AM 1:43.16 postgres: stats collector process
pgbuildfarm 11814 0.0 0.7 9084 4408 ? I Wed01AM 0:00.39 /usr/pkg/bin/perl ./run_branches.pl --run-one
pgbuildfarm 14785 0.0 0.2 3344 1276 ? I Wed02AM 0:00.03 gmake NO_LOCALE=1 installcheck
pgbuildfarm 15353 0.0 0.8 26948 4960 ? Ss Wed02AM 1:10.55 postgres: autovacuum launcher process
pgbuildfarm 16619 0.0 1.8 15992 11512 ? I Wed01AM 0:04.90 /usr/pkg/bin/perl ./run_build.pl --config build-farm.conf HEAD
pgbuildfarm 17532 0.0 0.5 25924 3052 ? Ss Wed02AM 0:34.49 postgres: wal writer process
pgbuildfarm 18073 0.0 1.0 25924 6844 ? I Wed02AM 0:19.83 /home/pgbuildfarm/workdir/HEAD/inst/bin/postgres -D data-C
pgbuildfarm 21310 0.0 1.2 26948 7948 ? Is Wed02AM 0:00.06 postgres: pgbuildfarm isolationtest [local] idle in transaction
pgbuildfarm 24137 0.0 0.2 3456 1064 ? Is Wed01AM 0:00.01 /bin/sh -c cd /home/pgbuildfarm/client-code && ./run_branches.pl --run-one
pgbuildfarm 26932 0.0 0.2 3456 1072 ? I Wed02AM 0:00.01 sh -c cd pgsql.16619/src/test/isolation && gmake NO_LOCALE=1 installcheck 2>&1
pgbuildfarm 28430 0.0 1.3 26948 8816 ? Ss Wed02AM 408:43.91 postgres: pgbuildfarm isolationtest [local] idle
pgbuildfarm 28809 0.0 0.4 10476 2844 ? R Wed02AM 27:48.24 ./isolationtester dbname=isolationtest
I killed one of the postgres process, and it lead to a failure of the buildfarm perl script (probably due to the 2GB+ log file), and persistance of the postgres cluster (which I had to clean manually)
I've kept the source and install directories, if useful for investigation.
Here are the processes for a second lock (several hours):
pgbuildfarm 1423 0.0 0.4 16212 2888 ? Ss 2:11PM 0:04.07 postgres: stats collector process
pgbuildfarm 1604 0.0 1.2 26948 7724 ? Is 2:13PM 0:00.05 postgres: pgbuildfarm isolationtest [local] idle in transaction
pgbuildfarm 2380 0.0 0.4 10288 2624 ? I 2:12PM 0:00.05 ./pg_isolation_regress --psqldir=/home/pgbuildfarm/workdir/HEAD/inst/bin --inputdir=. --schedule=./isolation_schedule (pg_isolation_reg)
pgbuildfarm 3156 0.0 0.8 26948 4920 ? Ss 2:11PM 0:02.71 postgres: autovacuum launcher process
pgbuildfarm 3253 0.0 0.5 25924 3044 ? Ss 2:11PM 0:01.65 postgres: wal writer process
pgbuildfarm 9889 0.0 1.8 25924 11856 ? Ss 2:11PM 0:04.87 postgres: writer process
pgbuildfarm 11359 0.0 1.4 26948 9032 ? Ss 2:13PM 15:16.63 postgres: pgbuildfarm isolationtest [local] idle
pgbuildfarm 13738 0.0 1.2 26948 7860 ? Ss 2:13PM 0:00.06 postgres: pgbuildfarm isolationtest [local] UPDATE waiting
pgbuildfarm 14031 0.0 1.0 25924 6800 ? S 2:11PM 0:01.82 /home/pgbuildfarm/workdir/HEAD/inst/bin/postgres -D data-C
pgbuildfarm 16555 0.0 0.4 10476 2764 ? S 2:13PM 1:01.77 ./isolationtester dbname=isolationtest
pgbuildfarm 18038 0.0 0.2 3456 1008 ? I 2:11PM 0:00.01 sh -c cd pgsql.20707/src/test/isolation && gmake NO_LOCALE=1 installcheck 2>&1
pgbuildfarm 19057 0.0 0.2 3456 1000 ? Is 1:30PM 0:00.01 /bin/sh -c cd /home/pgbuildfarm/client-code && ./run_branches.pl --run-one
pgbuildfarm 20707 0.0 1.8 15992 11460 ? I 1:30PM 0:04.92 /usr/pkg/bin/perl ./run_build.pl --config build-farm.conf HEAD
pgbuildfarm 21749 0.0 0.2 3344 1248 ? I 2:11PM 0:00.04 gmake NO_LOCALE=1 installcheck
pgbuildfarm 23054 0.0 0.7 9084 4352 ? I 1:30PM 0:00.38 /usr/pkg/bin/perl ./run_branches.pl --run-one
pgbuildfarm 2125 0.0 0.0 2652 0 ttyp0 R+ 5:09PM 0:00.00 grep pgbuildf (bash)
pgbuildfarm 9930 0.0 0.3 2652 2072 ttyp0 S 5:03PM 0:00.09 bash
pgbuildfarm 29194 0.0 0.1 3552 872 ttyp0 O+ 5:09PM 0:00.01 ps -aux
pgbuildfarm@cube:workdir$date
Sat Jul 16 17:10:13 CEST 2011
The log fills with:
[4e21807d.2c5f:1060074] LOG: execute isolationtester_waiting: SELECT 1 WHERE pg_stat_get_backend_waiting($1)
[4e21807d.2c5f:1060075] DETAIL: parameters: $1 = '5'
[4e21807d.2c5f:1060076] LOG: execute isolationtester_waiting: SELECT 1 WHERE pg_stat_get_backend_waiting($1)
[4e21807d.2c5f:1060077] DETAIL: parameters: $1 = '5'
I killed process 1604 (postgres: pgbuildfarm isolationtest [local] idle in transaction). Again, the buildfarm perlscript kicked into action. See http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=coypu&dt=2011-07-16%2011%3A30%3A02 for the logs
Regards,
Rémi Zara
Attachments:
On Sat, Jul 16, 2011 at 05:50:45PM +0200, R�mi Zara wrote:
Isolation tests seem to deadlock on buildfarm member coypu (NetBSD/powerpc 5.1).
Thanks for the report and detailed analysis. I believe the patch here will fix
the problem:
http://archives.postgresql.org/message-id/20110716171121.GB2047@tornado.leadboat.com
If you're up for testing it locally, I'd be interested to hear how it works out.
Thanks,
nm
--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Excerpts from Noah Misch's message of sáb jul 16 13:25:03 -0400 2011:
On Sat, Jul 16, 2011 at 05:50:45PM +0200, Rémi Zara wrote:
Isolation tests seem to deadlock on buildfarm member coypu (NetBSD/powerpc 5.1).
Thanks for the report and detailed analysis. I believe the patch here will fix
the problem:
http://archives.postgresql.org/message-id/20110716171121.GB2047@tornado.leadboat.comIf you're up for testing it locally, I'd be interested to hear how it works out.
Rémi,
I have committed Noah's patch, so please unblock coypu and locust so we
can see whether they are OK now.
Thanks,
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
After committing Noah's patch to fix the isolation tests, there have
been two more failures in Rémi's machines pika and coypu:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pika&dt=2011-07-24%2006%3A45%3A45
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu&dt=2011-07-23%2021%3A50%3A54
In both cases, the failure is identical in fk-deadlock2:
================== pgsql.20950/src/test/isolation/regression.diffs ===================
*** /home/pgbuildfarm/workdir/HEAD/pgsql.20950/src/test/isolation/expected/fk-deadlock2.out Sun Jul 24 08:46:44 2011
--- /home/pgbuildfarm/workdir/HEAD/pgsql.20950/src/test/isolation/results/fk-deadlock2.out Sun Jul 24 15:11:42 2011
***************
*** 22,29 ****
step s2u1: UPDATE B SET Col2 = 1 WHERE BID = 2;
step s1u2: UPDATE B SET Col2 = 1 WHERE BID = 2; <waiting ...>
step s2u2: UPDATE B SET Col2 = 1 WHERE BID = 2;
- step s1u2: <... completed>
ERROR: deadlock detected
step s1c: COMMIT;
step s2c: COMMIT;
--- 22,29 ----
step s2u1: UPDATE B SET Col2 = 1 WHERE BID = 2;
step s1u2: UPDATE B SET Col2 = 1 WHERE BID = 2; <waiting ...>
step s2u2: UPDATE B SET Col2 = 1 WHERE BID = 2;
ERROR: deadlock detected
+ step s1u2: <... completed>
step s1c: COMMIT;
step s2c: COMMIT;
I think the only explanation necessary for this is that one process
reports its status before the other one. I think it would be enough to
add another variant of the expected file to fix this problem, but I
don't quite want to do that because we already have three of them, and I
think we would need to add one per existing expected, so we'd end up
with 6 expected files which would be a pain to work with.
Not quite sure what to do with this.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Jul 26, 2011 at 05:04:28PM -0400, Alvaro Herrera wrote:
*** /home/pgbuildfarm/workdir/HEAD/pgsql.20950/src/test/isolation/expected/fk-deadlock2.out Sun Jul 24 08:46:44 2011 --- /home/pgbuildfarm/workdir/HEAD/pgsql.20950/src/test/isolation/results/fk-deadlock2.out Sun Jul 24 15:11:42 2011 *************** *** 22,29 **** step s2u1: UPDATE B SET Col2 = 1 WHERE BID = 2; step s1u2: UPDATE B SET Col2 = 1 WHERE BID = 2; <waiting ...> step s2u2: UPDATE B SET Col2 = 1 WHERE BID = 2; - step s1u2: <... completed> ERROR: deadlock detected step s1c: COMMIT; step s2c: COMMIT;--- 22,29 ---- step s2u1: UPDATE B SET Col2 = 1 WHERE BID = 2; step s1u2: UPDATE B SET Col2 = 1 WHERE BID = 2; <waiting ...> step s2u2: UPDATE B SET Col2 = 1 WHERE BID = 2; ERROR: deadlock detected + step s1u2: <... completed> step s1c: COMMIT; step s2c: COMMIT;I think the only explanation necessary for this is that one process
reports its status before the other one. I think it would be enough to
add another variant of the expected file to fix this problem, but I
don't quite want to do that because we already have three of them, and I
think we would need to add one per existing expected, so we'd end up
with 6 expected files which would be a pain to work with.
To really cover the problem in this way, we would need 16*3 variations covering
every permutation of the deadlocks detected when s1 runs the first command. I
see these other options:
1. Keep increasing the s1 deadlock_timeout until we stop getting these. This is
simple, but it proportionally slows the test suite for everyone. No value will
ever be guaranteed sufficient.
2. Post-process the output to ascribe the deadlock detections in a standard way
before comparing with expected output. This would also let us drop
deadlock_timeout arbitrarily low, giving a rapid test run.
3. Implement actual deadlock priorities, per discussion at
http://archives.postgresql.org/message-id/AANLkTimAqFzKV4Sc1DScruFft_Be78Y-oWPeuURCDnjR@mail.gmail.com
This is much more work, but it would let us drop deadlock_timeout arbitrarily
low and still get consistent results from the start.
--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services