Lots of stuck queries after upgrade to 9.4

Started by Spiros Ioannouover 10 years ago25 messagesgeneral
Jump to latest
#1Spiros Ioannou
sivann@inaccess.com

We upgraded to 9.4.4 from 8.4 We have tried 9.3.x in the past with success.
Unfortunately in 9.4 gets stuck 2-3 times a week.

Lots of queries seem stuck, that is they have waiting:false. Queries can be
COMMIT or INSERT.
No blocking locks seem to exist.
strace reveals that all stuck processes they are in semop
semop(2064417, {{0, -1, 0}}, 1^C <unfinished ...>

ipcs reveals no shortage of semaphores..

Please any ideas very welcome, our service now in a very sad state..

*Spiros Ioannou IT Manager, inAccesswww.inaccess.com
<http://www.inaccess.com&gt;M: +30 6973-903808T: +30 210-6802-358*

#2Spiros Ioannou
sivann@inaccess.com
In reply to: Spiros Ioannou (#1)
Re: Lots of stuck queries after upgrade to 9.4

Os is debian 6

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Spiros Ioannou (#1)
Re: Lots of stuck queries after upgrade to 9.4

Spiros Ioannou <sivann@inaccess.com> writes:

We upgraded to 9.4.4 from 8.4 We have tried 9.3.x in the past with success.
Unfortunately in 9.4 gets stuck 2-3 times a week.

Lots of queries seem stuck, that is they have waiting:false. Queries can be
COMMIT or INSERT.
No blocking locks seem to exist.
strace reveals that all stuck processes they are in semop
semop(2064417, {{0, -1, 0}}, 1^C <unfinished ...>

Stack traces of the stuck processes might help determine what's going on.

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

Please try to capture traces for both the COMMIT and INSERT cases,
as they're almost certainly different.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Spiros Ioannou
sivann@inaccess.com
In reply to: Tom Lane (#3)
Re: Lots of stuck queries after upgrade to 9.4

Hi Tom,
thank you for your input. The DB was stuck again, I attach all logs and
stack traces.

A stack trace from a COMMIT, an INSERT, an UPDATE, the wal writer, the
writer, and a sequence.

Stracing the commit was stuck at: semop(3145761, {{12, -1, 0}}, 1

Please tell me what else could I do to help debugging.

Is there a way to downgrade postgresql ?

*Spiros Ioannou IT Manager, inAccesswww.inaccess.com
<http://www.inaccess.com&gt;M: +30 6973-903808T: +30 210-6802-358*

On 20 July 2015 at 01:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Spiros Ioannou <sivann@inaccess.com> writes:

We upgraded to 9.4.4 from 8.4 We have tried 9.3.x in the past with

success.

Unfortunately in 9.4 gets stuck 2-3 times a week.

Lots of queries seem stuck, that is they have waiting:false. Queries can

be

COMMIT or INSERT.
No blocking locks seem to exist.
strace reveals that all stuck processes they are in semop
semop(2064417, {{0, -1, 0}}, 1^C <unfinished ...>

Stack traces of the stuck processes might help determine what's going on.

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

Please try to capture traces for both the COMMIT and INSERT cases,
as they're almost certainly different.

regards, tom lane

Attachments:

db9lock.tar.gzapplication/x-gzip; name=db9lock.tar.gzDownload
#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Spiros Ioannou (#4)
Re: Lots of stuck queries after upgrade to 9.4

Spiros Ioannou wrote:

Hi Tom,
thank you for your input. The DB was stuck again, I attach all logs and
stack traces.

A stack trace from a COMMIT, an INSERT, an UPDATE, the wal writer, the
writer, and a sequence.

Stracing the commit was stuck at: semop(3145761, {{12, -1, 0}}, 1

Hmm, isn't this an indication of problems in the WAL insertion logic?
Looks to me like there are deadlocks using lwlocks.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#5)
Re: Lots of stuck queries after upgrade to 9.4

On 2015-07-20 13:06:51 +0200, Alvaro Herrera wrote:

Spiros Ioannou wrote:

Hi Tom,
thank you for your input. The DB was stuck again, I attach all logs and
stack traces.

A stack trace from a COMMIT, an INSERT, an UPDATE, the wal writer, the
writer, and a sequence.

Stracing the commit was stuck at: semop(3145761, {{12, -1, 0}}, 1

Hmm, isn't this an indication of problems in the WAL insertion logic?
Looks to me like there are deadlocks using lwlocks.

Hm. The problem seems to be the WaitXLogInsertionsToFinish() call in
XLogFlush().

Spiros, I guess you have commit_delay enabled? If so, does disabling it
"fix" the issue?

Regards,

Andres

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#6)
Re: Lots of stuck queries after upgrade to 9.4

On 2015-07-20 13:22:42 +0200, Andres Freund wrote:

Hm. The problem seems to be the WaitXLogInsertionsToFinish() call in
XLogFlush().

These are the relevant stack traces:
db9lock/debuglog-commit.txt
#2 0x00007f7405bd44f4 in LWLockWaitForVar (l=0x7f70f2ab6680, valptr=0x7f70f2ab66a0, oldval=<optimized out>, newval=0xffffffffffffffff) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:1011
#3 0x00007f7405a0d3e6 in WaitXLogInsertionsToFinish (upto=121713318915952) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1755
#4 0x00007f7405a0e1d3 in XLogFlush (record=121713318911056) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2849

db9lock/debuglog-insert-8276.txt
#1 0x00007f7405b77d91 in PGSemaphoreLock (sema=0x7f73ff6531d0, interruptOK=0 '\000') at pg_sema.c:421
#2 0x00007f7405bd4849 in LWLockAcquireCommon (val=<optimized out>, valptr=<optimized out>, mode=<optimized out>, l=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:626
#3 LWLockAcquire (l=0x7f70ecaaa1a0, mode=LW_EXCLUSIVE) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:467
#4 0x00007f7405a0dcca in AdvanceXLInsertBuffer (upto=<optimized out>, opportunistic=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2161
#5 0x00007f7405a0e301 in GetXLogBuffer (ptr=121713318928384) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1848
#6 0x00007f7405a0e9c9 in CopyXLogRecordToWAL (EndPos=<optimized out>, StartPos=<optimized out>, rdata=0x7ffff1c21b90, isLogSwitch=<optimized out>, write_len=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1494
#7 XLogInsert (rmid=<optimized out>, info=<optimized out>, rdata=<optimized out>) at /tmp/buildd/postgre

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#7)
Re: Lots of stuck queries after upgrade to 9.4

Heikki,

On 2015-07-20 13:27:12 +0200, Andres Freund wrote:

On 2015-07-20 13:22:42 +0200, Andres Freund wrote:

Hm. The problem seems to be the WaitXLogInsertionsToFinish() call in
XLogFlush().

These are the relevant stack traces:
db9lock/debuglog-commit.txt
#2 0x00007f7405bd44f4 in LWLockWaitForVar (l=0x7f70f2ab6680, valptr=0x7f70f2ab66a0, oldval=<optimized out>, newval=0xffffffffffffffff) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:1011
#3 0x00007f7405a0d3e6 in WaitXLogInsertionsToFinish (upto=121713318915952) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1755
#4 0x00007f7405a0e1d3 in XLogFlush (record=121713318911056) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2849

db9lock/debuglog-insert-8276.txt
#1 0x00007f7405b77d91 in PGSemaphoreLock (sema=0x7f73ff6531d0, interruptOK=0 '\000') at pg_sema.c:421
#2 0x00007f7405bd4849 in LWLockAcquireCommon (val=<optimized out>, valptr=<optimized out>, mode=<optimized out>, l=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:626
#3 LWLockAcquire (l=0x7f70ecaaa1a0, mode=LW_EXCLUSIVE) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:467
#4 0x00007f7405a0dcca in AdvanceXLInsertBuffer (upto=<optimized out>, opportunistic=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2161
#5 0x00007f7405a0e301 in GetXLogBuffer (ptr=121713318928384) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1848
#6 0x00007f7405a0e9c9 in CopyXLogRecordToWAL (EndPos=<optimized out>, StartPos=<optimized out>, rdata=0x7ffff1c21b90, isLogSwitch=<optimized out>, write_len=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1494
#7 XLogInsert (rmid=<optimized out>, info=<optimized out>, rdata=<optimized out>) at /tmp/buildd/postgre

XLogFlush() has the following comment:
/*
* Re-check how far we can now flush the WAL. It's generally not
* safe to call WaitXLogInsertionsToFinish while holding
* WALWriteLock, because an in-progress insertion might need to
* also grab WALWriteLock to make progress. But we know that all
* the insertions up to insertpos have already finished, because
* that's what the earlier WaitXLogInsertionsToFinish() returned.
* We're only calling it again to allow insertpos to be moved
* further forward, not to actually wait for anyone.
*/
insertpos = WaitXLogInsertionsToFinish(insertpos);

but I don't think that's valid reasoning. WaitXLogInsertionsToFinish()
calls LWLockWaitForVar(oldval = InvalidXLogRecPtr), which will block if
there's a exlusive locker and some backend doesn't yet have set
initializedUpto. Which seems like a ossible state?

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#8)
Re: Lots of stuck queries after upgrade to 9.4

On 07/20/2015 03:01 PM, Andres Freund wrote:

Heikki,

On 2015-07-20 13:27:12 +0200, Andres Freund wrote:

On 2015-07-20 13:22:42 +0200, Andres Freund wrote:

Hm. The problem seems to be the WaitXLogInsertionsToFinish() call in
XLogFlush().

These are the relevant stack traces:
db9lock/debuglog-commit.txt
#2 0x00007f7405bd44f4 in LWLockWaitForVar (l=0x7f70f2ab6680, valptr=0x7f70f2ab66a0, oldval=<optimized out>, newval=0xffffffffffffffff) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:1011
#3 0x00007f7405a0d3e6 in WaitXLogInsertionsToFinish (upto=121713318915952) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1755
#4 0x00007f7405a0e1d3 in XLogFlush (record=121713318911056) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2849

db9lock/debuglog-insert-8276.txt
#1 0x00007f7405b77d91 in PGSemaphoreLock (sema=0x7f73ff6531d0, interruptOK=0 '\000') at pg_sema.c:421
#2 0x00007f7405bd4849 in LWLockAcquireCommon (val=<optimized out>, valptr=<optimized out>, mode=<optimized out>, l=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:626
#3 LWLockAcquire (l=0x7f70ecaaa1a0, mode=LW_EXCLUSIVE) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:467
#4 0x00007f7405a0dcca in AdvanceXLInsertBuffer (upto=<optimized out>, opportunistic=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2161
#5 0x00007f7405a0e301 in GetXLogBuffer (ptr=121713318928384) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1848
#6 0x00007f7405a0e9c9 in CopyXLogRecordToWAL (EndPos=<optimized out>, StartPos=<optimized out>, rdata=0x7ffff1c21b90, isLogSwitch=<optimized out>, write_len=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1494
#7 XLogInsert (rmid=<optimized out>, info=<optimized out>, rdata=<optimized out>) at /tmp/buildd/postgre

XLogFlush() has the following comment:
/*
* Re-check how far we can now flush the WAL. It's generally not
* safe to call WaitXLogInsertionsToFinish while holding
* WALWriteLock, because an in-progress insertion might need to
* also grab WALWriteLock to make progress. But we know that all
* the insertions up to insertpos have already finished, because
* that's what the earlier WaitXLogInsertionsToFinish() returned.
* We're only calling it again to allow insertpos to be moved
* further forward, not to actually wait for anyone.
*/
insertpos = WaitXLogInsertionsToFinish(insertpos);

but I don't think that's valid reasoning. WaitXLogInsertionsToFinish()
calls LWLockWaitForVar(oldval = InvalidXLogRecPtr), which will block if
there's a exlusive locker and some backend doesn't yet have set
initializedUpto. Which seems like a ossible state?

A backend always updates its insert position before sleeping/acquiring
another lock, by calling WALInsertLockUpdateInsertingAt. So even though
another backend might indeed be in the
initializedUpto==InvalidXlogRecPtr state, it will get out of that state
before either by releasing the lock or updating initializedUpto, before
it will in turn do anything that might deadlock.

Clearly there's *something* wrong here, though, given the bug report...

- Heikki

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10Spiros Ioannou
sivann@inaccess.com
In reply to: Heikki Linnakangas (#9)
Re: Lots of stuck queries after upgrade to 9.4

Happened again, another backtrace from a COMMIT process. I changed the
commit_delay to 0 (it was 4000 to help with our storage) and will report
back.

Already logging to debuglog.txt.
#0 0x00007f47b0789ec7 in semop () from /lib/libc.so.6
#1 0x00007f47b2513d91 in PGSemaphoreLock (sema=0x7f47abfe7ed0,
interruptOK=0 '\000') at pg_sema.c:421
#2 0x00007f47b25704f4 in LWLockWaitForVar (l=0x7f449f452680,
valptr=0x7f449f4526a0, oldval=<optimized out>, newval=0xffffffffffffffff)
at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:1011
#3 0x00007f47b23a93e6 in WaitXLogInsertionsToFinish (upto=121780827906336)
at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1755
#4 0x00007f47b23aa1d3 in XLogFlush (record=121780827898328) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2849
#5 0x00007f47b2398f17 in RecordTransactionCommit () at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xact.c:1196
#6 0x00007f47b239921b in CommitTransaction () at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xact.c:1953
#7 0x00007f47b239a8e5 in CommitTransactionCommand () at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xact.c:2732
#8 0x00007f47b257c105 in finish_xact_command () at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/tcop/postgres.c:2437
#9 0x00007f47b2580a10 in exec_execute_message (max_rows=<optimized out>,
portal_name=<optimized out>) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/tcop/postgres.c:1974
#10 PostgresMain (argc=<optimized out>, argv=<optimized out>,
dbname=0x7f47b49b5730 "ifms_db", username=<optimized out>) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/tcop/postgres.c:4137
#11 0x00007f47b25280f5 in BackendRun (port=<optimized out>) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/postmaster/postmaster.c:4164
#12 BackendStartup (port=<optimized out>) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/postmaster/postmaster.c:3829
#13 ServerLoop () at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/postmaster/postmaster.c:1597
#14 PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/postmaster/postmaster.c:1244
#15 0x00007f47b24ba474 in main (argc=5, argv=0x7f47b49b4920) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/main/main.c:228
Detaching from program: /usr/lib/postgresql/9.4/bin/postgres, process 15601

*Spiros Ioannou IT Manager, inAccesswww.inaccess.com
<http://www.inaccess.com&gt;M: +30 6973-903808T: +30 210-6802-358*

On 20 July 2015 at 15:21, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Show quoted text

On 07/20/2015 03:01 PM, Andres Freund wrote:

Heikki,

On 2015-07-20 13:27:12 +0200, Andres Freund wrote:

On 2015-07-20 13:22:42 +0200, Andres Freund wrote:

Hm. The problem seems to be the WaitXLogInsertionsToFinish() call in
XLogFlush().

These are the relevant stack traces:
db9lock/debuglog-commit.txt
#2 0x00007f7405bd44f4 in LWLockWaitForVar (l=0x7f70f2ab6680,
valptr=0x7f70f2ab66a0, oldval=<optimized out>, newval=0xffffffffffffffff)
at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:1011
#3 0x00007f7405a0d3e6 in WaitXLogInsertionsToFinish
(upto=121713318915952) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1755
#4 0x00007f7405a0e1d3 in XLogFlush (record=121713318911056) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2849

db9lock/debuglog-insert-8276.txt
#1 0x00007f7405b77d91 in PGSemaphoreLock (sema=0x7f73ff6531d0,
interruptOK=0 '\000') at pg_sema.c:421
#2 0x00007f7405bd4849 in LWLockAcquireCommon (val=<optimized out>,
valptr=<optimized out>, mode=<optimized out>, l=<optimized out>) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:626
#3 LWLockAcquire (l=0x7f70ecaaa1a0, mode=LW_EXCLUSIVE) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:467
#4 0x00007f7405a0dcca in AdvanceXLInsertBuffer (upto=<optimized out>,
opportunistic=<optimized out>) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2161
#5 0x00007f7405a0e301 in GetXLogBuffer (ptr=121713318928384) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1848
#6 0x00007f7405a0e9c9 in CopyXLogRecordToWAL (EndPos=<optimized out>,
StartPos=<optimized out>, rdata=0x7ffff1c21b90, isLogSwitch=<optimized
out>, write_len=<optimized out>) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1494
#7 XLogInsert (rmid=<optimized out>, info=<optimized out>,
rdata=<optimized out>) at /tmp/buildd/postgre

XLogFlush() has the following comment:
/*
* Re-check how far we can now flush the WAL.
It's generally not
* safe to call WaitXLogInsertionsToFinish while
holding
* WALWriteLock, because an in-progress insertion
might need to
* also grab WALWriteLock to make progress. But
we know that all
* the insertions up to insertpos have already
finished, because
* that's what the earlier
WaitXLogInsertionsToFinish() returned.
* We're only calling it again to allow insertpos
to be moved
* further forward, not to actually wait for
anyone.
*/
insertpos = WaitXLogInsertionsToFinish(insertpos);

but I don't think that's valid reasoning. WaitXLogInsertionsToFinish()
calls LWLockWaitForVar(oldval = InvalidXLogRecPtr), which will block if
there's a exlusive locker and some backend doesn't yet have set
initializedUpto. Which seems like a ossible state?

A backend always updates its insert position before sleeping/acquiring
another lock, by calling WALInsertLockUpdateInsertingAt. So even though
another backend might indeed be in the initializedUpto==InvalidXlogRecPtr
state, it will get out of that state before either by releasing the lock or
updating initializedUpto, before it will in turn do anything that might
deadlock.

Clearly there's *something* wrong here, though, given the bug report...

- Heikki

#11Andres Freund
andres@anarazel.de
In reply to: Spiros Ioannou (#10)
Re: Lots of stuck queries after upgrade to 9.4

Hi,

On 2015-07-20 15:58:33 +0300, Spiros Ioannou wrote:

Happened again, another backtrace from a COMMIT process. I changed the
commit_delay to 0 (it was 4000 to help with our storage) and will report
back.

What hardware & OS is this happening on?

Regards,

Andres

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12Spiros Ioannou
sivann@inaccess.com
In reply to: Andres Freund (#11)
Re: Lots of stuck queries after upgrade to 9.4

Debian 6, x64, HP Proliant, dbsize: 3TB on EMC AX-4i (iscsi), 16 cores,
24GB RAM.
kernel: Linux vserver11 2.6.32-5-xen-amd64 #1 SMP Sun Dec 7 22:52:42 UTC
2014 x86_64 GNU/Linux
I attach postgresql.conf

*Spiros Ioannou IT Manager, inAccesswww.inaccess.com
<http://www.inaccess.com&gt;M: +30 6973-903808T: +30 210-6802-358*

On 20 July 2015 at 16:04, Andres Freund <andres@anarazel.de> wrote:

Show quoted text

Hi,

On 2015-07-20 15:58:33 +0300, Spiros Ioannou wrote:

Happened again, another backtrace from a COMMIT process. I changed the
commit_delay to 0 (it was 4000 to help with our storage) and will report
back.

What hardware & OS is this happening on?

Regards,

Andres

Attachments:

postgresql.confapplication/octet-stream; name=postgresql.confDownload
#13Spiros Ioannou
sivann@inaccess.com
In reply to: Andres Freund (#8)
Re: Lots of stuck queries after upgrade to 9.4

FYI we have an 9.3.5 with commit_delay = 4000 and commit_siblings = 5 with
a 8TB dataset which seems fine. (Runs on different - faster hardware
though).

*Spiros Ioannou IT Manager, inAccesswww.inaccess.com
<http://www.inaccess.com&gt;M: +30 6973-903808T: +30 210-6802-358*

On 20 July 2015 at 15:01, Andres Freund <andres@anarazel.de> wrote:

Show quoted text

Heikki,

On 2015-07-20 13:27:12 +0200, Andres Freund wrote:

On 2015-07-20 13:22:42 +0200, Andres Freund wrote:

Hm. The problem seems to be the WaitXLogInsertionsToFinish() call in
XLogFlush().

These are the relevant stack traces:
db9lock/debuglog-commit.txt
#2 0x00007f7405bd44f4 in LWLockWaitForVar (l=0x7f70f2ab6680,

valptr=0x7f70f2ab66a0, oldval=<optimized out>, newval=0xffffffffffffffff)
at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:1011

#3 0x00007f7405a0d3e6 in WaitXLogInsertionsToFinish

(upto=121713318915952) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1755

#4 0x00007f7405a0e1d3 in XLogFlush (record=121713318911056) at

/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2849

db9lock/debuglog-insert-8276.txt
#1 0x00007f7405b77d91 in PGSemaphoreLock (sema=0x7f73ff6531d0,

interruptOK=0 '\000') at pg_sema.c:421

#2 0x00007f7405bd4849 in LWLockAcquireCommon (val=<optimized out>,

valptr=<optimized out>, mode=<optimized out>, l=<optimized out>) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:626

#3 LWLockAcquire (l=0x7f70ecaaa1a0, mode=LW_EXCLUSIVE) at

/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:467

#4 0x00007f7405a0dcca in AdvanceXLInsertBuffer (upto=<optimized out>,

opportunistic=<optimized out>) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2161

#5 0x00007f7405a0e301 in GetXLogBuffer (ptr=121713318928384) at

/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1848

#6 0x00007f7405a0e9c9 in CopyXLogRecordToWAL (EndPos=<optimized out>,

StartPos=<optimized out>, rdata=0x7ffff1c21b90, isLogSwitch=<optimized
out>, write_len=<optimized out>) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1494

#7 XLogInsert (rmid=<optimized out>, info=<optimized out>,

rdata=<optimized out>) at /tmp/buildd/postgre

XLogFlush() has the following comment:
/*
* Re-check how far we can now flush the WAL. It's
generally not
* safe to call WaitXLogInsertionsToFinish while
holding
* WALWriteLock, because an in-progress insertion
might need to
* also grab WALWriteLock to make progress. But we
know that all
* the insertions up to insertpos have already
finished, because
* that's what the earlier
WaitXLogInsertionsToFinish() returned.
* We're only calling it again to allow insertpos
to be moved
* further forward, not to actually wait for
anyone.
*/
insertpos = WaitXLogInsertionsToFinish(insertpos);

but I don't think that's valid reasoning. WaitXLogInsertionsToFinish()
calls LWLockWaitForVar(oldval = InvalidXLogRecPtr), which will block if
there's a exlusive locker and some backend doesn't yet have set
initializedUpto. Which seems like a ossible state?

#14Andres Freund
andres@anarazel.de
In reply to: Spiros Ioannou (#13)
Re: Lots of stuck queries after upgrade to 9.4

On 2015-07-20 17:00:52 +0300, Spiros Ioannou wrote:

FYI we have an 9.3.5 with commit_delay = 4000 and commit_siblings = 5 with
a 8TB dataset which seems fine. (Runs on different - faster hardware
though).

9.4 has a different xlog insertion algorithm (scaling much better), so
that unfortunately doesn't say very much...

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#15Spiros Ioannou
sivann@inaccess.com
In reply to: Andres Freund (#14)
Re: Lots of stuck queries after upgrade to 9.4

Well, so far with commit_delay=0 no problems. I will report back of couse
if something happens, but I believe that the problem may indeed be
solved/masked with that setting.

Rough description of our setup, or how to reproduce:
* Timeseries data in table , say, "measurements", size: 3-4TB, about 1000
inserts/second
* table measurements also has a trigger on insert to also insert on
measurements_a (for daily export purposes)
Just the above would cause a stuck query after a few days.

Now for exporting we run the following CTE query (measurements_b is an
empty table, measurements_a has about 5GB)
* WITH d_rows AS (DELETE FROM measurement_events_a RETURNING * ) INSERT
INTO measurement_events_b SELECT * FROM d_rows;
The above caused the problem to appear every time, after a 10-20 minutes.

Regards,
-Spiros

*Spiros Ioannou IT Manager, inAccesswww.inaccess.com
<http://www.inaccess.com&gt;M: +30 6973-903808T: +30 210-6802-358*

On 20 July 2015 at 17:02, Andres Freund <andres@anarazel.de> wrote:

Show quoted text

On 2015-07-20 17:00:52 +0300, Spiros Ioannou wrote:

FYI we have an 9.3.5 with commit_delay = 4000 and commit_siblings = 5

with

a 8TB dataset which seems fine. (Runs on different - faster hardware
though).

9.4 has a different xlog insertion algorithm (scaling much better), so
that unfortunately doesn't say very much...

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Spiros Ioannou (#15)
Re: Lots of stuck queries after upgrade to 9.4

On 07/23/2015 11:31 AM, Spiros Ioannou wrote:

Well, so far with commit_delay=0 no problems. I will report back of couse
if something happens, but I believe that the problem may indeed be
solved/masked with that setting.

Rough description of our setup, or how to reproduce:
* Timeseries data in table , say, "measurements", size: 3-4TB, about 1000
inserts/second
* table measurements also has a trigger on insert to also insert on
measurements_a (for daily export purposes)
Just the above would cause a stuck query after a few days.

Now for exporting we run the following CTE query (measurements_b is an
empty table, measurements_a has about 5GB)
* WITH d_rows AS (DELETE FROM measurement_events_a RETURNING * ) INSERT
INTO measurement_events_b SELECT * FROM d_rows;
The above caused the problem to appear every time, after a 10-20 minutes.

I was able to reproduce something like this with pgbench, by running a
custom little module that calls the WaitXLogInsertionsToFinish() in a
tight loop, and checks that the value it returns moves monotonically
forward. With commit_delay on, once every minute or so, it moves backwards.

I'll investigate why that happens...

- Heikki

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#17Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#16)
Re: Lots of stuck queries after upgrade to 9.4

On 07/23/2015 02:36 PM, Heikki Linnakangas wrote:

On 07/23/2015 11:31 AM, Spiros Ioannou wrote:

Well, so far with commit_delay=0 no problems. I will report back of couse
if something happens, but I believe that the problem may indeed be
solved/masked with that setting.

Rough description of our setup, or how to reproduce:
* Timeseries data in table , say, "measurements", size: 3-4TB, about 1000
inserts/second
* table measurements also has a trigger on insert to also insert on
measurements_a (for daily export purposes)
Just the above would cause a stuck query after a few days.

Now for exporting we run the following CTE query (measurements_b is an
empty table, measurements_a has about 5GB)
* WITH d_rows AS (DELETE FROM measurement_events_a RETURNING * ) INSERT
INTO measurement_events_b SELECT * FROM d_rows;
The above caused the problem to appear every time, after a 10-20 minutes.

Hmm. With that CTE query, were there other queries running at the same time?

I was able to reproduce something like this with pgbench, by running a
custom little module that calls the WaitXLogInsertionsToFinish() in a
tight loop, and checks that the value it returns moves monotonically
forward. With commit_delay on, once every minute or so, it moves backwards.

I'll investigate why that happens...

I was able to debug the synthetic test case I created, but unfortunately
I don't think it explains the lock up you're seeing after all.

It's possible for WaitXLogInsertionsToFinish() to move backwards, in
this scenario:

1. Backend A acquires WALInsertLock 2, and reserves xlog between byte
positions 2100 - 2200
2. Backend B calls WaitXLogInsertionsToFinish(), which blocks on backend
A, which hasn't advertised any location yet.
3. Backend C acquires WALInsertLock 1, and reserves xlog between byte
positions 2200 - 2300
4. Backend C calls GetXLogBuffer(), and sees that the page is not in
cache yet. (It does not call WALInsertLockUpdateInsertingAt() yet,
because it's a bit slow or context-switched out)
5. Backend A initializes the page, completes inserting its WAL record,
and releases its WALInsertLock.
6. Backend B gets unblocked, seeing that the lock held by B is now free.
It calculated 2200 as the return value, which was the latest reserved
WAL position. (Backend C started after it began, so it didn't have to
wait for it)
7. Backend C calls WALInsertLockUpdateInsertingAt(), with a WAL position
pointing to the beginning of the page, 2000.

If you now call WALInsertLockUpdateInsertingAt() again, it will return
2000, because backend C is the only backend holding a lock, and its
advertised position is 2000. But the previous call calculated 2200.
GetXLogBuffer() always advertises a WAL position at the beginning of the
requested page, but that's a bit bogus.

However, AFAICS that is actually harmless. Backend C is not blocked. The
page it's looking for is certainly in cache at this point, so it can
continue without blocking. So I don't think this explains your lockup.

- Heikki

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#18Spiros Ioannou
sivann@inaccess.com
In reply to: Heikki Linnakangas (#17)
Re: Lots of stuck queries after upgrade to 9.4

Hmm. With that CTE query, were there other queries running at the same
time?

yes, a multitude of INSERT and one COMMIT as I remember. I've noticed that
in every stuck situation there was always 1 stuck COMMIT running. We do
mainly INSERT and SELECT, very rarely UPDATE or DELETE.
but the stuck queries happen also without the CTE. The trigger on insert
runs non-stop though.

I think I've noticed a different behaviour with 9.4 and the CTE but I'm
just 50% sure about that, but may be relevant or not:
in 9.3 while the CTE was running, I think the trigger-filled table was
locked and no inserts could be made to the primary table (which copies all
inserts to the trigger table). In 9.4 inserts can be made at a lower rate
even when CTE runs. As I said not 100% sure about the behaviour was like
that in 9.3.

I was able to debug the synthetic test case I created, but unfortunately I

don't think it explains the lock up you're seeing after all.

It's possible for WaitXLogInsertionsToFinish() to move backwards, in this
scenario:

1. Backend A acquires WALInsertLock 2, and reserves xlog between byte
positions 2100 - 2200
2. Backend B calls WaitXLogInsertionsToFinish(), which blocks on backend
A, which hasn't advertised any location yet.
3. Backend C acquires WALInsertLock 1, and reserves xlog between byte
positions 2200 - 2300
4. Backend C calls GetXLogBuffer(), and sees that the page is not in cache
yet. (It does not call WALInsertLockUpdateInsertingAt() yet, because it's a
bit slow or context-switched out)
5. Backend A initializes the page, completes inserting its WAL record, and
releases its WALInsertLock.
6. Backend B gets unblocked, seeing that the lock held by B is now free.
It calculated 2200 as the return value, which was the latest reserved WAL
position. (Backend C started after it began, so it didn't have to wait for
it)
7. Backend C calls WALInsertLockUpdateInsertingAt(), with a WAL position
pointing to the beginning of the page, 2000.

If you now call WALInsertLockUpdateInsertingAt() again, it will return
2000, because backend C is the only backend holding a lock, and its
advertised position is 2000. But the previous call calculated 2200.
GetXLogBuffer() always advertises a WAL position at the beginning of the
requested page, but that's a bit bogus.

However, AFAICS that is actually harmless. Backend C is not blocked. The
page it's looking for is certainly in cache at this point, so it can
continue without blocking. So I don't think this explains your lockup.

Thanks for the investigation. The only think I can offer is that it's still
running fine with commit_delay = 0. Perhaps a warning should be put on the
documentation until the cause is found.
I've found a similar issue online:
http://dba.stackexchange.com/questions/96957/postgresql-9-4-1-stuck-all-queries-when-making-multi-updates
thanks
-Spiros

#19Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Spiros Ioannou (#12)
Re: Lots of stuck queries after upgrade to 9.4

A-ha, I succeeded to reproduce this now on my laptop, with pgbench! It
seems to be important to have a very large number of connections:

pgbench -n -c400 -j4 -T600 -P5

That got stuck after a few minutes. I'm using commit_delay=100.

Now that I have something to work with, I'll investigate this more tomorrow.

- Heikki

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#20Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#19)
Re: Lots of stuck queries after upgrade to 9.4

On 07/28/2015 11:36 PM, Heikki Linnakangas wrote:

A-ha, I succeeded to reproduce this now on my laptop, with pgbench! It
seems to be important to have a very large number of connections:

pgbench -n -c400 -j4 -T600 -P5

That got stuck after a few minutes. I'm using commit_delay=100.

Now that I have something to work with, I'll investigate this more tomorrow.

Ok, it seems that this is caused by the same issue that I found with my
synthetic test case, after all. It is possible to get a lockup because
of it.

For the archives, here's a hopefully easier-to-understand explanation of
how the lockup happens. It involves three backends. A and C are
insertion WAL records, while B is flushing the WAL with commit_delay.
The byte positions 2000, 2100, 2200, and 2300 are offsets within a WAL
page. 2000 points to the beginning of the page, while the others are
later positions on the same page. WaitToFinish() is an abbreviation for
WaitXLogInsertionsToFinish(). "Update pos X" means a call to
WALInsertLockUpdateInsertingAt(X). "Reserve A-B" means a call to
ReserveXLogInsertLocation, which returned StartPos A and EndPos B.

Backend A Backend B Backend C
--------- --------- ---------
Acquire InsertLock 2
Reserve 2100-2200
Calls WaitToFinish()
reservedUpto is 2200
sees that Lock 1 is
free
Acquire InsertLock 1
Reserve 2200-2300
GetXLogBuffer(2200)
page not in cache
Update pos 2000
AdvanceXLInsertBuffer()
run until about to
acquire WALWriteLock
GetXLogBuffer(2100)
page not in cache
Update pos 2000
AdvanceXLInsertBuffer()
Acquire WALWriteLock
write out old page
initialize new page
Release WALWriteLock
finishes insertion
release InsertLock 2
WaitToFinish() continues
sees that lock 2 is
free. Returns 2200.

Acquire WALWriteLock
Call WaitToFinish(2200)
blocks on Lock 1,
whose initializedUpto
is 2000.

At this point, there is a deadlock between B and C. B is waiting for C
to release the lock or update its insertingAt value past 2200, while C
is waiting for WALInsertLock, held by B.

To fix that, let's fix GetXLogBuffer() to always advertise the exact
position, not the beginning of the page (except when inserting the first
record on the page, just after the page header, see comments).

This fixes the problem for me. I've been running pgbench for about 30
minutes without lockups now, while without the patch it locked up within
a couple of minutes. Spiros, can you easily test this patch in your
environment? Would be nice to get a confirmation that this fixes the
problem for you too.

- Heikki

Attachments:

fix-walinsertlock-deadlock-1.patchapplication/x-patch; name=fix-walinsertlock-deadlock-1.patchDownload+24-3
#21Spiros Ioannou
sivann@inaccess.com
In reply to: Heikki Linnakangas (#20)
#22Scott Marlowe
scott.marlowe@gmail.com
In reply to: Spiros Ioannou (#21)
#23Spiros Ioannou
sivann@inaccess.com
In reply to: Scott Marlowe (#22)
#24Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Heikki Linnakangas (#19)
#25Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Spiros Ioannou (#21)