BUG #14960: Queries stuck on BtreePage lock on parallel index scan

Started by Tim Warbergover 8 years ago3 messagesbugs
Jump to latest
#1Tim Warberg
tlw@monsido.com

The following bug has been logged on the website:

Bug reference: 14960
Logged by: Tim Warberg
Email address: tlw@monsido.com
PostgreSQL version: 10.1
Operating system: Ubuntu 16.04 LTS
Description:

Hi,

Ran into a issue on our PostgreSQL 10.1 cluster that seems to be a parallel
index scan bug. Our queue system scheduled 30 almost identical concurrent
queries where some of them was executed as parallel queries and all of them
became stuck on IPC BtreePage lock. We discovered this after they've been
stuck like that for about 10 hours [1]We had seen no errors and long running queue jobs is not uncommon. At the same time a autovacuum was
progressing one of the queried tables and it had become stuck in LWLock
buffer_content while vacuuming indexes. None of the query processes
responded to pg_cancel_backend nor pg_terminate_backend including the
autovacuum.
We eventually saw no other option than restarting PostgreSQL. After the
restart the queue system immediately restarted the queries and they all once
more became stuck on BtreePage lock. We then decided to disable parallel
queries (max_parallel_workers_per_gather=0) and restarted PostgreSQL once
more and this time the queries completed without issue.

The system had been running with 10.1 and parallel queries for about 19 days
before this without issues and at least once with the same query set. We've
also been running the same queries at the same concurrency on PostgreSQL 9.5
for months without issues.

Don't know if its possible to create the same condition in a isolated test
but we could try to reproduce on a snapshot if wanted. If so is there any
debug flags we should set or things we should look for?

Server runs on Google Cloud with 18 vcpu, 118GB memory, ZFS filesystem with
lz4 compression and the database takes up 4.69TB [2]2TB compressed and we've just done some cleanup and a lot of it was still not reclaimed by autovacuum when queries ran.

pg_stat_activity before restart:
https://gist.github.com/twarberg/514f382af7adec19248eba55f94a4e6a

pg_stat_activity after restart:
https://gist.github.com/twarberg/20443600478d8396e76ae1a699298217

[1]: We had seen no errors and long running queue jobs is not uncommon
[2]: 2TB compressed and we've just done some cleanup and a lot of it was still not reclaimed by autovacuum when queries ran.
still not reclaimed by autovacuum when queries ran.

#2Thomas Munro
thomas.munro@gmail.com
In reply to: Tim Warberg (#1)
Re: BUG #14960: Queries stuck on BtreePage lock on parallel index scan

On Mon, Dec 11, 2017 at 10:22 PM, <tlw@monsido.com> wrote:

The following bug has been logged on the website:

Bug reference: 14960
Logged by: Tim Warberg
Email address: tlw@monsido.com
PostgreSQL version: 10.1
Operating system: Ubuntu 16.04 LTS
Description:

Hi,

Ran into a issue on our PostgreSQL 10.1 cluster that seems to be a parallel
index scan bug. Our queue system scheduled 30 almost identical concurrent
queries where some of them was executed as parallel queries and all of them
became stuck on IPC BtreePage lock.

Hi Tim,

Thanks for the report. This seems to be the same as the bug that we
just analysed over here:

/messages/by-id/CAEepm=2xZUcOGP9V0O_G0=2P2wwXwPrkF=upWTCJSisUxMnuSg@mail.gmail.com

We discovered this after they've been
stuck like that for about 10 hours [1]. At the same time a autovacuum was
progressing one of the queried tables and it had become stuck in LWLock
buffer_content while vacuuming indexes. None of the query processes
responded to pg_cancel_backend nor pg_terminate_backend including the
autovacuum.

Hmm. This may be because we hold a BT_READ lock while waiting in
_bt_parallel_seize(). Here it's extended by the above-mentioned bug,
preventing others from acquiring an exclusive lock.

--
Thomas Munro
http://www.enterprisedb.com

#3Tim Warberg
tlw@monsido.com
In reply to: Thomas Munro (#2)
Re: BUG #14960: Queries stuck on BtreePage lock on parallel index scan

Hi Thomas,

Thanks for the quick reply.

On 11 Dec 2017, at 10.42, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Mon, Dec 11, 2017 at 10:22 PM, <tlw@monsido.com> wrote:

The following bug has been logged on the website:

Bug reference: 14960
Logged by: Tim Warberg
Email address: tlw@monsido.com
PostgreSQL version: 10.1
Operating system: Ubuntu 16.04 LTS
Description:

Hi,

Ran into a issue on our PostgreSQL 10.1 cluster that seems to be a parallel
index scan bug. Our queue system scheduled 30 almost identical concurrent
queries where some of them was executed as parallel queries and all of them
became stuck on IPC BtreePage lock.

Hi Tim,

Thanks for the report. This seems to be the same as the bug that we
just analysed over here:

/messages/by-id/CAEepm=2xZUcOGP9V0O_G0=2P2wwXwPrkF=upWTCJSisUxMnuSg@mail.gmail.com </messages/by-id/CAEepm=2xZUcOGP9V0O_G0=2P2wwXwPrkF=upWTCJSisUxMnuSg@mail.gmail.com&gt;

Checked for similar bug reports and related git commits when it happened Thursday last week but didn’t occur to me to double check today before submitting. Looks like You have all the necessary information from the other report.

We discovered this after they've been
stuck like that for about 10 hours [1]. At the same time a autovacuum was
progressing one of the queried tables and it had become stuck in LWLock
buffer_content while vacuuming indexes. None of the query processes
responded to pg_cancel_backend nor pg_terminate_backend including the
autovacuum.

Hmm. This may be because we hold a BT_READ lock while waiting in
_bt_parallel_seize(). Here it's extended by the above-mentioned bug,
preventing others from acquiring an exclusive lock.

--
Thomas Munro
http://www.enterprisedb.com <http://www.enterprisedb.com/&gt;

Regards,

Tim Warberg