We're leaking predicate locks in HEAD

Started by Tom Laneover 6 years ago8 messages
#1Tom Lane
tgl@sss.pgh.pa.us

After running the core regression tests with installcheck-parallel,
the pg_locks view sometimes shows me apparently-orphaned SIReadLock
entries. They accumulate over repeated test runs. Right now,
for example, I see

regression=# select * from pg_locks;
locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath
------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+------+-----------------+---------+----------
relation | 130144 | 12137 | | | | | | | | 3/7977 | 8924 | AccessShareLock | t | t
virtualxid | | | | | 3/7977 | | | | | 3/7977 | 8924 | ExclusiveLock | t | t
relation | 130144 | 136814 | | | | | | | | 22/536 | 8076 | SIReadLock | t | f
relation | 111195 | 118048 | | | | | | | | 19/665 | 6738 | SIReadLock | t | f
relation | 130144 | 134850 | | | | | | | | 12/3093 | 7984 | SIReadLock | t | f
(5 rows)

after having done a couple of installcheck iterations since starting the
postmaster.

The PIDs shown as holding those locks don't exist anymore, but digging
in the postmaster log shows that they were session backends during the
regression test runs. Furthermore, it seems like they usually were the
ones running either the triggers or portals tests.

I don't see this behavior in v11 (though maybe I just didn't run it
long enough). In HEAD, a run adds one or two new entries more often
than not.

This is a pretty bad bug IMO --- quite aside from any ill effects
of the entries themselves, the leak seems fast enough that it'd run
a production installation out of locktable space before very long.

I'd have to say that my first suspicion falls on bb16aba50 ...

regards, tom lane

#2Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#1)
Re: We're leaking predicate locks in HEAD

On Wed, May 8, 2019 at 5:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

After running the core regression tests with installcheck-parallel,
the pg_locks view sometimes shows me apparently-orphaned SIReadLock
entries. [...]

Ugh.

I'd have to say that my first suspicion falls on bb16aba50 ...

Investigating.

--
Thomas Munro
https://enterprisedb.com

#3Andrew Dunstan
andrew.dunstan@2ndquadrant.com
In reply to: Tom Lane (#1)
Re: We're leaking predicate locks in HEAD

On 5/7/19 1:46 PM, Tom Lane wrote:

After running the core regression tests with installcheck-parallel,
the pg_locks view sometimes shows me apparently-orphaned SIReadLock
entries. They accumulate over repeated test runs.

Should we have a test for that run at/near the end of the regression
tests? The buildfarm will actually do multiple runs like this if set up
to do parallel checks and test multiple locales.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#3)
Re: We're leaking predicate locks in HEAD

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

On 5/7/19 1:46 PM, Tom Lane wrote:

After running the core regression tests with installcheck-parallel,
the pg_locks view sometimes shows me apparently-orphaned SIReadLock
entries. They accumulate over repeated test runs.

Should we have a test for that run at/near the end of the regression
tests? The buildfarm will actually do multiple runs like this if set up
to do parallel checks and test multiple locales.

No, I'm not excited about that idea; I think it'd have all the same
fragility as the late lamented "REINDEX pg_class" test. A given test
script has no business assuming that other test scripts aren't
legitimately taking out predicate locks, nor assuming that prior test
scripts are fully cleaned up when it runs.

regards, tom lane

#5Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#2)
1 attachment(s)
Re: We're leaking predicate locks in HEAD

On Wed, May 8, 2019 at 6:56 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, May 8, 2019 at 5:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'd have to say that my first suspicion falls on bb16aba50 ...

Investigating.

Reproduced here. Once the system reaches a state where it's leaking
(which happens only occasionally for me during installcheck-parallel),
it keeps leaking for future SSI transactions. The cause is
SxactGlobalXmin getting stuck. The attached fixes it for me. I can't
remember why on earth I made that change, but it is quite clearly
wrong: you have to check every transaction, or you might never advance
SxactGlobalXmin.

--
Thomas Munro
https://enterprisedb.com

Attachments:

0001-Fix-SxactGlobalXmin-tracking.patchapplication/octet-stream; name=0001-Fix-SxactGlobalXmin-tracking.patchDownload
From 134754ef6c0e82b2e654de6e9b13c9a1a9a3b673 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 May 2019 14:44:32 +1200
Subject: [PATCH] Fix SxactGlobalXmin tracking.

Commit bb16aba50 broke the code that maintains SxactGlobalXmin, so
that could get stuck.  When it stops advancing, later transactions
effectively leak resources.

Reported-by: Tom Lane
Discussion: https://postgr.es/m/16170.1557251214%40sss.pgh.pa.us
---
 src/backend/storage/lmgr/predicate.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 4e4d04bae37..15ef221c143 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -3635,8 +3635,7 @@ ReleasePredicateLocks(bool isCommit, bool isReadOnlySafe)
 	 * was launched.
 	 */
 	needToClear = false;
-	if (!isReadOnlySafe &&
-		TransactionIdEquals(MySerializableXact->xmin, PredXact->SxactGlobalXmin))
+	if (TransactionIdEquals(MySerializableXact->xmin, PredXact->SxactGlobalXmin))
 	{
 		Assert(PredXact->SxactGlobalXminCount > 0);
 		if (--(PredXact->SxactGlobalXminCount) == 0)
-- 
2.21.0

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#5)
Re: We're leaking predicate locks in HEAD

Thomas Munro <thomas.munro@gmail.com> writes:

Reproduced here. Once the system reaches a state where it's leaking
(which happens only occasionally for me during installcheck-parallel),
it keeps leaking for future SSI transactions. The cause is
SxactGlobalXmin getting stuck. The attached fixes it for me. I can't
remember why on earth I made that change, but it is quite clearly
wrong: you have to check every transaction, or you might never advance
SxactGlobalXmin.

Hm. So I don't have any opinion about whether this is a correct fix for
the leak, but I am quite distressed that the system failed to notice that
it was leaking predicate locks. Shouldn't there be the same sort of
leak-detection infrastructure that we have for most types of resources?

regards, tom lane

#7Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#6)
Re: We're leaking predicate locks in HEAD

On Wed, May 8, 2019 at 3:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Reproduced here. Once the system reaches a state where it's leaking
(which happens only occasionally for me during installcheck-parallel),
it keeps leaking for future SSI transactions. The cause is
SxactGlobalXmin getting stuck. The attached fixes it for me. I can't
remember why on earth I made that change, but it is quite clearly
wrong: you have to check every transaction, or you might never advance
SxactGlobalXmin.

Hm. So I don't have any opinion about whether this is a correct fix for
the leak, but I am quite distressed that the system failed to notice that
it was leaking predicate locks. Shouldn't there be the same sort of
leak-detection infrastructure that we have for most types of resources?

Well, it is hooked up the usual release machinery, because it's in
ReleasePredicateLocks(), which is wired into the
RESOURCE_RELEASE_LOCKS phase of resowner.c. The thing is that lock
lifetime is linked to the last transaction with the oldest known xmin,
not the transaction that created them.

More analysis: Lock clean-up is deferred until "... the last
serializable transaction with the oldest xmin among serializable
transactions completes", but I broke that by excluding read-only
transactions from the check so that SxactGlobalXminCount gets out of
sync. There's a read-only SSI transaction in
src/test/regress/sql/transactions.sql, but I think the reason the
problem manifests only intermittently with installcheck-parallel is
because sometimes the read-only optimisation kicks in (effectively
dropping us to plain old SI because there's no concurrent serializable
activity) and it doesn't take any locks at all, and sometimes the
read-only transaction doesn't have the oldest known xmin among
serializable transactions. However, if a read-write SSI transaction
had already taken a snapshot and has the oldest xmin and then the
read-only one starts with the same xmin, we get into trouble. When
the read-only one releases, we fail to decrement SxactGlobalXminCount,
and then we'll never call ClearOldPredicateLocks().

--
Thomas Munro
https://enterprisedb.com

#8Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#7)
Re: We're leaking predicate locks in HEAD

On Wed, May 8, 2019 at 4:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, May 8, 2019 at 3:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Reproduced here. Once the system reaches a state where it's leaking
(which happens only occasionally for me during installcheck-parallel),
it keeps leaking for future SSI transactions. The cause is
SxactGlobalXmin getting stuck. The attached fixes it for me. I can't
remember why on earth I made that change, but it is quite clearly
wrong: you have to check every transaction, or you might never advance
SxactGlobalXmin.

I pushed a version of that, thereby reverting the already-analysed
hunk, and also another similar hunk (probably harmless).

The second hunk dates from a time in development when I was treating
the final clean-up at commit time as a regular commit, but that failed
in PreCommit_CheckForSerializationFailure() because the DOOMED flag
was set by the earlier RO_SAFE partial release. The change was no
longer necessary, because final release of a partially released
read-only transaction is now done with isCommit forced to false.
(Before bb16aba50, it was done directly at RO_SAFE release time with
isCommit set to false, but bb16aba50 split the operation into two
phases, partial and then final, due to the extended object lifetime
requirement when sharing the SERIALIZABLEXACT with parallel workers.)

I'll update the open items page.

--
Thomas Munro
https://enterprisedb.com