Speed up transaction completion faster after many relations are accessed in a transaction
Hello,
The attached patch speeds up transaction completion when any prior transaction accessed many relations in the same session.
The transaction records its own acquired lock information in the LOCALLOCK structure (a pair of lock object and lock mode). It stores LOCALLOCKs in a hash table in its local memory. The hash table automatically expands when the transaction acquires many relations. The hash table doesn't shrink. When the transaction commits or aborts, it scans the hash table to find LOCALLOCKs to release locks.
The problem is that once some transaction accesses many relations, even subsequent transactions in the same session that only access a few relations take unreasonably long time to complete, because it has to scan the expanded hash table.
The attached patch links LOCALLOCKS to PGPROC, so that releasing locks should only scan the list instead of the hash table. The hash table is kept because some functions want to find a particular LOCALLOCK quickly based on its hash value.
This problem was uncovered while evaluating partitioning performance. When the application PREPAREs a statement once and then EXECUTE-COMMIT repeatedly, the server creates a generic plan on the 6th EXECUTE. Unfortunately, creation of the generic plan of UPDATE/DELETE currently accesses all partitions of the target table (this itself needs improvement), expanding the LOCALLOCK hash table. As a result, 7th and later EXECUTEs get slower.
Imai-san confirmed performance improvement with this patch:
https://commitfest.postgresql.org/22/1993/
Regards
Takayuki Tsunakawa
Attachments:
faster-locallock-scan.patchapplication/octet-stream; name=faster-locallock-scan.patchDownload+31-26
"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:
The attached patch speeds up transaction completion when any prior transaction accessed many relations in the same session.
Hm. Putting a list header for a purely-local data structure into shared
memory seems quite ugly. Isn't there a better place to keep that?
Do we really want a dlist here at all? I'm concerned that bloating
LOCALLOCK will cost us when there are many locks involved. This patch
increases the size of LOCALLOCK by 25% if I counted right, which does
not seem like a negligible penalty.
My own thought about how to improve this situation was just to destroy
and recreate LockMethodLocalHash at transaction end (or start)
if its size exceeded $some-value. Leaving it permanently bloated seems
like possibly a bad idea, even if we get rid of all the hash_seq_searches
on it.
regards, tom lane
On Tue, 19 Feb 2019 at 12:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
My own thought about how to improve this situation was just to destroy
and recreate LockMethodLocalHash at transaction end (or start)
if its size exceeded $some-value. Leaving it permanently bloated seems
like possibly a bad idea, even if we get rid of all the hash_seq_searches
on it.
That seems like a good idea. Although, it would be good to know that
it didn't add too much overhead dropping and recreating the table when
every transaction happened to obtain more locks than $some-value. If
it did, then maybe we could track the average locks per of recent
transactions and just ditch the table after the locks are released if
the locks held by the last transaction exceeded the average *
1.something. No need to go near shared memory to do that.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi,
On 2019-02-19 12:52:08 +1300, David Rowley wrote:
On Tue, 19 Feb 2019 at 12:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
My own thought about how to improve this situation was just to destroy
and recreate LockMethodLocalHash at transaction end (or start)
if its size exceeded $some-value. Leaving it permanently bloated seems
like possibly a bad idea, even if we get rid of all the hash_seq_searches
on it.That seems like a good idea. Although, it would be good to know that
it didn't add too much overhead dropping and recreating the table when
every transaction happened to obtain more locks than $some-value. If
it did, then maybe we could track the average locks per of recent
transactions and just ditch the table after the locks are released if
the locks held by the last transaction exceeded the average *
1.something. No need to go near shared memory to do that.
Isn't a large portion of benefits in this patch going to be mooted by
the locking improvements discussed in the other threads? I.e. there's
hopefully not going to be a ton of cases with low overhead where we
acquire a lot of locks and release them very soon after. Sure, for DDL
etc we will, but I can't see this mattering from a performance POV?
I'm not against doing something like Tom proposes, but heuristics with
magic constants like this tend to age purely / are hard to tune well
across systems.
Greetings,
Andres Freund
David Rowley <david.rowley@2ndquadrant.com> writes:
On Tue, 19 Feb 2019 at 12:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
My own thought about how to improve this situation was just to destroy
and recreate LockMethodLocalHash at transaction end (or start)
if its size exceeded $some-value. Leaving it permanently bloated seems
like possibly a bad idea, even if we get rid of all the hash_seq_searches
on it.
That seems like a good idea. Although, it would be good to know that
it didn't add too much overhead dropping and recreating the table when
every transaction happened to obtain more locks than $some-value. If
it did, then maybe we could track the average locks per of recent
transactions and just ditch the table after the locks are released if
the locks held by the last transaction exceeded the average *
1.something. No need to go near shared memory to do that.
Yeah, I'd deliberately avoided saying how we'd choose $some-value ;-).
Making it adaptive might not be a bad plan.
regards, tom lane
Andres Freund <andres@anarazel.de> writes:
Isn't a large portion of benefits in this patch going to be mooted by
the locking improvements discussed in the other threads? I.e. there's
hopefully not going to be a ton of cases with low overhead where we
acquire a lot of locks and release them very soon after. Sure, for DDL
etc we will, but I can't see this mattering from a performance POV?
Mmm ... AIUI, the patches currently proposed can only help for what
David called "point lookup" queries. There are still going to be
queries that scan a large proportion of a partition tree, so if you've
got tons of partitions, you'll be concerned about this sort of thing.
I'm not against doing something like Tom proposes, but heuristics with
magic constants like this tend to age purely / are hard to tune well
across systems.
I didn't say it had to be a constant ...
regards, tom lane
On Tue, 19 Feb 2019 at 12:56, Andres Freund <andres@anarazel.de> wrote:
Isn't a large portion of benefits in this patch going to be mooted by
the locking improvements discussed in the other threads? I.e. there's
hopefully not going to be a ton of cases with low overhead where we
acquire a lot of locks and release them very soon after. Sure, for DDL
etc we will, but I can't see this mattering from a performance POV?
I think this patch was born from Amit's partition planner improvement
patch. If not that one, which other threads did you have in mind?
A problem exists where, if using a PREPAREd statement to plan a query
to a partitioned table containing many partitions that a generic plan
will never be favoured over a custom plan since the generic plan might
not be able to prune partitions like the custom plan can. The actual
problem is around that we do need to at some point generate a generic
plan in order to know it's more costly and that requires locking
possibly every partition. When plan_cache_mode = auto, this is done
on the 6th execution of the statement. After Amit's partition planner
changes [1]https://commitfest.postgresql.org/22/1778/, the custom plan will only lock partitions that are not
pruned, so the 6th execution of the statement bloats the local lock
table.
[1]: https://commitfest.postgresql.org/22/1778/
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi,
On 2019-02-18 19:01:06 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
Isn't a large portion of benefits in this patch going to be mooted by
the locking improvements discussed in the other threads? I.e. there's
hopefully not going to be a ton of cases with low overhead where we
acquire a lot of locks and release them very soon after. Sure, for DDL
etc we will, but I can't see this mattering from a performance POV?Mmm ... AIUI, the patches currently proposed can only help for what
David called "point lookup" queries. There are still going to be
queries that scan a large proportion of a partition tree, so if you've
got tons of partitions, you'll be concerned about this sort of thing.
Agreed - but it seems not unlikely that for those the rest of the
planner / executor overhead will entirely swamp any improvement we could
make here. If I understand correctly the benchmarks here were made with
"point" update and select queries, although the reference in the first
post in this thread is a bit vague.
I'm not against doing something like Tom proposes, but heuristics with
magic constants like this tend to age purely / are hard to tune well
across systems.I didn't say it had to be a constant ...
Do you have good idea?
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2019-02-18 19:01:06 -0500, Tom Lane wrote:
Mmm ... AIUI, the patches currently proposed can only help for what
David called "point lookup" queries. There are still going to be
queries that scan a large proportion of a partition tree, so if you've
got tons of partitions, you'll be concerned about this sort of thing.
Agreed - but it seems not unlikely that for those the rest of the
planner / executor overhead will entirely swamp any improvement we could
make here. If I understand correctly the benchmarks here were made with
"point" update and select queries, although the reference in the first
post in this thread is a bit vague.
I think what Maumau-san is on about here is that not only does your
$giant-query take a long time, but it has a permanent negative effect
on all subsequent transactions in the session. That seems worth
doing something about.
I didn't say it had to be a constant ...
Do you have good idea?
I think David's on the right track --- keep some kind of moving average of
the LOCALLOCK table size for each transaction, and nuke it if it exceeds
some multiple of the recent average. Not sure offhand about how to get
the data cheaply --- it might not be sufficient to look at transaction
end, if we release LOCALLOCK entries before that (but do we?)
regards, tom lane
Hi,
On 2019-02-18 18:42:32 -0500, Tom Lane wrote:
"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:
The attached patch speeds up transaction completion when any prior transaction accessed many relations in the same session.
Hm. Putting a list header for a purely-local data structure into shared
memory seems quite ugly. Isn't there a better place to keep that?
Yea, I think it'd be just as fine to store that in a static
variable (best defined directly besides LockMethodLocalHash).
(Btw, I'd be entirely unsurprised if moving away from a dynahash for
LockMethodLocalHash would be beneficial)
Do we really want a dlist here at all? I'm concerned that bloating
LOCALLOCK will cost us when there are many locks involved. This patch
increases the size of LOCALLOCK by 25% if I counted right, which does
not seem like a negligible penalty.
It's currently
struct LOCALLOCK {
LOCALLOCKTAG tag; /* 0 20 */
/* XXX 4 bytes hole, try to pack */
LOCK * lock; /* 24 8 */
PROCLOCK * proclock; /* 32 8 */
uint32 hashcode; /* 40 4 */
/* XXX 4 bytes hole, try to pack */
int64 nLocks; /* 48 8 */
_Bool holdsStrongLockCount; /* 56 1 */
_Bool lockCleared; /* 57 1 */
/* XXX 2 bytes hole, try to pack */
int numLockOwners; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
int maxLockOwners; /* 64 4 */
/* XXX 4 bytes hole, try to pack */
LOCALLOCKOWNER * lockOwners; /* 72 8 */
/* size: 80, cachelines: 2, members: 10 */
/* sum members: 66, holes: 4, sum holes: 14 */
/* last cacheline: 16 bytes */
};
seems we could trivially squeeze most of the bytes for a dlist node out
of padding.
My own thought about how to improve this situation was just to destroy
and recreate LockMethodLocalHash at transaction end (or start)
if its size exceeded $some-value. Leaving it permanently bloated seems
like possibly a bad idea, even if we get rid of all the hash_seq_searches
on it.
OTOH, that'll force constant incremental resizing of the hashtable, for
workloads that regularly need a lot of locks. And I'd assume in most
cases if one transaction needs a lot of locks it's quite likely that
future ones will need a lot of locks, too.
Greetings,
Andres Freund
Hi,
On 2019-02-18 19:13:31 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On 2019-02-18 19:01:06 -0500, Tom Lane wrote:
Mmm ... AIUI, the patches currently proposed can only help for what
David called "point lookup" queries. There are still going to be
queries that scan a large proportion of a partition tree, so if you've
got tons of partitions, you'll be concerned about this sort of thing.Agreed - but it seems not unlikely that for those the rest of the
planner / executor overhead will entirely swamp any improvement we could
make here. If I understand correctly the benchmarks here were made with
"point" update and select queries, although the reference in the first
post in this thread is a bit vague.I think what Maumau-san is on about here is that not only does your
$giant-query take a long time, but it has a permanent negative effect
on all subsequent transactions in the session. That seems worth
doing something about.
Ah, yes, that makes sense. I'm inclined to think however that the
original approach presented in this thread is better than the
reset-the-whole-hashtable approach. Because:
I think David's on the right track --- keep some kind of moving average of
the LOCALLOCK table size for each transaction, and nuke it if it exceeds
some multiple of the recent average. Not sure offhand about how to get
the data cheaply --- it might not be sufficient to look at transaction
end, if we release LOCALLOCK entries before that (but do we?)
Seems too complicated for my taste. And it doesn't solve the issue of
having some transactions with few locks (say because the plan can be
nicely pruned) interspersed with transactions where a lot of locks are
acquired.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2019-02-18 18:42:32 -0500, Tom Lane wrote:
Do we really want a dlist here at all? I'm concerned that bloating
LOCALLOCK will cost us when there are many locks involved. This patch
increases the size of LOCALLOCK by 25% if I counted right, which does
not seem like a negligible penalty.
It's currently [ 80 bytes with several padding holes ]
seems we could trivially squeeze most of the bytes for a dlist node out
of padding.
Yeah, but if we want to rearrange the members into an illogical order
to save some space, we should do that independently of this patch ---
and then the overhead of this patch would be even worse than 25%.
regards, tom lane
Hi,
On 2019-02-18 19:24:54 -0500, Tom Lane wrote:
Yeah, but if we want to rearrange the members into an illogical order
to save some space, we should do that independently of this patch ---
Sure, we should do that. I don't buy the "illogical" bit, just moving
hashcode up to after tag isn't more or less logical, and saves most of
the padding, and moving the booleans to the end isn't better/worse
either.
You always bring up that argument. While I agree that sometimes the most
optimal ordering can be less natural, I think most of the time it vastly
overstates how intelligent the original ordering was. Often new elements
were either just added iteratively without consideration for padding, or
the attention to padding was paid in 32bit times.
I don't find
struct LOCALLOCK {
LOCALLOCKTAG tag; /* 0 20 */
uint32 hashcode; /* 20 4 */
LOCK * lock; /* 24 8 */
PROCLOCK * proclock; /* 32 8 */
int64 nLocks; /* 40 8 */
int numLockOwners; /* 48 4 */
int maxLockOwners; /* 52 4 */
LOCALLOCKOWNER * lockOwners; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
_Bool holdsStrongLockCount; /* 64 1 */
_Bool lockCleared; /* 65 1 */
/* size: 72, cachelines: 2, members: 10 */
/* padding: 6 */
/* last cacheline: 8 bytes */
};
less clear than
struct LOCALLOCK {
LOCALLOCKTAG tag; /* 0 20 */
/* XXX 4 bytes hole, try to pack */
LOCK * lock; /* 24 8 */
PROCLOCK * proclock; /* 32 8 */
uint32 hashcode; /* 40 4 */
/* XXX 4 bytes hole, try to pack */
int64 nLocks; /* 48 8 */
_Bool holdsStrongLockCount; /* 56 1 */
_Bool lockCleared; /* 57 1 */
/* XXX 2 bytes hole, try to pack */
int numLockOwners; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
int maxLockOwners; /* 64 4 */
/* XXX 4 bytes hole, try to pack */
LOCALLOCKOWNER * lockOwners; /* 72 8 */
/* size: 80, cachelines: 2, members: 10 */
/* sum members: 66, holes: 4, sum holes: 14 */
/* last cacheline: 16 bytes */
};
but it's smaller (althoug there's plenty trailing space).
and then the overhead of this patch would be even worse than 25%.
IDK, we, including you, very often make largely independent improvements
to make the cost of something else more palpable. Why's that not OK
here? Especially because we're not comparing to an alternative where no
cost is added, keeping track of e.g. a running average of the hashtable
size isn't free either; nor does it help in the intermittent cases.
- Andres
Andres Freund <andres@anarazel.de> writes:
On 2019-02-18 19:24:54 -0500, Tom Lane wrote:
Yeah, but if we want to rearrange the members into an illogical order
to save some space, we should do that independently of this patch ---
Sure, we should do that. I don't buy the "illogical" bit, just moving
hashcode up to after tag isn't more or less logical, and saves most of
the padding, and moving the booleans to the end isn't better/worse
either.
I hadn't looked at the details closely, but if we can squeeze out the
padding space without any loss of intelligibility, sure let's do so.
I still say that's independent of whether to adopt this patch though.
but it's smaller (althoug there's plenty trailing space).
I think you're supposing that these things are independently palloc'd, but
they aren't --- dynahash lays them out in arrays without palloc padding.
IDK, we, including you, very often make largely independent improvements
to make the cost of something else more palpable. Why's that not OK
here?
When we do that, we aren't normally talking about overheads as high as
25% (even more, if it's measured as I think it ought to be). What I'm
concerned about is that the patch is being advocated for cases where
there are lots of LOCALLOCK entries --- which is exactly where the
space overhead is going to hurt the most.
Especially because we're not comparing to an alternative where no
cost is added, keeping track of e.g. a running average of the hashtable
size isn't free either; nor does it help in the intermittent cases.
What I was hoping for --- though perhaps it's not achievable --- was
statistical overhead amounting to just a few more instructions per
transaction. Adding dlist linking would add more instructions per
hashtable entry/removal, which seems like it'd be a substantially
bigger time penalty. As for the intermittent-usage issue, that largely
depends on the details of the when-to-reset heuristic, which we don't
have a concrete design for yet. But I could certainly imagine it waiting
for a few transactions before deciding to chomp.
Anyway, I'm not trying to veto the patch in this form, just suggesting
that there are alternatives worth thinking about.
regards, tom lane
Hi,
On 2019-02-18 20:29:29 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
but it's smaller (althoug there's plenty trailing space).
I think you're supposing that these things are independently palloc'd, but
they aren't --- dynahash lays them out in arrays without palloc padding.
I don't think that matters, given that the trailing six bytes are
included in sizeof() (and have to, to guarantee suitable alignment in
arrays etc).
Greetings,
Andres Freund
On Tue, 19 Feb 2019 at 00:20, Andres Freund <andres@anarazel.de> wrote:
On 2019-02-18 19:13:31 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On 2019-02-18 19:01:06 -0500, Tom Lane wrote:
Mmm ... AIUI, the patches currently proposed can only help for what
David called "point lookup" queries. There are still going to be
queries that scan a large proportion of a partition tree, so if you've
got tons of partitions, you'll be concerned about this sort of thing.
I think what Maumau-san is on about here is that not only does your
$giant-query take a long time, but it has a permanent negative effect
on all subsequent transactions in the session. That seems worth
doing something about.Ah, yes, that makes sense. I'm inclined to think however that the
original approach presented in this thread is better than the
reset-the-whole-hashtable approach.
If it was just many-tables then blowing away the hash table would work fine.
The main issue seems to be with partitioning, not with the general case of
many-tables. For that case, it seems like reset hashtable is too much.
Can we use our knowledge of the structure of locks, i.e. partition locks
are all children of the partitioned table, to do a better job?
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2/12/19 7:33 AM, Tsunakawa, Takayuki wrote:
...
This problem was uncovered while evaluating partitioning performance.
When the application PREPAREs a statement once and then
EXECUTE-COMMIT repeatedly, the server creates a generic plan on the
6th EXECUTE. Unfortunately, creation of the generic plan of
UPDATE/DELETE currently accesses all partitions of the target table
(this itself needs improvement), expanding the LOCALLOCK hash table.
As a result, 7th and later EXECUTEs get slower.Imai-san confirmed performance improvement with this patch:
Can you quantify the effects? That is, how much slower/faster does it get?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com]
On 2/12/19 7:33 AM, Tsunakawa, Takayuki wrote:
Imai-san confirmed performance improvement with this patch:
Can you quantify the effects? That is, how much slower/faster does it get?
Ugh, sorry, I wrote a wrong URL. The correct page is:
/messages/by-id/0F97FA9ABBDBE54F91744A9B37151A512787EC@g01jpexmbkw24
The quoted figures re:
[v20 + faster-locallock-scan.patch]
auto: 9,069 TPS
custom: 9,015 TPS
[v20]
auto: 8,037 TPS
custom: 8,933 TPS
In the original problematic case, plan_cache_mode = auto (default), we can see about 13% improvement.
Regards
Takayuki Tsunakawa
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Hm. Putting a list header for a purely-local data structure into shared
memory seems quite ugly. Isn't there a better place to keep that?
Agreed. I put it in the global variable.
Do we really want a dlist here at all? I'm concerned that bloating
LOCALLOCK will cost us when there are many locks involved. This patch
increases the size of LOCALLOCK by 25% if I counted right, which does
not seem like a negligible penalty.
To delete the LOCALLOCK in RemoveLocalLock(), we need a dlist. slist requires the list iterator to be passed from callers.
From: Andres Freund [mailto:andres@anarazel.de]
Sure, we should do that. I don't buy the "illogical" bit, just moving
hashcode up to after tag isn't more or less logical, and saves most of
the padding, and moving the booleans to the end isn't better/worse
either.I don't find
Thanks, I've done it.
From: Simon Riggs [mailto:simon@2ndquadrant.com]
Can we use our knowledge of the structure of locks, i.e. partition locks
are all children of the partitioned table, to do a better job?
I couldn't come up with a idea.
Regards
Takayuki Tsunakawa
Attachments:
faster-locallock-scan_v2.patchapplication/octet-stream; name=faster-locallock-scan_v2.patchDownload+35-29
On 2019-02-20 07:20, Tsunakawa, Takayuki wrote:
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Hm. Putting a list header for a purely-local data structure into shared
memory seems quite ugly. Isn't there a better place to keep that?Agreed. I put it in the global variable.
I think there is agreement on the principles of this patch. Perhaps it
could be polished a bit.
Your changes in LOCALLOCK still refer to PGPROC, from your first version
of the patch.
I think the reordering of struct members could be done as a separate
preliminary patch.
Some more documentation in the comment before dlist_head LocalLocks to
explain this whole mechanism would be nice.
You posted a link to some performance numbers, but I didn't see the test
setup explained there. I'd like to get some more information on this
impact of this. Is there an effect with 100 tables, or do you need 100000?
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services