WIP: dynahash replacement for buffer table
A few years ago I started working on a concurrent hash table for
PostgreSQL. The hash table part of it worked, but I never did
anything with it, really. Amit mentioned to me earlier this week that
he was seeing contention inside the dynahash machinery, which inspired
me to go back and update the patch. I took the basic infrastructure
from before and used it to replace the buffer table. Patch is
attached.
The key idea here is that lookups are done without any locks, only
memory barriers; and inserts and deletes are done using atomic ops.
The algorithm is not strictly lock-free for reasons explained in the
comments in chash.c, but it's a lot less locky than what we have now,
so in theory you might think that would be a good thing.
I haven't had time to do much performance testing yet, but it looks
like this may be slower at low client counts and faster at high client
counts. However, my results weren't real reproducible, and I haven't
done comprehensive testing yet. What was really bizarre is that I
couldn't really pin down the cause of the slowness at low client
counts; a quick perf profile showed overhead concentrated in
CHashBucketScan, basically memory access latency for walking the
bucket chain. But the table is built to have a load factor of 1, so I
can't see why that should amount to much, or why it should be
significantly worse than for dynahash.
This patch contains assorted leftovers and is grotty in various ways,
but I'm sharing it anyway just to get it out there.
git branch also available at:
http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash2014
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
chash-buftable-v1.patchtext/x-patch; charset=US-ASCII; name=chash-buftable-v1.patchDownload+1888-231
On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
A few years ago I started working on a concurrent hash table for
PostgreSQL. The hash table part of it worked, but I never did
anything with it, really. Amit mentioned to me earlier this week that
he was seeing contention inside the dynahash machinery, which inspired
me to go back and update the patch.
Interestingly I've benchmarked similar loads, even on the same machine
as Amit, and I do seem trememdous time spent in dynahash.c. It's nearly
all cache misses in my tests though.
I took the basic infrastructure
from before and used it to replace the buffer table. Patch is
attached.
That's pretty cool. The algorithm is complex enough that I haven't fully
understood it yet, but it sounds sane on a first glance.
The key idea here is that lookups are done without any locks, only
memory barriers; and inserts and deletes are done using atomic ops.
Hm. I quickly looked and I see that you use two full barriers for every
lookup. That's pretty expensive. I think this should be doable using
only read/write barriers on the lookup side? Even on architectures where
read/write memory barriers have to be something but a compiler barrier,
they're still usually a magnitude or so cheaper than full barriers.
The algorithm is not strictly lock-free for reasons explained in the
comments in chash.c, but it's a lot less locky than what we have now,
so in theory you might think that would be a good thing.
I don't see much reason to strive for full lock-free ness. If the locks
aren't noticeable in real world scenarios - who cares?
I haven't had time to do much performance testing yet, but it looks
like this may be slower at low client counts and faster at high client
counts. However, my results weren't real reproducible, and I haven't
done comprehensive testing yet. What was really bizarre is that I
couldn't really pin down the cause of the slowness at low client
counts; a quick perf profile showed overhead concentrated in
CHashBucketScan, basically memory access latency for walking the
bucket chain. But the table is built to have a load factor of 1, so I
can't see why that should amount to much, or why it should be
significantly worse than for dynahash.
With regard for using a hash table for the buffer mapping lock I'm
doubtful that any form of separate chaining is the right one. We
currently have a quite noticeable problem with the number of cache
misses in the buffer mapping hash (and also the heavyweight lock mgr) -
if we stay with hashes that's only going to be fundamentally lower than
dynahash if we change the type of hashing. I've had good, *preliminary*,
results using an open addressing + linear probing approach.
My guess is that the additional indirection via the arena explains the
difference in cache misses? But I might be completely off here.
It'd be quite interesting to see a perf stat -ddd of dynahash.c/chash
employing builds for some comparable load.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
I took the basic infrastructure from before and used it to replace the
buffer table. Patch is attached.
Hm. Unless I missed something you pretty much completely eradicated
locks from the buffer mapping code? That'd be pretty cool, but it's also
scary.
How confident are you that your conversion is correct? Not in the sense
that there aren't any buglets, but fundamentally.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 14, 2014 at 7:55 PM, Andres Freund <andres@2ndquadrant.com>
wrote:
On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
A few years ago I started working on a concurrent hash table for
PostgreSQL. The hash table part of it worked, but I never did
anything with it, really. Amit mentioned to me earlier this week that
he was seeing contention inside the dynahash machinery, which inspired
me to go back and update the patch.Interestingly I've benchmarked similar loads, even on the same machine
as Amit,
There is one catch here, for these profiles I am using Power-8 m/c
and the load is slightly higher (5000 scale factor).
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 2014-10-14 20:30:45 +0530, Amit Kapila wrote:
On Tue, Oct 14, 2014 at 7:55 PM, Andres Freund <andres@2ndquadrant.com>
wrote:On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
A few years ago I started working on a concurrent hash table for
PostgreSQL. The hash table part of it worked, but I never did
anything with it, really. Amit mentioned to me earlier this week that
he was seeing contention inside the dynahash machinery, which inspired
me to go back and update the patch.Interestingly I've benchmarked similar loads, even on the same machine
as Amit,There is one catch here, for these profiles I am using Power-8 m/c
and the load is slightly higher (5000 scale factor).
Ah, right. I don't think the scale factor changes much, but the
different architecture certainly does. As I said elsewhere, I would not
believe these profiles much until they're actually done with optimized
code...
I also think we need to be a bit careful about optimizing too much for
stock pgbench with working set >> s_b. The uniform readonly access
pattern you're profiling here isn't something super realistic. Still
valuable, but we should take it with a grain of salt.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 14, 2014 at 10:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
I took the basic infrastructure from before and used it to replace the
buffer table. Patch is attached.Hm. Unless I missed something you pretty much completely eradicated
locks from the buffer mapping code? That'd be pretty cool, but it's also
scary.How confident are you that your conversion is correct? Not in the sense
that there aren't any buglets, but fundamentally.
It doesn't look particularly dangerous to me. Famous last words.
Basically, I think what we're doing right now is holding the buffer
mapping lock so that the buffer can't be renamed under us while we're
pinning it. If we don't do that, I think the only consequence is
that, by the time we get the buffer pin, the buffer might no longer be
the one we looked up. So we need to recheck. But assuming we do
that, I don't see an issue. In fact, it occurred to me while I was
cobbling this together that we might want to experiment with that
change independently of chash. We already know (from the
StrategyGetBuffer locking changes) that holding system-wide locks to
prevent people from twaddling the state of individual buffers can be a
loser. This case isn't nearly as bad, because we're only pinning one
buffer, rather than potentially all of them multiple times, and the
lock we're holding only affects 1/128th of the buffers, not all of
them.
The other application I had in mind for chash was SLRU stuff. I
haven't tried to write the code yet, but fundamentally, the same
considerations apply there. You do the lookup locklessly to find out
which buffer is thought to contain the SLRU page you want, but by the
time you lock the page the answer might be out of date. Assuming that
this doesn't happen many times in a row and that lookups are
relatively cheap, you could get rid of having any centralized lock for
SLRU, and just have per-page locks.
I'm not quite sure what fundamental dangers you're thinking about
here, but from what I understand from reading the literature, doing
lookups in a lock-free hash table needs to be thought about in a sort
of relativistic way. The different CPUs have slightly different views
of the world, so any answer you get may be obsolete by the time you
get it, and may according to some other CPU's view of the world have
been obsolete even at the time that it was copied. That's OK, because
it's just equivalent to some other CPU doing its hash table update
slightly sooner or later than it actually did it, within the bounds
imposed by memory barriers, which it could have done anyway. There is
no one source of truth. The result of all this is that some of the
synchronization responsibility is transferred to the caller. That's
obviously a bit more complex to reason about, but it hasn't stopped
lock-free hash tables from being wildly popular data structures.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 14, 2014 at 8:38 PM, Andres Freund <andres@2ndquadrant.com>
wrote:
On 2014-10-14 20:30:45 +0530, Amit Kapila wrote:
On Tue, Oct 14, 2014 at 7:55 PM, Andres Freund <andres@2ndquadrant.com>
wrote:On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
A few years ago I started working on a concurrent hash table for
PostgreSQL. The hash table part of it worked, but I never did
anything with it, really. Amit mentioned to me earlier this week
that
he was seeing contention inside the dynahash machinery, which
inspired
me to go back and update the patch.
Interestingly I've benchmarked similar loads, even on the same machine
as Amit,There is one catch here, for these profiles I am using Power-8 m/c
and the load is slightly higher (5000 scale factor).Ah, right. I don't think the scale factor changes much, but the
different architecture certainly does. As I said elsewhere, I would not
believe these profiles much until they're actually done with optimized
code...
Today, that m/c is not accessible, so will take a day or so to get the
optimized profiles and will post it once I am able to take the same.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Tue, Oct 14, 2014 at 10:25 AM, Andres Freund <andres@2ndquadrant.com> wrote:
The key idea here is that lookups are done without any locks, only
memory barriers; and inserts and deletes are done using atomic ops.Hm. I quickly looked and I see that you use two full barriers for every
lookup. That's pretty expensive. I think this should be doable using
only read/write barriers on the lookup side? Even on architectures where
read/write memory barriers have to be something but a compiler barrier,
they're still usually a magnitude or so cheaper than full barriers.
The code in CHashSearch shows the problem there: you need to STORE the
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer. A read or write barrier won't provide store/load
or load/store ordering.
I don't see much reason to strive for full lock-free ness. If the locks
aren't noticeable in real world scenarios - who cares?
That's my feeling too. ISTR that when I stress-tested the hash table
back in 2012 when I started this code, the concurrency was far better
than dynahash with 16 lwlock partitions. I don't remember off hand
exactly how I tested that, but it was with the code in hashtest.c. I
also seem to recall that it was possible to make the freelists get
VERY hot, but of course that was a workload where hash table updates
were the only thing happening, so I assume that on a workload where
we're actually doing work (like, copying the data in and out of kernel
buffers) that's not going to be a real problem unless you have
thousands of cores. But we'll see.
With regard for using a hash table for the buffer mapping lock I'm
doubtful that any form of separate chaining is the right one. We
currently have a quite noticeable problem with the number of cache
misses in the buffer mapping hash (and also the heavyweight lock mgr) -
if we stay with hashes that's only going to be fundamentally lower than
dynahash if we change the type of hashing. I've had good, *preliminary*,
results using an open addressing + linear probing approach.
I'm very skeptical of open addressing + linear probing. Such hash
tables tend to degrade over time, because you have to leave tombstones
behind to ensure that future scans advance far enough. There's really
no way to recover without rebuilding the whole hash table, and
eventually it degrades to linear search. If we're spending too much
time walking hash chains, I think the solution is to increase the
number of buckets so that the chains get shorter.
My guess is that the additional indirection via the arena explains the
difference in cache misses? But I might be completely off here.
The arena makes the calculation of the pointer address less
predictable, which might make things more difficult for the CPU
pipeline. But aside from that, I think it's the same basic idea: you
bounce from some sort of bucket header to the first element of the
chain and then, hopefully, you're done. Am I missing something?
It'd be quite interesting to see a perf stat -ddd of dynahash.c/chash
employing builds for some comparable load.
I'm hoping you can test this out on x86. All I have to work with are
the POWER machines, and the characteristics of those are somewhat
different. I can test it there, though.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-10-14 11:08:08 -0400, Robert Haas wrote:
On Tue, Oct 14, 2014 at 10:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
I took the basic infrastructure from before and used it to replace the
buffer table. Patch is attached.Hm. Unless I missed something you pretty much completely eradicated
locks from the buffer mapping code? That'd be pretty cool, but it's also
scary.How confident are you that your conversion is correct? Not in the sense
that there aren't any buglets, but fundamentally.It doesn't look particularly dangerous to me. Famous last words.
Basically, I think what we're doing right now is holding the buffer
mapping lock so that the buffer can't be renamed under us while we're
pinning it.
What I'm afraid of is that there's hidden assumptions about the
consistency provided by the mapping locks.
If we don't do that, I think the only consequence is
that, by the time we get the buffer pin, the buffer might no longer be
the one we looked up. So we need to recheck. But assuming we do
that, I don't see an issue. In fact, it occurred to me while I was
cobbling this together that we might want to experiment with that
change independently of chash. We already know (from the
StrategyGetBuffer locking changes) that holding system-wide locks to
prevent people from twaddling the state of individual buffers can be a
loser. This case isn't nearly as bad, because we're only pinning one
buffer, rather than potentially all of them multiple times, and the
lock we're holding only affects 1/128th of the buffers, not all of
them.
Also IIRC at least linux has targeted wakeup/time transfer logic when
using semaphore - and doesn't for spinlocks. So if you're not happening
to sleep while the lwlock's spinlock is held, the result will be
better. Only that we'll frequently sleep within that for very frequently
taken locks...
The other application I had in mind for chash was SLRU stuff. I
haven't tried to write the code yet, but fundamentally, the same
considerations apply there. You do the lookup locklessly to find out
which buffer is thought to contain the SLRU page you want, but by the
time you lock the page the answer might be out of date. Assuming that
this doesn't happen many times in a row and that lookups are
relatively cheap, you could get rid of having any centralized lock for
SLRU, and just have per-page locks.
Hm. I have to admit I haven't really looked enough into the slru code to
judge this. My impression so far wasn't that the locking itself was the
problem in most scenarios - that's not what you've seen?
I'm not quite sure what fundamental dangers you're thinking about
here
Oh, only in the context of the bufmgr.c conversion. Not more
generally. I agree that a lockfree hashtable is something quite
worthwile to have.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 14, 2014 at 11:31 AM, Andres Freund <andres@2ndquadrant.com> wrote:
It doesn't look particularly dangerous to me. Famous last words.
Basically, I think what we're doing right now is holding the buffer
mapping lock so that the buffer can't be renamed under us while we're
pinning it.What I'm afraid of is that there's hidden assumptions about the
consistency provided by the mapping locks.
That's certainly worth checking for, but I think the only code that
needs to be checked is the code that would formerly have run while
holding said lock. And there isn't that much of that.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-10-14 11:19:16 -0400, Robert Haas wrote:
On Tue, Oct 14, 2014 at 10:25 AM, Andres Freund <andres@2ndquadrant.com> wrote:
The key idea here is that lookups are done without any locks, only
memory barriers; and inserts and deletes are done using atomic ops.Hm. I quickly looked and I see that you use two full barriers for every
lookup. That's pretty expensive. I think this should be doable using
only read/write barriers on the lookup side? Even on architectures where
read/write memory barriers have to be something but a compiler barrier,
they're still usually a magnitude or so cheaper than full barriers.The code in CHashSearch shows the problem there: you need to STORE the
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer. A read or write barrier won't provide store/load
or load/store ordering.
I'm not sure I understand the problem here - but a read + write barrier
should suffice? To avoid falling back to two full barriers, we'd need a
separate define pg_read_write_barrier(), but that's not a problem. IIUC
that should allow us to avoid emitting any actual code on x86.
With regard for using a hash table for the buffer mapping lock I'm
doubtful that any form of separate chaining is the right one. We
currently have a quite noticeable problem with the number of cache
misses in the buffer mapping hash (and also the heavyweight lock mgr) -
if we stay with hashes that's only going to be fundamentally lower than
dynahash if we change the type of hashing. I've had good, *preliminary*,
results using an open addressing + linear probing approach.I'm very skeptical of open addressing + linear probing. Such hash
tables tend to degrade over time, because you have to leave tombstones
behind to ensure that future scans advance far enough.
You can try to be a bit smarter than a plain open addressing
approach. But yes, it has disadvantages.
There's really
no way to recover without rebuilding the whole hash table, and
eventually it degrades to linear search. If we're spending too much
time walking hash chains, I think the solution is to increase the
number of buckets so that the chains get shorter.
Might be worthwile to try. I know that both my handrolled open
addressing and my hand rolled chaining approach have significantly fewer
cache misses than dynahash for the same amount of work.
Hm. Possibly that's at least partially because of the segment stuff in
dynahash?
Anyway, you can get the size of the entire hashtable down quite a
bit. Primarily because there's no need to store a next pointer. There's
also not really any need for the 'hashvalue' in the bufmgr case
imo.
My guess is that the additional indirection via the arena explains the
difference in cache misses? But I might be completely off here.The arena makes the calculation of the pointer address less
predictable, which might make things more difficult for the CPU
pipeline. But aside from that, I think it's the same basic idea: you
bounce from some sort of bucket header to the first element of the
chain and then, hopefully, you're done. Am I missing something?
I haven't really read much of the code yet (wrote most of this email on
my workstation before embarking on a plane), but afaics there's one
indirection to the bucket, and then one to the first node in the linked
list. Right?
In dynahash it's first to the segment, but there's only few, so it's
likely to be cached. Then to the bucket - which, afaics, already can
contain the key.
So it's not really the arena, but that you chose not to store the first
element in a chain inline...
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:
The code in CHashSearch shows the problem there: you need to STORE the
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer. A read or write barrier won't provide store/load
or load/store ordering.I'm not sure I understand the problem here - but a read + write barrier
should suffice? To avoid falling back to two full barriers, we'd need a
separate define pg_read_write_barrier(), but that's not a problem. IIUC
that should allow us to avoid emitting any actual code on x86.
Well, thinking about x86 specifically, it definitely needs at least
one mfence, after setting the hazard pointer and before beginning to
read the bucket chain. It probably doesn't need the second mfence,
before clearing the the hazard pointer, because x86 allows loads to be
reordered before stores but not stores before loads. We could
certainly try removing the second fence on x86 for benchmarking
purposes, and we could also check whether mfence outperforms lock;
addl in that scenario.
I tested PPC, because that's what I have. I think we're emitting two
sync instructions there, but maybe lwsync or isync would be enough in
one or both cases. The first of these links indicates that lwsync
ought to be enough for both cases; the second says that we need a sync
after setting the hazard pointer but that an lwsync is enough before
clearing it. Or that's my reading anyway.
http://www.ibm.com/developerworks/systems/articles/powerpc.html
http://peeterjoot.wordpress.com/2010/07/23/use-of-lwsync-instead-of-isync-as-a-mutex-acquire-memory-barrier/
My guess is that the additional indirection via the arena explains the
difference in cache misses? But I might be completely off here.The arena makes the calculation of the pointer address less
predictable, which might make things more difficult for the CPU
pipeline. But aside from that, I think it's the same basic idea: you
bounce from some sort of bucket header to the first element of the
chain and then, hopefully, you're done. Am I missing something?I haven't really read much of the code yet (wrote most of this email on
my workstation before embarking on a plane), but afaics there's one
indirection to the bucket, and then one to the first node in the linked
list. Right?
In dynahash it's first to the segment, but there's only few, so it's
likely to be cached. Then to the bucket - which, afaics, already can
contain the key.So it's not really the arena, but that you chose not to store the first
element in a chain inline...
Hmm, you have a point. I think the reason I did it that way is
because I don't know how to make the memory management work out
otherwise. If you store the first element inline, then even an empty
bucket uses up as much memory space as one with a single element.
More seriously, it breaks the deletion algorithm. There's no good way
to atomically swap out the bucket header if it is located in a fixed
position and bigger than the size of object the machine can manipulate
atomically.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-10-14 17:53:10 -0400, Robert Haas wrote:
On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:
The code in CHashSearch shows the problem there: you need to STORE the
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer. A read or write barrier won't provide store/load
or load/store ordering.I'm not sure I understand the problem here - but a read + write barrier
should suffice? To avoid falling back to two full barriers, we'd need a
separate define pg_read_write_barrier(), but that's not a problem. IIUC
that should allow us to avoid emitting any actual code on x86.Well, thinking about x86 specifically, it definitely needs at least
one mfence, after setting the hazard pointer and before beginning to
read the bucket chain.
Yes, I can see that now. I do wonder if that's not going to prove quite
expensive... I think I can see some ways around needing that. But I
think that needs some benchmarking first - no need to build a even more
complex solution if not needed.
The solution I'm thinking of is to essentially move away from hazard
pointers and store something like a generation counter per
backend. Which is updated less often, and in combination with other
operations. When a backend need to clean up and sees that there's a
straggler with a old generation it sends that backend a signal to ensure
it sets the latest generation.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 14, 2014 at 6:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
With regard for using a hash table for the buffer mapping lock I'm
doubtful that any form of separate chaining is the right one. We
currently have a quite noticeable problem with the number of cache
misses in the buffer mapping hash (and also the heavyweight lock mgr) -
if we stay with hashes that's only going to be fundamentally lower than
dynahash if we change the type of hashing. I've had good, *preliminary*,
results using an open addressing + linear probing approach.I'm very skeptical of open addressing + linear probing. Such hash
tables tend to degrade over time, because you have to leave tombstones
behind to ensure that future scans advance far enough. There's really
no way to recover without rebuilding the whole hash table, and
eventually it degrades to linear search. If we're spending too much
time walking hash chains, I think the solution is to increase the
number of buckets so that the chains get shorter.
What about cuckoo hashing? There was a recent paper on how to do fine
grained locking with cuckoo hashes. [1]https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf
I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way
associativity). This allows us to fit the bucket onto 2 regular sized
cache lines and have 8 bytes left over. Buckets would be protected by
seqlocks stored in the extra space. On the read side we would only
need 2 read barriers (basically free on x86), and we are guaranteed to
have an answer by fetching two pairs of cache lines. We can trade
memory bandwidth for latency by issuing prefetches (once we add
primitives for this). Alternatively we can trade insert speed for
lookup speed by using asymmetrically sized tables.
[1]: https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf
Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 15/10/2014 10:32 AM, Ants Aasma wrote:
On Tue, Oct 14, 2014 at 6:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
With regard for using a hash table for the buffer mapping lock I'm
doubtful that any form of separate chaining is the right one. We
currently have a quite noticeable problem with the number of cache
misses in the buffer mapping hash (and also the heavyweight lock mgr) -
if we stay with hashes that's only going to be fundamentally lower than
dynahash if we change the type of hashing. I've had good, *preliminary*,
results using an open addressing + linear probing approach.I'm very skeptical of open addressing + linear probing. Such hash
tables tend to degrade over time, because you have to leave tombstones
behind to ensure that future scans advance far enough. There's really
no way to recover without rebuilding the whole hash table, and
eventually it degrades to linear search. If we're spending too much
time walking hash chains, I think the solution is to increase the
number of buckets so that the chains get shorter.What about cuckoo hashing? There was a recent paper on how to do fine
grained locking with cuckoo hashes. [1]I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way
associativity). This allows us to fit the bucket onto 2 regular sized
cache lines and have 8 bytes left over. Buckets would be protected by
seqlocks stored in the extra space. On the read side we would only
need 2 read barriers (basically free on x86), and we are guaranteed to
have an answer by fetching two pairs of cache lines. We can trade
memory bandwidth for latency by issuing prefetches (once we add
primitives for this). Alternatively we can trade insert speed for
lookup speed by using asymmetrically sized tables.[1] https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf
Actually, I'd go with second-chance hashing [2]http://www.eecs.harvard.edu/~michaelm/postscripts/infocom_hardware_submit.pdf, same number of hash
functions but it's more stable (no infinite loops, for example). Most
probably the techniques from [1] would apply equally well.
[2]: http://www.eecs.harvard.edu/~michaelm/postscripts/infocom_hardware_submit.pdf
http://www.eecs.harvard.edu/~michaelm/postscripts/infocom_hardware_submit.pdf
Ryan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-10-14 17:53:10 -0400, Robert Haas wrote:
On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:
The code in CHashSearch shows the problem there: you need to STORE the
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer. A read or write barrier won't provide store/load
or load/store ordering.I'm not sure I understand the problem here - but a read + write barrier
should suffice? To avoid falling back to two full barriers, we'd need a
separate define pg_read_write_barrier(), but that's not a problem. IIUC
that should allow us to avoid emitting any actual code on x86.Well, thinking about x86 specifically, it definitely needs at least
one mfence, after setting the hazard pointer and before beginning to
read the bucket chain.Yes, I can see that now. I do wonder if that's not going to prove quite
expensive... I think I can see some ways around needing that. But I
think that needs some benchmarking first - no need to build a even more
complex solution if not needed.The solution I'm thinking of is to essentially move away from hazard
pointers and store something like a generation counter per
backend. Which is updated less often, and in combination with other
operations. When a backend need to clean up and sees that there's a
straggler with a old generation it sends that backend a signal to ensure
it sets the latest generation.
It's possible that might work ... but on the timescale we're talking
about here, that's asking the garbage collecting process to wait for
practically geologic time.
Back when I first wrote this code, I spent a fair amount of time
looking at existing work in the area of lock-free hash tables.
Essentially all of the work I was able to find on this topic assumes a
threaded model - or more precisely, it assumes that shared memory
allocation is cheap and easy and you'll have no problem getting as
much of it as you need whenever you need it. This assumption often
isn't even spelled out explicitly: it's just assumed that that's the
programming environment you're working in. Finding highly parallel
algorithms that don't rely on memory allocation as a primitive is
hard. Hazard pointers are one of the few techniques I found that
seems like it can work in our architecture. I'm quite reluctant to
leave aside techniques that have been proven to work well in favor of
inventing something entirely novel to PostgreSQL.
That having been said, there is some literature on generation numbers,
and I think something like that could be made to work. It does have
some significant disadvantages, though. One is that a single process
which fails to update its generation number prevents all reclamation,
system-wide. In my algorithm, a process whose execution is
suspended while a hazard pointer is set prevents recycling of only one
of many garbage lists. A process searching for a reusable element can
mostly likely find some other garbage list to reclaim instead. Also,
a generation number implies that we update the value periodically,
rather than before and after each critical section. Anything that
forces garbage collection to be postponed longer than absolutely
necessary seems to me likely to be a loser. It might be a little
faster as long as we have free elements to allocate, but as soon as
we're out and have to wait longer than we otherwise would for garbage
collection, and all system activity halts as a result, even for a few
milliseconds, it's going to be a whole lot slower. Or at least, I
think.
That having been said, I don't know what to do about the fact that the
fence is too expensive. I don't know that we've really established
that that's the true root of the problem rather than some other
pedestrian optimization failure. But the existing code is using an
atomic operation to acquire a spinlock, then releasing it, walking the
bucket chain, and then using another atomic operation to acquire a
spinlock and then releasing it again. Surely a pure fence shouldn't
cost more than a spinlock cycle? Even with arguably one extra cache
line touch, that seems like it ought to be a win. But my intuitions
in this area are shaky.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-10-15 13:41:25 -0400, Robert Haas wrote:
On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund <andres@2ndquadrant.com> wrote:
The solution I'm thinking of is to essentially move away from hazard
pointers and store something like a generation counter per
backend. Which is updated less often, and in combination with other
operations. When a backend need to clean up and sees that there's a
straggler with a old generation it sends that backend a signal to ensure
it sets the latest generation.It's possible that might work ... but on the timescale we're talking
about here, that's asking the garbage collecting process to wait for
practically geologic time.
I think it depends on what we're tying the generation increase to. We
very well could add a implict generation increment to, say, lwlock
acquisition - which are full barriers anyway. And I don't think we'll
normally have a that high rate of garbage collection anyway - there
should be plenty of headroom.
Back when I first wrote this code, I spent a fair amount of time
looking at existing work in the area of lock-free hash tables.
Essentially all of the work I was able to find on this topic assumes a
threaded model - or more precisely, it assumes that shared memory
allocation is cheap and easy and you'll have no problem getting as
much of it as you need whenever you need it.
FWIW, I think many of the solutions that are actually used in practice
don't rely on that heavily. I know at least some that require memory to
be reserved beforehand, in special contexts.
This assumption often
isn't even spelled out explicitly: it's just assumed that that's the
programming environment you're working in. Finding highly parallel
algorithms that don't rely on memory allocation as a primitive is
hard. Hazard pointers are one of the few techniques I found that
seems like it can work in our architecture. I'm quite reluctant to
leave aside techniques that have been proven to work well in favor of
inventing something entirely novel to PostgreSQL.
I don't think something like generation numbers is really that new and
unproven - it's essentially a more trivial version of RCU. Which is used
quite heavily, and under different names.
That said, I'm far from convinced that they are the solution - they just
were the first thing that came to my mind thinking about the problem.
That having been said, there is some literature on generation numbers,
and I think something like that could be made to work. It does have
some significant disadvantages, though. One is that a single process
which fails to update its generation number prevents all reclamation,
system-wide. In my algorithm, a process whose execution is
suspended while a hazard pointer is set prevents recycling of only one
of many garbage lists. A process searching for a reusable element can
mostly likely find some other garbage list to reclaim instead.
Hm. Couldn't a similar scheme with several lists be used with generation
numbers?
Also, a generation number implies that we update the value
periodically, rather than before and after each critical section.
Hm. Don't think it necessarily has to do that.
Anything that forces garbage collection to be postponed longer than
absolutely necessary seems to me likely to be a loser. It might be a
little faster as long as we have free elements to allocate, but as
soon as we're out and have to wait longer than we otherwise would for
garbage collection, and all system activity halts as a result, even
for a few milliseconds, it's going to be a whole lot slower. Or at
least, I think.
I think it really depends on the user of the facility. If the generation
were e.g. also tied to lwlocks I couldn't see bufmgr.c outrunning that
realistically.
That having been said, I don't know what to do about the fact that the
fence is too expensive.
I'm far from sure that it's the problem at hand here - the reason I'm
wondering about the barriers is primarily that the buffer mapping hash
table is one of the top profile entries in in all concurrent
workloads. The top one often even, after removing some locking
bottleneck. Nearly all of the time is spent there due to cache
misses. While I think we can get the table smaller and more efficient
for the same NBuffers value, realistically we need to cope with much
bigger NBuffers values.
Since cache misses are a problem that we're going to have to deal with,
restricting the processor's tool for efficiently dealing with that (out
of order execution, SMT), doesn't seem like a wise choice.
I don't know that we've really established
that that's the true root of the problem rather than some other
pedestrian optimization failure.
Absolutely.
But the existing code is using an
atomic operation to acquire a spinlock, then releasing it, walking the
bucket chain, and then using another atomic operation to acquire a
spinlock and then releasing it again.
Well, we don't do that for lookups right now. Just for
insertions/removals. Or are you talking about the buffer mapping lock?
Surely a pure fence shouldn't cost more than a spinlock cycle?
It really shouldn't - there's a difference right now that there's some
other thing the processor can do while dealing with the cache miss.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-10-14 17:53:10 -0400, Robert Haas wrote:
On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:
The code in CHashSearch shows the problem there: you need to STORE the
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer. A read or write barrier won't provide store/load
or load/store ordering.I'm not sure I understand the problem here - but a read + write barrier
should suffice? To avoid falling back to two full barriers, we'd need a
separate define pg_read_write_barrier(), but that's not a problem. IIUC
that should allow us to avoid emitting any actual code on x86.Well, thinking about x86 specifically, it definitely needs at least
one mfence, after setting the hazard pointer and before beginning to
read the bucket chain. It probably doesn't need the second mfence,
before clearing the the hazard pointer, because x86 allows loads to be
reordered before stores but not stores before loads. We could
certainly try removing the second fence on x86 for benchmarking
purposes, and we could also check whether mfence outperforms lock;
addl in that scenario.
Hm. So I thought about this for a while this morning after I wasn't able
to unprogram my hotel room's alarm clock that a previous occupant set
way to early. Given that premise, take the following with a grain of
salt.
The reason for neading an mfence is that a read/write barrier doesn't
guarantee that the store is visible to other processes - just in which
order they become visible. Right? And that's essentially because the
write might sit in the process's store buffer.
So, how about *not* using a full barrier on the reader side (i.e. the
side that does the hazard pointer writes). But instead do something like
a atomic_fetch_add(hazard_ptr, 0) (i.e. lock; xaddl) on the side that
needs to scan the hazard pointers. That, combined with the write memory
barrier, should guarantee that the store buffers are emptied. Which is
pretty nice, because it moves the overhead to the rather infrequent
scanning of the hazard pointers - which needs to do lots of other atomic
ops anyway.
Sounds sane?
That's something that should best be simulated in an exhaustive x86
simulator, but it sounds sane - and it'd be quite awesome imo :)
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 15/10/2014 11:41 AM, Robert Haas wrote:
On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-10-14 17:53:10 -0400, Robert Haas wrote:
On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:
The code in CHashSearch shows the problem there: you need to STORE the
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer. A read or write barrier won't provide store/load
or load/store ordering.I'm not sure I understand the problem here - but a read + write barrier
should suffice? To avoid falling back to two full barriers, we'd need a
separate define pg_read_write_barrier(), but that's not a problem. IIUC
that should allow us to avoid emitting any actual code on x86.Well, thinking about x86 specifically, it definitely needs at least
one mfence, after setting the hazard pointer and before beginning to
read the bucket chain.Yes, I can see that now. I do wonder if that's not going to prove quite
expensive... I think I can see some ways around needing that. But I
think that needs some benchmarking first - no need to build a even more
complex solution if not needed.The solution I'm thinking of is to essentially move away from hazard
pointers and store something like a generation counter per
backend. Which is updated less often, and in combination with other
operations. When a backend need to clean up and sees that there's a
straggler with a old generation it sends that backend a signal to ensure
it sets the latest generation.It's possible that might work ... but on the timescale we're talking
about here, that's asking the garbage collecting process to wait for
practically geologic time.Back when I first wrote this code, I spent a fair amount of time
looking at existing work in the area of lock-free hash tables.
Essentially all of the work I was able to find on this topic assumes a
threaded model - or more precisely, it assumes that shared memory
allocation is cheap and easy and you'll have no problem getting as
much of it as you need whenever you need it. This assumption often
isn't even spelled out explicitly: it's just assumed that that's the
programming environment you're working in. Finding highly parallel
algorithms that don't rely on memory allocation as a primitive is
hard. Hazard pointers are one of the few techniques I found that
seems like it can work in our architecture. I'm quite reluctant to
leave aside techniques that have been proven to work well in favor of
inventing something entirely novel to PostgreSQL.That having been said, there is some literature on generation numbers,
and I think something like that could be made to work. It does have
some significant disadvantages, though. One is that a single process
which fails to update its generation number prevents all reclamation,
system-wide. In my algorithm, a process whose execution is
suspended while a hazard pointer is set prevents recycling of only one
of many garbage lists. A process searching for a reusable element can
mostly likely find some other garbage list to reclaim instead. Also,
a generation number implies that we update the value periodically,
rather than before and after each critical section. Anything that
forces garbage collection to be postponed longer than absolutely
necessary seems to me likely to be a loser. It might be a little
faster as long as we have free elements to allocate, but as soon as
we're out and have to wait longer than we otherwise would for garbage
collection, and all system activity halts as a result, even for a few
milliseconds, it's going to be a whole lot slower. Or at least, I
think.That having been said, I don't know what to do about the fact that the
fence is too expensive. I don't know that we've really established
that that's the true root of the problem rather than some other
pedestrian optimization failure. But the existing code is using an
atomic operation to acquire a spinlock, then releasing it, walking the
bucket chain, and then using another atomic operation to acquire a
spinlock and then releasing it again. Surely a pure fence shouldn't
cost more than a spinlock cycle? Even with arguably one extra cache
line touch, that seems like it ought to be a win. But my intuitions
in this area are shaky.
Why not use an RCU mechanism [1]http://www.rdrop.com/~paulmck/RCU/ and ditch the hazard pointers? Seems
like an ideal fit...
In brief, RCU has the following requirements:
* Read-heavy access pattern
* Writers must be able to make dead objects unreachable to new readers
(easily done for most data structures)
* Writers must be able to mark dead objects in such a way that
existing readers know to ignore their contents but can still
traverse the data structure properly (usually straightforward)
* Readers must occasionally inform the system that they are not
currently using any RCU-protected pointers (to allow resource
reclamation)
[1]: http://www.rdrop.com/~paulmck/RCU/
If the above are satisfied, then:
* Readers need no synchronization at all (not even compiler fences)
* Writers can use the synchronization mechanism of their choice to
coordinate with each other (lwlock, atomic CAS, etc.)
* Object reclamation can be performed in the background, or
synchronously (and incrementally, if desired )
I've had very good results implementing user-level RCU for my own
database projects. It slashes the complexity of reasoning about
concurrent reader accesses. Implemented properly, it requires only a bit
of shared memory, has extremely low overhead in the common case of no
stragglers, and requires only minimal communication except when
stragglers are present (and even then mostly between stragglers). It's
reasonably simple to implement in userspace dbms context (< 1kLoC with
comments, assertions, etc.). Just don't forget to quiesce all in-flight
back ends at regular intervals, or the system won't be able to reclaim
anything...
Thoughts?
Ryan
On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson
<ryan.johnson@cs.utoronto.ca> wrote:
Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like
an ideal fit...In brief, RCU has the following requirements:
Read-heavy access pattern
Writers must be able to make dead objects unreachable to new readers (easily
done for most data structures)
Writers must be able to mark dead objects in such a way that existing
readers know to ignore their contents but can still traverse the data
structure properly (usually straightforward)
Readers must occasionally inform the system that they are not currently
using any RCU-protected pointers (to allow resource reclamation)
Have a look at http://lwn.net/Articles/573424/ and specifically the
"URCU overview" section. Basically, that last requirement - that
readers inform the system tat they are not currently using any
RCU-protected pointers - turns out to require either memory barriers
or signals.
All of the many techniques that have been developed in this area are
merely minor variations on a very old theme: set some kind of flag
variable in shared memory to let people know that you are reading a
shared data structure, and clear it when you are done. Then, other
people can figure out when it's safe to recycle memory that was
previously part of that data structure. In Linux's RCU, the flag
variable is "whether the process is currently scheduled on a CPU",
which is obviously not workable from user-space. Lacking that, you
need an explicit flag variable, which means you need memory barriers,
since the protected operation is a load and the flag variable is
updated via a store. You can try to avoid some of the overhead by
updating the flag variable less often (say, when a signal arrives) or
you can make it more fine-grained (in my case, we only prevent reclaim
of a fraction of the data structure at a time, rather than all of it)
or various other variants, but none of this is unfortunately so simple
as "apply technique X and your problem just goes away".
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers