[RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

Started by Andres Freundabout 12 years ago28 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

I've been annoyed at the amount of memory used by the backend local
PrivateRefCount array for a couple of reasons:

a) The performance impact of AtEOXact_Buffers() on Assert() enabled
builds is really, really annoying.
b) On larger nodes, the L1/2/3 cache impact of randomly accessing
several megabyte big array at a high frequency is noticeable. I've
seen the access to that to be the primary (yes, really) source of
pipeline stalls.
c) On nodes with significant shared_memory the sum of the per-backend
arrays is a significant amount of memory, that could very well be
used more beneficially.

So what I have done in the attached proof of concept is to have a small
(8 currently) array of (buffer, pincount) that's searched linearly when
the refcount of a buffer is needed. When more than 8 buffers are pinned
a hashtable is used to lookup the values.

That seems to work fairly well. On the few tests I could run on my
laptop - I've done this during a flight - it's a small performance win
in all cases I could test. While saving a fair amount of memory.

Alternatively we could just get rid of the idea of tracking this per
backend, relying on tracking via resource managers...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Make-backend-local-tracking-of-buffer-pins-more-effi.patchtext/x-patch; charset=us-asciiDownload+290-101
#2Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#1)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

On 21 March 2014 14:22, Andres Freund <andres@2ndquadrant.com> wrote:

That seems to work fairly well. On the few tests I could run on my
laptop - I've done this during a flight - it's a small performance win
in all cases I could test. While saving a fair amount of memory.

We've got to the stage now that saving this much memory is essential,
so this patch is a must-have.

The patch does all I would expect and no more, so approach and details
look good to me.

Performance? Discussed many years ago, but I suspect the micro-tuning
of those earlier patches wasn't as good as it is here.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#2)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

On 2014-04-09 05:34:42 -0400, Simon Riggs wrote:

On 21 March 2014 14:22, Andres Freund <andres@2ndquadrant.com> wrote:

That seems to work fairly well. On the few tests I could run on my
laptop - I've done this during a flight - it's a small performance win
in all cases I could test. While saving a fair amount of memory.

We've got to the stage now that saving this much memory is essential,
so this patch is a must-have.

I think some patch like this is necessary - I am not 100% sure mine is
the one true approach here, but it certainly seems simple enough.

Performance? Discussed many years ago, but I suspect the micro-tuning
of those earlier patches wasn't as good as it is here.

It's a small win on small machines (my laptop, 16GB), so we need to
retest with 128GB shared_buffers or such on bigger ones. There
PrivateRefCount previously was the source of a large portion of the
cache misses...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#2)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

On Wed, Apr 9, 2014 at 5:34 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

We've got to the stage now that saving this much memory is essential,
so this patch is a must-have.

The patch does all I would expect and no more, so approach and details
look good to me.

Performance? Discussed many years ago, but I suspect the micro-tuning
of those earlier patches wasn't as good as it is here.

I think this approach is practically a slam-dunk when the number of
pins is small (as it typically is). I'm less clear what happens when
we overflow from the small array into the hashtable. That certainly
seems like it could be a loss, but how do we construct such a case to
test it? A session with lots of suspended queries? Can we generate a
regression by starting a few suspended queries to use up the array
elements, and then running a scan that pins and unpins many buffers?

One idea is: if we fill up all the array elements and still need
another one, evict all the elements to the hash table and then start
refilling the array. The advantage of that over what's done here is
that the active scan will always being using an array slot rather than
repeated hash table manipulations. I guess you'd still have to probe
the hash table repeatedly, but you'd avoid entering and removing items
frequently.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#4)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

On 2014-04-09 08:22:15 -0400, Robert Haas wrote:

On Wed, Apr 9, 2014 at 5:34 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

We've got to the stage now that saving this much memory is essential,
so this patch is a must-have.

The patch does all I would expect and no more, so approach and details
look good to me.

Performance? Discussed many years ago, but I suspect the micro-tuning
of those earlier patches wasn't as good as it is here.

I think this approach is practically a slam-dunk when the number of
pins is small (as it typically is). I'm less clear what happens when
we overflow from the small array into the hashtable. That certainly
seems like it could be a loss, but how do we construct such a case to
test it? A session with lots of suspended queries? Can we generate a
regression by starting a few suspended queries to use up the array
elements, and then running a scan that pins and unpins many buffers?

I've tried to reproduce problems around this (when I wrote this), but
it's really hard to construct cases that need more than 8 pins. I've
tested performance for those cases by simply not using the array, and
while the performance suffers a bit, it's not that bad.

One idea is: if we fill up all the array elements and still need
another one, evict all the elements to the hash table and then start
refilling the array. The advantage of that over what's done here is
that the active scan will always being using an array slot rather than
repeated hash table manipulations. I guess you'd still have to probe
the hash table repeatedly, but you'd avoid entering and removing items
frequently.

We could do that, but my gut feeling is that it's not necessary. There'd
be some heuristic to avoid doing that all the time, otherwise we'd
probably regress.
I think the fact that we pin/unpin very frequently will put frequently
used pins to the array most of the time.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Andres Freund (#5)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

On Wed, Apr 9, 2014 at 6:02 PM, Andres Freund <andres@2ndquadrant.com>wrote:

I've tried to reproduce problems around this (when I wrote this), but
it's really hard to construct cases that need more than 8 pins. I've
tested performance for those cases by simply not using the array, and
while the performance suffers a bit, it's not that bad.

AFAIR this was suggested before and got rejected because constructing that
worst case and proving that the approach does not perform too badly was a
challenge. Having said that, I agree its time to avoid that memory
allocation, especially with large number of backends running with large
shared buffers.

An orthogonal issue I noted is that we never check for overflow in the ref
count itself. While I understand overflowing int32 counter will take a
large number of pins on the same buffer, it can still happen in the worst
case, no ? Or is there a theoretical limit on the number of pins on the
same buffer by a single backend ?

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

#7Andres Freund
andres@anarazel.de
In reply to: Pavan Deolasee (#6)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

On 2014-04-09 18:13:29 +0530, Pavan Deolasee wrote:

On Wed, Apr 9, 2014 at 6:02 PM, Andres Freund <andres@2ndquadrant.com>wrote:

I've tried to reproduce problems around this (when I wrote this), but
it's really hard to construct cases that need more than 8 pins. I've
tested performance for those cases by simply not using the array, and
while the performance suffers a bit, it's not that bad.

AFAIR this was suggested before and got rejected because constructing that
worst case and proving that the approach does not perform too badly was a
challenge. Having said that, I agree its time to avoid that memory
allocation, especially with large number of backends running with large
shared buffers.

Well, I've tested the worst case by making *all* pins go through the
hash table. And it didn't regress too badly, although it *was* visible
in the profile.
I've searched the archive and to my knowledge nobody has actually sent a
patch implementing this sort of schemes for pins, although there's been
talk about various ways to solve this.

An orthogonal issue I noted is that we never check for overflow in the ref
count itself. While I understand overflowing int32 counter will take a
large number of pins on the same buffer, it can still happen in the worst
case, no ? Or is there a theoretical limit on the number of pins on the
same buffer by a single backend ?

I think we'll die much earlier, because the resource owner array keeping
track of buffer pins will be larger than 1GB.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#5)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

On Wed, Apr 9, 2014 at 8:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:

I've tried to reproduce problems around this (when I wrote this), but
it's really hard to construct cases that need more than 8 pins. I've
tested performance for those cases by simply not using the array, and
while the performance suffers a bit, it's not that bad.

Suspended queries won't do it?

Also, it would be good to quantify "not that bad". Actually, this
thread is completely lacking any actual benchmark results...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#8)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

On 2014-04-09 09:17:59 -0400, Robert Haas wrote:

On Wed, Apr 9, 2014 at 8:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:

I've tried to reproduce problems around this (when I wrote this), but
it's really hard to construct cases that need more than 8 pins. I've
tested performance for those cases by simply not using the array, and
while the performance suffers a bit, it's not that bad.

Suspended queries won't do it?

What exactly do you mean by "suspended" queries? Defined and started
portals? Recursive query execution?

Also, it would be good to quantify "not that bad".

The 'not bad' comes from my memory of the benchmarks I'd done after
about 12h of flying around ;).

Yes, it needs real benchmarks. Probably won't get to it the next few
days tho.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#9)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

On Wed, Apr 9, 2014 at 9:38 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-04-09 09:17:59 -0400, Robert Haas wrote:

On Wed, Apr 9, 2014 at 8:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:

I've tried to reproduce problems around this (when I wrote this), but
it's really hard to construct cases that need more than 8 pins. I've
tested performance for those cases by simply not using the array, and
while the performance suffers a bit, it's not that bad.

Suspended queries won't do it?

What exactly do you mean by "suspended" queries? Defined and started
portals? Recursive query execution?

Open a cursor and fetch from it; leave it open while doing other things.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#7)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-04-09 18:13:29 +0530, Pavan Deolasee wrote:

An orthogonal issue I noted is that we never check for overflow in the ref
count itself. While I understand overflowing int32 counter will take a
large number of pins on the same buffer, it can still happen in the worst
case, no ? Or is there a theoretical limit on the number of pins on the
same buffer by a single backend ?

I think we'll die much earlier, because the resource owner array keeping
track of buffer pins will be larger than 1GB.

The number of pins is bounded, more or less, by the number of scan nodes
in your query plan. You'll have run out of memory trying to plan the
query, assuming you live that long.

The resource managers are interesting to bring up in this context.
That mechanism didn't exist when PrivateRefCount was invented.
Is there a way we could lay off the work onto the resource managers?
(I don't see one right at the moment, but I'm under-caffeinated still.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#11)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

On 2014-04-09 10:09:44 -0400, Tom Lane wrote:

The resource managers are interesting to bring up in this context.
That mechanism didn't exist when PrivateRefCount was invented.
Is there a way we could lay off the work onto the resource managers?
(I don't see one right at the moment, but I'm under-caffeinated still.)

Yea, that's something I've also considered, but I couldn't come up with
a performant and sensibly complicated way to do it.
There's some nasty issues with pins held by different ResourceOwners and
such, so even if we could provide sensible random access to check for
existing pins, it wouldn't be a simple thing.

It's not unreasonable to argue that we just shouldn't optimize for
several pins held by the same backend for the same and always touch the
global count. Thanks to resource managers the old reason for
PrivateRefCount, which was the need to be able cleanup remaining pins in
case of error, doesn't exist anymore.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#12)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

Andres Freund <andres@2ndquadrant.com> writes:

It's not unreasonable to argue that we just shouldn't optimize for
several pins held by the same backend for the same and always touch the
global count.

NAK. That would be a killer because of increased contention for buffer
headers. The code is full of places where a buffer's PrivateRefCount
jumps up and down a bit, for example when transferring a tuple into a
TupleTableSlot. (I said upthread that the number of pins is bounded by
the number of scan nodes, but actually it's probably some small multiple
of that --- eg a seqscan would hold its own pin on the current buffer,
and there'd be a slot or two holding the current tuple, each with its
own pin count.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#13)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

On 2014-04-09 10:26:25 -0400, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

It's not unreasonable to argue that we just shouldn't optimize for
several pins held by the same backend for the same and always touch the
global count.

NAK.

Note I didn't implement it because I wasn't too convinced either ;)

That would be a killer because of increased contention for buffer
headers. The code is full of places where a buffer's PrivateRefCount
jumps up and down a bit, for example when transferring a tuple into a
TupleTableSlot.

On the other hand in those scenarios the backend is pretty likely to
already have the cacheline locally in exclusive mode...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#11)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

On 9 April 2014 15:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-04-09 18:13:29 +0530, Pavan Deolasee wrote:

An orthogonal issue I noted is that we never check for overflow in the ref
count itself. While I understand overflowing int32 counter will take a
large number of pins on the same buffer, it can still happen in the worst
case, no ? Or is there a theoretical limit on the number of pins on the
same buffer by a single backend ?

I think we'll die much earlier, because the resource owner array keeping
track of buffer pins will be larger than 1GB.

The number of pins is bounded, more or less, by the number of scan nodes
in your query plan. You'll have run out of memory trying to plan the
query, assuming you live that long.

ISTM that there is a strong possibility that the last buffer pinned
will be the next buffer to be unpinned. We can use that to optimise
this.

If we store the last 8 buffers pinned in the fast array then we will
be very likely to hit the right buffer just by scanning the array.

So if we treat the fast array as a circular LRU, we get
* pinning a new buffer when array has an empty slot is O(1)
* pinning a new buffer when array is full causes us to move the LRU
into the hash table and then use that element
* unpinning a buffer will most often be O(1), which then leaves an
empty slot for next pin

Doing it that way means all usage is O(1) apart from when we use >8
pins concurrently and that usage does not follow the regular pattern.

The resource managers are interesting to bring up in this context.
That mechanism didn't exist when PrivateRefCount was invented.
Is there a way we could lay off the work onto the resource managers?
(I don't see one right at the moment, but I'm under-caffeinated still.)

Me neither. Good idea, but I think it would take a lot of refactoring
to do that.

We need to do something about this. We have complaints (via Heikki)
that we are using too much memory in idle backends and small configs,
plus we know we are using too much memory in larger servers. Reducing
the memory usage here will reduce CPU L2 cache churn as well as
increasing available RAM.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#15)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

On 2014-06-22 12:38:04 +0100, Simon Riggs wrote:

On 9 April 2014 15:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-04-09 18:13:29 +0530, Pavan Deolasee wrote:

An orthogonal issue I noted is that we never check for overflow in the ref
count itself. While I understand overflowing int32 counter will take a
large number of pins on the same buffer, it can still happen in the worst
case, no ? Or is there a theoretical limit on the number of pins on the
same buffer by a single backend ?

I think we'll die much earlier, because the resource owner array keeping
track of buffer pins will be larger than 1GB.

The number of pins is bounded, more or less, by the number of scan nodes
in your query plan. You'll have run out of memory trying to plan the
query, assuming you live that long.

ISTM that there is a strong possibility that the last buffer pinned
will be the next buffer to be unpinned. We can use that to optimise
this.

If we store the last 8 buffers pinned in the fast array then we will
be very likely to hit the right buffer just by scanning the array.

So if we treat the fast array as a circular LRU, we get
* pinning a new buffer when array has an empty slot is O(1)
* pinning a new buffer when array is full causes us to move the LRU
into the hash table and then use that element
* unpinning a buffer will most often be O(1), which then leaves an
empty slot for next pin

Doing it that way means all usage is O(1) apart from when we use >8
pins concurrently and that usage does not follow the regular pattern.

Even that case is O(1) in the average case since insertion into a
hashtable is O(1) on average...

I've started working on a patch that pretty much works like that. It
doesn't move things around in the array, because that seemed to perform
badly. That seems to make sense, because it'd require moving entries in
the relatively common case of two pages being pinned.
It moves one array entry (chosen by [someint++ % NUM_ENTRIES] and moves
it to the hashtable and puts the new item in the now free slot. Same
happens if a lookup hits an entry from the hashtable. It moves one
entry from the array into the hashtable and puts the entry from the
hashtable in the free slot.
That seems to work nicely, but needs some cleanup. And benchmarks.

We need to do something about this. We have complaints (via Heikki)
that we are using too much memory in idle backends and small configs,
plus we know we are using too much memory in larger servers. Reducing
the memory usage here will reduce CPU L2 cache churn as well as
increasing available RAM.

Yea, the buffer pin array currently is one of the biggest sources of
cache misses... In contrast to things like the buffer descriptors it's
not even shared between concurrent processes, so it's more wasteful,
even if small.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#16)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

On 22 June 2014 16:09, Andres Freund <andres@2ndquadrant.com> wrote:

So if we treat the fast array as a circular LRU, we get
* pinning a new buffer when array has an empty slot is O(1)
* pinning a new buffer when array is full causes us to move the LRU
into the hash table and then use that element
* unpinning a buffer will most often be O(1), which then leaves an
empty slot for next pin

Doing it that way means all usage is O(1) apart from when we use >8
pins concurrently and that usage does not follow the regular pattern.

Even that case is O(1) in the average case since insertion into a
hashtable is O(1) on average...

I've started working on a patch that pretty much works like that. It
doesn't move things around in the array, because that seemed to perform
badly. That seems to make sense, because it'd require moving entries in
the relatively common case of two pages being pinned.
It moves one array entry (chosen by [someint++ % NUM_ENTRIES] and moves
it to the hashtable and puts the new item in the now free slot. Same
happens if a lookup hits an entry from the hashtable. It moves one
entry from the array into the hashtable and puts the entry from the
hashtable in the free slot.

Yes, that's roughly how the SLRU code works also, so sounds good.

That seems to work nicely, but needs some cleanup. And benchmarks.

ISTM that microbenchmarks won't reveal the beneficial L2 and RAM
effects of the patch, so I suggest we just need to do a pgbench, a
2-way nested join and a 10-way nested join with an objective of no
significant difference or better. The RAM and L2 effects are enough to
justify this, since it will help with both very small and very large
configs.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#17)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

On 2014-06-22 19:31:34 +0100, Simon Riggs wrote:

Yes, that's roughly how the SLRU code works also, so sounds good.

Heh. I rather see that as an argument for it sounding bad :)

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

Hi,

On 2014-03-21 19:22:31 +0100, Andres Freund wrote:

Hi,

I've been annoyed at the amount of memory used by the backend local
PrivateRefCount array for a couple of reasons:

a) The performance impact of AtEOXact_Buffers() on Assert() enabled
builds is really, really annoying.
b) On larger nodes, the L1/2/3 cache impact of randomly accessing
several megabyte big array at a high frequency is noticeable. I've
seen the access to that to be the primary (yes, really) source of
pipeline stalls.
c) On nodes with significant shared_memory the sum of the per-backend
arrays is a significant amount of memory, that could very well be
used more beneficially.

So what I have done in the attached proof of concept is to have a small
(8 currently) array of (buffer, pincount) that's searched linearly when
the refcount of a buffer is needed. When more than 8 buffers are pinned
a hashtable is used to lookup the values.

That seems to work fairly well. On the few tests I could run on my
laptop - I've done this during a flight - it's a small performance win
in all cases I could test. While saving a fair amount of memory.

Here's the next version of this patch. The major change is that newly
pinned/looked up buffers always go into the array, even when we're
spilling into the array. To get a free slot a preexisting entry (chosen
via PrivateRefCountArray[PrivateRefCountClock++ %
REFCOUNT_ARRAY_ENTRIES]) is displaced into the hash table. That way the
concern that frequently used buffers get 'stuck' in the hashtable while
unfrequently used are in the array is ameliorated.

The biggest concern previously were some benchmarks. I'm not entirely
sure where to get a good testcase for this that's not completely
artificial - most simpler testcases don't pin many buffers. I've played
a bit around and it's a slight performance win in pgbench read only and
mixed workloads, but not enough to get excited about alone.

When asserts are enabled, the story is different. The admittedly extreme
case of readonly pgbench scale 350, with 6GB shared_buffers and 128
clients goes from 3204.489825 39277.077448 TPS. So a) above is
definitely improved :)

The memory savings are clearly visible. During a pgbench scale 350, -cj
128 readonly run the following awk
for pid in $(pgrep -U andres postgres); do
grep VmData /proc/$pid/status;
done | \
awk 'BEGIN { sum = 0 } {sum += $2;} END { if (NR > 0) print sum/NR; else print 0;print sum;print NR}'

shows:

before:
AVG: 4626.06
TOT: 619892
NR: 134

after:
AVG: 1610.37
TOT: 217400
NR: 135

So, the patch is succeeding on c).

On it's own, in pgbench scale 350 -cj 128 -S -T10 the numbers are:
before:
166171.039778, 165488.531353, 165045.182215, 161492.094693 (excluding connections establishing)
after
175812.388869, 171600.928377, 168317.370893, 169860.008865 (excluding connections establishing)

so, a bit of a performance win.

-j 16, -c 16 -S -T10:
before:
159757.637878 161287.658276 164003.676018 160687.951017 162941.627683
after:
160628.774342 163981.064787 151239.151102 164763.851903 165219.220209

I'm too tired to do continue with write tests now, but I don't see a
reason why they should be more meaningful... We really need a test with
more complex queries I'm afraid.

Anyway, I think at this stage this needs somebody to closely look at the
code. I don't think there's going to be any really surprising
performance revelations here.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Make-backend-local-tracking-of-buffer-pins-memory-ef.patchtext/x-patch; charset=us-asciiDownload+370-79
#20Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Andres Freund (#19)
Re: [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

On 8/26/14, 6:52 PM, Andres Freund wrote:

On 2014-03-21 19:22:31 +0100, Andres Freund wrote:

Hi,

I've been annoyed at the amount of memory used by the backend local
PrivateRefCount array for a couple of reasons:

a) The performance impact of AtEOXact_Buffers() on Assert() enabled
builds is really, really annoying.
b) On larger nodes, the L1/2/3 cache impact of randomly accessing
several megabyte big array at a high frequency is noticeable. I've
seen the access to that to be the primary (yes, really) source of
pipeline stalls.
c) On nodes with significant shared_memory the sum of the per-backend
arrays is a significant amount of memory, that could very well be
used more beneficially.

So what I have done in the attached proof of concept is to have a small
(8 currently) array of (buffer, pincount) that's searched linearly when
the refcount of a buffer is needed. When more than 8 buffers are pinned
a hashtable is used to lookup the values.

That seems to work fairly well. On the few tests I could run on my
laptop - I've done this during a flight - it's a small performance win
in all cases I could test. While saving a fair amount of memory.

Here's the next version of this patch. The major change is that newly

<snip>

The memory savings are clearly visible. During a pgbench scale 350, -cj
128 readonly run the following awk
for pid in $(pgrep -U andres postgres); do
grep VmData/proc/$pid/status;
done | \
awk 'BEGIN { sum = 0 } {sum += $2;} END { if (NR > 0) print sum/NR; else print 0;print sum;print NR}'

shows:

before:
AVG: 4626.06
TOT: 619892
NR: 134

after:
AVG: 1610.37
TOT: 217400
NR: 135

These results look very encouraging, especially thinking about the cache impact. It occurs to me that it'd also be nice to have some stats available on how this is performing; perhaps a dtrace probe for whenever we overflow to the hash table, and one that shows maximum usage for a statement? (Presumably that's not much extra code or overhead...)
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#19)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#19)
#23Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#22)
#24Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#21)
#25Andres Freund
andres@anarazel.de
In reply to: Jim Nasby (#20)
#26Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#23)
#27Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Andres Freund (#25)
#28Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#21)