profiling connection overhead
Per previous threats, I spent some time tonight running oprofile
(using the directions Tom Lane was foolish enough to provide me back
in May). I took testlibpq.c and hacked it up to just connect to the
server and then disconnect in a tight loop without doing anything
useful, hoping to measure the overhead of starting up a new
connection. Ha, ha, funny about that:
120899 18.0616 postgres AtProcExit_Buffers
56891 8.4992 libc-2.11.2.so memset
30987 4.6293 libc-2.11.2.so memcpy
26944 4.0253 postgres hash_search_with_hash_value
26554 3.9670 postgres AllocSetAlloc
20407 3.0487 libc-2.11.2.so _int_malloc
17269 2.5799 libc-2.11.2.so fread
13005 1.9429 ld-2.11.2.so do_lookup_x
11850 1.7703 ld-2.11.2.so _dl_fixup
10194 1.5229 libc-2.11.2.so _IO_file_xsgetn
In English: the #1 overhead here is actually something that happens
when processes EXIT, not when they start. Essentially all the time is
in two lines:
56920 6.6006 : for (i = 0; i < NBuffers; i++)
: {
98745 11.4507 : if (PrivateRefCount[i] != 0)
Non-default configs:
max_connections = 100
shared_buffers = 480MB
work_mem = 4MB
maintenance_work_mem = 128MB
checkpoint_segments = 30
random_page_cost = 2.0
Anything we can do about this? That's a lot of overhead, and it'd be
a lot worse on a big machine with 8GB of shared_buffers.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Nov 24, 2010 at 12:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Per previous threats, I spent some time tonight running oprofile
(using the directions Tom Lane was foolish enough to provide me back
in May). I took testlibpq.c and hacked it up to just connect to the
server and then disconnect in a tight loop without doing anything
useful, hoping to measure the overhead of starting up a new
connection.
Oh, right: attachments.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 24.11.2010 07:07, Robert Haas wrote:
Per previous threats, I spent some time tonight running oprofile
(using the directions Tom Lane was foolish enough to provide me back
in May). I took testlibpq.c and hacked it up to just connect to the
server and then disconnect in a tight loop without doing anything
useful, hoping to measure the overhead of starting up a new
connection. Ha, ha, funny about that:120899 18.0616 postgres AtProcExit_Buffers
56891 8.4992 libc-2.11.2.so memset
30987 4.6293 libc-2.11.2.so memcpy
26944 4.0253 postgres hash_search_with_hash_value
26554 3.9670 postgres AllocSetAlloc
20407 3.0487 libc-2.11.2.so _int_malloc
17269 2.5799 libc-2.11.2.so fread
13005 1.9429 ld-2.11.2.so do_lookup_x
11850 1.7703 ld-2.11.2.so _dl_fixup
10194 1.5229 libc-2.11.2.so _IO_file_xsgetnIn English: the #1 overhead here is actually something that happens
when processes EXIT, not when they start. Essentially all the time is
in two lines:56920 6.6006 : for (i = 0; i< NBuffers; i++)
: {
98745 11.4507 : if (PrivateRefCount[i] != 0)
Oh, that's quite surprising.
Anything we can do about this? That's a lot of overhead, and it'd be
a lot worse on a big machine with 8GB of shared_buffers.
Micro-optimizing that search for the non-zero value helps a little bit
(attached). Reduces the percentage shown by oprofile from about 16% to
12% on my laptop.
For bigger gains, I think you need to somehow make the PrivateRefCount
smaller. Perhaps only use one byte for each buffer instead of int32, and
use some sort of an overflow list for the rare case that a buffer is
pinned more than 255 times. Or make it a hash table instead of a simple
lookup array. But whatever you do, you have to be very careful to not
add overhead to PinBuffer/UnPinBuffer, those can already be quite high
in oprofile reports of real applications. It might be worth
experimenting a bit, at the moment PrivateRefCount takes up 5MB of
memory per 1GB of shared_buffers. Machines with a high shared_buffers
setting have no shortage of memory, but a large array like that might
waste a lot of precious CPU cache.
Now, the other question is if this really matters. Even if we eliminate
that loop in AtProcExit_Buffers altogether, is connect/disconnect still
be so slow that you have to use a connection pooler if you do that a lot?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
optimize-AtProcExit_Buffers-1.patchtext/x-diff; name=optimize-AtProcExit_Buffers-1.patchDownload+10-1
On Wed, Nov 24, 2010 at 2:10 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Anything we can do about this? That's a lot of overhead, and it'd be
a lot worse on a big machine with 8GB of shared_buffers.Micro-optimizing that search for the non-zero value helps a little bit
(attached). Reduces the percentage shown by oprofile from about 16% to 12%
on my laptop.For bigger gains,
The first optimization that occurred to me was "remove the loop
altogether". I could maybe see needing to do something like this if
we're recovering from an error, but why do we need to do this (except
perhaps to fail an assertion) if we're exiting cleanly? Even a
session-lifetime buffer-pin leak would be quite disastrous, one would
think.
Now, the other question is if this really matters. Even if we eliminate that
loop in AtProcExit_Buffers altogether, is connect/disconnect still be so
slow that you have to use a connection pooler if you do that a lot?
Oh, I'm sure this isn't going to be nearly enough to fix that problem,
but every little bit helps; and if we never do the first optimization,
we'll never get to #30 or wherever it is that it really starts to move
the needle.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Nov 24, 2010 at 2:10 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:Micro-optimizing that search for the non-zero value helps a little bit
(attached). Reduces the percentage shown by oprofile from about 16% to 12%
on my laptop.
That "micro-optimization" looks to me like your compiler leaves
something to be desired.
The first optimization that occurred to me was "remove the loop
altogether".
Or make it execute only in assert-enabled mode, perhaps.
This check had some use back in the bad old days, but the ResourceOwner
mechanism has probably removed a lot of the argument for it.
The counter-argument might be that failing to remove a buffer pin would
be disastrous; but I can't see that it'd be worse than failing to remove
an LWLock, and we have no belt-and-suspenders-too loop for those.
regards, tom lane
On Wed, Nov 24, 2010 at 10:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
The first optimization that occurred to me was "remove the loop
altogether".Or make it execute only in assert-enabled mode, perhaps.
This check had some use back in the bad old days, but the ResourceOwner
mechanism has probably removed a lot of the argument for it.
Yeah, that's what I was thinking - this could would have been a good
backstop when our cleanup mechanisms were not as robust as they seem
to be today. But making the check execute only in assert-enabled more
doesn't seem right, since the check actually acts to mask other coding
errors, rather than reveal them. Maybe we replace the check with one
that only occurs in an Assert-enabled build and just loops through and
does Assert(PrivateRefCount[i] == 0). I'm not sure exactly where this
gets called in the shutdown sequence, though - is it sensible to
Assert() here?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Nov 24, 2010 at 10:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Or make it execute only in assert-enabled mode, perhaps.
But making the check execute only in assert-enabled more
doesn't seem right, since the check actually acts to mask other coding
errors, rather than reveal them. Maybe we replace the check with one
that only occurs in an Assert-enabled build and just loops through and
does Assert(PrivateRefCount[i] == 0).
Yeah, that would be sensible. There is precedent for this elsewhere
too; I think there's a similar setup for checking buffer refcounts
during transaction cleanup.
I'm not sure exactly where this
gets called in the shutdown sequence, though - is it sensible to
Assert() here?
Assert is sensible anywhere.
regards, tom lane
On Wed, Nov 24, 2010 at 11:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Nov 24, 2010 at 10:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Or make it execute only in assert-enabled mode, perhaps.
But making the check execute only in assert-enabled more
doesn't seem right, since the check actually acts to mask other coding
errors, rather than reveal them. Maybe we replace the check with one
that only occurs in an Assert-enabled build and just loops through and
does Assert(PrivateRefCount[i] == 0).Yeah, that would be sensible. There is precedent for this elsewhere
too; I think there's a similar setup for checking buffer refcounts
during transaction cleanup.I'm not sure exactly where this
gets called in the shutdown sequence, though - is it sensible to
Assert() here?Assert is sensible anywhere.
OK, patch attached. Here's what oprofile output looks like with this applied:
3505 10.4396 libc-2.11.2.so memset
2051 6.1089 libc-2.11.2.so memcpy
1686 5.0217 postgres AllocSetAlloc
1642 4.8907 postgres hash_search_with_hash_value
1247 3.7142 libc-2.11.2.so _int_malloc
1096 3.2644 libc-2.11.2.so fread
855 2.5466 ld-2.11.2.so do_lookup_x
723 2.1535 ld-2.11.2.so _dl_fixup
645 1.9211 ld-2.11.2.so strcmp
620 1.8467 postgres MemoryContextAllocZero
Somehow I don't think I'm going to get much further with this without
figuring out how to get oprofile to cough up a call graph.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
AtProcExit_Buffers.patchapplication/octet-stream; name=AtProcExit_Buffers.patchDownload+16-17
Robert Haas <robertmhaas@gmail.com> writes:
OK, patch attached.
Two comments:
1. A comment would help, something like "Assert we released all buffer pins".
2. AtProcExit_LocalBuffers should be redone the same way, for
consistency (it likely won't make any performance difference).
Note the comment for AtProcExit_LocalBuffers, too; that probably
needs to be changed along the lines of "If we missed any, and
assertions aren't enabled, we'll fail later in DropRelFileNodeBuffers
while trying to drop the temp rels".
regards, tom lane
On Wed, Nov 24, 2010 at 1:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
OK, patch attached.
Two comments:
Revised patch attached.
I tried configuring oprofile with --callgraph=10 and then running
oprofile with -c, but it gives kooky looking output I can't interpret.
For example:
6 42.8571 postgres record_in
8 57.1429 postgres pg_perm_setlocale
17035 5.7219 libc-2.11.2.so memcpy
17035 100.000 libc-2.11.2.so memcpy [self]
Not that helpful. :-(
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
AtProcExit_Buffers-v2.patchapplication/octet-stream; name=AtProcExit_Buffers-v2.patchDownload+24-16
On Wed, Nov 24, 2010 at 01:20:36PM -0500, Robert Haas wrote:
On Wed, Nov 24, 2010 at 1:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
OK, patch attached.
Two comments:
Revised patch attached.
I tried configuring oprofile with --callgraph=10 and then running
oprofile with -c, but it gives kooky looking output I can't interpret.
For example:6 42.8571 postgres record_in
8 57.1429 postgres pg_perm_setlocale
17035 5.7219 libc-2.11.2.so memcpy
17035 100.000 libc-2.11.2.so memcpy [self]Not that helpful. :-(
Have a look at the wiki:
http://wiki.postgresql.org/wiki/Profiling_with_OProfile#Additional_analysis
Robert Haas
Regards,
Gerhard Heift
On Wed, Nov 24, 2010 at 1:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I tried configuring oprofile with --callgraph=10 and then running
oprofile with -c, but it gives kooky looking output I can't interpret.
It looks like the trick is to compile with -fno-omit-frame-pointer.
New profiling run:
27563 10.3470 libc-2.11.2.so memset
15162 5.6917 libc-2.11.2.so memcpy
13471 5.0569 postgres hash_search_with_hash_value
13465 5.0547 postgres AllocSetAlloc
9513 3.5711 libc-2.11.2.so _int_malloc
8729 3.2768 libc-2.11.2.so fread
6336 2.3785 ld-2.11.2.so do_lookup_x
5788 2.1728 ld-2.11.2.so _dl_fixup
4995 1.8751 postgres MemoryContextAllocZero
4978 1.8687 ld-2.11.2.so strcmp
Full results, and call graph, attached. The first obvious fact is
that most of the memset overhead appears to be coming from
InitCatCache.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wednesday 24 November 2010 19:01:32 Robert Haas wrote:
Somehow I don't think I'm going to get much further with this without
figuring out how to get oprofile to cough up a call graph.
I think to do that sensibly you need CFLAGS="-O2 -fno-omit-frame-pointer"...
Gerhard Heift <ml-postgresql-20081012-3518@gheift.de> writes:
On Wed, Nov 24, 2010 at 01:20:36PM -0500, Robert Haas wrote:
I tried configuring oprofile with --callgraph=10 and then running
oprofile with -c, but it gives kooky looking output I can't interpret.
Have a look at the wiki:
http://wiki.postgresql.org/wiki/Profiling_with_OProfile#Additional_analysis
The critical piece of information is there now, but it wasn't a minute
ago.
regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes:
Revised patch attached.
The asserts in AtProcExit_LocalBuffers are a bit pointless since
you forgot to remove the code that forcibly zeroes LocalRefCount[]...
otherwise +1.
regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes:
Full results, and call graph, attached. The first obvious fact is
that most of the memset overhead appears to be coming from
InitCatCache.
AFAICT that must be the palloc0 calls that are zeroing out (mostly)
the hash bucket headers. I don't see any real way to make that cheaper
other than to cut the initial sizes of the hash tables (and add support
for expanding them later, which is lacking in catcache ATM). Not
convinced that that creates any net savings --- it might just save
some cycles at startup in exchange for more cycles later, in typical
backend usage.
Making those hashtables expansible wouldn't be a bad thing in itself,
mind you.
regards, tom lane
On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Full results, and call graph, attached. The first obvious fact is
that most of the memset overhead appears to be coming from
InitCatCache.AFAICT that must be the palloc0 calls that are zeroing out (mostly)
the hash bucket headers. I don't see any real way to make that cheaper
other than to cut the initial sizes of the hash tables (and add support
for expanding them later, which is lacking in catcache ATM). Not
convinced that that creates any net savings --- it might just save
some cycles at startup in exchange for more cycles later, in typical
backend usage.Making those hashtables expansible wouldn't be a bad thing in itself,
mind you.
The idea I had was to go the other way and say, hey, if these hash
tables can't be expanded anyway, let's put them on the BSS instead of
heap-allocating them. Any new pages we request from the OS will be
zeroed anyway, but with palloc we then have to re-zero the allocated
block anyway because palloc can return a memory that's been used,
freed, and reused. However, for anything that only needs to be
allocated once and never freed, and whose size can be known at compile
time, that's not an issue.
In fact, it wouldn't be that hard to relax the "known at compile time"
constraint either. We could just declare:
char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED];
...and then peel off chunks.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wednesday 24 November 2010 21:47:32 Robert Haas wrote:
On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Full results, and call graph, attached. The first obvious fact is
that most of the memset overhead appears to be coming from
InitCatCache.AFAICT that must be the palloc0 calls that are zeroing out (mostly)
the hash bucket headers. I don't see any real way to make that cheaper
other than to cut the initial sizes of the hash tables (and add support
for expanding them later, which is lacking in catcache ATM). Not
convinced that that creates any net savings --- it might just save
some cycles at startup in exchange for more cycles later, in typical
backend usage.Making those hashtables expansible wouldn't be a bad thing in itself,
mind you.The idea I had was to go the other way and say, hey, if these hash
tables can't be expanded anyway, let's put them on the BSS instead of
heap-allocating them. Any new pages we request from the OS will be
zeroed anyway, but with palloc we then have to re-zero the allocated
block anyway because palloc can return a memory that's been used,
freed, and reused. However, for anything that only needs to be
allocated once and never freed, and whose size can be known at compile
time, that's not an issue.In fact, it wouldn't be that hard to relax the "known at compile time"
constraint either. We could just declare:char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED];
...and then peel off chunks.
Won't this just cause loads of additional pagefaults after fork() when those
pages are used the first time and then a second time when first written to (to
copy it)?
Andres
On Wed, Nov 24, 2010 at 3:53 PM, Andres Freund <andres@anarazel.de> wrote:
On Wednesday 24 November 2010 21:47:32 Robert Haas wrote:
On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Full results, and call graph, attached. The first obvious fact is
that most of the memset overhead appears to be coming from
InitCatCache.AFAICT that must be the palloc0 calls that are zeroing out (mostly)
the hash bucket headers. I don't see any real way to make that cheaper
other than to cut the initial sizes of the hash tables (and add support
for expanding them later, which is lacking in catcache ATM). Not
convinced that that creates any net savings --- it might just save
some cycles at startup in exchange for more cycles later, in typical
backend usage.Making those hashtables expansible wouldn't be a bad thing in itself,
mind you.The idea I had was to go the other way and say, hey, if these hash
tables can't be expanded anyway, let's put them on the BSS instead of
heap-allocating them. Any new pages we request from the OS will be
zeroed anyway, but with palloc we then have to re-zero the allocated
block anyway because palloc can return a memory that's been used,
freed, and reused. However, for anything that only needs to be
allocated once and never freed, and whose size can be known at compile
time, that's not an issue.In fact, it wouldn't be that hard to relax the "known at compile time"
constraint either. We could just declare:char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED];
...and then peel off chunks.
Won't this just cause loads of additional pagefaults after fork() when those
pages are used the first time and then a second time when first written to (to
copy it)?
Aren't we incurring those page faults anyway, for whatever memory
palloc is handing out? The heap is no different from bss; we just
move the pointer with sbrk().
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Nov 24, 2010 at 3:53 PM, Andres Freund <andres@anarazel.de> wrote:
The idea I had was to go the other way and say, hey, if these hash
tables can't be expanded anyway, let's put them on the BSS instead of
heap-allocating them.
Won't this just cause loads of additional pagefaults after fork() when those
pages are used the first time and then a second time when first written to (to
copy it)?
Aren't we incurring those page faults anyway, for whatever memory
palloc is handing out? The heap is no different from bss; we just
move the pointer with sbrk().
I think you're missing the real point, which that the cost you're
measuring here probably isn't so much memset() as faulting in large
chunks of address space. Avoiding the explicit memset() likely will
save little in real runtime --- it'll just make sure the initial-touch
costs are more distributed and harder to measure. But in any case I
think this idea is a nonstarter because it gets in the way of making
those hashtables expansible, which we *do* need to do eventually.
(You might be able to confirm or disprove this theory if you ask
oprofile to count memory access stalls instead of CPU clock cycles...)
regards, tom lane