Storing pg_stat_statements query texts externally, pg_stat_statements in core

Started by Peter Geogheganover 12 years ago34 messageshackers

pg@bowt.ie

over 12 years ago

Attached patch allows pg_stat_statements to store arbitrarily long
query texts, using an external, transparently managed lookaside file.

This is of great practical benefit to certain types of users, who need
to understand the execution costs of queries with associated
excessively long query texts, often due to the fact that the query was
generated by an ORM. This is a common complaint of Heroku customers,
though I'm sure they're not alone there.

As I mentioned on a previous occasion, since query fingerprinting was
added, the query text is nothing more than a way of displaying the
entries to users that are interested. As such, it seems perfectly
reasonable to store those externally, out of shared memory - the
function pg_stat_statements() itself (that the view is defined as a
call to) isn't particularly performance sensitive. Creating a brand
new pg_stat_statements entry is already costly, since it necessitates
an exclusive lock that blocks everything, so the additional cost of
storing the query text when that happens is assumed to be of marginal
concern. Better to have more entries in the first place, so that
doesn't need to happen very frequently - using far less memory per
entry helps the user to increase pg_stat_statements.max in the first
place, making excessive exclusive locking more unlikely in the first
place. Having said all that, the code bends over backwards to make
sure that physical I/O is as unlikely as possible during exclusive
locking. There are numerous tricks employed here where cache pressure
is a concern, that I won't go into here, since it's all well commented
in the patch.

Maybe we should have a warning when eviction occurs too frequently? I
also think that we might want to take another look at making the
normalization logic less strict in terms of differentiating queries
that a reasonable person might consider equivalent. This is obviously
rather fuzzy, but I've had complaints about things like
pg_stat_statements being overly fussy in seeing row and array
comparisons as distinct from each other. This is a discussion for
another thread most probably, but informed by one of the same concerns
- making expensive cache pressure less likely.

This incidentally furthers the case for pg_stat_statements in core
(making it similar to pg_stat_user_functions - not turned on by
default, but easily available, even on busy production systems that
cannot afford a restart and didn't know they needed it until
performance tanked 5 minutes ago). The amount of shared memory now
used (and therefore arguably wasted by having a little bit of shared
memory just in case pg_stat_statements becomes useful) is greatly
reduced. That's really a separate discussion, though, which is why I
haven't done the additional work of integrating pg_stat_statements
into core here. Doing a restart to enable pg_stat_statements is, in my
experience, only acceptable infrequently - restarts are generally to
be avoided at all costs.

I can see a day when the query hash is actually a general query cache
identifier, at which time the query text will also be stored outside
of the pgss hash table proper.

Refactoring
=========

I've decided to remove the encoding id from the pgss entry key (it's
now just in the main entry struct) - I don't think it makes sense
anymore. It is clearly no longer true that it's a "notational
convenience" (and hasn't been since 9.1). Like query texts themselves,
it is something that exists for the sole purpose of displaying
statistics to users, and as such cannot possibly result in incorrectly
failing to differentiate or spuriously differentiating two entries.
Now, I suppose it's true that since we're sometimes directly hashing
query texts, it might matter (e.g. utility statements) assuming that
they could vary. But since encoding invariably comes from database
encoding anyway, and we always differentiate on databaseid to begin
with, I fail to see how that could possibly matter.

I've bumped PGSS_FILE_HEADER, just in case a pg_stat_statements temp
file survives a pg_upgrade. I think that some minor tweaks made by
Noah to pg_stat_statements (as part of commit b560ec1b) ought to have
necessitated doing so anyway, but no matter.

The role of time-series aggregation
=============================

Over on the min/max pg_stat_statements storage thread, I've stated
that I think the way forward is that third party tools aggregate
deltas fairly aggressively - maybe even as aggressively as every 3
seconds or something. So I'm slightly concerned that I may be
hampering a future tool that needs to aggregate statistics very
aggressively. For that reason, we should probably provide an
alternative function/view that identifies pg_stat_statements entries
by hashid + databaseid + userid only, so any overhead from reading big
query strings from disk with a shared lock held is eliminated.

Thoughts?
--
Peter Geoghegan

Peter Eisentraut

peter_e@gmx.net

over 12 years ago

In reply to: Peter Geoghegan (#1)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

On 11/14/13, 3:17 AM, Peter Geoghegan wrote:

Attached patch allows pg_stat_statements to store arbitrarily long
query texts, using an external, transparently managed lookaside file.

Compiler warnings with fortify settings:

gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -fpic -I. -I. -I../../src/include -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2 -c -o pg_stat_statements.o pg_stat_statements.c -MMD -MP -MF .deps/pg_stat_statements.Po
pg_stat_statements.c: In function ï¿½entry_resetï¿½:
pg_stat_statements.c:1827:11: warning: ignoring return value of ï¿½ftruncateï¿½, declared with attribute warn_unused_result [-Wunused-result]
pg_stat_statements.c: In function ï¿½garbage_collect_query_textsï¿½:
pg_stat_statements.c:1779:11: warning: ignoring return value of ï¿½ftruncateï¿½, declared with attribute warn_unused_result [-Wunused-result]

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@bowt.ie

over 12 years ago

In reply to: Peter Eisentraut (#2)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

On Fri, Nov 15, 2013 at 8:51 AM, Peter Eisentraut <peter_e@gmx.net> wrote:

Compiler warnings with fortify settings:

I'll post a revision with fixes for those. Another concern is that I
see some race conditions that only occur when pg_stat_statements cache
is under a lot of pressure with a lot of concurrency, that wasn't
revealed by my original testing. I'm working on a fix for that.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@bowt.ie

over 12 years ago

In reply to: Peter Geoghegan (#3)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

On Sat, Nov 16, 2013 at 4:36 PM, Peter Geoghegan <pg@heroku.com> wrote:

I'll post a revision with fixes for those. Another concern is that I
see some race conditions that only occur when pg_stat_statements cache
is under a lot of pressure with a lot of concurrency, that wasn't
revealed by my original testing. I'm working on a fix for that.

Revision attached.

--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

over 12 years ago

In reply to: Peter Geoghegan (#4)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

With a pg_stat_statements.max of 1,000, on my system with the patch
applied the additional amount of shared memory required is only
192KiB. That compares with about 1.17MiB for the same setting in
master's version of pg_stat_statements. Surely with this huge
reduction, we should revisit that default - I think that a default
setting of 5,000 for pg_stat_statements.max makes more sense.

entry_dealloc() requires an exclusive lock, locking all queries out of
simply recording their costs. With a lot of cache pressure this could
be really expensive. In my opinion that risk alone justifies the
change.

Without adding another GUC, we might even go so far as to have a
mechanism like checkpoint_warning warn that entry_dealloc() calls
occur too frequently...

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fujii Masao

masao.fujii@gmail.com

over 12 years ago

In reply to: Peter Geoghegan (#4)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

On Sun, Nov 17, 2013 at 1:15 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Sat, Nov 16, 2013 at 4:36 PM, Peter Geoghegan <pg@heroku.com> wrote:

I'll post a revision with fixes for those. Another concern is that I
see some race conditions that only occur when pg_stat_statements cache
is under a lot of pressure with a lot of concurrency, that wasn't
revealed by my original testing. I'm working on a fix for that.

Revision attached.

The patch doesn't apply cleanly against the master due to recent change of
pg_stat_statements. Could you update the patch?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@bowt.ie

over 12 years ago

In reply to: Fujii Masao (#6)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

On Sat, Dec 7, 2013 at 9:26 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

The patch doesn't apply cleanly against the master due to recent change of
pg_stat_statements. Could you update the patch?

Revision is attached, including changes recently discussed on other
thread [1]/messages/by-id/CAM3SWZSVFf-vBNC8sc-0CpWZxVQH49REYBcHb95t95fcrGSa6A@mail.gmail.com -- Peter Geoghegan to allow avoiding sending query text where that's not
desirable and increase in default pg_stat_statements.max to 5000.

I've found myself tweaking things a bit more here and there. I've
added additional clarification around why we do an in-place garbage
collection (i.e. why we don't just write a new file and atomically
rename -- grep "over-engineer" to find what I mean). I also fully
documented the pg_stat_statements function. If we want external tool
authors to use it to avoid sending query texts, they have to know
about it in the first place.

I now use the pg_stat_tmp directory for my temp file (so
pg_stat_statements behaves more like the statistics collector). The
stats_temp_directory GUC is not used, for reasons that a patch comment
explains.

I've fuzz-tested this throughout with the "make installcheck-parallel"
regression tests. I found it useful to run that in one terminal, while
running pgbench with multiple clients in another. The pgbench script
looks something like:

"""
select * from pg_stat_statements;
select pg_stat_statements_reset();
"""

pg_stat_statements_reset() was temporarily rigged to perform a garbage
collection (and not a reset) for the benefit of this stress-test. The
function pg_stat_statements(), the garbage collection function and
pgss_store() will complain loudly if they notice anything out of the
ordinary. I was not able to demonstrate any problems through this
technique (though I think it focused my thinking on the right areas
during development). Clearly, missed subtleties around managing the
external file while locking in order to ensure correct behavior (that
no session can observe something that it should or should not) will be
something that a committer will want to pay particular attention to. I
go to some lengths to to avoid doing this with only a shared lock.

FYI, without hacking things up, it's quite hard to make external query
text garbage collection occur - need_gc_qtexts() will almost always
say no. It will only automatically happen once with
pg_stat_statements.max set to 1000 when the standard regression tests
are run. This is a really extreme workload in terms of causing
pg_stat_statements churn, which shows just how unlikely garbage
collection is in the real world. Still, it ought to be totally robust
when it happens.

In passing, I did this:

/*
! * Rename file into place, to atomically commit to serializing.
*/

Because at this juncture, there ought to be no file to replace (not
since Magnus had pg_stat_statements not keep the main global file
around for as long as the server is running, in the style of the
statistics collector).

Thanks

[1]: /messages/by-id/CAM3SWZSVFf-vBNC8sc-0CpWZxVQH49REYBcHb95t95fcrGSa6A@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

over 12 years ago

In reply to: Peter Geoghegan (#7)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

On Mon, Dec 9, 2013 at 7:31 PM, Peter Geoghegan <pg@heroku.com> wrote:

I go to some lengths to to avoid doing this with only a shared lock.

I should have said: I go to great lengths to do this with only a
shared lock, and not an exclusive (see gc_count stuff).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Peter Geoghegan (#7)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

Peter Geoghegan <pg@heroku.com> writes:

On Sat, Dec 7, 2013 at 9:26 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

The patch doesn't apply cleanly against the master due to recent change of
pg_stat_statements. Could you update the patch?

Revision is attached, including changes recently discussed on other
thread [1] to allow avoiding sending query text where that's not
desirable and increase in default pg_stat_statements.max to 5000.

I see this is marked as ready for committer. Where does it stand in
relation to the other long-running thread about "calls under-estimation
propagation"? I was surprised to find that there isn't any CommitFest
entry linked to that thread, so I'm wondering if that proposal is
abandoned or what. If it's not, is committing this going to blow up
that patch?

BTW, I'm also thinking that the "detected_version" kluge is about ready
to collapse of its own weight, or at least is clearly going to break in
future. What we need to do is embed the API version in the C name of the
pg_stat_statements function instead.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Peter Geoghegan

pg@bowt.ie

over 12 years ago

In reply to: Tom Lane (#9)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

On Mon, Jan 20, 2014 at 12:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I see this is marked as ready for committer. Where does it stand in
relation to the other long-running thread about "calls under-estimation
propagation"? I was surprised to find that there isn't any CommitFest
entry linked to that thread, so I'm wondering if that proposal is
abandoned or what. If it's not, is committing this going to blow up
that patch?

I believe that proposal was withdrawn. I think the conclusion there
was that we should just expose queryid and be done with it. In a way,
exposing the queryid enabled this work, because it provides an
identifier that can be used instead of sending large query texts each
call.

BTW, I'm also thinking that the "detected_version" kluge is about ready
to collapse of its own weight, or at least is clearly going to break in
future. What we need to do is embed the API version in the C name of the
pg_stat_statements function instead.

I agree that it isn't scalable.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Peter Geoghegan (#10)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

Peter Geoghegan <pg@heroku.com> writes:

On Mon, Jan 20, 2014 at 12:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, I'm also thinking that the "detected_version" kluge is about ready
to collapse of its own weight, or at least is clearly going to break in
future. What we need to do is embed the API version in the C name of the
pg_stat_statements function instead.

I agree that it isn't scalable.

Yeah. It was more or less tolerable as long as we were only using it in
connection with identifying the set of output columns, but using it to
identify the expected input arguments too seems darn rickety. I'll
refactor so there are separate C call points for each supported API
version. (Well, I guess it's too late to separate 1.0 and 1.1, but
we can make 1.2 and future versions separate.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Peter Geoghegan (#10)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

BTW, what is the value of using posix_fadvise() in warm_fcache?

ISTM that if you're worried about waiting for I/O while holding the
LWLock (as I concur you should be), this coding is entirely inadequate
for preventing that, as the whole point of posix_fadvise is that it
won't wait around for the I/O to happen. It strains credulity to
the breaking point to imagine that the kernel will have gotten
around to reading the file in the small number of nanoseconds that
will elapse before you start needing the data. Unless of course the file
is already in buffers, but then the whole thing was a waste of code.

I think we should flush the posix_fadvise() code path as being neither
useful nor portable, and just do the actual read() as in the other
path (which btw is lacking an essential fseek).

Actually, I think the whole thing is rather badly designed, as warm
cache or no you're still proposing to do thousands of kernel calls
while holding a potentially contended LWLock. I suggest what we
do is (1) read in the whole file, (2) acquire the lock, (3)
re-read the whole file in the same buffer, (4) work from the buffer.

It might be possible to optimize the re-read away in the common case
that the file didn't actually change since we started to read it;
we'd need to put a bit more reporting into qtext_store probably, so
that it's possible to tell if any space is reserved but still unwritten.
BTW shouldn't there be an fflush() in qtext_store?

We could also eliminate some palloc/pfree cycles by using the string
directly in the buffer (at the small cost of needing to include the
strings' trailing nulls in the file).

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Peter Geoghegan

pg@bowt.ie

over 12 years ago

In reply to: Tom Lane (#12)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

On Tue, Jan 21, 2014 at 6:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, what is the value of using posix_fadvise() in warm_fcache?

ISTM that if you're worried about waiting for I/O while holding the
LWLock (as I concur you should be), this coding is entirely inadequate
for preventing that, as the whole point of posix_fadvise is that it
won't wait around for the I/O to happen. It strains credulity to
the breaking point to imagine that the kernel will have gotten
around to reading the file in the small number of nanoseconds that
will elapse before you start needing the data. Unless of course the file
is already in buffers, but then the whole thing was a waste of code.

Not necessarily. Under Linux for example, POSIX_FADV_SEQUENTIAL "sets
the readahead window to the default size for the backing device". That
seems like a useful thing, though I will admit that it's hard to
quantify how useful here. If it is useful, and furthermore if it's
more useful than just reading the file into a temp buffer (which, I
admit, is not clear), then why not do it where we can? Adopting
support for the non-portable posix_fadvise is a sunk cost.

I think we should flush the posix_fadvise() code path as being neither
useful nor portable, and just do the actual read() as in the other
path

Perhaps.

Actually, I think the whole thing is rather badly designed, as warm
cache or no you're still proposing to do thousands of kernel calls
while holding a potentially contended LWLock. I suggest what we
do is (1) read in the whole file, (2) acquire the lock, (3)
re-read the whole file in the same buffer, (4) work from the buffer.

fseeko() and fread() are part of the standard library, not any kernel.
I don't believe that what I've done in qtext_fetch() implies thousands
of kernel calls, or thousands of context switches.

Now, using a buffer not managed by libc might be better, but I should
note what holding that LWLock actually represents. As you know, it's a
shared lock that does not block the updating of existing entries. It
will only block the creation of entirely new ones. That ought to be
very rare. You may recall that when I worked with you on
pg_stat_statements in the past, I characterized the rate of growth as
being logarithmic. Once the system has been running for half an hour,
new entries become very infrequent indeed. Admittedly, I may be
failing to consider a worst case there, but in the average case this
works fine. Encouraging automated tools to lazily retrieve query texts
is a big part of why I thought this might be the least worst thing.

It's not obvious to me that always avoiding blocking new entries while
performing physical I/O is important enough to warrant the additional
complexity of copying to a buffer first, and mostly working off that.
Also, you must consider that there may have been a garbage collection
in the interim, so you have to get the shared lock twice (to check the
garbage collection count, to later guard against your buffer having
been invalidated by a garbage collection) instead of just getting the
shared lock once. I discussed this with Pavel, actually. Getting the
shared lock twice on every pg_stat_statements() call doesn't seem very
good either. Especially since you'll have to open the file twice,
*with* the shared lock second time around, when your scheme misses new
entries. Your scheme won't work out under exactly the same
circumstances that mine doesn't do too well, which in when there is a
new entry to consider. When that doesn't happen, holding a shared lock
is not that important.

It might be possible to optimize the re-read away in the common case
that the file didn't actually change since we started to read it;
we'd need to put a bit more reporting into qtext_store probably, so
that it's possible to tell if any space is reserved but still unwritten.

So, to be clear, you'd still be reading from the file if that proved
necessary? I think you'll have to, because it seems useful that
pg_stat_statements offers a limited form of consistency, in that you
only see entries at a point in time, even if the figures themselves
are not exactly consistent. Maybe I've missed something, but it's not
obvious to me that you're any better off as compared to just using a
FILE/stream.

BTW shouldn't there be an fflush() in qtext_store?

I don't think so, no. Whenever qtext_store() callers don't fflush()
before releasing their exclusive lock, they FreeFile() before doing so
instead. In the case of pgss_shmem_startup() there is no lock held
anyway, because it doesn't matter, for the same reason it doesn't
matter that we're modifying shmem structures in a way that ordinarily
necessitates an exclusive lock. Why fflush() every qtext_store() call
if there is expected to be more than one, as is the case for several
callers?

FreeFile() calls fclose(). fclose() is described by the standard as
flushing its stream. It is my understanding that calling fflush() on a
stream immediately prior to calling fclose() on the same stream is
always superfluous.

We could also eliminate some palloc/pfree cycles by using the string
directly in the buffer (at the small cost of needing to include the
strings' trailing nulls in the file).

It might be useful to do that anyway, for sanity checking purposes. So
+1 from me.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Peter Geoghegan

pg@bowt.ie

over 12 years ago

In reply to: Peter Geoghegan (#13)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

On Tue, Jan 21, 2014 at 8:01 PM, Peter Geoghegan <pg@heroku.com> wrote:

Not necessarily. Under Linux for example, POSIX_FADV_SEQUENTIAL "sets
the readahead window to the default size for the backing device"

Excuse me; I meant to put "POSIX_FADV_SEQUENTIAL doubles this size
[default size for the backing device]".

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Peter Geoghegan

pg@bowt.ie

over 12 years ago

In reply to: Peter Geoghegan (#13)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

On Tue, Jan 21, 2014 at 8:01 PM, Peter Geoghegan <pg@heroku.com> wrote:

Actually, I think the whole thing is rather badly designed, as warm
cache or no you're still proposing to do thousands of kernel calls
while holding a potentially contended LWLock. I suggest what we
do is (1) read in the whole file, (2) acquire the lock, (3)
re-read the whole file in the same buffer, (4) work from the buffer.

fseeko() and fread() are part of the standard library, not any kernel.
I don't believe that what I've done in qtext_fetch() implies thousands
of kernel calls, or thousands of context switches.

I ran an strace on a pg_stat_statements backend with a full ~5,000
entries. It seems that glibc is actually content to always make
lseek() and read() syscalls in the event of an fseek(), even though it
does maintain its own buffer and could in principle have a lot more
gumption about that [1]http://www.pixelbeat.org/programming/stdio_buffering/ -- Peter Geoghegan. I agree that we should copy everything into a
buffer in one go.

I think that making the LWLocking duration as brief as possible isn't
critically important. Automated tools will only be interested in new
query texts (implying possible I/O while share locking) as they
observe new entries. But new entries are the only thing worth
considering that may potentially block on that shared lock (although,
not as they write out their query texts, which almost always just
requires a shared lock). New entries ought to be very rare occurrence
compared to the execution of queries - certainly, on the busy
production systems I use pg_stat_statements on, it may be weeks before
a new query is observed (uh, with the possible exception of *my*
ad-hoc pg_stat_statements queries). On my laptop, "explain analyze
select * from pg_stat_statements" indicates a total runtime of ~9ms
with 5,000 regression test entries, even with the (still modest)
overhead of doing all those read()/lseek() calls. So we're talking
about a small impact on a new entry, that will only be affected once,
that was always going to have to write out its query text to file
anyway. On top of all this, in general it's very improbable that any
given new entry will be affected at all, because two unlikely things
need to happen at the same time for that to be possible.

The most important aspect of keeping the overhead of
pg_stat_statements low is that there not be too much cache pressure.
Obviously the huge reduction in its shared memory footprint that this
patch allows helps with that.

[1]: http://www.pixelbeat.org/programming/stdio_buffering/ -- Peter Geoghegan
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Peter Geoghegan

pg@bowt.ie

over 12 years ago

In reply to: Peter Geoghegan (#13)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

On Tue, Jan 21, 2014 at 8:01 PM, Peter Geoghegan <pg@heroku.com> wrote:

BTW shouldn't there be an fflush() in qtext_store?

I don't think so, no. Whenever qtext_store() callers don't fflush()
before releasing their exclusive lock, they FreeFile() before doing so
instead. In the case of pgss_shmem_startup() there is no lock held
anyway, because it doesn't matter, for the same reason it doesn't
matter that we're modifying shmem structures in a way that ordinarily
necessitates an exclusive lock. Why fflush() every qtext_store() call
if there is expected to be more than one, as is the case for several
callers?

Having said all that, it occurs to me now that I could have been
smarter about where I fflush(). I'm pretty sure pgss_store() should
have two fflush() calls. The first can occur with just a shared LWLock
held, before the lock strength promotion interim (you probably noticed
that I no longer generate a normalized query text in that interim, for
reasons that are obvious). The second fflush() may occur only in the
very unlikely event of a garbage collection necessitating a new
qtext_store() call with an exclusive lock held, a few lines after the
first fflush(). No need to fflush() with the exclusive lock the vast
majority of the time.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Peter Geoghegan (#15)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

Peter Geoghegan <pg@heroku.com> writes:

On Tue, Jan 21, 2014 at 8:01 PM, Peter Geoghegan <pg@heroku.com> wrote:

Actually, I think the whole thing is rather badly designed, as warm
cache or no you're still proposing to do thousands of kernel calls
while holding a potentially contended LWLock. I suggest what we
do is (1) read in the whole file, (2) acquire the lock, (3)
re-read the whole file in the same buffer, (4) work from the buffer.

fseeko() and fread() are part of the standard library, not any kernel.
I don't believe that what I've done in qtext_fetch() implies thousands
of kernel calls, or thousands of context switches.

I ran an strace on a pg_stat_statements backend with a full ~5,000
entries. It seems that glibc is actually content to always make
lseek() and read() syscalls in the event of an fseek(), even though it
does maintain its own buffer and could in principle have a lot more
gumption about that [1]. I agree that we should copy everything into a
buffer in one go.

Yeah, that was what I thought might happen. Even if glibc were smarter,
a lot of other libc implementations aren't. Also, ISTM that traversal
of the hash table will generally lead to more-or-less-random access into
the text file. So unless you assume that libc has mmap'd the whole
file, it'd have to do a lot of kernel calls anyway to get different pages
of the file into its buffer at different times.

I think that "read the whole file" is probably going to net out not any
more complex as far as our code goes, and it'll be making a lot fewer
assumptions about how smart the libc level is and what's happening at
the kernel level. I'll have a go at coding it, anyway.

I think that making the LWLocking duration as brief as possible isn't
critically important.

Maybe, but I don't like the idea that calling pg_stat_statements()
frequently could result in random glitches in response time for other
queries.

Automated tools will only be interested in new
query texts (implying possible I/O while share locking) as they
observe new entries.

This argument would be more plausible if there were a way to fetch
the text for a previously-observed entry without invoking
pg_stat_statements(). I'm surprised you've not submitted a patch
to add such a function.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Peter Geoghegan (#15)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

Peter Geoghegan <pg@heroku.com> writes:

I ran an strace on a pg_stat_statements backend with a full ~5,000
entries.

BTW, have you got any sort of test scenario you could share for this
purpose? I'm sure I could build something, but if you've already
done it ...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Peter Geoghegan

pg@bowt.ie

over 12 years ago

In reply to: Tom Lane (#18)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

On Wed, Jan 22, 2014 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, have you got any sort of test scenario you could share for this
purpose? I'm sure I could build something, but if you've already
done it ...

I simply ran the standard regression tests, and then straced a backend
as it executed the pgss-calling query. I'm not sure that I've
understood your question.

As previously noted, my approach to testing this patch involved variations of:

running "make installcheck-parallel"

Concurrently, running pgbench with multiple clients each calling
pg_stat_statements() and pg_stat_statements_reset(). The latter was
sometimes rigged to do a direct garbage collection to stress test
things. My expectation was that the sanity checking done everywhere
would complain if there were any race conditions or other bugs. This
was pretty effective as a smoke test during development.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Peter Geoghegan (#19)

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

Peter Geoghegan <pg@heroku.com> writes:

On Wed, Jan 22, 2014 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, have you got any sort of test scenario you could share for this
purpose? I'm sure I could build something, but if you've already
done it ...

I simply ran the standard regression tests, and then straced a backend
as it executed the pgss-calling query. I'm not sure that I've
understood your question.

Hm, are there 5K distinct queries in the regression tests? I'd have
thought they weren't enough to fill a large pgss hash.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Peter Geoghegan

pg@bowt.ie

over 12 years ago

In reply to: Tom Lane (#20)

#22

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Peter Geoghegan (#19)

#23

Tom Lane