posix_fadvise v22

Started by Bruce Momjianover 17 years ago41 messageshackers
Jump to latest
#1Bruce Momjian
bruce@momjian.us

Here's an update to eliminate two small bitrot conflicts.

Attachments:

posix_fadvise_v22.diff.gzapplication/octet-streamDownload
#2ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Bruce Momjian (#1)
Re: posix_fadvise v22

Hello,

Gregory Stark <stark@enterprisedb.com> wrote:

Here's an update to eliminate two small bitrot conflicts.

I read your patch with interest, but found some trivial bad manners.

* LET_OS_MANAGE_FILESIZE is already obsoleted.
You don't have to cope with the option.

* Type mismatch in prefetch_pages
A variable prefetch_pages is defined as "unsigned" or "int"
in some places. Why don't you define it only once in a header
and include the header in source files?

* Assignment to prefetch_pages
What do "+0.99" means here?
    [assign_io_concurrency()]
    + 			prefetch_pages = new_prefetch_pages+0.99;
You want to do as follows, right?
    + 			prefetch_pages = (int) ceil(new_prefetch_pages);

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#3Bruce Momjian
bruce@momjian.us
In reply to: ITAGAKI Takahiro (#2)
Re: posix_fadvise v22

I'll send another path with at least 1 and 3 fixed and hunt around
again for a header file to put this guc into.

On 10 Dec 2008, at 04:22, ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp

wrote:

Hello,

Gregory Stark <stark@enterprisedb.com> wrote:

Here's an update to eliminate two small bitrot conflicts.

I read your patch with interest, but found some trivial bad manners.

* LET_OS_MANAGE_FILESIZE is already obsoleted.
You don't have to cope with the option.

Huh I didn't realize that. I guess the idea is that users just
configure a very large segment size to get the old behaviour.

* Type mismatch in prefetch_pages
A variable prefetch_pages is defined as "unsigned" or "int"
in some places. Why don't you define it only once in a header
and include the header in source files?

Just... Which header?

* Assignment to prefetch_pages
What do "+0.99" means here?
[assign_io_concurrency()]
+            prefetch_pages = new_prefetch_pages+0.99;
You want to do as follows, right?
+            prefetch_pages = (int) ceil(new_prefetch_pages);

Sure

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#3)
Re: posix_fadvise v22

Greg Stark <greg.stark@enterprisedb.com> writes:

A variable prefetch_pages is defined as "unsigned" or "int"
in some places. Why don't you define it only once in a header
and include the header in source files?

Just... Which header?

MHO: the header that goes with the source file that is most concerned with
implementing the variable's behavior (which is also the file that should
have the actual variable definition).

regards, tom lane

#5Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#4)
Re: posix_fadvise v22

On Thu, Dec 11, 2008 at 4:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Greg Stark <greg.stark@enterprisedb.com> writes:

A variable prefetch_pages is defined as "unsigned" or "int"
in some places. Why don't you define it only once in a header
and include the header in source files?

Just... Which header?

MHO: the header that goes with the source file that is most concerned with
implementing the variable's behavior (which is also the file that should
have the actual variable definition).

Well the trick here is that the variable actually affects how many
PrefetchBuffer() calls *callers* should make. The callers are various
places which are doing lots of ReadBuffer calls and know what buffer's
they'll need in the future. The main places are in
nodeBitmapHeapScan.c and nbtsearch.c. Neither of those are remotely
relevant.

I think i'm settling in that it should be in the same place as the
PrefetchBuffer() prototype since anyone who needs prefetch_buffers
will need that as well (except for guc.c). So I'll put it in bufmgr.h
for now.

--
greg

#6Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#5)
Re: posix_fadvise v22

Here's the update

I also skimmed through and cleaned a couple other things. There's *still* a
function prototype which I don't see what header file to put it in, that's the
one in port/posix_fadvise.c which contains one function with one caller, guc.c.

Attachments:

posix_fadvise_v23.diff.gzapplication/octet-streamDownload
#7Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#6)
Re: posix_fadvise v22

I tried this on my laptop running FC9, and because I forgot to run
autoconf, I got this error message when I tried to turn on
posix_fadvise.

rhaas=# set effective_io_concurrency to 3;
ERROR: could not determine if this system has a working posix_fadvise
DETAIL: Check configure.log produced by configure for more information

Am I correct in thinking that the only thing we're really checking for
here is whether a trivial posix_fadvise() call returns success? If
so, is this test really worth doing? It seems to me that since users
can always switch off the feature by leaving effective_io_concurrency
set to the default value of 1, there is not much value in having a
configure test that forcibly disables it. If the user has a broken
posix_fadvise() and later fixes it, they shouldn't have to recompile
PostgreSQL to use this feature, especially in this day when the build
system and the run system are often different. A user who somehow
ends up with RPMs that generate this error message will be utterly at
a loss as to what to do about it.

One minor nit: If we're going to keep this test, we should change the
detail string to say config.log rather than configure.log, as that is
the actual file name.

...Robert

Show quoted text

On Thu, Dec 11, 2008 at 4:35 PM, Gregory Stark <stark@enterprisedb.com> wrote:

Here's the update

I also skimmed through and cleaned a couple other things. There's *still* a
function prototype which I don't see what header file to put it in, that's the
one in port/posix_fadvise.c which contains one function with one caller, guc.c.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#7)
Re: posix_fadvise v22

"Robert Haas" <robertmhaas@gmail.com> writes:

Am I correct in thinking that the only thing we're really checking for
here is whether a trivial posix_fadvise() call returns success? If
so, is this test really worth doing?

Runtime tests performed during configure are generally a bad idea to
start with --- it's impossible to do any such thing in a
cross-compilation scenario, for example.

regards, tom lane

#9Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#8)
Re: posix_fadvise v22

On Thu, Jan 1, 2009 at 3:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Robert Haas" <robertmhaas@gmail.com> writes:

Am I correct in thinking that the only thing we're really checking for
here is whether a trivial posix_fadvise() call returns success? If
so, is this test really worth doing?

Runtime tests performed during configure are generally a bad idea to
start with --- it's impossible to do any such thing in a
cross-compilation scenario, for example.

OK, here's an update of Greg's patch with the runtime configure test
ripped out, some minor documentation tweaks, and a few unnecessary
whitespace diff hunks quashed. I think this is about ready for
committer review. The only thing I haven't been able to do is
demonstrate that this change actually produces a performance
improvement. Either I'm testing the wrong thing, or it just doesn't
provide any benefit on a single-spindle system. However, I believe
that Greg has previously posted some fairly impressive performance
results, so I'm not sure that my shortcomings in this area should be a
bar to having a committer pick this one up. If more testing is
needed, it would at least be helpful to have a committer specify what
areas they are concerned about.

...Robert

Attachments:

posix_fadvise_v23_rh1.diff.gzapplication/x-gzip; name=posix_fadvise_v23_rh1.diff.gzDownload
#10Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#9)
Re: posix_fadvise v22

In theory there should be no benefit on a single spindle system. There
could be a slight benefit due to reordering of I/o but only on a raid
array would you see a significant speedup -- which should be about
equal to the number of spindles.

What would be interesting is whether you see a noticable speed
*decrease* from having prefetching enabled when it isn't helping.
Either due to having everything fit in shared buffers or everything
fit in the filesystem cache (the latter should be more of a hit)

Even if there is it doesn't really worry me. By default the feature is
disabled and you should only really turn it on if ulu do have a raid
array and want an individual query to make use if it.

Now that there's an actual run-time sysconf check for the buggy glibc
called by the guc function we arguably don't need the autoconf
check_run check anymore anyways.

--
Greg

On 1 Jan 2009, at 21:43, "Robert Haas" <robertmhaas@gmail.com> wrote:

Show quoted text

On Thu, Jan 1, 2009 at 3:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Robert Haas" <robertmhaas@gmail.com> writes:

Am I correct in thinking that the only thing we're really checking
for
here is whether a trivial posix_fadvise() call returns success? If
so, is this test really worth doing?

Runtime tests performed during configure are generally a bad idea to
start with --- it's impossible to do any such thing in a
cross-compilation scenario, for example.

OK, here's an update of Greg's patch with the runtime configure test
ripped out, some minor documentation tweaks, and a few unnecessary
whitespace diff hunks quashed. I think this is about ready for
committer review. The only thing I haven't been able to do is
demonstrate that this change actually produces a performance
improvement. Either I'm testing the wrong thing, or it just doesn't
provide any benefit on a single-spindle system. However, I believe
that Greg has previously posted some fairly impressive performance
results, so I'm not sure that my shortcomings in this area should be a
bar to having a committer pick this one up. If more testing is
needed, it would at least be helpful to have a committer specify what
areas they are concerned about.

...Robert
<posix_fadvise_v23_rh1.diff.gz>

#11Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#10)
Re: posix_fadvise v22

Now that there's an actual run-time sysconf check for the buggy glibc called
by the guc function we arguably don't need the autoconf check_run check
anymore anyways.

Isn't that the check I just removed for you, or are you talking about
some other check that can also be removed?

...Robert

#12Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#11)
Re: posix_fadvise v22

Sorry for top-posting -- phone mail client sucks.

I thought the autoconf ac_run_check was the test that people were
questioning. That calls posix_fadvise to see if it crashes at
configure time.

The guc run-time check is checking for known-buggy versions of glibc
using sysconf to check what version of glibc you have.

--
Greg

On 1 Jan 2009, at 23:11, "Robert Haas" <robertmhaas@gmail.com> wrote:

Show quoted text

Now that there's an actual run-time sysconf check for the buggy
glibc called
by the guc function we arguably don't need the autoconf check_run
check
anymore anyways.

Isn't that the check I just removed for you, or are you talking about
some other check that can also be removed?

...Robert

#13Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#12)
Re: posix_fadvise v22

On Thu, Jan 1, 2009 at 11:49 PM, Greg Stark <greg.stark@enterprisedb.com> wrote:

Sorry for top-posting -- phone mail client sucks.

I thought the autoconf ac_run_check was the test that people were
questioning. That calls posix_fadvise to see if it crashes at configure
time.

Yes, that's what I removed.

The guc run-time check is checking for known-buggy versions of glibc using
sysconf to check what version of glibc you have.

Right - that check is still in my updated patch.

I think the confusion may stem from the fact that Tom and I used the
word "runtime" to refer to the ac_run_check thing, because it is
checking something about the runtime environment (namely, whether
posix_fadvise works or not) at configure-time.

In any event, it seems as though we are all on the same page.

...Robert

#14Greg Smith
gsmith@gregsmith.com
In reply to: Robert Haas (#9)
Re: posix_fadvise v22

On Thu, 1 Jan 2009, Robert Haas wrote:

The only thing I haven't been able to do is demonstrate that this change
actually produces a performance improvement. Either I'm testing the
wrong thing, or it just doesn't provide any benefit on a single-spindle
system.

When I did a round of testing on the earlier prefetch test program Greg
Stark put together, one of my single-spindle Linux system didn't show any
real benefit. So as long as you didn't see performance degrade, your not
seeing any improvement isn't bad news.

I've got a stack of hardware I can do performance testing of this patch
on, what I haven't been able to find time for is setting up any sort of
test harness right now. If you or Greg have any benchmark or test program
you could suggest that should show off the improvements here, I'd be glad
to run it on a bunch of systems and report back--I've already got a stack
of candidate ones I ran the earlier tests on to compare results against.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#14)
Re: posix_fadvise v22

Greg Smith <gsmith@gregsmith.com> writes:

On Thu, 1 Jan 2009, Robert Haas wrote:

The only thing I haven't been able to do is demonstrate that this change
actually produces a performance improvement. Either I'm testing the
wrong thing, or it just doesn't provide any benefit on a single-spindle
system.

When I did a round of testing on the earlier prefetch test program Greg
Stark put together, one of my single-spindle Linux system didn't show any
real benefit. So as long as you didn't see performance degrade, your not
seeing any improvement isn't bad news.

ISTM that you *should* be able to see an improvement on even
single-spindle systems, due to better overlapping of CPU and I/O effort.
If the test case is either 100% CPU-bound or 100% I/O-bound then no,
but for anything in between there ought to be improvement.

regards, tom lane

#16Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#15)
Re: posix_fadvise v22

On Fri, 2 Jan 2009, Tom Lane wrote:

ISTM that you *should* be able to see an improvement on even
single-spindle systems, due to better overlapping of CPU and I/O effort.

The earlier synthetic tests I did:

http://archives.postgresql.org/pgsql-hackers/2008-09/msg01401.php

Showed a substantial speedup even in the single spindle case on a couple
of systems, but one didn't really seem to benefit. So we could theorize
that Robert's test system is more like that one. If someone can help out
with making a more formal test case showing this in action, I'll dig into
the details of what's different between that system and the others.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#16)
Re: posix_fadvise v22

Greg Smith <gsmith@gregsmith.com> writes:

On Fri, 2 Jan 2009, Tom Lane wrote:

ISTM that you *should* be able to see an improvement on even
single-spindle systems, due to better overlapping of CPU and I/O effort.

The earlier synthetic tests I did:
http://archives.postgresql.org/pgsql-hackers/2008-09/msg01401.php
Showed a substantial speedup even in the single spindle case on a couple
of systems, but one didn't really seem to benefit. So we could theorize
that Robert's test system is more like that one. If someone can help out
with making a more formal test case showing this in action, I'll dig into
the details of what's different between that system and the others.

Well, I claim that if you start with a query that's about 50% CPU and
50% I/O effort, you ought to be able to get something approaching 2X
speedup if this patch really works. Consider something like

create function waste_time(int) returns int as $$
begin
for i in 1 .. $1 loop
null;
end loop;
return 1;
end $$ language plpgsql;

select count(waste_time(42)) from very_large_table;

In principle you should be able to adjust the constant so that vmstat
shows about 50% CPU busy, and then enabling fadvise should improve
matters significantly.

Now the above proposed test case is too simple because it will generate
a seqscan, and if the kernel is not completely brain-dead it will not
need any fadvise hinting to do read-ahead. But you should be able to
adapt the idea for whatever indexscan-based test case you are really
using.

Note: on a multi-CPU system you need to take vmstat or top numbers with
a grain of salt, since they might consider "one CPU 50% busy" as
"system only 50/N % busy".

regards, tom lane

#18Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#17)
Re: posix_fadvise v22

Tom Lane <tgl@sss.pgh.pa.us> writes:

In principle you should be able to adjust the constant so that vmstat
shows about 50% CPU busy, and then enabling fadvise should improve
matters significantly.

I think in practice individual queries don't interleave much cpu with i/o
work. A single random page fetch is 5ms which is an awful lot of cpu cycles to
be sinking somewhere. In practice I think this is going to be single-digit
percentages.

Aside from big sorts and such which tend not to be interleaved with i/o the
main time I've seen queries have a significant cpu load is when the data is
mostly in cache. In which case prefetching would be hard pressed to help at
all.

We could construct a synthetic case but the main point of this feature is to
make use of raid arrays that are currently going idle, not to pick up a few
percentage points for single spindle systems.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning

#19Bruce Momjian
bruce@momjian.us
In reply to: Greg Smith (#16)
Re: posix_fadvise v22

Greg Smith wrote:

On Fri, 2 Jan 2009, Tom Lane wrote:

ISTM that you *should* be able to see an improvement on even
single-spindle systems, due to better overlapping of CPU and I/O effort.

The earlier synthetic tests I did:

http://archives.postgresql.org/pgsql-hackers/2008-09/msg01401.php

Showed a substantial speedup even in the single spindle case on a couple
of systems, but one didn't really seem to benefit. So we could theorize
that Robert's test system is more like that one. If someone can help out
with making a more formal test case showing this in action, I'll dig into
the details of what's different between that system and the others.

I think for an I/O-bound workload on a single drive system you would
need a drive that did some kind of tagged queuing (reordering/grouping)
of requests to see a benefit from the patch.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#20Robert Haas
robertmhaas@gmail.com
In reply to: Greg Smith (#14)
Re: posix_fadvise v22

I've got a stack of hardware I can do performance testing of this patch on,
what I haven't been able to find time for is setting up any sort of test
harness right now. If you or Greg have any benchmark or test program you
could suggest that should show off the improvements here, I'd be glad to run
it on a bunch of systems and report back--I've already got a stack of
candidate ones I ran the earlier tests on to compare results against.

Unfortunately I don't have anything useful. I took the skewed TPC-H
data that Lawrence Ramon posted and created a table based on lineitem
using something along the lines of:

SELECT *, g FROM lineitem, generate_series(1,8) AS g;

Then I made an index on one of the columns and ran some queries that
ended up being planned as bitmap index scans. The disk seemed to be
doing its thing, but it didn't do it's thing any faster when I changed
effective_io_concurrency to 4 (in fact there might have been a small
slowdown but I didn't take the time to measure carefully, so I can't
refute the null hypothesis).

...Robert

#21Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#18)
#23Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#21)
#24Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#23)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#24)
#26Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#22)
#27Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#25)
#28Peter Eisentraut
peter_e@gmx.net
In reply to: Bruce Momjian (#12)
#29Bruce Momjian
bruce@momjian.us
In reply to: Peter Eisentraut (#28)
#30Peter Eisentraut
peter_e@gmx.net
In reply to: Bruce Momjian (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#30)
#32Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#31)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#31)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#9)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#34)
#36Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#34)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#35)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#9)
#39Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#38)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#39)
#41Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#34)