Seq scans roadmap
Here's my roadmap for the "scan-resistant buffer cache" and
"synchronized scans" patches.
1. Fix the current vacuum behavior of throwing dirty buffers to the
freelist, forcing a lot of WAL flushes. Instead, use a backend-private
ring of shared buffers that are recycled. This is what Simon's
"scan-resistant buffer manager" did.
The theory here is that if a page is read in by vacuum, it's unlikely to
be accessed in the near future, therefore it should be recycled. If
vacuum doesn't dirty the page, it's best to reuse the buffer immediately
for the next page. However, if the buffer is dirty (and not just because
we set hint bits), we ought to delay writing it to disk until the
corresponding WAL record has been flushed to disk.
Simon's patch used a fixed size ring of buffers that are recycled, but I
think the ring should be dynamically sized. Start with a small ring, and
whenever you need to do a WAL flush to write a dirty buffer, increase
the ring size. On every full iteration through the ring, decrease its
size to trim down an unnecessarily large ring.
This only alters the behavior of vacuums, and it's pretty safe to say it
won't get worse than what we have now. In the future, we can use the
buffer ring for seqscans as well; more on that on step 3.
2. Implement the list/table of last/ongoing seq scan positions. This is
Jeff's "synchronized scans" patch. When a seq scan starts on a table
larger than some threshold, it starts from where the previous seq scan
is currently, or where it ended. This will synchronize the scans so that
for two concurrent scans the total I/O is halved in the best case. There
should be no other effect on performance.
If you have a partitioned table, or union of multiple tables or any
other plan where multiple seq scans are performed in arbitrary order,
this change won't change the order the partitions are scanned and won't
therefore ensure they will be synchronized.
Now that we have both pieces of the puzzle in place, it's time to
consider what more we can do with them:
3A. To take advantage of the "cache trail" of a previous seq scan, scan
backwards from where the previous seq scan ended, until a you hit a
buffer that's not in cache.
This will allow taking advantage of the buffer cache even if the table
doesn't fit completely in RAM. That can make a big difference if the
table size is just slightly bigger than RAM, and can avoid the nasty
surprise when a table grows beyond RAM size and queries start taking
minutes instead of seconds.
This should be a non-controversial change on its own from performance
point of view. No query should get slower, and some will become faster.
But see step 3B:
3B. Currently, sequential scans on a large table spoils the buffer cache
by evicting other pages from the cache. In CVS HEAD, as soon as the
table is larger than shared_buffers, the pages in the buffer won't be
used to speed up running the same query again, and there's no reason to
believe the pages read in would be more useful than any other page in
the database, and in particular the pages that were in the buffer cache
before the huge seq scan. If the table being scanned is > 5 *
shared_buffers, the scan will evict every other page from the cache if
there's no other activity in the database (max usage_count is 5).
If the table is much larger than shared_buffers, say 10 times as large,
even with the change 3B to read the pages that are in cache first, using
all shared_buffers to cache the table will only speed up the query by
10%. We should not spoil the cache for such a small gain, and use the
local buffer ring strategy instead. It's better to make queries that are
slow anyway a little bit slower, than making queries that are normally
really fast, slow.
As you may notice, 3A and 3B are at odds with each other. We can
implement both, but you can't use both strategies in the same scan.
Therefore we need to have decision logic of some kind to figure out
which strategy is optimal.
A simple heuristic is to decide based on the table size:
< 0.1*shared_buffers -> start from page 0, keep in cache (like we do now)
< 5 * shared_buffers -> start from last read page, keep in cache
5 * shared_buffers -> start from last read page, use buffer ring
I'm not sure about the constants, we might need to make them GUC
variables as Simon argued, but that would be the general approach.
In the future, I'm envisioning a smarter algorithm to size the local
buffer ring. Take all buffers with usage_count=0, plus a few with
usage_count=1 from the clock sweep. That way if there's a lot of buffers
in the buffer cache that are seldomly used, we'll use more buffers to
cache the large scan, and vice versa. And no matter how large the scan,
it wouldn't blow all buffers from the cache. But if you execute the same
query again, the buffers left in the cache from the last scan were
apparently useful, so we use a bigger ring this time.
I'm going to do this incrementally, and we'll see how far we get for
8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up
Simon's patch (step 1), run some performance tests with vacuum, and
submit a patch for that. Then I'll move to Jeff's patch (step 2).
Thoughts? Everyone happy with the roadmap?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki,
On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc)
implement dynamic I/O caching. The experiments have shown that benefit
of re-using PG buffer cache on large sequential scans is vanishingly
small when the buffer cache size is small compared to the system memory.
Since this is a normal and recommended situation (OS I/O cache is
auto-tuning and easy to administer, etc), IMO the effort to optimize
buffer cache reuse for seq scans > 1 x buffer cache size is not
worthwhile.
On 3B: The scenario described is "multiple readers seq scanning large
table and sharing bufcache", but in practice this is not a common
situation. The common situation is "multiple queries joining several
small tables to one or more large tables that are >> 1 x bufcache". In
the common scenario, the dominant factor is the ability to keep the
small tables in bufcache (or I/O cache for that matter) while running
the I/O bound large table scans as fast as possible.
To that point - an important factor in achieving max I/O rate for large
tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache.
This is commonly in the range of 512KB -> 2MB, which is only important
when considering a bound on the size of the ring buffer. The effect has
been demonstrated to be significant - in the 20%+ range. Another thing
to consider is the use of readahead inside the heapscan, in which case
sizes >= 32KB are very effective.
We've implemented both ideas (ring buffer, readahead) and see very
significant improvements in I/O and query speeds on DSS workloads. I
would expect benefits on OLTP as well.
The modifications you suggest here may not have the following
properties:
- don't pollute bufcache for seqscan of tables > 1 x bufcache
- for tables > 1 x bufcache use a ringbuffer for I/O that is ~ 32KB to
minimize L2 cache pollution
- Luke
Show quoted text
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of
Heikki Linnakangas
Sent: Tuesday, May 08, 2007 3:40 AM
To: PostgreSQL-development
Cc: Jeff Davis; Simon Riggs
Subject: [HACKERS] Seq scans roadmapHere's my roadmap for the "scan-resistant buffer cache" and
"synchronized scans" patches.1. Fix the current vacuum behavior of throwing dirty buffers
to the freelist, forcing a lot of WAL flushes. Instead, use a
backend-private ring of shared buffers that are recycled.
This is what Simon's "scan-resistant buffer manager" did.The theory here is that if a page is read in by vacuum, it's
unlikely to be accessed in the near future, therefore it
should be recycled. If vacuum doesn't dirty the page, it's
best to reuse the buffer immediately for the next page.
However, if the buffer is dirty (and not just because we set
hint bits), we ought to delay writing it to disk until the
corresponding WAL record has been flushed to disk.Simon's patch used a fixed size ring of buffers that are
recycled, but I think the ring should be dynamically sized.
Start with a small ring, and whenever you need to do a WAL
flush to write a dirty buffer, increase the ring size. On
every full iteration through the ring, decrease its size to
trim down an unnecessarily large ring.This only alters the behavior of vacuums, and it's pretty
safe to say it won't get worse than what we have now. In the
future, we can use the buffer ring for seqscans as well; more
on that on step 3.2. Implement the list/table of last/ongoing seq scan
positions. This is Jeff's "synchronized scans" patch. When a
seq scan starts on a table larger than some threshold, it
starts from where the previous seq scan is currently, or
where it ended. This will synchronize the scans so that for
two concurrent scans the total I/O is halved in the best
case. There should be no other effect on performance.If you have a partitioned table, or union of multiple tables
or any other plan where multiple seq scans are performed in
arbitrary order, this change won't change the order the
partitions are scanned and won't therefore ensure they will
be synchronized.Now that we have both pieces of the puzzle in place, it's
time to consider what more we can do with them:3A. To take advantage of the "cache trail" of a previous seq
scan, scan
backwards from where the previous seq scan ended, until a you hit a
buffer that's not in cache.This will allow taking advantage of the buffer cache even if
the table
doesn't fit completely in RAM. That can make a big difference if the
table size is just slightly bigger than RAM, and can avoid the nasty
surprise when a table grows beyond RAM size and queries start taking
minutes instead of seconds.This should be a non-controversial change on its own from performance
point of view. No query should get slower, and some will
become faster.
But see step 3B:3B. Currently, sequential scans on a large table spoils the
buffer cache
by evicting other pages from the cache. In CVS HEAD, as soon as the
table is larger than shared_buffers, the pages in the buffer won't be
used to speed up running the same query again, and there's no
reason to
believe the pages read in would be more useful than any other page in
the database, and in particular the pages that were in the
buffer cache
before the huge seq scan. If the table being scanned is > 5 *
shared_buffers, the scan will evict every other page from the
cache if
there's no other activity in the database (max usage_count is 5).If the table is much larger than shared_buffers, say 10 times
as large,
even with the change 3B to read the pages that are in cache
first, using
all shared_buffers to cache the table will only speed up the query by
10%. We should not spoil the cache for such a small gain, and use the
local buffer ring strategy instead. It's better to make
queries that are
slow anyway a little bit slower, than making queries that are
normally
really fast, slow.As you may notice, 3A and 3B are at odds with each other. We can
implement both, but you can't use both strategies in the same scan.
Therefore we need to have decision logic of some kind to figure out
which strategy is optimal.A simple heuristic is to decide based on the table size:
< 0.1*shared_buffers -> start from page 0, keep in cache
(like we do now)
< 5 * shared_buffers -> start from last read page, keep in cache5 * shared_buffers -> start from last read page, use buffer ring
I'm not sure about the constants, we might need to make them GUC
variables as Simon argued, but that would be the general approach.In the future, I'm envisioning a smarter algorithm to size the local
buffer ring. Take all buffers with usage_count=0, plus a few with
usage_count=1 from the clock sweep. That way if there's a lot
of buffers
in the buffer cache that are seldomly used, we'll use more buffers to
cache the large scan, and vice versa. And no matter how large
the scan,
it wouldn't blow all buffers from the cache. But if you
execute the same
query again, the buffers left in the cache from the last scan were
apparently useful, so we use a bigger ring this time.I'm going to do this incrementally, and we'll see how far we get for
8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up
Simon's patch (step 1), run some performance tests with vacuum, and
submit a patch for that. Then I'll move to Jeff's patch (step 2).Thoughts? Everyone happy with the roadmap?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com---------------------------(end of
broadcast)---------------------------
TIP 4: Have you searched our list archives?
Luke Lonergan wrote:
On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc)
implement dynamic I/O caching. The experiments have shown that benefit
of re-using PG buffer cache on large sequential scans is vanishingly
small when the buffer cache size is small compared to the system memory.
Since this is a normal and recommended situation (OS I/O cache is
auto-tuning and easy to administer, etc), IMO the effort to optimize
buffer cache reuse for seq scans > 1 x buffer cache size is not
worthwhile.
That's interesting. Care to share the results of the experiments you
ran? I was thinking of running tests of my own with varying table sizes.
The main motivation here is to avoid the sudden drop in performance when
a table grows big enough not to fit in RAM. See attached diagram for
what I mean. Maybe you're right and the effect isn't that bad in practice.
I'm thinking of attacking 3B first anyway, because it seems much simpler
to implement.
On 3B: The scenario described is "multiple readers seq scanning large
table and sharing bufcache", but in practice this is not a common
situation. The common situation is "multiple queries joining several
small tables to one or more large tables that are >> 1 x bufcache". In
the common scenario, the dominant factor is the ability to keep the
small tables in bufcache (or I/O cache for that matter) while running
the I/O bound large table scans as fast as possible.
How is that different from what I described?
To that point - an important factor in achieving max I/O rate for large
tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache.
This is commonly in the range of 512KB -> 2MB, which is only important
when considering a bound on the size of the ring buffer. The effect has
been demonstrated to be significant - in the 20%+ range. Another thing
to consider is the use of readahead inside the heapscan, in which case
sizes >= 32KB are very effective.
Yeah I remember the discussion on the L2 cache a while back.
What do you mean with using readahead inside the heapscan? Starting an
async read request?
The modifications you suggest here may not have the following
properties:
- don't pollute bufcache for seqscan of tables > 1 x bufcache
- for tables > 1 x bufcache use a ringbuffer for I/O that is ~ 32KB to
minimize L2 cache pollution
So the difference is that you don't want 3A (the take advantage of pages
already in buffer cache) strategy at all, and want the buffer ring
strategy to kick in earlier instead. Am I reading you correctly?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
Heikki,
That's interesting. Care to share the results of the
experiments you ran? I was thinking of running tests of my
own with varying table sizes.
Yah - it may take a while - you might get there faster.
There are some interesting effects to look at between I/O cache
performance and PG bufcache, and at those speeds the only tool I've
found that actually measures scan rate in PG is VACUUM. "SELECT
COUNT(*)" measures CPU consumption in the aggregation node, not scan
rate.
Note that the copy from I/O cache to PG bufcache is where the L2 effect
is seen.
The main motivation here is to avoid the sudden drop in
performance when a table grows big enough not to fit in RAM.
See attached diagram for what I mean. Maybe you're right and
the effect isn't that bad in practice.
There are going to be two performance drops, first when the table
doesn't fit into PG bufcache, the second when it doesn't fit in bufcache
+ I/O cache. The second is severe, the first is almost insignificant
(for common queries).
How is that different from what I described?
My impression of your descriptions is that they overvalue the case where
there are multiple scanners of a large (> 1x bufcache) table such that
they can share the "first load" of the bufcache, e.g. your 10% benefit
for table = 10x bufcache argument. I think this is a non-common
workload, rather there are normally many small tables and several large
tables such that sharing the PG bufcache is irrelevant to the query
speed.
Yeah I remember the discussion on the L2 cache a while back.
What do you mean with using readahead inside the heapscan?
Starting an async read request?
Nope - just reading N buffers ahead for seqscans. Subsequent calls use
previously read pages. The objective is to issue contiguous reads to
the OS in sizes greater than the PG page size (which is much smaller
than what is needed for fast sequential I/O).
The modifications you suggest here may not have the following
properties:
- don't pollute bufcache for seqscan of tables > 1 x bufcache
- for tables > 1 x bufcache use a ringbuffer for I/O thatis ~ 32KB to
minimize L2 cache pollution
So the difference is that you don't want 3A (the take
advantage of pages already in buffer cache) strategy at all,
and want the buffer ring strategy to kick in earlier instead.
Am I reading you correctly?
Yes, I think the ring buffer strategy should be used when the table size
is > 1 x bufcache and the ring buffer should be of a fixed size smaller
than L2 cache (32KB - 128KB seems to work well).
- Luke
Luke Lonergan wrote:
What do you mean with using readahead inside the heapscan?
Starting an async read request?Nope - just reading N buffers ahead for seqscans. Subsequent calls use
previously read pages. The objective is to issue contiguous reads to
the OS in sizes greater than the PG page size (which is much smaller
than what is needed for fast sequential I/O).
Are you filling multiple buffers in the buffer cache with a single
read-call? The OS should be doing readahead for us anyway, so I don't
see how just issuing multiple ReadBuffers one after each other helps.
Yes, I think the ring buffer strategy should be used when the table size
is > 1 x bufcache and the ring buffer should be of a fixed size smaller
than L2 cache (32KB - 128KB seems to work well).
I think we want to let the ring grow larger than that for updating
transactions and vacuums, though, to avoid the WAL flush problem.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Nope - just reading N buffers ahead for seqscans. Subsequent
calls use previously read pages. The objective is to issue
contiguous reads to the OS in sizes greater than the PG page
size (which is much smaller than what is needed for fast
sequential I/O).
Problem here is that eighter you issue the large read into throwaway
private memory and hope that when you later read 8k you get the page
from OS buffercache, or you need ScatterGather IO and a way to grab 32
buffers at once.
Yes, I think the ring buffer strategy should be used when the
table size is > 1 x bufcache and the ring buffer should be of
a fixed size smaller than L2 cache (32KB - 128KB seems to work well).
How do you merge those two objectives? It seems the ring buffer needs to
be at least as large as the contiguous read size.
Thus you would need at least 256k ring buffer. Better yet have twice the
IO size as ring buffer size, so two sessions can alternately take the
lead while the other session still blocks a prev page. Modern L2 cache
is 8 Mb, so 512k seems no problem ?
Andreas
What do you mean with using readahead inside the heapscan?
Starting an async read request?Nope - just reading N buffers ahead for seqscans. Subsequent calls
use previously read pages. The objective is to issuecontiguous reads
to the OS in sizes greater than the PG page size (which is much
smaller than what is needed for fast sequential I/O).Are you filling multiple buffers in the buffer cache with a
single read-call?
yes, needs vector or ScatterGather IO.
The OS should be doing readahead for us
anyway, so I don't see how just issuing multiple ReadBuffers
one after each other helps.
Last time I looked OS readahead was only comparable to 32k blocked
reads.
256k blocked reads still perform way better. Also when the OS is
confronted
with an IO storm the 256k reads perform way better than OS readahead.
Andreas
"Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at> writes:
Are you filling multiple buffers in the buffer cache with a
single read-call?yes, needs vector or ScatterGather IO.
I would expect that to get only moderate improvement. To get the full benefit
I would think you would want to either fire off a separate thread to do the
read-ahead, use libaio, or funnel the read-ahead requests to a separate thread
like our bgwriter only it would be a bgreader or something like that.
The OS should be doing readahead for us
anyway, so I don't see how just issuing multiple ReadBuffers
one after each other helps.Last time I looked OS readahead was only comparable to 32k blocked reads.
256k blocked reads still perform way better. Also when the OS is confronted
with an IO storm the 256k reads perform way better than OS readahead.
Well that's going to depend on the OS. Last I checked Linux's readahead logic
is pretty straightforward and doesn't try to do any better than 32k readahead
and is easily fooled. However I wouldn't be surprised if that's changed.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
On Tue, 2007-05-08 at 11:40 +0100, Heikki Linnakangas wrote:
I'm going to do this incrementally, and we'll see how far we get for
8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up
Simon's patch (step 1), run some performance tests with vacuum, and
submit a patch for that. Then I'll move to Jeff's patch (step 2).
I think it's best to postpone 3A (one aspect of my patch). There are a
couple reasons:
1) The results didn't show enough benefit yet.
2) Complex interactions with Simon's patch.
I think it's an area of research for the future, but for now I just want
to concentrate on the most important aspect of my patch: the
synchronized scanning ( #2 in your list ).
Regards,
Jeff Davis
On Tue, 2007-05-08 at 07:47 -0400, Luke Lonergan wrote:
Heikki,
On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc)
implement dynamic I/O caching. The experiments have shown that benefit
of re-using PG buffer cache on large sequential scans is vanishingly
small when the buffer cache size is small compared to the system memory.
Since this is a normal and recommended situation (OS I/O cache is
auto-tuning and easy to administer, etc), IMO the effort to optimize
buffer cache reuse for seq scans > 1 x buffer cache size is not
worthwhile.
I think 3A is still an interesting idea, but I agree that it needs some
results to back it up. Let's save 3A for the next release so that we
have more time to see.
To that point - an important factor in achieving max I/O rate for large
tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache.
This is commonly in the range of 512KB -> 2MB, which is only important
when considering a bound on the size of the ring buffer. The effect has
been demonstrated to be significant - in the 20%+ range. Another thing
to consider is the use of readahead inside the heapscan, in which case
sizes >= 32KB are very effective.
One thing I'd like us to keep in mind is to have a reasonable number of
buffers active for a sequential scan. If the number is too small, my
sync scans might not even work. Right now my patch only reports every 16
pages, so 32KB (=4 pages) is not going to work for sync scans (I suppose
only testing will tell).
Regards,
Jeff Davis
Are you filling multiple buffers in the buffer cache with a single
read-call?yes, needs vector or ScatterGather IO.
I would expect that to get only moderate improvement.
The vast improvement comes from 256k blocksize.
To get
the full benefit I would think you would want to either fire
off a separate thread to do the read-ahead, use libaio, or
funnel the read-ahead requests to a separate thread like our
bgwriter only it would be a bgreader or something like that.
I like bgreader :-) But that looks even more difficult than grabbing 32
[scattered or contiguous] buffers at once.
Especially in a situation where there is no concurrent load it would be
nice to do CPU work while waiting for the next read ahead IO. If there
is enough parallel CPU load it is actually not so important. So I opt,
that on a high load server you get nearly all benefit without any sort
of aio.
The OS should be doing readahead for us anyway, so I don't see how
just issuing multiple ReadBuffers one after each other helps.Last time I looked OS readahead was only comparable to 32k blocked
reads.
256k blocked reads still perform way better. Also when the OS is
confronted with an IO storm the 256k reads perform way better than
OS readahead.
Well that's going to depend on the OS. Last I checked Linux's
readahead logic is pretty straightforward and doesn't try to
do any better than 32k readahead and is easily fooled.
However I wouldn't be surprised if that's changed.
My test was on AIX, 32 or 64k seem quite common, at least as default
setting.
Also on some OS's (like HPUX) OS readahead and writebehind strategy
changes with large IO blocksizes, imho beneficially.
Andreas
On Tue, 2007-05-08 at 11:40 +0100, Heikki Linnakangas wrote:
Here's my roadmap for the "scan-resistant buffer cache" and
"synchronized scans" patches.1. Fix the current vacuum behavior of throwing dirty buffers to the
freelist, forcing a lot of WAL flushes. Instead, use a backend-private
ring of shared buffers that are recycled. This is what Simon's
"scan-resistant buffer manager" did.The theory here is that if a page is read in by vacuum, it's unlikely to
be accessed in the near future, therefore it should be recycled. If
vacuum doesn't dirty the page, it's best to reuse the buffer immediately
for the next page. However, if the buffer is dirty (and not just because
we set hint bits), we ought to delay writing it to disk until the
corresponding WAL record has been flushed to disk.Simon's patch used a fixed size ring of buffers that are recycled, but I
think the ring should be dynamically sized. Start with a small ring, and
whenever you need to do a WAL flush to write a dirty buffer, increase
the ring size. On every full iteration through the ring, decrease its
size to trim down an unnecessarily large ring.This only alters the behavior of vacuums, and it's pretty safe to say it
won't get worse than what we have now.
I think thats too much code, why not just leave it as it is. Would a
dynamic buffer be substantially better? If not, why bother?
In the future, we can use the
buffer ring for seqscans as well; more on that on step 3.
There was clear benefit for that. You sound like you are suggesting to
remove the behaviour for Seq Scans, which wouldn't make much sense??
2. Implement the list/table of last/ongoing seq scan positions. This is
Jeff's "synchronized scans" patch. When a seq scan starts on a table
larger than some threshold, it starts from where the previous seq scan
is currently, or where it ended. This will synchronize the scans so that
for two concurrent scans the total I/O is halved in the best case. There
should be no other effect on performance.If you have a partitioned table, or union of multiple tables or any
other plan where multiple seq scans are performed in arbitrary order,
this change won't change the order the partitions are scanned and won't
therefore ensure they will be synchronized.Now that we have both pieces of the puzzle in place, it's time to
consider what more we can do with them:3A. To take advantage of the "cache trail" of a previous seq scan, scan
backwards from where the previous seq scan ended, until a you hit a
buffer that's not in cache.This will allow taking advantage of the buffer cache even if the table
doesn't fit completely in RAM. That can make a big difference if the
table size is just slightly bigger than RAM, and can avoid the nasty
surprise when a table grows beyond RAM size and queries start taking
minutes instead of seconds.This should be a non-controversial change on its own from performance
point of view. No query should get slower, and some will become faster.
But see step 3B:3B. Currently, sequential scans on a large table spoils the buffer cache
by evicting other pages from the cache. In CVS HEAD, as soon as the
table is larger than shared_buffers, the pages in the buffer won't be
used to speed up running the same query again, and there's no reason to
believe the pages read in would be more useful than any other page in
the database, and in particular the pages that were in the buffer cache
before the huge seq scan. If the table being scanned is > 5 *
shared_buffers, the scan will evict every other page from the cache if
there's no other activity in the database (max usage_count is 5).If the table is much larger than shared_buffers, say 10 times as large,
even with the change 3B to read the pages that are in cache first, using
all shared_buffers to cache the table will only speed up the query by
10%. We should not spoil the cache for such a small gain, and use the
local buffer ring strategy instead. It's better to make queries that are
slow anyway a little bit slower, than making queries that are normally
really fast, slow.As you may notice, 3A and 3B are at odds with each other. We can
implement both, but you can't use both strategies in the same scan.
Not sure I've seen any evidence of that.
Most scans will be solo and so should use the ring buffer, since there
is clear evidence of that. If there were evidence to suggest the two
patches conflict then we should turn off the ring buffer only when
concurrent scans are in progress (while remembering that concurrent
scans will not typically overlap as much as the synch scan tests show
and so for much of their execution they too will be solo).
Therefore we need to have decision logic of some kind to figure out
which strategy is optimal.A simple heuristic is to decide based on the table size:
< 0.1*shared_buffers -> start from page 0, keep in cache (like we do now)
< 5 * shared_buffers -> start from last read page, keep in cache5 * shared_buffers -> start from last read page, use buffer ring
I'm not sure about the constants, we might need to make them GUC
variables as Simon argued, but that would be the general approach.
If you want to hardcode it, I'd say use the ring buffer on scans of 1000
blocks or more, or we have a GUC. Sizing things to shared_buffers isn't
appropriate because of the effects of partitioning, as I argued in my
last post, which I think is still valid.
Thoughts? Everyone happy with the roadmap?
I think separating the patches is now the best way forward, though both
are good.
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
Hi,
In reference to the seq scans roadmap, I have just submitted a patch
that addresses some of the concerns.
The patch does this:
1. for small relation (smaller than 60% of bufferpool), use the
current logic
2. for big relation:
- use a ring buffer in heap scan
- pin first 12 pages when scan starts
- on consumption of every 4-page, read and pin the next 4-page
- invalidate used pages of in the scan so they do not force out
other useful pages
4 files changed:
bufmgr.c, bufmgr.h, heapam.c, relscan.h
If there are interests, I can submit another scan patch that returns
N tuples at a time, instead of current one-at-a-time interface. This
improves code locality and further improve performance by another
10-20%.
For TPCH 1G tables, we are seeing more than 20% improvement in scans
on the same hardware.
------------------------------------------------------------------------
-
----- PATCHED VERSION
------------------------------------------------------------------------
-
gptest=# select count(*) from lineitem;
count
---------
6001215
(1 row)
Time: 2117.025 ms
------------------------------------------------------------------------
-
----- ORIGINAL CVS HEAD VERSION
------------------------------------------------------------------------
-
gptest=# select count(*) from lineitem;
count
---------
6001215
(1 row)
Time: 2722.441 ms
Suggestions for improvement are welcome.
Regards,
-cktan
Greenplum, Inc.
On May 8, 2007, at 5:57 AM, Heikki Linnakangas wrote:
Show quoted text
Luke Lonergan wrote:
What do you mean with using readahead inside the heapscan?
Starting an async read request?Nope - just reading N buffers ahead for seqscans. Subsequent
calls use
previously read pages. The objective is to issue contiguous reads to
the OS in sizes greater than the PG page size (which is much smaller
than what is needed for fast sequential I/O).Are you filling multiple buffers in the buffer cache with a single
read-call? The OS should be doing readahead for us anyway, so I
don't see how just issuing multiple ReadBuffers one after each
other helps.Yes, I think the ring buffer strategy should be used when the
table size
is > 1 x bufcache and the ring buffer should be of a fixed size
smaller
than L2 cache (32KB - 128KB seems to work well).I think we want to let the ring grow larger than that for updating
transactions and vacuums, though, to avoid the WAL flush problem.--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com---------------------------(end of
broadcast)---------------------------
TIP 6: explain analyze is your friend
In reference to the seq scans roadmap, I have just submitted
a patch that addresses some of the concerns.The patch does this:
1. for small relation (smaller than 60% of bufferpool), use
the current logic 2. for big relation:
- use a ring buffer in heap scan
- pin first 12 pages when scan starts
- on consumption of every 4-page, read and pin the next 4-page
- invalidate used pages of in the scan so they do not
force out other useful pages
A few comments regarding the effects:
I do not see how this speedup could be caused by readahead, so what are
the effects ?
(It should make no difference to do the CPU work for count(*) inbetween
reading each block when the pages are not dirtied)
Is the improvement solely reduced CPU because no search for a free
buffer is needed and/or L2 cache locality ?
What effect does the advance pinnig have, avoid vacuum ?
A 16 x 8k page ring is too small to allow the needed IO blocksize of
256k.
The readahead is done 4 x one page at a time (=32k).
What is the reasoning behind 1/4 ring for readahead (why not 1/2), is
3/4 the trail for followers and bgwriter ?
I think in anticipation of doing a single IO call for more that one
page, the KillAndReadBuffer function should be split into two parts. One
that does the killing
for n pages, and one that does the reading for n pages.
Killing n before reading n would also have the positive effect of
grouping perhaps needed writes (not interleaving them with the reads).
I think the 60% Nbuffers is a very good starting point. I would only
introduce a GUC when we see evidence that it is needed (I agree with
Simon's partitioning comments, but I'd still wait and see).
Andreas
Also, that patch doesn't address the VACUUM issue at all. And
using a small fixed size ring with scans that do updates can
be devastating. I'm experimenting with different ring sizes
for COPY at the moment. Too small ring leads to a lot of WAL
flushes, it's basically the same problem we have with VACUUM
in CVS HEAD.
My first take on that would be to simply abandon any dirty (and actually
also any still pinned) buffer from the ring and replace the ring slot
with a buffer from the freelist.
If the freelist is empty and LSN allows writing the buffer, write it
(and maybe try to group these).
If the LSN does not allow the write, replace the slot with a buffer from
LRU.
Andreas
Import Notes
Reply to msg id not found: 46430E69.7030400@enterprisedb.com
Zeugswetter Andreas ADI SD wrote:
In reference to the seq scans roadmap, I have just submitted
a patch that addresses some of the concerns.The patch does this:
1. for small relation (smaller than 60% of bufferpool), use
the current logic 2. for big relation:
- use a ring buffer in heap scan
- pin first 12 pages when scan starts
- on consumption of every 4-page, read and pin the next 4-page
- invalidate used pages of in the scan so they do not
force out other useful pagesA few comments regarding the effects:
I do not see how this speedup could be caused by readahead, so what are
the effects ?
I was wondering that as well. We'd really need to test all the changes
separately to see where the improvements are really coming from.
Also, that patch doesn't address the VACUUM issue at all. And using a
small fixed size ring with scans that do updates can be devastating. I'm
experimenting with different ring sizes for COPY at the moment. Too
small ring leads to a lot of WAL flushes, it's basically the same
problem we have with VACUUM in CVS HEAD.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Zeugswetter Andreas ADI SD wrote:
Also, that patch doesn't address the VACUUM issue at all. And
using a small fixed size ring with scans that do updates can
be devastating. I'm experimenting with different ring sizes
for COPY at the moment. Too small ring leads to a lot of WAL
flushes, it's basically the same problem we have with VACUUM
in CVS HEAD.My first take on that would be to simply abandon any dirty (and actually
also any still pinned) buffer from the ring and replace the ring slot
with a buffer from the freelist.
If the freelist is empty and LSN allows writing the buffer, write it
(and maybe try to group these).
If the LSN does not allow the write, replace the slot with a buffer from
LRU.
That would effectively disable the ring for COPY and the 2nd phase of
VACUUM.
One problem with looking at the LSN is that you need the content lock to
read it, and I wouldn't want to add any new locking. It could be done
inside FlushBuffer when we hold the lock anyway, but I'm afraid the
changes would be pretty invasive.
I'm struggling to get a grip of what the optimal ring size is under
various circumstances. Some thoughts I have this far:
- a small ring gives better L2 cache behavior
- for read-only queries, and for queries that just hint bits, 1 buffer
is enough
- small ring with query that writes WAL (COPY, mass updates, 2nd phase
of VACUUM) leads to a lot of WAL flushes, which can become bottleneck.
But all these assumptions need to be validated. I'm setting up tests
with different ring sizes and queries to get a clear picture of this:
- VACUUM on a clean table
- VACUUM on a table with 1 dead tuple per page
- read-only scan, large table
- read-only scan, table fits in OS cache
- COPY
In addition, I'm going to run VACUUM in a DBT-2 test to see the affect
on other queries running concurrently.
I think a ring that grows when WAL flushes occur covers all the use
cases reasonably well, but I need to do the testing...
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote:
However, it caught me by total surprise that the performance with 1
buffer is so horrible. Using 2 buffers is enough to avoid whatever the
issue is with just 1 buffer. I have no idea what's causing that. There
must be some interaction that I don't understand.
Ok, I found the reason for that. I was using this query for the selects:
SELECT COUNT(*) FROM (SELECT 1 FROM stock_copytest LIMIT 10000000) AS a;
Stock_copytest is larger than RAM size, that's why I used the LIMIT to
make the result set memory resident. That had the side effect that
apparently the limit-node kept the single buffer pinned which defeated
the buffer ring completely. To avoid issues like that we apparently want
to use 2-4 buffers instead of just 1.
I'll review my test methodology and keep testing...
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Import Notes
Reply to msg id not found: 46435E3D.206@enterprisedb.com
Heikki Linnakangas wrote:
But all these assumptions need to be validated. I'm setting up tests
with different ring sizes and queries to get a clear picture of this:
- VACUUM on a clean table
- VACUUM on a table with 1 dead tuple per page
- read-only scan, large table
- read-only scan, table fits in OS cache
- COPY
Just to keep you guys informed, here's my results on a read-only scan on
a table bigger than shared_buffers but smaller than RAM:
select-1 | 00:00:10.853831
select-1 | 00:00:10.380667
select-1 | 00:00:11.530528
select-2 | 00:00:08.634105
select-2 | 00:00:02.674084
select-4 | 00:00:02.65664
select-8 | 00:00:02.662922
select-16 | 00:00:02.682475
select-32 | 00:00:02.693163
select-64 | 00:00:02.722031
select-128 | 00:00:02.873645
select-256 | 00:00:03.185586
select-512 | 00:00:03.534285
select-1024 | 00:00:03.741867
lshw utility tells me that this server has 32KB of L1 cache and 4MB of
L2 cache. The performance starts to drop between 64-128 buffers, which
is 512 - 1024 KB, so I'm not sure how it's related to cache size but
using a small number of buffers is clearly better than using a large number.
However, it caught me by total surprise that the performance with 1
buffer is so horrible. Using 2 buffers is enough to avoid whatever the
issue is with just 1 buffer. I have no idea what's causing that. There
must be some interaction that I don't understand.
All the numbers are quite repeatable, I ran the same test script many
times. The runtime of the first select-2 test however varied between
3-10 seconds, somehow the bad karma from using just 1 buffer in the
earlier test carries over to the next test.
I'm not sure what to think about this, but I'll set up more test
scenarios with VACUUM and COPY.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
The patch has no effect on scans that do updates. The
KillAndReadBuffer routine does not force out a buffer if the dirty
bit is set. So updated pages revert to the current performance
characteristics.
-cktan
GreenPlum, Inc.
On May 10, 2007, at 5:22 AM, Heikki Linnakangas wrote:
Show quoted text
Zeugswetter Andreas ADI SD wrote:
In reference to the seq scans roadmap, I have just submitted a
patch that addresses some of the concerns.The patch does this:
1. for small relation (smaller than 60% of bufferpool), use the
current logic 2. for big relation:
- use a ring buffer in heap scan
- pin first 12 pages when scan starts
- on consumption of every 4-page, read and pin the next 4-page
- invalidate used pages of in the scan so they do not force out
other useful pagesA few comments regarding the effects:
I do not see how this speedup could be caused by readahead, so
what are
the effects ?I was wondering that as well. We'd really need to test all the
changes separately to see where the improvements are really coming
from.Also, that patch doesn't address the VACUUM issue at all. And using
a small fixed size ring with scans that do updates can be
devastating. I'm experimenting with different ring sizes for COPY
at the moment. Too small ring leads to a lot of WAL flushes, it's
basically the same problem we have with VACUUM in CVS HEAD.--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com