Experimental ARC implementation
Attached is a first trial implementation of the Adaptive Replacement
Cache (ARC). The patch is against CVS HEAD of 7.4.
The algorithm is based on what's described in these papers:
http://www.almaden.ibm.com/StorageSystems/autonomic_storage/ARC/arcfast.pdf
http://www.almaden.ibm.com/StorageSystems/autonomic_storage/ARC/oneup.pdf
kindly pointed to by Cuong Bui a few days ago.
To work with PostgreSQL and to mostly keep our differentiation between
the buffer manager and the cache replacement policy I implemented it as
a set of (hopefully) more strategy independent, separate calls that
mostly replace freelist.c. The main problem why the code differs
optically from the pseudo code found in oneup.pdf is, that it needs to
be concurrency safe. In particular the underlying list structures can be
changed by other backends during IO, and that happens half way through
the algorithm.
The major change to the existing freelist is an additional layer of
indirection. The original buffer descriptors now only hold a cache page
and it's clean/dirty information and such. The new layer is the cache
directory who's size is different from the number of data blocks in the
buffer cache.
With a TPC-C like transaction profile and a buffer cache size that
allows 10-15% of the database to stay resident, I am not able to see any
dramatic change in behaviour or performance. At the maximum the
difference could be called a "light relief". But that transaction
profile is not a good mix compared to any real world application anyway,
so I wouldn't give too much on that result. It's also possible that I
made some mistakes in the implementation, so have a critical eye on that.
When playing with it please keep a few facts in mind. A 64MB cache
(shared_buffers=8192) needs about 10 minutes of full transaction load to
populate, in benchmark terms this is called ramp-up time. The algorithm
is aiming at scan resistance, so a pure index access approach will not
show any significant changes. Therefore, don't tell me that a 1x scale
10 client by 1000 transactions pgbench run doesn't show any performance
boost, it cannot.
I will follow up shortly with an approach that integrates Tom's delay
mechanism plus my first READ_BY_VACUUM hack into one combined experiement.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Attachments:
cache_strategy_ARC.74.difftext/plain; name=cache_strategy_ARC.74.diffDownload+741-627
Jan Wieck wrote:
I will follow up shortly with an approach that integrates Tom's delay
mechanism plus my first READ_BY_VACUUM hack into one combined experiement.
Okay,
the attached patch contains the 3 already discussed and one additional
change. I also made a few changes.
1) ARC policy. Has no configurable options as it is fully self tuning.
2) vacuum_page_delay. I changed the algorithm not to do a usleep() per
page. The milliseconds usleep() now is done every vacuum_page_groupsize
pages. This makes especially sense if one can only tune in 10ms intervals.
3) Pages faulted in by VACUUM are placed onto the T1 head (LRU position)
in the ARC strategy. Thereby vacuum is even more conservative about
cache pollution than a sequential scan now.
4) A new config option lazy_checkpoint_time causes automatic timeout
controlled checkpoints (the background ones done by postmaster - and
only those) to spread out the BufferSync() over the specified amount of
seconds. This activity does not include the smgrsync() which will
actually cause the kernel to force the stuff out to disk. But it gives
the kernel time to do something already, and thereby possibly shortening
the IO burst caused by smgrsync.
I started with
vacuum_page_delay = 100
vacuum_page_groupsize = 25
checkpoint_timeout = 600
lazy_checkpoint_time = 300
and the system runs a lot smoother than before. The only not addressed
performance drop occurs now after flushing all buffers and finally
syncing during the checkpoints. And I don't have any idea how to tackle
that one.
Comments/Feedback?
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Attachments:
Jan Wieck wrote:
Jan Wieck wrote:
I will follow up shortly with an approach that integrates Tom's delay
mechanism plus my first READ_BY_VACUUM hack into one combined experiement.Okay,
the attached patch contains the 3 already discussed and one additional
change.
Ooopsy
the B1/B2 queue length adjustment in that one was totally nonsense. This
one behaves much better.
I added a DEBUG1 elog every 10 seconds to monitor the cache hitrates and
cache size adjustments. It's pretty neat to watch how it responds to
running an OLTP kind of thing and then issue VACUUM and run big
reporting sequential suckers in parallel.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Attachments:
Jan Wieck wrote:
Jan Wieck wrote:
Jan Wieck wrote:
I will follow up shortly with an approach that integrates Tom's delay
mechanism plus my first READ_BY_VACUUM hack into one combined experiement.Okay,
the attached patch contains the 3 already discussed and one additional
change.Ooopsy
Ooops-2
but I'm getting closer.
I guess I polluted the list enough. The latest patch is now here:
http://developer.postgresql.org/~wieck/all_performance.v3.74.diff.gz
This one now correctly keeps T1len+B1len at about the number of buffers,
which is half the directory size. The former versions favored T1 too much.
It also contains the starting work of the discussed background buffer
writer. Thus far, the BufferSync() done at a checkpoint only writes out
all dirty blocks in their LRU order and over a configurable time
(lazy_checkpoint_time in seconds). But that means at least, while the
checkpoint is running the backends should not need to flush dirty
buffers as well, since all the candidates they get for replacement are
clean. My plan is to create another background process very similar to
the checkpointer and to let that run forever basically looping over that
BufferSync() with a bool telling that it's the bg_writer.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
My plan is to create another background process very similar to
the checkpointer and to let that run forever basically looping over that
BufferSync() with a bool telling that it's the bg_writer.
Why not use the checkpointer itself inbetween checkpoints ?
use a min and a max dirty setting like Informix. Start writing
when more than max are dirty stop when at min. This avoids writing
single pages (which is slow, since it cannot be grouped together
by the OS).
Andreas
Import Notes
Resolved by subject fallback
Zeugswetter Andreas SB SD wrote:
My plan is to create another background process very similar to
the checkpointer and to let that run forever basically looping over that
BufferSync() with a bool telling that it's the bg_writer.Why not use the checkpointer itself inbetween checkpoints ?
use a min and a max dirty setting like Informix. Start writing
when more than max are dirty stop when at min. This avoids writing
single pages (which is slow, since it cannot be grouped together
by the OS).
Current approach is similar ... if I strech the IO and syncing over the
entire 150-300 second checkpoint interval, grouping in 50 pages then
sync()+nap, the system purr's pretty nice and without any peaks.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Why not use the checkpointer itself inbetween checkpoints ?
use a min and a max dirty setting like Informix. Start writing
when more than max are dirty stop when at min. This avoids writing
single pages (which is slow, since it cannot be grouped together
by the OS).Current approach is similar ... if I strech the IO and syncing over the
entire 150-300 second checkpoint interval, grouping in 50 pages then
sync()+nap, the system purr's pretty nice and without any peaks.
But how do you handle a write IO bound system then ? My thought was to
let the checkpointer write dirty pages inbetween checkpoints with a min max,
but still try to do the checkpoint as fast as possible. I don't think
streching the checkpoint is good, since it needs to write hot pages, which the
inbetween IO should avoid doing. The checkpointer would have two tasks,
that it handles alternately, checkpoint or flush LRU from max to min.
Andreas
Import Notes
Resolved by subject fallback
Zeugswetter Andreas SB SD wrote:
Why not use the checkpointer itself inbetween checkpoints ?
use a min and a max dirty setting like Informix. Start writing
when more than max are dirty stop when at min. This avoids writing
single pages (which is slow, since it cannot be grouped together
by the OS).Current approach is similar ... if I strech the IO and syncing over the
entire 150-300 second checkpoint interval, grouping in 50 pages then
sync()+nap, the system purr's pretty nice and without any peaks.But how do you handle a write IO bound system then ? My thought was to
let the checkpointer write dirty pages inbetween checkpoints with a min max,
but still try to do the checkpoint as fast as possible. I don't think
streching the checkpoint is good, since it needs to write hot pages, which the
inbetween IO should avoid doing. The checkpointer would have two tasks,
that it handles alternately, checkpoint or flush LRU from max to min.Andreas
By actually moving a lot of the IO work into the checkpointer. It asks
the buffer strategy about the order in which dirty blocks would
currently get evicted from the cache. The checkpointer now flushes them
in that order. Your "hot pages" will be found at the end of that list
and thus flushed last in the checkpoint, why it's good to keep them
dirty longer.
The problem with the checkpointer flushing as fast as possible is, that
the entire system literally freezes. In my tests I use something that
resembles the transaction profile of a TPC-C including the thinking and
keying times. Those are important as they are a very realistic thing. A
stock 7.4.RC1 handles a right scaled DB with new_order response times of
0.2 to 1.5 seconds, but when the checkpoint occurs, it can't keep up and
the response times go up to anything between 20-60 seconds. What makes
the situation worse is that in the meantime, all simulated terminals hit
the "send" button again, which lead's to a transaction pileup right
during the checkpoint. It takes a while until the system recovers from
that.
If the system is write-bound, the checkpointer will find that many dirty
blocks that he has no time to nap and will burst them out as fast as
possible anyway. Well, at least that's the theory.
PostgreSQL with the non-overwriting storage concept can never have
hot-written pages for a long time anyway, can it? They fill up and cool
down until vacuum.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Jan Wieck wrote:
It also contains the starting work of the discussed background buffer
writer. Thus far, the BufferSync() done at a checkpoint only writes out
all dirty blocks in their LRU order and over a configurable time
(lazy_checkpoint_time in seconds). But that means at least, while the
checkpoint is running the backends should not need to flush dirty
buffers as well, since all the candidates they get for replacement are
clean. My plan is to create another background process very similar to
the checkpointer and to let that run forever basically looping over that
BufferSync() with a bool telling that it's the bg_writer.
Have you considered having the background writer check the pages it is
about to write to see if they can be added to the FSM, thereby reducing
the need for vacuum? Seems we would need to add a statistics parameter
so pg_autovacuum would know how many tuples the background write added
to the freespace map, so it doesn't vacuum a table that doesn't need it.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Jan Wieck wrote:
If the system is write-bound, the checkpointer will find that many dirty
blocks that he has no time to nap and will burst them out as fast as
possible anyway. Well, at least that's the theory.PostgreSQL with the non-overwriting storage concept can never have
hot-written pages for a long time anyway, can it? They fill up and cool
down until vacuum.
Another idea on removing sync() --- if we are going to use fsync() on
each file during checkpoint (open, fsync, close), seems we could keep a
hash of written block dbid/relfilenode pairs and cycle through that on
checkpoint. We could keep the hash in shared memory, and dump it to a
backing store when it gets full, or just have it exist in buffer writer
process memory (so it can grow) and have backends that do their own
buffer writes all open a single file in append mode and write the pairs
to the file, or something like that, and the checkpoint process can read
from there.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Hey - now that we have a branch, is Bruce going to start committed the
pgpatches2 list?
Chris
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Have you considered having the background writer check the pages it is
about to write to see if they can be added to the FSM, thereby reducing
the need for vacuum?
The 7.4 rewrite of FSM depends on the assumption that all the free space
in a given relation is reported to FSM in one batch (ie, at the end of a
VACUUM pass). This solves problems in both speed (page-at-a-time update
of FSM was horrendously expensive) and space allocation policy (we now
use the number of "interesting" pages in each relation as input
information for the allocation policy). To do anything like the above,
you'd need to find different solutions to these problems.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Have you considered having the background writer check the pages it is
about to write to see if they can be added to the FSM, thereby reducing
the need for vacuum?The 7.4 rewrite of FSM depends on the assumption that all the free space
in a given relation is reported to FSM in one batch (ie, at the end of a
VACUUM pass). This solves problems in both speed (page-at-a-time update
of FSM was horrendously expensive) and space allocation policy (we now
use the number of "interesting" pages in each relation as input
information for the allocation policy). To do anything like the above,
you'd need to find different solutions to these problems.
Yea, shame. I never liked sequentially scanning a huge table just to
find the few free tuples.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Christopher Kings-Lynne wrote:
Hey - now that we have a branch, is Bruce going to start committed the
pgpatches2 list?
Yes, once my email backlog is cleared --- probably early next week.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Have you considered having the background writer check the pages it is
about to write to see if they can be added to the FSM, thereby reducing
the need for vacuum? Seems we would need to add a statistics parameter
so pg_autovacuum would know how many tuples the background write added
to the freespace map, so it doesn't vacuum a table that doesn't need it.
This would suffer from the previously mentioned problem of having to pull in
index pages and dirty them when it's trying to flush and clean pages.
Conceivably it could just count up the dead tuples and provide that
information to something like pg_autovacuum so it knows when it's time to run
a vacuum. I don't see that as all that much of a win over the current
heuristics. At best it means a big batch update will trigger a vacuum sooner
so you don't have to manually run vacuum to avoid overflowing the fsm.
--
greg
Greg Stark wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Have you considered having the background writer check the pages it is
about to write to see if they can be added to the FSM, thereby reducing
the need for vacuum? Seems we would need to add a statistics parameter
so pg_autovacuum would know how many tuples the background write added
to the freespace map, so it doesn't vacuum a table that doesn't need it.This would suffer from the previously mentioned problem of having to pull in
index pages and dirty them when it's trying to flush and clean pages.
I am confused why the index would be involved in this.
Conceivably it could just count up the dead tuples and provide that
information to something like pg_autovacuum so it knows when it's time to run
a vacuum. I don't see that as all that much of a win over the current
heuristics. At best it means a big batch update will trigger a vacuum sooner
so you don't have to manually run vacuum to avoid overflowing the fsm.
Yea, probably. Another idea would be for tuple reuse to favor pages
already in the cache so it doesn't have to read in a page from disk.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote:
Greg Stark wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Have you considered having the background writer check the pages it is
about to write to see if they can be added to the FSM, thereby reducing
the need for vacuum? Seems we would need to add a statistics parameter
so pg_autovacuum would know how many tuples the background write added
to the freespace map, so it doesn't vacuum a table that doesn't need it.This would suffer from the previously mentioned problem of having to pull in
index pages and dirty them when it's trying to flush and clean pages.I am confused why the index would be involved in this.
Looks Greg mixed up background vacuuming with background fsming, the
problem is a vacuum-some-page-at-hand problem, not related to FSM.
Conceivably it could just count up the dead tuples and provide that
information to something like pg_autovacuum so it knows when it's time to run
a vacuum. I don't see that as all that much of a win over the current
heuristics. At best it means a big batch update will trigger a vacuum sooner
so you don't have to manually run vacuum to avoid overflowing the fsm.Yea, probably. Another idea would be for tuple reuse to favor pages
already in the cache so it doesn't have to read in a page from disk.
It has little chance to remember how many dead tuples where added
between this flush and it's corresponding read, or more precise the last
time this buffer got flushed. And we certainly don't want to modify the
on-disk block structure just for that.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Bruce Momjian wrote:
Jan Wieck wrote:
If the system is write-bound, the checkpointer will find that many dirty
blocks that he has no time to nap and will burst them out as fast as
possible anyway. Well, at least that's the theory.PostgreSQL with the non-overwriting storage concept can never have
hot-written pages for a long time anyway, can it? They fill up and cool
down until vacuum.Another idea on removing sync() --- if we are going to use fsync() on
each file during checkpoint (open, fsync, close), seems we could keep a
hash of written block dbid/relfilenode pairs and cycle through that on
checkpoint. We could keep the hash in shared memory, and dump it to a
backing store when it gets full, or just have it exist in buffer writer
process memory (so it can grow) and have backends that do their own
buffer writes all open a single file in append mode and write the pairs
to the file, or something like that, and the checkpoint process can read
from there.
I am not really aiming at removing sync() alltogether. We know already
that open,fsync,close does not guarantee you flush dirty OS-buffers for
which another process might so far only have done open,write. And you
really don't want to remove all the vfd logic or fsync on every write
done by a backend.
What doing frequent fdatasync/fsync during a constant ongoing checkpoint
will cause is to significantly lower the physical write storm happening
at the sync(), which is causing huge problems right now.
The reason why people blame vacuum that much is that not only does it
replaces the buffer cache with useless garbage, it also leaves that
garbage to be flushed by other backends or the checkpointer and it
rapidly fills WAL, causing exactly that checkpoint we don't have the IO
bandwidth for right now! They only see that vacuum is running, and if
they kill it the system returns to a healty state after a while ... easy
enought but only half the story.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Jan Wieck wrote:
Bruce Momjian wrote:
Jan Wieck wrote:
If the system is write-bound, the checkpointer will find that many dirty
blocks that he has no time to nap and will burst them out as fast as
possible anyway. Well, at least that's the theory.PostgreSQL with the non-overwriting storage concept can never have
hot-written pages for a long time anyway, can it? They fill up and cool
down until vacuum.Another idea on removing sync() --- if we are going to use fsync() on
each file during checkpoint (open, fsync, close), seems we could keep a
hash of written block dbid/relfilenode pairs and cycle through that on
checkpoint. We could keep the hash in shared memory, and dump it to a
backing store when it gets full, or just have it exist in buffer writer
process memory (so it can grow) and have backends that do their own
buffer writes all open a single file in append mode and write the pairs
to the file, or something like that, and the checkpoint process can read
from there.I am not really aiming at removing sync() alltogether. We know already
that open,fsync,close does not guarantee you flush dirty OS-buffers for
which another process might so far only have done open,write. And you
We do know this? How? I thought someone listed the standard saying it
should work. I need this for Win32 anyway.
really don't want to remove all the vfd logic or fsync on every write
done by a backend.
No, certainly not --- that is a big loser.
What doing frequent fdatasync/fsync during a constant ongoing checkpoint
will cause is to significantly lower the physical write storm happening
at the sync(), which is causing huge problems right now.
I don't see that frankly because sync() is syncing everying on that
machine, including other file systems. Reducing our own load from sync
will not help with other applications writing to drives.
The reason why people blame vacuum that much is that not only does it
replaces the buffer cache with useless garbage, it also leaves that
garbage to be flushed by other backends or the checkpointer and it
rapidly fills WAL, causing exactly that checkpoint we don't have the IO
bandwidth for right now! They only see that vacuum is running, and if
they kill it the system returns to a healty state after a while ... easy
enought but only half the story.
Right, vacuum triggers WAL/checkpoint.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote:
Jan Wieck wrote:
What doing frequent fdatasync/fsync during a constant ongoing checkpoint
will cause is to significantly lower the physical write storm happening
at the sync(), which is causing huge problems right now.I don't see that frankly because sync() is syncing everying on that
machine, including other file systems. Reducing our own load from sync
will not help with other applications writing to drives.
You have 4 kids, Bruce. If you buy only two lollypops, how many of them
can share the room unattended?
What I described is absolutely sufficient for a dedicated DB server. We
will be able to coordinate the resources between the various components
of PostgreSQL, no doubt. Everyone who has significant performance
problems because of I/O saturation, and is still keeping other I/O heavy
applications on the same box instead of separating the things, is either
not serious or dumb ... or both.
Jan
PS: I know your kids can, but it serves too well ... ;-)
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #