Some ideas about Vacuum
Hi,
May be i am reposting something which has been discussed to end in this
forum. I have made a search in the archives and i couldn't find any
immediately.
With my relatively small experience in Performance Testing and Tuning,
one of the rules of thumb for getting Performance is "Don't do it, if you
don't need to do it". When we look at clearing the older versions of tuples
from our tables in PostgreSQL, we can't stop thinking about how it is done
in other databases. When we compare the Oracle Undo Log approach with the
Postgresql Vacuum approach, the pattern looks very similar to C++ memory
de-allocation and Java garbage collection.
So, as you may all know, the thing which worries us about Vacuum is
that it is going to places where it need not goto. That's when we are
thinking about Dead space Map. This dead space map is a map, if implemented
correctly, would guide Vacuum to go and only look at places where there was
some activity of Delete/Update/Insert after the last Vacuum. This is
accomplished at the cost of some very small overhead to
Inserts/Deletes/Updates.
Dead space Map is like an undo-log, if we think its role is to get rid
of the older versions of data. Instead of moving the tuples to separate
location, it guides the Vacuum process to do the cleanup task. May be we can
even think of something like Dead space log, which may not be a bitmap. In
this log, transactions might enter their transaction ids and ctids, which
can be scanned by the Vacuum process. While this might take more space, it
is with lesser contention, while compared to Dead space Map. To me, as far
as i can think of, the only advantage of Dead space Map over Dead space log
is the disk space.
It just strikes me that WAL log is already doing just that. I think you
can follow my thought-line. If we can ask the Vacuum process to scan the WAL
log, it can get all the relevant details on where it needs to go. One
optimization, that can be placed here is to somehow make the archiver do a
double-job of helping the Vacuum, while doing the archiving. For people, who
have switched off archiving, this might not be a benefit.
One main restriction it places on the WAL Logs is that the WAL Log needs
to be archived only after all the transactions in it completes. In other
words, WAL logs need to be given enough space, to survive the longest
transaction of the database. It is possible to avoid this situation by
asking the Vacuum process to take the necessary information out of WAL log
and store it somewhere and wait for the long running transaction to
complete.
The information of interest in WAL is only the table
inserts/updates/deletes. So if everyone accepts that this is a good idea,
till this point, there is a point in reading further.
Ultimately, what has been achieved till now is that we have made the
sequential scans made by the Vacuum process on each table into a few random
i/os. Of course there are optimizations possible to group the random i/os
and find some sequential i/o out of it. But still we need to do a full index
scan for all those indexes out there. HOT might have saved some work over
there. But i am pessimistic here and wondering how it could have been
improved. So it just strikes me, we can do the same thing which we did just
with the tables. Convert a seq scan of the entire table into a random scan
of few blocks. We can read the necessary tuple information from the tuples,
group them and hit at the index in just those blocks and clean it up.
I can already hear people, saying that it is not always possible to go
back to index from table. There is this culprit called unstable function
based indexes. The structure stops us from going back to index from table.
So currently we should restrict the above said approach to only normal
indexes(not the function based ones). I hope it would still give a good
benefit.
Of course Vacuum can convert the few random scans into a seq scan, if
required by referring to table statistics.
Thoughts about the idea????
Thanks,
Gokul.
P.S.: Let the objections/opposing views have a subtle reduction in its
harshness.
Hi,
Gokulakannan Somasundaram wrote:
If we can ask the Vacuum process to scan
the WAL log, it can get all the relevant details on where it needs to
go.
You seem to be assuming that only few tuples have changed between
vacuums, so that WAL could quickly guide the VACUUM processes to the
areas where cleaning is necessary.
Let's drop that assumption, because by default, autovacuum_scale_factor
is 20%, so a VACUUM process normally kicks in after 20% of tuples
changed (disk space is cheap, I/O isn't). Additionally, there's a
default nap time of one minute - and VACUUM is forced to take at least
that much of a nap.
So it's easily possible having more dead tuples, than live ones. In such
cases, scanning the WAL can easily takes *longer* than scanning the
table, because the amount of WAL to read would be bigger.
One main restriction it places on the WAL Logs is that the WAL Log
needs to be archived only after all the transactions in it completes. In
other words, WAL logs need to be given enough space, to survive the
longest transaction of the database. It is possible to avoid this
situation by asking the Vacuum process to take the necessary information
out of WAL log and store it somewhere and wait for the long running
transaction to complete.
That would result in even more I/O...
The information of interest in WAL is only the table
inserts/updates/deletes. So if everyone accepts that this is a good
idea, till this point, there is a point in reading further.
Well, that's the information of interest, the question is where to store
that information. Maintaining a dead space map looks a lot cheaper to
me, than relying on the WAL to store that information.
Ultimately, what has been achieved till now is that we have made the
sequential scans made by the Vacuum process on each table into a few
random i/os. Of course there are optimizations possible to group the
random i/os and find some sequential i/o out of it. But still we need to
do a full index scan for all those indexes out there. HOT might have
saved some work over there. But i am pessimistic here and wondering how
it could have been improved. So it just strikes me, we can do the same
thing which we did just with the tables. Convert a seq scan of the
entire table into a random scan of few blocks. We can read the necessary
tuple information from the tuples, group them and hit at the index in
just those blocks and clean it up.
Sorry, I don't quite get what you are talking about here. What do
indexes have to do with dead space? Why not just keep acting on the
block level?
I can already hear people, saying that it is not always possible to
go back to index from table. There is this culprit called unstable
function based indexes.
No, there's no such thing. Citing [1]the Very Fine Postgres Manual on CREATE INDEX: http://www.postgresql.org/docs/8.3/static/sql-createindex.html: "All functions and operators used
in an index definition must be "immutable", that is, their results must
depend only on their arguments and never on any outside influence".
Of course, you can mark any function IMMUTABLE and get unstable function
based indexes, but that turns into a giant foot gun very quickly.
P.S.: Let the objections/opposing views have a subtle reduction in its
harshness.
I'm just pointing at things that are in conflict with my knowledge,
assumptions and believes, all which might be erroneous, plain wrong or
completely mad. ;-)
Regards
Markus
[1]: the Very Fine Postgres Manual on CREATE INDEX: http://www.postgresql.org/docs/8.3/static/sql-createindex.html
http://www.postgresql.org/docs/8.3/static/sql-createindex.html
So it's easily possible having more dead tuples, than live ones. In such
cases, scanning the WAL can easily takes *longer* than scanning the
table, because the amount of WAL to read would be bigger.
Yes... i made a wrong assumption there...... so the idea is totally
useless.
Thanks,
Gokul.
"Markus Schiltknecht" <markus@bluegap.ch> writes:
Hi,
Gokulakannan Somasundaram wrote:
If we can ask the Vacuum process to scan the WAL log, it can get all the
relevant details on where it needs to go.
That's an interesting thought. I think your caveats are right but with some
more work it might be possible to work it out. For example if a background
process processed the WAL and accumulated an array of possibly-dead tuples to
process in batch. It would wait whenever it sees an xid which isn't yet past
globalxmin, and keep accumulating until it has enough to make it worthwhile
doing a pass.
I think a bigger issue with this approach is that it ties all your tables
together. You can't process one table frequently while some other table has
some long-lived deleted tuples.
I'm also not sure it really buys us anything over having a second
dead-space-map data structure. The WAL is much larger and serves other
purposes which would limit what we can do with it.
You seem to be assuming that only few tuples have changed between vacuums, so
that WAL could quickly guide the VACUUM processes to the areas where cleaning
is necessary.Let's drop that assumption, because by default, autovacuum_scale_factor is 20%,
so a VACUUM process normally kicks in after 20% of tuples changed (disk space
is cheap, I/O isn't). Additionally, there's a default nap time of one minute -
and VACUUM is forced to take at least that much of a nap.
I think this is exactly backwards. The goal should be to improve vacuum, then
adjust the autovacuum_scale_factor as low as we can. As vacuum gets cheaper
the scale factor can go lower and lower. We shouldn't allow the existing
autovacuum behaviour to control the way vacuum works.
As a side point, "disk is cheap, I/O isn't" is a weird statement. The more
disk you use the more I/O you'll have to do to work with the data. I still
maintain the default autovacuum_scale_factor is *far* to liberal. If I had my
druthers it would be 5%. But that's mostly informed by TPCC experience, in
real life the actual value will vary depending on the width of your records
and the relative length of your transactions versus transaction rate. The TPCC
experience is with ~ 400 byte records and many short transactions.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning
On Wed, 2008-01-09 at 15:10 +0000, Gregory Stark wrote:
The goal should be to improve vacuum, then
adjust the autovacuum_scale_factor as low as we can. As vacuum gets
cheaper the scale factor can go lower and lower. We shouldn't allow
the existing autovacuum behaviour to control the way vacuum works.
Very much agreed.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Hi,
Gregory Stark wrote:
That's an interesting thought. I think your caveats are right but with some
more work it might be possible to work it out. For example if a background
process processed the WAL and accumulated an array of possibly-dead tuples to
process in batch. It would wait whenever it sees an xid which isn't yet past
globalxmin, and keep accumulating until it has enough to make it worthwhile
doing a pass.
I don't understand why one would want to go via the WAL, that only
creates needless I/O. Better accumulate the data right away, during the
inserts, updates and deletes. Spilling the accumulated data to disk, if
absolutely required, would presumably still result in less I/O.
I think a bigger issue with this approach is that it ties all your tables
together. You can't process one table frequently while some other table has
some long-lived deleted tuples.
Don't use the WAL as the source of that information and that's issue's gone.
I'm also not sure it really buys us anything over having a second
dead-space-map data structure. The WAL is much larger and serves other
purposes which would limit what we can do with it.
Exactly.
You seem to be assuming that only few tuples have changed between vacuums, so
that WAL could quickly guide the VACUUM processes to the areas where cleaning
is necessary.Let's drop that assumption, because by default, autovacuum_scale_factor is 20%,
so a VACUUM process normally kicks in after 20% of tuples changed (disk space
is cheap, I/O isn't). Additionally, there's a default nap time of one minute -
and VACUUM is forced to take at least that much of a nap.I think this is exactly backwards. The goal should be to improve vacuum, then
adjust the autovacuum_scale_factor as low as we can. As vacuum gets cheaper
the scale factor can go lower and lower.
But you can't lower it endlessly, it's still a compromise, because it
also means reducing the amount of tuples being cleaned per scan, which
is against the goal of minimizing overall I/O cost of vacuuming.
We shouldn't allow the existing
autovacuum behaviour to control the way vacuum works.
That's a point.
As a side point, "disk is cheap, I/O isn't" is a weird statement. The more
disk you use the more I/O you'll have to do to work with the data.
That's only true, as long as you need *all* your data to work with it.
I still
maintain the default autovacuum_scale_factor is *far* to liberal. If I had my
druthers it would be 5%. But that's mostly informed by TPCC experience, in
real life the actual value will vary depending on the width of your records
and the relative length of your transactions versus transaction rate. The TPCC
experience is with ~ 400 byte records and many short transactions.
Hm.. 5% vs 20% would mean 4x as many vacuum scans, but only a 15% growth
in size (105% vs 120%), right? Granted, those 15% are also taken from
memory and caches, resulting in additional I/O... Still these numbers
are surprising me. Or am I missing something?
Regards
Markus
Markus,
I was re-thinking about what you said. I feel, if we read the WAL
through archiver(Where the archiver is switched on), which anyway reads the
entire WAL Log, it might save some CPU cycles off updates, inserts and
deletes.
The question is about reducing I/Os and i have no doubt about it.
But if we create the WAL Log in a seperate disk and we make the Vacuum scan
through it(in case the archiver is absent), it would reduce the I/O off the
disk containing the data. Essentially the I/O effects are seperated. We
might end up doing more I/Os, but it would not affect the OLTP transactions.
I would also like to clarify one more thing. I am not asking to
remove the DSM approach. But i am just thinking of creating the DSM by
reading through the WAL Logs, instead of asking the Inserts, updates and
deletes to do the DSM creation.
Of course, if a person places both WAL logs and Data files in the
same disk drives, this would reduce the performance. But can we take that
hit?
I think what Gregory is coming at is, "if we schedule the Vacuum
after 20% of table changes, then we essentially say we need 120% of the disk
space and hence our select operations might end up doing more I/Os."
Please put forward your suggestions.
Hi All,
Essentially concluding
a) If there is a archiver running, we are putting slightly more CPU cycles
on the archiver to help form the DSM.
b) If there is no archiver, if the DBA places the WAL in a seperate disk,
Vacuum will do more I/O on that disk to form the DSM.
c) In case someone has not schedules both archiver and is not ready to spare
a disk for WAL, this approach reduces the performance of that setup.
Are my conclusions right?
If they are right, how much percentage constitute the third part? (Field
experts out there!!)
If the percentage is more, we should stop this line of thinking.
Thanks,
Gokul.
Hi,
Gokulakannan Somasundaram wrote:
But i am just thinking of creating the DSM
by reading through the WAL Logs, instead of asking the Inserts, updates
and deletes to do the DSM creation.
What's the advantage of that? What's wrong with collecting the
information for DSM at transaction processing time? The overhead is
certainly smaller than the overhead for doing it later on.
I think what Gregory is coming at is, "if we schedule the Vacuum
after 20% of table changes, then we essentially say we need 120% of the
disk space and hence our select operations might end up doing more I/Os."
Well, full sequential scans end up doing more I/O, but not index scans
typical for OLTP. So if autovacuum is the only thing doing full
sequential scans, you'd better reduce the number of full scans, instead
of saving only some percentage per scan, no?
Of course, depending on how much of your table fits in ram, you also
need to consider the space savings in RAM... However, I'm assuming a
reasonably low ratio of RAM size vs table size.
Regards
Markus
On Jan 10, 2008 3:43 PM, Markus Schiltknecht <markus@bluegap.ch> wrote:
Hi,
Gokulakannan Somasundaram wrote:
But i am just thinking of creating the DSM
by reading through the WAL Logs, instead of asking the Inserts, updates
and deletes to do the DSM creation.What's the advantage of that? What's wrong with collecting the
information for DSM at transaction processing time? The overhead is
certainly smaller than the overhead for doing it later on.
The overhead ..... is because of the contention. Am i missing something
here? While Vacuum is reading the DSM, operations may not be able to update
the bits. We need to put the DSM in shared memory, if all the processes are
going to update it, whereas if Vacuum is going to form the DSM, then it
might well be in the process local memory. I can think of things like False
sharing which might be avoided. But i think the main stuff is contention.
I think what Gregory is coming at is, "if we schedule the Vacuum
after 20% of table changes, then we essentially say we need 120% of the
disk space and hence our select operations might end up doing moreI/Os."
Well, full sequential scans end up doing more I/O, but not index scans
typical for OLTP. So if autovacuum is the only thing doing full
sequential scans, you'd better reduce the number of full scans, instead
of saving only some percentage per scan, no?
Even in indexes, we might end up reading dead tuples. We would mark it with
LP_DEAD. So the overhead is less, but its there. Ofcourse its natural to
think of some background jobs during OLTP, and they will be affected
Of course, depending on how much of your table fits in ram, you also
need to consider the space savings in RAM... However, I'm assuming a
reasonably low ratio of RAM size vs table size.
That's another one.
Thanks,
Gokul.
Hi,
Gokulakannan Somasundaram wrote:
because of the contention. Am i missing something
here? While Vacuum is reading the DSM, operations may not be able to
update the bits. We need to put the DSM in shared memory, if all the
processes are going to update it, whereas if Vacuum is going to form the
DSM, then it might well be in the process local memory. I can think of
things like False sharing which might be avoided. But i think the main
stuff is contention.
Ah, I begin to understand where you are coming from now, yes. However,
(ab-)using the WAL and archiver still doesn't look like a good idea to me.
Even in indexes, we might end up reading dead tuples. We would mark it
with LP_DEAD. So the overhead is less, but its there.
That's a good point, yes.
Ofcourse its
natural to think of some background jobs during OLTP, and they will be
affected
Agreed.
Regards
Markus
Sorry Greg , I missed to read this part before.
On Jan 9, 2008 8:40 PM, Gregory Stark <stark@enterprisedb.com> wrote:
"Markus Schiltknecht" <markus@bluegap.ch> writes:
Hi,
Gokulakannan Somasundaram wrote:
If we can ask the Vacuum process to scan the WAL log, it can get all
the
relevant details on where it needs to go.
That's an interesting thought. I think your caveats are right but with
some
more work it might be possible to work it out. For example if a background
process processed the WAL and accumulated an array of possibly-dead tuples
to
process in batch. It would wait whenever it sees an xid which isn't yet
past
globalxmin, and keep accumulating until it has enough to make it
worthwhile
doing a pass.I think a bigger issue with this approach is that it ties all your tables
together. You can't process one table frequently while some other table
has
some long-lived deleted tuples.
I am not able to clearly understand what you are saying here. It ties all
the tables yes. There are two options here
a) Do we really need to do Vacuum table by table? Say we read 'n' WAL
segments and accumulate the data. We should try to sort the result with
Relation name, Block num and we can go ahead with the Vacuum. In this way,
Vacuum will only work at the database level. Why do we need to process one
table frequently?
b) We can create DSMs for each table separately and Vacuum will use the WAL
information to update it. In this way, we can Vacuum table wise.
I'm also not sure it really buys us anything over having a second
dead-space-map data structure. The WAL is much larger and serves other
purposes which would limit what we can do with it.
Ok. One obvious advantage is that it saves the contention over DSM for the
DML operations and Vacuum process. Since Vacuum process is going to have
much more information on what has happened in the database, it is possible
for some new structures. For example i have been thinking of changing our
current index structure in such a way, it won't hold any duplicate tuples
for different versions of data. Whenever there is a update, only the indexes
relevant to the columns changed will get updated. The Vacuum has to play the
role of changing the tid, the index tuple points to, whenever it vacuums a
older version.
It would be possible to create such structures, which can be synched
asynchronously. Another example would be Asynchronous Materialized views.
But pushing those future plans aside, don't you think this would reduce the
contention, which otherwise would be faced by the DML operations?
Thanks,
Gokul.
Hi,
Gokulakannan Somasundaram wrote:
I'm also not sure it really buys us anything over having a second
dead-space-map data structure. The WAL is much larger and serves other
purposes which would limit what we can do with it.Ok. One obvious advantage is that it saves the contention over DSM for
the DML operations and Vacuum process.
Do you have evidence of that contention being so worse, that it
justifies the additional WAL reading from disk? (Assuming no WAL archiving).
IMO we can get about any granularity we want for DSM update locking,
depending on how we arrange the DSM bits.
Since Vacuum process is going to
have much more information on what has happened in the database,
Why should that be? IMO, collecting the information at transaction time
can give you exactly the same information, if not more or better
information.
it is
possible for some new structures. For example i have been thinking of
changing our current index structure in such a way, it won't hold any
duplicate tuples for different versions of data. Whenever there is a
update, only the indexes relevant to the columns changed will get
updated. The Vacuum has to play the role of changing the tid, the index
tuple points to, whenever it vacuums a older version.
Huh? The index would then point to the old tuple only, until a VACUUM
comes by, right. How are following transactions expected to find the new
tuple before that VACUUMing?
Regards
Markus
Markus Schiltknecht <markus@bluegap.ch> writes:
Since Vacuum process is going to
have much more information on what has happened in the database,
Why should that be? IMO, collecting the information at transaction time
can give you exactly the same information, if not more or better
information.
Well, one of the principal arguments for having VACUUM at all is that it
off-loads required maintenance effort from foreground transaction code
paths. I'm not really going to be in favor of solutions that put more
work into the transaction code paths (HOT already did more of that than
I would like :-(). OTOH, I agree that scanning the WAL log doesn't
really sound like something well-matched to this problem either.
regards, tom lane
Hi,
Tom Lane wrote:
Well, one of the principal arguments for having VACUUM at all is that it
off-loads required maintenance effort from foreground transaction code
paths.
Off-loading doesn't mean we don't have to do the work, so it's obviously
is a compromise.
AFAICT, having to write some DSM blocks from foreground transaction code
paths may well be worth it overall, if it saves VACUUM from doing much
more I/O.
Especially if the bgwriter can defer the I/O to after commit time (which
I'm thinking of as another form of off-loading work from foreground
transaction code).
Regards
Markus
Hi,
Please find my answers inline
Do you have evidence of that contention being so worse, that it
justifies the additional WAL reading from disk? (Assuming no WAL
archiving).
On a broader sense, DSM is a bitmap index with some optimization that has
been placed to make the updates more effective. As you may know, the design
of Bitmap index doesn't scale very well with concurrency. If you put more
information into a little space, then i feel it might affect concurrency.
Let us discuss it in detail.
DSM, i believe plans to achieve the following objectives,
a) To find out the blocks, which are to be Vacuumed
b) To find out the blocks, where freezing is required
c) To find out the blocks which are visible to everyone.
The DSM might get split into multiple maps like Visibility maps(already
proposed by Heikki), Vacuum Maps and Freezing maps. When the inserts
happen, the map has to get extended and it has to lock the block to extend
the map. Say if the DSM block corresponds to some 60K data blocks. Then any
updates / deletes happening over those blocks have to wait for that time.
This is just an example, which i can think of off-hand. May be the people,
who are implementing might throw more light on the synchronization points.
IMO we can get about any granularity we want for DSM update locking,
depending on how we arrange the DSM bits.
I can't understand this exactly.
Since Vacuum process is going to
have much more information on what has happened in the database,Why should that be? IMO, collecting the information at transaction time
can give you exactly the same information, if not more or better
information.
My argument is if we have collected that information in WAL, why should we
collect it again and again?
it is
possible for some new structures. For example i have been thinking of
changing our current index structure in such a way, it won't hold any
duplicate tuples for different versions of data. Whenever there is a
update, only the indexes relevant to the columns changed will get
updated. The Vacuum has to play the role of changing the tid, the index
tuple points to, whenever it vacuums a older version.Huh? The index would then point to the old tuple only, until a VACUUM
comes by, right. How are following transactions expected to find the new
tuple before that VACUUMing?
You are right. We have already discusses about this. In the Vacuum aproach,
we travel front in time. We catch the oldest transaction and go to the new
transaction, by following the ctid in the old tuple. In the undo log
approach, it is the reverse. We go to the latest transaction and travel back
in time. Its interesting to see, how theory of relativity has got applied in
database science right?
So say we have 'n' versions of the same data in index. Right now we have 'n'
index tuples which point to 'n' block in heap. we would read all the 'n'
index tuples and go to all the versions of data in the table. If this
changes, there will be one index tuple, which would point to the oldest heap
tuple and from there we will navigate to all the new tuples. The advantage
is obvious, the index is going to have lesser size and the updates will not
update indexes, unless the data in it has got changed.
Hope i was clear. Please revert back, in case i am not clear.
Thanks,
Gokul.
One more application of the same is Asynchronous Materialized views. I hope
you agree that the asynchronous materialized views have to get updated only
through WAL. If WAL can be used for that purpose, why can't we multiplex it?
Thanks,
Gokul.
Well, one of the principal arguments for having VACUUM at all is that it
off-loads required maintenance effort from foreground transaction code
paths. I'm not really going to be in favor of solutions that put more
work into the transaction code paths (HOT already did more of that than
I would like :-(). OTOH, I agree that scanning the WAL log doesn't
really sound like something well-matched to this problem either.
Tom, Don't you like the idea of building some more structures around WAL,
like Asynchronous Materialized views. Indexes, if implemented as stated,
would remove the HOT code in the path of the transaction(as you may know).
I am also slightly doubtful of the argument, that doing full-table scans and
full index scans for Vacuum is efficient. Can you please advise me on why we
should not use a read only operation on WAL log ?
Thanks,
Gokul.
Gokulakannan Somasundaram wrote:
Well, one of the principal arguments for having VACUUM at all is that it
off-loads required maintenance effort from foreground transaction code
paths. I'm not really going to be in favor of solutions that put more
work into the transaction code paths (HOT already did more of that than
I would like :-(). OTOH, I agree that scanning the WAL log doesn't
really sound like something well-matched to this problem either.Tom, Don't you like the idea of building some more structures around WAL,
like Asynchronous Materialized views. Indexes, if implemented as stated,
would remove the HOT code in the path of the transaction(as you may know).
I am also slightly doubtful of the argument, that doing full-table scans and
full index scans for Vacuum is efficient. Can you please advise me on why we
should not use a read only operation on WAL log ?
I haven't been paying close attention to this thread, but there is a
couple general issues with using the WAL for this kind of things. First
of all, one extremely cool feature of PostgreSQL is that transaction
size is not limited by WAL space, unlike on many other DBMSs. I think
many of the proposed ideas of reading WAL would require us to keep all
WAL available back to the beginning of the oldest running transaction.
Another issue is that reading WAL is inherently not very scalable.
There's only one WAL for the whole cluster, and it needs to be read
sequentially, so it can easily become a bottleneck on large systems.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
I haven't been paying close attention to this thread, but there is a
couple general issues with using the WAL for this kind of things. First
of all, one extremely cool feature of PostgreSQL is that transaction
size is not limited by WAL space, unlike on many other DBMSs. I think
many of the proposed ideas of reading WAL would require us to keep all
WAL available back to the beginning of the oldest running transaction.
Initially i thought this may be required. But the current idea is Vacuum is
going to maintain a DSM per relation and it will update it, once the WAL
segement is switched. so if the WAL logging is happening at segment 2, then
the first segment will be scanned to update the DSM.
Another issue is that reading WAL is inherently not very scalable.
There's only one WAL for the whole cluster, and it needs to be read
sequentially, so it can easily become a bottleneck on large systems.
Let me try to understand what would become a problem here. We are going to
have only one process, which would open this WAL (one segment at a time) and
update the DSMs. The limitation would be that we should have completed
reading the log before the WAL segment round-up. What else do you think
would be the problem?
Thanks,
Gokul.
Heikki Linnakangas escribi�:
Another issue is that reading WAL is inherently not very scalable. There's
only one WAL for the whole cluster, and it needs to be read sequentially,
so it can easily become a bottleneck on large systems.
I have wondered why do we do it this way. Is there a problem with
having one WAL per database, and another for general operations? This
last WAL would have changes to shared tables, as well as global stuff
like "create database" or "create tablespace".
Of course, it means a lot more files, and a PITR setup is a bit more
complex.
One obvious problem is that it is no longer true that you have a "no
seek" disk head. But is there much use of that, these days? People
have either a big RAID on which the WAL resides along all data; or, on
low-cost systems, the whole thing is in a single disk or a small RAID.
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.