VACUUM Question
We have 2 tables we expect to grow by up to 50,000 rows per day each depending on the customer. In normal operation we will most likely never update or delete from these tables as they are for historical reporting. (Eventually we may but a limit on the amount of data and delete older than X months or such)
We intend to create a number of indexes based upon the reporting search criteria.
What would the best setup be for VACUUM, ANALYSE, REINDEX. Alot of the infor refers to data hanging around from deletes and updates which in normal course we will not do on these tables?
Oisin
If you really are just inserting, and never updating or deleting, then you
will never need to vacuum the table, rather you will just need to ANALYSE
the table. If you use autovacuum that is exactly what it will do.
As for Reindex, I'm not entirely sure, I don't think you would benefit
from reindex because you aren't updating or deleting. Can anyone comment
on this? Is is possibile that a table with lots of inserts resulting in
lots of page splits etc could ever benifit form REINDEX?
Matt
Show quoted text
We have 2 tables we expect to grow by up to 50,000 rows per day each
depending on the customer. In normal operation we will most likely never
update or delete from these tables as they are for historical reporting.
(Eventually we may but a limit on the amount of data and delete older than
X months or such)
We intend to create a number of indexes based upon the reporting search
criteria.What would the best setup be for VACUUM, ANALYSE, REINDEX. Alot of the
infor refers to data hanging around from deletes and updates which in
normal course we will not do on these tables?Oisin
On Thu, Jan 26, 2006 at 02:35:42PM -0500, Matthew T. O'Connor wrote:
If you really are just inserting, and never updating or deleting, then you
will never need to vacuum the table, rather you will just need to ANALYSE
the table.
That's not quite true; the table must still be vacuumed occasionally
to prevent transaction ID wraparound failure, else you risk losing
data.
http://www.postgresql.org/docs/8.1/interactive/maintenance.html#VACUUM-FOR-WRAPAROUND
--
Michael Fuhr
matthew@zeut.net ("Matthew T. O'Connor") writes:
If you really are just inserting, and never updating or deleting,
then you will never need to vacuum the table, rather you will just
need to ANALYSE the table. If you use autovacuum that is exactly
what it will do.
"Never" is a pretty long time...
You need a VACUUM every 2^31 transactions, but since there needs to be
such a vacuum for the whole database, that one will do...
As for Reindex, I'm not entirely sure, I don't think you would benefit
from reindex because you aren't updating or deleting. Can anyone comment
on this? Is is possibile that a table with lots of inserts resulting in
lots of page splits etc could ever benifit form REINDEX?
I could imagine a CLUSTER doing some good, and if that's the case,
REINDEX could have some favorable results. But you'd better have a
real specific model as to why that would be...
--
let name="cbbrowne" and tld="ntlug.org" in name ^ "@" ^ tld;;
http://cbbrowne.com/info/spreadsheets.html
Oh, boy, virtual memory! Now I'm gonna make myself a really *big*
RAMdisk!
Michael Fuhr <mike@fuhr.org> writes:
On Thu, Jan 26, 2006 at 02:35:42PM -0500, Matthew T. O'Connor wrote:
If you really are just inserting, and never updating or deleting, then you
will never need to vacuum the table, rather you will just need to ANALYSE
the table.
That's not quite true; the table must still be vacuumed occasionally
to prevent transaction ID wraparound failure,
Also, somebody made a real good point about rolled-back insertions.
Even if the only command you ever apply to the table is INSERT, you
could still have dead rows in the table if some of those transactions
occasionally roll back.
regards, tom lane
Also, somebody made a real good point about rolled-back insertions.
Even if the only command you ever apply to the table is INSERT, you
could still have dead rows in the table if some of those transactions
occasionally roll back.
hmm... That's true. I don't think autovacuum doesn't anything to account
for the concept of rolledback inserts. I suppose in most real world
situations that number is going to be small enough to be ignored, but not
in all cases. Is there anyway for the stats system to report the
information about rolledback inserts? In fact autovacuum probably has a
similar deficiency for rolled back deletes but not a rolled back update.
Anyone think this is enough of an issue that it needs more attention?
Matt
"Matthew T. O'Connor" <matthew@zeut.net> writes:
Also, somebody made a real good point about rolled-back insertions.
Even if the only command you ever apply to the table is INSERT, you
could still have dead rows in the table if some of those transactions
occasionally roll back.
hmm... That's true. I don't think autovacuum doesn't anything to account
for the concept of rolledback inserts.
I think this is the fault of the stats system design. AFAICT from a
quick look at the code, inserted/updated/deleted tuples are reported
to the collector in the same way regardless of whether the sending
transaction committed or rolled back. I think this is unquestionably
a bug, at least for autovacuum's purposes --- though it might be OK
for the original intent of the stats system, which was simply to track
activity levels.
Any thoughts about how it ought to work?
regards, tom lane
Tom Lane wrote:
I think this is the fault of the stats system design. AFAICT from a
quick look at the code, inserted/updated/deleted tuples are reported
to the collector in the same way regardless of whether the sending
transaction committed or rolled back. I think this is unquestionably
a bug, at least for autovacuum's purposes --- though it might be OK
for the original intent of the stats system, which was simply to track
activity levels.Any thoughts about how it ought to work?
I don't remember exactly how it works -- I think the activity (insert,
update, delete) counters are kept separately from commit/rollback
status, right? Maybe we should keep three separate counters: "current
transaction counters" and "counters for transactions that were
aborted/committed". We only send the latter counts, and the former are
added to them when the transaction ends.
--
Alvaro Herrera http://www.advogato.org/person/alvherre
"Aprende a avergonzarte m�s ante ti que ante los dem�s" (Dem�crito)
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Tom Lane wrote:
I think this is unquestionably
a bug, at least for autovacuum's purposes --- though it might be OK
for the original intent of the stats system, which was simply to track
activity levels.Any thoughts about how it ought to work?
I don't remember exactly how it works -- I think the activity (insert,
update, delete) counters are kept separately from commit/rollback
status, right? Maybe we should keep three separate counters: "current
transaction counters" and "counters for transactions that were
aborted/committed". We only send the latter counts, and the former are
added to them when the transaction ends.
My question was at a higher level, actually: *what* should we be
counting?
I think doubling the number of counters in the stats system, which is
what you seem to be proposing, is probably not acceptable --- we've
already got a problem with the stats file becoming unreasonably bulky.
We need to figure out exactly which counts there is adequate reason
to be tracking.
I don't, for instance, see any percentage in tracking block-level I/O
operations separately for committed and rolled-back transactions.
Those numbers are certainly things you watch only for total activity,
and a failed xact is just as much system load as a committed one.
regards, tom lane
Tom Lane wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Tom Lane wrote:
I think this is unquestionably
a bug, at least for autovacuum's purposes --- though it might be OK
for the original intent of the stats system, which was simply to track
activity levels.Any thoughts about how it ought to work?
I don't remember exactly how it works -- I think the activity (insert,
update, delete) counters are kept separately from commit/rollback
status, right? Maybe we should keep three separate counters: "current
transaction counters" and "counters for transactions that were
aborted/committed". We only send the latter counts, and the former are
added to them when the transaction ends.My question was at a higher level, actually: *what* should we be
counting?
Oh, I see. Do you think small incremental improvements to the stat
system will buy us much? I think we should be thinking big here, i.e.
rewrite most stuff instead. In the meantime, we should fix the minor
issues but not spend too much time on it; IMHO anyway.
I can devote some time to it starting from, say, mid february, which is
when I think I'm going to have more time to spend on community stuff.
(I've been spending the last couple of months on PL/php and internal
Command Prompt stuff.)
--
Alvaro Herrera http://www.amazon.com/gp/registry/5ZYLFMCVHXC
"XML!" Exclaimed C++. "What are you doing here? You're not a programming
language."
"Tell that to the people who use me," said XML.
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Tom Lane wrote:
My question was at a higher level, actually: *what* should we be
counting?
Oh, I see. Do you think small incremental improvements to the stat
system will buy us much? I think we should be thinking big here, i.e.
rewrite most stuff instead.
Uh, I wasn't aware of any proposals for a major rewrite. What did you
have in mind?
regards, tom lane
Tom Lane wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Tom Lane wrote:
My question was at a higher level, actually: *what* should we be
counting?Oh, I see. Do you think small incremental improvements to the stat
system will buy us much? I think we should be thinking big here, i.e.
rewrite most stuff instead.Uh, I wasn't aware of any proposals for a major rewrite. What did you
have in mind?
Nothing yet really. But we mentioned it in the recent past. (About not
using UDP, and also about the property of not having each backend load
the whole stat file every time.)
--
Alvaro Herrera http://www.PlanetPostgreSQL.org
"This is a foot just waiting to be shot" (Andrew Dunstan)
Tom Lane wrote:
hmm... That's true. I don't think autovacuum doesn't anything to account
for the concept of rolledback inserts.I think this is the fault of the stats system design. AFAICT from a
quick look at the code, inserted/updated/deleted tuples are reported
to the collector in the same way regardless of whether the sending
transaction committed or rolled back. I think this is unquestionably
a bug, at least for autovacuum's purposes --- though it might be OK
for the original intent of the stats system, which was simply to track
activity levels.Any thoughts about how it ought to work?
The fact that autovacuum bases it's decisions on info from the stats
system is arguably an abuse of the original design. However I don't
know of a better source of information at the moment. It has always
been my vision that autovacuum will eventually incorporate additional
information sources to make better informed decisions. There has always
been discussion of using the FSM to help clue us in as to when we need
another vacuum. Perhaps the addition of the vacuum space map that
people are talking about will also help.
None of this directly addresses the question of what the stats system
*should* track, but perhaps it is wrongheaded to totally redesign the
stats system for the purposes of autovacuum. As a quick semi-fix,
perhaps autovacuum should look at the number of rollbacks vs. commits in
an attempt to determine the accuracy of the stats. For example if 50%
of the transactions are getting rolled back, then autovacuum might
include 50% of the inserts in the count towards the vacuum threshold.
Obviously this isn't perfect, but it probably gets us closer to reality
with the information already available.
Thoughts?
Matt
"Matthew T. O'Connor" <matthew@zeut.net> writes:
None of this directly addresses the question of what the stats system
*should* track, but perhaps it is wrongheaded to totally redesign the
stats system for the purposes of autovacuum.
I'd argue it's fine: there are tons of people using row-level stats
via autovacuum, and (AFAICT) just about nobody using 'em for any other
purpose. Certainly you never see anyone suggesting them as a tool for
investigating problems on pgsql-performance. Sure, it's a repurposing
of the stats subsystem, but we should be willing to do that when field
experience teaches us what's really needed.
As a quick semi-fix, perhaps autovacuum should look at the number of
rollbacks vs. commits in an attempt to determine the accuracy of the
stats. For example if 50% of the transactions are getting rolled
back, then autovacuum might include 50% of the inserts in the count
towards the vacuum threshold. Obviously this isn't perfect, but it
probably gets us closer to reality with the information already
available.
But all that we have is *global* counts. Inferring ratios applicable to
particular tables seems a big stretch to me. Any given application is
likely to have some types of transactions that roll back much more often
than others.
One thing we could do is tie the stats message sending more tightly to
top-level transaction commit/abort. (It's already effectively true that
we send stats only after commit/abort, but we send 'em from the wrong
place, ie PostgresMain.) Then the semantics of the message could be
read as "here's what I did before committing" or "here's what I did
before aborting" and the collector could interpret the counts
accordingly. However, this still fails in the case where a committed
top-level transaction includes some failed subtransactions. I think
the only full solution will involve backends doing some extra work at
subtransaction commit/abort so that they can report properly classified
update counts.
regards, tom lane
Tom Lane wrote:
I'd argue it's fine: there are tons of people using row-level stats
via autovacuum, and (AFAICT) just about nobody using 'em for any other
purpose. Certainly you never see anyone suggesting them as a tool for
investigating problems on pgsql-performance. Sure, it's a repurposing
of the stats subsystem, but we should be willing to do that when field
experience teaches us what's really needed.
OK.
But all that we have is *global* counts. Inferring ratios applicable to
particular tables seems a big stretch to me. Any given application is
likely to have some types of transactions that roll back much more often
than others.
Certainly a stretch, but it's far from fatal, the worst case scenario is
we occasionally vacuum a table that might not need it, whereas the way
things stand right now, the worst case is that we never vacuum that
might have a lot of slack space. BTW, I'm only arguing this based on
what is in the stats system now and pondering how we might improve
things if the stats system isn't changed to directly address this problem.
One thing we could do is tie the stats message sending more tightly to
top-level transaction commit/abort. (It's already effectively true that
we send stats only after commit/abort, but we send 'em from the wrong
place, ie PostgresMain.) Then the semantics of the message could be
read as "here's what I did before committing" or "here's what I did
before aborting" and the collector could interpret the counts
accordingly. However, this still fails in the case where a committed
top-level transaction includes some failed subtransactions. I think
the only full solution will involve backends doing some extra work at
subtransaction commit/abort so that they can report properly classified
update counts.
Any guess as to the performance implications?
"Matthew T. O'Connor" <matthew@zeut.net> writes:
Tom Lane wrote:
the only full solution will involve backends doing some extra work at
subtransaction commit/abort so that they can report properly classified
update counts.
Any guess as to the performance implications?
Pushing some counts from one place to another doesn't seem that
expensive, but it'd be nice to avoid scanning a lot of unrelated
table-stats entries to find the ones that have to be adjusted.
Not sure what it'll take exactly.
Or we could blow it off for the time being. Certainly, getting
things right at the top-transaction level would already be a big
leg up in accuracy from where we are, and I don't think that would
be hard at all.
regards, tom lane
Tom,
I'd argue it's fine: there are tons of people using row-level stats
via autovacuum, and (AFAICT) just about nobody using 'em for any other
purpose. Certainly you never see anyone suggesting them as a tool for
investigating problems on pgsql-performance.
Actually, I use the stats for performance tuning. However, I can't say
that I care about the exact numbers; I'm just looking for columns which
get "lots" of seq scans or indexes that don't get used.
--Josh
On 1/27/2006 10:53 AM, Alvaro Herrera wrote:
Tom Lane wrote:
I think this is the fault of the stats system design. AFAICT from a
quick look at the code, inserted/updated/deleted tuples are reported
to the collector in the same way regardless of whether the sending
transaction committed or rolled back. I think this is unquestionably
a bug, at least for autovacuum's purposes --- though it might be OK
for the original intent of the stats system, which was simply to track
activity levels.Any thoughts about how it ought to work?
I don't remember exactly how it works -- I think the activity (insert,
update, delete) counters are kept separately from commit/rollback
status, right? Maybe we should keep three separate counters: "current
transaction counters" and "counters for transactions that were
aborted/committed". We only send the latter counts, and the former are
added to them when the transaction ends.
It's not a bug. More some "don't bother" attitude. The stats were not
intended to be precise. Their sole purpose at the time of design and
development was to give clues for missing or useless indexes, identify
candidates to move onto different spindles and things like autovacuum.
If the number of aborts you're riding significantly disturbs your
statistic collection, you should look at some design issues with your
own application instead.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
On 1/27/2006 10:56 AM, Tom Lane wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Tom Lane wrote:
I think this is unquestionably
a bug, at least for autovacuum's purposes --- though it might be OK
for the original intent of the stats system, which was simply to track
activity levels.Any thoughts about how it ought to work?
I don't remember exactly how it works -- I think the activity (insert,
update, delete) counters are kept separately from commit/rollback
status, right? Maybe we should keep three separate counters: "current
transaction counters" and "counters for transactions that were
aborted/committed". We only send the latter counts, and the former are
added to them when the transaction ends.My question was at a higher level, actually: *what* should we be
counting?I think doubling the number of counters in the stats system, which is
what you seem to be proposing, is probably not acceptable --- we've
already got a problem with the stats file becoming unreasonably bulky.
We need to figure out exactly which counts there is adequate reason
to be tracking.I don't, for instance, see any percentage in tracking block-level I/O
operations separately for committed and rolled-back transactions.
Those numbers are certainly things you watch only for total activity,
and a failed xact is just as much system load as a committed one.
Right.
The problem is that different questions need different counting
semantics. If one looks at an application profile, he wants to know how
many inserts, updates and deletes the app has issued, probably
regardless of the final transaction outcome. But for autovac he wants to
be able to figure out how many dead tuples there would be, so for that
purpose inserts in a rolled back xact would count as ins+del where
deletes would not count at all.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
On Thu, Jan 26, 2006 at 04:14:45PM -0500, Chris Browne wrote:
As for Reindex, I'm not entirely sure, I don't think you would benefit
from reindex because you aren't updating or deleting. Can anyone comment
on this? Is is possibile that a table with lots of inserts resulting in
lots of page splits etc could ever benifit form REINDEX?I could imagine a CLUSTER doing some good, and if that's the case,
REINDEX could have some favorable results. But you'd better have a
real specific model as to why that would be...
Aside from the cluster case, are there any issues with how page splits
in the b-tree are done that could lead to better performance after a
REINDEX?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461