Data loss, vacuum, transaction wrap-around
I want to see if there is a concensus of opinion out there.
We've all known that data loss "could" happen if vacuum is not run and you
perform more than 2b transactions. These days with faster and bigger
computers and disks, it more likely that this problem can be hit in months
-- not years.
To me, the WORST thing a program can do is lose data. (Certainly this is
bad for a database.) I don't think there is any real excuse for this.
While the 2b transaction problem was always there, it seemed so remote
that I never obcessed about it. Now that it seems like a real problem that
more than one user has hit, I am worried.
In fact, I think it is so bad, that I think we need to back-port a fix to
previous versions and issue a notice of some kind.
Here as my suggestions:
(1) As Tom has already said, at some point start issuing warning in the
log that vacuum needs to be run.
(2) At some point, stop accepting transactions on anything but template1,
issuing an error saying the vacuum needs to be run.
(3) Either with psql on template1 or "postgres" or some "vacuumall"
program, open the database in single user mode or on template1 and vacuum
database.
(4) This should remain even after autovacuum is in place. If for some
reason auto vacuum is installed but not running, we still need to protect
the data from a stupid admin. (Last time I looked, auto vacuum used
various stats, and that may be something an admin disables.)
(5) Vacuum could check for a wrap-around condition in the database cluster
and take it upon itself to run more broadly even if it was directed only
towards a table.
We've been saying that mysql is ok if you don't care about your data, I
would hate if people started using this issue against postgresql.
pgsql@mohawksoft.com writes:
In fact, I think it is so bad, that I think we need to back-port a fix to
previous versions and issue a notice of some kind.
They already do issue notices --- see VACUUM.
A real fix (eg the forcible stop we were talking about earlier) will not
be reasonable to back-port.
regards, tom lane
Tom Lane wrote:
pgsql@mohawksoft.com writes:
In fact, I think it is so bad, that I think we need to back-port a fix to
previous versions and issue a notice of some kind.They already do issue notices --- see VACUUM.
A real fix (eg the forcible stop we were talking about earlier) will not
be reasonable to back-port.
I hope this question isn't too stupid....
Is it be possible to create a "vacuum wraparound" or "vacuum xidreset"
command which would do the work required to fix the wraparound problem,
without being as expensive as a normal vacuum of an entire database?
More suggestions:
(1) At startup, postmaster checks for an XID, if it is close to a problem,
force a vacuum.
(2) At "sig term" shutdown, can the postmaster start a vacuum?
(3) When the XID count goes past the "trip wire" can it spontaneously
issue a vacuum?
NOTE:
Suggestions 1 and 2 are for 8.0 and prior. 3 is for later than 8.0.1
"Matthew T. O'Connor" <matthew@zeut.net> writes:
I hope this question isn't too stupid....
Is it be possible to create a "vacuum wraparound" or "vacuum xidreset"
command which would do the work required to fix the wraparound problem,
without being as expensive as a normal vacuum of an entire database?
I don't think it'd be worth the trouble. You could skip index cleanup
if you didn't actually delete any tuples, but you'd still have to do all
of the scanning work. The cases where people think they don't need to
do vacuum (because the table didn't have any deleted tuples) wouldn't
get any cheaper at all.
regards, tom lane
Tom Lane wrote:
pgsql@mohawksoft.com writes:
In fact, I think it is so bad, that I think we need to back-port a fix to
previous versions and issue a notice of some kind.They already do issue notices --- see VACUUM.
A real fix (eg the forcible stop we were talking about earlier) will not
be reasonable to back-port.
Would at least a automated warning mechanism be a reasonable backport?
Sincerely,
Joshua D. Drake
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
--
Command Prompt, Inc., your source for PostgreSQL replication,
professional support, programming, managed services, shared
and dedicated hosting. Home of the Open Source Projects plPHP,
plPerlNG, pgManage, and pgPHPtoolkit.
Contact us now at: +1-503-667-4564 - http://www.commandprompt.com
"Joshua D. Drake" <jd@commandprompt.com> writes:
Tom Lane wrote:
A real fix (eg the forcible stop we were talking about earlier) will not
be reasonable to back-port.
Would at least a automated warning mechanism be a reasonable backport?
No, because the hard part of the problem actually is detecting that the
condition exists in a reasonably cheap way. The check in VACUUM is
really extremely expensive, which is why we don't make it except after
completing a database-wide vacuum. Once we have an XID limit value
sitting in shared memory then the code to use it (generate warnings
and/or error out) is simple; it's initializing that value during
postmaster start that I consider too complicated/risky to stick into
existing versions.
There is another issue here, which is that I have no faith that the
people who actually need this are going to be clueful enough to update
to 7.4.8 or 7.3.10 or whatever they'd need...
regards, tom lane
pgsql@mohawksoft.com writes:
More suggestions:
(1) At startup, postmaster checks for an XID, if it is close to a problem,
force a vacuum.
Useless to a system that's run 24x7; also presumes the existence of a
complete solution anyway (since getting the postmaster to find that out
is the hard part).
(2) At "sig term" shutdown, can the postmaster start a vacuum?
Certainly not. We have to assume that SIGTERM means we are under a
short-term sentence of death from init. And if it's a manual stop
it doesn't sound much better: the sort of DBA that needs this "feature"
is likely to decide he should kill -9 the postmaster because it's taking
too long to shut down.
(3) When the XID count goes past the "trip wire" can it spontaneously
issue a vacuum?
Only in the database you're connected to, which very likely isn't where
the problem is. Moreover, having N backends all decide they need to do
this at once doesn't sound like a winner. Furthermore, this still
presumes the existence of the hard part of the solution, which is
knowing where the trip point is.
regards, tom lane
There is another issue here, which is that I have no faith that the
people who actually need this are going to be clueful enough to update
to 7.4.8 or 7.3.10 or whatever they'd need...
Well I can't argue with that one ;)
regards, tom lane
--
Command Prompt, Inc., your source for PostgreSQL replication,
professional support, programming, managed services, shared
and dedicated hosting. Home of the Open Source Projects plPHP,
plPerlNG, pgManage, and pgPHPtoolkit.
Contact us now at: +1-503-667-4564 - http://www.commandprompt.com
Tom Lane <tgl@sss.pgh.pa.us> writes:
(3) When the XID count goes past the "trip wire" can it spontaneously
issue a vacuum?Only in the database you're connected to, which very likely isn't where
the problem is. Moreover, having N backends all decide they need to do
this at once doesn't sound like a winner. Furthermore, this still
presumes the existence of the hard part of the solution, which is
knowing where the trip point is.
Alright, I have a suggestion. If the database kept a "oldest xid" for each
table then there wouldn't be any expensive work to scan the table looking for
an oldest xid. The only time "oldest xid" needs to be updated is when vacuum
is run, which is precisely when it would be known.
There could be a per-database "oldest xid" that any vacuum on any table
updates (by skimming all the "oldest xid"s for the current database). If
that's stored in the shared pg_database table then it's accessible regardless
of what database you connect to, no?
Then on every connection and every n-thousandth transaction you just have to
check the "oldest xid" for all the databases, and make sure the difference
between the oldest one and the current xid is reasonable.
--
greg
Greg Stark <gsstark@mit.edu> writes:
There could be a per-database "oldest xid" that any vacuum on any table
updates (by skimming all the "oldest xid"s for the current database). If
that's stored in the shared pg_database table then it's accessible regardless
of what database you connect to, no?
You mean like pg_database.datfrozenxid?
The problem is not that we don't have the data. The problem is getting
the data to where it's needed, which is GetNewTransactionId(). That
routine cannot be responsible for finding it out for itself, because we
haven't yet started a transaction at the time where we need to know
if it's safe to consume an XID.
regards, tom lane
On Sat, 19 Feb 2005 04:10 am, Tom Lane wrote:
pgsql@mohawksoft.com writes:
In fact, I think it is so bad, that I think we need to back-port a fix to
previous versions and issue a notice of some kind.They already do issue notices --- see VACUUM.
A real fix (eg the forcible stop we were talking about earlier) will not
be reasonable to back-port.
Not to be rude, but if backporting is not an option, why do we not just
focus on the job of getting autovacuum into 8.1, and not have to think
about how a patch that will warn users will work?
Regards
Russell Smith
In <4216A911.9020704@dunslane.net>, on 02/18/05
at 09:48 PM, Andrew Dunstan <andrew@dunslane.net> said:
Russell Smith wrote:
On Sat, 19 Feb 2005 04:10 am, Tom Lane wrote:
pgsql@mohawksoft.com writes:
In fact, I think it is so bad, that I think we need to back-port a fix to
previous versions and issue a notice of some kind.They already do issue notices --- see VACUUM.
A real fix (eg the forcible stop we were talking about earlier) will not
be reasonable to back-port.Not to be rude, but if backporting is not an option, why do we not just
focus on the job of getting autovacuum into 8.1, and not have to think
about how a patch that will warn users will work?
What if autovacuum is turned off for some reason? Or fails? A more
graceful failure along the lines suggested would be a good thing, ISTM.
I agree with Tom about not backpatching, though. The situation seems
analogous with a car owner who neglects the clear instructions in the
manual to perform regular oil changes and then finds to his great
surprise that the car stops running. It's hardly the manufacturer's
fault.
cheers
andrew
I never did like car analogies...
At least a car comes with an IDIOT light or maybe even an oil pressure
gauge.
Something like that (the idiot light) is missing from postgreSQL. The oil
pressure gauge would be good to have, kind of like a gauge that lets you
when you are about to run out of fuel.
Of course the best thing is the auto-fill gas tank.
--
-----------------------------------------------------------
lsunley@mb.sympatico.ca
-----------------------------------------------------------
Import Notes
Reply to msg id not found: 4216A911.9020704@dunslane.net | Resolved by subject fallback
On Sat, 19 Feb 2005 04:10 am, Tom Lane wrote:
pgsql@mohawksoft.com writes:
In fact, I think it is so bad, that I think we need to back-port a fix
to
previous versions and issue a notice of some kind.
They already do issue notices --- see VACUUM.
A real fix (eg the forcible stop we were talking about earlier) will not
be reasonable to back-port.Not to be rude, but if backporting is not an option, why do we not just
focus on the job of getting autovacuum into 8.1, and not have to think
about how a patch that will warn users will work?
Unless I'm mistaken, even "autovacuum" may not be enough. AFAIK,
autovacuum depends on the statistics daemon, and some admins may turn that
off for performance. Even so, how unlikely is it that autovacuum doesn't
run.
I think there should be a 100% no data loss fail safe. Anything less is a
cop-out. I can't see one successful argument that starts with data loss
and ends with maintenence.
Russell Smith wrote:
On Sat, 19 Feb 2005 04:10 am, Tom Lane wrote:
pgsql@mohawksoft.com writes:
In fact, I think it is so bad, that I think we need to back-port a fix to
previous versions and issue a notice of some kind.They already do issue notices --- see VACUUM.
A real fix (eg the forcible stop we were talking about earlier) will not
be reasonable to back-port.Not to be rude, but if backporting is not an option, why do we not just
focus on the job of getting autovacuum into 8.1, and not have to think
about how a patch that will warn users will work?
What if autovacuum is turned off for some reason? Or fails? A more
graceful failure along the lines suggested would be a good thing, ISTM.
I agree with Tom about not backpatching, though. The situation seems
analogous with a car owner who neglects the clear instructions in the
manual to perform regular oil changes and then finds to his great
surprise that the car stops running. It's hardly the manufacturer's fault.
cheers
andrew
pgsql@mohawksoft.com writes:
I think there should be a 100% no data loss fail safe.
Possibly we need to recalibrate our expectations here. The current
situation is that PostgreSQL will not lose data if:
1. Your disk drive doesn't screw up (eg, lie about write complete,
or just plain die on you).
2. Your kernel and filesystem don't screw up.
3. You follow the instructions about routine vacuuming.
4. You don't hit any bugs that we don't know about.
I agree that it's a nice idea to be able to eliminate assumption #3 from
our list of gotchas, but the big picture is that it's hard to believe
that doing this will make for a quantum jump in the overall level of
reliability. I think I listed the risks in roughly the right order of
severity ...
I'm willing to fix this for 8.1 (and am already in process of drafting a
patch), especially since it ties into some other known problems such as
the pg_pwd/pg_group files not being properly reconstructed after PITR
recovery. But I think that a "Chinese fire drill" is not called for,
and backpatching a significant but poorly tested change falls into that
category IMHO.
regards, tom lane
pgsql@mohawksoft.com writes:
I think there should be a 100% no data loss fail safe.
OK, maybe I was overly broad in my statement, but I assumed a context that
I guess you missed. Don't you think that in normal operations, i.e. with
no hardware of OS failure, we should see any data loss as unacceptable?
If a bug causes data loss, it is a big deal right?
Possibly we need to recalibrate our expectations here. The current
situation is that PostgreSQL will not lose data if:1. Your disk drive doesn't screw up (eg, lie about write complete,
or just plain die on you).
2. Your kernel and filesystem don't screw up.
3. You follow the instructions about routine vacuuming.
4. You don't hit any bugs that we don't know about.
See, here is where I strongly disagree.Items (1) and (2) are completely
out of our control and no one would blame PostgreSQL.
Item (4) is an issue with all software, every now and then people hit bugs
and the bug is reported and assumed to get fixed.
Item (3) is just nasty, RTFM or else sucka! I think it is a very user
hostile stance.
I agree that it's a nice idea to be able to eliminate assumption #3 from
our list of gotchas, but the big picture is that it's hard to believe
that doing this will make for a quantum jump in the overall level of
reliability. I think I listed the risks in roughly the right order of
severity ...
Sometimes the edge conditions of a problem are not so obscure. I think (3)
is a huge issue, iamgine I'm in this meeting:
DBA: We can't use PostgreSQL, if we forget to do normal maintenence we'll
lose all our data.
ME: Well, there is an amount of truth in that, but we just won't forget.
DBA: Sorry, I don't trust it.
CTO: Mark, I think joe has some serious issues that need to be resolved
before we can move on this.
Boom!! Lost.
Show quoted text
I'm willing to fix this for 8.1 (and am already in process of drafting a
patch), especially since it ties into some other known problems such as
the pg_pwd/pg_group files not being properly reconstructed after PITR
recovery. But I think that a "Chinese fire drill" is not called for,
and backpatching a significant but poorly tested change falls into that
category IMHO.regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
[ Shrugs ] and looks at other database systems ...
CA has put Ingres into Open Source last year.
Very reliable system with a replicator worth looking at.
Just a thought.
-------- Urspr�ngliche Nachricht --------
Betreff: Re: [HACKERS] Data loss, vacuum, transaction wrap-around
Datum: Sat, 19 Feb 2005 09:15:14 -0500 (EST)
Von: pgsql@mohawksoft.com
An: "Tom Lane" <tgl@sss.pgh.pa.us>
CC: pgsql@mohawksoft.com, "Russell Smith" <mr-russ@pws.com.au>, pgsql-hackers@postgresql.org
Referenzen: <16672.24.91.171.78.1108743398.squirrel@mail.mohawksoft.com> <19175.1108746609@sss.pgh.pa.us>
<200502191302.58099.mr-russ@pws.com.au> <16585.24.91.171.78.1108780763.squirrel@mail.mohawksoft.com>
<6391.1108784131@sss.pgh.pa.us>
pgsql@mohawksoft.com writes:
I think there should be a 100% no data loss fail safe.
OK, maybe I was overly broad in my statement, but I assumed a context that
I guess you missed. Don't you think that in normal operations, i.e. with
no hardware of OS failure, we should see any data loss as unacceptable?
If a bug causes data loss, it is a big deal right?
Possibly we need to recalibrate our expectations here. The current
situation is that PostgreSQL will not lose data if:1. Your disk drive doesn't screw up (eg, lie about write complete,
or just plain die on you).
2. Your kernel and filesystem don't screw up.
3. You follow the instructions about routine vacuuming.
4. You don't hit any bugs that we don't know about.
See, here is where I strongly disagree.Items (1) and (2) are completely
out of our control and no one would blame PostgreSQL.
Item (4) is an issue with all software, every now and then people hit bugs
and the bug is reported and assumed to get fixed.
Item (3) is just nasty, RTFM or else sucka! I think it is a very user
hostile stance.
I agree that it's a nice idea to be able to eliminate assumption #3 from
our list of gotchas, but the big picture is that it's hard to believe
that doing this will make for a quantum jump in the overall level of
reliability. I think I listed the risks in roughly the right order of
severity ...
Sometimes the edge conditions of a problem are not so obscure. I think (3)
is a huge issue, iamgine I'm in this meeting:
DBA: We can't use PostgreSQL, if we forget to do normal maintenence we'll
lose all our data.
ME: Well, there is an amount of truth in that, but we just won't forget.
DBA: Sorry, I don't trust it.
CTO: Mark, I think joe has some serious issues that need to be resolved
before we can move on this.
Boom!! Lost.
I'm willing to fix this for 8.1 (and am already in process of drafting a
patch), especially since it ties into some other known problems such as
the pg_pwd/pg_group files not being properly reconstructed after PITR
recovery. But I think that a "Chinese fire drill" is not called for,
and backpatching a significant but poorly tested change falls into that
category IMHO.regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Import Notes
Resolved by subject fallback
In <c2d9e70e050219112379204df4@mail.gmail.com>, on 02/19/05
at 02:23 PM, Jaime Casanova <systemguards@gmail.com> said:
On Fri, 18 Feb 2005 22:35:31 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >
pgsql@mohawksoft.com writes:I think there should be a 100% no data loss fail safe.
Possibly we need to recalibrate our expectations here. The current
situation is that PostgreSQL will not lose data if:1. Your disk drive doesn't screw up (eg, lie about write complete,
or just plain die on you).
2. Your kernel and filesystem don't screw up.
3. You follow the instructions about routine vacuuming.
4. You don't hit any bugs that we don't know about.I'm not an expert but a happy user. My opinion is:
1) there is nothing to do with #1 and #2.
2) #4 is not a big problem because of the velocity developers fix those
when a bug is found.
3) All databases has some type of maintenance routine, in informix for
example we have (update statistics, and there are others for oracle) of
course they are for performance reasons, but vacuum is too for that and
additionally give us the XID wraparound.
So, to have a maintenance routine in PostgreSQL is not bad. *Bad* is to
have a DBA(1) with no clue about the tool is using. Tools that do to much
are an incentive in hire *no clue* people.
(1) DBA: DataBase Administrator or DataBase Aniquilator???
regards,
Jaime Casanova
Bad mouthing the people who use your software is a good way to make sure
no one uses the software.
The catastrophic failure of the database because a maintenence function is
not performed is a problem with the software, not with the people using
it.
--
-----------------------------------------------------------
lsunley@mb.sympatico.ca
-----------------------------------------------------------
Import Notes
Reply to msg id not found: c2d9e70e050219112379204df4@mail.gmail.com | Resolved by subject fallback
On Fri, 18 Feb 2005 22:35:31 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
pgsql@mohawksoft.com writes:
I think there should be a 100% no data loss fail safe.
Possibly we need to recalibrate our expectations here. The current
situation is that PostgreSQL will not lose data if:1. Your disk drive doesn't screw up (eg, lie about write complete,
or just plain die on you).
2. Your kernel and filesystem don't screw up.
3. You follow the instructions about routine vacuuming.
4. You don't hit any bugs that we don't know about.
I'm not an expert but a happy user. My opinion is:
1) there is nothing to do with #1 and #2.
2) #4 is not a big problem because of the velocity developers fix
those when a bug is found.
3) All databases has some type of maintenance routine, in informix for
example we have (update statistics, and there are others for oracle)
of course they are for performance reasons, but vacuum is too for that
and additionally give us the XID wraparound.
So, to have a maintenance routine in PostgreSQL is not bad. *Bad* is
to have a DBA(1) with no clue about the tool is using. Tools that do
to much are an incentive in hire *no clue* people.
(1) DBA: DataBase Administrator or DataBase Aniquilator???
regards,
Jaime Casanova