Losing records when server hang
Hi,
I'm observing the following:
If I commit the following records 1,2,3,4,5,6,7,8,9,10 to the database
and the server hangs, I could lose records 5,6,7,8,9 but record 10 is
there. How is this possible and do anyone know how Postgresql
physically writes the records?
Thanks,
thomas.
On Sun, 2004-08-08 at 19:43, lec wrote:
Hi,
I'm observing the following:
If I commit the following records 1,2,3,4,5,6,7,8,9,10 to the database
and the server hangs, I could lose records 5,6,7,8,9 but record 10 is
there. How is this possible and do anyone know how Postgresql
physically writes the records?
Assuming a properly function storage subsystem and a kernel that does
not lie about fsync, this is not possible.
Are you running on top of IDE drives with the write cache enabled?
Most, if not all, IDE drives lie about FSYNC and can lose data if power
is disconnected abrubtly.
Or if you are running with fsync=false in postgresql.conf the same thing
can happen. Otherwise, not, it should not happen. More info perhaps?
On Sun, Aug 08, 2004 at 08:36:36PM -0600, Scott Marlowe wrote:
On Sun, 2004-08-08 at 19:43, lec wrote:
If I commit the following records 1,2,3,4,5,6,7,8,9,10 to the database
and the server hangs, I could lose records 5,6,7,8,9 but record 10 is
there. How is this possible and do anyone know how Postgresql
physically writes the records?Assuming a properly function storage subsystem and a kernel that does
not lie about fsync, this is not possible.
It is actually possible if he uses several backends to do the job, and
transaction inserting record 10 commits before the hang, and 5,6,7,8,9
don't.
If this is only one backend, then I'd love to see how did he do that.
--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"The ability to monopolize a planet is insignificant
next to the power of the source"
On Sun, 2004-08-08 at 21:26, Alvaro Herrera Munoz wrote:
On Sun, Aug 08, 2004 at 08:36:36PM -0600, Scott Marlowe wrote:
On Sun, 2004-08-08 at 19:43, lec wrote:
If I commit the following records 1,2,3,4,5,6,7,8,9,10 to the database
and the server hangs, I could lose records 5,6,7,8,9 but record 10 is
there. How is this possible and do anyone know how Postgresql
physically writes the records?Assuming a properly function storage subsystem and a kernel that does
not lie about fsync, this is not possible.It is actually possible if he uses several backends to do the job, and
transaction inserting record 10 commits before the hang, and 5,6,7,8,9
don't.
Yeah, but he explicitly said he'd committed 1 through 10. Unless he
didn't understand what is meant by commit. I.e. committing AND
receiving the ack for that commit. Until the database says it
committed, nothing's been committed, so he might have thought just
firing the insert query was committing. I hadn't really thought of that
angle.
Is that the case, lec?
If this is only one backend, then I'd love to see how did he do that.
Me too :-)
"Scott Marlowe" <smarlowe@qwest.net> writes:
On Sun, 2004-08-08 at 21:26, Alvaro Herrera Munoz wrote:
If this is only one backend, then I'd love to see how did he do that.
Me too :-)
Given a disk drive that lies about write complete (which is to to say,
most consumer-grade IDE gear), it's at least theoretically possible.
But it doesn't sound real likely, especially if the machine didn't
actually lose power.
I was wondering what he meant by "hang", as well as what he meant by
"commit". The overall lack of detail in the report is notable.
regards, tom lane
Scott Marlowe wrote:
On Sun, 2004-08-08 at 21:26, Alvaro Herrera Munoz wrote:
On Sun, Aug 08, 2004 at 08:36:36PM -0600, Scott Marlowe wrote:
On Sun, 2004-08-08 at 19:43, lec wrote:
If I commit the following records 1,2,3,4,5,6,7,8,9,10 to the database
and the server hangs, I could lose records 5,6,7,8,9 but record 10 is
there. How is this possible and do anyone know how Postgresql
physically writes the records?Assuming a properly function storage subsystem and a kernel that does
not lie about fsync, this is not possible.
I'm using Redhat 7.3, kernel 2.4.18
It is actually possible if he uses several backends to do the job, and
transaction inserting record 10 commits before the hang, and 5,6,7,8,9
don't.
Just 1 backend.
Yeah, but he explicitly said he'd committed 1 through 10. Unless he
didn't understand what is meant by commit. I.e. committing AND
receiving the ack for that commit. Until the database says it
committed, nothing's been committed, so he might have thought just
firing the insert query was committing. I hadn't really thought of that
angle.Is that the case, lec?
I explicitly 'COMMIT'
If this is only one backend, then I'd love to see how did he do that.
Me too :-)
That's exactly leaving me puzzled. I don't know if it has anything to
do with the SCSI controller or hardware related stuff. The controller
is a RAID, configured are RAID-5.
--lec
Tom Lane wrote:
"Scott Marlowe" <smarlowe@qwest.net> writes:
On Sun, 2004-08-08 at 21:26, Alvaro Herrera Munoz wrote:
If this is only one backend, then I'd love to see how did he do that.
Me too :-)
Given a disk drive that lies about write complete (which is to to say,
most consumer-grade IDE gear), it's at least theoretically possible.
But it doesn't sound real likely, especially if the machine didn't
actually lose power.
It's a SCSI, RAID-5 on a Dell server.
I was wondering what he meant by "hang", as well as what he meant by
"commit". The overall lack of detail in the report is notable.
The hardware actually "hang". The Dell engineers came and replaced the
motherboard but couldn't tell what the actual fault was.
Commit as in 'COMMIT'. 'Records' 1,2,3,4,5,6,7,8,9,10 are actually
transactions. I'm as puzzled as to why I lost the transactions in the
middle but got the last transaction.
Show quoted text
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Import Notes
Reply to msg id not found: 13758695.1092031490542.JavaMail.root@hercules
Scott Marlowe wrote:
On Sun, 2004-08-08 at 19:43, lec wrote:
Hi,
I'm observing the following:
If I commit the following records 1,2,3,4,5,6,7,8,9,10 to the database
and the server hangs, I could lose records 5,6,7,8,9 but record 10 is
there. How is this possible and do anyone know how Postgresql
physically writes the records?Assuming a properly function storage subsystem and a kernel that does
not lie about fsync, this is not possible.Are you running on top of IDE drives with the write cache enabled?
Most, if not all, IDE drives lie about FSYNC and can lose data if power
is disconnected abrubtly.
SCSI
Or if you are running with fsync=false in postgresql.conf the same thing
can happen. Otherwise, not, it should not happen. More info perhaps?
I didn't change the default fsync, which I believe is fsync=true.
lec <limec@streamyx.com> writes:
It's a SCSI, RAID-5 on a Dell server.
The hardware actually "hang". The Dell engineers came and replaced the
motherboard but couldn't tell what the actual fault was.
Commit as in 'COMMIT'. 'Records' 1,2,3,4,5,6,7,8,9,10 are actually
transactions. I'm as puzzled as to why I lost the transactions in the
middle but got the last transaction.
I'm puzzled too. I don't suppose you have the postmaster log from when
it was trying to recover from the crash? Or even better, copies of the
WAL files?
A possible theory has to do with corruption of the WAL log. For
instance, transactions 1-10 are all down to disk in WAL (or at least the
kernel told postgres the writes were done) and for one reason or another
the buffer manager chances to flush the page containing record 10 out
to its data file before the other records' pages. Now the system hangs.
After reboot, if the WAL log is unreadable beyond transaction 1 then the
database would come up with transaction 1 replayed, 2-10 not replayed,
but 10's data is out there anyway.
However this would seem to imply disk drive misfeasance above and beyond
your motherboard problem.
regards, tom lane
Tom Lane wrote:
I'm puzzled too. I don't suppose you have the postmaster log from when
it was trying to recover from the crash? Or even better, copies of the
WAL files?
I never knew where the log files are :(
Tom Lane wrote:
lec <limec@streamyx.com> writes:
It's a SCSI, RAID-5 on a Dell server.
The hardware actually "hang". The Dell engineers came and replaced the
motherboard but couldn't tell what the actual fault was.Commit as in 'COMMIT'. 'Records' 1,2,3,4,5,6,7,8,9,10 are actually
transactions. I'm as puzzled as to why I lost the transactions in the
middle but got the last transaction.I'm puzzled too. I don't suppose you have the postmaster log from when
it was trying to recover from the crash? Or even better, copies of the
WAL files?A possible theory has to do with corruption of the WAL log. For
instance, transactions 1-10 are all down to disk in WAL (or at least the
kernel told postgres the writes were done) and for one reason or another
the buffer manager chances to flush the page containing record 10 out
to its data file before the other records' pages. Now the system hangs.
After reboot, if the WAL log is unreadable beyond transaction 1 then the
database would come up with transaction 1 replayed, 2-10 not replayed,
but 10's data is out there anyway.However this would seem to imply disk drive misfeasance above and beyond
your motherboard problem.
Well, no. How about this theory:
1) everything is ok:
the backend executes write()/fsync() for transactions 1-5
2) hardware fails some how at MB level (imagine CPU/RAM overheating):
RAM gets corrupted - kernel starts oopsing (but goes on)
meanwhile, the backend executes write()/fsync() for transactions 6-10,
but randomly corrupted data gets written to disk.
3) unrecoverable kernel error occurs, the show stops.
On recover, transactions 6-9 don't even look like valid log entries, while
10, for some reason, does (maybe only data is corrupted).
I'm not familiar with the details of WAL files and post-crash recovery,
but is that possible? Or does the process stop at the first failure?
Anyway, if your CPU/RAM is failing, no DB technology can save you. You
need redundant CPU/RAM units to perform the same operations concurrenly,
and the hardware to validate the results on a 2vs1 basis at least.
Ask NASA, I think they know what "mission critical" actually means. :)
Really, when the hardware starts flipping random bits in RAM, you can't
even know how long it's being going on, can be hours w/o the kernel panic
or hang at all. No one knows how good is your data. There's no point in
recovering a transaction if the data inside is corrupted.
.TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Colombo@ESI.it
On Mon, 2004-08-09 at 09:07, lec wrote:
Scott Marlowe wrote:
On Sun, 2004-08-08 at 21:26, Alvaro Herrera Munoz wrote:
On Sun, Aug 08, 2004 at 08:36:36PM -0600, Scott Marlowe wrote:
On Sun, 2004-08-08 at 19:43, lec wrote:
If I commit the following records 1,2,3,4,5,6,7,8,9,10 to the database
and the server hangs, I could lose records 5,6,7,8,9 but record 10 is
there. How is this possible and do anyone know how Postgresql
physically writes the records?Assuming a properly function storage subsystem and a kernel that does
not lie about fsync, this is not possible.I'm using Redhat 7.3, kernel 2.4.18
It is actually possible if he uses several backends to do the job, and
transaction inserting record 10 commits before the hang, and 5,6,7,8,9
don't.Just 1 backend.
Yeah, but he explicitly said he'd committed 1 through 10. Unless he
didn't understand what is meant by commit. I.e. committing AND
receiving the ack for that commit. Until the database says it
committed, nothing's been committed, so he might have thought just
firing the insert query was committing. I hadn't really thought of that
angle.Is that the case, lec?
I explicitly 'COMMIT'
If this is only one backend, then I'd love to see how did he do that.
Me too :-)
That's exactly leaving me puzzled. I don't know if it has anything to
do with the SCSI controller or hardware related stuff. The controller
is a RAID, configured are RAID-5.
Does that RAID controller have NON battery backed cache?
Marco Colombo <marco@esi.it> writes:
Tom Lane wrote:
However this would seem to imply disk drive misfeasance above and beyond
your motherboard problem.
Well, no. How about this theory:
1) everything is ok:
the backend executes write()/fsync() for transactions 1-5
2) hardware fails some how at MB level (imagine CPU/RAM overheating):
RAM gets corrupted - kernel starts oopsing (but goes on)
meanwhile, the backend executes write()/fsync() for transactions 6-10,
but randomly corrupted data gets written to disk.
3) unrecoverable kernel error occurs, the show stops.
On recover, transactions 6-9 don't even look like valid log entries, while
10, for some reason, does (maybe only data is corrupted).
I'm not familiar with the details of WAL files and post-crash recovery,
but is that possible? Or does the process stop at the first failure?
Recovery will stop at the first corrupted record, so it would not happen
like that. But you are right, the MB failure alone might have been
enough to corrupt the outgoing WAL log data and thus produce the
scenario I described. Once Postgres *thinks* transactions 1-10 are
safely down to disk in the WAL log, it will feel free to update the data
files in any random order that seems convenient. So the write of record
10 could have occurred before the rest, and if that happened not to get
corrupted by the MB problem, we could see the result lec describes.
Of course this is all guesswork since we have no direct evidence to look
at, but it seems fairly plausible.
Anyway, if your CPU/RAM is failing, no DB technology can save you.
Agreed. Software certainly cannot make any guarantees if it can't even
execute correctly ...
regards, tom lane
Scott Marlowe wrote:
On Mon, 2004-08-09 at 09:07, lec wrote:
Scott Marlowe wrote:
On Sun, 2004-08-08 at 21:26, Alvaro Herrera Munoz wrote:
On Sun, Aug 08, 2004 at 08:36:36PM -0600, Scott Marlowe wrote:
On Sun, 2004-08-08 at 19:43, lec wrote:
If I commit the following records 1,2,3,4,5,6,7,8,9,10 to the database
and the server hangs, I could lose records 5,6,7,8,9 but record 10 is
there. How is this possible and do anyone know how Postgresql
physically writes the records?Assuming a properly function storage subsystem and a kernel that does
not lie about fsync, this is not possible.I'm using Redhat 7.3, kernel 2.4.18
It is actually possible if he uses several backends to do the job, and
transaction inserting record 10 commits before the hang, and 5,6,7,8,9
don't.Just 1 backend.
Yeah, but he explicitly said he'd committed 1 through 10. Unless he
didn't understand what is meant by commit. I.e. committing AND
receiving the ack for that commit. Until the database says it
committed, nothing's been committed, so he might have thought just
firing the insert query was committing. I hadn't really thought of that
angle.Is that the case, lec?
I explicitly 'COMMIT'
If this is only one backend, then I'd love to see how did he do that.
Me too :-)
That's exactly leaving me puzzled. I don't know if it has anything to
do with the SCSI controller or hardware related stuff. The controller
is a RAID, configured are RAID-5.Does that RAID controller have NON battery backed cache?
I'm not sure.
Tom Lane wrote:
Marco Colombo <marco@esi.it> writes:
Tom Lane wrote:
However this would seem to imply disk drive misfeasance above and beyond
your motherboard problem.Well, no. How about this theory:
1) everything is ok:
the backend executes write()/fsync() for transactions 1-52) hardware fails some how at MB level (imagine CPU/RAM overheating):
RAM gets corrupted - kernel starts oopsing (but goes on)
meanwhile, the backend executes write()/fsync() for transactions 6-10,
but randomly corrupted data gets written to disk.3) unrecoverable kernel error occurs, the show stops.
On recover, transactions 6-9 don't even look like valid log entries, while
10, for some reason, does (maybe only data is corrupted).I'm not familiar with the details of WAL files and post-crash recovery,
but is that possible? Or does the process stop at the first failure?Recovery will stop at the first corrupted record, so it would not happen
like that. But you are right, the MB failure alone might have been
enough to corrupt the outgoing WAL log data and thus produce the
scenario I described. Once Postgres *thinks* transactions 1-10 are
safely down to disk in the WAL log, it will feel free to update the data
files in any random order that seems convenient. So the write of record
10 could have occurred before the rest, and if that happened not to get
corrupted by the MB problem, we could see the result lec describes.Of course this is all guesswork since we have no direct evidence to look
at, but it seems fairly plausible.Anyway, if your CPU/RAM is failing, no DB technology can save you.
Agreed. Software certainly cannot make any guarantees if it can't even
execute correctly ...
Same here. I don't even want to have to prove anything if the hardware
isn't reliable but the "management" queries about the lost transactions,
blaming on system/software/database. I could prove to them that the lost
transactions were due to the system hang, but transaction #10 being
there makes my reasoning doubtful.
Thanks for all your feedbacks and reasoning.
--lec
Import Notes
Reply to msg id not found: 15354891.1092079790659.JavaMail.root@hercules
On Mon, 2004-08-09 at 19:29, lec wrote:
Scott Marlowe wrote:
Does that RAID controller have NON battery backed cache?
I'm not sure.
Find out. If it does, it's an accident looking for a place to happen.
Unless it's set to write thru not write back. If you're gonna have a
caching controller set to write back it MUST have battery backed cache.
lec wrote:
Scott Marlowe wrote:
On Sun, 2004-08-08 at 19:43, lec wrote:
Hi,
I'm observing the following:
If I commit the following records 1,2,3,4,5,6,7,8,9,10 to the database
and the server hangs, I could lose records 5,6,7,8,9 but record 10 is
there. How is this possible and do anyone know how Postgresql
physically writes the records?Assuming a properly function storage subsystem and a kernel that does
not lie about fsync, this is not possible.Are you running on top of IDE drives with the write cache enabled?
Most, if not all, IDE drives lie about FSYNC and can lose data if power
is disconnected abrubtly.SCSI
Or if you are running with fsync=false in postgresql.conf the same thing
can happen. Otherwise, not, it should not happen. More info perhaps?I didn't change the default fsync, which I believe is fsync=true.
Could your SCSI controller be misbehaving? Could this be leading to a
system hang (say when swap space is being called back into memory)?
Best Wishes,
Chris Travers
Metatron Technology Consulting
Chris Travers wrote:
lec wrote:
Scott Marlowe wrote:
On Sun, 2004-08-08 at 19:43, lec wrote:
Hi,
I'm observing the following:
If I commit the following records 1,2,3,4,5,6,7,8,9,10 to the
database and the server hangs, I could lose records 5,6,7,8,9 but
record 10 is there. How is this possible and do anyone know how
Postgresql physically writes the records?Assuming a properly function storage subsystem and a kernel that does
not lie about fsync, this is not possible.Are you running on top of IDE drives with the write cache enabled?
Most, if not all, IDE drives lie about FSYNC and can lose data if power
is disconnected abrubtly.SCSI
Or if you are running with fsync=false in postgresql.conf the same
thing
can happen. Otherwise, not, it should not happen. More info perhaps?I didn't change the default fsync, which I believe is fsync=true.
Could your SCSI controller be misbehaving? Could this be leading to a
system hang (say when swap space is being called back into memory)?
The first hardware that got replaced was the SCSI controller. After
that there were still hardware "hanging" and eventually the system
engineers replaced the whole server and they weren't sure what the fault
was. The lost transactions happened not on the first server hang. I'm
more curious to know why the transactions in the middle got lost but the
last transaction was there.
Thanks,
lec
On Tue, 2004-08-10 at 19:30, lec wrote:
Chris Travers wrote:
The first hardware that got replaced was the SCSI controller. After
that there were still hardware "hanging" and eventually the system
engineers replaced the whole server and they weren't sure what the fault
was. The lost transactions happened not on the first server hang. I'm
more curious to know why the transactions in the middle got lost but the
last transaction was there.
With questionable hardware (failing memory, scsi / raid controller,
etc.) all failure modes are pretty much possible. PostgreSQL is good,
but it can't make up for bad hardware.
I installed PG 8.0 Beta 1 in my machine. Everything seemed ok... The
problem occurred when I tried to install the programs in contrib
directory...
I commented out dbsize from the Makefile... And everything installed
successfully... Following is compiling output of dbsize... Am I doing
anything wrong or GETDATABASEPATH is broken?
gcc -O2
-fno-strict-aliasing
-Wall
-Wmissing-prototypes
-Wmissing-declarations
-fpic
-I.
-I../../src/include
-D_GNU_SOURCE
-c
-o dbsize.o dbsize.c
dbsize.c: In function`database_size': dbsize.c:58: error: too few
arguments to function`GetDatabasePath' gmake: *** [dbsize.o] Error 1
Current configuration of PG:
pg_config --configure
--prefix=/usr/local/postgresql-8.0
--with-java
--with-perl
--enable-nls
--with-openssl
--with-tcl
--with-pgport=5435
--enable-thread-safety
-----------------------------
Cordialmente,
Renê Salomão
rene@ibiz.com.br
IBIZ Tecnologia - www.ibiz.com.br
(011) 5579-3178
--------------------------------