Silent data loss in its pure form

Started by Alex Ignatovalmost 10 years ago6 messagesgeneral
Jump to latest
#1Alex Ignatov
a.ignatov@postgrespro.ru

Following this bug reports from redhat
https://bugzilla.redhat.com/show_bug.cgi?id=845233

it rising some dangerous issue:

If on any reasons you data file is zeroed after some power loss(it is
the most known issue on XFS in the past) when you do
select count(*) from you_table you got zero if you table was in one
1GB(default) file or some other numbers !=count (*) from you_table
before power loss
No errors, nothing suspicious in logs. No any checksum errors. Nothing.

Silent data loss is its pure form.

And thanks to all gods that you notice it before backup recycling which
contains good data.
Keep in mind it while checking you "backups" in any forms (pg_dump or
the more dangerous and short-spoken PITR file backup)

You data is always in danger with "zeroed data file is normal file"
paradigm.

--
Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#2Scott Marlowe
scott.marlowe@gmail.com
In reply to: Alex Ignatov (#1)
Re: Silent data loss in its pure form

On Mon, May 30, 2016 at 10:57 AM, Alex Ignatov <a.ignatov@postgrespro.ru> wrote:

Following this bug reports from redhat
https://bugzilla.redhat.com/show_bug.cgi?id=845233

it rising some dangerous issue:

If on any reasons you data file is zeroed after some power loss(it is the
most known issue on XFS in the past) when you do
select count(*) from you_table you got zero if you table was in one
1GB(default) file or some other numbers !=count (*) from you_table before
power loss
No errors, nothing suspicious in logs. No any checksum errors. Nothing.

Silent data loss is its pure form.

And thanks to all gods that you notice it before backup recycling which
contains good data.
Keep in mind it while checking you "backups" in any forms (pg_dump or the
more dangerous and short-spoken PITR file backup)

You data is always in danger with "zeroed data file is normal file"
paradigm.

That bug shows as having been fixed in 2012. Are there any modern,
supported distros that would still have it? It sounds really bad btw.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Alex Ignatov
a.ignatov@postgrespro.ru
In reply to: Scott Marlowe (#2)
Re: Silent data loss in its pure form

_____________________________
From: Scott Marlowe <scott.marlowe@gmail.com>
Sent: Monday, May 30, 2016 20:14
Subject: Re: [GENERAL] Silent data loss in its pure form
To: Alex Ignatov <a.ignatov@postgrespro.ru>
Cc: <pgsql-general@postgresql.org>

On Mon, May 30, 2016 at 10:57 AM, Alex Ignatov <a.ignatov@postgrespro.ru> wrote:

Following this bug reports from redhat
https://bugzilla.redhat.com/show_bug.cgi?id=845233

it rising some dangerous issue:

If on any reasons you data file is zeroed after some power loss(it is the
most known issue on XFS in the past) when you do
select count(*) from you_table you got zero if you table was in one
1GB(default) file or some other numbers !=count (*) from you_table before
power loss
No errors, nothing suspicious in logs. No any checksum errors. Nothing.

Silent data loss is its pure form.

And thanks to all gods that you notice it before backup recycling which
contains good data.
Keep in mind it while checking you "backups" in any forms (pg_dump or the
more dangerous and short-spoken PITR file backup)

You data is always in danger with "zeroed data file is normal file"
paradigm.

That bug shows as having been fixed in 2012. Are there any modern,
supported distros that would still have it? It sounds really bad btw.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

It is not about modern distros it is about possible silent data loss in old distros. We have replication, have some form of data check summing, but we are powerless in front of this XFS bug just because "zeroed file is you good friend in Postgres". With "zero file is good file" paradigm and this noted XFS bug PG  as it is now is "colossus with feet of clay" It can do many things but it cant even tell us that we have some trouble with our precious data. No need to prevent or to some other AI magic and so on when zero doom day has come.What we need now is some error report about suspicious zeroed file. To make us sure that something went wrong and we have to do recovery.Today PG "power loss" recovery and this XFS bug poisoning our ensurance that  recovery went well . It went well even with zeroed file. It it not healthy behavior. It like a walk on a mine field with eyes closed. I think it is  very dangerous view on data to have data files without any header in it and without any files checking at least on PG start.With this known XFS bug  it can leads to undetected and unavoidable loss of data.

#4David G. Johnston
david.g.johnston@gmail.com
In reply to: Alex Ignatov (#3)
Re: Silent data loss in its pure form

On Mon, May 30, 2016 at 4:22 PM, Alex Ignatov <a.ignatov@postgrespro.ru>
wrote:

_____________________________
From: Scott Marlowe <scott.marlowe@gmail.com>
Sent: Monday, May 30, 2016 20:14
Subject: Re: [GENERAL] Silent data loss in its pure form
To: Alex Ignatov <a.ignatov@postgrespro.ru>
Cc: <pgsql-general@postgresql.org>

On Mon, May 30, 2016 at 10:57 AM, Alex Ignatov <a.ignatov@postgrespro.ru>
wrote:

Following this bug reports from redhat
https://bugzilla.redhat.com/show_bug.cgi?id=845233

it rising some dangerous issue:

If on any reasons you data file is zeroed after some power loss(it is the
most known issue on XFS in the past) when you do
select count(*) from you_table you got zero if you table was in one
1GB(default) file or some other numbers !=count (*) from you_table before
power loss
No errors, nothing suspicious in logs. No any checksum errors. Nothing.

Silent data loss is its pure form.

And thanks to all gods that you notice it before backup recycling which
contains good data.
Keep in mind it while checking you "backups" in any forms (pg_dump or the
more dangerous and short-spoken PITR file backup)

You data is always in danger with "zeroed data file is normal file"
paradigm.

That bug shows as having been fixed in 2012. Are there any modern,
supported distros that would still have it? It sounds really bad btw.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

It is not about modern distros it is about possible silent data loss in
old distros. We have replication, have some form of data check summing, but
we are powerless in front of this XFS bug just because "zeroed file is you
good friend in Postgres".
With "zero file is good file" paradigm and this noted XFS bug PG as it
is now is "colossus with feet of clay" It can do many things but it cant
even tell us that we have some trouble with our precious data.
No need to prevent or to some other AI magic and so on when zero doom day
has come.
What we need now is some error report about suspicious zeroed file. To
make us sure that something went wrong and we have to do recovery.
Today PG "power loss" recovery and this XFS bug poisoning our ensurance
that recovery went well . It went well even with zeroed file. It it not
healthy behavior. It like a walk on a mine field with eyes closed.
I think it is very dangerous view on data to have data files without any
header in it and without any files checking at least on PG start.
With this known XFS bug it can leads to undetected and unavoidable loss
of data.

​For those not following -general this is basically an extension of the
following thread.

"Deleting a table file does not raise an error when the table is touched
afterwards, why?"

/messages/by-id/184509399.5590018.1464622534207.JavaMail.zimbra@dbi-services.com

David J.

#5Alex Ignatov
a.ignatov@postgrespro.ru
In reply to: David G. Johnston (#4)
Re: Silent data loss in its pure form

_____________________________
From: David G. Johnston <david.g.johnston@gmail.com>
Sent: Monday, May 30, 2016 23:44
Subject: Re: [GENERAL] Silent data loss in its pure form
To: Alex Ignatov <a.ignatov@postgrespro.ru>
Cc: <pgsql-general@postgresql.org>, Scott Marlowe <scott.marlowe@gmail.com>

On Mon, May 30, 2016 at 4:22 PM, Alex Ignatov <a.ignatov@postgrespro.ru> wrote:

_____________________________
From: Scott Marlowe <scott.marlowe@gmail.com>
Sent: Monday, May 30, 2016 20:14
Subject: Re: [GENERAL] Silent data loss in its pure form
To: Alex Ignatov <a.ignatov@postgrespro.ru>
Cc: <pgsql-general@postgresql.org>

On Mon, May 30, 2016 at 10:57 AM, Alex Ignatov <a.ignatov@postgrespro.ru> wrote:

Following this bug reports from redhat
https://bugzilla.redhat.com/show_bug.cgi?id=845233

it rising some dangerous issue:

If on any reasons you data file is zeroed after some power loss(it is the
most known issue on XFS in the past) when you do
select count(*) from you_table you got zero if you table was in one
1GB(default) file or some other numbers !=count (*) from you_table before
power loss
No errors, nothing suspicious in logs. No any checksum errors. Nothing.

Silent data loss is its pure form.

And thanks to all gods that you notice it before backup recycling which
contains good data.
Keep in mind it while checking you "backups" in any forms (pg_dump or the
more dangerous and short-spoken PITR file backup)

You data is always in danger with "zeroed data file is normal file"
paradigm.

That bug shows as having been fixed in 2012. Are there any modern,
supported distros that would still have it? It sounds really bad btw.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

It is not about modern distros it is about possible silent data loss in old distros. We have replication, have some form of data check summing, but we are powerless in front of this XFS bug just because "zeroed file is you good friend in Postgres". With "zero file is good file" paradigm and this noted XFS bug PG  as it is now is "colossus with feet of clay" It can do many things but it cant even tell us that we have some trouble with our precious data. No need to prevent or to some other AI magic and so on when zero doom day has come.What we need now is some error report about suspicious zeroed file. To make us sure that something went wrong and we have to do recovery.Today PG "power loss" recovery and this XFS bug poisoning our ensurance that  recovery went well . It went well even with zeroed file. It it not healthy behavior. It like a walk on a mine field with eyes closed. I think it is  very dangerous view on data to have data files without any header in it and without any files checking at least on PG start.With this known XFS bug  it can leads to undetected and unavoidable loss of data.

​For those not following -general this is basically an extension of the following thread.
"Deleting a table file does not raise an error when the table is touched afterwards, why?"
/messages/by-id/184509399.5590018.1464622534207.JavaMail.zimbra@dbi-services.com
David J.
It is not extension of that thread it is about XFS bug and how PG ignoring zeroed file even during poweloss recovery. That thread is just topic starter on such important theme as how to silently loose your data with broken XFS and PG. Key words is silently without any human intervention and "zero length file is good file " paradigm. It is not even like on unlinking files by hands.

Alex IgnatovPostgres Professional: http://www.postgrespro.comRussian Postgres Company

#6Alex Ignatov
a.ignatov@postgrespro.ru
In reply to: Alex Ignatov (#5)
Re: Silent data loss in its pure form

Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

On 31.05.2016 0:12, Alex Ignatov wrote:

_____________________________
From: David G. Johnston <david.g.johnston@gmail.com
<mailto:david.g.johnston@gmail.com>>
Sent: Monday, May 30, 2016 23:44
Subject: Re: [GENERAL] Silent data loss in its pure form
To: Alex Ignatov <a.ignatov@postgrespro.ru
<mailto:a.ignatov@postgrespro.ru>>
Cc: <pgsql-general@postgresql.org
<mailto:pgsql-general@postgresql.org>>, Scott Marlowe
<scott.marlowe@gmail.com <mailto:scott.marlowe@gmail.com>>

On Mon, May 30, 2016 at 4:22 PM, Alex Ignatov
<a.ignatov@postgrespro.ru <mailto:a.ignatov@postgrespro.ru>>wrote:

_____________________________
From: Scott Marlowe <scott.marlowe@gmail.com
<mailto:scott.marlowe@gmail.com>>
Sent: Monday, May 30, 2016 20:14
Subject: Re: [GENERAL] Silent data loss in its pure form
To: Alex Ignatov <a.ignatov@postgrespro.ru
<mailto:a.ignatov@postgrespro.ru>>
Cc: <pgsql-general@postgresql.org
<mailto:pgsql-general@postgresql.org>>

On Mon, May 30, 2016 at 10:57 AM, Alex Ignatov
<a.ignatov@postgrespro.ru <mailto:a.ignatov@postgrespro.ru>> wrote:

Following this bug reports from redhat
https://bugzilla.redhat.com/show_bug.cgi?id=845233

it rising some dangerous issue:

If on any reasons you data file is zeroed after some power

loss(it is the

most known issue on XFS in the past) when you do
select count(*) from you_table you got zero if you table was in one
1GB(default) file or some other numbers !=count (*) from

you_table before

power loss
No errors, nothing suspicious in logs. No any checksum errors.

Nothing.

Silent data loss is its pure form.

And thanks to all gods that you notice it before backup

recycling which

contains good data.
Keep in mind it while checking you "backups" in any forms

(pg_dump or the

more dangerous and short-spoken PITR file backup)

You data is always in danger with "zeroed data file is normal file"
paradigm.

That bug shows as having been fixed in 2012. Are there any modern,
supported distros that would still have it? It sounds really bad btw.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org
<mailto:pgsql-general@postgresql.org>)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

It is not about modern distros it is about possible silent data
loss in old distros. We have replication, have some form of data
check summing, but we are powerless in front of this XFS bug just
because "zeroed file is you good friend in Postgres".
With "zero file is good file" paradigm and this noted XFS bug PG
as it is now is "colossus with feet of clay" It can do many
things but it cant even tell us that we have some trouble with our
precious data.
No need to prevent or to some other AI magic and so on when zero
doom day has come.
What we need now is some error report about suspicious zeroed
file. To make us sure that something went wrong and we have to do
recovery.
Today PG "power loss" recovery and this XFS bug poisoning our
ensurance that recovery went well . It went well even with zeroed
file. It it not healthy behavior. It like a walk on a mine field
with eyes closed.
I think it is very dangerous view on data to have data files
without any header in it and without any files checking at least
on PG start.
With this known XFS bug it can leads to undetected and
unavoidable loss of data.

​ For those not following -general this is basically an extension of
the following thread.

"Deleting a table file does not raise an error when the table is
touched afterwards, why?"

/messages/by-id/184509399.5590018.1464622534207.JavaMail.zimbra@dbi-services.com

David J.

It is not extension of that thread it is about XFS bug and how PG
ignoring zeroed file even during poweloss recovery.
That thread is just topic starter on such important theme as how to
silently loose your data with broken XFS and PG.
Key words is silently without any human intervention and "zero length
file is good file " paradigm. It is not even like on unlinking files
by hands.

Alex Ignatov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

It also can happen on ext4 with delayed allocation .
http://www.pointsoftware.ch/en/4-ext4-vs-ext3-filesystem-and-why-delayed-allocation-is-bad/
So issue become more seriously than just "XFS constanly wiped my file" mem

So it total we have at least two FS that can wiped files to zero length
after power loss. One can do it "by design" with "wrong" delayed
allocation mount option other just because it had some bug in old kernel.

Alex Ignatov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company