Is full_page_writes=off safe in conjunction with PITR?
While thinking about the patch I just made to allow full_page_writes to
be turned off again, it struck me that this patch only fixes the problem
for post-crash XLOG replay. There is still a hazard if the variable is
turned off in a PITR master system. The reason is that while a base
backup is being taken, the backup-taker might read an inconsistent state
of a page and include that in the backup. This is not a problem if
full_page_writes is ON --- it's logically equivalent to a torn page
write and will be fixed on the slave by XLOG replay. But it *is* a
problem if full_page_writes is OFF, for exactly the same reason that
torn page writes are a problem.
I think we had originally argued that there was no problem anyway
because the kernel should cause the page write to appear atomic to other
processes (since we issue it in a single write() command). But that's
only true if the backup-taker reads in units that are multiples of
BLCKSZ. If the backup-taker reads, say, 4K at a time then it's
certainly possible that it gets a later version of the second half of a
page than it got of the first half. I don't know about you, but I sure
don't feel comfortable making assumptions at that level about the
behavior of tar or cpio.
I fear we still have to disable full_page_writes (force it ON) if
XLogArchivingActive is on. Comments?
regards, tom lane
Ühel kenal päeval, R, 2006-04-14 kell 16:40, kirjutas Tom Lane:
I think we had originally argued that there was no problem anyway
because the kernel should cause the page write to appear atomic to other
processes (since we issue it in a single write() command). But that's
only true if the backup-taker reads in units that are multiples of
BLCKSZ. If the backup-taker reads, say, 4K at a time then it's
certainly possible that it gets a later version of the second half of a
page than it got of the first half. I don't know about you, but I sure
don't feel comfortable making assumptions at that level about the
behavior of tar or cpio.I fear we still have to disable full_page_writes (force it ON) if
XLogArchivingActive is on. Comments?
Why not just tell the backup-taker to take backups using 8K pages ?
---------------
Hannu
Hannu Krosing <hannu@skype.net> writes:
Ühel kenal päeval, R, 2006-04-14 kell 16:40, kirjutas Tom Lane:
If the backup-taker reads, say, 4K at a time then it's
certainly possible that it gets a later version of the second half of a
page than it got of the first half. I don't know about you, but I sure
don't feel comfortable making assumptions at that level about the
behavior of tar or cpio.I fear we still have to disable full_page_writes (force it ON) if
XLogArchivingActive is on. Comments?
Why not just tell the backup-taker to take backups using 8K pages ?
How? (No, I don't think tar's blocksize options control this
necessarily --- those indicate the blocking factor on the *tape*.
And not everyone uses tar anyway.)
Even if this would work for all popular backup programs, it seems
far too fragile: the consequence of forgetting the switch would be
silent data corruption, which you might not notice until the slave
had been in live operation for some time.
regards, tom lane
Quoting Tom Lane <tgl@sss.pgh.pa.us>:
I fear we still have to disable full_page_writes (force it ON) if
XLogArchivingActive is on. Comments?
Yeah - if you are enabling PITR, then you care about safety and integrity, so it
makes sense (well, to me anyway).
Cheers
Mark
* Tom Lane:
I think we had originally argued that there was no problem anyway
because the kernel should cause the page write to appear atomic to other
processes (since we issue it in a single write() command).
I doubt Linux makes any such guarantees. See this recent thread on
linux-kernel: <http://marc.theaimsgroup.com/?t=114489284200003>
Ühel kenal päeval, R, 2006-04-14 kell 17:31, kirjutas Tom Lane:
Hannu Krosing <hannu@skype.net> writes:
Ühel kenal päeval, R, 2006-04-14 kell 16:40, kirjutas Tom Lane:
If the backup-taker reads, say, 4K at a time then it's
certainly possible that it gets a later version of the second half of a
page than it got of the first half. I don't know about you, but I sure
don't feel comfortable making assumptions at that level about the
behavior of tar or cpio.I fear we still have to disable full_page_writes (force it ON) if
XLogArchivingActive is on. Comments?Why not just tell the backup-taker to take backups using 8K pages ?
How?
use find + dd, or whatever. I just dont want it to be made universally
unavailable just because some users *might* use an file/disk-level
backup solution which is incompatible.
(No, I don't think tar's blocksize options control this
necessarily --- those indicate the blocking factor on the *tape*.
And not everyone uses tar anyway.)
If I'm desperate enough to get the 2x reduction of WAL writes, I may
even write my own backup solution.
Even if this would work for all popular backup programs, it seems
far too fragile: the consequence of forgetting the switch would be
silent data corruption, which you might not notice until the slave
had been in live operation for some time.
We may declare only one solution to be supported by us with
XLogArchivingActive, say a gnu tar modified to read in Nx8K blocks
( pg_tar :p ).
I guess that even if we can control what operating system does, it is
still possible to get a torn page using some SAN solution, where you can
freeze the image for backup independent of OS.
----------------
Hannu
Tom Lane wrote:
Hannu Krosing <hannu@skype.net> writes:
��hel kenal p��eval, R, 2006-04-14 kell 16:40, kirjutas Tom Lane:
If the backup-taker reads, say, 4K at a time then it's
certainly possible that it gets a later version of the second half of a
page than it got of the first half. I don't know about you, but I sure
don't feel comfortable making assumptions at that level about the
behavior of tar or cpio.I fear we still have to disable full_page_writes (force it ON) if
XLogArchivingActive is on. Comments?Why not just tell the backup-taker to take backups using 8K pages ?
How? (No, I don't think tar's blocksize options control this
necessarily --- those indicate the blocking factor on the *tape*.
And not everyone uses tar anyway.)Even if this would work for all popular backup programs, it seems
far too fragile: the consequence of forgetting the switch would be
silent data corruption, which you might not notice until the slave
had been in live operation for some time.
Yea, it is a problem. Even a 10k read is going to read 2k into the next
page.
I am thinking we should throw an error on pg_start_backup() and
pg_stop_backup if full_page_writes is off. Seems archive_command and
full_page_writes can still be used if we are not in the process of doing
a file system backup.
In fact, could we have pg_start_backup() turn on full_page_writes and
have pg_stop_backup turn it off, if postgresql.conf has it off.
--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I am thinking we should throw an error on pg_start_backup() and
pg_stop_backup if full_page_writes is off.
No, we'll just change the test in xlog.c so that fullPageWrites is
ignored if XLogArchivingActive.
Seems archive_command and
full_page_writes can still be used if we are not in the process of doing
a file system backup.
Think harder: we are only safe if the first write to a given page after
it's mis-copied by the archiver is a full page write. The requirement
therefore continues after pg_stop_backup. Unless you want to add
infrastructure to keep track for *every page* in the DB of whether it's
been fully written since the last backup?
regards, tom lane
Hannu Krosing <hannu@skype.net> writes:
If I'm desperate enough to get the 2x reduction of WAL writes, I may
even write my own backup solution.
Given Florian's concern, sounds like you might have to write your own
kernel too. In which case, generating a variant build of Postgres
that allows full_page_writes to be disabled is certainly not beyond
your powers. But for the ordinary mortal DBA, I think this combination
is just too unsafe to even consider.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I am thinking we should throw an error on pg_start_backup() and
pg_stop_backup if full_page_writes is off.No, we'll just change the test in xlog.c so that fullPageWrites is
ignored if XLogArchivingActive.
We should probably throw a LOG message too.
Seems archive_command and
full_page_writes can still be used if we are not in the process of doing
a file system backup.Think harder: we are only safe if the first write to a given page after
it's mis-copied by the archiver is a full page write. The requirement
therefore continues after pg_stop_backup. Unless you want to add
infrastructure to keep track for *every page* in the DB of whether it's
been fully written since the last backup?
Ah, yea.
--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Ühel kenal päeval, L, 2006-04-15 kell 11:49, kirjutas Tom Lane:
Hannu Krosing <hannu@skype.net> writes:
If I'm desperate enough to get the 2x reduction of WAL writes, I may
even write my own backup solution.Given Florian's concern, sounds like you might have to write your own
kernel too. In which case, generating a variant build of Postgres
that allows full_page_writes to be disabled is certainly not beyond
your powers. But for the ordinary mortal DBA, I think this combination
is just too unsafe to even consider.
I guess that writing our own pg_tar, which cooperates with postgres
backends to get full pages, is still in the realm of possible things,
even on kernels which dont guarantee atomic visibility of write() calls.
But until such is included in the distribution it is a good idea indeed
to disable full_page_writes=off when doing PITR.
--------------
Hannu
Hannu Krosing wrote:
?hel kenal p?eval, L, 2006-04-15 kell 11:49, kirjutas Tom Lane:
Hannu Krosing <hannu@skype.net> writes:
If I'm desperate enough to get the 2x reduction of WAL writes, I may
even write my own backup solution.Given Florian's concern, sounds like you might have to write your own
kernel too. In which case, generating a variant build of Postgres
that allows full_page_writes to be disabled is certainly not beyond
your powers. But for the ordinary mortal DBA, I think this combination
is just too unsafe to even consider.I guess that writing our own pg_tar, which cooperates with postgres
backends to get full pages, is still in the realm of possible things,
even on kernels which dont guarantee atomic visibility of write() calls.But until such is included in the distribution it is a good idea indeed
to disable full_page_writes=off when doing PITR.
The cost/benefit of that seems very discouraging. Most backup
applications allow for a block size to be specified, so it isn't
unreasonable to assume that people who really want PITR and
full_page_writes can easily set the block size to 8k. However, I don't
think we are going to allow that to be configured --- you would have to
hack up our backend code to allow the combination.
--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
On 4/16/06, Bruce Momjian <pgman@candle.pha.pa.us> wrote:
Hannu Krosing wrote:
I guess that writing our own pg_tar, which cooperates with postgres
backends to get full pages, is still in the realm of possible things,
even on kernels which dont guarantee atomic visibility of write() calls.But until such is included in the distribution it is a good idea indeed
to disable full_page_writes=off when doing PITR.The cost/benefit of that seems very discouraging. Most backup
applications allow for a block size to be specified, so it isn't
unreasonable to assume that people who really want PITR and
full_page_writes can easily set the block size to 8k. However, I don't
think we are going to allow that to be configured --- you would have to
hack up our backend code to allow the combination.
The problem is that they allow configuring _target_ block size,
not reading block size. I did some tests with strace:
* GNU cpio version 2.5
allows to change only output block size, input block is 512
bytes. Maybe uses device's block size?
* tar (GNU tar) 1.15.1
the '-b' and '--record-size' options change also input block
size, but to get 8192 bytes for output block, the first read is 7680
bytes to make room for tar header. the rest of reads are indeed 8192
bytes, but that won't help us anymore.
* cp (coreutils) 5.2.1
fixed block size of 4096 bytes.
* rsync version 2.6.5
it does not have a way to change input block size. but it seems
that it reads with 32k blocks or full file if length < 32k.
So we should probably document that rsync is only working solution.
--
marko
On Sat, 2006-04-15 at 11:45 -0400, Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I am thinking we should throw an error on pg_start_backup() and
pg_stop_backup if full_page_writes is off.No, we'll just change the test in xlog.c so that fullPageWrites is
ignored if XLogArchivingActive.
I can see the danger of which you speak, but does it necessarily apply
to all forms of backup?
Seems archive_command and
full_page_writes can still be used if we are not in the process of doing
a file system backup.Think harder: we are only safe if the first write to a given page after
it's mis-copied by the archiver is a full page write. The requirement
therefore continues after pg_stop_backup. Unless you want to add
infrastructure to keep track for *every page* in the DB of whether it's
been fully written since the last backup?
It seems that we should write an API to allow a backup device to ask for
blocks from the database.
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com/
"Marko Kreen" <markokr@gmail.com> writes:
So we should probably document that rsync is only working solution.
No, we're just turning off the variable. One experiment on one version
of rsync doesn't prove it's "safe", even if there weren't the kernel-
behavior issue to consider.
regards, tom lane
On Sat, Apr 15, 2006 at 01:31:58PM +0300, Hannu Krosing wrote:
(No, I don't think tar's blocksize options control this
necessarily --- those indicate the blocking factor on the *tape*.
And not everyone uses tar anyway.)If I'm desperate enough to get the 2x reduction of WAL writes, I may
even write my own backup solution.
I must be missing something obvious, but why don't we compress the
xlogs? They appear to be quite compressable (>75%) with standard gzip...
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.
Ühel kenal päeval, P, 2006-04-16 kell 11:31, kirjutas Tom Lane:
"Marko Kreen" <markokr@gmail.com> writes:
So we should probably document that rsync is only working solution.
No, we're just turning off the variable. One experiment on one version
of rsync doesn't prove it's "safe", even if there weren't the kernel-
behavior issue to consider.
But if we do need to consider the kernel-level behaviour mentioned, then
the whole PITR thing becomes an impossibility. Consider the case when we
get a torn page during the initial copy with tar/cpio/rsync/whatever,
and no WAL record updates it.
In that case we will just have a torn page in backup with no way to fix
it.
-------------
Hannu
Hannu Krosing <hannu@skype.net> writes:
But if we do need to consider the kernel-level behaviour mentioned, then
the whole PITR thing becomes an impossibility. Consider the case when we
get a torn page during the initial copy with tar/cpio/rsync/whatever,
and no WAL record updates it.
The only way the backup program could read a torn page is if the
database is writing that page concurrently, in which case there must
be a WAL record for the action.
This was all thought through carefully when the PITR mechanism was
designed, and it is solid -- as long as we are doing full-page writes.
Unfortunately, certain people forced that feature into 8.1 without
adequate review of the system's assumptions ...
regards, tom lane
Simon Riggs <simon@2ndquadrant.com> writes:
On Sat, 2006-04-15 at 11:45 -0400, Tom Lane wrote:
No, we'll just change the test in xlog.c so that fullPageWrites is
ignored if XLogArchivingActive.
I can see the danger of which you speak, but does it necessarily apply
to all forms of backup?
No, but the problem is we're not sure which forms are safe; it appears
to depend on poorly-documented details of behavior of both the kernel
and the backup program --- details that might well vary from one version
to the next even of the "same" program. Given the variety of platforms
PG runs on, I can't see us expending the effort to try to monitor which
combinations it might be safe to not use full_page_writes with.
It seems that we should write an API to allow a backup device to ask for
blocks from the database.
I don't think we have the manpower or interest to develop and maintain
our own backup tool --- or tools, actually, as you'd at least want a tar
replacement and an rsync replacement. Oracle might be able to afford
to throw programmers at that sort of thing, but where are you going to
get volunteers for tasks as mind-numbing as maintaining a PG-specific
tar replacement?
regards, tom lane
Martijn van Oosterhout <kleptog@svana.org> writes:
I must be missing something obvious, but why don't we compress the
xlogs? They appear to be quite compressable (>75%) with standard gzip...
Might be worth experimenting with, but I'm a bit dubious. We've seen
several tests showing that XLogInsert's calculation of a CRC for each
WAL record is a bottleneck (that's why we backed off from 64-bit CRC
to 32-bit recently). I'd think that any nontrivial compression
algorithm would be vastly slower than CRC ...
regards, tom lane