Is pg_control file crashsafe?

Started by Alex Ignatovover 9 years ago18 messages
#1Alex Ignatov
a.ignatov@postgrespro.ru

Hello everyone!
We have some issue with truncated pg_control file on Windows after power failure.My questions is : 1) Is pg_control protected from say , power crash or partial write? 2) How PG update pg_control? By writing in it or writing in some temp file and after that rename it to pg_control to be atomic?3) Can PG have  multiple pg_control copy to be more fault tolerant?
PS During some experiments we found that at present time there is no any method to do crash recovery with "restored" version of pg_control (based on some manipulations with pg_resetxlog ). Only by using pg_resetxlog and setting it parameters to values taken from wal file (pg_xlogdump)we can at least start PG and saw that PG state is at the moment of last check point. But we have no real confidence that PG is in consistent state(also docs on pg_resetxlogs told us about it too)

Alex IgnatovPostgres Professional: http://www.postgrespro.comRussian Postgres Company

#2Bruce Momjian
bruce@momjian.us
In reply to: Alex Ignatov (#1)
Re: Is pg_control file crashsafe?

On Thu, Apr 28, 2016 at 09:58:00PM +0000, Alex Ignatov wrote:

Hello everyone!
We have some issue with truncated pg_control file on Windows after power
failure.
My questions is :
1) Is pg_control protected from say , power crash or partial write?
2) How PG update pg_control? By writing in it or writing in some temp file and
after that rename it to pg_control to be atomic?

We write pg_controldata in one write() OS call:

if (write(fd, buffer, PG_CONTROL_SIZE) != PG_CONTROL_SIZE)

3) Can PG have multiple pg_control copy to be more fault tolerant?

PS During some experiments we found that at present time there is no any method
to do crash recovery with "restored" version of pg_control (based on some
manipulations with pg_resetxlog ).
Only by using pg_resetxlog and setting it parameters to values taken from wal
file (pg_xlogdump)we can at least start PG and saw that PG state is at the
moment of last check point. But we have no real confidence that PG is in
consistent state(also docs on pg_resetxlogs told us about it too)

We have talked about improving the reliability of pg_control, but
failures are so rare we have never done anything to improve it. I know
Tatsuo has talked about making pg_control more reliable, so I am CC'ing
him.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Alex Ignatov
a.ignatov@postgrespro.ru
In reply to: Bruce Momjian (#2)
Re: Is pg_control file crashsafe?

On 01.05.2016 0:55, Bruce Momjian wrote:

On Thu, Apr 28, 2016 at 09:58:00PM +0000, Alex Ignatov wrote:

Hello everyone!
We have some issue with truncated pg_control file on Windows after power
failure.
My questions is :
1) Is pg_control protected from say , power crash or partial write?
2) How PG update pg_control? By writing in it or writing in some temp file and
after that rename it to pg_control to be atomic?

We write pg_controldata in one write() OS call:

if (write(fd, buffer, PG_CONTROL_SIZE) != PG_CONTROL_SIZE)

3) Can PG have multiple pg_control copy to be more fault tolerant?

PS During some experiments we found that at present time there is no any method
to do crash recovery with "restored" version of pg_control (based on some
manipulations with pg_resetxlog ).
Only by using pg_resetxlog and setting it parameters to values taken from wal
file (pg_xlogdump)we can at least start PG and saw that PG state is at the
moment of last check point. But we have no real confidence that PG is in
consistent state(also docs on pg_resetxlogs told us about it too)

We have talked about improving the reliability of pg_control, but
failures are so rare we have never done anything to improve it. I know
Tatsuo has talked about making pg_control more reliable, so I am CC'ing
him.

Oh! Good. Thank you!
It is rare but as we saw now it is our reality too. One of our customers
had this issue on previous week =)

I think that rename can help a little bit. At least on some FS it is
atomic operation.

--
Alex Ignatov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alex Ignatov (#3)
Re: Is pg_control file crashsafe?

Alex Ignatov <a.ignatov@postgrespro.ru> writes:

I think that rename can help a little bit. At least on some FS it is
atomic operation.

Writing a single sector ought to be atomic too. I'm very skeptical that
it'll be an improvement to just move the risk from one filesystem
operation to another; especially not to one where there's not even a
terribly portable way to request fsync.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@anarazel.de
In reply to: Alex Ignatov (#1)
Re: Is pg_control file crashsafe?

Hi,

On 2016-04-28 21:58:00 +0000, Alex Ignatov wrote:

We have some issue with truncated pg_control file on Windows after
power failure.My questions is :�1) Is pg_control protected from say ,
power crash or partial write?

It should be. I think to make progress on this thread we're going to
need a bit more details about the exact corruption. Was the length of
the file change? Did the checksum fail? Did you just observe too old
contents?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Alex Ignatov
a.ignatov@postgrespro.ru
In reply to: Andres Freund (#5)
Re: Is pg_control file crashsafe?

Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

On 03.05.2016 2:21, Andres Freund wrote:

Hi,

On 2016-04-28 21:58:00 +0000, Alex Ignatov wrote:

We have some issue with truncated pg_control file on Windows after
power failure.My questions is : 1) Is pg_control protected from say ,
power crash or partial write?

It should be. I think to make progress on this thread we're going to
need a bit more details about the exact corruption. Was the length of
the file change? Did the checksum fail? Did you just observe too old
contents?

Greetings,

Andres Freund

Length was 0 bytes after crash. It was Windows and ntfs + ssd in raid 1.
File zeroed after power loss.

Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Alex Ignatov
a.ignatov@postgrespro.ru
In reply to: Tom Lane (#4)
Re: Is pg_control file crashsafe?

On 03.05.2016 2:17, Tom Lane wrote:

Alex Ignatov <a.ignatov@postgrespro.ru> writes:

I think that rename can help a little bit. At least on some FS it is
atomic operation.

Writing a single sector ought to be atomic too. I'm very skeptical that
it'll be an improvement to just move the risk from one filesystem
operation to another; especially not to one where there's not even a
terribly portable way to request fsync.

regards, tom lane

pg_control is 8k long(i think it is legth of one page in default PG
compile settings).
I also think that 8k recording can be atomic. Even if recording of one
sector is atomic nobody can say about what sector from 8k record of
pg_control should be written first. It can be last sector or say sector
number 10 from 16. That why i mentioned renaming from tmp file to
pg_control. Renaming in FS usually is atomic operation. And after power
loss we have either old version of pg_control or new version of it. But
not torn pg_control file.

Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Alex Ignatov (#7)
Re: Is pg_control file crashsafe?

On Wed, May 4, 2016 at 4:02 PM, Alex Ignatov <a.ignatov@postgrespro.ru>
wrote:

On 03.05.2016 2:17, Tom Lane wrote:

Alex Ignatov <a.ignatov@postgrespro.ru> writes:

I think that rename can help a little bit. At least on some FS it is
atomic operation.

Writing a single sector ought to be atomic too. I'm very skeptical that
it'll be an improvement to just move the risk from one filesystem
operation to another; especially not to one where there's not even a
terribly portable way to request fsync.

regards, tom lane

pg_control is 8k long(i think it is legth of one page in default PG

compile settings).
I also think that 8k recording can be atomic. Even if recording of one
sector is atomic nobody can say about what sector from 8k record of
pg_control should be written first. It can be last sector or say sector
number 10 from 16.

The actual data written is always sizeof(ControlFileData) which should be
less than one sector. I think it is only possible that we get a torn write
for pg_control, if while writing + fsyncing, the filesystem maps that data
to different sectors.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#8)
Re: Is pg_control file crashsafe?

Amit Kapila <amit.kapila16@gmail.com> writes:

On Wed, May 4, 2016 at 4:02 PM, Alex Ignatov <a.ignatov@postgrespro.ru>
wrote:

On 03.05.2016 2:17, Tom Lane wrote:

Writing a single sector ought to be atomic too.

pg_control is 8k long(i think it is legth of one page in default PG
compile settings).

The actual data written is always sizeof(ControlFileData) which should be
less than one sector.

Yes. We don't care what happens to the rest of the file as long as the
first sector's worth is updated atomically. See the comments for
PG_CONTROL_SIZE and the code in ReadControlFile/WriteControlFile.

We could change to a different PG_CONTROL_SIZE pretty easily, and there's
certainly room to argue that reducing it to 512 or 1024 would be more
efficient. I think the motivation for setting it at 8K was basically
"we're already assuming that 8K writes are efficient, so let's assume
it here too". But since the file is only written once per checkpoint,
efficiency is not really a key selling point anyway. If you could make
an argument that some other size would reduce the risk of failures,
it would be interesting --- but I suspect any such argument would be
very dependent on the quirks of a specific file system.

One point worth considering is that on most file systems, rewriting
a fraction of a page is *less* efficient than rewriting a full page,
because the kernel first has to read in the old contents to fill
the disk buffer it's going to partially overwrite with new data.
This motivates against trying to reduce the write size too much.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#9)
Re: Is pg_control file crashsafe?

On Wed, May 4, 2016 at 8:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

On Wed, May 4, 2016 at 4:02 PM, Alex Ignatov <a.ignatov@postgrespro.ru>
wrote:

On 03.05.2016 2:17, Tom Lane wrote:

Writing a single sector ought to be atomic too.

pg_control is 8k long(i think it is legth of one page in default PG
compile settings).

The actual data written is always sizeof(ControlFileData) which should

be

less than one sector.

Yes. We don't care what happens to the rest of the file as long as the
first sector's worth is updated atomically. See the comments for
PG_CONTROL_SIZE and the code in ReadControlFile/WriteControlFile.

We could change to a different PG_CONTROL_SIZE pretty easily, and there's
certainly room to argue that reducing it to 512 or 1024 would be more
efficient. I think the motivation for setting it at 8K was basically
"we're already assuming that 8K writes are efficient, so let's assume
it here too". But since the file is only written once per checkpoint,
efficiency is not really a key selling point anyway. If you could make
an argument that some other size would reduce the risk of failures,
it would be interesting --- but I suspect any such argument would be
very dependent on the quirks of a specific file system.

How about using 512 bytes as a write size and perform direct writes rather
than going via OS buffer cache for control file? Alex, is the issue
reproducible (to ensure that if we try to solve it in some way, do we have
way to test it as well)?

One point worth considering is that on most file systems, rewriting
a fraction of a page is *less* efficient than rewriting a full page,
because the kernel first has to read in the old contents to fill
the disk buffer it's going to partially overwrite with new data.
This motivates against trying to reduce the write size too much.

Yes, you are very much right and I have observed that recently during my
work on WAL Re-Writes [1]/messages/by-id/CAA4eK1+=O33dZZ=jBtjXBFyD67R5dLcqFyOMj4f-qmFXBP1OOQ@mail.gmail.com. However, I think that won't be the issue if we
use direct writes for control file.

[1]: /messages/by-id/CAA4eK1+=O33dZZ=jBtjXBFyD67R5dLcqFyOMj4f-qmFXBP1OOQ@mail.gmail.com
/messages/by-id/CAA4eK1+=O33dZZ=jBtjXBFyD67R5dLcqFyOMj4f-qmFXBP1OOQ@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#10)
Re: Is pg_control file crashsafe?

Amit Kapila <amit.kapila16@gmail.com> writes:

How about using 512 bytes as a write size and perform direct writes rather
than going via OS buffer cache for control file?

Wouldn't that fail outright under a lot of implementations of direct write;
ie the request needs to be page-aligned, for some not-very-determinate
value of page size?

To repeat, I'm pretty hesitant to change this logic. While this is not
the first report we've ever heard of loss of pg_control, I believe I could
count those reports without running out of fingers on one hand --- and
that's counting since the last century. It will take quite a lot of
evidence to convince me that some other implementation will be more
reliable. If you just come and present a patch to use direct write, or
rename, or anything else for that matter, I'm going to reject it out of
hand unless you provide very strong evidence that it's going to be more
reliable than the current code across all the systems we support.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Tom Lane (#11)
Re: Is pg_control file crashsafe?

On Thu, May 5, 2016 at 4:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

How about using 512 bytes as a write size and perform direct writes rather
than going via OS buffer cache for control file?

Wouldn't that fail outright under a lot of implementations of direct write;
ie the request needs to be page-aligned, for some not-very-determinate
value of page size?

To repeat, I'm pretty hesitant to change this logic. While this is not
the first report we've ever heard of loss of pg_control, I believe I could
count those reports without running out of fingers on one hand --- and
that's counting since the last century. It will take quite a lot of
evidence to convince me that some other implementation will be more
reliable. If you just come and present a patch to use direct write, or
rename, or anything else for that matter, I'm going to reject it out of
hand unless you provide very strong evidence that it's going to be more
reliable than the current code across all the systems we support.

I'm not sure how those ideas address the reported problem anyway: the
*length* was unexpectedly zero after a crash. UpdateControlFile
doesn't change the length of the control file, since it doesn't
specify O_TRUNC or O_APPEND and it always writes the same size. So it
seems like a pretty weird failure mode affecting filesystem metadata
(which I wouldn't expect to change anyway, but I would expect to be
journaled if it did), not a file-contents-atomicity problem. Whether
or not the page cache is involved in a write to a preallocated file
doesn't seem relevant to a case of unexpected truncation, and the
atomic rename trick doesn't seem relevant either unless someone with
expert knowledge of NTFS could explain how a crash could lead to
truncation in the first place, and how rename would help.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Thomas Munro (#12)
Re: Is pg_control file crashsafe?

On Thu, May 5, 2016 at 11:52 AM, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

On Thu, May 5, 2016 at 4:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

How about using 512 bytes as a write size and perform direct writes

rather

than going via OS buffer cache for control file?

Wouldn't that fail outright under a lot of implementations of direct

write;

ie the request needs to be page-aligned, for some not-very-determinate
value of page size?

Right, it should be atleast page size.

To repeat, I'm pretty hesitant to change this logic. While this is not
the first report we've ever heard of loss of pg_control, I believe I

could

count those reports without running out of fingers on one hand --- and
that's counting since the last century. It will take quite a lot of
evidence to convince me that some other implementation will be more
reliable. If you just come and present a patch to use direct write, or
rename, or anything else for that matter, I'm going to reject it out of
hand unless you provide very strong evidence that it's going to be more
reliable than the current code across all the systems we support.

I'm not sure how those ideas address the reported problem anyway: the
*length* was unexpectedly zero after a crash. UpdateControlFile
doesn't change the length of the control file, since it doesn't
specify O_TRUNC or O_APPEND and it always writes the same size. So it
seems like a pretty weird failure mode affecting filesystem metadata
(which I wouldn't expect to change anyway, but I would expect to be
journaled if it did), not a file-contents-atomicity problem. Whether
or not the page cache is involved in a write to a preallocated file
doesn't seem relevant to a case of unexpected truncation, and the
atomic rename trick doesn't seem relevant either unless someone with
expert knowledge of NTFS could explain how a crash could lead to
truncation in the first place, and how rename would help.

I think the real reason for truncation is not known or not discussed here.
It seems to me that the ideas are being discussed on the mere speculation
that current way of writing can lead to corruption in certain cases. I
think it would be better to first dig into the actual reason of problem.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#14Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#11)
Re: Is pg_control file crashsafe?

On 2016-05-05 00:32:29 -0400, Tom Lane wrote:

To repeat, I'm pretty hesitant to change this logic. While this is not
the first report we've ever heard of loss of pg_control, I believe I could
count those reports without running out of fingers on one hand --- and
that's counting since the last century. It will take quite a lot of
evidence to convince me that some other implementation will be more
reliable. If you just come and present a patch to use direct write, or
rename, or anything else for that matter, I'm going to reject it out of
hand unless you provide very strong evidence that it's going to be more
reliable than the current code across all the systems we support.

https://lwn.net/SubscriberLink/686150/9697c313930fbe99/ :

"Jeff Moyer pointed out that sector tearing can happen on block devices
like SSDs, which is not what users expect. "
"Actually, what I said was that sector tearing doesn't usually happen on
SSDs due to the nature of the FTL. Traditional storage, however, never
guaranteed sector atomicity, but it usually does provide it."

FWIW, at the LSF/MM session Robert and I attended I talked to a Seagate
and a WD (IIRC) employee, and there answer echoed the second comment
from above. It's unlikely, but entirely possible to get torn sectors
after power outages. What's worse, if you get one it's entirely possible
that future *reads* will not just return torn contents, but an error.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Greg Stark
stark@mit.edu
In reply to: Tom Lane (#11)
Re: Is pg_control file crashsafe?

On 5 May 2016 12:32 am, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

To repeat, I'm pretty hesitant to change this logic. While this is not
the first report we've ever heard of loss of pg_control, I believe I could
count those reports without running out of fingers on one hand --- and
that's counting since the last century. It will take quite a lot of
evidence to convince me that some other implementation will be more
reliable. If you just come and present a patch to use direct write, or
rename, or anything else for that matter, I'm going to reject it out of
hand unless you provide very strong evidence that it's going to be more
reliable than the current code across all the systems we support.

One thing we could do without much worry of being less reliable would be to
keep two copies of pg_control. Write one, fsync, then write to the other
and fsync that one.

Oracle keeps a copy of the old control file so that you can always go back
to an older version if a hardware or software bug currupts it. But they
keep a lot more data in their control file and they can be quite large.

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Stark (#15)
Re: Is pg_control file crashsafe?

Greg Stark <stark@mit.edu> writes:

One thing we could do without much worry of being less reliable would be to
keep two copies of pg_control. Write one, fsync, then write to the other
and fsync that one.

Hmm, interesting thought. Without knowing more about the filesystem
problem that the OP had, it's hard to tell whether this would have saved
us; but in principle it sounds like it would be more reliable.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Alex Ignatov
a.ignatov@postgrespro.ru
In reply to: Greg Stark (#15)
Re: Is pg_control file crashsafe?

On 06.05.2016 0:42, Greg Stark wrote:

On 5 May 2016 12:32 am, "Tom Lane" <tgl@sss.pgh.pa.us
<mailto:tgl@sss.pgh.pa.us>> wrote:

To repeat, I'm pretty hesitant to change this logic. While this is not
the first report we've ever heard of loss of pg_control, I believe I

could

count those reports without running out of fingers on one hand --- and
that's counting since the last century. It will take quite a lot of
evidence to convince me that some other implementation will be more
reliable. If you just come and present a patch to use direct write, or
rename, or anything else for that matter, I'm going to reject it out of
hand unless you provide very strong evidence that it's going to be more
reliable than the current code across all the systems we support.

One thing we could do without much worry of being less reliable would be
to keep two copies of pg_control. Write one, fsync, then write to the
other and fsync that one.

Oracle keeps a copy of the old control file so that you can always go
back to an older version if a hardware or software bug currupts it. But
they keep a lot more data in their control file and they can be quite large.

Oracle can create more then one copy of control file. They are the same,
not old copy and current. And their advise is just to store this copies
on separate storage to be more fault tolerant.

PS By the way on my initial post about "is pg_control safe" i wrote in p
3. some thoughts about multiple copies of pg_control file. Glad to see
identity of views on this issue

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Alex Ignatov
a.ignatov@postgrespro.ru
In reply to: Amit Kapila (#10)
Re: Is pg_control file crashsafe?

On 05.05.2016 7:16, Amit Kapila wrote:

On Wed, May 4, 2016 at 8:03 PM, Tom Lane <tgl@sss.pgh.pa.us
<mailto:tgl@sss.pgh.pa.us>> wrote:

Amit Kapila <amit.kapila16@gmail.com

<mailto:amit.kapila16@gmail.com>> writes:

On Wed, May 4, 2016 at 4:02 PM, Alex Ignatov

<a.ignatov@postgrespro.ru <mailto:a.ignatov@postgrespro.ru>>

wrote:

On 03.05.2016 2:17, Tom Lane wrote:

Writing a single sector ought to be atomic too.

pg_control is 8k long(i think it is legth of one page in default PG
compile settings).

The actual data written is always sizeof(ControlFileData) which

should be

less than one sector.

Yes. We don't care what happens to the rest of the file as long as the
first sector's worth is updated atomically. See the comments for
PG_CONTROL_SIZE and the code in ReadControlFile/WriteControlFile.

We could change to a different PG_CONTROL_SIZE pretty easily, and there's
certainly room to argue that reducing it to 512 or 1024 would be more
efficient. I think the motivation for setting it at 8K was basically
"we're already assuming that 8K writes are efficient, so let's assume
it here too". But since the file is only written once per checkpoint,
efficiency is not really a key selling point anyway. If you could make
an argument that some other size would reduce the risk of failures,
it would be interesting --- but I suspect any such argument would be
very dependent on the quirks of a specific file system.

How about using 512 bytes as a write size and perform direct writes
rather than going via OS buffer cache for control file? Alex, is the
issue reproducible (to ensure that if we try to solve it in some way, do
we have way to test it as well)?

One point worth considering is that on most file systems, rewriting
a fraction of a page is *less* efficient than rewriting a full page,
because the kernel first has to read in the old contents to fill
the disk buffer it's going to partially overwrite with new data.
This motivates against trying to reduce the write size too much.

Yes, you are very much right and I have observed that recently during my
work on WAL Re-Writes [1]. However, I think that won't be the issue if
we use direct writes for control file.

[1] -
/messages/by-id/CAA4eK1+=O33dZZ=jBtjXBFyD67R5dLcqFyOMj4f-qmFXBP1OOQ@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com/&gt;

Hi!
No issue happened only once. Also any attempts to reproduce it is not
successful yet

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers