odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Started by Andres Freundover 3 years ago44 messageshackers

andres@anarazel.de

over 3 years ago

Hi,

My buildfarm animal grassquit just showed an odd failure [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2022-11-22%2016%3A33%3A57 in REL_11_STABLE:

ok 10 - standby is in recovery
# Running: pg_ctl -D /mnt/resource/bf/build/grassquit/REL_11_STABLE/pgsql.build/src/bin/pg_ctl/tmp_check/t_003_promote_standby2_data/pgdata promote
waiting for server to promote....pg_ctl: control file appears to be corrupt
not ok 11 - pg_ctl promote of standby runs

# Failed test 'pg_ctl promote of standby runs'
# at /mnt/resource/bf/build/grassquit/REL_11_STABLE/pgsql.build/../pgsql/src/test/perl/TestLib.pm line 474.

I didn't find other references to this kind of failure. Nor has the error
re-occurred on grassquit.

I don't immediately see a way for this message to be hit that's not indicating
a bug somewhere. We should be updating the control file in an atomic way and
read it in an atomic way.

The failure has to be happening in wait_for_postmaster_promote(), because the
standby2 is actually successfully promoted.

Greetings,

Andres Freund

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2022-11-22%2016%3A33%3A57

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Andres Freund (#1)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Tue, Nov 22, 2022 at 05:42:24PM -0800, Andres Freund wrote:

The failure has to be happening in wait_for_postmaster_promote(), because the
standby2 is actually successfully promoted.

That's the one under -fsanitize=address. It really smells to me like
a bug with a race condition all over it.
--
Michael

Alvaro Herrera

alvherre@2ndquadrant.com

over 3 years ago

In reply to: Andres Freund (#1)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On 2022-Nov-22, Andres Freund wrote:

ok 10 - standby is in recovery
# Running: pg_ctl -D /mnt/resource/bf/build/grassquit/REL_11_STABLE/pgsql.build/src/bin/pg_ctl/tmp_check/t_003_promote_standby2_data/pgdata promote
waiting for server to promote....pg_ctl: control file appears to be corrupt
not ok 11 - pg_ctl promote of standby runs

# Failed test 'pg_ctl promote of standby runs'
# at /mnt/resource/bf/build/grassquit/REL_11_STABLE/pgsql.build/../pgsql/src/test/perl/TestLib.pm line 474.

This triggered me on this proposal I saw yesterday
/messages/by-id/02fe0063-bf77-90d0-3cf5-e9fe7c2a487b@postgrespro.ru
I think trying to store more stuff in pg_control is dangerous and we
should resist it.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Andres Freund (#1)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Wed, Nov 23, 2022 at 2:42 PM Andres Freund <andres@anarazel.de> wrote:

The failure has to be happening in wait_for_postmaster_promote(), because the
standby2 is actually successfully promoted.

I assume this is ext4. Presumably anything that reads the
controlfile, like pg_ctl, pg_checksums, pg_resetwal,
pg_control_system(), ... by reading without interlocking against
writes could see garbage. I have lost track of the versions and the
thread, but I worked out at some point by experimentation that this
only started relatively recently for concurrent read() and write(),
but always happened with concurrent pread() and pwrite(). The control
file uses the non-p variants which didn't mash old/new data like
grated cheese under concurrency due to some implementation detail, but
now does.

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Thomas Munro (#4)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Nov 23, 2022 at 2:42 PM Andres Freund <andres@anarazel.de> wrote:

The failure has to be happening in wait_for_postmaster_promote(), because the
standby2 is actually successfully promoted.

I assume this is ext4. Presumably anything that reads the
controlfile, like pg_ctl, pg_checksums, pg_resetwal,
pg_control_system(), ... by reading without interlocking against
writes could see garbage. I have lost track of the versions and the
thread, but I worked out at some point by experimentation that this
only started relatively recently for concurrent read() and write(),
but always happened with concurrent pread() and pwrite(). The control
file uses the non-p variants which didn't mash old/new data like
grated cheese under concurrency due to some implementation detail, but
now does.

As for what to do about it, some ideas:

1. Use advisory range locking. (This would be an advisory version of
what many other filesystems do automatically, AFAIK. Does Windows
have a thing like POSIX file locking, or need it here?)
2. Retry after a short time on checksum failure. The probability is
already miniscule, and becomes pretty close to 0 if we read thrice
100ms apart.
3. Some scheme that involves renaming the file into place. (That
might be a pain on Windows; it only works for the relmap thing because
all readers and writers are in the backend and use an LWLock to avoid
silly handle semantics.)
4. ???

First thought is that 2 is appropriate level of complexity for this
rare and stupid problem.

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Thomas Munro (#5)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Thomas Munro <thomas.munro@gmail.com> writes:

On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I assume this is ext4. Presumably anything that reads the
controlfile, like pg_ctl, pg_checksums, pg_resetwal,
pg_control_system(), ... by reading without interlocking against
writes could see garbage. I have lost track of the versions and the
thread, but I worked out at some point by experimentation that this
only started relatively recently for concurrent read() and write(),
but always happened with concurrent pread() and pwrite(). The control
file uses the non-p variants which didn't mash old/new data like
grated cheese under concurrency due to some implementation detail, but
now does.

Ugh.

As for what to do about it, some ideas:
2. Retry after a short time on checksum failure. The probability is
already miniscule, and becomes pretty close to 0 if we read thrice
100ms apart.

First thought is that 2 is appropriate level of complexity for this
rare and stupid problem.

Yeah, I was thinking the same. A variant could be "repeat until
we see the same calculated checksum twice".

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Tom Lane (#6)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Thu, Nov 24, 2022 at 11:05 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro <thomas.munro@gmail.com> wrote:
As for what to do about it, some ideas:
2. Retry after a short time on checksum failure. The probability is
already miniscule, and becomes pretty close to 0 if we read thrice
100ms apart.

First thought is that 2 is appropriate level of complexity for this
rare and stupid problem.

Yeah, I was thinking the same. A variant could be "repeat until
we see the same calculated checksum twice".

Hmm. While writing a comment to explain why that's good enough, I
realised it's not really true for a standby that control file writes
are always expected to be far apart in time. XLogFlush->
UpdateMinRecoveryPoint() could coincide badly with our N attempts for
any small N and for any nap time, which I think makes your idea better
than mine.

With some cartoon-level understanding of what's going on (to wit: I
think the kernel just pins the page but doesn't use a page-level
content lock or range lock, so what you're seeing is raw racing memcpy
calls and unsynchronised cache line effects), I guess you'd be fairly
likely to make "progress" in seeing more new data even if you didn't
sleep in between, but who knows. So I have a 10ms sleep to make
progress very likely; given your algorithm it doesn't matter if you
didn't make all the progress, just some. Since this is reachable from
SQL, I think we also need a CFI call so you can't get uninterruptibly
stuck here?

I wrote a stupid throw-away function to force a write. If you have an
ext4 system to hand (xfs, zfs, apfs, ufs, others don't suffer from
this) you can do:

do $$ begin for i in 1..100000000 do loop perform
pg_update_control_file(); end loop; end; $$;

... while you also do:

select pg_control_system();
\watch 0.001

... and you'll soon see:

ERROR: calculated CRC checksum does not match value stored in file

The attached draft patch fixes it.

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Thomas Munro (#7)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Thu, Nov 24, 2022 at 2:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:

... and you'll soon see:

ERROR: calculated CRC checksum does not match value stored in file

I forgot to mention: this reproducer only seems to work if fsync =
off. I don't know why, but I recall that was true also for bug
#17064.

Anton A. Melnikov

aamelnikov@inbox.ru

over 3 years ago

In reply to: Thomas Munro (#7)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Hello!

On 24.11.2022 04:02, Thomas Munro wrote:

On Thu, Nov 24, 2022 at 11:05 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

ERROR: calculated CRC checksum does not match value stored in file

The attached draft patch fixes it.

Tried to catch this error on my PC, but failed to do it within a reasonable time.
The 1s interval is probably too long for me.
It seems there are more durable way to reproduce this bug with 0001 patch applied:
At the first backend:

do $$ begin loop perform pg_update_control_file(); end loop; end; $$;

At the second one:

do $$ begin loop perform pg_control_system(); end loop; end; $$;

It will fails almost immediately with:
"ERROR: calculated CRC checksum does not match value stored in file"
both with fsync = off and fsync = on.
Checked it out for master and REL_11_STABLE.

Also checked for a few hours that the patch 0002 fixes this error,
but there are some questions to its logical structure.
The equality between the previous and newly calculated crc is checked only
if the last crc calculation was wrong, i.e not equal to the value stored in the file.
It is very unlikely that in this case the previous and new crc can match, so, in fact,
the loop will spin until crc is calculated correctly. In the other words,
this algorithm is practically equivalent to an infinite loop of reading from a file
and calculating crc while(EQ_CRC32C(crc, ControlFile->crc) != true).
But then it can be simplified significantly by removing checksums equality checks,
bool fist_try and by limiting the maximum number of iterations
with some constant in the e.g. for loop to avoid theoretically possible freeze.

Or maybe use the variant suggested by Tom Lane, i.e, as far as i understand,
repeat the file_open-read-close-calculate_crc sequence twice without a pause between
them and check the both calculated crcs for the equality. If they match, exit and return
the bool result of comparing between the last calculation with the value from the file,
if not, take a pause and repeat everything from the beginning.
In this case, no need to check *crc_ok_p inside get_controlfile()
as it was in the present version; i think it's more logically correct
since this variable is intended top-level functions and the logic
inside get_controlfile() should not depend on its value.

Also found a warning in 0001 patch for master branch. On my PC gcc gives:

xlog.c:2507:1: warning: no previous prototype for ‘pg_update_control_file’ [-Wmissing-prototypes]
2507 | pg_update_control_file()

Fixed it with #include "utils/fmgrprotos.h" to xlog.c and
add PG_FUNCTION_ARGS to pg_update_control_file().

With the best wishes,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#10

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Anton A. Melnikov (#9)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Tue, Jan 31, 2023 at 2:10 PM Anton A. Melnikov <aamelnikov@inbox.ru> wrote:

Also checked for a few hours that the patch 0002 fixes this error,
but there are some questions to its logical structure.

Hi Anton,

Thanks for looking!

The equality between the previous and newly calculated crc is checked only
if the last crc calculation was wrong, i.e not equal to the value stored in the file.
It is very unlikely that in this case the previous and new crc can match, so, in fact,
the loop will spin until crc is calculated correctly. In the other words,
this algorithm is practically equivalent to an infinite loop of reading from a file
and calculating crc while(EQ_CRC32C(crc, ControlFile->crc) != true).

Maybe it's unlikely that two samples will match while running that
torture test, because it's overwriting the file as fast as it can.
But the idea is that a real system isn't writing the control file most
of the time.

But then it can be simplified significantly by removing checksums equality checks,
bool fist_try and by limiting the maximum number of iterations
with some constant in the e.g. for loop to avoid theoretically possible freeze.

Yeah, I was thinking that we should also put a limit on the loop, just
to be cautious.

Primary servers write the control file very infrequently. Standybys
more frequently, while writing data out, maybe every few seconds on a
busy system writing out a lot of data. UpdateMinRecoveryPoint() makes
some effort to avoid updating the file too often. You definitely see
bursts of repeated flushes that might send this thing in a loop for a
while if the timings were exactly wrong, but usually with periodic
gaps; I haven't really studied the expected behaviour too closely.

Or maybe use the variant suggested by Tom Lane, i.e, as far as i understand,
repeat the file_open-read-close-calculate_crc sequence twice without a pause between
them and check the both calculated crcs for the equality. If they match, exit and return
the bool result of comparing between the last calculation with the value from the file,
if not, take a pause and repeat everything from the beginning.

Hmm. Would it be good enough to do two read() calls with no sleep in
between? How sure are we that a concurrent write will manage to
change at least one bit that our second read can see? I guess it's
likely, but maybe hypervisors, preemptible kernels, I/O interrupts or
a glitch in the matrix could decrease our luck? I really don't know.
So I figured I should add a small sleep between the reads to change
the odds in our favour. But we don't want to slow down all reads of
the control file with a sleep, do we? So I thought we should only
bother doing this slow stuff if the actual CRC check fails, a low
probability event.

Clearly there is an element of speculation or superstition here. I
don't know what else to do if both PostgreSQL and ext4 decided not to
add interlocking. Maybe we should rethink that. How bad would it
really be if control file access used POSIX file locking? I mean, the
writer is going to *fsync* the file, so it's not like one more wafer
thin system call is going to hurt too much.

#11

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Thomas Munro (#10)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Tue, Jan 31, 2023 at 5:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Clearly there is an element of speculation or superstition here. I
don't know what else to do if both PostgreSQL and ext4 decided not to
add interlocking. Maybe we should rethink that. How bad would it
really be if control file access used POSIX file locking? I mean, the
writer is going to *fsync* the file, so it's not like one more wafer
thin system call is going to hurt too much.

Here's an experimental patch for that alternative. I wonder if
someone would want to be able to turn it off for some reason -- maybe
some NFS problem? It's less back-patchable, but maybe more
principled?

I don't know if Windows suffers from this type of problem.
Unfortunately its equivalent functionality LockFile() looks non-ideal
for this purpose: if your program crashes, the documentation is very
vague on when exactly it is released by the OS, but it's not
immediately on process exit. That seems non-ideal for a control file
you might want to access again very soon after a crash, to be able to
recover.

A thought in passing: if UpdateMinRecoveryPoint() performance is an
issue, maybe we should figure out how to use fdatasync() instead of
fsync().

#12

Anton A. Melnikov

aamelnikov@inbox.ru

over 3 years ago

In reply to: Thomas Munro (#11)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Hi, Thomas!

There are two variants of the patch now.

1) As for the first workaround:

On 31.01.2023 07:09, Thomas Munro wrote:

Maybe it's unlikely that two samples will match while running that
torture test, because it's overwriting the file as fast as it can.
But the idea is that a real system isn't writing the control file most
of the time.

........

Yeah, I was thinking that we should also put a limit on the loop, just
to be cautious.

At first i didn’t understand that the equality condition with the previous
calculated crc and the current one at the second+ attempts was intended
for the case when the pg_control file is really corrupted.

Indeed, by making a few debugging variables and running the tortue test,
i found that there were ~4000 crc errors (~0.0003%) in ~125 million reads from the file,
and there was no case when the crc error appeared twice in a row.
So the second and moreover the third successive occurrence of an crc error
can be neglected and for this workaround seems a simpler and maybe more clear
algorithm is possible.
For instance:

for(try = 0 ; try < 3; try++)
{
open, read from and close pg_control;
calculate crc;

*crc_ok_p = EQ_CRC32C(crc, ControlFile->crc);

if(*crc_ok_p)
break;
}

2) As for the second variant of the patch with POSIX locks:

On 31.01.2023 14:38, Thomas Munro wrote:

On Tue, Jan 31, 2023 at 5:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Clearly there is an element of speculation or superstition here. I
don't know what else to do if both PostgreSQL and ext4 decided not to
add interlocking. Maybe we should rethink that. How bad would it
really be if control file access used POSIX file locking? I mean, the
writer is going to *fsync* the file, so it's not like one more wafer
thin system call is going to hurt too much.

Here's an experimental patch for that alternative. I wonder if
someone would want to be able to turn it off for some reason -- maybe
some NFS problem? It's less back-patchable, but maybe more
principled?

It looks very strange to me that there may be cases where the cluster data
is stored in NFS. Can't figure out how this can be useful.

i think this variant of the patch is a normal solution
of the problem, not workaround. Found no problems on Linux.
+1 for this variant.

Might add a custom error message for EDEADLK
since it absent in errcode_for_file_access()?

I don't know if Windows suffers from this type of problem.
Unfortunately its equivalent functionality LockFile() looks non-ideal
for this purpose: if your program crashes, the documentation is very
vague on when exactly it is released by the OS, but it's not
immediately on process exit. That seems non-ideal for a control file
you might want to access again very soon after a crash, to be able to
recover.

Unfortunately i've not had time to reproduce the problem and apply this patch on
Windows yet but i'm going to do it soon on windows 10. If a crc error
will occur there, then we might use the workaround from the first
variant of the patch.

A thought in passing: if UpdateMinRecoveryPoint() performance is an
issue, maybe we should figure out how to use fdatasync() instead of
fsync().

May be choose it in accordance with GUC wal_sync_method?

Sincerely yours,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#13

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Anton A. Melnikov (#12)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Wed, Feb 1, 2023 at 5:04 PM Anton A. Melnikov <aamelnikov@inbox.ru> wrote:

On 31.01.2023 14:38, Thomas Munro wrote:

Here's an experimental patch for that alternative. I wonder if
someone would want to be able to turn it off for some reason -- maybe
some NFS problem? It's less back-patchable, but maybe more
principled?

It looks very strange to me that there may be cases where the cluster data
is stored in NFS. Can't figure out how this can be useful.

Heh. There are many interesting failure modes, but people do it. I
guess my more general question when introducing any new system call
into this code is how some unusual system I can't even access is going
to break. Maybe some obscure filesystem will fail with EOPNOTSUPP, or
take 5 seconds and then fail because there is no lock server
configured or whatever, so that's why I don't think we can back-patch
it, and we probably need a way to turn it off.

i think this variant of the patch is a normal solution
of the problem, not workaround. Found no problems on Linux.
+1 for this variant.

I prefer it too.

Might add a custom error message for EDEADLK
since it absent in errcode_for_file_access()?

Ah, good thought. I think it shouldn't happen™, so it's OK that
errcode_for_file_access() would classify it as ERRCODE_INTERNAL_ERROR.

Other interesting errors are:

ENOLCK: system limits exceeded; PANIC seems reasonable
EOPNOTSUPP: this file doesn't support locking (seen on FreeBSD man
pages, not on POSIX)

I don't know if Windows suffers from this type of problem.
Unfortunately its equivalent functionality LockFile() looks non-ideal
for this purpose: if your program crashes, the documentation is very
vague on when exactly it is released by the OS, but it's not
immediately on process exit. That seems non-ideal for a control file
you might want to access again very soon after a crash, to be able to
recover.

Unfortunately i've not had time to reproduce the problem and apply this patch on
Windows yet but i'm going to do it soon on windows 10. If a crc error
will occur there, then we might use the workaround from the first
variant of the patch.

Thank you for investigating. I am afraid to read your results.

One idea would be to proceed with LockFile() for Windows, with a note
suggesting you file a bug with your OS vendor if you ever need it to
get unstuck. Googling this subject, I read that MongoDB used to
suffer from stuck lock files, until an OS bug report led to recent
versions releasing locks more promptly. I find that sufficiently
scary that I would want to default the feature to off on Windows, even
if your testing shows that it does really need it.

A thought in passing: if UpdateMinRecoveryPoint() performance is an
issue, maybe we should figure out how to use fdatasync() instead of
fsync().

May be choose it in accordance with GUC wal_sync_method?

Here's a patch like that. I don't know if it's a good idea for
wal_sync_method to affect other kinds of files or not, but, then, it
already does (fsync_writethough changes this behaviour). I would
only want to consider this if we also stop choosing "open_datasync" as
a default on at least Windows.

#14

Anton A. Melnikov

aamelnikov@inbox.ru

over 3 years ago

In reply to: Thomas Munro (#13)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Hi, Thomas!

Thanks for your rapid answer and sorry for my delay with reply.

On 01.02.2023 09:45, Thomas Munro wrote:

Might add a custom error message for EDEADLK
since it absent in errcode_for_file_access()?

Ah, good thought. I think it shouldn't happen™, so it's OK that
errcode_for_file_access() would classify it as ERRCODE_INTERNAL_ERROR.

Yes, i also think that is impossible since the lock is taken on
the entire file, so ERRCODE_INTERNAL_ERROR will be right here.

Other interesting errors are:

ENOLCK: system limits exceeded; PANIC seems reasonable
EOPNOTSUPP: this file doesn't support locking (seen on FreeBSD man
pages, not on POSIX)

Agreed that ENOLCK is a PANIC or at least FATAL. Maybe it's even better
to do it FATAL to allow other backends to survive?
As for EOPNOTSUPP, maybe make a fallback to the workaround from the
first variant of the patch? (In my previous letter i forgot the pause
after break;, of cause)

I don't know if Windows suffers from this type of problem.
Unfortunately its equivalent functionality LockFile() looks non-ideal
for this purpose: if your program crashes, the documentation is very
vague on when exactly it is released by the OS, but it's not
immediately on process exit. That seems non-ideal for a control file
you might want to access again very soon after a crash, to be able to
recover.

Unfortunately i've not had time to reproduce the problem and apply this patch on
Windows yet but i'm going to do it soon on windows 10. If a crc error
will occur there, then we might use the workaround from the first
variant of the patch.

Thank you for investigating. I am afraid to read your results.

First of all it seemed to me that is not a problem at all since msdn
guarantees sector-by-sector atomicity.
"Physical Sector: The unit for which read and write operations to the device
are completed in a single operation. This is the unit of atomic write..."
https://learn.microsoft.com/en-us/windows/win32/fileio/file-buffering
https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile
(Of course, only if the 512 bytes lays from the beginning of the file with a zero
offset, but this is our case. The current size of ControlFileData is
296 bytes at offset = 0.)

I tried to verify this fact experimentally and was very surprised.
Unfortunately it crashed in an hour during torture test:
2023-02-13 15:10:52.675 MSK [5704] LOG: starting PostgreSQL 16devel, compiled by Visual C++ build 1929, 64-bit
2023-02-13 15:10:52.768 MSK [5704] LOG: database system is ready to accept connections
@@@@@@ sizeof(ControlFileData) = 296
.........
2023-02-13 16:10:41.997 MSK [9380] ERROR: calculated CRC checksum does not match value stored in file

But fortunately, this only happens when fsync=off.
Also i did several experiments with fsync=on and found more appropriate behavior:
The stress test with sizeof(ControlFileData) = 512+8 = 520 bytes failed in a 4,5 hours,
but the other one with ordinary sizeof(ControlFileData) = 296 not crashed in more than 12 hours.

Seems in that case the behavior corresponds to msdn. So if it is possible
to use fsync() under windows when the GUC fsync is off it maybe a solution
for this problem. If so there is no need to lock the pg_control file under windows at all.

May be choose it in accordance with GUC wal_sync_method?

Here's a patch like that. I don't know if it's a good idea for
wal_sync_method to affect other kinds of files or not, but, then, it
already does (fsync_writethough changes this behaviour).

+1. Looks like it needs a little fix:

+++ b/src/common/controldata_utils.c
@@ -316,7 +316,7 @@ update_controlfile(const char *DataDir,
                         if (pg_fsync(fd) != 0)
                                 ereport(PANIC,
                                                 (errcode_for_file_access(),
-                                                errmsg("could not fdatasync file \"%s\": %m",
+                                                errmsg("could not fsync file \"%s\": %m",
                                                                 ControlFilePath)));

And it may be combined with 0001-Lock-pg_control-while-reading-or-writing.patch

I would
only want to consider this if we also stop choosing "open_datasync" as
a default on at least Windows.

I didn't quite understand this point. Could you clarify it for me, please? If the performance
of UpdateMinRecoveryPoint() wasn't a problem we could just use fsync in all platforms.

Sincerely yours,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#15

Anton A. Melnikov

aamelnikov@inbox.ru

over 3 years ago

In reply to: Anton A. Melnikov (#14)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Hi, Thomas!

On 14.02.2023 06:38, Anton A. Melnikov wrote:

Also i did several experiments with fsync=on and found more appropriate behavior:
The stress test with sizeof(ControlFileData) = 512+8 = 520 bytes failed in a 4,5 hours,
but the other one with ordinary sizeof(ControlFileData) = 296 not crashed in more than 12 hours.

Nonetheless it crashed after 18 hours:

2023-02-13 18:07:21.476 MSK [7640] LOG: starting PostgreSQL 16devel, compiled by Visual C++ build 1929, 64-bit
2023-02-13 18:07:21.483 MSK [7640] LOG: listening on IPv6 address "::1", port 5432
2023-02-13 18:07:21.483 MSK [7640] LOG: listening on IPv4 address "127.0.0.1", port 5432
2023-02-13 18:07:21.556 MSK [1940] LOG: database system was shut down at 2023-02-13 18:07:12 MSK
2023-02-13 18:07:21.590 MSK [7640] LOG: database system is ready to accept connections
@@@@@@@@@@@@@ sizeof(ControlFileData) = 296
2023-02-13 18:12:21.545 MSK [9532] LOG: checkpoint starting: time
2023-02-13 18:12:21.583 MSK [9532] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.003 s, sync=0.009 s, total=0.038 s; sync files=2, longest=0.005 s, average=0.005 s; distance=0 kB, estimate=0 kB; lsn=0/17AC388, redo lsn=0/17AC350
2023-02-14 12:12:21.738 MSK [8676] ERROR: calculated CRC checksum does not match value stored in file
2023-02-14 12:12:21.738 MSK [8676] CONTEXT: SQL statement "SELECT pg_control_system()"
PL/pgSQL function inline_code_block line 1 at PERFORM
2023-02-14 12:12:21.738 MSK [8676] STATEMENT: do $$ begin loop perform pg_control_system(); end loop; end; $$;

So all of the following is incorrect:

Seems in that case the behavior corresponds to msdn. So if it is possible
to use fsync() under windows when the GUC fsync is off it maybe a solution
for this problem. If so there is no need to lock the pg_control file under windows at all.

and cannot be a solution.

Sincerely yours,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#16

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Anton A. Melnikov (#14)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Tue, Feb 14, 2023 at 4:38 PM Anton A. Melnikov <aamelnikov@inbox.ru> wrote:

First of all it seemed to me that is not a problem at all since msdn
guarantees sector-by-sector atomicity.
"Physical Sector: The unit for which read and write operations to the device
are completed in a single operation. This is the unit of atomic write..."
https://learn.microsoft.com/en-us/windows/win32/fileio/file-buffering
https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile
(Of course, only if the 512 bytes lays from the beginning of the file with a zero
offset, but this is our case. The current size of ControlFileData is
296 bytes at offset = 0.)

There are two kinds of atomicity that we rely on for the control file today:

* atomicity on power loss (= device property, in case of overwrite filesystems)
* atomicity of concurrent reads and writes (= VFS or kernel buffer
pool interlocking policy)

I assume that documentation is talking about the first thing (BTW, I
suspect that documentation is also now wrong in one special case: NTFS
filesystems mounted on DAX devices don't have sectors or sector-based
atomicity unless you turn on BTT and slow them down[1]https://learn.microsoft.com/en-us/windows-server/storage/storage-spaces/persistent-memory-direct-access; that might
eventually be something for PostgreSQL to think about, and it applies
to other OSes too).

With this patch we would stop relying on the second thing. Supposedly
POSIX requires read/write atomicity, and many file systems offer it in
a strong form (internal range locking) or maybe a weak accidental form
(page-level locking). Since some extremely popular implementations
just don't care, and Windows isn't even a POSIX, we just have to do it
ourselves.

BTW there are at least two other places where PostgreSQL already knows
that concurrent reads and writes are possibly non-atomic (and we also
don't even try to get the alignment right, making the question moot):
pg_basebackup, which enables full_page_writes implicitly if you didn't
have the GUC on already, and pg_rewind, which refuses to run if you
haven't enabled full_page_writes explicitly (as recently discussed on
another thread recently; that's an annoying difference, and also an
annoying behaviour if you know your storage doesn't really need it!)

I tried to verify this fact experimentally and was very surprised.
Unfortunately it crashed in an hour during torture test:
2023-02-13 15:10:52.675 MSK [5704] LOG: starting PostgreSQL 16devel, compiled by Visual C++ build 1929, 64-bit
2023-02-13 15:10:52.768 MSK [5704] LOG: database system is ready to accept connections
@@@@@@ sizeof(ControlFileData) = 296
.........
2023-02-13 16:10:41.997 MSK [9380] ERROR: calculated CRC checksum does not match value stored in file

Thanks. I'd seen reports of this in discussions on the 'net, but
those had no authoritative references or supporting test results. The
fact that fsync made it take longer (in your following email) makes
sense and matches Linux. It inserts a massive pause in the middle of
the experiment loop, affecting the probabilities.

Therefore, we need a solution for Windows too. I tried to write the
equivalent code, in the attached. I learned a few things: Windows
locks are mandatory (that is, if you lock a file, reads/writes can
fail, unlike Unix). Combined with async release, that means we
probably also need to lock the file in xlog.c, when reading it in
xlog.c:ReadControlFile() (see comments). Before I added that, on a CI
run, I found that the read in there would sometimes fail, and adding
the locking fixed that. I am a bit confused, though, because I
expected that to be necessary only if someone managed to crash while
holding the write lock, which the CI tests shouldn't do. Do you have
any ideas?

While contemplating what else a mandatory file lock might break, I
remembered that basebackup.c also reads the control file. Hrmph. Not
addressed yet; I guess it might need to acquire/release around
sendFile(sink, XLOG_CONTROL_FILE, ...)?

I would
only want to consider this if we also stop choosing "open_datasync" as
a default on at least Windows.

I didn't quite understand this point. Could you clarify it for me, please? If the performance
of UpdateMinRecoveryPoint() wasn't a problem we could just use fsync in all platforms.

The level of durability would be weakened on Windows. On all systems
except Linux and FreeBSD, we default to wal_sync_method=open_datasync,
so then we would start using FILE_FLAG_WRITE_THROUGH for the control
file too. We know from reading and experimentation that
FILE_FLAG_WRITE_THROUGH doesn't flush the drive cache on consumer
Windows hardware, but its fdatasync-like thing does[2]/messages/by-id/CA+hUKG+a-7r4GpADsasCnuDBiqC1c31DAQQco2FayVtB9V3sQw@mail.gmail.com. I have not
thought too hard about the consequences of using different durability
levels for different categories of file, but messing with write
ordering can't be good for crash recovery, so I'd rather increase WAL
durability than decrease control file durability. If a Windows user
complains that it makes their fancy non-volatile cache slow down, they
can always adjust the settings in PostgreSQL, their OS, or their
drivers etc. I think we should just make fdatasync the default on all
systems.

Here's a patch like that.

[1]: https://learn.microsoft.com/en-us/windows-server/storage/storage-spaces/persistent-memory-direct-access
[2]: /messages/by-id/CA+hUKG+a-7r4GpADsasCnuDBiqC1c31DAQQco2FayVtB9V3sQw@mail.gmail.com

#17

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Thomas Munro (#16)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Fri, Feb 17, 2023 at 4:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:

While contemplating what else a mandatory file lock might break, I
remembered that basebackup.c also reads the control file. Hrmph. Not
addressed yet; I guess it might need to acquire/release around
sendFile(sink, XLOG_CONTROL_FILE, ...)?

If we go this way, I suppose, in theory at least, someone with
external pg_backup_start()-based tools might also want to hold the
lock while copying pg_control. Otherwise they might fail to open it
on Windows (where that patch uses a mandatory lock) or copy garbage on
Linux (as they can today, I assume), with non-zero probability -- at
least when copying files from a hot standby. Or backup tools might
want to get the file contents through some entirely different
mechanism that does the right interlocking (whatever that might be,
maybe inside the server). Perhaps this is not so much the localised
systems programming curiosity I thought it was, and has implications
that'd need to be part of the documented low-level backup steps. It
makes me like the idea a bit less. It'd be good to hear from backup
gurus what they think about that.

One cute hack I thought of to make the file lock effectively advisory
on Windows is to lock a byte range *past the end* of the file (the
documentation says you can do that). That shouldn't stop programs
that want to read the file without locking and don't know/care about
our scheme (that is, pre-existing backup tools that haven't considered
this problem and remain oblivious or accept the (low) risk of torn
reads), but will block other participants in our scheme.

If we went back to the keep-rereading-until-it-stops-changing model,
then an external backup tool would need to be prepared to do that too,
in theory at least. Maybe some already do something like that?

Or maybe the problem is/was too theoretical before...

#18

Anton A. Melnikov

aamelnikov@inbox.ru

over 3 years ago

In reply to: Thomas Munro (#16)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Hi, Thomas!

On 17.02.2023 06:21, Thomas Munro wrote:

There are two kinds of atomicity that we rely on for the control file today:

* atomicity on power loss (= device property, in case of overwrite filesystems)
* atomicity of concurrent reads and writes (= VFS or kernel buffer
pool interlocking policy)

I assume that documentation is talking about the first thing

I think this is true, as documentation says about write operations only,
but not about the read ones.

(BTW, I
suspect that documentation is also now wrong in one special case: NTFS
filesystems mounted on DAX devices don't have sectors or sector-based
atomicity unless you turn on BTT and slow them down[1]; that might
eventually be something for PostgreSQL to think about, and it applies
to other OSes too).

Very interesting find! For instance, the volume of Intel® Optane™ Persistent Memory
already reaches 512GB and can be potentially used for cluster data. As the
first step it would be good to understand what Microsoft means by
large memory pages, what size are they talking about, that is, where
is the reasonable boundary for using BTT; i suppose, this will help choose
whether to use BTT or have to write own DAX-aware code.

With this patch we would stop relying on the second thing. Supposedly
POSIX requires read/write atomicity, and many file systems offer it in
a strong form (internal range locking) or maybe a weak accidental form
(page-level locking). Since some extremely popular implementations
just don't care, and Windows isn't even a POSIX, we just have to do it
ourselves.

Yes. indeed. But unless it's an atomic or transactional filesystem. In
such a case there is almost nothing to do. Another thing is that
it seems such a systems do not exist in reality although it has been
discussed many times I've googled some information on this topic e.g
[1]: https://lwn.net/Articles/789600/
Microsoft's own development.

BTW there are at least two other places where PostgreSQL already knows
that concurrent reads and writes are possibly non-atomic (and we also
don't even try to get the alignment right, making the question moot):
pg_basebackup, which enables full_page_writes implicitly if you didn't
have the GUC on already, and pg_rewind, which refuses to run if you
haven't enabled full_page_writes explicitly (as recently discussed on
another thread recently; that's an annoying difference, and also an
annoying behaviour if you know your storage doesn't really need it!)

It seems a good topic for a separate thread patch. Would you provide a
link to the thread you mentioned please?

Therefore, we need a solution for Windows too. I tried to write the
equivalent code, in the attached. I learned a few things: Windows
locks are mandatory (that is, if you lock a file, reads/writes can
fail, unlike Unix). Combined with async release, that means we
probably also need to lock the file in xlog.c, when reading it in
xlog.c:ReadControlFile() (see comments). Before I added that, on a CI
run, I found that the read in there would sometimes fail, and adding
the locking fixed that. I am a bit confused, though, because I
expected that to be necessary only if someone managed to crash while
holding the write lock, which the CI tests shouldn't do. Do you have
any ideas?

Unfortunately, no ideas so far. Was it a pg_control CRC or I/O errors?
Maybe logs of such a fail were saved somewhere? I would like to see
them if possible.

While contemplating what else a mandatory file lock might break, I
remembered that basebackup.c also reads the control file. Hrmph. Not
addressed yet; I guess it might need to acquire/release around
sendFile(sink, XLOG_CONTROL_FILE, ...)?

Here, possibly pass a boolean flag into sendFile()? When it is true,
then take a lock after OpenTransientFile() and release it before
CloseTransientFile() if under Windows.

There are also two places where read or write from/to the pg_control
occur. These are functions WriteControlFile() in xlog.c and
read_controlfile() in pg_resetwal.c.
For the second case, locking definitely not necessary as
the server is stopped. For the first case seems too as BootStrapXLOG()
where WriteControlFile() will be called inside must be called
only once on system install.

Since i've smoothly moved on to the code review here there is a
suggestion at your discretion to add error messages to get_controlfile()
and update_controlfile() if unlock_controlfile() fails.

I think we should just make fdatasync the default on all
systems.

Agreed. And maybe choose the UPDATE_CONTROLFILE_FDATASYNC as the default
case in UpdateControlFile() since fdatasync now the default on all
systems and its metadata are of no interest?

On 22.02.2023 03:10, Thomas Munro wrote:

If we go this way, I suppose, in theory at least, someone with
external pg_backup_start()-based tools might also want to hold the
lock while copying pg_control. Otherwise they might fail to open it
on Windows (where that patch uses a mandatory lock)

As for external pg_backup_start()-based tools if somebody want to take the
lock while copying pg_control i suppose it's a normal case. He may have to wait
a bit until we release the lock, like in lock_controlfile(). Moreover
this is a very good desire, as it guarantees the integrity of pg_control if
only someone is going to use F_SETLKW rather than non-waiting F_SETLK.

Backup of locked files is the common place in Windows and all standard
backup tools can do it well via VSS (volume shadow copy) including
embedded windows backup tool. Just in case, i tested it experimentally.
During the torture test first try to copy pg_control and predictably caught:
Error: Cannot read C:\pgbins\master\data\global\pg_control!
"The process cannot access the file because another process has locked a portion of the file."
But copy using own windows backup utility successfully copied it with VSS.

One cute hack I thought of to make the file lock effectively advisory

on Windows is to lock a byte range *past the end* of the file (the
documentation says you can do that). That shouldn't stop programs
that want to read the file without locking and don't know/care about
our scheme (that is, pre-existing backup tools that haven't considered
this problem and remain oblivious or accept the (low) risk of torn
reads), but will block other participants in our scheme.

A very interesting idea. It makes sense when someone using external
backup tools that can not work with VSS. But the fact of using such a tools
under Windows is highly doubtful, i guess. It will not allow to backup
many other applications and windows system itself.
Let me to join you suggestion that it'd be good to hear from backup
gurus what they think about that.

or copy garbage on
Linux (as they can today, I assume), with non-zero probability -- at
least when copying files from a hot standby.
Or backup tools might
want to get the file contents through some entirely different
mechanism that does the right interlocking (whatever that might be,
maybe inside the server). Perhaps this is not so much the localised
systems programming curiosity I thought it was, and has implications
that'd need to be part of the documented low-level backup steps. It
makes me like the idea a bit less. It'd be good to hear from backup
gurus what they think about that.
If we went back to the keep-rereading-until-it-stops-changing model,
then an external backup tool would need to be prepared to do that too,
in theory at least. Maybe some already do something like that?

Or maybe the problem is/was too theoretical before...

As far as i understand, this problem has always been, but the probability of
this is extremely small in practice, which is directly pointed in
the docs [4]:
"So while it is theoretically a weak spot, pg_control does not seem
to be a problem in practice."

Here's a patch like that.

In this patch, the problem is solved for the live database and
maybe remains for some possible cases of an external backup. In a whole,
i think it can be stated that this is a sensible step forward.

Just like last time, i ran a long stress test under windows with current patch.
There were no errors for more than 3 days even with fsync=off.

[1]: https://lwn.net/Articles/789600/
[2]: https://github.com/ut-osa/txfs
[3]: https://en.wikipedia.org/wiki/Transactional_NTFS[4] https://www.postgresql.org/docs/devel/wal-internals.html

With the best wishes!

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#19

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Anton A. Melnikov (#18)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Fri, Feb 24, 2023 at 11:12 PM Anton A. Melnikov <aamelnikov@inbox.ru> wrote:

On 17.02.2023 06:21, Thomas Munro wrote:

BTW there are at least two other places where PostgreSQL already knows
that concurrent reads and writes are possibly non-atomic (and we also
don't even try to get the alignment right, making the question moot):
pg_basebackup, which enables full_page_writes implicitly if you didn't
have the GUC on already, and pg_rewind, which refuses to run if you
haven't enabled full_page_writes explicitly (as recently discussed on
another thread recently; that's an annoying difference, and also an
annoying behaviour if you know your storage doesn't really need it!)

It seems a good topic for a separate thread patch. Would you provide a
link to the thread you mentioned please?

/messages/by-id/367d01a7-90bb-9b70-4cda-248e81cc475c@cosium.com

Therefore, we need a solution for Windows too. I tried to write the
equivalent code, in the attached. I learned a few things: Windows
locks are mandatory (that is, if you lock a file, reads/writes can
fail, unlike Unix). Combined with async release, that means we
probably also need to lock the file in xlog.c, when reading it in
xlog.c:ReadControlFile() (see comments). Before I added that, on a CI
run, I found that the read in there would sometimes fail, and adding
the locking fixed that. I am a bit confused, though, because I
expected that to be necessary only if someone managed to crash while
holding the write lock, which the CI tests shouldn't do. Do you have
any ideas?

Unfortunately, no ideas so far. Was it a pg_control CRC or I/O errors?
Maybe logs of such a fail were saved somewhere? I would like to see
them if possible.

I think it was this one:

https://cirrus-ci.com/task/5004082033721344

For example, see subscription/011_generated which failed like this:

2023-02-16 06:57:25.724 GMT [5736][not initialized] PANIC: could not
read file "global/pg_control": Permission denied

That was fixed after I changed it to also do locking in xlog.c
ReadControlFile(), in the version you tested. There must be something
I don't understand going on, because that cluster wasn't even running
before: it had just been created by initdb.

But, anyway, I have a new idea that makes that whole problem go away
(though I'd still like to understand what happened there):

With the "pseudo-advisory lock for Windows" idea from the last email,
we don't need to bother with file system level locking in many places.
Just in controldata_utils.c, for FE/BE interlocking (that is, we don't
need to use that for interlocking of concurrent reads and writes that
are entirely in the backend, because we already have an LWLock that we
could use more consistently). Changes:

1. xlog.c mostly uses no locking
2. basebackup.c now acquires ControlFileLock
3. only controldata_utils.c uses the new file locking, for FE-vs-BE interlocking
4. lock past the end (pseudo-advisory locking for Windows)

Note that when we recover from a basebackup or pg_backup_start()
backup, we use the backup label to find a redo start location in the
WAL (see read_backup_label()), BUT we still read the copied pg_control
file (one that might be too "new", so we don't use its redo pointer).
So it had better not be torn, or the recovery will fail. So, in this
version I protected that sendFile() with ControlFileLock. But...

Isn't that a bit strange? To go down this path we would also need to
document the need to copy the control file with the file locked to
avoid a rare failure, in the pg_backup_start() documentation. That's
annoying (I don't even know how to do it with easy shell commands;
maybe we should provide a program that locks and cats the file...?).
Could we make better use of the safe copy that we have in the log?
Then the pg_backup_start() subproblem would disappear. Conceptually,
that'd be just like the way we use FPI for data pages copied during a
backup. I'm not sure about any of that, though, it's just an idea,
not tested.

Or maybe the problem is/was too theoretical before...

As far as i understand, this problem has always been, but the probability of
this is extremely small in practice, which is directly pointed in
the docs [4]:
"So while it is theoretically a weak spot, pg_control does not seem
to be a problem in practice."

I guess that was talking about power loss atomicity again? Knowledge
of the read/write atomicity problem seems to be less evenly
distributed (and I think it became more likely in Linux > 3.something;
and the Windows situation possibly hadn't been examined by anyone
before).

Here's a patch like that.

In this patch, the problem is solved for the live database and
maybe remains for some possible cases of an external backup. In a whole,
i think it can be stated that this is a sensible step forward.

Just like last time, i ran a long stress test under windows with current patch.
There were no errors for more than 3 days even with fsync=off.

Thanks!

#20

Anton A. Melnikov

aamelnikov@inbox.ru

about 3 years ago

In reply to: Thomas Munro (#19)

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Hi, Thomas!

On 04.03.2023 00:39, Thomas Munro wrote:

It seems a good topic for a separate thread patch. Would you provide a
link to the thread you mentioned please?

/messages/by-id/367d01a7-90bb-9b70-4cda-248e81cc475c@cosium.com

Thanks! The important words there:

But I fail to see how full_page_writes prevents this since it only act on writes

It ensures the damage is later repaired during WAL replay. Which can only
happen if the WAL contains the necessary information to do so - the full page
writes.

Together with the docs about restoring corrupted pg_control in the
pg_resetwal utility description these words made me think about whether
to save the contents of pg_control at the beginning and end of
checkpoint into WAL and teach pg_resetwal to read them? It would be like
a periodic backup of the pg_control to a safe place.
This thought has nothing to do with this patch and this thread, and, in general,
i'm not sure if it has any practical meaning, and whether, on the contrary, it
may lead to some difficulties. If it seems that there is a sense, then it
will be possible to think further about it.

For example, see subscription/011_generated which failed like this:

2023-02-16 06:57:25.724 GMT [5736][not initialized] PANIC: could not
read file "global/pg_control": Permission denied

That was fixed after I changed it to also do locking in xlog.c
ReadControlFile(), in the version you tested. There must be something
I don't understand going on, because that cluster wasn't even running
before: it had just been created by initdb.

But, anyway, I have a new idea that makes that whole problem go away
(though I'd still like to understand what happened there):

Seems to be it's a race between the first reading of the pg_control in PostmasterMain()
in LocalProcessControlFile(false) and the second one in SubPostmasterMain() here:
/*
* (re-)read control file, as it contains config. The postmaster will
* already have read this, but this process doesn't know about that.
*/
LocalProcessControlFile(false);

which crashes according to the crash log: crashlog-postgres.exe_19a0_2023-02-16_06-57-26-675.txt

With the "pseudo-advisory lock for Windows" idea from the last email,
we don't need to bother with file system level locking in many places.
Just in controldata_utils.c, for FE/BE interlocking (that is, we don't
need to use that for interlocking of concurrent reads and writes that
are entirely in the backend, because we already have an LWLock that we
could use more consistently). Changes:

1. xlog.c mostly uses no locking
2. basebackup.c now acquires ControlFileLock
3. only controldata_utils.c uses the new file locking, for FE-vs-BE interlocking
4. lock past the end (pseudo-advisory locking for Windows)

Although the changes in 1. contributes to the problem described above again,
but 4. fixes this. And i did not find any other places where ReadControlFile()
can be called in different processes.That's all ok.
Thanks to 4., now it is not necessary to use VSS to copy the pg_control file,
it can be copied in a common way even during the torture test. This is very good.
I really like the idea with LWLock where possible.
In general, i think that these changes make the patch more lightweight and fast.

Also i ran tests for more than a day in stress mode with fsync=off under windows
and Linux and found no problems. Patch-tester also passes without errors.

I would like to move this patch to RFC, since I don’t see any problems
both in the code and in the tests, but the pg_backup_start() subproblem confuses me.
Maybe move it to a separate patch in a distinct thread?
As there are a number of suggestions and questions to discuss such as:

Note that when we recover from a basebackup or pg_backup_start()
backup, we use the backup label to find a redo start location in the
WAL (see read_backup_label()), BUT we still read the copied pg_control
file (one that might be too "new", so we don't use its redo pointer).
So it had better not be torn, or the recovery will fail. So, in this
version I protected that sendFile() with ControlFileLock. But...

Isn't that a bit strange? To go down this path we would also need to
document the need to copy the control file with the file locked to
avoid a rare failure, in the pg_backup_start() documentation. That's
annoying (I don't even know how to do it with easy shell commands;
maybe we should provide a program that locks and cats the file...?).

Variant with separate utility looks good, with the recommendation
in the doc to use it for the pg_control coping.

Also seems it is possible to make a function available in psql, such as
export_pg_control('dst_path') with the destination path as argument
and call it before pg_backup_stop().
Or pass the pg_control destination path to the pg_backup_stop() as extra argument.
Or save pg_control to a predetermined location during pg_backup_stop() and specify
in the docs that one need to copy it from there. I suppose that we have the right
to ask the user to perform some manual manipulations here like with backup_label.

Could we make better use of the safe copy that we have in the log?
Then the pg_backup_start() subproblem would disappear. Conceptually,
that'd be just like the way we use FPI for data pages copied during a
backup. I'm not sure about any of that, though, it's just an idea,
not tested.

Sorry, i didn't understand the question about log. Would you explain me
please what kind of log did you mention and where can i look this
safe copy creation in the code?

With the best wishes,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#21

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Anton A. Melnikov (#20)

#22

Anton A. Melnikov

aamelnikov@inbox.ru

about 3 years ago

In reply to: Thomas Munro (#21)