subscriptionCheck failures on nightjar

Started by Tom Laneabout 7 years ago44 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

nightjar just did this:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2019-02-11%2004%3A33%3A07

The critical bit seems to be that the publisher side of the
010_truncate.pl test failed like so:

2019-02-10 23:55:58.765 EST [40771] sub3 LOG: statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ
2019-02-10 23:55:58.765 EST [40771] sub3 LOG: received replication command: CREATE_REPLICATION_SLOT "sub3_16414_sync_16394" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
2019-02-10 23:55:58.798 EST [40728] sub1 PANIC: could not open file "pg_logical/snapshots/0-160B578.snap": No such file or directory
2019-02-10 23:55:58.800 EST [40771] sub3 LOG: logical decoding found consistent point at 0/160B578
2019-02-10 23:55:58.800 EST [40771] sub3 DETAIL: There are no running transactions.

I'm not sure what to make of that, but I notice that nightjar has
failed subscriptionCheck seven times since mid-December, and every one
of those shows this same PANIC. Meanwhile, no other buildfarm member
has produced such a failure. It smells like a race condition with
a rather tight window, but that's just a guess.

So: (1) what's causing the failure? (2) could we respond with
something less than take-down-the-whole-database when a failure
happens in this area?

regards, tom lane

#2Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#1)
Re: subscriptionCheck failures on nightjar

On Mon, Feb 11, 2019 at 7:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

2019-02-10 23:55:58.798 EST [40728] sub1 PANIC: could not open file "pg_logical/snapshots/0-160B578.snap": No such file or directory

<pokes at totally unfamiliar code>

They get atomically renamed into place, which seems kosher even if
snapshots for the same LSN are created concurrently by different
backends (and tracing syscalls confirms that that does occasionally
happen). It's hard to believe that nightjar's rename() ceased to be
atomic a couple of months ago. It looks like the only way for files
to get unlinked after that is by CheckPointSnapBuild() deciding they
are too old.

Hmm. Could this be relevant, and cause a well timed checkpoint to
unlink files too soon?

2019-02-12 21:52:58.304 EST [22922] WARNING: out of logical
replication worker slots

--
Thomas Munro
http://www.enterprisedb.com

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#2)
Re: subscriptionCheck failures on nightjar

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Mon, Feb 11, 2019 at 7:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

2019-02-10 23:55:58.798 EST [40728] sub1 PANIC: could not open file "pg_logical/snapshots/0-160B578.snap": No such file or directory

<pokes at totally unfamiliar code>

They get atomically renamed into place, which seems kosher even if
snapshots for the same LSN are created concurrently by different
backends (and tracing syscalls confirms that that does occasionally
happen). It's hard to believe that nightjar's rename() ceased to be
atomic a couple of months ago. It looks like the only way for files
to get unlinked after that is by CheckPointSnapBuild() deciding they
are too old.

Hmm. Could this be relevant, and cause a well timed checkpoint to
unlink files too soon?
2019-02-12 21:52:58.304 EST [22922] WARNING: out of logical
replication worker slots

I've managed to reproduce this locally, and obtained this PANIC:

log/010_truncate_publisher.log:2019-02-13 11:29:04.759 EST [9973] sub3 PANIC: could not open file "pg_logical/snapshots/0-16067B0.snap": No such file or directory

with this stack trace:

#0 0x0000000801ab610c in kill () from /lib/libc.so.7
#1 0x0000000801ab4d3b in abort () from /lib/libc.so.7
#2 0x000000000089202e in errfinish (dummy=Variable "dummy" is not available.
) at elog.c:552
#3 0x000000000075d339 in fsync_fname_ext (
fname=0x7fffffffba20 "pg_logical/snapshots/0-16067B0.snap", isdir=Variable "isdir" is not available.
)
at fd.c:3372
#4 0x0000000000730c75 in SnapBuildSerialize (builder=0x80243c118,
lsn=23095216) at snapbuild.c:1669
#5 0x0000000000731297 in SnapBuildProcessRunningXacts (builder=0x80243c118,
lsn=23095216, running=0x8024424f0) at snapbuild.c:1110
#6 0x0000000000722eac in LogicalDecodingProcessRecord (ctx=0x802414118,
record=0x8024143d8) at decode.c:318
#7 0x0000000000736f30 in XLogSendLogical () at walsender.c:2826
#8 0x00000000007389c7 in WalSndLoop (send_data=0x736ed0 <XLogSendLogical>)
at walsender.c:2184
#9 0x0000000000738c3d in StartLogicalReplication (cmd=Variable "cmd" is not available.
) at walsender.c:1118
#10 0x0000000000739895 in exec_replication_command (
cmd_string=0x80240e118 "START_REPLICATION SLOT \"sub3\" LOGICAL 0/0 (proto_version '1', publication_names '\"pub3\"')") at walsender.c:1536
#11 0x000000000078b272 in PostgresMain (argc=-14464, argv=Variable "argv" is not available.
) at postgres.c:4252
#12 0x00000000007113fa in PostmasterMain (argc=-14256, argv=0x7fffffffcc60)
at postmaster.c:4382

So the problem seems to boil down to "somebody removed the snapshot
file between the time we renamed it into place and the time we tried
to fsync it".

I do find messages like the one you mention, but they are on the
subscriber not the publisher side, so I'm not sure if this matches
the scenario you have in mind?

log/010_truncate_subscriber.log:2019-02-13 11:29:02.343 EST [9970] WARNING: out of logical replication worker slots
log/010_truncate_subscriber.log:2019-02-13 11:29:02.344 EST [9970] WARNING: out of logical replication worker slots
log/010_truncate_subscriber.log:2019-02-13 11:29:03.401 EST [9970] WARNING: out of logical replication worker slots

Anyway, I think we might be able to fix this along the lines of

CloseTransientFile(fd);

+   /* ensure snapshot file is down to stable storage */
+   fsync_fname(tmppath, false);
    fsync_fname("pg_logical/snapshots", true);

/*
* We may overwrite the work from some other backend, but that's ok, our
* snapshot is valid as well, we'll just have done some superfluous work.
*/
if (rename(tmppath, path) != 0)
{
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not rename file \"%s\" to \"%s\": %m",
tmppath, path)));
}

-   /* make sure we persist */
-   fsync_fname(path, false);
+   /* ensure we persist the file rename */
    fsync_fname("pg_logical/snapshots", true);

The existing code here seems simply wacky/unsafe to me regardless
of this race condition: couldn't it potentially result in a corrupt
snapshot file appearing with a valid name, if the system crashes
after persisting the rename but before it's pushed the data out?

I also wonder why bother with the directory sync just before the
rename.

regards, tom lane

#4Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#3)
Re: subscriptionCheck failures on nightjar

Hi,

On 2019-02-13 11:57:32 -0500, Tom Lane wrote:

I've managed to reproduce this locally, and obtained this PANIC:

Cool. How exactly?

Nice catch.

Anyway, I think we might be able to fix this along the lines of

CloseTransientFile(fd);

+   /* ensure snapshot file is down to stable storage */
+   fsync_fname(tmppath, false);
fsync_fname("pg_logical/snapshots", true);

/*
* We may overwrite the work from some other backend, but that's ok, our
* snapshot is valid as well, we'll just have done some superfluous work.
*/
if (rename(tmppath, path) != 0)
{
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not rename file \"%s\" to \"%s\": %m",
tmppath, path)));
}

-   /* make sure we persist */
-   fsync_fname(path, false);
+   /* ensure we persist the file rename */
fsync_fname("pg_logical/snapshots", true);

Hm, but that's not the same? On some filesystems one needs the directory
fsync, on some the file fsync, and I think both in some cases. ISTM we
should just open the file before the rename, and then use fsync() on the
filehandle rather than the filename. Then a concurrent rename ought not
to hurt us?

The existing code here seems simply wacky/unsafe to me regardless
of this race condition: couldn't it potentially result in a corrupt
snapshot file appearing with a valid name, if the system crashes
after persisting the rename but before it's pushed the data out?

What do you mean precisely with "before it's pushed the data out"?

I also wonder why bother with the directory sync just before the
rename.

Because on some FS/OS combinations the size of the renamed-into-place
file isn't guaranteed to be durable unless the directory was
fsynced. Otherwise there's the possibility to have incomplete data if we
crash after renaming, but before fsyncing.

Greetings,

Andres Freund

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#4)
Re: subscriptionCheck failures on nightjar

Andres Freund <andres@anarazel.de> writes:

On 2019-02-13 11:57:32 -0500, Tom Lane wrote:

I've managed to reproduce this locally, and obtained this PANIC:

Cool. How exactly?

Andrew told me that nightjar is actually running in a qemu VM,
so I set up freebsd 9.0 in a qemu VM, and boom. It took a bit
of fiddling with qemu parameters, but for such a timing-sensitive
problem, that's not surprising.

Anyway, I think we might be able to fix this along the lines of
[ fsync the data before renaming not after ]

Hm, but that's not the same? On some filesystems one needs the directory
fsync, on some the file fsync, and I think both in some cases.

Now that I look at it, there's a pg_fsync() just above this, so
I wonder why we need a second fsync on the file at all. fsync'ing
the directory is needed to ensure the directory entry is on disk;
but the file data should be out already, or else the kernel is
simply failing to honor fsync.

The existing code here seems simply wacky/unsafe to me regardless
of this race condition: couldn't it potentially result in a corrupt
snapshot file appearing with a valid name, if the system crashes
after persisting the rename but before it's pushed the data out?

What do you mean precisely with "before it's pushed the data out"?

Given the previous pg_fsync, this isn't an issue.

I also wonder why bother with the directory sync just before the
rename.

Because on some FS/OS combinations the size of the renamed-into-place
file isn't guaranteed to be durable unless the directory was
fsynced.

Bleah. But in any case, the rename should not create a situation
in which we need to fsync the file data again.

regards, tom lane

#6Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#5)
Re: subscriptionCheck failures on nightjar

Hi,

On 2019-02-13 12:37:35 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2019-02-13 11:57:32 -0500, Tom Lane wrote:

I've managed to reproduce this locally, and obtained this PANIC:

Cool. How exactly?

Andrew told me that nightjar is actually running in a qemu VM,
so I set up freebsd 9.0 in a qemu VM, and boom. It took a bit
of fiddling with qemu parameters, but for such a timing-sensitive
problem, that's not surprising.

Ah.

I also wonder why bother with the directory sync just before the
rename.

Because on some FS/OS combinations the size of the renamed-into-place
file isn't guaranteed to be durable unless the directory was
fsynced.

Bleah. But in any case, the rename should not create a situation
in which we need to fsync the file data again.

Well, it's not super well defined which of either you need to make the
rename durable, and it appears to differ between OSs. Any argument
against fixing it up like I suggested, by using an fd from before the
rename?

Greetings,

Andres Freund

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#6)
Re: subscriptionCheck failures on nightjar

Andres Freund <andres@anarazel.de> writes:

On 2019-02-13 12:37:35 -0500, Tom Lane wrote:

Bleah. But in any case, the rename should not create a situation
in which we need to fsync the file data again.

Well, it's not super well defined which of either you need to make the
rename durable, and it appears to differ between OSs. Any argument
against fixing it up like I suggested, by using an fd from before the
rename?

I'm unimpressed. You're speculating about the filesystem having random
deviations from POSIX behavior, and using that weak argument to justify a
totally untested technique having its own obvious portability hazards.
Who's to say that an fsync on a file opened before a rename is going to do
anything good after the rename? (On, eg, NFS there are obvious reasons
why it might not.)

Also, I wondered why this is coming out as a PANIC. I thought originally
that somebody must be causing this code to run in a critical section,
but it looks like the real issue is just that fsync_fname() uses
data_sync_elevel, which is

int
data_sync_elevel(int elevel)
{
return data_sync_retry ? elevel : PANIC;
}

I really really don't want us doing questionably-necessary fsyncs with a
PANIC as the result. Perhaps more to the point, the way this was coded,
the PANIC applies to open() failures in fsync_fname_ext() not just fsync()
failures; that's painting with too broad a brush isn't it?

regards, tom lane

#8Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#7)
Re: subscriptionCheck failures on nightjar

Hi,

On 2019-02-13 12:59:19 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2019-02-13 12:37:35 -0500, Tom Lane wrote:

Bleah. But in any case, the rename should not create a situation
in which we need to fsync the file data again.

Well, it's not super well defined which of either you need to make the
rename durable, and it appears to differ between OSs. Any argument
against fixing it up like I suggested, by using an fd from before the
rename?

I'm unimpressed. You're speculating about the filesystem having random
deviations from POSIX behavior, and using that weak argument to justify a
totally untested technique having its own obvious portability hazards.

Uhm, we've reproduced failures due to the lack of such fsyncs at some
point. And not some fringe OS, but ext4 (albeit with data=writeback).

I don't think POSIX has yet figured out what they actually think is
required:
http://austingroupbugs.net/view.php?id=672

I guess we could just ignore ENOENT in this case, that ought to be just
as safe as using the old fd.

Also, I wondered why this is coming out as a PANIC. I thought originally
that somebody must be causing this code to run in a critical section,
but it looks like the real issue is just that fsync_fname() uses
data_sync_elevel, which is

int
data_sync_elevel(int elevel)
{
return data_sync_retry ? elevel : PANIC;
}

I really really don't want us doing questionably-necessary fsyncs with a
PANIC as the result.

Well, given the 'failed fsync throws dirty data away' issue, I don't
quite see what we can do otherwise. But:

Perhaps more to the point, the way this was coded, the PANIC applies
to open() failures in fsync_fname_ext() not just fsync() failures;
that's painting with too broad a brush isn't it?

That indeed seems wrong. Thomas? I'm not quite sure how to best fix
this though - I guess we could rename fsync_fname_ext's eleval parameter
to fsync_failure_elevel? It's not visible outside fd.c, so that'd not be
to bad?

Greetings,

Andres Freund

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#8)
Re: subscriptionCheck failures on nightjar

Andres Freund <andres@anarazel.de> writes:

On 2019-02-13 12:59:19 -0500, Tom Lane wrote:

Perhaps more to the point, the way this was coded, the PANIC applies
to open() failures in fsync_fname_ext() not just fsync() failures;
that's painting with too broad a brush isn't it?

That indeed seems wrong. Thomas? I'm not quite sure how to best fix
this though - I guess we could rename fsync_fname_ext's eleval parameter
to fsync_failure_elevel? It's not visible outside fd.c, so that'd not be
to bad?

Perhaps fsync_fname() should pass the elevel through as-is, and
then fsync_fname_ext() apply the data_sync_elevel() promotion only
to the fsync() call not the open()? I'm not sure.

The bigger picture here is that there are probably some call sites where
PANIC on open() failure is appropriate too. So having fsync_fname[_ext]
deciding what to do on its own is likely to be inadequate.

If we fix this by allowing ENOENT to not be an error in this particular
call case, that's going to involve an fsync_fname_ext API change anyway...

regards, tom lane

#10Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#9)
Re: subscriptionCheck failures on nightjar

Hi,

On 2019-02-13 13:24:03 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2019-02-13 12:59:19 -0500, Tom Lane wrote:

Perhaps more to the point, the way this was coded, the PANIC applies
to open() failures in fsync_fname_ext() not just fsync() failures;
that's painting with too broad a brush isn't it?

That indeed seems wrong. Thomas? I'm not quite sure how to best fix
this though - I guess we could rename fsync_fname_ext's eleval parameter
to fsync_failure_elevel? It's not visible outside fd.c, so that'd not be
to bad?

Perhaps fsync_fname() should pass the elevel through as-is, and
then fsync_fname_ext() apply the data_sync_elevel() promotion only
to the fsync() call not the open()? I'm not sure.

Yea, that's probably not a bad plan. It'd address your:

The bigger picture here is that there are probably some call sites where
PANIC on open() failure is appropriate too. So having fsync_fname[_ext]
deciding what to do on its own is likely to be inadequate.

Seems to me we ought to do this regardless of the bug discussed
here. But we'd nee dto be careful that we'd take the "more severe" error
between the passed in elevel and data_sync_elevel(). Otherwise we'd end
up downgrading errors...

If we fix this by allowing ENOENT to not be an error in this particular
call case, that's going to involve an fsync_fname_ext API change anyway...

I was kinda pondering just open coding it. I am not yet convinced that
my idea of just using an open FD isn't the least bad approach for the
issue at hand. What precisely is the NFS issue you're concerned about?

Right now fsync_fname_ext isn't exposed outside fd.c...

Greetings,

Andres Freund

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#10)
Re: subscriptionCheck failures on nightjar

Andres Freund <andres@anarazel.de> writes:

I was kinda pondering just open coding it. I am not yet convinced that
my idea of just using an open FD isn't the least bad approach for the
issue at hand. What precisely is the NFS issue you're concerned about?

I'm not sure that fsync-on-FD after the rename will work, considering that
the issue here is that somebody might've unlinked the file altogether
before we get to doing the fsync. I don't have a hard time believing that
that might result in a failure report on NFS or similar. Yeah, it's
hypothetical, but the argument that we need a repeat fsync at all seems
equally hypothetical.

Right now fsync_fname_ext isn't exposed outside fd.c...

Mmm. That makes it easier to consider changing its API.

regards, tom lane

#12Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#11)
Re: subscriptionCheck failures on nightjar

On Thu, Feb 14, 2019 at 8:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

I was kinda pondering just open coding it. I am not yet convinced that
my idea of just using an open FD isn't the least bad approach for the
issue at hand. What precisely is the NFS issue you're concerned about?

I'm not sure that fsync-on-FD after the rename will work, considering that
the issue here is that somebody might've unlinked the file altogether
before we get to doing the fsync. I don't have a hard time believing that
that might result in a failure report on NFS or similar. Yeah, it's
hypothetical, but the argument that we need a repeat fsync at all seems
equally hypothetical.

Right now fsync_fname_ext isn't exposed outside fd.c...

Mmm. That makes it easier to consider changing its API.

Just to make sure I understand: it's OK for the file not to be there
when we try to fsync it by name, because a concurrent checkpoint can
remove it, having determined that we don't need it anymore? In other
words, we really needed either missing_ok=true semantics, or to use
the fd we already had instead of the name?

I found 3 examples of this failing with an ERROR (though not turning
the BF red, so nobody noticed) before the PANIC patch went in:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=nightjar&amp;amp;dt=2018-09-10%2020%3A54%3A21&amp;amp;stg=subscription-check
2018-09-10 17:20:09.247 EDT [23287] sub1 ERROR: could not open file
"pg_logical/snapshots/0-161D778.snap": No such file or directory
2018-09-10 17:20:09.247 EDT [23285] ERROR: could not receive data
from WAL stream: ERROR: could not open file
"pg_logical/snapshots/0-161D778.snap": No such file or directory

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=nightjar&amp;amp;dt=2018-08-31%2023%3A25%3A41&amp;amp;stg=subscription-check
2018-08-31 19:52:06.634 EDT [52724] sub1 ERROR: could not open file
"pg_logical/snapshots/0-161D718.snap": No such file or directory
2018-08-31 19:52:06.634 EDT [52721] ERROR: could not receive data
from WAL stream: ERROR: could not open file
"pg_logical/snapshots/0-161D718.snap": No such file or directory

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=nightjar&amp;amp;dt=2018-08-22%2021%3A49%3A18&amp;amp;stg=subscription-check
2018-08-22 18:10:29.422 EDT [44208] sub1 ERROR: could not open file
"pg_logical/snapshots/0-161D718.snap": No such file or directory
2018-08-22 18:10:29.422 EDT [44206] ERROR: could not receive data
from WAL stream: ERROR: could not open file
"pg_logical/snapshots/0-161D718.snap": No such file or directory

--
Thomas Munro
http://www.enterprisedb.com

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#12)
Re: subscriptionCheck failures on nightjar

Thomas Munro <thomas.munro@enterprisedb.com> writes:

I found 3 examples of this failing with an ERROR (though not turning
the BF red, so nobody noticed) before the PANIC patch went in:

Yeah, I suspected that had happened before with less-obvious consequences.
Now that we know where the problem is, you could probably make it highly
reproducible by inserting a sleep of a few msec between the rename and the
second fsync.

regards, tom lane

#14Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#12)
Re: subscriptionCheck failures on nightjar

Hi,

On 2019-02-14 09:52:33 +1300, Thomas Munro wrote:

On Thu, Feb 14, 2019 at 8:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

I was kinda pondering just open coding it. I am not yet convinced that
my idea of just using an open FD isn't the least bad approach for the
issue at hand. What precisely is the NFS issue you're concerned about?

I'm not sure that fsync-on-FD after the rename will work, considering that
the issue here is that somebody might've unlinked the file altogether
before we get to doing the fsync. I don't have a hard time believing that
that might result in a failure report on NFS or similar. Yeah, it's
hypothetical, but the argument that we need a repeat fsync at all seems
equally hypothetical.

Right now fsync_fname_ext isn't exposed outside fd.c...

Mmm. That makes it easier to consider changing its API.

Just to make sure I understand: it's OK for the file not to be there
when we try to fsync it by name, because a concurrent checkpoint can
remove it, having determined that we don't need it anymore? In other
words, we really needed either missing_ok=true semantics, or to use
the fd we already had instead of the name?

I'm not yet sure that that's actually something that's supposed to
happen, I got to spend some time analysing how this actually
happens. Normally the contents of the slot should actually prevent it
from being removed (as they're newer than
ReplicationSlotsComputeLogicalRestartLSN()). I kind of wonder if that's
a bug in the drop logic in newer releases.

Greetings,

Andres Freund

#15Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#8)
Re: subscriptionCheck failures on nightjar

On 2/13/19 1:12 PM, Andres Freund wrote:

Hi,

On 2019-02-13 12:59:19 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2019-02-13 12:37:35 -0500, Tom Lane wrote:

Bleah. But in any case, the rename should not create a situation
in which we need to fsync the file data again.

Well, it's not super well defined which of either you need to make the
rename durable, and it appears to differ between OSs. Any argument
against fixing it up like I suggested, by using an fd from before the
rename?

I'm unimpressed. You're speculating about the filesystem having random
deviations from POSIX behavior, and using that weak argument to justify a
totally untested technique having its own obvious portability hazards.

Uhm, we've reproduced failures due to the lack of such fsyncs at some
point. And not some fringe OS, but ext4 (albeit with data=writeback).

I don't think POSIX has yet figured out what they actually think is
required:
http://austingroupbugs.net/view.php?id=672

I guess we could just ignore ENOENT in this case, that ought to be just
as safe as using the old fd.

Also, I wondered why this is coming out as a PANIC. I thought originally
that somebody must be causing this code to run in a critical section,
but it looks like the real issue is just that fsync_fname() uses
data_sync_elevel, which is

int
data_sync_elevel(int elevel)
{
return data_sync_retry ? elevel : PANIC;
}

I really really don't want us doing questionably-necessary fsyncs with a
PANIC as the result.

Well, given the 'failed fsync throws dirty data away' issue, I don't
quite see what we can do otherwise. But:

Perhaps more to the point, the way this was coded, the PANIC applies
to open() failures in fsync_fname_ext() not just fsync() failures;
that's painting with too broad a brush isn't it?

That indeed seems wrong. Thomas? I'm not quite sure how to best fix
this though - I guess we could rename fsync_fname_ext's eleval parameter
to fsync_failure_elevel? It's not visible outside fd.c, so that'd not be
to bad?

Thread seems to have gone quiet ...

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#14)
Re: subscriptionCheck failures on nightjar

Andres Freund <andres@anarazel.de> writes:

On 2019-02-14 09:52:33 +1300, Thomas Munro wrote:

Just to make sure I understand: it's OK for the file not to be there
when we try to fsync it by name, because a concurrent checkpoint can
remove it, having determined that we don't need it anymore? In other
words, we really needed either missing_ok=true semantics, or to use
the fd we already had instead of the name?

I'm not yet sure that that's actually something that's supposed to
happen, I got to spend some time analysing how this actually
happens. Normally the contents of the slot should actually prevent it
from being removed (as they're newer than
ReplicationSlotsComputeLogicalRestartLSN()). I kind of wonder if that's
a bug in the drop logic in newer releases.

My animal dromedary just reproduced this failure, which we've previously
only seen on nightjar.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dromedary&amp;dt=2019-06-26%2023%3A57%3A45

regards, tom lane

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#16)
Re: subscriptionCheck failures on nightjar

I wrote:

My animal dromedary just reproduced this failure, which we've previously
only seen on nightjar.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dromedary&amp;dt=2019-06-26%2023%3A57%3A45

Twice:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dromedary&amp;dt=2019-06-28%2006%3A50%3A41

Since this is a live (if old) system, not some weird qemu emulation,
we can no longer pretend that it might not be reachable in the field.
I've added an open item.

regards, tom lane

#18Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#14)
Re: subscriptionCheck failures on nightjar

On Wed, Feb 13, 2019 at 01:51:47PM -0800, Andres Freund wrote:

I'm not yet sure that that's actually something that's supposed to
happen, I got to spend some time analysing how this actually
happens. Normally the contents of the slot should actually prevent it
from being removed (as they're newer than
ReplicationSlotsComputeLogicalRestartLSN()). I kind of wonder if that's
a bug in the drop logic in newer releases.

In the same context, could it be a consequence of 9915de6c which has
introduced a conditional variable to control slot operations? This
could have exposed more easily a pre-existing race condition.
--
Michael

#19Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#18)
Re: subscriptionCheck failures on nightjar

On Tue, Aug 13, 2019 at 05:04:35PM +0900, Michael Paquier wrote:

On Wed, Feb 13, 2019 at 01:51:47PM -0800, Andres Freund wrote:

I'm not yet sure that that's actually something that's supposed to
happen, I got to spend some time analysing how this actually
happens. Normally the contents of the slot should actually prevent it
from being removed (as they're newer than
ReplicationSlotsComputeLogicalRestartLSN()). I kind of wonder if that's
a bug in the drop logic in newer releases.

In the same context, could it be a consequence of 9915de6c which has
introduced a conditional variable to control slot operations? This
could have exposed more easily a pre-existing race condition.
--

This is one of the remaining open items, and we don't seem to be moving
forward with it :-(

I'm willing to take a stab at it, but to do that I need a way to
reproduce it. Tom, you mentioned you've managed to reproduce it in a
qemu instance, but that it took some fiddling with qemu parmeters or
something. Can you share what exactly was necessary?

An observation about the issue - while we started to notice this after
Decemeber, that's mostly because the PANIC patch went it shortly before.
We've however seen the issue before, as Thomas Munro mentioned in [1]/messages/by-id/CAEepm=0wB7vgztC5sg2nmJ-H3bnrBT5GQfhUzP+Ffq-WT3g8VA@mail.gmail.com.

Those reports are from August, so it's quite possible something in the
first CF upset the code. And there's only a single commit in 2018-07
that seems related to logical decoding / snapshots [2]https://commitfest.postgresql.org/18/1650/, i.e. f49a80c:

commit f49a80c481f74fa81407dce8e51dea6956cb64f8
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Tue Jun 26 16:38:34 2018 -0400

Fix "base" snapshot handling in logical decoding

...

The other reason to suspect this is related is that the fix also made it
to REL_11_STABLE at that time, and if you check the buildfarm data [3]https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=nightjar&amp;br=REL_11_STABLE,
you'll see 11 fails on nightjar too, from time to time.

This means it's not a 12+ only issue, it's a live issue on 11. I don't
know if f49a80c is the culprit, or if it simply uncovered a pre-existing
bug (e.g. due to timing).

[1]: /messages/by-id/CAEepm=0wB7vgztC5sg2nmJ-H3bnrBT5GQfhUzP+Ffq-WT3g8VA@mail.gmail.com

[2]: https://commitfest.postgresql.org/18/1650/

[3]: https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=nightjar&amp;br=REL_11_STABLE

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#19)
Re: subscriptionCheck failures on nightjar

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

I'm willing to take a stab at it, but to do that I need a way to
reproduce it. Tom, you mentioned you've managed to reproduce it in a
qemu instance, but that it took some fiddling with qemu parmeters or
something. Can you share what exactly was necessary?

I don't recall exactly what I did anymore, and it was pretty fiddly
anyway. Upthread I suggested

Now that we know where the problem is, you could probably make it highly
reproducible by inserting a sleep of a few msec between the rename and the
second fsync.

so why not try that first?

regards, tom lane

#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#20)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#19)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#22)
#24Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#23)
#25Michael Paquier
michael@paquier.xyz
In reply to: Tomas Vondra (#24)
#26Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#25)
#27Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kuntal Ghosh (#26)
#28Michael Paquier
michael@paquier.xyz
In reply to: Tomas Vondra (#27)
#29Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#28)
#30Michael Paquier
michael@paquier.xyz
In reply to: Kuntal Ghosh (#29)
#31Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#28)
#32Andres Freund
andres@anarazel.de
In reply to: Kuntal Ghosh (#29)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#32)
#34Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#33)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#34)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#35)
#37Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#35)
#38Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#35)
#39Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#38)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#38)
#41Michael Paquier
michael@paquier.xyz
In reply to: Tomas Vondra (#31)
#42Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#40)
#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#42)
#44Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Tom Lane (#43)