Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
Hi,
Currently the server is erroring out when unable to remove/parse a
logical rewrite file in CheckPointLogicalRewriteHeap wasting the
amount of work the checkpoint has done and preventing the checkpoint
from finishing. This is unlike CheckPointSnapBuild does for snapshot
files i.e. it just emits a message at LOG level and continues if it is
unable to parse or remove the file. Attaching a small patch applying
the same idea to the mapping files.
Thoughts?
Regards,
Bharath Rupireddy.
Attachments:
v1-0001-Avoid-erroring-out-when-unable-to-remove-or-parse.patchapplication/octet-stream; name=v1-0001-Avoid-erroring-out-when-unable-to-remove-or-parse.patchDownload+19-3
On 12/31/21, 4:44 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
Currently the server is erroring out when unable to remove/parse a
logical rewrite file in CheckPointLogicalRewriteHeap wasting the
amount of work the checkpoint has done and preventing the checkpoint
from finishing. This is unlike CheckPointSnapBuild does for snapshot
files i.e. it just emits a message at LOG level and continues if it is
unable to parse or remove the file. Attaching a small patch applying
the same idea to the mapping files.
This seems reasonable to me. AFAICT moving on to other files after an
error shouldn't cause any problems. In fact, it's probably beneficial
to try to clean up as much as possible so that the files do not
continue to build up.
The only feedback I have for the patch is that I don't think the new
comments are necessary.
Nathan
On Thu, Jan 13, 2022 at 3:47 AM Bossart, Nathan <bossartn@amazon.com> wrote:
On 12/31/21, 4:44 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
Currently the server is erroring out when unable to remove/parse a
logical rewrite file in CheckPointLogicalRewriteHeap wasting the
amount of work the checkpoint has done and preventing the checkpoint
from finishing. This is unlike CheckPointSnapBuild does for snapshot
files i.e. it just emits a message at LOG level and continues if it is
unable to parse or remove the file. Attaching a small patch applying
the same idea to the mapping files.This seems reasonable to me. AFAICT moving on to other files after an
error shouldn't cause any problems. In fact, it's probably beneficial
to try to clean up as much as possible so that the files do not
continue to build up.
Thanks for the review Nathan!
The only feedback I have for the patch is that I don't think the new
comments are necessary.
I borrowed the comments as-is from the CheckPointSnapBuild introduced
by the commit b89e15105. IMO, let the comments be there as they
explain why we are not emitting ERRORs, however I will leave it to the
committer to decide on that.
Regards,
Bharath Rupireddy.
On 1/12/22, 10:03 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
On Thu, Jan 13, 2022 at 3:47 AM Bossart, Nathan <bossartn@amazon.com> wrote:
The only feedback I have for the patch is that I don't think the new
comments are necessary.I borrowed the comments as-is from the CheckPointSnapBuild introduced
by the commit b89e15105. IMO, let the comments be there as they
explain why we are not emitting ERRORs, however I will leave it to the
committer to decide on that.
Huh, somehow I missed that when I looked at it yesterday. I'm going
to bump this one to ready-for-committer, then.
Nathan
Hi,
On 2021-12-31 18:12:37 +0530, Bharath Rupireddy wrote:
Currently the server is erroring out when unable to remove/parse a
logical rewrite file in CheckPointLogicalRewriteHeap wasting the
amount of work the checkpoint has done and preventing the checkpoint
from finishing.
This seems like it'd make failures to remove the files practically
invisible. Which'd have it's own set of problems?
What motivated proposing this change?
Greetings,
Andres Freund
On Fri, Jan 14, 2022 at 1:08 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-12-31 18:12:37 +0530, Bharath Rupireddy wrote:
Currently the server is erroring out when unable to remove/parse a
logical rewrite file in CheckPointLogicalRewriteHeap wasting the
amount of work the checkpoint has done and preventing the checkpoint
from finishing.This seems like it'd make failures to remove the files practically
invisible. Which'd have it's own set of problems?What motivated proposing this change?
We had an issue where there were many mapping files generated during
the crash recovery and end-of-recovery checkpoint was taking a lot of
time. We had to manually intervene and delete some of the mapping
files (although it may not sound sensible) to make end-of-recovery
checkpoint faster. Because of the race condition between manual
deletion and checkpoint deletion, the unlink error occurred which
crashed the server and the server entered the recovery again wasting
the entire earlier recovery work.
In summary, with the changes (emitting LOG-only messages for unlink
failures and continuing with the other files) proposed for
CheckPointLogicalRewriteHeap in this thread and the existing code in
CheckPointSnapBuild, I'm sure it will help not waste the recovery
that's has been done in case unlink fails for any reasons.
Regards,
Bharath Rupireddy.
Hi,
On Sat, Jan 15, 2022 at 02:04:12PM +0530, Bharath Rupireddy wrote:
We had an issue where there were many mapping files generated during
the crash recovery and end-of-recovery checkpoint was taking a lot of
time. We had to manually intervene and delete some of the mapping
files (although it may not sound sensible) to make end-of-recovery
checkpoint faster. Because of the race condition between manual
deletion and checkpoint deletion, the unlink error occurred which
crashed the server and the server entered the recovery again wasting
the entire earlier recovery work.
Maybe I'm missing something but wouldn't
https://commitfest.postgresql.org/36/3448/ better solve the problem?
On Sat, Jan 15, 2022 at 2:59 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
Hi,
On Sat, Jan 15, 2022 at 02:04:12PM +0530, Bharath Rupireddy wrote:
We had an issue where there were many mapping files generated during
the crash recovery and end-of-recovery checkpoint was taking a lot of
time. We had to manually intervene and delete some of the mapping
files (although it may not sound sensible) to make end-of-recovery
checkpoint faster. Because of the race condition between manual
deletion and checkpoint deletion, the unlink error occurred which
crashed the server and the server entered the recovery again wasting
the entire earlier recovery work.Maybe I'm missing something but wouldn't
https://commitfest.postgresql.org/36/3448/ better solve the problem?
The error can cause the new background process proposed there in that
thread to restart, which is again costly. Since we have LOG-only and
continue behavior in CheckPointSnapBuild already, having the same
behavior for CheckPointLogicalRewriteHeap helps a lot.
Regards,
Bharath Rupireddy.
Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:
On Sat, Jan 15, 2022 at 2:59 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
Maybe I'm missing something but wouldn't
https://commitfest.postgresql.org/36/3448/ better solve the problem?
The error can cause the new background process proposed there in that
thread to restart, which is again costly. Since we have LOG-only and
continue behavior in CheckPointSnapBuild already, having the same
behavior for CheckPointLogicalRewriteHeap helps a lot.
[ stares at CheckPointLogicalRewriteHeap for awhile ... ]
This code has got more problems than that. It took me awhile to
absorb it, but we don't actually care about the contents of any of
those files; all of the information is encoded in the file *names*.
(This strikes me as probably not a very efficient design, compared
to putting the same data into a single text file; but for now I'll
assume we're not up for a complete rewrite.) That being the case,
I wonder what it is we expect fsync'ing the surviving files to do
exactly. We should be fsync'ing the directory not the files
themselves, no?
Other things that seem poorly thought out:
* Why is the check for "map-" prefix after, rather than before,
the lstat?
* Why is it okay to ignore lstat failure? Seems like we might
as well not even have the lstat.
* The sscanf on the file name would not notice trailing junk,
such as an editor backup marker. Is that okay?
As far as the patch itself goes, I agree that failure to unlink
is noncritical, because such a file would have no further effect
and we can just ignore it. I think I also agree that failure
of the sscanf is noncritical, because the implication of that
is that the file name doesn't conform to our expectations, which
means it's basically just like the check that causes us to
ignore names not starting with "map-". (Actually, isn't the
separate check for "map-" useless, given that sscanf will make
the equivalent check?)
I started out wondering why the patch didn't also change the loop's
other ERROR conditions to LOG. But we do want to ERROR if we're
unable to sync transient state down to disk, and that is what
the other steps (think they) are doing. It might be worth a
comment to point that out though, before someone decides that
if these errors are just LOG level then the others can be too.
Anyway, I think possibly we can drop the bottom half of the loop
(the part trying to fsync non-removed files) in favor of fsync'ing
the directory afterwards. Thoughts?
regards, tom lane
I wrote:
Anyway, I think possibly we can drop the bottom half of the loop
(the part trying to fsync non-removed files) in favor of fsync'ing
the directory afterwards. Thoughts?
Oh, scratch that --- *this* loop doesn't care about the file
contents, but other code does. However, don't we need a directory
fsync too? Or is that handled someplace else?
regards, tom lane
Hi,
On 2022-01-19 13:34:21 -0500, Tom Lane wrote:
Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:
On Sat, Jan 15, 2022 at 2:59 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
Maybe I'm missing something but wouldn't
https://commitfest.postgresql.org/36/3448/ better solve the problem?The error can cause the new background process proposed there in that
thread to restart, which is again costly. Since we have LOG-only and
continue behavior in CheckPointSnapBuild already, having the same
behavior for CheckPointLogicalRewriteHeap helps a lot.[ stares at CheckPointLogicalRewriteHeap for awhile ... ]
This code has got more problems than that. It took me awhile to
absorb it, but we don't actually care about the contents of any of
those files; all of the information is encoded in the file *names*.
I'm not following - we *do* need the contents of the files? They're applied
as-needed in ApplyLogicalMappingFile().
(This strikes me as probably not a very efficient design, compared
to putting the same data into a single text file; but for now I'll
assume we're not up for a complete rewrite.) That being the case,
I wonder what it is we expect fsync'ing the surviving files to do
exactly. We should be fsync'ing the directory not the files
themselves, no?
Fsyncing the directory doesn't guarantee anything about the contents of
files. But, you're right, we need an fsync of the directory too.
Other things that seem poorly thought out:
* Why is the check for "map-" prefix after, rather than before,
the lstat?
It doesn't seem to matter much - there shouldn't be a meaningful amount of
other files in there.
* Why is it okay to ignore lstat failure? Seems like we might
as well not even have the lstat.
Yea, that seems odd, not sure why that ended up this way. I guess the aim
might have been to tolerate random files we don't have permissions for or
such?
* The sscanf on the file name would not notice trailing junk,
such as an editor backup marker. Is that okay?
I don't really see a problem with it - there shouldn't be other files matching
the pattern - but it couldn't hurt to check the pattern matches exhaustively.
As far as the patch itself goes, I agree that failure to unlink
is noncritical, because such a file would have no further effect
and we can just ignore it.
I don't agree. We iterate through the directory regularly on systems with
catalog changes + logical decoding. An ever increasing list of gunk will make
that more and more expensive. And I haven't heard a meaningful reason why we
would have map-* files that we can't remove.
Ignoring failures like this just makes problems much harder to debug and they
tend to bite harder for it.
I think I also agree that failure of the sscanf is noncritical, because the
implication of that is that the file name doesn't conform to our
expectations, which means it's basically just like the check that causes us
to ignore names not starting with "map-". (Actually, isn't the separate
check for "map-" useless, given that sscanf will make the equivalent check?)
Well, this way only files starting with "map-" are expected to conform to a
strict format, the rest is ignored?
Anyway, I think possibly we can drop the bottom half of the loop
(the part trying to fsync non-removed files) in favor of fsync'ing
the directory afterwards. Thoughts?
I don't think that'd be correct.
In short: We should add a directory fsync, I'm fine with improving the error
checking, but the rest seems like a net-negative with no convincing reasoning.
Greetings,
Andres Freund
On 1/19/22, 11:08 AM, "Andres Freund" <andres@anarazel.de> wrote:
On 2022-01-19 13:34:21 -0500, Tom Lane wrote:
As far as the patch itself goes, I agree that failure to unlink
is noncritical, because such a file would have no further effect
and we can just ignore it.I don't agree. We iterate through the directory regularly on systems with
catalog changes + logical decoding. An ever increasing list of gunk will make
that more and more expensive. And I haven't heard a meaningful reason why we
would have map-* files that we can't remove.
I think the other side of this is that we don't want checkpointing to
continually fail because of a noncritical failure. That could also
lead to problems down the road.
Ignoring failures like this just makes problems much harder to debug and they
tend to bite harder for it.
If such noncritical failures happened regularly, the server logs will
likely become filled with messages about it. Perhaps users may not
notice for a while, but I don't think the proposed patch would make
debugging excessively difficult.
Nathan
"Bossart, Nathan" <bossartn@amazon.com> writes:
I think the other side of this is that we don't want checkpointing to
continually fail because of a noncritical failure. That could also
lead to problems down the road.
Yeah, a persistent failure to complete checkpoints is very nasty.
Your disk will soon fill with unrecyclable WAL. I don't see how
that's better than a somewhat hypothetical performance issue.
regards, tom lane
I took the liberty of adjusting Bharath's patch based on the latest
feedback.
On 1/19/22, 10:35 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
We should be fsync'ing the directory not the files
themselves, no?
I added a directory sync at the end of CheckPointLogicalRewriteHeap(),
which IIUC is enough.
* Why is the check for "map-" prefix after, rather than before,
the lstat?
I swapped these checks. I stopped short of moving the sscanf() before
the lstat(), though, as 1) I don't think it will help very much and 2)
it seemed weird to start emitting "could not parse filename" logs for
non-regular files we presently skip silently.
* Why is it okay to ignore lstat failure? Seems like we might
as well not even have the lstat.
I added error checking for lstat().
* The sscanf on the file name would not notice trailing junk,
such as an editor backup marker. Is that okay?
I think this is okay. The absolute worst that would happen would be
that the extra file would be deleted. This might eventually become a
problem if files with the same prefix format were created by the
server. However, CheckPointSnapBuild() already has this problem with
temporary files, and it claims not to need any extra handling:
* temporary filenames from SnapBuildSerialize() include the LSN and
* everything but are postfixed by .$pid.tmp. We can just remove them
* the same as other files because there can be none that are
* currently being written that are older than cutoff.
(Actually, isn't the
separate check for "map-" useless, given that sscanf will make
the equivalent check?)
The only benefit I see from the extra "map-" check is that it'll avoid
"could not parse filename" logs for files that clearly aren't related
to the task at hand. I don't know if this is expected during normal
operation at the moment. I've left the "map-" check for now.
I started out wondering why the patch didn't also change the loop's
other ERROR conditions to LOG. But we do want to ERROR if we're
unable to sync transient state down to disk, and that is what
the other steps (think they) are doing. It might be worth a
comment to point that out though, before someone decides that
if these errors are just LOG level then the others can be too.
I added such a comment.
I also updated CheckPointSnapBuild() and UpdateLogicalMappings() with
similar adjustments.
Nathan
Attachments:
v2-0001-fix-up-code-in-directory-scans-for-logical-replic.patchapplication/octet-stream; name=v2-0001-fix-up-code-in-directory-scans-for-logical-replic.patchDownload+41-17
Hi,
On 2022-01-20 19:15:08 +0000, Bossart, Nathan wrote:
* Why is it okay to ignore lstat failure? Seems like we might
as well not even have the lstat.I added error checking for lstat().
It seems odd to change a bunch of things to not be errors anymore, but then
add new sources of errors. If we try to deal with concurrent deletions or
permission issues - otherwise what's the point of making unlink() not an error
anymore - why do we expect to be able to lstat()?
I also updated CheckPointSnapBuild() and UpdateLogicalMappings() with
similar adjustments.
FWIW, I still think the ERROR->LOG changes are bad idea. The whole thing of
"oh, let's just ignore stuff that we don't expect and soldier on" has bitten
us over and over again. It makes us less robust, not more robust.
It's also just about impossible to monitor for problems that emit LOG.
I'd be more on board accepting some selective errors. E.g. not erroring on
ENOENT, but continuing to error on others (most likely ENOACCESS). I think we
should *not* change the
+ /* + * We just log a message if a file doesn't fit the pattern, it's + * probably some editor's lock/state file or similar... + */
An editor's lock file that starts with map- would presumably be the whole
filename plus an additional file-ending. But this check won't catch those.
+ * Unlike failures to unlink() old logical rewrite files, we must + * ERROR if we're unable to sync transient state down to disk. This + * allows replay to assume that everything written out before + * checkpoint start is persisted. */
Hm, not quite happy with the second bit. Checkpointed state being durable
isn't about replay directly, it's about the basic property of a checkpoint
being fulfilled?
pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC); if (pg_fsync(fd) != 0) @@ -1282,4 +1305,7 @@ CheckPointLogicalRewriteHeap(void) } } FreeDir(mappings_dir); + + /* persist directory entries to disk */ + fsync_fname("pg_logical/mappings", true); }
This is probably worth backpatching, if you split it out I'll do so.
Greetings,
Andres Freund
Thanks for your feedback.
On 1/20/22, 11:47 AM, "Andres Freund" <andres@anarazel.de> wrote:
It seems odd to change a bunch of things to not be errors anymore, but then
add new sources of errors. If we try to deal with concurrent deletions or
permission issues - otherwise what's the point of making unlink() not an error
anymore - why do we expect to be able to lstat()?
My reasoning for making lstat() an ERROR was that there's a chance we
need to fsync() the file, and if we can't fsync() a file for whatever
reason, we definitely want to ERROR. I suppose we could conditionally
ERROR based on the file name, but I don't know if that's really worth
the complexity.
I'd be more on board accepting some selective errors. E.g. not erroring on
ENOENT, but continuing to error on others (most likely ENOACCESS). I think we
should *not* change the
I think this approach would work for the use-case Bharath mentioned
upthread. In any case, if deleting a file fails because the file was
already deleted, there's no point in ERROR-ing. I think filtering
errors is a bit trickier for lstat(). If we would've fsync'd the file
but lstat() gives us ENOENT, we may have a problem. (However, there's
also a good chance we wouldn't notice such problems if the race didn't
occur.) I'll play around with it.
+ /* + * We just log a message if a file doesn't fit the pattern, it's + * probably some editor's lock/state file or similar... + */An editor's lock file that starts with map- would presumably be the whole
filename plus an additional file-ending. But this check won't catch those.
Right, it will either fsync() or unlink() those. I'll work on the
comment. Or do you think it's worth validating that there are no
trailing characters? I looked into that a bit earlier, and the code
felt excessive to me, but I don't have a strong opinion here.
+ * Unlike failures to unlink() old logical rewrite files, we must + * ERROR if we're unable to sync transient state down to disk. This + * allows replay to assume that everything written out before + * checkpoint start is persisted. */Hm, not quite happy with the second bit. Checkpointed state being durable
isn't about replay directly, it's about the basic property of a checkpoint
being fulfilled?
I'll work on this. FWIW I modeled this based on the comment above the
function. Do you think that should be adjusted as well?
+ /* persist directory entries to disk */ + fsync_fname("pg_logical/mappings", true);This is probably worth backpatching, if you split it out I'll do so.
Sure thing, will do shortly.
Nathan
On 1/20/22, 12:24 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
On 1/20/22, 11:47 AM, "Andres Freund" <andres@anarazel.de> wrote:
+ /* persist directory entries to disk */ + fsync_fname("pg_logical/mappings", true);This is probably worth backpatching, if you split it out I'll do so.
Sure thing, will do shortly.
Here's this part.
Nathan
Attachments:
v3-0001-persist-directory-entries-for-logical-rewrite-fil.patchapplication/octet-stream; name=v3-0001-persist-directory-entries-for-logical-rewrite-fil.patchDownload+3-1
On 2022-01-20 20:41:16 +0000, Bossart, Nathan wrote:
Here's this part.
And pushed to all branches. Thanks.
On Fri, Jan 21, 2022 at 11:49:56AM -0800, Andres Freund wrote:
On 2022-01-20 20:41:16 +0000, Bossart, Nathan wrote:
Here's this part.
And pushed to all branches. Thanks.
Thanks!
I spent some time thinking about the right way to proceed here, and I came
up with the attached patches. The first patch just adds error checking for
various lstat() calls in the replication code. If lstat() fails, then it
probably doesn't make sense to try to continue processing the file.
The second patch changes some nearby calls to ereport() to ERROR. If these
failures are truly unexpected, and we don't intend to support use-cases
like concurrent manual deletion, then failing might be the right way to go.
I think it's a shame that such failures could cause checkpointing to
continually fail, but that topic is already being discussed elsewhere [0]/messages/by-id/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB@amazon.com.
[0]: /messages/by-id/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB@amazon.com
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com/
On Thu, Jan 27, 2022 at 6:31 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
I spent some time thinking about the right way to proceed here, and I came
up with the attached patches. The first patch just adds error checking for
various lstat() calls in the replication code. If lstat() fails, then it
probably doesn't make sense to try to continue processing the file.The second patch changes some nearby calls to ereport() to ERROR. If these
failures are truly unexpected, and we don't intend to support use-cases
like concurrent manual deletion, then failing might be the right way to go.
I think it's a shame that such failures could cause checkpointing to
continually fail, but that topic is already being discussed elsewhere [0].[0] /messages/by-id/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB@amazon.com
After an off-list discussion with Andreas, proposing here a patch that
basically replaces ReadDir call with ReadDirExtended and gets rid of
lstat entirely. With this chance, the checkpoint will only care about
the snapshot and mapping files and not fail if it finds other files in
the directories. Removing lstat enables us to make things faster as we
avoid a bunch of extra system calls - one lstat call per each mapping
or snapshot file.
This patch also includes, converting "could not parse filename" and
"could not remove file" errors to LOG messages in
CheckPointLogicalRewriteHeap. This will enable checkpoint not to waste
the amount of work that it has done in case it is unable to
parse/remove the file, for whatever reasons.
Regards,
Bharath Rupireddy.