pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory
Hi Hackers,
I noticed that pg_receivewal fails to stream when the partial file to write
is not fully initialized and fails with the error message something like
below. This requires an extra step of deleting the partial file that is not
fully initialized before starting the pg_receivewal. Attaching a simple
patch that creates a temp file, fully initialize it and rename the file to
the desired wal segment name.
"error: write-ahead log file "000000010000000000000003.partial" has 8396800
bytes, should be 0 or 16777216"
Thanks,
Satya
Attachments:
pg_receivewal_init_fix.patchapplication/octet-stream; name=pg_receivewal_init_fix.patchDownload+53-11
On Sun, Jan 02, 2022 at 09:27:43PM -0800, SATYANARAYANA NARLAPURAM wrote:
I noticed that pg_receivewal fails to stream when the partial file to write
is not fully initialized and fails with the error message something like
below. This requires an extra step of deleting the partial file that is not
fully initialized before starting the pg_receivewal. Attaching a simple
patch that creates a temp file, fully initialize it and rename the file to
the desired wal segment name.
Are you referring to the pre-padding when creating a new partial
segment, aka when we write chunks of XLOG_BLCKSZ full of zeros until
the file is fully created? What kind of error did you see? I guess
that a write() with ENOSPC would be more likely, but you got a
different problem? I don't disagree with improving such cases, but we
should not do things so as there is a risk of leaving behind an
infinite set of segments in case of repeated errors, and partial
segments are already a kind of temporary file.
- if (dir_data->sync)
+ if (shouldcreatetempfile)
+ {
+ if (durable_rename(tmpsuffixpath, targetpath) != 0)
+ {
+ close(fd);
+ unlink(tmpsuffixpath);
+ return NULL;
+ }
+ }
durable_rename() does a set of fsync()'s, but --no-sync should not
flush any data.
--
Michael
Thanks Michael!
On Sun, Jan 2, 2022 at 11:56 PM Michael Paquier <michael@paquier.xyz> wrote:
On Sun, Jan 02, 2022 at 09:27:43PM -0800, SATYANARAYANA NARLAPURAM wrote:
I noticed that pg_receivewal fails to stream when the partial file to
write
is not fully initialized and fails with the error message something like
below. This requires an extra step of deleting the partial file that isnot
fully initialized before starting the pg_receivewal. Attaching a simple
patch that creates a temp file, fully initialize it and rename the fileto
the desired wal segment name.
Are you referring to the pre-padding when creating a new partial
segment, aka when we write chunks of XLOG_BLCKSZ full of zeros until
the file is fully created? What kind of error did you see? I guess
that a write() with ENOSPC would be more likely, but you got a
different problem?
I see two cases, 1/ when no space is left on the device and 2/ when the
process is taken down forcibly (a VM/container crash)
I don't disagree with improving such cases, but we
should not do things so as there is a risk of leaving behind an
infinite set of segments in case of repeated errors
Do you see a problem with the proposed patch that leaves the files behind,
at least in my testing I don't see any files left behind?
, and partial
segments are already a kind of temporary file.
if the .partial file exists with not zero-padded up to the wal segment size
(WalSegSz), then open_walfile fails with the below error. I have two
options here, 1/ to continue padding the existing partial file and let it
zero up to WalSegSz , 2/create a temp file as I did in the patch. I thought
the latter is safe because it can handle corrupt cases as described below.
Thoughts?
* When streaming to files, if an existing file exists we verify that it's
* either empty (just created), or a complete WalSegSz segment (in which
* case it has been created and padded). Anything else indicates a corrupt
* file. Compressed files have no need for padding, so just ignore this
* case.
- if (dir_data->sync) + if (shouldcreatetempfile) + { + if (durable_rename(tmpsuffixpath, targetpath) != 0) + { + close(fd); + unlink(tmpsuffixpath); + return NULL; + } + }durable_rename() does a set of fsync()'s, but --no-sync should not
flush any data.
I need to look into this further, without this I am seeing random file
close and rename failures and disconnecting the stream. Also it appears we
are calling durable_rename when we are closing the file (dir_close) even
without --no-sync. Should we worry about the padding case?
Show quoted text
--
Michael
Hello
pg_receivewal creates this .partial WAL file during WAL streaming and it is already treating this file as a temporary file. It will fill this .partial file with zeroes up to 16777216 by default before streaming real WAL data on it.
If your .partial file is only 8396800 bytes, then this could mean that pg_receivewal is terminated abruptly while it is appending zeroes or your system runs out of disk space. Do you have any error message?
If this is case, the uninitialized .partial file should still be all zeroes, so it should be ok to delete it and have pg_receivewal to recreate a new .partial file.
Also, in your patch, you are using pad_to_size argument in function dir_open_for_write to determine if it needs to create a temp file, but I see that this function is always given a pad_to_size = 16777216 , and never 0. Am I missing something?
Cary Huang
===========
HighGo Software Canada
On Tue, Jan 4, 2022 at 1:40 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
On Sun, Jan 2, 2022 at 11:56 PM Michael Paquier <michael@paquier.xyz> wrote:
On Sun, Jan 02, 2022 at 09:27:43PM -0800, SATYANARAYANA NARLAPURAM wrote:
I noticed that pg_receivewal fails to stream when the partial file to write
is not fully initialized and fails with the error message something like
below. This requires an extra step of deleting the partial file that is not
fully initialized before starting the pg_receivewal. Attaching a simple
patch that creates a temp file, fully initialize it and rename the file to
the desired wal segment name.Are you referring to the pre-padding when creating a new partial
segment, aka when we write chunks of XLOG_BLCKSZ full of zeros until
the file is fully created? What kind of error did you see? I guess
that a write() with ENOSPC would be more likely, but you got a
different problem?I see two cases, 1/ when no space is left on the device and 2/ when the process is taken down forcibly (a VM/container crash)
Yeah, these cases can occur leaving uninitialized .partial files which
can be a problem for both pg_receivewal and pg_basebackup that uses
dir_open_for_write (CreateWalDirectoryMethod).
I don't disagree with improving such cases, but we
should not do things so as there is a risk of leaving behind an
infinite set of segments in case of repeated errorsDo you see a problem with the proposed patch that leaves the files behind, at least in my testing I don't see any files left behind?
With the proposed patch, it doesn't leave the unpadded .partial files.
Also, the v2 patch always removes a leftover .partial.temp file before
it creates a new one.
, and partial
segments are already a kind of temporary file.if the .partial file exists with not zero-padded up to the wal segment size (WalSegSz), then open_walfile fails with the below error. I have two options here, 1/ to continue padding the existing partial file and let it zero up to WalSegSz , 2/create a temp file as I did in the patch. I thought the latter is safe because it can handle corrupt cases as described below. Thoughts?
The temp file approach looks clean.
- if (dir_data->sync) + if (shouldcreatetempfile) + { + if (durable_rename(tmpsuffixpath, targetpath) != 0) + { + close(fd); + unlink(tmpsuffixpath); + return NULL; + } + }durable_rename() does a set of fsync()'s, but --no-sync should not
flush any data.
Fixed this in v2.
Another thing I found while working on this is the way the
dir_open_for_write does padding - it doesn't retry in case of partial
writes of blocks of size XLOG_BLCKSZ, unlike what core postgres does
with pg_pwritev_with_retry in XLogFileInitInternal. Maybe
dir_open_for_write should use the same approach. Thoughts?
I fixed couple of issues with v1 (which was using static local
variables in dir_open_for_write, not using durable_rename/rename for
dir_data->sync true/false cases, not considering compression method
none while setting shouldcreatetempfile true), improved comments and
added commit message.
Please review the v2 further.
Regards,
Bharath Rupireddy.
Attachments:
v2-0001-Avoid-partially-padded-files-under-pg_receivewal-.patchapplication/octet-stream; name=v2-0001-Avoid-partially-padded-files-under-pg_receivewal-.patchDownload+90-10
At Sat, 9 Apr 2022 18:03:01 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in
On Tue, Jan 4, 2022 at 1:40 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:On Sun, Jan 2, 2022 at 11:56 PM Michael Paquier <michael@paquier.xyz> wrote:
Are you referring to the pre-padding when creating a new partial
segment, aka when we write chunks of XLOG_BLCKSZ full of zeros until
the file is fully created? What kind of error did you see? I guess
that a write() with ENOSPC would be more likely, but you got a
different problem?I see two cases, 1/ when no space is left on the device and 2/ when the process is taken down forcibly (a VM/container crash)
Yeah, these cases can occur leaving uninitialized .partial files which
can be a problem for both pg_receivewal and pg_basebackup that uses
dir_open_for_write (CreateWalDirectoryMethod).I don't disagree with improving such cases, but we
should not do things so as there is a risk of leaving behind an
infinite set of segments in case of repeated errorsDo you see a problem with the proposed patch that leaves the files behind, at least in my testing I don't see any files left behind?
I guess that Michael took this patch as creating a temp file name such
like "tmp.n" very time finding an incomplete file.
With the proposed patch, it doesn't leave the unpadded .partial files.
Also, the v2 patch always removes a leftover .partial.temp file before
it creates a new one., and partial
segments are already a kind of temporary file.
I'm not sure this is true for pg_receivewal case. The .partial file
is not a temporary file but the current working file for the tool.
if the .partial file exists with not zero-padded up to the wal segment size (WalSegSz), then open_walfile fails with the below error. I have two options here, 1/ to continue padding the existing partial file and let it zero up to WalSegSz , 2/create a temp file as I did in the patch. I thought the latter is safe because it can handle corrupt cases as described below. Thoughts?
I think this patch shouldn't involve pg_basebackup. I agree to Cary
that deleting the erroring file should be fine.
We already "skipping" (complete = non-.partial) WAL files with a wrong
size in FindStreamingStart so we can error-out with suggesting a hint.
$ pg_receivewal -D xlog -p 5432 -h /tmp
pg_receivewal: error: segment file "0000000100000022000000F5.partial" has incorrect size 8404992
hint: You can continue after removing the file.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Sorry for the terrible typos..
At Sat, 9 Apr 2022 18:03:01 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in
On Tue, Jan 4, 2022 at 1:40 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:On Sun, Jan 2, 2022 at 11:56 PM Michael Paquier <michael@paquier.xyz> wrote:
Are you referring to the pre-padding when creating a new partial
segment, aka when we write chunks of XLOG_BLCKSZ full of zeros until
the file is fully created? What kind of error did you see? I guess
that a write() with ENOSPC would be more likely, but you got a
different problem?I see two cases, 1/ when no space is left on the device and 2/ when the process is taken down forcibly (a VM/container crash)
Yeah, these cases can occur leaving uninitialized .partial files which
can be a problem for both pg_receivewal and pg_basebackup that uses
dir_open_for_write (CreateWalDirectoryMethod).I don't disagree with improving such cases, but we
should not do things so as there is a risk of leaving behind an
infinite set of segments in case of repeated errorsDo you see a problem with the proposed patch that leaves the files behind, at least in my testing I don't see any files left behind?
I guess that Michael took this patch as creating a temp file with a
name such like "tmp.n" every time finding an incomplete file.
With the proposed patch, it doesn't leave the unpadded .partial files.
Also, the v2 patch always removes a leftover .partial.temp file before
it creates a new one., and partial
segments are already a kind of temporary file.
I'm not sure this is true for pg_receivewal case. The .partial file
is not a temporary file but the current working file for the tool.
if the .partial file exists with not zero-padded up to the wal segment size (WalSegSz), then open_walfile fails with the below error. I have two options here, 1/ to continue padding the existing partial file and let it zero up to WalSegSz , 2/create a temp file as I did in the patch. I thought the latter is safe because it can handle corrupt cases as described below. Thoughts?
I think this patch shouldn't involve pg_basebackup. I agree to Cary
that deleting the erroring file should be fine.
We already "skipping" (complete = non-.partial) WAL files with a wrong
size in FindStreamingStart so we can error-out with suggesting a hint.
$ pg_receivewal -D xlog -p 5432 -h /tmp
pg_receivewal: error: segment file "0000000100000022000000F5.partial" has incorrect size 8404992
hint: You can continue after removing the file.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Sun, Apr 10, 2022 at 11:16 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:
Sorry for the terrible typos..
At Sat, 9 Apr 2022 18:03:01 +0530, Bharath Rupireddy <
bharath.rupireddyforpostgres@gmail.com> wrote inOn Tue, Jan 4, 2022 at 1:40 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:On Sun, Jan 2, 2022 at 11:56 PM Michael Paquier <michael@paquier.xyz>
wrote:
Are you referring to the pre-padding when creating a new partial
segment, aka when we write chunks of XLOG_BLCKSZ full of zeros until
the file is fully created? What kind of error did you see? I guess
that a write() with ENOSPC would be more likely, but you got a
different problem?I see two cases, 1/ when no space is left on the device and 2/ when
the process is taken down forcibly (a VM/container crash)
Yeah, these cases can occur leaving uninitialized .partial files which
can be a problem for both pg_receivewal and pg_basebackup that uses
dir_open_for_write (CreateWalDirectoryMethod).I don't disagree with improving such cases, but we
should not do things so as there is a risk of leaving behind an
infinite set of segments in case of repeated errorsDo you see a problem with the proposed patch that leaves the files
behind, at least in my testing I don't see any files left behind?
I guess that Michael took this patch as creating a temp file with a
name such like "tmp.n" every time finding an incomplete file.With the proposed patch, it doesn't leave the unpadded .partial files.
Also, the v2 patch always removes a leftover .partial.temp file before
it creates a new one., and partial
segments are already a kind of temporary file.I'm not sure this is true for pg_receivewal case. The .partial file
is not a temporary file but the current working file for the tool.
Correct. The idea is to make sure the file is fully allocated before
treating it as a current file.
if the .partial file exists with not zero-padded up to the wal segment
size (WalSegSz), then open_walfile fails with the below error. I have two
options here, 1/ to continue padding the existing partial file and let it
zero up to WalSegSz , 2/create a temp file as I did in the patch. I thought
the latter is safe because it can handle corrupt cases as described below.
Thoughts?I think this patch shouldn't involve pg_basebackup. I agree to Cary
that deleting the erroring file should be fine.We already "skipping" (complete = non-.partial) WAL files with a wrong
size in FindStreamingStart so we can error-out with suggesting a hint.$ pg_receivewal -D xlog -p 5432 -h /tmp
pg_receivewal: error: segment file "0000000100000022000000F5.partial" has
incorrect size 8404992
hint: You can continue after removing the file.
The idea here is to make pg_receivewal self sufficient and reduce
human/third party tool interaction. Ideal case is running pg_Receivewal as
a service for wal archiving.
Show quoted text
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, Apr 11, 2022 at 01:21:23PM -0700, SATYANARAYANA NARLAPURAM wrote:
Correct. The idea is to make sure the file is fully allocated before
treating it as a current file.
Another problem comes to compression, as the pre-padding cannot be
applied in this case because zlib and lz4 don't know the size of the
compressed segment until we reach 16MB of data received, but you can
get a good estimate as long as you know how much space is left on a
device. FWIW, I had to deal with this problem a couple of years ago
for the integration of an archiver in a certain thing, and the
requirement was that the WAL archiver service had to be a maximum
self-aware and automated, which is what you wish to achieve here. It
basically came down to measure how much WAL one wishes to keep in the
WAL archives for the sizing of the disk partition storing the
archives (aka how much back in time you want to go), in combination to
how much WAL would get produced on a rather-linear production load.
Another thing is that you never really want to stress too much your
partition so as it gets filled at 100%, as there could be opened files
and the kind that consume more space than the actual amount of data
stored, but you'd usually want to keep up to 70~90% of it. At the
end, we finished with:
- A dependency to statvfs(), which is not portable on WIN32, to find
out how much space was left on the partition (f_blocks*f_bsize for
the total size and f_bfree*f_bsize for the free size I guess, by
looking at its man page).
- Control the amount of WAL to keep around using a percentage rate of
maximum disk space allowed (or just a percentage of free disk space),
with pg_receivewal doing a cleanup of up to WalSegSz worth of data for
the oldest segments. The segments of the oldest TLIs are removed
first. For any compression algorithm, unlinking this much amount of
data is not necessary but that's fine as you usually just remove one
compressed or uncompressed segment per cycle, at it does not matter
with dozens of gigs worth of WAL archives, or even more.
--
Michael
On Tue, Apr 12, 2022 at 5:34 AM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Apr 11, 2022 at 01:21:23PM -0700, SATYANARAYANA NARLAPURAM wrote:
Correct. The idea is to make sure the file is fully allocated before
treating it as a current file.Another problem comes to compression, as the pre-padding cannot be
applied in this case because zlib and lz4 don't know the size of the
compressed segment until we reach 16MB of data received, but you can
get a good estimate as long as you know how much space is left on a
device. FWIW, I had to deal with this problem a couple of years ago
for the integration of an archiver in a certain thing, and the
requirement was that the WAL archiver service had to be a maximum
self-aware and automated, which is what you wish to achieve here. It
basically came down to measure how much WAL one wishes to keep in the
WAL archives for the sizing of the disk partition storing the
archives (aka how much back in time you want to go), in combination to
how much WAL would get produced on a rather-linear production load.Another thing is that you never really want to stress too much your
partition so as it gets filled at 100%, as there could be opened files
and the kind that consume more space than the actual amount of data
stored, but you'd usually want to keep up to 70~90% of it. At the
end, we finished with:
- A dependency to statvfs(), which is not portable on WIN32, to find
out how much space was left on the partition (f_blocks*f_bsize for
the total size and f_bfree*f_bsize for the free size I guess, by
looking at its man page).
- Control the amount of WAL to keep around using a percentage rate of
maximum disk space allowed (or just a percentage of free disk space),
with pg_receivewal doing a cleanup of up to WalSegSz worth of data for
the oldest segments. The segments of the oldest TLIs are removed
first. For any compression algorithm, unlinking this much amount of
data is not necessary but that's fine as you usually just remove one
compressed or uncompressed segment per cycle, at it does not matter
with dozens of gigs worth of WAL archives, or even more.
Thanks for sharing this. Will the write operations (in
dir_open_for_write) for PG_COMPRESSION_GZIP and PG_COMPRESSION_LZ4
take longer compared to prepadding for non-compressed files?
I would like to know if there's any problem with the proposed fix.
I think we need the same fix proposed in this thread for
tar_open_for_write as well because it also does prepadding for
non-compressed files.
In general, I agree that making pg_receivewal self-aware and
automating things by itself is really a great idea. This will avoid
manual effort. For instance, pg_receivewal can try with different
streaming start LSNs (restart_lsn of its slot or server insert LSN)
not just the latest LSN found in its target directory which will
particularly be helpful in case its source server has changed the
timeline or for some reason unable to serve the WAL.
Regards,
Bharath Rupireddy.
On Mon, Apr 18, 2022 at 02:50:17PM +0530, Bharath Rupireddy wrote:
Thanks for sharing this. Will the write operations (in
dir_open_for_write) for PG_COMPRESSION_GZIP and PG_COMPRESSION_LZ4
take longer compared to prepadding for non-compressed files?
The first write operations for gzip and lz4 consists in writing their
respective headers in the resulting file, which should be a couple of
dozen bytes, at most. So that's surely going to be cheaper than the
pre-padding done for a full segment with the flush induced after
writing WalSegSz bytes worth of zeros.
I would like to know if there's any problem with the proposed fix.
There is nothing done for the case of compressed segments, meaning
that you would see the same problem when being in the middle of
writing a segment compressed with gzip or lz4 in the middle of writing
it, and that's what you want to avoid here. So the important part is
not the pre-padding, it is to make sure that there is enough space
reserved for the handling of a full segment before beginning the work
on it.
--
Michael
On Tue, Apr 19, 2022 at 10:42 AM Michael Paquier <michael@paquier.xyz> wrote:
I would like to know if there's any problem with the proposed fix.
There is nothing done for the case of compressed segments, meaning
that you would see the same problem when being in the middle of
writing a segment compressed with gzip or lz4 in the middle of writing
it, and that's what you want to avoid here. So the important part is
not the pre-padding, it is to make sure that there is enough space
reserved for the handling of a full segment before beginning the work
on it.
Right. We find enough disk space and go to write and suddenly the
write operations fail for some reason or the VM crashes because of a
reason other than disk space. I think the foolproof solution is to
figure out the available disk space before prepadding or compressing
and also use the
write-first-to-temp-file-and-then-rename-it-to-original-file as
proposed in the earlier patches in this thread.
Having said that, is there a portable way that we can find out the
disk space available? I read your response upthread that statvfs isn't
portable to WIN32 platforms. So, we just say that part of the fix we
proposed here (checking disk space before prepadding or compressing)
isn't supported on WIN32 and we just do the temp file thing for WIN32
alone?
Regards,
Bharath Rupireddy.
On Fri, Apr 22, 2022 at 07:17:37PM +0530, Bharath Rupireddy wrote:
Right. We find enough disk space and go to write and suddenly the
write operations fail for some reason or the VM crashes because of a
reason other than disk space. I think the foolproof solution is to
figure out the available disk space before prepadding or compressing
and also use the
write-first-to-temp-file-and-then-rename-it-to-original-file as
proposed in the earlier patches in this thread.
Yes, what would count here is only the amount of free space in a
partition. The total amount of space available becomes handy once you
begin introducing things like percentage-based quota policies for the
disk when archiving. The free amount of space could be used to define
a policy based on the maximum number of bytes you need to leave
around, as well, but this is not perfect science as this depends of
what FSes decide to do underneath. There are a couple of designs
possible here. When I had to deal with my upthread case I have chosen
one as I had no need to worry only about Linux, it does not mean that
this is the best choice that would fit with the long-term community
picture. This comes down to how much pg_receivewal should handle
automatically, and how it should handle it.
Having said that, is there a portable way that we can find out the
disk space available? I read your response upthread that statvfs isn't
portable to WIN32 platforms. So, we just say that part of the fix we
proposed here (checking disk space before prepadding or compressing)
isn't supported on WIN32 and we just do the temp file thing for WIN32
alone?
Something like GetDiskFreeSpaceA() would do the trick on WIN32.
https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getdiskfreespacea
When it comes to compression, creating a temporary file would only
lead to its deletion, mostly, and extra I/O is never free. Anyway,
why would you need the extra logic of the temporary file at all?
That's basically what the .partial file is as pg_receivewal begins
streaming at the beginning of a segment, to the partial file, each
time it sees fit to restart a streaming cycle.
--
Michael
On Mon, Apr 25, 2022 at 6:38 AM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Apr 22, 2022 at 07:17:37PM +0530, Bharath Rupireddy wrote:
Right. We find enough disk space and go to write and suddenly the
write operations fail for some reason or the VM crashes because of a
reason other than disk space. I think the foolproof solution is to
figure out the available disk space before prepadding or compressing
and also use the
write-first-to-temp-file-and-then-rename-it-to-original-file as
proposed in the earlier patches in this thread.Yes, what would count here is only the amount of free space in a
partition. The total amount of space available becomes handy once you
begin introducing things like percentage-based quota policies for the
disk when archiving. The free amount of space could be used to define
a policy based on the maximum number of bytes you need to leave
around, as well, but this is not perfect science as this depends of
what FSes decide to do underneath. There are a couple of designs
possible here. When I had to deal with my upthread case I have chosen
one as I had no need to worry only about Linux, it does not mean that
this is the best choice that would fit with the long-term community
picture. This comes down to how much pg_receivewal should handle
automatically, and how it should handle it.
Thanks. I'm not sure why we are just thinking of crashes due to
out-of-disk space. Figuring out free disk space before writing a huge
file (say a WAL file) is a problem in itself to the core postgres as
well, not just pg_receivewal.
I think we are off-track a bit here. Let me illustrate what's the
whole problem is and the idea:
If the node/VM on which pg_receivewal runs, goes down/crashes or fails
during write operation while padding the target WAL file (the .partial
file) with zeros, the unfilled target WAL file ((let me call this file
a partially padded .partial file) will be left over and subsequent
reads/writes to that it will fail with "write-ahead log file \"%s\"
has %zd bytes, should be 0 or %d" error which requires manual
intervention to remove it. In a service, this manual intervention is
what we would like to avoid. Let's not much bother right now for
compressed file writes (for now at least) as they don't have a
prepadding phase.
The proposed solution is to make the prepadding atomic - prepad the
XXXX.partial file as XXXX.partial.tmp name and after the prepadding
rename (durably if sync option is chosen for pg_receivewal) to
XXXX.partial. Before prepadding XXXX.partial.tmp, delete the
XXXX.partial.tmp if it exists.
The above problem isn't unique to pg_receivewal alone, pg_basebackup
too uses CreateWalDirectoryMethod and dir_open_for_write via
ReceiveXlogStream.
IMHO, pg_receivewal checking for available disk space before writing
any file should better be discussed separately?
Regards,
Bharath Rupireddy.
On Mon, Apr 25, 2022 at 5:17 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
On Mon, Apr 25, 2022 at 6:38 AM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Apr 22, 2022 at 07:17:37PM +0530, Bharath Rupireddy wrote:
Right. We find enough disk space and go to write and suddenly the
write operations fail for some reason or the VM crashes because of a
reason other than disk space. I think the foolproof solution is to
figure out the available disk space before prepadding or compressing
and also use the
write-first-to-temp-file-and-then-rename-it-to-original-file as
proposed in the earlier patches in this thread.Yes, what would count here is only the amount of free space in a
partition. The total amount of space available becomes handy once you
begin introducing things like percentage-based quota policies for the
disk when archiving. The free amount of space could be used to define
a policy based on the maximum number of bytes you need to leave
around, as well, but this is not perfect science as this depends of
what FSes decide to do underneath. There are a couple of designs
possible here. When I had to deal with my upthread case I have chosen
one as I had no need to worry only about Linux, it does not mean that
this is the best choice that would fit with the long-term community
picture. This comes down to how much pg_receivewal should handle
automatically, and how it should handle it.Thanks. I'm not sure why we are just thinking of crashes due to
out-of-disk space. Figuring out free disk space before writing a huge
file (say a WAL file) is a problem in itself to the core postgres as
well, not just pg_receivewal.I think we are off-track a bit here. Let me illustrate what's the
whole problem is and the idea:If the node/VM on which pg_receivewal runs, goes down/crashes or fails
during write operation while padding the target WAL file (the .partial
file) with zeros, the unfilled target WAL file ((let me call this file
a partially padded .partial file) will be left over and subsequent
reads/writes to that it will fail with "write-ahead log file \"%s\"
has %zd bytes, should be 0 or %d" error which requires manual
intervention to remove it. In a service, this manual intervention is
what we would like to avoid. Let's not much bother right now for
compressed file writes (for now at least) as they don't have a
prepadding phase.The proposed solution is to make the prepadding atomic - prepad the
XXXX.partial file as XXXX.partial.tmp name and after the prepadding
rename (durably if sync option is chosen for pg_receivewal) to
XXXX.partial. Before prepadding XXXX.partial.tmp, delete the
XXXX.partial.tmp if it exists.The above problem isn't unique to pg_receivewal alone, pg_basebackup
too uses CreateWalDirectoryMethod and dir_open_for_write via
ReceiveXlogStream.IMHO, pg_receivewal checking for available disk space before writing
any file should better be discussed separately?
Here's the v3 patch after rebasing.
I just would like to reiterate the issue the patch is trying to solve:
At times (when no space is left on the device or when the process is
taken down forcibly (VM/container crash)), there can be leftover
uninitialized .partial files (note that padding i.e. filling 16MB WAL
files with all zeros is done in non-compression cases) due to which
pg_receivewal fails to come up after the crash. To address this, the
proposed patch makes the padding 16MB WAL files atomic (first write a
.temp file, pad it and durably rename it to the original .partial
file, ready to be filled with received WAL records). This approach is
similar to what core postgres achieves atomicity while creating new
WAL file (see how XLogFileInitInternal() creates xlogtemp.%d file
first and then how InstallXLogFileSegment() durably renames it to
original WAL file).
Thoughts?
Regards,
Bharath Rupireddy.
Attachments:
v3-0001-Avoid-partially-padded-files-under-pg_receivewal-.patchapplication/x-patch; name=v3-0001-Avoid-partially-padded-files-under-pg_receivewal-.patchDownload+90-10
Hi Bharath,
Idea to atomically allocate WAL file by creating tmp file and renaming it
is nice.
I have one question though:
How is partially temp file created will be cleaned if the VM crashes or out
of disk space cases? Does it endup creating multiple files for every VM
crash/disk space during process of pg_receivewal?
Thoughts?
Thanks,
Mahendrakar.
On Mon, 25 Jul 2022 at 16:42, Bharath Rupireddy <
bharath.rupireddyforpostgres@gmail.com> wrote:
Show quoted text
On Mon, Apr 25, 2022 at 5:17 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:On Mon, Apr 25, 2022 at 6:38 AM Michael Paquier <michael@paquier.xyz>
wrote:
On Fri, Apr 22, 2022 at 07:17:37PM +0530, Bharath Rupireddy wrote:
Right. We find enough disk space and go to write and suddenly the
write operations fail for some reason or the VM crashes because of a
reason other than disk space. I think the foolproof solution is to
figure out the available disk space before prepadding or compressing
and also use the
write-first-to-temp-file-and-then-rename-it-to-original-file as
proposed in the earlier patches in this thread.Yes, what would count here is only the amount of free space in a
partition. The total amount of space available becomes handy once you
begin introducing things like percentage-based quota policies for the
disk when archiving. The free amount of space could be used to define
a policy based on the maximum number of bytes you need to leave
around, as well, but this is not perfect science as this depends of
what FSes decide to do underneath. There are a couple of designs
possible here. When I had to deal with my upthread case I have chosen
one as I had no need to worry only about Linux, it does not mean that
this is the best choice that would fit with the long-term community
picture. This comes down to how much pg_receivewal should handle
automatically, and how it should handle it.Thanks. I'm not sure why we are just thinking of crashes due to
out-of-disk space. Figuring out free disk space before writing a huge
file (say a WAL file) is a problem in itself to the core postgres as
well, not just pg_receivewal.I think we are off-track a bit here. Let me illustrate what's the
whole problem is and the idea:If the node/VM on which pg_receivewal runs, goes down/crashes or fails
during write operation while padding the target WAL file (the .partial
file) with zeros, the unfilled target WAL file ((let me call this file
a partially padded .partial file) will be left over and subsequent
reads/writes to that it will fail with "write-ahead log file \"%s\"
has %zd bytes, should be 0 or %d" error which requires manual
intervention to remove it. In a service, this manual intervention is
what we would like to avoid. Let's not much bother right now for
compressed file writes (for now at least) as they don't have a
prepadding phase.The proposed solution is to make the prepadding atomic - prepad the
XXXX.partial file as XXXX.partial.tmp name and after the prepadding
rename (durably if sync option is chosen for pg_receivewal) to
XXXX.partial. Before prepadding XXXX.partial.tmp, delete the
XXXX.partial.tmp if it exists.The above problem isn't unique to pg_receivewal alone, pg_basebackup
too uses CreateWalDirectoryMethod and dir_open_for_write via
ReceiveXlogStream.IMHO, pg_receivewal checking for available disk space before writing
any file should better be discussed separately?Here's the v3 patch after rebasing.
I just would like to reiterate the issue the patch is trying to solve:
At times (when no space is left on the device or when the process is
taken down forcibly (VM/container crash)), there can be leftover
uninitialized .partial files (note that padding i.e. filling 16MB WAL
files with all zeros is done in non-compression cases) due to which
pg_receivewal fails to come up after the crash. To address this, the
proposed patch makes the padding 16MB WAL files atomic (first write a
.temp file, pad it and durably rename it to the original .partial
file, ready to be filled with received WAL records). This approach is
similar to what core postgres achieves atomicity while creating new
WAL file (see how XLogFileInitInternal() creates xlogtemp.%d file
first and then how InstallXLogFileSegment() durably renames it to
original WAL file).Thoughts?
Regards,
Bharath Rupireddy.
On Sun, Jul 31, 2022 at 8:36 PM mahendrakar s
<mahendrakarforpg@gmail.com> wrote:
On Mon, 25 Jul 2022 at 16:42, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
Here's the v3 patch after rebasing.I just would like to reiterate the issue the patch is trying to solve:
At times (when no space is left on the device or when the process is
taken down forcibly (VM/container crash)), there can be leftover
uninitialized .partial files (note that padding i.e. filling 16MB WAL
files with all zeros is done in non-compression cases) due to which
pg_receivewal fails to come up after the crash. To address this, the
proposed patch makes the padding 16MB WAL files atomic (first write a
.temp file, pad it and durably rename it to the original .partial
file, ready to be filled with received WAL records). This approach is
similar to what core postgres achieves atomicity while creating new
WAL file (see how XLogFileInitInternal() creates xlogtemp.%d file
first and then how InstallXLogFileSegment() durably renames it to
original WAL file).Thoughts?
Hi Bharath,
Idea to atomically allocate WAL file by creating tmp file and renaming it is nice.
Thanks for reviewing it.
I have one question though:
How is partially temp file created will be cleaned if the VM crashes or out of disk space cases? Does it endup creating multiple files for every VM crash/disk space during process of pg_receivewal?Thoughts?
It is handled in the patch, see [1]+ /* + * Actual file doesn't exist. Now, create a temporary file pad it + * and rename to the target file. The temporary file may exist from + * the last failed attempt (failed during partial padding or + * renaming or some other). If it exists, let's play safe and + * delete it before creating a new one. + */ + snprintf(tmpsuffixpath, sizeof(tmpsuffixpath), "%s.%s", + targetpath, "temp"); + + if (dir_existsfile(tmpsuffixpath)) + { + if (unlink(tmpsuffixpath) != 0) + { + dir_data->lasterrno = errno;.
Attaching v4 patch herewith which now uses the temporary file suffix
'.tmp' as opposed to v3 patch '.temp'. This is just to be in sync with
other atomic file write codes in the core - autoprewarm,
pg_stat_statement, slot, basebacup, replorigin, snapbuild, receivelog
and so on.
Please review the v4 patch.
[1]
+ /*
+ * Actual file doesn't exist. Now, create a temporary file pad it
+ * and rename to the target file. The temporary file may exist from
+ * the last failed attempt (failed during partial padding or
+ * renaming or some other). If it exists, let's play safe and
+ * delete it before creating a new one.
+ */
+ snprintf(tmpsuffixpath, sizeof(tmpsuffixpath), "%s.%s",
+ targetpath, "temp");
+
+ if (dir_existsfile(tmpsuffixpath))
+ {
+ if (unlink(tmpsuffixpath) != 0)
+ {
+ dir_data->lasterrno = errno;
+ return NULL;
+ }
+ }
--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
Attachments:
v4-0001-Avoid-partially-padded-files-under-pg_receivewal-.patchapplication/x-patch; name=v4-0001-Avoid-partially-padded-files-under-pg_receivewal-.patchDownload+89-10
On Thu, Aug 4, 2022 at 11:59 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
On Sun, Jul 31, 2022 at 8:36 PM mahendrakar s
<mahendrakarforpg@gmail.com> wrote:On Mon, 25 Jul 2022 at 16:42, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
Here's the v3 patch after rebasing.I just would like to reiterate the issue the patch is trying to solve:
At times (when no space is left on the device or when the process is
taken down forcibly (VM/container crash)), there can be leftover
uninitialized .partial files (note that padding i.e. filling 16MB WAL
files with all zeros is done in non-compression cases) due to which
pg_receivewal fails to come up after the crash. To address this, the
proposed patch makes the padding 16MB WAL files atomic (first write a
.temp file, pad it and durably rename it to the original .partial
file, ready to be filled with received WAL records). This approach is
similar to what core postgres achieves atomicity while creating new
WAL file (see how XLogFileInitInternal() creates xlogtemp.%d file
first and then how InstallXLogFileSegment() durably renames it to
original WAL file).Thoughts?
Hi Bharath,
Idea to atomically allocate WAL file by creating tmp file and renaming it is nice.
Thanks for reviewing it.
I have one question though:
How is partially temp file created will be cleaned if the VM crashes or out of disk space cases? Does it endup creating multiple files for every VM crash/disk space during process of pg_receivewal?Thoughts?
It is handled in the patch, see [1].
Attaching v4 patch herewith which now uses the temporary file suffix
'.tmp' as opposed to v3 patch '.temp'. This is just to be in sync with
other atomic file write codes in the core - autoprewarm,
pg_stat_statement, slot, basebacup, replorigin, snapbuild, receivelog
and so on.Please review the v4 patch.
I've done some more testing today (hacked the code a bit by adding
pg_usleep(10000L); in pre-padding loop and crashing the pg_receivewal
process to produce the warning [1]pg_receivewal: warning: segment file "0000000100000006000000B9.partial" has incorrect size 15884288, skipping) and found that the better place to
remove ".partial.tmp" leftover files is in FindStreamingStart()
because there we do a traversal of all the files in target directory
along the way to remove if ".partial.tmp" file(s) is/are found.
Please review the v5 patch further.
[1]: pg_receivewal: warning: segment file "0000000100000006000000B9.partial" has incorrect size 15884288, skipping
"0000000100000006000000B9.partial" has incorrect size 15884288,
skipping
--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
Attachments:
v5-0001-Avoid-partially-padded-files-under-pg_receivewal-.patchapplication/octet-stream; name=v5-0001-Avoid-partially-padded-files-under-pg_receivewal-.patchDownload+111-10
Hi Bharath,
I reviewed your patch. Minor comments.
1. Why are we not using durable_unlink instead of unlink to remove the
partial tmp files?
2. Below could be a simple if(shouldcreatetempfile){} else{} as in error
case we need to return NULL.
+ if (errno != ENOENT || !shouldcreatetempfile)
+ {
+ dir_data->lasterrno = errno;
+ return NULL;
+ }
+ else if (shouldcreatetempfile)
+ {
+ /*
+ * Actual file doesn't exist. Now, create a temporary file pad it
+ * and rename to the target file. The temporary file may exist from
+ * the last failed attempt (failed during partial padding or
+ * renaming or some other). If it exists, let's play safe and
+ * delete it before creating a new one.
+ */
+ snprintf(tmpsuffixpath, MAXPGPATH, "%s.tmp", targetpath);
+
+ if (dir_existsfile(tmpsuffixpath))
+ {
+ if (unlink(tmpsuffixpath) != 0)
+ {
+ dir_data->lasterrno = errno;
+ return NULL;
+ }
+ }
+
+ fd = open(tmpsuffixpath, flags | O_CREAT, pg_file_create_mode);
+ if (fd < 0)
+ {
+ dir_data->lasterrno = errno;
+ return NULL;
+ }
+ }
On Mon, 8 Aug 2022 at 11:59, Bharath Rupireddy <
bharath.rupireddyforpostgres@gmail.com> wrote:
Show quoted text
On Thu, Aug 4, 2022 at 11:59 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:On Sun, Jul 31, 2022 at 8:36 PM mahendrakar s
<mahendrakarforpg@gmail.com> wrote:On Mon, 25 Jul 2022 at 16:42, Bharath Rupireddy <
bharath.rupireddyforpostgres@gmail.com> wrote:
Here's the v3 patch after rebasing.
I just would like to reiterate the issue the patch is trying to solve:
At times (when no space is left on the device or when the process is
taken down forcibly (VM/container crash)), there can be leftover
uninitialized .partial files (note that padding i.e. filling 16MB WAL
files with all zeros is done in non-compression cases) due to which
pg_receivewal fails to come up after the crash. To address this, the
proposed patch makes the padding 16MB WAL files atomic (first write a
.temp file, pad it and durably rename it to the original .partial
file, ready to be filled with received WAL records). This approach is
similar to what core postgres achieves atomicity while creating new
WAL file (see how XLogFileInitInternal() creates xlogtemp.%d file
first and then how InstallXLogFileSegment() durably renames it to
original WAL file).Thoughts?
Hi Bharath,
Idea to atomically allocate WAL file by creating tmp file and renaming
it is nice.
Thanks for reviewing it.
I have one question though:
How is partially temp file created will be cleaned if the VM crashesor out of disk space cases? Does it endup creating multiple files for
every VM crash/disk space during process of pg_receivewal?Thoughts?
It is handled in the patch, see [1].
Attaching v4 patch herewith which now uses the temporary file suffix
'.tmp' as opposed to v3 patch '.temp'. This is just to be in sync with
other atomic file write codes in the core - autoprewarm,
pg_stat_statement, slot, basebacup, replorigin, snapbuild, receivelog
and so on.Please review the v4 patch.
I've done some more testing today (hacked the code a bit by adding
pg_usleep(10000L); in pre-padding loop and crashing the pg_receivewal
process to produce the warning [1]) and found that the better place to
remove ".partial.tmp" leftover files is in FindStreamingStart()
because there we do a traversal of all the files in target directory
along the way to remove if ".partial.tmp" file(s) is/are found.Please review the v5 patch further.
[1] pg_receivewal: warning: segment file
"0000000100000006000000B9.partial" has incorrect size 15884288,
skipping--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
On Fri, Aug 19, 2022 at 1:37 PM mahendrakar s
<mahendrakarforpg@gmail.com> wrote:
Hi Bharath,
I reviewed your patch. Minor comments.
Thanks.
1. Why are we not using durable_unlink instead of unlink to remove the partial tmp files?
durable_unlink() issues fsync on the parent directory, if used, those
fsync() calls will be per partial.tmp file. Moreover, durable_unlink()
is backend-only, not available for tools i.e. FRONTEND code. If we
don't durably remove the pratial.tmp file, it will get deleted in the
next cycle anyways, so no problem there.
2. Below could be a simple if(shouldcreatetempfile){} else{} as in error case we need to return NULL.
Yeah, that way it is much simpler.
Please review the attached v6 patch.
--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/