improving basebackup.c's file-reading code
Hi,
basebackup.c's code to read from files uses fread() and friends. This
is not great, because it's not documented to set errno. See commit
286af0ce12117bc673b97df6228d1a666594d247 and related discussion. It
seems like a better idea would be to use pg_pgread(), which not only
does set errno, but also lets us eliminate a bit of code that uses
fseek().
There are a couple of other things here that can also be improved. One
is that it seems like a good idea to set a wait event while doing I/O
here, as we do elsewhere. Another is that it seems like a good idea to
report short reads in a non-confusing, non-wrong sort of way. I here
adopted the convention previously mentioned in
/messages/by-id/20200128020303.GA1552@paquier.xyz
Patch, for v14, attached.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
v1-0001-Improve-server-code-to-read-files-as-part-of-a-ba.patchapplication/octet-stream; name=v1-0001-Improve-server-code-to-read-files-as-part-of-a-ba.patchDownload
From 48da23e5a719dac972fe8cc0a1c7153d7e54ab19 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 27 Apr 2020 12:04:04 -0400
Subject: [PATCH v1] Improve server code to read files as part of a base
backup.
Don't use fread(), since that doesn't necessarily set errno. We could
use read() instead, but it's even better to use pg_pread(), which
allows us to avoid some extra calls to seek to the desired location in
the file.
Also, advertise a wait event while reading from a file, as we do for
most other places where we're reading data from files.
---
doc/src/sgml/monitoring.sgml | 7 +-
src/backend/postmaster/pgstat.c | 3 +
src/backend/replication/basebackup.c | 143 ++++++++++++++-------------
src/include/pgstat.h | 3 +-
4 files changed, 88 insertions(+), 68 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6562cc400b..1f193592a5 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1536,7 +1536,12 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting in a cost-based vacuum delay point.</entry>
</row>
<row>
- <entry morerows="68"><literal>IO</literal></entry>
+ <entry morerows="69"><literal>IO</literal></entry>
+ <entry><literal>BaseBackupRead</literal></entry>
+ <entry>Waiting for base backup to read from a file.</entry>
+ </row>
+ <row>
+ <entry><literal>IO</literal></entry>
<entry><literal>BufFileRead</literal></entry>
<entry>Waiting for a read from a buffered file.</entry>
</row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 50eea2e8a8..89cf8fbc84 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3926,6 +3926,9 @@ pgstat_get_wait_io(WaitEventIO w)
switch (w)
{
+ case WAIT_EVENT_BASEBACKUP_READ:
+ event_name = "BaseBackupRead";
+ break;
case WAIT_EVENT_BUFFILE_READ:
event_name = "BufFileRead";
break;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index fbdc28ec39..ee1060ab8b 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -79,6 +79,8 @@ static int compareWalFileNames(const ListCell *a, const ListCell *b);
static void throttle(size_t increment);
static void update_basebackup_progress(int64 delta);
static bool is_checksummed_file(const char *fullpath, const char *filename);
+static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
+ const char *filename, bool partial_read_ok);
/* Was the backup currently in-progress initiated in recovery mode? */
static bool backup_started_in_recovery = false;
@@ -96,18 +98,6 @@ static char *statrelpath = NULL;
*/
#define THROTTLING_FREQUENCY 8
-/*
- * Checks whether we encountered any error in fread(). fread() doesn't give
- * any clue what has happened, so we check with ferror(). Also, neither
- * fread() nor ferror() set errno, so we just throw a generic error.
- */
-#define CHECK_FREAD_ERROR(fp, filename) \
-do { \
- if (ferror(fp)) \
- ereport(ERROR, \
- (errmsg("could not read from file \"%s\"", filename))); \
-} while (0)
-
/* The actual number of bytes, transfer of which may cause sleep. */
static uint64 throttling_sample;
@@ -595,7 +585,7 @@ perform_base_backup(basebackup_options *opt)
foreach(lc, walFileList)
{
char *walFileName = (char *) lfirst(lc);
- FILE *fp;
+ int fd;
char buf[TAR_SEND_SIZE];
size_t cnt;
pgoff_t len = 0;
@@ -603,8 +593,8 @@ perform_base_backup(basebackup_options *opt)
snprintf(pathbuf, MAXPGPATH, XLOGDIR "/%s", walFileName);
XLogFromFileName(walFileName, &tli, &segno, wal_segment_size);
- fp = AllocateFile(pathbuf, "rb");
- if (fp == NULL)
+ fd = OpenTransientFile(pathbuf, O_RDONLY | PG_BINARY);
+ if (fd < 0)
{
int save_errno = errno;
@@ -621,7 +611,7 @@ perform_base_backup(basebackup_options *opt)
errmsg("could not open file \"%s\": %m", pathbuf)));
}
- if (fstat(fileno(fp), &statbuf) != 0)
+ if (fstat(fd, &statbuf) != 0)
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not stat file \"%s\": %m",
@@ -637,9 +627,10 @@ perform_base_backup(basebackup_options *opt)
/* send the WAL file itself */
_tarWriteHeader(pathbuf, NULL, &statbuf, false);
- while ((cnt = fread(buf, 1,
- Min(sizeof(buf), wal_segment_size - len),
- fp)) > 0)
+ while ((cnt = basebackup_read_file(fd, buf,
+ Min(sizeof(buf),
+ wal_segment_size - len),
+ len, pathbuf, true)) > 0)
{
CheckXLogRemoved(segno, tli);
/* Send the chunk as a CopyData message */
@@ -655,8 +646,6 @@ perform_base_backup(basebackup_options *opt)
break;
}
- CHECK_FREAD_ERROR(fp, pathbuf);
-
if (len != wal_segment_size)
{
CheckXLogRemoved(segno, tli);
@@ -667,7 +656,7 @@ perform_base_backup(basebackup_options *opt)
/* wal_segment_size is a multiple of 512, so no need for padding */
- FreeFile(fp);
+ CloseTransientFile(fd);
/*
* Mark file as archived, otherwise files can get archived again
@@ -1561,7 +1550,7 @@ sendFile(const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid,
backup_manifest_info *manifest, const char *spcoid)
{
- FILE *fp;
+ int fd;
BlockNumber blkno = 0;
bool block_retry = false;
char buf[TAR_SEND_SIZE];
@@ -1580,8 +1569,8 @@ sendFile(const char *readfilename, const char *tarfilename,
pg_checksum_init(&checksum_ctx, manifest->checksum_type);
- fp = AllocateFile(readfilename, "rb");
- if (fp == NULL)
+ fd = OpenTransientFile(readfilename, O_RDONLY | PG_BINARY);
+ if (fd < 0)
{
if (errno == ENOENT && missing_ok)
return false;
@@ -1623,8 +1612,27 @@ sendFile(const char *readfilename, const char *tarfilename,
}
}
- while ((cnt = fread(buf, 1, Min(sizeof(buf), statbuf->st_size - len), fp)) > 0)
+ /*
+ * Loop until we read at least the amount of data the caller told us to
+ * expect. The file could be longer, if it was extended while we were
+ * sending it, but for a base backup we can ignore such extended data. It
+ * will be restored from WAL.
+ */
+ while (len < statbuf->st_size)
{
+ /* Try to read some more data. */
+ cnt = basebackup_read_file(fd, buf,
+ Min(sizeof(buf), statbuf->st_size - len),
+ len, readfilename, true);
+
+ /*
+ * If we hit end-of-file, a concurrent truncation must have occurred.
+ * That's not an error condition, because WAL replay will fix things
+ * up.
+ */
+ if (cnt == 0)
+ break;
+
/*
* The checksums are verified at block level, so we iterate over the
* buffer in chunks of BLCKSZ, after making sure that
@@ -1675,16 +1683,15 @@ sendFile(const char *readfilename, const char *tarfilename,
*/
if (block_retry == false)
{
- /* Reread the failed block */
- if (fseek(fp, -(cnt - BLCKSZ * i), SEEK_CUR) == -1)
- {
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not fseek in file \"%s\": %m",
- readfilename)));
- }
+ int reread_cnt;
- if (fread(buf + BLCKSZ * i, 1, BLCKSZ, fp) != BLCKSZ)
+ /* Reread the failed block */
+ reread_cnt =
+ basebackup_read_file(fd, buf + BLCKSZ * i,
+ BLCKSZ, len + BLCKSZ * i,
+ readfilename,
+ false);
+ if (reread_cnt == 0)
{
/*
* If we hit end-of-file, a concurrent
@@ -1694,24 +1701,8 @@ sendFile(const char *readfilename, const char *tarfilename,
* code that handles that case. (We must fix
* up cnt first, though.)
*/
- if (feof(fp))
- {
- cnt = BLCKSZ * i;
- break;
- }
-
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not reread block %d of file \"%s\": %m",
- blkno, readfilename)));
- }
-
- if (fseek(fp, cnt - BLCKSZ * i - BLCKSZ, SEEK_CUR) == -1)
- {
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not fseek in file \"%s\": %m",
- readfilename)));
+ cnt = BLCKSZ * i;
+ break;
}
/* Set flag so we know a retry was attempted */
@@ -1754,20 +1745,8 @@ sendFile(const char *readfilename, const char *tarfilename,
len += cnt;
throttle(cnt);
-
- if (feof(fp) || len >= statbuf->st_size)
- {
- /*
- * Reached end of file. The file could be longer, if it was
- * extended while we were sending it, but for a base backup we can
- * ignore such extended data. It will be restored from WAL.
- */
- break;
- }
}
- CHECK_FREAD_ERROR(fp, readfilename);
-
/* If the file was truncated while we were sending it, pad it with zeros */
if (len < statbuf->st_size)
{
@@ -1796,7 +1775,7 @@ sendFile(const char *readfilename, const char *tarfilename,
update_basebackup_progress(pad);
}
- FreeFile(fp);
+ CloseTransientFile(fd);
if (checksum_failures > 1)
{
@@ -1982,3 +1961,35 @@ update_basebackup_progress(int64 delta)
pgstat_progress_update_multi_param(nparam, index, val);
}
+
+/*
+ * Read some data from a file, setting a wait event and reporting any error
+ * encountered.
+ *
+ * If partial_read_ok is false, also report an error if the number of bytes
+ * read is not equal to the number of bytes requested.
+ *
+ * Returns the number of bytes read.
+ */
+static int
+basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
+ const char *filename, bool partial_read_ok)
+{
+ int rc;
+
+ pgstat_report_wait_start(WAIT_EVENT_BASEBACKUP_READ);
+ rc = pg_pread(fd, buf, nbytes, offset);
+ pgstat_report_wait_end();
+
+ if (rc < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m", filename)));
+ if (!partial_read_ok && rc > 0 && rc != nbytes)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": read %d of %zu",
+ filename, rc, nbytes)));
+
+ return rc;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9988..3ae40d305a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -913,7 +913,8 @@ typedef enum
*/
typedef enum
{
- WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
+ WAIT_EVENT_BASEBACKUP_READ = PG_WAIT_IO,
+ WAIT_EVENT_BUFFILE_READ,
WAIT_EVENT_BUFFILE_WRITE,
WAIT_EVENT_CONTROL_FILE_READ,
WAIT_EVENT_CONTROL_FILE_SYNC,
--
2.24.2 (Apple Git-127)
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: tested, passed
The idea and the patch looks good to me.
It makes sense to change fread to basebackup_read_file which internally calls pg_pread which is a portable function as opposed to read. As you've mentioned, this is much better for error handling.
I guess it is more of a personal choice, but I would suggest making the while conditions consistent as well. The while loop in the patch @ line 121 conditions over return value of "basebackup_read_file" whereas @ line 177, it has a condition "(len < statbuf->st_size)".
The new status of this patch is: Ready for Committer
On Wed, Apr 29, 2020 at 5:52 AM Hamid Akhtar <hamid.akhtar@gmail.com> wrote:
The idea and the patch looks good to me.
It makes sense to change fread to basebackup_read_file which internally calls pg_pread which is a portable function as opposed to read. As you've mentioned, this is much better for error handling.
Thanks for the review. I have now committed the patch, after rebasing
and adjusting one comment slightly.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 17 Jun 2020, at 17:45, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Apr 29, 2020 at 5:52 AM Hamid Akhtar <hamid.akhtar@gmail.com> wrote:
The idea and the patch looks good to me.
It makes sense to change fread to basebackup_read_file which internally calls pg_pread which is a portable function as opposed to read. As you've mentioned, this is much better for error handling.
Thanks for the review. I have now committed the patch, after rebasing
and adjusting one comment slightly.
As this went in, can we close the 2020-07 CF entry or is there anything left in
the patchseries?
cheers ./daniel
On Thu, Jun 25, 2020 at 10:29 AM Daniel Gustafsson <daniel@yesql.se> wrote:
As this went in, can we close the 2020-07 CF entry or is there anything left in
the patchseries?
Done. Thanks for the reminder.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company