pread() and pwrite()
Hello hackers,
A couple of years ago, Oskari Saarenmaa proposed a patch[1]/messages/by-id/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7@ohmu.fi to adopt
$SUBJECT. Last year, Tobias Oberstein argued again that we should do
that[2]/messages/by-id/b8748d39-0b19-0514-a1b9-4e5a28e6a208@gmail.com. At the end of that thread there was a +1 from multiple
committers in support of getting it done for PostgreSQL 12. Since
Oskari hasn't returned, I decided to dust off his patch and start a
new thread.
Here is a rebase and an initial review. I plan to address the
problems myself, unless Oskari wants to do that in which case I'll
happily review and hopefully eventually commit it. It's possible I
introduced new bugs while rebasing since basically everything moved
around, but it passes check-world here with and without HAVE_PREAD and
HAVE_PWRITE defined.
Let me summarise what's going on in the patch:
1. FileRead() and FileWrite() are replaced with FileReadAt() and
FileWriteAt(), taking a new offset argument. Now we can do one
syscall instead of two for random reads and writes.
2. fd.c no longer tracks seek position, so we lose a bunch of cruft
from the vfd machinery. FileSeek() now only supports SEEK_SET and
SEEK_END.
This is taking the position that we no longer want to be able to do
file IO with implicit positioning at all. I think that seems
reasonable: it's nice to drop all that position tracking and caching
code, and the seek-based approach would be incompatible with any
multithreading plans. I'm not even sure I'd bother adding "At" to the
function names. If there are any extensions that want the old
interface they will fail to compile either way. Note that the BufFile
interface remains the same, so hopefully that covers many use cases.
I guess the only remaining reason to use FileSeek() is to get the file
size? So I wonder why SEEK_SET remains valid in the patch... if my
suspicion is correct that only SEEK_END still has a reason to exist,
perhaps we should just kill FileSeek() and add FileSize() or something
instead?
pgstat_report_wait_start(wait_event_info);
+#ifdef HAVE_PREAD
+ returnCode = pread(vfdP->fd, buffer, amount, offset);
+#else
+ lseek(VfdCache[file].fd, offset, SEEK_SET);
returnCode = read(vfdP->fd, buffer, amount);
+#endif
pgstat_report_wait_end();
This obviously lacks error handling for lseek().
I wonder if anyone would want separate wait_event tracking for the
lseek() (though this codepath would be used by almost nobody,
especially if someone adds Windows support, so it's probably not worth
bothering with).
I suppose we could assume that if you have pread() you must also have
pwrite() and save on ./configure cycles.
I will add this to the next Festival of Commits, though clearly it
needs a bit more work before the festivities begin.
[1]: /messages/by-id/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7@ohmu.fi
[2]: /messages/by-id/b8748d39-0b19-0514-a1b9-4e5a28e6a208@gmail.com
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
0001-Use-pread-pwrite-instead-of-lseek-read-write-v1.patchapplication/octet-stream; name=0001-Use-pread-pwrite-instead-of-lseek-read-write-v1.patchDownload
From f3b150c81bf3ffb29efcfbcbaf4f0c0f0700de48 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Thu, 12 Jul 2018 13:14:02 +1200
Subject: [PATCH] Use pread()/pwrite() instead of lseek() + read()/write().
Cut down on system calls by doing random IO with a single system call,
on operating systems that support it.
*WIP*
Author: Oskari Saarenmaa
Reviewed-by: Thomas Munro
Discussion:
Discussion: https://postgr.es/m/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com
Discussion: https://postgr.es/m/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi
---
configure | 2 +-
configure.in | 2 +-
src/backend/access/heap/rewriteheap.c | 4 +-
src/backend/storage/file/buffile.c | 58 ++-----
src/backend/storage/file/fd.c | 221 ++++++--------------------
src/backend/storage/smgr/md.c | 33 +---
src/include/pg_config.h.in | 6 +
src/include/storage/fd.h | 4 +-
8 files changed, 73 insertions(+), 257 deletions(-)
diff --git a/configure b/configure
index f891914ed99..1d8de9ab919 100755
--- a/configure
+++ b/configure
@@ -14915,7 +14915,7 @@ fi
LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pread pstat pthread_is_threaded_np pwrite readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l
do :
as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 5712419a274..9b043b23f45 100644
--- a/configure.in
+++ b/configure.in
@@ -1540,7 +1540,7 @@ PGAC_FUNC_WCSTOMBS_L
LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
+AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pread pstat pthread_is_threaded_np pwrite readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
AC_REPLACE_FUNCS(fseeko)
case $host_os in
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index ed7ba181c79..8b308cd8eb9 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -922,8 +922,8 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
* Note that we deviate from the usual WAL coding practices here,
* check the above "Logical rewrite support" comment for reasoning.
*/
- written = FileWrite(src->vfd, waldata_start, len,
- WAIT_EVENT_LOGICAL_REWRITE_WRITE);
+ written = FileWriteAt(src->vfd, waldata_start, len, src->off,
+ WAIT_EVENT_LOGICAL_REWRITE_WRITE);
if (written != len)
ereport(ERROR,
(errcode_for_file_access(),
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index efbede76297..84354290ec7 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -67,12 +67,6 @@ struct BufFile
int numFiles; /* number of physical files in set */
/* all files except the last have length exactly MAX_PHYSICAL_FILESIZE */
File *files; /* palloc'd array with numFiles entries */
- off_t *offsets; /* palloc'd array with numFiles entries */
-
- /*
- * offsets[i] is the current seek position of files[i]. We use this to
- * avoid making redundant FileSeek calls.
- */
bool isInterXact; /* keep open over transactions? */
bool dirty; /* does buffer need to be written? */
@@ -116,7 +110,6 @@ makeBufFileCommon(int nfiles)
BufFile *file = (BufFile *) palloc(sizeof(BufFile));
file->numFiles = nfiles;
- file->offsets = (off_t *) palloc0(sizeof(off_t) * nfiles);
file->isInterXact = false;
file->dirty = false;
file->resowner = CurrentResourceOwner;
@@ -170,10 +163,7 @@ extendBufFile(BufFile *file)
file->files = (File *) repalloc(file->files,
(file->numFiles + 1) * sizeof(File));
- file->offsets = (off_t *) repalloc(file->offsets,
- (file->numFiles + 1) * sizeof(off_t));
file->files[file->numFiles] = pfile;
- file->offsets[file->numFiles] = 0L;
file->numFiles++;
}
@@ -396,7 +386,6 @@ BufFileClose(BufFile *file)
FileClose(file->files[i]);
/* release the buffer space */
pfree(file->files);
- pfree(file->offsets);
pfree(file);
}
@@ -422,27 +411,17 @@ BufFileLoadBuffer(BufFile *file)
file->curOffset = 0L;
}
- /*
- * May need to reposition physical file.
- */
- thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, read nothing */
- file->offsets[file->curFile] = file->curOffset;
- }
-
/*
* Read whatever we can get, up to a full bufferload.
*/
- file->nbytes = FileRead(thisfile,
- file->buffer,
- sizeof(file->buffer),
- WAIT_EVENT_BUFFILE_READ);
+ thisfile = file->files[file->curFile];
+ file->nbytes = FileReadAt(thisfile,
+ file->buffer,
+ sizeof(file->buffer),
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
if (file->nbytes < 0)
file->nbytes = 0;
- file->offsets[file->curFile] += file->nbytes;
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
@@ -491,23 +470,14 @@ BufFileDumpBuffer(BufFile *file)
if ((off_t) bytestowrite > availbytes)
bytestowrite = (int) availbytes;
- /*
- * May need to reposition physical file.
- */
thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, give up */
- file->offsets[file->curFile] = file->curOffset;
- }
- bytestowrite = FileWrite(thisfile,
- file->buffer + wpos,
- bytestowrite,
- WAIT_EVENT_BUFFILE_WRITE);
+ bytestowrite = FileWriteAt(thisfile,
+ file->buffer + wpos,
+ bytestowrite,
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_WRITE);
if (bytestowrite <= 0)
return; /* failed to write */
- file->offsets[file->curFile] += bytestowrite;
file->curOffset += bytestowrite;
wpos += bytestowrite;
@@ -807,7 +777,6 @@ BufFileSize(BufFile *file)
lastFileSize = FileSeek(file->files[file->numFiles - 1], 0, SEEK_END);
if (lastFileSize < 0)
return -1;
- file->offsets[file->numFiles - 1] = lastFileSize;
return ((file->numFiles - 1) * (off_t) MAX_PHYSICAL_FILESIZE) +
lastFileSize;
@@ -849,13 +818,8 @@ BufFileAppend(BufFile *target, BufFile *source)
target->files = (File *)
repalloc(target->files, sizeof(File) * newNumFiles);
- target->offsets = (off_t *)
- repalloc(target->offsets, sizeof(off_t) * newNumFiles);
for (i = target->numFiles; i < newNumFiles; i++)
- {
target->files[i] = source->files[i - target->numFiles];
- target->offsets[i] = source->offsets[i - target->numFiles];
- }
target->numFiles = newNumFiles;
return startBlock;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 8dd51f17674..b8450ed20d6 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -16,8 +16,8 @@
* including base tables, scratch files (e.g., sort and hash spool
* files), and random calls to C library routines like system(3); it
* is quite easy to exceed system limits on the number of open files a
- * single process can have. (This is around 256 on many modern
- * operating systems, but can be as low as 32 on others.)
+ * single process can have. (This is around 1024 on many modern
+ * operating systems, but may be lower on others.)
*
* VFDs are managed as an LRU pool, with actual OS file descriptors
* being opened and closed as needed. Obviously, if a routine is
@@ -167,15 +167,6 @@ int max_safe_fds = 32; /* default if not changed */
#define FileIsNotOpen(file) (VfdCache[file].fd == VFD_CLOSED)
-/*
- * Note: a VFD's seekPos is normally always valid, but if for some reason
- * an lseek() fails, it might become set to FileUnknownPos. We can struggle
- * along without knowing the seek position in many cases, but in some places
- * we have to fail if we don't have it.
- */
-#define FileUnknownPos ((off_t) -1)
-#define FilePosIsUnknown(pos) ((pos) < 0)
-
/* these are the assigned bits in fdstate below: */
#define FD_DELETE_AT_CLOSE (1 << 0) /* T = delete when closed */
#define FD_CLOSE_AT_EOXACT (1 << 1) /* T = close at eoXact */
@@ -189,7 +180,6 @@ typedef struct vfd
File nextFree; /* link to next free VFD, if in freelist */
File lruMoreRecently; /* doubly linked recency-of-use list */
File lruLessRecently;
- off_t seekPos; /* current logical file position, or -1 */
off_t fileSize; /* current size of file (0 if not temporary) */
char *fileName; /* name of file, or NULL for unused VFD */
/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
@@ -407,9 +397,7 @@ pg_fdatasync(int fd)
/*
* pg_flush_data --- advise OS that the described dirty data should be flushed
*
- * offset of 0 with nbytes 0 means that the entire file should be flushed;
- * in this case, this function may have side-effects on the file's
- * seek position!
+ * offset of 0 with nbytes 0 means that the entire file should be flushed
*/
void
pg_flush_data(int fd, off_t offset, off_t nbytes)
@@ -1029,22 +1017,6 @@ LruDelete(File file)
vfdP = &VfdCache[file];
- /*
- * Normally we should know the seek position, but if for some reason we
- * have lost track of it, try again to get it. If we still can't get it,
- * we have a problem: we will be unable to restore the file seek position
- * when and if the file is re-opened. But we can't really throw an error
- * and refuse to close the file, or activities such as transaction cleanup
- * will be broken.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(LOG, "could not seek file \"%s\" before closing: %m",
- vfdP->fileName);
- }
-
/*
* Close the file. We aren't expecting this to fail; if it does, better
* to leak the FD than to mess up our internal state.
@@ -1113,33 +1085,6 @@ LruInsert(File file)
{
++nfile;
}
-
- /*
- * Seek to the right position. We need no special case for seekPos
- * equal to FileUnknownPos, as lseek() will certainly reject that
- * (thus completing the logic noted in LruDelete() that we will fail
- * to re-open a file if we couldn't get its seek position before
- * closing).
- */
- if (vfdP->seekPos != (off_t) 0)
- {
- if (lseek(vfdP->fd, vfdP->seekPos, SEEK_SET) < 0)
- {
- /*
- * If we fail to restore the seek position, treat it like an
- * open() failure.
- */
- int save_errno = errno;
-
- elog(LOG, "could not seek file \"%s\" after re-opening: %m",
- vfdP->fileName);
- (void) close(vfdP->fd);
- vfdP->fd = VFD_CLOSED;
- --nfile;
- errno = save_errno;
- return -1;
- }
- }
}
/*
@@ -1406,7 +1351,6 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
/* Saved flags are adjusted to be OK for re-opening file */
vfdP->fileFlags = fileFlags & ~(O_CREAT | O_TRUNC | O_EXCL);
vfdP->fileMode = fileMode;
- vfdP->seekPos = 0;
vfdP->fileSize = 0;
vfdP->fdstate = 0x0;
vfdP->resowner = NULL;
@@ -1820,7 +1764,6 @@ FileClose(File file)
/*
* FilePrefetch - initiate asynchronous read of a given range of the file.
- * The logical seek position is unaffected.
*
* Currently the only implementation of this function is using posix_fadvise
* which is the simplest standardized interface that accomplishes this.
@@ -1867,10 +1810,6 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
file, VfdCache[file].fileName,
(int64) offset, (int64) nbytes));
- /*
- * Caution: do not call pg_flush_data with nbytes = 0, it could trash the
- * file's seek position. We prefer to define that as a no-op here.
- */
if (nbytes <= 0)
return;
@@ -1884,7 +1823,8 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
}
int
-FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
+FileReadAt(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1893,7 +1833,7 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileRead: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1904,16 +1844,15 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
pgstat_report_wait_start(wait_event_info);
+#ifdef HAVE_PREAD
+ returnCode = pread(vfdP->fd, buffer, amount, offset);
+#else
+ lseek(VfdCache[file].fd, offset, SEEK_SET);
returnCode = read(vfdP->fd, buffer, amount);
+#endif
pgstat_report_wait_end();
- if (returnCode >= 0)
- {
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
- }
- else
+ if (returnCode < 0)
{
/*
* Windows may run out of kernel buffers and return "Insufficient
@@ -1939,16 +1878,14 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
}
int
-FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
+FileWriteAt(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1957,7 +1894,7 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileWrite: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1976,26 +1913,13 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
*/
if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
{
- off_t newPos;
-
- /*
- * Normally we should know the seek position, but if for some reason
- * we have lost track of it, try again to get it. Here, it's fine to
- * throw an error if we still can't get it.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(ERROR, "could not seek file \"%s\": %m", vfdP->fileName);
- }
+ off_t past_write = offset + amount;
- newPos = vfdP->seekPos + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
uint64 newTotal = temporary_files_size;
- newTotal += newPos - vfdP->fileSize;
+ newTotal += past_write - vfdP->fileSize;
if (newTotal > (uint64) temp_file_limit * (uint64) 1024)
ereport(ERROR,
(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
@@ -2007,7 +1931,12 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
errno = 0;
pgstat_report_wait_start(wait_event_info);
- returnCode = write(vfdP->fd, buffer, amount);
+#ifdef HAVE_PWRITE
+ returnCode = pwrite(VfdCache[file].fd, buffer, amount, offset);
+#else
+ lseek(VfdCache[file].fd, offset, SEEK_SET);
+ returnCode = write(VfdCache[file].fd, buffer, amount);
+#endif
pgstat_report_wait_end();
/* if write didn't set errno, assume problem is no disk space */
@@ -2016,10 +1945,6 @@ retry:
if (returnCode >= 0)
{
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
-
/*
* Maintain fileSize and temporary_files_size if it's a temp file.
*
@@ -2029,19 +1954,19 @@ retry:
*/
if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
{
- off_t newPos = vfdP->seekPos;
+ off_t past_write = offset + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
- temporary_files_size += newPos - vfdP->fileSize;
- vfdP->fileSize = newPos;
+ temporary_files_size += past_write - vfdP->fileSize;
+ vfdP->fileSize = past_write;
}
}
}
else
{
/*
- * See comments in FileRead()
+ * See comments in FileReadAt()
*/
#ifdef WIN32
DWORD error = GetLastError();
@@ -2060,9 +1985,6 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
@@ -2103,79 +2025,30 @@ FileSeek(File file, off_t offset, int whence)
vfdP = &VfdCache[file];
- if (FileIsNotOpen(file))
+ switch (whence)
{
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos = offset;
- break;
- case SEEK_CUR:
- if (FilePosIsUnknown(vfdP->seekPos) ||
- vfdP->seekPos + offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos += offset;
- break;
- case SEEK_END:
+ case SEEK_SET:
+ if (offset < 0)
+ elog(ERROR, "invalid seek offset: " INT64_FORMAT,
+ (int64) offset);
+ return offset;
+
+ case SEEK_END:
+ if (FileIsNotOpen(file))
+ {
if (FileAccess(file) < 0)
return (off_t) -1;
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
- }
- else
- {
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- if (vfdP->seekPos != offset)
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_CUR:
- if (offset != 0 || FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_END:
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
- }
+ }
+ return lseek(VfdCache[file].fd, offset, whence);
+ break;
- return vfdP->seekPos;
-}
+ default:
+ elog(ERROR, "invalid whence: %d", whence);
+ break;
+ }
-/*
- * XXX not actually used but here for completeness
- */
-#ifdef NOT_USED
-off_t
-FileTell(File file)
-{
- Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileTell %d (%s)",
- file, VfdCache[file].fileName));
- return VfdCache[file].seekPos;
+ return -1;
}
-#endif
int
FileTruncate(File file, off_t offset, uint32 wait_event_info)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f4374d077be..b5bf364caa1 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -522,22 +522,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- /*
- * Note: because caller usually obtained blocknum by calling mdnblocks,
- * which did a seek(SEEK_END), this seek is often redundant and will be
- * optimized away by fd.c. It's not redundant, however, if there is a
- * partial page at the end of the file. In that case we want to try to
- * overwrite the partial page with a full page. It's also not redundant
- * if bufmgr.c had to dump another buffer of the same file to make room
- * for the new page's buffer.
- */
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
+ if ((nbytes = FileWriteAt(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
{
if (nbytes < 0)
ereport(ERROR,
@@ -748,13 +733,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_READ);
+ nbytes = FileReadAt(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -824,13 +803,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_WRITE);
+ nbytes = FileWriteAt(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index f9fb92f31c1..2be0b12407e 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -434,6 +434,9 @@
/* Define to 1 if the assembler supports PPC's LWARX mutex hint bit. */
#undef HAVE_PPC_LWARX_MUTEX_HINT
+/* Define to 1 if you have the `pread' function. */
+#undef HAVE_PREAD
+
/* Define to 1 if you have the `pstat' function. */
#undef HAVE_PSTAT
@@ -449,6 +452,9 @@
/* Have PTHREAD_PRIO_INHERIT. */
#undef HAVE_PTHREAD_PRIO_INHERIT
+/* Define to 1 if you have the `pwrite' function. */
+#undef HAVE_PWRITE
+
/* Define to 1 if you have the `random' function. */
#undef HAVE_RANDOM
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8e7c9728f4b..cbbb6786c21 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -68,8 +68,8 @@ extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fil
extern File OpenTemporaryFile(bool interXact);
extern void FileClose(File file);
extern int FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
-extern int FileRead(File file, char *buffer, int amount, uint32 wait_event_info);
-extern int FileWrite(File file, char *buffer, int amount, uint32 wait_event_info);
+extern int FileReadAt(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
+extern int FileWriteAt(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
extern int FileSync(File file, uint32 wait_event_info);
extern off_t FileSeek(File file, off_t offset, int whence);
extern int FileTruncate(File file, off_t offset, uint32 wait_event_info);
--
2.17.0
On Thu, Jul 12, 2018 at 1:55 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
I guess the only remaining reason to use FileSeek() is to get the file
size? So I wonder why SEEK_SET remains valid in the patch... if my
suspicion is correct that only SEEK_END still has a reason to exist,
perhaps we should just kill FileSeek() and add FileSize() or something
instead?
Done.
pgstat_report_wait_start(wait_event_info); +#ifdef HAVE_PREAD + returnCode = pread(vfdP->fd, buffer, amount, offset); +#else + lseek(VfdCache[file].fd, offset, SEEK_SET); returnCode = read(vfdP->fd, buffer, amount); +#endif pgstat_report_wait_end();This obviously lacks error handling for lseek().
Fixed.
Updated the main WAL IO routines to use pread()/pwrite() too.
Not super heavily tested yet.
An idea for how to handle Windows, in a follow-up patch: add a file
src/backend/port/win32/file.c that defines pgwin32_pread() and
pgwin32_pwrite(), wrapping WriteFile()/ReadFile() and passing in an
"OVERLAPPED" struct with the offset and sets errno on error, then set
up the macros so that Windows can use them as pread(), pwrite(). It
might also be necessary to open all files with FILE_FLAG_OVERLAPPED.
Does any Windows hacker have a bettter idea, and/or want to try to
write that patch? Otherwise I'll eventually try to do some long
distance hacking on AppVeyor.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
0001-Use-pread-pwrite-instead-of-lseek-read-write-v2.patchapplication/octet-stream; name=0001-Use-pread-pwrite-instead-of-lseek-read-write-v2.patchDownload
From b0659f64b8e0eb2bd16f0ff8ba75486dee2d8810 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Thu, 12 Jul 2018 13:14:02 +1200
Subject: [PATCH] Use pread()/pwrite() instead of lseek() + read()/write().
Cut down on system calls by doing random IO using POSIX.1-2008
offset-based IO routines, where available.
Author: Oskari Saarenmaa, Thomas Munro
Reviewed-by: Thomas Munro
Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com
Discussion: https://postgr.es/m/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com
Discussion: https://postgr.es/m/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi
---
configure | 2 +-
configure.in | 2 +-
src/backend/access/heap/rewriteheap.c | 4 +-
src/backend/access/transam/xlog.c | 13 ++
src/backend/storage/file/buffile.c | 62 ++------
src/backend/storage/file/fd.c | 219 +++++---------------------
src/backend/storage/smgr/md.c | 35 +---
src/include/pg_config.h.in | 6 +
src/include/storage/fd.h | 12 +-
9 files changed, 82 insertions(+), 273 deletions(-)
diff --git a/configure b/configure
index f891914ed99..1d8de9ab919 100755
--- a/configure
+++ b/configure
@@ -14915,7 +14915,7 @@ fi
LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pread pstat pthread_is_threaded_np pwrite readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l
do :
as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 5712419a274..9b043b23f45 100644
--- a/configure.in
+++ b/configure.in
@@ -1540,7 +1540,7 @@ PGAC_FUNC_WCSTOMBS_L
LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
+AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pread pstat pthread_is_threaded_np pwrite readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
AC_REPLACE_FUNCS(fseeko)
case $host_os in
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index ed7ba181c79..8b308cd8eb9 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -922,8 +922,8 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
* Note that we deviate from the usual WAL coding practices here,
* check the above "Logical rewrite support" comment for reasoning.
*/
- written = FileWrite(src->vfd, waldata_start, len,
- WAIT_EVENT_LOGICAL_REWRITE_WRITE);
+ written = FileWriteAt(src->vfd, waldata_start, len, src->off,
+ WAIT_EVENT_LOGICAL_REWRITE_WRITE);
if (written != len)
ereport(ERROR,
(errcode_for_file_access(),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3ee6d5c4676..da03815f4ce 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2477,6 +2477,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
Size nleft;
int written;
+#ifndef HAVE_PWRITE
/* Need to seek in the file? */
if (openLogOff != startoffset)
{
@@ -2488,6 +2489,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
startoffset)));
openLogOff = startoffset;
}
+#endif
/* OK to write the page(s) */
from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2497,7 +2499,11 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
{
errno = 0;
pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+#ifdef HAVE_PWRITE
+ written = pwrite(openLogFile, from, nleft, startoffset);
+#else
written = write(openLogFile, from, nleft);
+#endif
pgstat_report_wait_end();
if (written <= 0)
{
@@ -2512,6 +2518,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
}
nleft -= written;
from += written;
+ startoffset += written;
} while (nleft > 0);
/* Update state for write */
@@ -11791,6 +11798,7 @@ retry:
/* Read the requested page */
readOff = targetPageOff;
+#ifndef HAVE_PREAD
if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
{
char fname[MAXFNAMELEN];
@@ -11804,9 +11812,14 @@ retry:
fname, readOff)));
goto next_record_is_invalid;
}
+#endif
pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#ifdef HAVE_PREAD
+ r = pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+#else
r = read(readFile, readBuf, XLOG_BLCKSZ);
+#endif
if (r != XLOG_BLCKSZ)
{
char fname[MAXFNAMELEN];
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index efbede76297..12e1b32c9bf 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -67,12 +67,6 @@ struct BufFile
int numFiles; /* number of physical files in set */
/* all files except the last have length exactly MAX_PHYSICAL_FILESIZE */
File *files; /* palloc'd array with numFiles entries */
- off_t *offsets; /* palloc'd array with numFiles entries */
-
- /*
- * offsets[i] is the current seek position of files[i]. We use this to
- * avoid making redundant FileSeek calls.
- */
bool isInterXact; /* keep open over transactions? */
bool dirty; /* does buffer need to be written? */
@@ -116,7 +110,6 @@ makeBufFileCommon(int nfiles)
BufFile *file = (BufFile *) palloc(sizeof(BufFile));
file->numFiles = nfiles;
- file->offsets = (off_t *) palloc0(sizeof(off_t) * nfiles);
file->isInterXact = false;
file->dirty = false;
file->resowner = CurrentResourceOwner;
@@ -170,10 +163,7 @@ extendBufFile(BufFile *file)
file->files = (File *) repalloc(file->files,
(file->numFiles + 1) * sizeof(File));
- file->offsets = (off_t *) repalloc(file->offsets,
- (file->numFiles + 1) * sizeof(off_t));
file->files[file->numFiles] = pfile;
- file->offsets[file->numFiles] = 0L;
file->numFiles++;
}
@@ -396,7 +386,6 @@ BufFileClose(BufFile *file)
FileClose(file->files[i]);
/* release the buffer space */
pfree(file->files);
- pfree(file->offsets);
pfree(file);
}
@@ -422,27 +411,17 @@ BufFileLoadBuffer(BufFile *file)
file->curOffset = 0L;
}
- /*
- * May need to reposition physical file.
- */
- thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, read nothing */
- file->offsets[file->curFile] = file->curOffset;
- }
-
/*
* Read whatever we can get, up to a full bufferload.
*/
- file->nbytes = FileRead(thisfile,
- file->buffer,
- sizeof(file->buffer),
- WAIT_EVENT_BUFFILE_READ);
+ thisfile = file->files[file->curFile];
+ file->nbytes = FileReadAt(thisfile,
+ file->buffer,
+ sizeof(file->buffer),
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
if (file->nbytes < 0)
file->nbytes = 0;
- file->offsets[file->curFile] += file->nbytes;
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
@@ -491,23 +470,14 @@ BufFileDumpBuffer(BufFile *file)
if ((off_t) bytestowrite > availbytes)
bytestowrite = (int) availbytes;
- /*
- * May need to reposition physical file.
- */
thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, give up */
- file->offsets[file->curFile] = file->curOffset;
- }
- bytestowrite = FileWrite(thisfile,
- file->buffer + wpos,
- bytestowrite,
- WAIT_EVENT_BUFFILE_WRITE);
+ bytestowrite = FileWriteAt(thisfile,
+ file->buffer + wpos,
+ bytestowrite,
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_WRITE);
if (bytestowrite <= 0)
return; /* failed to write */
- file->offsets[file->curFile] += bytestowrite;
file->curOffset += bytestowrite;
wpos += bytestowrite;
@@ -803,11 +773,10 @@ BufFileSize(BufFile *file)
{
off_t lastFileSize;
- /* Get the size of the last physical file by seeking to end. */
- lastFileSize = FileSeek(file->files[file->numFiles - 1], 0, SEEK_END);
+ /* Get the size of the last physical file. */
+ lastFileSize = FileSize(file->files[file->numFiles - 1]);
if (lastFileSize < 0)
return -1;
- file->offsets[file->numFiles - 1] = lastFileSize;
return ((file->numFiles - 1) * (off_t) MAX_PHYSICAL_FILESIZE) +
lastFileSize;
@@ -849,13 +818,8 @@ BufFileAppend(BufFile *target, BufFile *source)
target->files = (File *)
repalloc(target->files, sizeof(File) * newNumFiles);
- target->offsets = (off_t *)
- repalloc(target->offsets, sizeof(off_t) * newNumFiles);
for (i = target->numFiles; i < newNumFiles; i++)
- {
target->files[i] = source->files[i - target->numFiles];
- target->offsets[i] = source->offsets[i - target->numFiles];
- }
target->numFiles = newNumFiles;
return startBlock;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 8dd51f17674..f866e760d91 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -16,8 +16,8 @@
* including base tables, scratch files (e.g., sort and hash spool
* files), and random calls to C library routines like system(3); it
* is quite easy to exceed system limits on the number of open files a
- * single process can have. (This is around 256 on many modern
- * operating systems, but can be as low as 32 on others.)
+ * single process can have. (This is around 1024 on many modern
+ * operating systems, but may be lower on others.)
*
* VFDs are managed as an LRU pool, with actual OS file descriptors
* being opened and closed as needed. Obviously, if a routine is
@@ -167,15 +167,6 @@ int max_safe_fds = 32; /* default if not changed */
#define FileIsNotOpen(file) (VfdCache[file].fd == VFD_CLOSED)
-/*
- * Note: a VFD's seekPos is normally always valid, but if for some reason
- * an lseek() fails, it might become set to FileUnknownPos. We can struggle
- * along without knowing the seek position in many cases, but in some places
- * we have to fail if we don't have it.
- */
-#define FileUnknownPos ((off_t) -1)
-#define FilePosIsUnknown(pos) ((pos) < 0)
-
/* these are the assigned bits in fdstate below: */
#define FD_DELETE_AT_CLOSE (1 << 0) /* T = delete when closed */
#define FD_CLOSE_AT_EOXACT (1 << 1) /* T = close at eoXact */
@@ -189,7 +180,6 @@ typedef struct vfd
File nextFree; /* link to next free VFD, if in freelist */
File lruMoreRecently; /* doubly linked recency-of-use list */
File lruLessRecently;
- off_t seekPos; /* current logical file position, or -1 */
off_t fileSize; /* current size of file (0 if not temporary) */
char *fileName; /* name of file, or NULL for unused VFD */
/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
@@ -407,9 +397,7 @@ pg_fdatasync(int fd)
/*
* pg_flush_data --- advise OS that the described dirty data should be flushed
*
- * offset of 0 with nbytes 0 means that the entire file should be flushed;
- * in this case, this function may have side-effects on the file's
- * seek position!
+ * offset of 0 with nbytes 0 means that the entire file should be flushed
*/
void
pg_flush_data(int fd, off_t offset, off_t nbytes)
@@ -1029,22 +1017,6 @@ LruDelete(File file)
vfdP = &VfdCache[file];
- /*
- * Normally we should know the seek position, but if for some reason we
- * have lost track of it, try again to get it. If we still can't get it,
- * we have a problem: we will be unable to restore the file seek position
- * when and if the file is re-opened. But we can't really throw an error
- * and refuse to close the file, or activities such as transaction cleanup
- * will be broken.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(LOG, "could not seek file \"%s\" before closing: %m",
- vfdP->fileName);
- }
-
/*
* Close the file. We aren't expecting this to fail; if it does, better
* to leak the FD than to mess up our internal state.
@@ -1113,33 +1085,6 @@ LruInsert(File file)
{
++nfile;
}
-
- /*
- * Seek to the right position. We need no special case for seekPos
- * equal to FileUnknownPos, as lseek() will certainly reject that
- * (thus completing the logic noted in LruDelete() that we will fail
- * to re-open a file if we couldn't get its seek position before
- * closing).
- */
- if (vfdP->seekPos != (off_t) 0)
- {
- if (lseek(vfdP->fd, vfdP->seekPos, SEEK_SET) < 0)
- {
- /*
- * If we fail to restore the seek position, treat it like an
- * open() failure.
- */
- int save_errno = errno;
-
- elog(LOG, "could not seek file \"%s\" after re-opening: %m",
- vfdP->fileName);
- (void) close(vfdP->fd);
- vfdP->fd = VFD_CLOSED;
- --nfile;
- errno = save_errno;
- return -1;
- }
- }
}
/*
@@ -1406,7 +1351,6 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
/* Saved flags are adjusted to be OK for re-opening file */
vfdP->fileFlags = fileFlags & ~(O_CREAT | O_TRUNC | O_EXCL);
vfdP->fileMode = fileMode;
- vfdP->seekPos = 0;
vfdP->fileSize = 0;
vfdP->fdstate = 0x0;
vfdP->resowner = NULL;
@@ -1820,7 +1764,6 @@ FileClose(File file)
/*
* FilePrefetch - initiate asynchronous read of a given range of the file.
- * The logical seek position is unaffected.
*
* Currently the only implementation of this function is using posix_fadvise
* which is the simplest standardized interface that accomplishes this.
@@ -1867,10 +1810,6 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
file, VfdCache[file].fileName,
(int64) offset, (int64) nbytes));
- /*
- * Caution: do not call pg_flush_data with nbytes = 0, it could trash the
- * file's seek position. We prefer to define that as a no-op here.
- */
if (nbytes <= 0)
return;
@@ -1884,7 +1823,8 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
}
int
-FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
+FileReadAt(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1893,7 +1833,7 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileRead: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1904,16 +1844,16 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
pgstat_report_wait_start(wait_event_info);
- returnCode = read(vfdP->fd, buffer, amount);
+#ifdef HAVE_PREAD
+ returnCode = pread(vfdP->fd, buffer, amount, offset);
+#else
+ returnCode = lseek(VfdCache[file].fd, offset, SEEK_SET);
+ if (returnCode >= 0)
+ returnCode = read(vfdP->fd, buffer, amount);
+#endif
pgstat_report_wait_end();
- if (returnCode >= 0)
- {
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
- }
- else
+ if (returnCode < 0)
{
/*
* Windows may run out of kernel buffers and return "Insufficient
@@ -1939,16 +1879,14 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
}
int
-FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
+FileWriteAt(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1957,7 +1895,7 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileWrite: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1976,26 +1914,13 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
*/
if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
{
- off_t newPos;
+ off_t past_write = offset + amount;
- /*
- * Normally we should know the seek position, but if for some reason
- * we have lost track of it, try again to get it. Here, it's fine to
- * throw an error if we still can't get it.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(ERROR, "could not seek file \"%s\": %m", vfdP->fileName);
- }
-
- newPos = vfdP->seekPos + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
uint64 newTotal = temporary_files_size;
- newTotal += newPos - vfdP->fileSize;
+ newTotal += past_write - vfdP->fileSize;
if (newTotal > (uint64) temp_file_limit * (uint64) 1024)
ereport(ERROR,
(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
@@ -2007,7 +1932,13 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
errno = 0;
pgstat_report_wait_start(wait_event_info);
- returnCode = write(vfdP->fd, buffer, amount);
+#ifdef HAVE_PWRITE
+ returnCode = pwrite(VfdCache[file].fd, buffer, amount, offset);
+#else
+ returnCode = lseek(VfdCache[file].fd, offset, SEEK_SET);
+ if (returnCode >= 0)
+ returnCode = write(VfdCache[file].fd, buffer, amount);
+#endif
pgstat_report_wait_end();
/* if write didn't set errno, assume problem is no disk space */
@@ -2016,10 +1947,6 @@ retry:
if (returnCode >= 0)
{
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
-
/*
* Maintain fileSize and temporary_files_size if it's a temp file.
*
@@ -2029,19 +1956,19 @@ retry:
*/
if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
{
- off_t newPos = vfdP->seekPos;
+ off_t past_write = offset + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
- temporary_files_size += newPos - vfdP->fileSize;
- vfdP->fileSize = newPos;
+ temporary_files_size += past_write - vfdP->fileSize;
+ vfdP->fileSize = past_write;
}
}
}
else
{
/*
- * See comments in FileRead()
+ * See comments in FileReadAt()
*/
#ifdef WIN32
DWORD error = GetLastError();
@@ -2060,9 +1987,6 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
@@ -2090,92 +2014,25 @@ FileSync(File file, uint32 wait_event_info)
}
off_t
-FileSeek(File file, off_t offset, int whence)
+FileSize(File file)
{
Vfd *vfdP;
Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileSeek: %d (%s) " INT64_FORMAT " " INT64_FORMAT " %d",
- file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
- (int64) offset, whence));
+ DO_DB(elog(LOG, "FileSize %d (%s)",
+ file, VfdCache[file].fileName));
vfdP = &VfdCache[file];
if (FileIsNotOpen(file))
{
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos = offset;
- break;
- case SEEK_CUR:
- if (FilePosIsUnknown(vfdP->seekPos) ||
- vfdP->seekPos + offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos += offset;
- break;
- case SEEK_END:
- if (FileAccess(file) < 0)
- return (off_t) -1;
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
- }
- else
- {
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- if (vfdP->seekPos != offset)
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_CUR:
- if (offset != 0 || FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_END:
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
+ if (FileAccess(file) < 0)
+ return (off_t) -1;
}
- return vfdP->seekPos;
-}
-
-/*
- * XXX not actually used but here for completeness
- */
-#ifdef NOT_USED
-off_t
-FileTell(File file)
-{
- Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileTell %d (%s)",
- file, VfdCache[file].fileName));
- return VfdCache[file].seekPos;
+ return lseek(VfdCache[file].fd, 0, SEEK_END);
}
-#endif
int
FileTruncate(File file, off_t offset, uint32 wait_event_info)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f4374d077be..56277806c8b 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -522,22 +522,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- /*
- * Note: because caller usually obtained blocknum by calling mdnblocks,
- * which did a seek(SEEK_END), this seek is often redundant and will be
- * optimized away by fd.c. It's not redundant, however, if there is a
- * partial page at the end of the file. In that case we want to try to
- * overwrite the partial page with a full page. It's also not redundant
- * if bufmgr.c had to dump another buffer of the same file to make room
- * for the new page's buffer.
- */
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
+ if ((nbytes = FileWriteAt(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
{
if (nbytes < 0)
ereport(ERROR,
@@ -748,13 +733,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_READ);
+ nbytes = FileReadAt(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -824,13 +803,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_WRITE);
+ nbytes = FileWriteAt(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -1979,7 +1952,7 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
{
off_t len;
- len = FileSeek(seg->mdfd_vfd, 0L, SEEK_END);
+ len = FileSize(seg->mdfd_vfd);
if (len < 0)
ereport(ERROR,
(errcode_for_file_access(),
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index f9fb92f31c1..2be0b12407e 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -434,6 +434,9 @@
/* Define to 1 if the assembler supports PPC's LWARX mutex hint bit. */
#undef HAVE_PPC_LWARX_MUTEX_HINT
+/* Define to 1 if you have the `pread' function. */
+#undef HAVE_PREAD
+
/* Define to 1 if you have the `pstat' function. */
#undef HAVE_PSTAT
@@ -449,6 +452,9 @@
/* Have PTHREAD_PRIO_INHERIT. */
#undef HAVE_PTHREAD_PRIO_INHERIT
+/* Define to 1 if you have the `pwrite' function. */
+#undef HAVE_PWRITE
+
/* Define to 1 if you have the `random' function. */
#undef HAVE_RANDOM
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8e7c9728f4b..a823442dd1a 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -15,7 +15,7 @@
/*
* calls:
*
- * File {Close, Read, Write, Seek, Tell, Sync}
+ * File {Close, Read, Write, Size, Tell, Sync}
* {Path Name Open, Allocate, Free} File
*
* These are NOT JUST RENAMINGS OF THE UNIX ROUTINES.
@@ -42,10 +42,6 @@
#include <dirent.h>
-/*
- * FileSeek uses the standard UNIX lseek(2) flags.
- */
-
typedef int File;
@@ -68,10 +64,10 @@ extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fil
extern File OpenTemporaryFile(bool interXact);
extern void FileClose(File file);
extern int FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
-extern int FileRead(File file, char *buffer, int amount, uint32 wait_event_info);
-extern int FileWrite(File file, char *buffer, int amount, uint32 wait_event_info);
+extern int FileReadAt(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
+extern int FileWriteAt(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
extern int FileSync(File file, uint32 wait_event_info);
-extern off_t FileSeek(File file, off_t offset, int whence);
+extern off_t FileSize(File file);
extern int FileTruncate(File file, off_t offset, uint32 wait_event_info);
extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
extern char *FilePathName(File file);
--
2.17.0
On 20/07/18 01:50, Thomas Munro wrote:
An idea for how to handle Windows, in a follow-up patch: add a file
src/backend/port/win32/file.c that defines pgwin32_pread() and
pgwin32_pwrite(), wrapping WriteFile()/ReadFile() and passing in an
"OVERLAPPED" struct with the offset and sets errno on error, then set
up the macros so that Windows can use them as pread(), pwrite(). It
might also be necessary to open all files with FILE_FLAG_OVERLAPPED.
Does any Windows hacker have a bettter idea, and/or want to try to
write that patch? Otherwise I'll eventually try to do some long
distance hacking on AppVeyor.
No objections, if you want to make the effort. But IMHO the lseek+read
fallback is good enough on Windows. Unless you were thinking that we
could then remove the !HAVE_PREAD fallback altogether. Are there any
other platforms out there that don't have pread/pwrite that we care about?
- Heikki
On Thu, Jul 12, 2018 at 01:55:31PM +1200, Thomas Munro wrote:
A couple of years ago, Oskari Saarenmaa proposed a patch[1] to adopt
$SUBJECT. Last year, Tobias Oberstein argued again that we should do
that[2]. At the end of that thread there was a +1 from multiple
committers in support of getting it done for PostgreSQL 12. Since
Oskari hasn't returned, I decided to dust off his patch and start a
new thread.
Thanks for picking this up - I was meaning to get back to this, but have
unfortunately been too busy with other projects.
1. FileRead() and FileWrite() are replaced with FileReadAt() and
FileWriteAt(), taking a new offset argument. Now we can do one
syscall instead of two for random reads and writes.
[...] I'm not even sure I'd bother adding "At" to the
function names. If there are any extensions that want the old
interface they will fail to compile either way. Note that the BufFile
interface remains the same, so hopefully that covers many use cases.
IIRC I used the "At" suffixes in my first version of the patch before
completely removing the functions which didn't take an offset argument
Now that they're gone I agree that we could just drop the "At" suffix;
"at" suffix is also used by various POSIX functions to operate in a
specific directory which may just add to confusion.
I guess the only remaining reason to use FileSeek() is to get the file
size? So I wonder why SEEK_SET remains valid in the patch... if my
suspicion is correct that only SEEK_END still has a reason to exist,
perhaps we should just kill FileSeek() and add FileSize() or something
instead?
I see you did this in your updated patch :+1:
Happy to see this patch move forward.
/ Oskari
Heikki Linnakangas <hlinnaka@iki.fi> writes:
No objections, if you want to make the effort. But IMHO the lseek+read
fallback is good enough on Windows. Unless you were thinking that we
could then remove the !HAVE_PREAD fallback altogether. Are there any
other platforms out there that don't have pread/pwrite that we care about?
AFAICT, macOS has them as far back as we care about (prairiedog does).
HPUX 10.20 (gaur/pademelon) does not, so personally I'd like to keep
the lseek+read workaround. Don't know about the oldest Solaris critters
we have in the buildfarm. FreeBSD has had 'em at least since 4.0 (1994);
didn't check the other BSDen.
SUS v2 (POSIX 1997) does specify both functions, so we could insist on
their presence without breaking any of our own portability guidelines.
However, if we have to have some workaround anyway for Windows, it
seems like including an lseek+read code path is reasonable so that we
needn't retire those oldest buildfarm critters.
regards, tom lane
On 20 Jul 2018, at 17:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
No objections, if you want to make the effort. But IMHO the lseek+read
fallback is good enough on Windows. Unless you were thinking that we
could then remove the !HAVE_PREAD fallback altogether. Are there any
other platforms out there that don't have pread/pwrite that we care about?AFAICT, macOS has them as far back as we care about (prairiedog does).
HPUX 10.20 (gaur/pademelon) does not, so personally I'd like to keep
the lseek+read workaround. Don't know about the oldest Solaris critters
we have in the buildfarm. FreeBSD has had 'em at least since 4.0 (1994);
didn't check the other BSDen.
The OpenBSD box I have access to has pwrite/pread, and have had for some time
if I read the manpage right.
cheers ./daniel
On Fri, Jul 20, 2018 at 8:14 PM, Oskari Saarenmaa <os@ohmu.fi> wrote:
On Thu, Jul 12, 2018 at 01:55:31PM +1200, Thomas Munro wrote:
[...] I'm not even sure I'd bother adding "At" to the
function names. If there are any extensions that want the old
interface they will fail to compile either way. Note that the BufFile
interface remains the same, so hopefully that covers many use cases.IIRC I used the "At" suffixes in my first version of the patch before
completely removing the functions which didn't take an offset argument
Now that they're gone I agree that we could just drop the "At" suffix;
"at" suffix is also used by various POSIX functions to operate in a
specific directory which may just add to confusion.
Done. Rebased.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
0001-Use-pread-pwrite-instead-of-lseek-read-write-v3.patchapplication/octet-stream; name=0001-Use-pread-pwrite-instead-of-lseek-read-write-v3.patchDownload
From 6dcb4bbe5a9fd3fd6a303432f19188fea06828b9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Thu, 12 Jul 2018 13:14:02 +1200
Subject: [PATCH] Use pread()/pwrite() instead of lseek() + read()/write().
Cut down on system calls by doing random IO using POSIX.1-2008
offset-based IO routines, where available. Remove the code for
tracking the 'virtual' seek position. The only reason left to
call FileSeek() was to get the file's size, so provide a new
function FileSize() instead.
Author: Oskari Saarenmaa, Thomas Munro
Reviewed-by: Thomas Munro
Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com
Discussion: https://postgr.es/m/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com
Discussion: https://postgr.es/m/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi
---
configure | 2 +-
configure.in | 2 +-
src/backend/access/heap/rewriteheap.c | 2 +-
src/backend/access/transam/xlog.c | 13 ++
src/backend/storage/file/buffile.c | 46 +-----
src/backend/storage/file/fd.c | 217 +++++---------------------
src/backend/storage/smgr/md.c | 35 +----
src/include/pg_config.h.in | 6 +
src/include/storage/fd.h | 12 +-
9 files changed, 72 insertions(+), 263 deletions(-)
diff --git a/configure b/configure
index 26652133d53..90d937f39ce 100755
--- a/configure
+++ b/configure
@@ -14916,7 +14916,7 @@ fi
LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pread pstat pthread_is_threaded_np pwrite readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l
do :
as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 397f6bc7651..e9c65559091 100644
--- a/configure.in
+++ b/configure.in
@@ -1540,7 +1540,7 @@ PGAC_FUNC_WCSTOMBS_L
LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
+AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pread pstat pthread_is_threaded_np pwrite readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
AC_REPLACE_FUNCS(fseeko)
case $host_os in
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index ed7ba181c79..4bcb9c9d204 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -922,7 +922,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
* Note that we deviate from the usual WAL coding practices here,
* check the above "Logical rewrite support" comment for reasoning.
*/
- written = FileWrite(src->vfd, waldata_start, len,
+ written = FileWrite(src->vfd, waldata_start, len, src->off,
WAIT_EVENT_LOGICAL_REWRITE_WRITE);
if (written != len)
ereport(ERROR,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 493f1db7b97..c277f53489e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2477,6 +2477,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
Size nleft;
int written;
+#ifndef HAVE_PWRITE
/* Need to seek in the file? */
if (openLogOff != startoffset)
{
@@ -2488,6 +2489,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
startoffset)));
openLogOff = startoffset;
}
+#endif
/* OK to write the page(s) */
from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2497,7 +2499,11 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
{
errno = 0;
pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+#ifdef HAVE_PWRITE
+ written = pwrite(openLogFile, from, nleft, startoffset);
+#else
written = write(openLogFile, from, nleft);
+#endif
pgstat_report_wait_end();
if (written <= 0)
{
@@ -2512,6 +2518,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
}
nleft -= written;
from += written;
+ startoffset += written;
} while (nleft > 0);
/* Update state for write */
@@ -11794,6 +11801,7 @@ retry:
/* Read the requested page */
readOff = targetPageOff;
+#ifndef HAVE_PREAD
if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
{
char fname[MAXFNAMELEN];
@@ -11807,9 +11815,14 @@ retry:
fname, readOff)));
goto next_record_is_invalid;
}
+#endif
pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#ifdef HAVE_PREAD
+ r = pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+#else
r = read(readFile, readBuf, XLOG_BLCKSZ);
+#endif
if (r != XLOG_BLCKSZ)
{
char fname[MAXFNAMELEN];
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index efbede76297..c773358ec5c 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -67,12 +67,6 @@ struct BufFile
int numFiles; /* number of physical files in set */
/* all files except the last have length exactly MAX_PHYSICAL_FILESIZE */
File *files; /* palloc'd array with numFiles entries */
- off_t *offsets; /* palloc'd array with numFiles entries */
-
- /*
- * offsets[i] is the current seek position of files[i]. We use this to
- * avoid making redundant FileSeek calls.
- */
bool isInterXact; /* keep open over transactions? */
bool dirty; /* does buffer need to be written? */
@@ -116,7 +110,6 @@ makeBufFileCommon(int nfiles)
BufFile *file = (BufFile *) palloc(sizeof(BufFile));
file->numFiles = nfiles;
- file->offsets = (off_t *) palloc0(sizeof(off_t) * nfiles);
file->isInterXact = false;
file->dirty = false;
file->resowner = CurrentResourceOwner;
@@ -170,10 +163,7 @@ extendBufFile(BufFile *file)
file->files = (File *) repalloc(file->files,
(file->numFiles + 1) * sizeof(File));
- file->offsets = (off_t *) repalloc(file->offsets,
- (file->numFiles + 1) * sizeof(off_t));
file->files[file->numFiles] = pfile;
- file->offsets[file->numFiles] = 0L;
file->numFiles++;
}
@@ -396,7 +386,6 @@ BufFileClose(BufFile *file)
FileClose(file->files[i]);
/* release the buffer space */
pfree(file->files);
- pfree(file->offsets);
pfree(file);
}
@@ -422,27 +411,17 @@ BufFileLoadBuffer(BufFile *file)
file->curOffset = 0L;
}
- /*
- * May need to reposition physical file.
- */
- thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, read nothing */
- file->offsets[file->curFile] = file->curOffset;
- }
-
/*
* Read whatever we can get, up to a full bufferload.
*/
+ thisfile = file->files[file->curFile];
file->nbytes = FileRead(thisfile,
file->buffer,
sizeof(file->buffer),
+ file->curOffset,
WAIT_EVENT_BUFFILE_READ);
if (file->nbytes < 0)
file->nbytes = 0;
- file->offsets[file->curFile] += file->nbytes;
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
@@ -491,23 +470,14 @@ BufFileDumpBuffer(BufFile *file)
if ((off_t) bytestowrite > availbytes)
bytestowrite = (int) availbytes;
- /*
- * May need to reposition physical file.
- */
thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, give up */
- file->offsets[file->curFile] = file->curOffset;
- }
bytestowrite = FileWrite(thisfile,
file->buffer + wpos,
bytestowrite,
+ file->curOffset,
WAIT_EVENT_BUFFILE_WRITE);
if (bytestowrite <= 0)
return; /* failed to write */
- file->offsets[file->curFile] += bytestowrite;
file->curOffset += bytestowrite;
wpos += bytestowrite;
@@ -803,11 +773,10 @@ BufFileSize(BufFile *file)
{
off_t lastFileSize;
- /* Get the size of the last physical file by seeking to end. */
- lastFileSize = FileSeek(file->files[file->numFiles - 1], 0, SEEK_END);
+ /* Get the size of the last physical file. */
+ lastFileSize = FileSize(file->files[file->numFiles - 1]);
if (lastFileSize < 0)
return -1;
- file->offsets[file->numFiles - 1] = lastFileSize;
return ((file->numFiles - 1) * (off_t) MAX_PHYSICAL_FILESIZE) +
lastFileSize;
@@ -849,13 +818,8 @@ BufFileAppend(BufFile *target, BufFile *source)
target->files = (File *)
repalloc(target->files, sizeof(File) * newNumFiles);
- target->offsets = (off_t *)
- repalloc(target->offsets, sizeof(off_t) * newNumFiles);
for (i = target->numFiles; i < newNumFiles; i++)
- {
target->files[i] = source->files[i - target->numFiles];
- target->offsets[i] = source->offsets[i - target->numFiles];
- }
target->numFiles = newNumFiles;
return startBlock;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 8dd51f17674..a380f794014 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -16,8 +16,8 @@
* including base tables, scratch files (e.g., sort and hash spool
* files), and random calls to C library routines like system(3); it
* is quite easy to exceed system limits on the number of open files a
- * single process can have. (This is around 256 on many modern
- * operating systems, but can be as low as 32 on others.)
+ * single process can have. (This is around 1024 on many modern
+ * operating systems, but may be lower on others.)
*
* VFDs are managed as an LRU pool, with actual OS file descriptors
* being opened and closed as needed. Obviously, if a routine is
@@ -167,15 +167,6 @@ int max_safe_fds = 32; /* default if not changed */
#define FileIsNotOpen(file) (VfdCache[file].fd == VFD_CLOSED)
-/*
- * Note: a VFD's seekPos is normally always valid, but if for some reason
- * an lseek() fails, it might become set to FileUnknownPos. We can struggle
- * along without knowing the seek position in many cases, but in some places
- * we have to fail if we don't have it.
- */
-#define FileUnknownPos ((off_t) -1)
-#define FilePosIsUnknown(pos) ((pos) < 0)
-
/* these are the assigned bits in fdstate below: */
#define FD_DELETE_AT_CLOSE (1 << 0) /* T = delete when closed */
#define FD_CLOSE_AT_EOXACT (1 << 1) /* T = close at eoXact */
@@ -189,7 +180,6 @@ typedef struct vfd
File nextFree; /* link to next free VFD, if in freelist */
File lruMoreRecently; /* doubly linked recency-of-use list */
File lruLessRecently;
- off_t seekPos; /* current logical file position, or -1 */
off_t fileSize; /* current size of file (0 if not temporary) */
char *fileName; /* name of file, or NULL for unused VFD */
/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
@@ -407,9 +397,7 @@ pg_fdatasync(int fd)
/*
* pg_flush_data --- advise OS that the described dirty data should be flushed
*
- * offset of 0 with nbytes 0 means that the entire file should be flushed;
- * in this case, this function may have side-effects on the file's
- * seek position!
+ * offset of 0 with nbytes 0 means that the entire file should be flushed
*/
void
pg_flush_data(int fd, off_t offset, off_t nbytes)
@@ -1029,22 +1017,6 @@ LruDelete(File file)
vfdP = &VfdCache[file];
- /*
- * Normally we should know the seek position, but if for some reason we
- * have lost track of it, try again to get it. If we still can't get it,
- * we have a problem: we will be unable to restore the file seek position
- * when and if the file is re-opened. But we can't really throw an error
- * and refuse to close the file, or activities such as transaction cleanup
- * will be broken.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(LOG, "could not seek file \"%s\" before closing: %m",
- vfdP->fileName);
- }
-
/*
* Close the file. We aren't expecting this to fail; if it does, better
* to leak the FD than to mess up our internal state.
@@ -1113,33 +1085,6 @@ LruInsert(File file)
{
++nfile;
}
-
- /*
- * Seek to the right position. We need no special case for seekPos
- * equal to FileUnknownPos, as lseek() will certainly reject that
- * (thus completing the logic noted in LruDelete() that we will fail
- * to re-open a file if we couldn't get its seek position before
- * closing).
- */
- if (vfdP->seekPos != (off_t) 0)
- {
- if (lseek(vfdP->fd, vfdP->seekPos, SEEK_SET) < 0)
- {
- /*
- * If we fail to restore the seek position, treat it like an
- * open() failure.
- */
- int save_errno = errno;
-
- elog(LOG, "could not seek file \"%s\" after re-opening: %m",
- vfdP->fileName);
- (void) close(vfdP->fd);
- vfdP->fd = VFD_CLOSED;
- --nfile;
- errno = save_errno;
- return -1;
- }
- }
}
/*
@@ -1406,7 +1351,6 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
/* Saved flags are adjusted to be OK for re-opening file */
vfdP->fileFlags = fileFlags & ~(O_CREAT | O_TRUNC | O_EXCL);
vfdP->fileMode = fileMode;
- vfdP->seekPos = 0;
vfdP->fileSize = 0;
vfdP->fdstate = 0x0;
vfdP->resowner = NULL;
@@ -1820,7 +1764,6 @@ FileClose(File file)
/*
* FilePrefetch - initiate asynchronous read of a given range of the file.
- * The logical seek position is unaffected.
*
* Currently the only implementation of this function is using posix_fadvise
* which is the simplest standardized interface that accomplishes this.
@@ -1867,10 +1810,6 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
file, VfdCache[file].fileName,
(int64) offset, (int64) nbytes));
- /*
- * Caution: do not call pg_flush_data with nbytes = 0, it could trash the
- * file's seek position. We prefer to define that as a no-op here.
- */
if (nbytes <= 0)
return;
@@ -1884,7 +1823,8 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
}
int
-FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
+FileRead(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1893,7 +1833,7 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileRead: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1904,16 +1844,16 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
pgstat_report_wait_start(wait_event_info);
- returnCode = read(vfdP->fd, buffer, amount);
+#ifdef HAVE_PREAD
+ returnCode = pread(vfdP->fd, buffer, amount, offset);
+#else
+ returnCode = lseek(VfdCache[file].fd, offset, SEEK_SET);
+ if (returnCode >= 0)
+ returnCode = read(vfdP->fd, buffer, amount);
+#endif
pgstat_report_wait_end();
- if (returnCode >= 0)
- {
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
- }
- else
+ if (returnCode < 0)
{
/*
* Windows may run out of kernel buffers and return "Insufficient
@@ -1939,16 +1879,14 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
}
int
-FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
+FileWrite(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1957,7 +1895,7 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileWrite: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1976,26 +1914,13 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
*/
if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
{
- off_t newPos;
+ off_t past_write = offset + amount;
- /*
- * Normally we should know the seek position, but if for some reason
- * we have lost track of it, try again to get it. Here, it's fine to
- * throw an error if we still can't get it.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(ERROR, "could not seek file \"%s\": %m", vfdP->fileName);
- }
-
- newPos = vfdP->seekPos + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
uint64 newTotal = temporary_files_size;
- newTotal += newPos - vfdP->fileSize;
+ newTotal += past_write - vfdP->fileSize;
if (newTotal > (uint64) temp_file_limit * (uint64) 1024)
ereport(ERROR,
(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
@@ -2007,7 +1932,13 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
errno = 0;
pgstat_report_wait_start(wait_event_info);
- returnCode = write(vfdP->fd, buffer, amount);
+#ifdef HAVE_PWRITE
+ returnCode = pwrite(VfdCache[file].fd, buffer, amount, offset);
+#else
+ returnCode = lseek(VfdCache[file].fd, offset, SEEK_SET);
+ if (returnCode >= 0)
+ returnCode = write(VfdCache[file].fd, buffer, amount);
+#endif
pgstat_report_wait_end();
/* if write didn't set errno, assume problem is no disk space */
@@ -2016,10 +1947,6 @@ retry:
if (returnCode >= 0)
{
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
-
/*
* Maintain fileSize and temporary_files_size if it's a temp file.
*
@@ -2029,12 +1956,12 @@ retry:
*/
if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
{
- off_t newPos = vfdP->seekPos;
+ off_t past_write = offset + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
- temporary_files_size += newPos - vfdP->fileSize;
- vfdP->fileSize = newPos;
+ temporary_files_size += past_write - vfdP->fileSize;
+ vfdP->fileSize = past_write;
}
}
}
@@ -2060,9 +1987,6 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
@@ -2090,92 +2014,25 @@ FileSync(File file, uint32 wait_event_info)
}
off_t
-FileSeek(File file, off_t offset, int whence)
+FileSize(File file)
{
Vfd *vfdP;
Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileSeek: %d (%s) " INT64_FORMAT " " INT64_FORMAT " %d",
- file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
- (int64) offset, whence));
+ DO_DB(elog(LOG, "FileSize %d (%s)",
+ file, VfdCache[file].fileName));
vfdP = &VfdCache[file];
if (FileIsNotOpen(file))
{
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos = offset;
- break;
- case SEEK_CUR:
- if (FilePosIsUnknown(vfdP->seekPos) ||
- vfdP->seekPos + offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos += offset;
- break;
- case SEEK_END:
- if (FileAccess(file) < 0)
- return (off_t) -1;
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
- }
- else
- {
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- if (vfdP->seekPos != offset)
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_CUR:
- if (offset != 0 || FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_END:
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
+ if (FileAccess(file) < 0)
+ return (off_t) -1;
}
- return vfdP->seekPos;
-}
-
-/*
- * XXX not actually used but here for completeness
- */
-#ifdef NOT_USED
-off_t
-FileTell(File file)
-{
- Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileTell %d (%s)",
- file, VfdCache[file].fileName));
- return VfdCache[file].seekPos;
+ return lseek(VfdCache[file].fd, 0, SEEK_END);
}
-#endif
int
FileTruncate(File file, off_t offset, uint32 wait_event_info)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f4374d077be..86013a5c8b2 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -522,22 +522,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- /*
- * Note: because caller usually obtained blocknum by calling mdnblocks,
- * which did a seek(SEEK_END), this seek is often redundant and will be
- * optimized away by fd.c. It's not redundant, however, if there is a
- * partial page at the end of the file. In that case we want to try to
- * overwrite the partial page with a full page. It's also not redundant
- * if bufmgr.c had to dump another buffer of the same file to make room
- * for the new page's buffer.
- */
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
+ if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
{
if (nbytes < 0)
ereport(ERROR,
@@ -748,13 +733,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_READ);
+ nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -824,13 +803,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_WRITE);
+ nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -1979,7 +1952,7 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
{
off_t len;
- len = FileSeek(seg->mdfd_vfd, 0L, SEEK_END);
+ len = FileSize(seg->mdfd_vfd);
if (len < 0)
ereport(ERROR,
(errcode_for_file_access(),
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index b7e469670f4..73b96f00e3a 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -434,6 +434,9 @@
/* Define to 1 if the assembler supports PPC's LWARX mutex hint bit. */
#undef HAVE_PPC_LWARX_MUTEX_HINT
+/* Define to 1 if you have the `pread' function. */
+#undef HAVE_PREAD
+
/* Define to 1 if you have the `pstat' function. */
#undef HAVE_PSTAT
@@ -449,6 +452,9 @@
/* Have PTHREAD_PRIO_INHERIT. */
#undef HAVE_PTHREAD_PRIO_INHERIT
+/* Define to 1 if you have the `pwrite' function. */
+#undef HAVE_PWRITE
+
/* Define to 1 if you have the `random' function. */
#undef HAVE_RANDOM
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8e7c9728f4b..f8b6fa8ece5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -15,7 +15,7 @@
/*
* calls:
*
- * File {Close, Read, Write, Seek, Tell, Sync}
+ * File {Close, Read, Write, Size, Tell, Sync}
* {Path Name Open, Allocate, Free} File
*
* These are NOT JUST RENAMINGS OF THE UNIX ROUTINES.
@@ -42,10 +42,6 @@
#include <dirent.h>
-/*
- * FileSeek uses the standard UNIX lseek(2) flags.
- */
-
typedef int File;
@@ -68,10 +64,10 @@ extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fil
extern File OpenTemporaryFile(bool interXact);
extern void FileClose(File file);
extern int FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
-extern int FileRead(File file, char *buffer, int amount, uint32 wait_event_info);
-extern int FileWrite(File file, char *buffer, int amount, uint32 wait_event_info);
+extern int FileRead(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
+extern int FileWrite(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
extern int FileSync(File file, uint32 wait_event_info);
-extern off_t FileSeek(File file, off_t offset, int whence);
+extern off_t FileSize(File file);
extern int FileTruncate(File file, off_t offset, uint32 wait_event_info);
extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
extern char *FilePathName(File file);
--
2.17.0
On Sat, Jul 21, 2018 at 3:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
No objections, if you want to make the effort. But IMHO the lseek+read
fallback is good enough on Windows. Unless you were thinking that we
could then remove the !HAVE_PREAD fallback altogether. Are there any
other platforms out there that don't have pread/pwrite that we care about?AFAICT, macOS has them as far back as we care about (prairiedog does).
HPUX 10.20 (gaur/pademelon) does not, so personally I'd like to keep
the lseek+read workaround. Don't know about the oldest Solaris critters
we have in the buildfarm. FreeBSD has had 'em at least since 4.0 (1994);
didn't check the other BSDen.SUS v2 (POSIX 1997) does specify both functions, so we could insist on
their presence without breaking any of our own portability guidelines.
However, if we have to have some workaround anyway for Windows, it
seems like including an lseek+read code path is reasonable so that we
needn't retire those oldest buildfarm critters.
Yeah it seems useful and cheap to carry the lseek() fallback. But
actually there is a good reason to implement proper pread/pwrite
(equivalent) on Windows: this patch removes the position tracking, so
that the fallback code generates *more* lseek() calls than current
master. For example with sequential reads today we are smart enough
to skip redundant lseek() calls, but this patch removes those smarts.
I doubt anyone cares about that on HPUX 10.20 but I don't think we
should do that on Windows.
--
Thomas Munro
http://www.enterprisedb.com
Hi,
On 07/26/2018 10:04 PM, Thomas Munro wrote:
Done. Rebased.
This needs a rebase again.
Once resolved the patch passes make check-world, and a strace analysis
shows the associated read()/write() have been turned into
pread64()/pwrite64(). All lseek()'s are SEEK_END's.
Best regards,
Jesper
Hi,
On 09/05/2018 02:42 PM, Jesper Pedersen wrote:
On 07/26/2018 10:04 PM, Thomas Munro wrote:
Done. Rebased.
This needs a rebase again.
Would it be of benefit to update these call sites
* slru.c
- SlruPhysicalReadPage
- SlruPhysicalWritePage
* xlogutils.c
- XLogRead
* pg_receivewal.c
- FindStreamingStart
* rewriteheap.c
- heap_xlog_logical_rewrite
* walreceiver.c
- XLogWalRcvWrite
too ?
Best regards,
Jesper
On Fri, Sep 7, 2018 at 2:17 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
This needs a rebase again.
Done.
Would it be of benefit to update these call sites
* slru.c
- SlruPhysicalReadPage
- SlruPhysicalWritePage
* xlogutils.c
- XLogRead
* pg_receivewal.c
- FindStreamingStart
* rewriteheap.c
- heap_xlog_logical_rewrite
* walreceiver.c
- XLogWalRcvWrite
It certainly wouldn't hurt... but more pressing to get this committed
would be Windows support IMHO. I think the thing to do is to open
files with the FILE_FLAG_OVERLAPPED flag, and then use ReadFile() and
WriteFile() with an LPOVERLAPPED struct that holds an offset, but I'm
not sure if I can write that myself. I tried doing some semi-serious
Windows development for the fsyncgate patch using only AppVeyor CI a
couple of weeks ago and it was like visiting the dentist.
On Thu, Sep 6, 2018 at 6:42 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
Once resolved the patch passes make check-world, and a strace analysis
shows the associated read()/write() have been turned into
pread64()/pwrite64(). All lseek()'s are SEEK_END's.
Yeah :-) Just for fun, here is the truss output for a single pgbench
transaction here:
recvfrom(9,"B\0\0\0\^R\0P0_4\0\0\0\0\0\0\^A"...,8192,0,NULL,0x0) = 41 (0x29)
sendto(9,"2\0\0\0\^Dn\0\0\0\^DC\0\0\0\nBEG"...,27,0,NULL,0) = 27 (0x1b)
recvfrom(9,"B\0\0\0&\0P0_5\0\0\0\0\^B\0\0\0"...,8192,0,NULL,0x0) = 61 (0x3d)
pread(22,"\0\0\0\0\^P\M^C\M-@P\0\0\0\0\M-X"...,8192,0x960a000) = 8192 (0x2000)
pread(20,"\0\0\0\0\M^X\^D\M^Iq\0\0\0\0\^T"...,8192,0x380fe000) = 8192 (0x2000)
sendto(9,"2\0\0\0\^Dn\0\0\0\^DC\0\0\0\rUPD"...,30,0,NULL,0) = 30 (0x1e)
recvfrom(9,"B\0\0\0\^]\0P0_6\0\0\0\0\^A\0\0"...,8192,0,NULL,0x0) = 52 (0x34)
sendto(9,"2\0\0\0\^DT\0\0\0!\0\^Aabalance"...,75,0,NULL,0) = 75 (0x4b)
recvfrom(9,"B\0\0\0"\0P0_7\0\0\0\0\^B\0\0\0"...,8192,0,NULL,0x0) = 57 (0x39)
sendto(9,"2\0\0\0\^Dn\0\0\0\^DC\0\0\0\rUPD"...,30,0,NULL,0) = 30 (0x1e)
recvfrom(9,"B\0\0\0!\0P0_8\0\0\0\0\^B\0\0\0"...,8192,0,NULL,0x0) = 56 (0x38)
lseek(29,0x0,SEEK_END) = 8192 (0x2000)
sendto(9,"2\0\0\0\^Dn\0\0\0\^DC\0\0\0\rUPD"...,30,0,NULL,0) = 30 (0x1e)
recvfrom(9,"B\0\0\0003\0P0_9\0\0\0\0\^D\0\0"...,8192,0,NULL,0x0) = 74 (0x4a)
sendto(9,"2\0\0\0\^Dn\0\0\0\^DC\0\0\0\^OIN"...,32,0,NULL,0) = 32 (0x20)
recvfrom(9,"B\0\0\0\^S\0P0_10\0\0\0\0\0\0\^A"...,8192,0,NULL,0x0) = 42 (0x2a)
pwrite(33,"\M^X\M-P\^E\0\^A\0\0\0\0\M-`\M^A"...,16384,0x81e000) = 16384 (0x4000)
fdatasync(0x21) = 0 (0x0)
sendto(9,"2\0\0\0\^Dn\0\0\0\^DC\0\0\0\vCOM"...,28,0,NULL,0) = 28 (0x1c)
There is only one lseek() left. I actually have another patch that
gets rid of even that (by caching sizes in SMgrRelation using a shared
invalidation counter which I'm not yet sure about). Then pgbench's
7-round-trip transaction makes only the strictly necessary 18 syscalls
(every one an explainable network message, disk page or sync).
Unpatched master has 5 extra lseek()s.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
0001-Use-pread-pwrite-instead-of-lseek-read-write-v4.patchapplication/octet-stream; name=0001-Use-pread-pwrite-instead-of-lseek-read-write-v4.patchDownload
From ca8e5ebd68b956b57499500884bbb2df04cdf4a7 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Thu, 12 Jul 2018 13:14:02 +1200
Subject: [PATCH] Use pread()/pwrite() instead of lseek() + read()/write().
Cut down on system calls by doing random IO using POSIX.1-2008
offset-based IO routines, where available. Remove the code for
tracking the 'virtual' seek position. The only reason left to
call FileSeek() was to get the file's size, so provide a new
function FileSize() instead.
Author: Oskari Saarenmaa, Thomas Munro
Reviewed-by: Thomas Munro, Jesper Pedersen
Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com
Discussion: https://postgr.es/m/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com
Discussion: https://postgr.es/m/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi
---
configure | 2 +-
configure.in | 2 +-
src/backend/access/heap/rewriteheap.c | 2 +-
src/backend/access/transam/xlog.c | 13 ++
src/backend/storage/file/buffile.c | 46 +-----
src/backend/storage/file/fd.c | 217 +++++---------------------
src/backend/storage/smgr/md.c | 35 +----
src/include/pg_config.h.in | 6 +
src/include/storage/fd.h | 12 +-
9 files changed, 72 insertions(+), 263 deletions(-)
diff --git a/configure b/configure
index c6a44a9078a..69760cd4ca2 100755
--- a/configure
+++ b/configure
@@ -15060,7 +15060,7 @@ fi
LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pread pstat pthread_is_threaded_np pwrite readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l
do :
as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 3ada48b5f95..6eabba1c5b3 100644
--- a/configure.in
+++ b/configure.in
@@ -1544,7 +1544,7 @@ PGAC_FUNC_WCSTOMBS_L
LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-AC_CHECK_FUNCS([cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
+AC_CHECK_FUNCS([cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pread pstat pthread_is_threaded_np pwrite readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
AC_REPLACE_FUNCS(fseeko)
case $host_os in
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 85f92973c95..5f573bafda6 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -922,7 +922,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
* Note that we deviate from the usual WAL coding practices here,
* check the above "Logical rewrite support" comment for reasoning.
*/
- written = FileWrite(src->vfd, waldata_start, len,
+ written = FileWrite(src->vfd, waldata_start, len, src->off,
WAIT_EVENT_LOGICAL_REWRITE_WRITE);
if (written != len)
ereport(ERROR,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3025d0badb8..51a915a389f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2484,6 +2484,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
Size nleft;
int written;
+#ifndef HAVE_PWRITE
/* Need to seek in the file? */
if (openLogOff != startoffset)
{
@@ -2495,6 +2496,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
startoffset)));
openLogOff = startoffset;
}
+#endif
/* OK to write the page(s) */
from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2504,7 +2506,11 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
{
errno = 0;
pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+#ifdef HAVE_PWRITE
+ written = pwrite(openLogFile, from, nleft, startoffset);
+#else
written = write(openLogFile, from, nleft);
+#endif
pgstat_report_wait_end();
if (written <= 0)
{
@@ -2519,6 +2525,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
}
nleft -= written;
from += written;
+ startoffset += written;
} while (nleft > 0);
/* Update state for write */
@@ -11818,6 +11825,7 @@ retry:
/* Read the requested page */
readOff = targetPageOff;
+#ifndef HAVE_PREAD
if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
{
char fname[MAXFNAMELEN];
@@ -11831,9 +11839,14 @@ retry:
fname, readOff)));
goto next_record_is_invalid;
}
+#endif
pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#ifdef HAVE_PREAD
+ r = pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+#else
r = read(readFile, readBuf, XLOG_BLCKSZ);
+#endif
if (r != XLOG_BLCKSZ)
{
char fname[MAXFNAMELEN];
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index e93813d9737..dd687dfe71f 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -67,12 +67,6 @@ struct BufFile
int numFiles; /* number of physical files in set */
/* all files except the last have length exactly MAX_PHYSICAL_FILESIZE */
File *files; /* palloc'd array with numFiles entries */
- off_t *offsets; /* palloc'd array with numFiles entries */
-
- /*
- * offsets[i] is the current seek position of files[i]. We use this to
- * avoid making redundant FileSeek calls.
- */
bool isInterXact; /* keep open over transactions? */
bool dirty; /* does buffer need to be written? */
@@ -116,7 +110,6 @@ makeBufFileCommon(int nfiles)
BufFile *file = (BufFile *) palloc(sizeof(BufFile));
file->numFiles = nfiles;
- file->offsets = (off_t *) palloc0(sizeof(off_t) * nfiles);
file->isInterXact = false;
file->dirty = false;
file->resowner = CurrentResourceOwner;
@@ -170,10 +163,7 @@ extendBufFile(BufFile *file)
file->files = (File *) repalloc(file->files,
(file->numFiles + 1) * sizeof(File));
- file->offsets = (off_t *) repalloc(file->offsets,
- (file->numFiles + 1) * sizeof(off_t));
file->files[file->numFiles] = pfile;
- file->offsets[file->numFiles] = 0L;
file->numFiles++;
}
@@ -396,7 +386,6 @@ BufFileClose(BufFile *file)
FileClose(file->files[i]);
/* release the buffer space */
pfree(file->files);
- pfree(file->offsets);
pfree(file);
}
@@ -422,27 +411,17 @@ BufFileLoadBuffer(BufFile *file)
file->curOffset = 0L;
}
- /*
- * May need to reposition physical file.
- */
- thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, read nothing */
- file->offsets[file->curFile] = file->curOffset;
- }
-
/*
* Read whatever we can get, up to a full bufferload.
*/
+ thisfile = file->files[file->curFile];
file->nbytes = FileRead(thisfile,
file->buffer.data,
sizeof(file->buffer),
+ file->curOffset,
WAIT_EVENT_BUFFILE_READ);
if (file->nbytes < 0)
file->nbytes = 0;
- file->offsets[file->curFile] += file->nbytes;
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
@@ -491,23 +470,14 @@ BufFileDumpBuffer(BufFile *file)
if ((off_t) bytestowrite > availbytes)
bytestowrite = (int) availbytes;
- /*
- * May need to reposition physical file.
- */
thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, give up */
- file->offsets[file->curFile] = file->curOffset;
- }
bytestowrite = FileWrite(thisfile,
file->buffer.data + wpos,
bytestowrite,
+ file->curOffset,
WAIT_EVENT_BUFFILE_WRITE);
if (bytestowrite <= 0)
return; /* failed to write */
- file->offsets[file->curFile] += bytestowrite;
file->curOffset += bytestowrite;
wpos += bytestowrite;
@@ -803,11 +773,10 @@ BufFileSize(BufFile *file)
{
off_t lastFileSize;
- /* Get the size of the last physical file by seeking to end. */
- lastFileSize = FileSeek(file->files[file->numFiles - 1], 0, SEEK_END);
+ /* Get the size of the last physical file. */
+ lastFileSize = FileSize(file->files[file->numFiles - 1]);
if (lastFileSize < 0)
return -1;
- file->offsets[file->numFiles - 1] = lastFileSize;
return ((file->numFiles - 1) * (off_t) MAX_PHYSICAL_FILESIZE) +
lastFileSize;
@@ -849,13 +818,8 @@ BufFileAppend(BufFile *target, BufFile *source)
target->files = (File *)
repalloc(target->files, sizeof(File) * newNumFiles);
- target->offsets = (off_t *)
- repalloc(target->offsets, sizeof(off_t) * newNumFiles);
for (i = target->numFiles; i < newNumFiles; i++)
- {
target->files[i] = source->files[i - target->numFiles];
- target->offsets[i] = source->offsets[i - target->numFiles];
- }
target->numFiles = newNumFiles;
return startBlock;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 8dd51f17674..a380f794014 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -16,8 +16,8 @@
* including base tables, scratch files (e.g., sort and hash spool
* files), and random calls to C library routines like system(3); it
* is quite easy to exceed system limits on the number of open files a
- * single process can have. (This is around 256 on many modern
- * operating systems, but can be as low as 32 on others.)
+ * single process can have. (This is around 1024 on many modern
+ * operating systems, but may be lower on others.)
*
* VFDs are managed as an LRU pool, with actual OS file descriptors
* being opened and closed as needed. Obviously, if a routine is
@@ -167,15 +167,6 @@ int max_safe_fds = 32; /* default if not changed */
#define FileIsNotOpen(file) (VfdCache[file].fd == VFD_CLOSED)
-/*
- * Note: a VFD's seekPos is normally always valid, but if for some reason
- * an lseek() fails, it might become set to FileUnknownPos. We can struggle
- * along without knowing the seek position in many cases, but in some places
- * we have to fail if we don't have it.
- */
-#define FileUnknownPos ((off_t) -1)
-#define FilePosIsUnknown(pos) ((pos) < 0)
-
/* these are the assigned bits in fdstate below: */
#define FD_DELETE_AT_CLOSE (1 << 0) /* T = delete when closed */
#define FD_CLOSE_AT_EOXACT (1 << 1) /* T = close at eoXact */
@@ -189,7 +180,6 @@ typedef struct vfd
File nextFree; /* link to next free VFD, if in freelist */
File lruMoreRecently; /* doubly linked recency-of-use list */
File lruLessRecently;
- off_t seekPos; /* current logical file position, or -1 */
off_t fileSize; /* current size of file (0 if not temporary) */
char *fileName; /* name of file, or NULL for unused VFD */
/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
@@ -407,9 +397,7 @@ pg_fdatasync(int fd)
/*
* pg_flush_data --- advise OS that the described dirty data should be flushed
*
- * offset of 0 with nbytes 0 means that the entire file should be flushed;
- * in this case, this function may have side-effects on the file's
- * seek position!
+ * offset of 0 with nbytes 0 means that the entire file should be flushed
*/
void
pg_flush_data(int fd, off_t offset, off_t nbytes)
@@ -1029,22 +1017,6 @@ LruDelete(File file)
vfdP = &VfdCache[file];
- /*
- * Normally we should know the seek position, but if for some reason we
- * have lost track of it, try again to get it. If we still can't get it,
- * we have a problem: we will be unable to restore the file seek position
- * when and if the file is re-opened. But we can't really throw an error
- * and refuse to close the file, or activities such as transaction cleanup
- * will be broken.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(LOG, "could not seek file \"%s\" before closing: %m",
- vfdP->fileName);
- }
-
/*
* Close the file. We aren't expecting this to fail; if it does, better
* to leak the FD than to mess up our internal state.
@@ -1113,33 +1085,6 @@ LruInsert(File file)
{
++nfile;
}
-
- /*
- * Seek to the right position. We need no special case for seekPos
- * equal to FileUnknownPos, as lseek() will certainly reject that
- * (thus completing the logic noted in LruDelete() that we will fail
- * to re-open a file if we couldn't get its seek position before
- * closing).
- */
- if (vfdP->seekPos != (off_t) 0)
- {
- if (lseek(vfdP->fd, vfdP->seekPos, SEEK_SET) < 0)
- {
- /*
- * If we fail to restore the seek position, treat it like an
- * open() failure.
- */
- int save_errno = errno;
-
- elog(LOG, "could not seek file \"%s\" after re-opening: %m",
- vfdP->fileName);
- (void) close(vfdP->fd);
- vfdP->fd = VFD_CLOSED;
- --nfile;
- errno = save_errno;
- return -1;
- }
- }
}
/*
@@ -1406,7 +1351,6 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
/* Saved flags are adjusted to be OK for re-opening file */
vfdP->fileFlags = fileFlags & ~(O_CREAT | O_TRUNC | O_EXCL);
vfdP->fileMode = fileMode;
- vfdP->seekPos = 0;
vfdP->fileSize = 0;
vfdP->fdstate = 0x0;
vfdP->resowner = NULL;
@@ -1820,7 +1764,6 @@ FileClose(File file)
/*
* FilePrefetch - initiate asynchronous read of a given range of the file.
- * The logical seek position is unaffected.
*
* Currently the only implementation of this function is using posix_fadvise
* which is the simplest standardized interface that accomplishes this.
@@ -1867,10 +1810,6 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
file, VfdCache[file].fileName,
(int64) offset, (int64) nbytes));
- /*
- * Caution: do not call pg_flush_data with nbytes = 0, it could trash the
- * file's seek position. We prefer to define that as a no-op here.
- */
if (nbytes <= 0)
return;
@@ -1884,7 +1823,8 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
}
int
-FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
+FileRead(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1893,7 +1833,7 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileRead: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1904,16 +1844,16 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
pgstat_report_wait_start(wait_event_info);
- returnCode = read(vfdP->fd, buffer, amount);
+#ifdef HAVE_PREAD
+ returnCode = pread(vfdP->fd, buffer, amount, offset);
+#else
+ returnCode = lseek(VfdCache[file].fd, offset, SEEK_SET);
+ if (returnCode >= 0)
+ returnCode = read(vfdP->fd, buffer, amount);
+#endif
pgstat_report_wait_end();
- if (returnCode >= 0)
- {
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
- }
- else
+ if (returnCode < 0)
{
/*
* Windows may run out of kernel buffers and return "Insufficient
@@ -1939,16 +1879,14 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
}
int
-FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
+FileWrite(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1957,7 +1895,7 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileWrite: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1976,26 +1914,13 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
*/
if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
{
- off_t newPos;
+ off_t past_write = offset + amount;
- /*
- * Normally we should know the seek position, but if for some reason
- * we have lost track of it, try again to get it. Here, it's fine to
- * throw an error if we still can't get it.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(ERROR, "could not seek file \"%s\": %m", vfdP->fileName);
- }
-
- newPos = vfdP->seekPos + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
uint64 newTotal = temporary_files_size;
- newTotal += newPos - vfdP->fileSize;
+ newTotal += past_write - vfdP->fileSize;
if (newTotal > (uint64) temp_file_limit * (uint64) 1024)
ereport(ERROR,
(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
@@ -2007,7 +1932,13 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
errno = 0;
pgstat_report_wait_start(wait_event_info);
- returnCode = write(vfdP->fd, buffer, amount);
+#ifdef HAVE_PWRITE
+ returnCode = pwrite(VfdCache[file].fd, buffer, amount, offset);
+#else
+ returnCode = lseek(VfdCache[file].fd, offset, SEEK_SET);
+ if (returnCode >= 0)
+ returnCode = write(VfdCache[file].fd, buffer, amount);
+#endif
pgstat_report_wait_end();
/* if write didn't set errno, assume problem is no disk space */
@@ -2016,10 +1947,6 @@ retry:
if (returnCode >= 0)
{
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
-
/*
* Maintain fileSize and temporary_files_size if it's a temp file.
*
@@ -2029,12 +1956,12 @@ retry:
*/
if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
{
- off_t newPos = vfdP->seekPos;
+ off_t past_write = offset + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
- temporary_files_size += newPos - vfdP->fileSize;
- vfdP->fileSize = newPos;
+ temporary_files_size += past_write - vfdP->fileSize;
+ vfdP->fileSize = past_write;
}
}
}
@@ -2060,9 +1987,6 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
@@ -2090,92 +2014,25 @@ FileSync(File file, uint32 wait_event_info)
}
off_t
-FileSeek(File file, off_t offset, int whence)
+FileSize(File file)
{
Vfd *vfdP;
Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileSeek: %d (%s) " INT64_FORMAT " " INT64_FORMAT " %d",
- file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
- (int64) offset, whence));
+ DO_DB(elog(LOG, "FileSize %d (%s)",
+ file, VfdCache[file].fileName));
vfdP = &VfdCache[file];
if (FileIsNotOpen(file))
{
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos = offset;
- break;
- case SEEK_CUR:
- if (FilePosIsUnknown(vfdP->seekPos) ||
- vfdP->seekPos + offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos += offset;
- break;
- case SEEK_END:
- if (FileAccess(file) < 0)
- return (off_t) -1;
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
- }
- else
- {
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- if (vfdP->seekPos != offset)
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_CUR:
- if (offset != 0 || FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_END:
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
+ if (FileAccess(file) < 0)
+ return (off_t) -1;
}
- return vfdP->seekPos;
-}
-
-/*
- * XXX not actually used but here for completeness
- */
-#ifdef NOT_USED
-off_t
-FileTell(File file)
-{
- Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileTell %d (%s)",
- file, VfdCache[file].fileName));
- return VfdCache[file].seekPos;
+ return lseek(VfdCache[file].fd, 0, SEEK_END);
}
-#endif
int
FileTruncate(File file, off_t offset, uint32 wait_event_info)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f4374d077be..86013a5c8b2 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -522,22 +522,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- /*
- * Note: because caller usually obtained blocknum by calling mdnblocks,
- * which did a seek(SEEK_END), this seek is often redundant and will be
- * optimized away by fd.c. It's not redundant, however, if there is a
- * partial page at the end of the file. In that case we want to try to
- * overwrite the partial page with a full page. It's also not redundant
- * if bufmgr.c had to dump another buffer of the same file to make room
- * for the new page's buffer.
- */
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
+ if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
{
if (nbytes < 0)
ereport(ERROR,
@@ -748,13 +733,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_READ);
+ nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -824,13 +803,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_WRITE);
+ nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -1979,7 +1952,7 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
{
off_t len;
- len = FileSeek(seg->mdfd_vfd, 0L, SEEK_END);
+ len = FileSize(seg->mdfd_vfd);
if (len < 0)
ereport(ERROR,
(errcode_for_file_access(),
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 4094e22776c..1c1e7b74c23 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -443,6 +443,9 @@
/* Define to 1 if the assembler supports PPC's LWARX mutex hint bit. */
#undef HAVE_PPC_LWARX_MUTEX_HINT
+/* Define to 1 if you have the `pread' function. */
+#undef HAVE_PREAD
+
/* Define to 1 if you have the `pstat' function. */
#undef HAVE_PSTAT
@@ -458,6 +461,9 @@
/* Have PTHREAD_PRIO_INHERIT. */
#undef HAVE_PTHREAD_PRIO_INHERIT
+/* Define to 1 if you have the `pwrite' function. */
+#undef HAVE_PWRITE
+
/* Define to 1 if you have the `random' function. */
#undef HAVE_RANDOM
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8e7c9728f4b..f8b6fa8ece5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -15,7 +15,7 @@
/*
* calls:
*
- * File {Close, Read, Write, Seek, Tell, Sync}
+ * File {Close, Read, Write, Size, Tell, Sync}
* {Path Name Open, Allocate, Free} File
*
* These are NOT JUST RENAMINGS OF THE UNIX ROUTINES.
@@ -42,10 +42,6 @@
#include <dirent.h>
-/*
- * FileSeek uses the standard UNIX lseek(2) flags.
- */
-
typedef int File;
@@ -68,10 +64,10 @@ extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fil
extern File OpenTemporaryFile(bool interXact);
extern void FileClose(File file);
extern int FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
-extern int FileRead(File file, char *buffer, int amount, uint32 wait_event_info);
-extern int FileWrite(File file, char *buffer, int amount, uint32 wait_event_info);
+extern int FileRead(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
+extern int FileWrite(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
extern int FileSync(File file, uint32 wait_event_info);
-extern off_t FileSeek(File file, off_t offset, int whence);
+extern off_t FileSize(File file);
extern int FileTruncate(File file, off_t offset, uint32 wait_event_info);
extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
extern char *FilePathName(File file);
--
2.17.0
On Wed, Sep 19, 2018 at 1:48 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
On Fri, Sep 7, 2018 at 2:17 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:This needs a rebase again.
And again, due to the conflict with ppoll in AC_CHECK_FUNCS.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
0001-Use-pread-pwrite-instead-of-lseek-read-write-v5.patchapplication/octet-stream; name=0001-Use-pread-pwrite-instead-of-lseek-read-write-v5.patchDownload
From 7fb80c97650a58e9cf15bb8c6a09e9b034471af4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Thu, 12 Jul 2018 13:14:02 +1200
Subject: [PATCH] Use pread()/pwrite() instead of lseek() + read()/write().
Cut down on system calls by doing random IO using POSIX.1-2008
offset-based IO routines, where available. Remove the code for
tracking the 'virtual' seek position. The only reason left to
call FileSeek() was to get the file's size, so provide a new
function FileSize() instead.
Author: Oskari Saarenmaa, Thomas Munro
Reviewed-by: Thomas Munro, Jesper Pedersen
Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com
Discussion: https://postgr.es/m/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com
Discussion: https://postgr.es/m/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi
---
configure | 2 +-
configure.in | 2 +-
src/backend/access/heap/rewriteheap.c | 2 +-
src/backend/access/transam/xlog.c | 13 ++
src/backend/storage/file/buffile.c | 46 +-----
src/backend/storage/file/fd.c | 217 +++++---------------------
src/backend/storage/smgr/md.c | 35 +----
src/include/pg_config.h.in | 6 +
src/include/storage/fd.h | 12 +-
9 files changed, 72 insertions(+), 263 deletions(-)
diff --git a/configure b/configure
index 6414ec1ea6d..d026ba75248 100755
--- a/configure
+++ b/configure
@@ -15100,7 +15100,7 @@ fi
LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pread pstat pthread_is_threaded_np pwrite readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l
do :
as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 158d5a1ac82..a598e5be04c 100644
--- a/configure.in
+++ b/configure.in
@@ -1571,7 +1571,7 @@ PGAC_FUNC_WCSTOMBS_L
LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-AC_CHECK_FUNCS([cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
+AC_CHECK_FUNCS([cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pread pstat pthread_is_threaded_np pwrite readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
AC_REPLACE_FUNCS(fseeko)
case $host_os in
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 85f92973c95..5f573bafda6 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -922,7 +922,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
* Note that we deviate from the usual WAL coding practices here,
* check the above "Logical rewrite support" comment for reasoning.
*/
- written = FileWrite(src->vfd, waldata_start, len,
+ written = FileWrite(src->vfd, waldata_start, len, src->off,
WAIT_EVENT_LOGICAL_REWRITE_WRITE);
if (written != len)
ereport(ERROR,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5abaeb005b3..157c8465bd0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2484,6 +2484,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
Size nleft;
int written;
+#ifndef HAVE_PWRITE
/* Need to seek in the file? */
if (openLogOff != startoffset)
{
@@ -2495,6 +2496,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
startoffset)));
openLogOff = startoffset;
}
+#endif
/* OK to write the page(s) */
from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2504,7 +2506,11 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
{
errno = 0;
pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+#ifdef HAVE_PWRITE
+ written = pwrite(openLogFile, from, nleft, startoffset);
+#else
written = write(openLogFile, from, nleft);
+#endif
pgstat_report_wait_end();
if (written <= 0)
{
@@ -2519,6 +2525,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
}
nleft -= written;
from += written;
+ startoffset += written;
} while (nleft > 0);
/* Update state for write */
@@ -11819,6 +11826,7 @@ retry:
/* Read the requested page */
readOff = targetPageOff;
+#ifndef HAVE_PREAD
if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
{
char fname[MAXFNAMELEN];
@@ -11832,9 +11840,14 @@ retry:
fname, readOff)));
goto next_record_is_invalid;
}
+#endif
pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#ifdef HAVE_PREAD
+ r = pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+#else
r = read(readFile, readBuf, XLOG_BLCKSZ);
+#endif
if (r != XLOG_BLCKSZ)
{
char fname[MAXFNAMELEN];
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index e93813d9737..dd687dfe71f 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -67,12 +67,6 @@ struct BufFile
int numFiles; /* number of physical files in set */
/* all files except the last have length exactly MAX_PHYSICAL_FILESIZE */
File *files; /* palloc'd array with numFiles entries */
- off_t *offsets; /* palloc'd array with numFiles entries */
-
- /*
- * offsets[i] is the current seek position of files[i]. We use this to
- * avoid making redundant FileSeek calls.
- */
bool isInterXact; /* keep open over transactions? */
bool dirty; /* does buffer need to be written? */
@@ -116,7 +110,6 @@ makeBufFileCommon(int nfiles)
BufFile *file = (BufFile *) palloc(sizeof(BufFile));
file->numFiles = nfiles;
- file->offsets = (off_t *) palloc0(sizeof(off_t) * nfiles);
file->isInterXact = false;
file->dirty = false;
file->resowner = CurrentResourceOwner;
@@ -170,10 +163,7 @@ extendBufFile(BufFile *file)
file->files = (File *) repalloc(file->files,
(file->numFiles + 1) * sizeof(File));
- file->offsets = (off_t *) repalloc(file->offsets,
- (file->numFiles + 1) * sizeof(off_t));
file->files[file->numFiles] = pfile;
- file->offsets[file->numFiles] = 0L;
file->numFiles++;
}
@@ -396,7 +386,6 @@ BufFileClose(BufFile *file)
FileClose(file->files[i]);
/* release the buffer space */
pfree(file->files);
- pfree(file->offsets);
pfree(file);
}
@@ -422,27 +411,17 @@ BufFileLoadBuffer(BufFile *file)
file->curOffset = 0L;
}
- /*
- * May need to reposition physical file.
- */
- thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, read nothing */
- file->offsets[file->curFile] = file->curOffset;
- }
-
/*
* Read whatever we can get, up to a full bufferload.
*/
+ thisfile = file->files[file->curFile];
file->nbytes = FileRead(thisfile,
file->buffer.data,
sizeof(file->buffer),
+ file->curOffset,
WAIT_EVENT_BUFFILE_READ);
if (file->nbytes < 0)
file->nbytes = 0;
- file->offsets[file->curFile] += file->nbytes;
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
@@ -491,23 +470,14 @@ BufFileDumpBuffer(BufFile *file)
if ((off_t) bytestowrite > availbytes)
bytestowrite = (int) availbytes;
- /*
- * May need to reposition physical file.
- */
thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, give up */
- file->offsets[file->curFile] = file->curOffset;
- }
bytestowrite = FileWrite(thisfile,
file->buffer.data + wpos,
bytestowrite,
+ file->curOffset,
WAIT_EVENT_BUFFILE_WRITE);
if (bytestowrite <= 0)
return; /* failed to write */
- file->offsets[file->curFile] += bytestowrite;
file->curOffset += bytestowrite;
wpos += bytestowrite;
@@ -803,11 +773,10 @@ BufFileSize(BufFile *file)
{
off_t lastFileSize;
- /* Get the size of the last physical file by seeking to end. */
- lastFileSize = FileSeek(file->files[file->numFiles - 1], 0, SEEK_END);
+ /* Get the size of the last physical file. */
+ lastFileSize = FileSize(file->files[file->numFiles - 1]);
if (lastFileSize < 0)
return -1;
- file->offsets[file->numFiles - 1] = lastFileSize;
return ((file->numFiles - 1) * (off_t) MAX_PHYSICAL_FILESIZE) +
lastFileSize;
@@ -849,13 +818,8 @@ BufFileAppend(BufFile *target, BufFile *source)
target->files = (File *)
repalloc(target->files, sizeof(File) * newNumFiles);
- target->offsets = (off_t *)
- repalloc(target->offsets, sizeof(off_t) * newNumFiles);
for (i = target->numFiles; i < newNumFiles; i++)
- {
target->files[i] = source->files[i - target->numFiles];
- target->offsets[i] = source->offsets[i - target->numFiles];
- }
target->numFiles = newNumFiles;
return startBlock;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 8dd51f17674..a380f794014 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -16,8 +16,8 @@
* including base tables, scratch files (e.g., sort and hash spool
* files), and random calls to C library routines like system(3); it
* is quite easy to exceed system limits on the number of open files a
- * single process can have. (This is around 256 on many modern
- * operating systems, but can be as low as 32 on others.)
+ * single process can have. (This is around 1024 on many modern
+ * operating systems, but may be lower on others.)
*
* VFDs are managed as an LRU pool, with actual OS file descriptors
* being opened and closed as needed. Obviously, if a routine is
@@ -167,15 +167,6 @@ int max_safe_fds = 32; /* default if not changed */
#define FileIsNotOpen(file) (VfdCache[file].fd == VFD_CLOSED)
-/*
- * Note: a VFD's seekPos is normally always valid, but if for some reason
- * an lseek() fails, it might become set to FileUnknownPos. We can struggle
- * along without knowing the seek position in many cases, but in some places
- * we have to fail if we don't have it.
- */
-#define FileUnknownPos ((off_t) -1)
-#define FilePosIsUnknown(pos) ((pos) < 0)
-
/* these are the assigned bits in fdstate below: */
#define FD_DELETE_AT_CLOSE (1 << 0) /* T = delete when closed */
#define FD_CLOSE_AT_EOXACT (1 << 1) /* T = close at eoXact */
@@ -189,7 +180,6 @@ typedef struct vfd
File nextFree; /* link to next free VFD, if in freelist */
File lruMoreRecently; /* doubly linked recency-of-use list */
File lruLessRecently;
- off_t seekPos; /* current logical file position, or -1 */
off_t fileSize; /* current size of file (0 if not temporary) */
char *fileName; /* name of file, or NULL for unused VFD */
/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
@@ -407,9 +397,7 @@ pg_fdatasync(int fd)
/*
* pg_flush_data --- advise OS that the described dirty data should be flushed
*
- * offset of 0 with nbytes 0 means that the entire file should be flushed;
- * in this case, this function may have side-effects on the file's
- * seek position!
+ * offset of 0 with nbytes 0 means that the entire file should be flushed
*/
void
pg_flush_data(int fd, off_t offset, off_t nbytes)
@@ -1029,22 +1017,6 @@ LruDelete(File file)
vfdP = &VfdCache[file];
- /*
- * Normally we should know the seek position, but if for some reason we
- * have lost track of it, try again to get it. If we still can't get it,
- * we have a problem: we will be unable to restore the file seek position
- * when and if the file is re-opened. But we can't really throw an error
- * and refuse to close the file, or activities such as transaction cleanup
- * will be broken.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(LOG, "could not seek file \"%s\" before closing: %m",
- vfdP->fileName);
- }
-
/*
* Close the file. We aren't expecting this to fail; if it does, better
* to leak the FD than to mess up our internal state.
@@ -1113,33 +1085,6 @@ LruInsert(File file)
{
++nfile;
}
-
- /*
- * Seek to the right position. We need no special case for seekPos
- * equal to FileUnknownPos, as lseek() will certainly reject that
- * (thus completing the logic noted in LruDelete() that we will fail
- * to re-open a file if we couldn't get its seek position before
- * closing).
- */
- if (vfdP->seekPos != (off_t) 0)
- {
- if (lseek(vfdP->fd, vfdP->seekPos, SEEK_SET) < 0)
- {
- /*
- * If we fail to restore the seek position, treat it like an
- * open() failure.
- */
- int save_errno = errno;
-
- elog(LOG, "could not seek file \"%s\" after re-opening: %m",
- vfdP->fileName);
- (void) close(vfdP->fd);
- vfdP->fd = VFD_CLOSED;
- --nfile;
- errno = save_errno;
- return -1;
- }
- }
}
/*
@@ -1406,7 +1351,6 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
/* Saved flags are adjusted to be OK for re-opening file */
vfdP->fileFlags = fileFlags & ~(O_CREAT | O_TRUNC | O_EXCL);
vfdP->fileMode = fileMode;
- vfdP->seekPos = 0;
vfdP->fileSize = 0;
vfdP->fdstate = 0x0;
vfdP->resowner = NULL;
@@ -1820,7 +1764,6 @@ FileClose(File file)
/*
* FilePrefetch - initiate asynchronous read of a given range of the file.
- * The logical seek position is unaffected.
*
* Currently the only implementation of this function is using posix_fadvise
* which is the simplest standardized interface that accomplishes this.
@@ -1867,10 +1810,6 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
file, VfdCache[file].fileName,
(int64) offset, (int64) nbytes));
- /*
- * Caution: do not call pg_flush_data with nbytes = 0, it could trash the
- * file's seek position. We prefer to define that as a no-op here.
- */
if (nbytes <= 0)
return;
@@ -1884,7 +1823,8 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
}
int
-FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
+FileRead(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1893,7 +1833,7 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileRead: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1904,16 +1844,16 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
pgstat_report_wait_start(wait_event_info);
- returnCode = read(vfdP->fd, buffer, amount);
+#ifdef HAVE_PREAD
+ returnCode = pread(vfdP->fd, buffer, amount, offset);
+#else
+ returnCode = lseek(VfdCache[file].fd, offset, SEEK_SET);
+ if (returnCode >= 0)
+ returnCode = read(vfdP->fd, buffer, amount);
+#endif
pgstat_report_wait_end();
- if (returnCode >= 0)
- {
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
- }
- else
+ if (returnCode < 0)
{
/*
* Windows may run out of kernel buffers and return "Insufficient
@@ -1939,16 +1879,14 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
}
int
-FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
+FileWrite(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1957,7 +1895,7 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileWrite: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1976,26 +1914,13 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
*/
if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
{
- off_t newPos;
+ off_t past_write = offset + amount;
- /*
- * Normally we should know the seek position, but if for some reason
- * we have lost track of it, try again to get it. Here, it's fine to
- * throw an error if we still can't get it.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(ERROR, "could not seek file \"%s\": %m", vfdP->fileName);
- }
-
- newPos = vfdP->seekPos + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
uint64 newTotal = temporary_files_size;
- newTotal += newPos - vfdP->fileSize;
+ newTotal += past_write - vfdP->fileSize;
if (newTotal > (uint64) temp_file_limit * (uint64) 1024)
ereport(ERROR,
(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
@@ -2007,7 +1932,13 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
errno = 0;
pgstat_report_wait_start(wait_event_info);
- returnCode = write(vfdP->fd, buffer, amount);
+#ifdef HAVE_PWRITE
+ returnCode = pwrite(VfdCache[file].fd, buffer, amount, offset);
+#else
+ returnCode = lseek(VfdCache[file].fd, offset, SEEK_SET);
+ if (returnCode >= 0)
+ returnCode = write(VfdCache[file].fd, buffer, amount);
+#endif
pgstat_report_wait_end();
/* if write didn't set errno, assume problem is no disk space */
@@ -2016,10 +1947,6 @@ retry:
if (returnCode >= 0)
{
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
-
/*
* Maintain fileSize and temporary_files_size if it's a temp file.
*
@@ -2029,12 +1956,12 @@ retry:
*/
if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
{
- off_t newPos = vfdP->seekPos;
+ off_t past_write = offset + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
- temporary_files_size += newPos - vfdP->fileSize;
- vfdP->fileSize = newPos;
+ temporary_files_size += past_write - vfdP->fileSize;
+ vfdP->fileSize = past_write;
}
}
}
@@ -2060,9 +1987,6 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
@@ -2090,92 +2014,25 @@ FileSync(File file, uint32 wait_event_info)
}
off_t
-FileSeek(File file, off_t offset, int whence)
+FileSize(File file)
{
Vfd *vfdP;
Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileSeek: %d (%s) " INT64_FORMAT " " INT64_FORMAT " %d",
- file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
- (int64) offset, whence));
+ DO_DB(elog(LOG, "FileSize %d (%s)",
+ file, VfdCache[file].fileName));
vfdP = &VfdCache[file];
if (FileIsNotOpen(file))
{
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos = offset;
- break;
- case SEEK_CUR:
- if (FilePosIsUnknown(vfdP->seekPos) ||
- vfdP->seekPos + offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos += offset;
- break;
- case SEEK_END:
- if (FileAccess(file) < 0)
- return (off_t) -1;
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
- }
- else
- {
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- if (vfdP->seekPos != offset)
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_CUR:
- if (offset != 0 || FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_END:
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
+ if (FileAccess(file) < 0)
+ return (off_t) -1;
}
- return vfdP->seekPos;
-}
-
-/*
- * XXX not actually used but here for completeness
- */
-#ifdef NOT_USED
-off_t
-FileTell(File file)
-{
- Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileTell %d (%s)",
- file, VfdCache[file].fileName));
- return VfdCache[file].seekPos;
+ return lseek(VfdCache[file].fd, 0, SEEK_END);
}
-#endif
int
FileTruncate(File file, off_t offset, uint32 wait_event_info)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f4374d077be..86013a5c8b2 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -522,22 +522,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- /*
- * Note: because caller usually obtained blocknum by calling mdnblocks,
- * which did a seek(SEEK_END), this seek is often redundant and will be
- * optimized away by fd.c. It's not redundant, however, if there is a
- * partial page at the end of the file. In that case we want to try to
- * overwrite the partial page with a full page. It's also not redundant
- * if bufmgr.c had to dump another buffer of the same file to make room
- * for the new page's buffer.
- */
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
+ if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
{
if (nbytes < 0)
ereport(ERROR,
@@ -748,13 +733,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_READ);
+ nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -824,13 +803,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_WRITE);
+ nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -1979,7 +1952,7 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
{
off_t len;
- len = FileSeek(seg->mdfd_vfd, 0L, SEEK_END);
+ len = FileSize(seg->mdfd_vfd);
if (len < 0)
ereport(ERROR,
(errcode_for_file_access(),
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 90dda8ea050..1b02ec0ad90 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -438,6 +438,9 @@
/* Define to 1 if you have the `ppoll' function. */
#undef HAVE_PPOLL
+/* Define to 1 if you have the `pread' function. */
+#undef HAVE_PREAD
+
/* Define to 1 if you have the `pstat' function. */
#undef HAVE_PSTAT
@@ -453,6 +456,9 @@
/* Have PTHREAD_PRIO_INHERIT. */
#undef HAVE_PTHREAD_PRIO_INHERIT
+/* Define to 1 if you have the `pwrite' function. */
+#undef HAVE_PWRITE
+
/* Define to 1 if you have the `random' function. */
#undef HAVE_RANDOM
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8e7c9728f4b..f8b6fa8ece5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -15,7 +15,7 @@
/*
* calls:
*
- * File {Close, Read, Write, Seek, Tell, Sync}
+ * File {Close, Read, Write, Size, Tell, Sync}
* {Path Name Open, Allocate, Free} File
*
* These are NOT JUST RENAMINGS OF THE UNIX ROUTINES.
@@ -42,10 +42,6 @@
#include <dirent.h>
-/*
- * FileSeek uses the standard UNIX lseek(2) flags.
- */
-
typedef int File;
@@ -68,10 +64,10 @@ extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fil
extern File OpenTemporaryFile(bool interXact);
extern void FileClose(File file);
extern int FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
-extern int FileRead(File file, char *buffer, int amount, uint32 wait_event_info);
-extern int FileWrite(File file, char *buffer, int amount, uint32 wait_event_info);
+extern int FileRead(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
+extern int FileWrite(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
extern int FileSync(File file, uint32 wait_event_info);
-extern off_t FileSeek(File file, off_t offset, int whence);
+extern off_t FileSize(File file);
extern int FileTruncate(File file, off_t offset, uint32 wait_event_info);
extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
extern char *FilePathName(File file);
--
2.17.1 (Apple Git-112)
Hi Thomas,
On 9/18/18 9:48 PM, Thomas Munro wrote:
It certainly wouldn't hurt... but more pressing to get this committed
would be Windows support IMHO. I think the thing to do is to open
files with the FILE_FLAG_OVERLAPPED flag, and then use ReadFile() and
WriteFile() with an LPOVERLAPPED struct that holds an offset, but I'm
not sure if I can write that myself. I tried doing some semi-serious
Windows development for the fsyncgate patch using only AppVeyor CI a
couple of weeks ago and it was like visiting the dentist.
Sorry, no idea about this. Maybe Magnus can provide some feedback ?
On Thu, Sep 6, 2018 at 6:42 AM Jesper Pedersen
Once resolved the patch passes make check-world, and a strace analysis
shows the associated read()/write() have been turned into
pread64()/pwrite64(). All lseek()'s are SEEK_END's.Yeah :-)
Thanks for v5 too.
Best regards,
Jesper
On Fri, Sep 28, 2018 at 2:03 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
Thanks for v5 too.
Rebased again. Patches that touch AC_CHECK_FUNCS are fun like that!
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
0001-Use-pread-pwrite-instead-of-lseek-read-write-v6.patchapplication/octet-stream; name=0001-Use-pread-pwrite-instead-of-lseek-read-write-v6.patchDownload
From e5cd7f0f8af107646613c9d22a6a399340887b56 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Thu, 12 Jul 2018 13:14:02 +1200
Subject: [PATCH] Use pread()/pwrite() instead of lseek() + read()/write().
Cut down on system calls by doing random IO using POSIX.1-2008
offset-based IO routines, where available. Remove the code for
tracking the 'virtual' seek position. The only reason left to
call FileSeek() was to get the file's size, so provide a new
function FileSize() instead.
Author: Oskari Saarenmaa, Thomas Munro
Reviewed-by: Thomas Munro, Jesper Pedersen
Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com
Discussion: https://postgr.es/m/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com
Discussion: https://postgr.es/m/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi
---
configure | 2 +-
configure.in | 2 +-
src/backend/access/heap/rewriteheap.c | 2 +-
src/backend/access/transam/xlog.c | 13 ++
src/backend/storage/file/buffile.c | 46 +-----
src/backend/storage/file/fd.c | 217 +++++---------------------
src/backend/storage/smgr/md.c | 35 +----
src/include/pg_config.h.in | 6 +
src/include/storage/fd.h | 12 +-
9 files changed, 72 insertions(+), 263 deletions(-)
diff --git a/configure b/configure
index 0448c6bfebf..a0ee8c569cf 100755
--- a/configure
+++ b/configure
@@ -15100,7 +15100,7 @@ fi
LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pread pstat pthread_is_threaded_np pwrite readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l
do :
as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 23b5bb867bb..a4531ff3036 100644
--- a/configure.in
+++ b/configure.in
@@ -1571,7 +1571,7 @@ PGAC_FUNC_WCSTOMBS_L
LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-AC_CHECK_FUNCS([cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l])
+AC_CHECK_FUNCS([cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pread pstat pthread_is_threaded_np pwrite readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l])
AC_REPLACE_FUNCS(fseeko)
case $host_os in
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 85f92973c95..5f573bafda6 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -922,7 +922,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
* Note that we deviate from the usual WAL coding practices here,
* check the above "Logical rewrite support" comment for reasoning.
*/
- written = FileWrite(src->vfd, waldata_start, len,
+ written = FileWrite(src->vfd, waldata_start, len, src->off,
WAIT_EVENT_LOGICAL_REWRITE_WRITE);
if (written != len)
ereport(ERROR,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7375a78ffcf..c5b99c93237 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2484,6 +2484,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
Size nleft;
int written;
+#ifndef HAVE_PWRITE
/* Need to seek in the file? */
if (openLogOff != startoffset)
{
@@ -2495,6 +2496,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
startoffset)));
openLogOff = startoffset;
}
+#endif
/* OK to write the page(s) */
from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2504,7 +2506,11 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
{
errno = 0;
pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+#ifdef HAVE_PWRITE
+ written = pwrite(openLogFile, from, nleft, startoffset);
+#else
written = write(openLogFile, from, nleft);
+#endif
pgstat_report_wait_end();
if (written <= 0)
{
@@ -2519,6 +2525,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
}
nleft -= written;
from += written;
+ startoffset += written;
} while (nleft > 0);
/* Update state for write */
@@ -11827,6 +11834,7 @@ retry:
/* Read the requested page */
readOff = targetPageOff;
+#ifndef HAVE_PREAD
if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
{
char fname[MAXFNAMELEN];
@@ -11840,9 +11848,14 @@ retry:
fname, readOff)));
goto next_record_is_invalid;
}
+#endif
pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#ifdef HAVE_PREAD
+ r = pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+#else
r = read(readFile, readBuf, XLOG_BLCKSZ);
+#endif
if (r != XLOG_BLCKSZ)
{
char fname[MAXFNAMELEN];
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index e93813d9737..dd687dfe71f 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -67,12 +67,6 @@ struct BufFile
int numFiles; /* number of physical files in set */
/* all files except the last have length exactly MAX_PHYSICAL_FILESIZE */
File *files; /* palloc'd array with numFiles entries */
- off_t *offsets; /* palloc'd array with numFiles entries */
-
- /*
- * offsets[i] is the current seek position of files[i]. We use this to
- * avoid making redundant FileSeek calls.
- */
bool isInterXact; /* keep open over transactions? */
bool dirty; /* does buffer need to be written? */
@@ -116,7 +110,6 @@ makeBufFileCommon(int nfiles)
BufFile *file = (BufFile *) palloc(sizeof(BufFile));
file->numFiles = nfiles;
- file->offsets = (off_t *) palloc0(sizeof(off_t) * nfiles);
file->isInterXact = false;
file->dirty = false;
file->resowner = CurrentResourceOwner;
@@ -170,10 +163,7 @@ extendBufFile(BufFile *file)
file->files = (File *) repalloc(file->files,
(file->numFiles + 1) * sizeof(File));
- file->offsets = (off_t *) repalloc(file->offsets,
- (file->numFiles + 1) * sizeof(off_t));
file->files[file->numFiles] = pfile;
- file->offsets[file->numFiles] = 0L;
file->numFiles++;
}
@@ -396,7 +386,6 @@ BufFileClose(BufFile *file)
FileClose(file->files[i]);
/* release the buffer space */
pfree(file->files);
- pfree(file->offsets);
pfree(file);
}
@@ -422,27 +411,17 @@ BufFileLoadBuffer(BufFile *file)
file->curOffset = 0L;
}
- /*
- * May need to reposition physical file.
- */
- thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, read nothing */
- file->offsets[file->curFile] = file->curOffset;
- }
-
/*
* Read whatever we can get, up to a full bufferload.
*/
+ thisfile = file->files[file->curFile];
file->nbytes = FileRead(thisfile,
file->buffer.data,
sizeof(file->buffer),
+ file->curOffset,
WAIT_EVENT_BUFFILE_READ);
if (file->nbytes < 0)
file->nbytes = 0;
- file->offsets[file->curFile] += file->nbytes;
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
@@ -491,23 +470,14 @@ BufFileDumpBuffer(BufFile *file)
if ((off_t) bytestowrite > availbytes)
bytestowrite = (int) availbytes;
- /*
- * May need to reposition physical file.
- */
thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, give up */
- file->offsets[file->curFile] = file->curOffset;
- }
bytestowrite = FileWrite(thisfile,
file->buffer.data + wpos,
bytestowrite,
+ file->curOffset,
WAIT_EVENT_BUFFILE_WRITE);
if (bytestowrite <= 0)
return; /* failed to write */
- file->offsets[file->curFile] += bytestowrite;
file->curOffset += bytestowrite;
wpos += bytestowrite;
@@ -803,11 +773,10 @@ BufFileSize(BufFile *file)
{
off_t lastFileSize;
- /* Get the size of the last physical file by seeking to end. */
- lastFileSize = FileSeek(file->files[file->numFiles - 1], 0, SEEK_END);
+ /* Get the size of the last physical file. */
+ lastFileSize = FileSize(file->files[file->numFiles - 1]);
if (lastFileSize < 0)
return -1;
- file->offsets[file->numFiles - 1] = lastFileSize;
return ((file->numFiles - 1) * (off_t) MAX_PHYSICAL_FILESIZE) +
lastFileSize;
@@ -849,13 +818,8 @@ BufFileAppend(BufFile *target, BufFile *source)
target->files = (File *)
repalloc(target->files, sizeof(File) * newNumFiles);
- target->offsets = (off_t *)
- repalloc(target->offsets, sizeof(off_t) * newNumFiles);
for (i = target->numFiles; i < newNumFiles; i++)
- {
target->files[i] = source->files[i - target->numFiles];
- target->offsets[i] = source->offsets[i - target->numFiles];
- }
target->numFiles = newNumFiles;
return startBlock;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 8dd51f17674..a380f794014 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -16,8 +16,8 @@
* including base tables, scratch files (e.g., sort and hash spool
* files), and random calls to C library routines like system(3); it
* is quite easy to exceed system limits on the number of open files a
- * single process can have. (This is around 256 on many modern
- * operating systems, but can be as low as 32 on others.)
+ * single process can have. (This is around 1024 on many modern
+ * operating systems, but may be lower on others.)
*
* VFDs are managed as an LRU pool, with actual OS file descriptors
* being opened and closed as needed. Obviously, if a routine is
@@ -167,15 +167,6 @@ int max_safe_fds = 32; /* default if not changed */
#define FileIsNotOpen(file) (VfdCache[file].fd == VFD_CLOSED)
-/*
- * Note: a VFD's seekPos is normally always valid, but if for some reason
- * an lseek() fails, it might become set to FileUnknownPos. We can struggle
- * along without knowing the seek position in many cases, but in some places
- * we have to fail if we don't have it.
- */
-#define FileUnknownPos ((off_t) -1)
-#define FilePosIsUnknown(pos) ((pos) < 0)
-
/* these are the assigned bits in fdstate below: */
#define FD_DELETE_AT_CLOSE (1 << 0) /* T = delete when closed */
#define FD_CLOSE_AT_EOXACT (1 << 1) /* T = close at eoXact */
@@ -189,7 +180,6 @@ typedef struct vfd
File nextFree; /* link to next free VFD, if in freelist */
File lruMoreRecently; /* doubly linked recency-of-use list */
File lruLessRecently;
- off_t seekPos; /* current logical file position, or -1 */
off_t fileSize; /* current size of file (0 if not temporary) */
char *fileName; /* name of file, or NULL for unused VFD */
/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
@@ -407,9 +397,7 @@ pg_fdatasync(int fd)
/*
* pg_flush_data --- advise OS that the described dirty data should be flushed
*
- * offset of 0 with nbytes 0 means that the entire file should be flushed;
- * in this case, this function may have side-effects on the file's
- * seek position!
+ * offset of 0 with nbytes 0 means that the entire file should be flushed
*/
void
pg_flush_data(int fd, off_t offset, off_t nbytes)
@@ -1029,22 +1017,6 @@ LruDelete(File file)
vfdP = &VfdCache[file];
- /*
- * Normally we should know the seek position, but if for some reason we
- * have lost track of it, try again to get it. If we still can't get it,
- * we have a problem: we will be unable to restore the file seek position
- * when and if the file is re-opened. But we can't really throw an error
- * and refuse to close the file, or activities such as transaction cleanup
- * will be broken.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(LOG, "could not seek file \"%s\" before closing: %m",
- vfdP->fileName);
- }
-
/*
* Close the file. We aren't expecting this to fail; if it does, better
* to leak the FD than to mess up our internal state.
@@ -1113,33 +1085,6 @@ LruInsert(File file)
{
++nfile;
}
-
- /*
- * Seek to the right position. We need no special case for seekPos
- * equal to FileUnknownPos, as lseek() will certainly reject that
- * (thus completing the logic noted in LruDelete() that we will fail
- * to re-open a file if we couldn't get its seek position before
- * closing).
- */
- if (vfdP->seekPos != (off_t) 0)
- {
- if (lseek(vfdP->fd, vfdP->seekPos, SEEK_SET) < 0)
- {
- /*
- * If we fail to restore the seek position, treat it like an
- * open() failure.
- */
- int save_errno = errno;
-
- elog(LOG, "could not seek file \"%s\" after re-opening: %m",
- vfdP->fileName);
- (void) close(vfdP->fd);
- vfdP->fd = VFD_CLOSED;
- --nfile;
- errno = save_errno;
- return -1;
- }
- }
}
/*
@@ -1406,7 +1351,6 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
/* Saved flags are adjusted to be OK for re-opening file */
vfdP->fileFlags = fileFlags & ~(O_CREAT | O_TRUNC | O_EXCL);
vfdP->fileMode = fileMode;
- vfdP->seekPos = 0;
vfdP->fileSize = 0;
vfdP->fdstate = 0x0;
vfdP->resowner = NULL;
@@ -1820,7 +1764,6 @@ FileClose(File file)
/*
* FilePrefetch - initiate asynchronous read of a given range of the file.
- * The logical seek position is unaffected.
*
* Currently the only implementation of this function is using posix_fadvise
* which is the simplest standardized interface that accomplishes this.
@@ -1867,10 +1810,6 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
file, VfdCache[file].fileName,
(int64) offset, (int64) nbytes));
- /*
- * Caution: do not call pg_flush_data with nbytes = 0, it could trash the
- * file's seek position. We prefer to define that as a no-op here.
- */
if (nbytes <= 0)
return;
@@ -1884,7 +1823,8 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
}
int
-FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
+FileRead(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1893,7 +1833,7 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileRead: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1904,16 +1844,16 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
pgstat_report_wait_start(wait_event_info);
- returnCode = read(vfdP->fd, buffer, amount);
+#ifdef HAVE_PREAD
+ returnCode = pread(vfdP->fd, buffer, amount, offset);
+#else
+ returnCode = lseek(VfdCache[file].fd, offset, SEEK_SET);
+ if (returnCode >= 0)
+ returnCode = read(vfdP->fd, buffer, amount);
+#endif
pgstat_report_wait_end();
- if (returnCode >= 0)
- {
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
- }
- else
+ if (returnCode < 0)
{
/*
* Windows may run out of kernel buffers and return "Insufficient
@@ -1939,16 +1879,14 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
}
int
-FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
+FileWrite(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1957,7 +1895,7 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileWrite: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1976,26 +1914,13 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
*/
if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
{
- off_t newPos;
+ off_t past_write = offset + amount;
- /*
- * Normally we should know the seek position, but if for some reason
- * we have lost track of it, try again to get it. Here, it's fine to
- * throw an error if we still can't get it.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(ERROR, "could not seek file \"%s\": %m", vfdP->fileName);
- }
-
- newPos = vfdP->seekPos + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
uint64 newTotal = temporary_files_size;
- newTotal += newPos - vfdP->fileSize;
+ newTotal += past_write - vfdP->fileSize;
if (newTotal > (uint64) temp_file_limit * (uint64) 1024)
ereport(ERROR,
(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
@@ -2007,7 +1932,13 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
errno = 0;
pgstat_report_wait_start(wait_event_info);
- returnCode = write(vfdP->fd, buffer, amount);
+#ifdef HAVE_PWRITE
+ returnCode = pwrite(VfdCache[file].fd, buffer, amount, offset);
+#else
+ returnCode = lseek(VfdCache[file].fd, offset, SEEK_SET);
+ if (returnCode >= 0)
+ returnCode = write(VfdCache[file].fd, buffer, amount);
+#endif
pgstat_report_wait_end();
/* if write didn't set errno, assume problem is no disk space */
@@ -2016,10 +1947,6 @@ retry:
if (returnCode >= 0)
{
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
-
/*
* Maintain fileSize and temporary_files_size if it's a temp file.
*
@@ -2029,12 +1956,12 @@ retry:
*/
if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
{
- off_t newPos = vfdP->seekPos;
+ off_t past_write = offset + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
- temporary_files_size += newPos - vfdP->fileSize;
- vfdP->fileSize = newPos;
+ temporary_files_size += past_write - vfdP->fileSize;
+ vfdP->fileSize = past_write;
}
}
}
@@ -2060,9 +1987,6 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
@@ -2090,92 +2014,25 @@ FileSync(File file, uint32 wait_event_info)
}
off_t
-FileSeek(File file, off_t offset, int whence)
+FileSize(File file)
{
Vfd *vfdP;
Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileSeek: %d (%s) " INT64_FORMAT " " INT64_FORMAT " %d",
- file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
- (int64) offset, whence));
+ DO_DB(elog(LOG, "FileSize %d (%s)",
+ file, VfdCache[file].fileName));
vfdP = &VfdCache[file];
if (FileIsNotOpen(file))
{
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos = offset;
- break;
- case SEEK_CUR:
- if (FilePosIsUnknown(vfdP->seekPos) ||
- vfdP->seekPos + offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos += offset;
- break;
- case SEEK_END:
- if (FileAccess(file) < 0)
- return (off_t) -1;
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
- }
- else
- {
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- if (vfdP->seekPos != offset)
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_CUR:
- if (offset != 0 || FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_END:
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
+ if (FileAccess(file) < 0)
+ return (off_t) -1;
}
- return vfdP->seekPos;
-}
-
-/*
- * XXX not actually used but here for completeness
- */
-#ifdef NOT_USED
-off_t
-FileTell(File file)
-{
- Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileTell %d (%s)",
- file, VfdCache[file].fileName));
- return VfdCache[file].seekPos;
+ return lseek(VfdCache[file].fd, 0, SEEK_END);
}
-#endif
int
FileTruncate(File file, off_t offset, uint32 wait_event_info)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f4374d077be..86013a5c8b2 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -522,22 +522,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- /*
- * Note: because caller usually obtained blocknum by calling mdnblocks,
- * which did a seek(SEEK_END), this seek is often redundant and will be
- * optimized away by fd.c. It's not redundant, however, if there is a
- * partial page at the end of the file. In that case we want to try to
- * overwrite the partial page with a full page. It's also not redundant
- * if bufmgr.c had to dump another buffer of the same file to make room
- * for the new page's buffer.
- */
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
+ if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
{
if (nbytes < 0)
ereport(ERROR,
@@ -748,13 +733,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_READ);
+ nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -824,13 +803,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_WRITE);
+ nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -1979,7 +1952,7 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
{
off_t len;
- len = FileSeek(seg->mdfd_vfd, 0L, SEEK_END);
+ len = FileSize(seg->mdfd_vfd);
if (len < 0)
ereport(ERROR,
(errcode_for_file_access(),
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 7894caa8c12..660dec5afb4 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -438,6 +438,9 @@
/* Define to 1 if you have the `ppoll' function. */
#undef HAVE_PPOLL
+/* Define to 1 if you have the `pread' function. */
+#undef HAVE_PREAD
+
/* Define to 1 if you have the `pstat' function. */
#undef HAVE_PSTAT
@@ -453,6 +456,9 @@
/* Have PTHREAD_PRIO_INHERIT. */
#undef HAVE_PTHREAD_PRIO_INHERIT
+/* Define to 1 if you have the `pwrite' function. */
+#undef HAVE_PWRITE
+
/* Define to 1 if you have the `random' function. */
#undef HAVE_RANDOM
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8e7c9728f4b..f8b6fa8ece5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -15,7 +15,7 @@
/*
* calls:
*
- * File {Close, Read, Write, Seek, Tell, Sync}
+ * File {Close, Read, Write, Size, Tell, Sync}
* {Path Name Open, Allocate, Free} File
*
* These are NOT JUST RENAMINGS OF THE UNIX ROUTINES.
@@ -42,10 +42,6 @@
#include <dirent.h>
-/*
- * FileSeek uses the standard UNIX lseek(2) flags.
- */
-
typedef int File;
@@ -68,10 +64,10 @@ extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fil
extern File OpenTemporaryFile(bool interXact);
extern void FileClose(File file);
extern int FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
-extern int FileRead(File file, char *buffer, int amount, uint32 wait_event_info);
-extern int FileWrite(File file, char *buffer, int amount, uint32 wait_event_info);
+extern int FileRead(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
+extern int FileWrite(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
extern int FileSync(File file, uint32 wait_event_info);
-extern off_t FileSeek(File file, off_t offset, int whence);
+extern off_t FileSize(File file);
extern int FileTruncate(File file, off_t offset, uint32 wait_event_info);
extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
extern char *FilePathName(File file);
--
2.17.1 (Apple Git-112)
Thomas Munro <thomas.munro@enterprisedb.com> writes:
Rebased again. Patches that touch AC_CHECK_FUNCS are fun like that!
Yeah, I've been burnt by that too recently. It occurs to me we could make
that at least a little less painful if we formatted the macro with one
line per function name:
AC_CHECK_FUNCS([
cbrt
clock_gettime
fdatasync
...
wcstombs_l
])
You'd still get conflicts in configure itself, of course, but that
doesn't require manual work to resolve -- just re-run autoconf.
regards, tom lane
On Tue, Oct 9, 2018 at 2:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
Rebased again. Patches that touch AC_CHECK_FUNCS are fun like that!
Yeah, I've been burnt by that too recently. It occurs to me we could make
that at least a little less painful if we formatted the macro with one
line per function name:AC_CHECK_FUNCS([
cbrt
clock_gettime
fdatasync
...
wcstombs_l
])You'd still get conflicts in configure itself, of course, but that
doesn't require manual work to resolve -- just re-run autoconf.
+1, was about to suggest the same!
--
Thomas Munro
http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes:
On Tue, Oct 9, 2018 at 2:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, I've been burnt by that too recently. It occurs to me we could make
that at least a little less painful if we formatted the macro with one
line per function name:AC_CHECK_FUNCS([
cbrt
clock_gettime
fdatasync
...
wcstombs_l
])You'd still get conflicts in configure itself, of course, but that
doesn't require manual work to resolve -- just re-run autoconf.
+1, was about to suggest the same!
Sold, I'll go do it.
regards, tom lane
I wrote:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
On Tue, Oct 9, 2018 at 2:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, I've been burnt by that too recently. It occurs to me we could make
that at least a little less painful if we formatted the macro with one
line per function name:
+1, was about to suggest the same!
Sold, I'll go do it.
Learned a few new things about M4 along the way :-( ... but done.
You'll need to rebase the pread patch again of course.
regards, tom lane
On 10/08/2018 09:55 PM, Tom Lane wrote:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
Rebased again. Patches that touch AC_CHECK_FUNCS are fun like that!
Yeah, I've been burnt by that too recently. It occurs to me we could make
that at least a little less painful if we formatted the macro with one
line per function name:AC_CHECK_FUNCS([
cbrt
clock_gettime
fdatasync
...
wcstombs_l
])You'd still get conflicts in configure itself, of course, but that
doesn't require manual work to resolve -- just re-run autoconf.
By and large I think it's better not to submit patches with changes to
configure, but to let the committer run autoconf.
You can avoid getting such changes in your patches by doing something
like this:
git config diff.nodiff.command /bin/true
echo configure diff=nodiff >> .git/info/attributes
If you actually want to turn this off and see any diffs in configure, run
git diff --no-ext-diff
It's also possible to supply a filter expression to 'git diff'.
OTOH, this will probably confuse the heck out of the cfbot patch checker.
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-10-09 14:32:29 -0400, Andrew Dunstan wrote:
On 10/08/2018 09:55 PM, Tom Lane wrote:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
Rebased again. Patches that touch AC_CHECK_FUNCS are fun like that!
Yeah, I've been burnt by that too recently. It occurs to me we could make
that at least a little less painful if we formatted the macro with one
line per function name:AC_CHECK_FUNCS([
cbrt
clock_gettime
fdatasync
...
wcstombs_l
])You'd still get conflicts in configure itself, of course, but that
doesn't require manual work to resolve -- just re-run autoconf.By and large I think it's better not to submit patches with changes to
configure, but to let the committer run autoconf.
OTOH, this will probably confuse the heck out of the cfbot patch checker.
And make life harder for reviewers.
-1 on this one.
Greetings,
Andres Freund
On 10/09/2018 02:37 PM, Andres Freund wrote:
On 2018-10-09 14:32:29 -0400, Andrew Dunstan wrote:
On 10/08/2018 09:55 PM, Tom Lane wrote:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
Rebased again. Patches that touch AC_CHECK_FUNCS are fun like that!
Yeah, I've been burnt by that too recently. It occurs to me we could make
that at least a little less painful if we formatted the macro with one
line per function name:AC_CHECK_FUNCS([
cbrt
clock_gettime
fdatasync
...
wcstombs_l
])You'd still get conflicts in configure itself, of course, but that
doesn't require manual work to resolve -- just re-run autoconf.By and large I think it's better not to submit patches with changes to
configure, but to let the committer run autoconf.
OTOH, this will probably confuse the heck out of the cfbot patch checker.And make life harder for reviewers.
-1 on this one.
Maybe I'm thinking back to the time when we used to use a bunch of old
versions of autoconf ...
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Oct 9, 2018 at 5:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
On Tue, Oct 9, 2018 at 2:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, I've been burnt by that too recently. It occurs to me we could make
that at least a little less painful if we formatted the macro with one
line per function name:+1, was about to suggest the same!
Sold, I'll go do it.
Learned a few new things about M4 along the way :-( ... but done.
You'll need to rebase the pread patch again of course.
Thanks, much nicer. Rebased.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
0001-Use-pread-pwrite-instead-of-lseek-read-write-v7.patchapplication/octet-stream; name=0001-Use-pread-pwrite-instead-of-lseek-read-write-v7.patchDownload
From ab2be9b8d61f4aa9bdc768927d2ebb590a84283b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Thu, 12 Jul 2018 13:14:02 +1200
Subject: [PATCH] Use pread()/pwrite() instead of lseek() + read()/write().
Cut down on system calls by doing random IO using POSIX.1-2008
offset-based IO routines, where available. Remove the code for
tracking the 'virtual' seek position. The only reason left to
call FileSeek() was to get the file's size, so provide a new
function FileSize() instead.
Author: Oskari Saarenmaa, Thomas Munro
Reviewed-by: Thomas Munro, Jesper Pedersen
Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com
Discussion: https://postgr.es/m/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com
Discussion: https://postgr.es/m/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi
---
configure | 2 +-
configure.in | 2 +
src/backend/access/heap/rewriteheap.c | 2 +-
src/backend/access/transam/xlog.c | 13 ++
src/backend/storage/file/buffile.c | 46 +-----
src/backend/storage/file/fd.c | 217 +++++---------------------
src/backend/storage/smgr/md.c | 35 +----
src/include/pg_config.h.in | 6 +
src/include/storage/fd.h | 12 +-
9 files changed, 73 insertions(+), 262 deletions(-)
diff --git a/configure b/configure
index b7250d7f5b8..f59ee041345 100755
--- a/configure
+++ b/configure
@@ -15129,7 +15129,7 @@ fi
LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pread pstat pthread_is_threaded_np pwrite readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l
do :
as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index de5f777333b..3c5073d940e 100644
--- a/configure.in
+++ b/configure.in
@@ -1610,8 +1610,10 @@ AC_CHECK_FUNCS(m4_normalize([
poll
posix_fallocate
ppoll
+ pread
pstat
pthread_is_threaded_np
+ pwrite
readlink
setproctitle
setproctitle_fast
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 85f92973c95..5f573bafda6 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -922,7 +922,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
* Note that we deviate from the usual WAL coding practices here,
* check the above "Logical rewrite support" comment for reasoning.
*/
- written = FileWrite(src->vfd, waldata_start, len,
+ written = FileWrite(src->vfd, waldata_start, len, src->off,
WAIT_EVENT_LOGICAL_REWRITE_WRITE);
if (written != len)
ereport(ERROR,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7375a78ffcf..c5b99c93237 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2484,6 +2484,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
Size nleft;
int written;
+#ifndef HAVE_PWRITE
/* Need to seek in the file? */
if (openLogOff != startoffset)
{
@@ -2495,6 +2496,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
startoffset)));
openLogOff = startoffset;
}
+#endif
/* OK to write the page(s) */
from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2504,7 +2506,11 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
{
errno = 0;
pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+#ifdef HAVE_PWRITE
+ written = pwrite(openLogFile, from, nleft, startoffset);
+#else
written = write(openLogFile, from, nleft);
+#endif
pgstat_report_wait_end();
if (written <= 0)
{
@@ -2519,6 +2525,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
}
nleft -= written;
from += written;
+ startoffset += written;
} while (nleft > 0);
/* Update state for write */
@@ -11827,6 +11834,7 @@ retry:
/* Read the requested page */
readOff = targetPageOff;
+#ifndef HAVE_PREAD
if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
{
char fname[MAXFNAMELEN];
@@ -11840,9 +11848,14 @@ retry:
fname, readOff)));
goto next_record_is_invalid;
}
+#endif
pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#ifdef HAVE_PREAD
+ r = pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+#else
r = read(readFile, readBuf, XLOG_BLCKSZ);
+#endif
if (r != XLOG_BLCKSZ)
{
char fname[MAXFNAMELEN];
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index e93813d9737..dd687dfe71f 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -67,12 +67,6 @@ struct BufFile
int numFiles; /* number of physical files in set */
/* all files except the last have length exactly MAX_PHYSICAL_FILESIZE */
File *files; /* palloc'd array with numFiles entries */
- off_t *offsets; /* palloc'd array with numFiles entries */
-
- /*
- * offsets[i] is the current seek position of files[i]. We use this to
- * avoid making redundant FileSeek calls.
- */
bool isInterXact; /* keep open over transactions? */
bool dirty; /* does buffer need to be written? */
@@ -116,7 +110,6 @@ makeBufFileCommon(int nfiles)
BufFile *file = (BufFile *) palloc(sizeof(BufFile));
file->numFiles = nfiles;
- file->offsets = (off_t *) palloc0(sizeof(off_t) * nfiles);
file->isInterXact = false;
file->dirty = false;
file->resowner = CurrentResourceOwner;
@@ -170,10 +163,7 @@ extendBufFile(BufFile *file)
file->files = (File *) repalloc(file->files,
(file->numFiles + 1) * sizeof(File));
- file->offsets = (off_t *) repalloc(file->offsets,
- (file->numFiles + 1) * sizeof(off_t));
file->files[file->numFiles] = pfile;
- file->offsets[file->numFiles] = 0L;
file->numFiles++;
}
@@ -396,7 +386,6 @@ BufFileClose(BufFile *file)
FileClose(file->files[i]);
/* release the buffer space */
pfree(file->files);
- pfree(file->offsets);
pfree(file);
}
@@ -422,27 +411,17 @@ BufFileLoadBuffer(BufFile *file)
file->curOffset = 0L;
}
- /*
- * May need to reposition physical file.
- */
- thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, read nothing */
- file->offsets[file->curFile] = file->curOffset;
- }
-
/*
* Read whatever we can get, up to a full bufferload.
*/
+ thisfile = file->files[file->curFile];
file->nbytes = FileRead(thisfile,
file->buffer.data,
sizeof(file->buffer),
+ file->curOffset,
WAIT_EVENT_BUFFILE_READ);
if (file->nbytes < 0)
file->nbytes = 0;
- file->offsets[file->curFile] += file->nbytes;
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
@@ -491,23 +470,14 @@ BufFileDumpBuffer(BufFile *file)
if ((off_t) bytestowrite > availbytes)
bytestowrite = (int) availbytes;
- /*
- * May need to reposition physical file.
- */
thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, give up */
- file->offsets[file->curFile] = file->curOffset;
- }
bytestowrite = FileWrite(thisfile,
file->buffer.data + wpos,
bytestowrite,
+ file->curOffset,
WAIT_EVENT_BUFFILE_WRITE);
if (bytestowrite <= 0)
return; /* failed to write */
- file->offsets[file->curFile] += bytestowrite;
file->curOffset += bytestowrite;
wpos += bytestowrite;
@@ -803,11 +773,10 @@ BufFileSize(BufFile *file)
{
off_t lastFileSize;
- /* Get the size of the last physical file by seeking to end. */
- lastFileSize = FileSeek(file->files[file->numFiles - 1], 0, SEEK_END);
+ /* Get the size of the last physical file. */
+ lastFileSize = FileSize(file->files[file->numFiles - 1]);
if (lastFileSize < 0)
return -1;
- file->offsets[file->numFiles - 1] = lastFileSize;
return ((file->numFiles - 1) * (off_t) MAX_PHYSICAL_FILESIZE) +
lastFileSize;
@@ -849,13 +818,8 @@ BufFileAppend(BufFile *target, BufFile *source)
target->files = (File *)
repalloc(target->files, sizeof(File) * newNumFiles);
- target->offsets = (off_t *)
- repalloc(target->offsets, sizeof(off_t) * newNumFiles);
for (i = target->numFiles; i < newNumFiles; i++)
- {
target->files[i] = source->files[i - target->numFiles];
- target->offsets[i] = source->offsets[i - target->numFiles];
- }
target->numFiles = newNumFiles;
return startBlock;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 8dd51f17674..a380f794014 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -16,8 +16,8 @@
* including base tables, scratch files (e.g., sort and hash spool
* files), and random calls to C library routines like system(3); it
* is quite easy to exceed system limits on the number of open files a
- * single process can have. (This is around 256 on many modern
- * operating systems, but can be as low as 32 on others.)
+ * single process can have. (This is around 1024 on many modern
+ * operating systems, but may be lower on others.)
*
* VFDs are managed as an LRU pool, with actual OS file descriptors
* being opened and closed as needed. Obviously, if a routine is
@@ -167,15 +167,6 @@ int max_safe_fds = 32; /* default if not changed */
#define FileIsNotOpen(file) (VfdCache[file].fd == VFD_CLOSED)
-/*
- * Note: a VFD's seekPos is normally always valid, but if for some reason
- * an lseek() fails, it might become set to FileUnknownPos. We can struggle
- * along without knowing the seek position in many cases, but in some places
- * we have to fail if we don't have it.
- */
-#define FileUnknownPos ((off_t) -1)
-#define FilePosIsUnknown(pos) ((pos) < 0)
-
/* these are the assigned bits in fdstate below: */
#define FD_DELETE_AT_CLOSE (1 << 0) /* T = delete when closed */
#define FD_CLOSE_AT_EOXACT (1 << 1) /* T = close at eoXact */
@@ -189,7 +180,6 @@ typedef struct vfd
File nextFree; /* link to next free VFD, if in freelist */
File lruMoreRecently; /* doubly linked recency-of-use list */
File lruLessRecently;
- off_t seekPos; /* current logical file position, or -1 */
off_t fileSize; /* current size of file (0 if not temporary) */
char *fileName; /* name of file, or NULL for unused VFD */
/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
@@ -407,9 +397,7 @@ pg_fdatasync(int fd)
/*
* pg_flush_data --- advise OS that the described dirty data should be flushed
*
- * offset of 0 with nbytes 0 means that the entire file should be flushed;
- * in this case, this function may have side-effects on the file's
- * seek position!
+ * offset of 0 with nbytes 0 means that the entire file should be flushed
*/
void
pg_flush_data(int fd, off_t offset, off_t nbytes)
@@ -1029,22 +1017,6 @@ LruDelete(File file)
vfdP = &VfdCache[file];
- /*
- * Normally we should know the seek position, but if for some reason we
- * have lost track of it, try again to get it. If we still can't get it,
- * we have a problem: we will be unable to restore the file seek position
- * when and if the file is re-opened. But we can't really throw an error
- * and refuse to close the file, or activities such as transaction cleanup
- * will be broken.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(LOG, "could not seek file \"%s\" before closing: %m",
- vfdP->fileName);
- }
-
/*
* Close the file. We aren't expecting this to fail; if it does, better
* to leak the FD than to mess up our internal state.
@@ -1113,33 +1085,6 @@ LruInsert(File file)
{
++nfile;
}
-
- /*
- * Seek to the right position. We need no special case for seekPos
- * equal to FileUnknownPos, as lseek() will certainly reject that
- * (thus completing the logic noted in LruDelete() that we will fail
- * to re-open a file if we couldn't get its seek position before
- * closing).
- */
- if (vfdP->seekPos != (off_t) 0)
- {
- if (lseek(vfdP->fd, vfdP->seekPos, SEEK_SET) < 0)
- {
- /*
- * If we fail to restore the seek position, treat it like an
- * open() failure.
- */
- int save_errno = errno;
-
- elog(LOG, "could not seek file \"%s\" after re-opening: %m",
- vfdP->fileName);
- (void) close(vfdP->fd);
- vfdP->fd = VFD_CLOSED;
- --nfile;
- errno = save_errno;
- return -1;
- }
- }
}
/*
@@ -1406,7 +1351,6 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
/* Saved flags are adjusted to be OK for re-opening file */
vfdP->fileFlags = fileFlags & ~(O_CREAT | O_TRUNC | O_EXCL);
vfdP->fileMode = fileMode;
- vfdP->seekPos = 0;
vfdP->fileSize = 0;
vfdP->fdstate = 0x0;
vfdP->resowner = NULL;
@@ -1820,7 +1764,6 @@ FileClose(File file)
/*
* FilePrefetch - initiate asynchronous read of a given range of the file.
- * The logical seek position is unaffected.
*
* Currently the only implementation of this function is using posix_fadvise
* which is the simplest standardized interface that accomplishes this.
@@ -1867,10 +1810,6 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
file, VfdCache[file].fileName,
(int64) offset, (int64) nbytes));
- /*
- * Caution: do not call pg_flush_data with nbytes = 0, it could trash the
- * file's seek position. We prefer to define that as a no-op here.
- */
if (nbytes <= 0)
return;
@@ -1884,7 +1823,8 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
}
int
-FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
+FileRead(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1893,7 +1833,7 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileRead: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1904,16 +1844,16 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
pgstat_report_wait_start(wait_event_info);
- returnCode = read(vfdP->fd, buffer, amount);
+#ifdef HAVE_PREAD
+ returnCode = pread(vfdP->fd, buffer, amount, offset);
+#else
+ returnCode = lseek(VfdCache[file].fd, offset, SEEK_SET);
+ if (returnCode >= 0)
+ returnCode = read(vfdP->fd, buffer, amount);
+#endif
pgstat_report_wait_end();
- if (returnCode >= 0)
- {
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
- }
- else
+ if (returnCode < 0)
{
/*
* Windows may run out of kernel buffers and return "Insufficient
@@ -1939,16 +1879,14 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
}
int
-FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
+FileWrite(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1957,7 +1895,7 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileWrite: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1976,26 +1914,13 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
*/
if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
{
- off_t newPos;
+ off_t past_write = offset + amount;
- /*
- * Normally we should know the seek position, but if for some reason
- * we have lost track of it, try again to get it. Here, it's fine to
- * throw an error if we still can't get it.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(ERROR, "could not seek file \"%s\": %m", vfdP->fileName);
- }
-
- newPos = vfdP->seekPos + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
uint64 newTotal = temporary_files_size;
- newTotal += newPos - vfdP->fileSize;
+ newTotal += past_write - vfdP->fileSize;
if (newTotal > (uint64) temp_file_limit * (uint64) 1024)
ereport(ERROR,
(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
@@ -2007,7 +1932,13 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
errno = 0;
pgstat_report_wait_start(wait_event_info);
- returnCode = write(vfdP->fd, buffer, amount);
+#ifdef HAVE_PWRITE
+ returnCode = pwrite(VfdCache[file].fd, buffer, amount, offset);
+#else
+ returnCode = lseek(VfdCache[file].fd, offset, SEEK_SET);
+ if (returnCode >= 0)
+ returnCode = write(VfdCache[file].fd, buffer, amount);
+#endif
pgstat_report_wait_end();
/* if write didn't set errno, assume problem is no disk space */
@@ -2016,10 +1947,6 @@ retry:
if (returnCode >= 0)
{
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
-
/*
* Maintain fileSize and temporary_files_size if it's a temp file.
*
@@ -2029,12 +1956,12 @@ retry:
*/
if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
{
- off_t newPos = vfdP->seekPos;
+ off_t past_write = offset + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
- temporary_files_size += newPos - vfdP->fileSize;
- vfdP->fileSize = newPos;
+ temporary_files_size += past_write - vfdP->fileSize;
+ vfdP->fileSize = past_write;
}
}
}
@@ -2060,9 +1987,6 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
@@ -2090,92 +2014,25 @@ FileSync(File file, uint32 wait_event_info)
}
off_t
-FileSeek(File file, off_t offset, int whence)
+FileSize(File file)
{
Vfd *vfdP;
Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileSeek: %d (%s) " INT64_FORMAT " " INT64_FORMAT " %d",
- file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
- (int64) offset, whence));
+ DO_DB(elog(LOG, "FileSize %d (%s)",
+ file, VfdCache[file].fileName));
vfdP = &VfdCache[file];
if (FileIsNotOpen(file))
{
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos = offset;
- break;
- case SEEK_CUR:
- if (FilePosIsUnknown(vfdP->seekPos) ||
- vfdP->seekPos + offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos += offset;
- break;
- case SEEK_END:
- if (FileAccess(file) < 0)
- return (off_t) -1;
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
- }
- else
- {
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- if (vfdP->seekPos != offset)
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_CUR:
- if (offset != 0 || FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_END:
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
+ if (FileAccess(file) < 0)
+ return (off_t) -1;
}
- return vfdP->seekPos;
-}
-
-/*
- * XXX not actually used but here for completeness
- */
-#ifdef NOT_USED
-off_t
-FileTell(File file)
-{
- Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileTell %d (%s)",
- file, VfdCache[file].fileName));
- return VfdCache[file].seekPos;
+ return lseek(VfdCache[file].fd, 0, SEEK_END);
}
-#endif
int
FileTruncate(File file, off_t offset, uint32 wait_event_info)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f4374d077be..86013a5c8b2 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -522,22 +522,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- /*
- * Note: because caller usually obtained blocknum by calling mdnblocks,
- * which did a seek(SEEK_END), this seek is often redundant and will be
- * optimized away by fd.c. It's not redundant, however, if there is a
- * partial page at the end of the file. In that case we want to try to
- * overwrite the partial page with a full page. It's also not redundant
- * if bufmgr.c had to dump another buffer of the same file to make room
- * for the new page's buffer.
- */
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
+ if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
{
if (nbytes < 0)
ereport(ERROR,
@@ -748,13 +733,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_READ);
+ nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -824,13 +803,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_WRITE);
+ nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -1979,7 +1952,7 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
{
off_t len;
- len = FileSeek(seg->mdfd_vfd, 0L, SEEK_END);
+ len = FileSize(seg->mdfd_vfd);
if (len < 0)
ereport(ERROR,
(errcode_for_file_access(),
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 9798bd24b44..5a996e75572 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -438,6 +438,9 @@
/* Define to 1 if you have the `ppoll' function. */
#undef HAVE_PPOLL
+/* Define to 1 if you have the `pread' function. */
+#undef HAVE_PREAD
+
/* Define to 1 if you have the `pstat' function. */
#undef HAVE_PSTAT
@@ -453,6 +456,9 @@
/* Have PTHREAD_PRIO_INHERIT. */
#undef HAVE_PTHREAD_PRIO_INHERIT
+/* Define to 1 if you have the `pwrite' function. */
+#undef HAVE_PWRITE
+
/* Define to 1 if you have the `random' function. */
#undef HAVE_RANDOM
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8e7c9728f4b..f8b6fa8ece5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -15,7 +15,7 @@
/*
* calls:
*
- * File {Close, Read, Write, Seek, Tell, Sync}
+ * File {Close, Read, Write, Size, Tell, Sync}
* {Path Name Open, Allocate, Free} File
*
* These are NOT JUST RENAMINGS OF THE UNIX ROUTINES.
@@ -42,10 +42,6 @@
#include <dirent.h>
-/*
- * FileSeek uses the standard UNIX lseek(2) flags.
- */
-
typedef int File;
@@ -68,10 +64,10 @@ extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fil
extern File OpenTemporaryFile(bool interXact);
extern void FileClose(File file);
extern int FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
-extern int FileRead(File file, char *buffer, int amount, uint32 wait_event_info);
-extern int FileWrite(File file, char *buffer, int amount, uint32 wait_event_info);
+extern int FileRead(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
+extern int FileWrite(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
extern int FileSync(File file, uint32 wait_event_info);
-extern off_t FileSeek(File file, off_t offset, int whence);
+extern off_t FileSize(File file);
extern int FileTruncate(File file, off_t offset, uint32 wait_event_info);
extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
extern char *FilePathName(File file);
--
2.17.1 (Apple Git-112)
Hi Thomas,
On 10/9/18 4:56 PM, Thomas Munro wrote:
Thanks, much nicer. Rebased.
This still applies, and passes make check-world.
I wonder what the commit policy is on this, if the Windows part isn't
included. I read Heikki's comment [1]/messages/by-id/6cc7c8dd-29f9-7d75-d18a-99f19c076d10@iki.fi as it would be ok to commit
benefiting all platforms that has pread/pwrite.
The functions in [2]/messages/by-id/c2f56d0a-cadd-3df1-ae48-b84dc8128c37@redhat.com could be a follow-up patch as well.
[1]: /messages/by-id/6cc7c8dd-29f9-7d75-d18a-99f19c076d10@iki.fi
/messages/by-id/6cc7c8dd-29f9-7d75-d18a-99f19c076d10@iki.fi
[2]: /messages/by-id/c2f56d0a-cadd-3df1-ae48-b84dc8128c37@redhat.com
/messages/by-id/c2f56d0a-cadd-3df1-ae48-b84dc8128c37@redhat.com
Best regards,
Jesper
On Sat, Nov 3, 2018 at 2:07 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
This still applies, and passes make check-world.
I wonder what the commit policy is on this, if the Windows part isn't
included. I read Heikki's comment [1] as it would be ok to commit
benefiting all platforms that has pread/pwrite.
Here's a patch to add Windows support by supplying
src/backend/port/win32/pread.c. Thoughts?
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
0001-Use-pread-pwrite-instead-of-lseek-read-write-v8.patchapplication/octet-stream; name=0001-Use-pread-pwrite-instead-of-lseek-read-write-v8.patchDownload
From cf01770ad7ab0fdac8decc55f4a2105616dde885 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Thu, 12 Jul 2018 13:14:02 +1200
Subject: [PATCH 1/3] Use pread()/pwrite() instead of lseek() + read()/write().
Cut down on system calls by doing random IO using POSIX.1-2008
offset-based IO routines, where available. Remove the code for
tracking the 'virtual' seek position. The only reason left to
call FileSeek() was to get the file's size, so provide a new
function FileSize() instead.
Author: Oskari Saarenmaa, Thomas Munro
Reviewed-by: Thomas Munro, Jesper Pedersen
Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com
Discussion: https://postgr.es/m/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com
Discussion: https://postgr.es/m/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi
---
configure | 2 +-
configure.in | 2 +
src/backend/access/heap/rewriteheap.c | 2 +-
src/backend/access/transam/xlog.c | 13 ++
src/backend/storage/file/buffile.c | 46 +-----
src/backend/storage/file/fd.c | 217 +++++---------------------
src/backend/storage/smgr/md.c | 35 +----
src/include/pg_config.h.in | 6 +
src/include/storage/fd.h | 12 +-
9 files changed, 73 insertions(+), 262 deletions(-)
diff --git a/configure b/configure
index 0686941331c..69a2a6a87e1 100755
--- a/configure
+++ b/configure
@@ -15131,7 +15131,7 @@ fi
LIBS_including_readline="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pread pstat pthread_is_threaded_np pwrite readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l
do :
as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 7586deb7ee6..ea4a4c43ece 100644
--- a/configure.in
+++ b/configure.in
@@ -1611,8 +1611,10 @@ AC_CHECK_FUNCS(m4_normalize([
poll
posix_fallocate
ppoll
+ pread
pstat
pthread_is_threaded_np
+ pwrite
readlink
setproctitle
setproctitle_fast
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 71277889649..c5db75afa1f 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -935,7 +935,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
* Note that we deviate from the usual WAL coding practices here,
* check the above "Logical rewrite support" comment for reasoning.
*/
- written = FileWrite(src->vfd, waldata_start, len,
+ written = FileWrite(src->vfd, waldata_start, len, src->off,
WAIT_EVENT_LOGICAL_REWRITE_WRITE);
if (written != len)
ereport(ERROR,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 246869bba29..8724c8fb012 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2478,6 +2478,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
Size nleft;
int written;
+#ifndef HAVE_PWRITE
/* Need to seek in the file? */
if (openLogOff != startoffset)
{
@@ -2489,6 +2490,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
startoffset)));
openLogOff = startoffset;
}
+#endif
/* OK to write the page(s) */
from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2498,7 +2500,11 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
{
errno = 0;
pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+#ifdef HAVE_PWRITE
+ written = pwrite(openLogFile, from, nleft, startoffset);
+#else
written = write(openLogFile, from, nleft);
+#endif
pgstat_report_wait_end();
if (written <= 0)
{
@@ -2513,6 +2519,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
}
nleft -= written;
from += written;
+ startoffset += written;
} while (nleft > 0);
/* Update state for write */
@@ -11821,6 +11828,7 @@ retry:
/* Read the requested page */
readOff = targetPageOff;
+#ifndef HAVE_PREAD
if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
{
char fname[MAXFNAMELEN];
@@ -11834,9 +11842,14 @@ retry:
fname, readOff)));
goto next_record_is_invalid;
}
+#endif
pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#ifdef HAVE_PREAD
+ r = pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+#else
r = read(readFile, readBuf, XLOG_BLCKSZ);
+#endif
if (r != XLOG_BLCKSZ)
{
char fname[MAXFNAMELEN];
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index e93813d9737..dd687dfe71f 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -67,12 +67,6 @@ struct BufFile
int numFiles; /* number of physical files in set */
/* all files except the last have length exactly MAX_PHYSICAL_FILESIZE */
File *files; /* palloc'd array with numFiles entries */
- off_t *offsets; /* palloc'd array with numFiles entries */
-
- /*
- * offsets[i] is the current seek position of files[i]. We use this to
- * avoid making redundant FileSeek calls.
- */
bool isInterXact; /* keep open over transactions? */
bool dirty; /* does buffer need to be written? */
@@ -116,7 +110,6 @@ makeBufFileCommon(int nfiles)
BufFile *file = (BufFile *) palloc(sizeof(BufFile));
file->numFiles = nfiles;
- file->offsets = (off_t *) palloc0(sizeof(off_t) * nfiles);
file->isInterXact = false;
file->dirty = false;
file->resowner = CurrentResourceOwner;
@@ -170,10 +163,7 @@ extendBufFile(BufFile *file)
file->files = (File *) repalloc(file->files,
(file->numFiles + 1) * sizeof(File));
- file->offsets = (off_t *) repalloc(file->offsets,
- (file->numFiles + 1) * sizeof(off_t));
file->files[file->numFiles] = pfile;
- file->offsets[file->numFiles] = 0L;
file->numFiles++;
}
@@ -396,7 +386,6 @@ BufFileClose(BufFile *file)
FileClose(file->files[i]);
/* release the buffer space */
pfree(file->files);
- pfree(file->offsets);
pfree(file);
}
@@ -422,27 +411,17 @@ BufFileLoadBuffer(BufFile *file)
file->curOffset = 0L;
}
- /*
- * May need to reposition physical file.
- */
- thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, read nothing */
- file->offsets[file->curFile] = file->curOffset;
- }
-
/*
* Read whatever we can get, up to a full bufferload.
*/
+ thisfile = file->files[file->curFile];
file->nbytes = FileRead(thisfile,
file->buffer.data,
sizeof(file->buffer),
+ file->curOffset,
WAIT_EVENT_BUFFILE_READ);
if (file->nbytes < 0)
file->nbytes = 0;
- file->offsets[file->curFile] += file->nbytes;
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
@@ -491,23 +470,14 @@ BufFileDumpBuffer(BufFile *file)
if ((off_t) bytestowrite > availbytes)
bytestowrite = (int) availbytes;
- /*
- * May need to reposition physical file.
- */
thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, give up */
- file->offsets[file->curFile] = file->curOffset;
- }
bytestowrite = FileWrite(thisfile,
file->buffer.data + wpos,
bytestowrite,
+ file->curOffset,
WAIT_EVENT_BUFFILE_WRITE);
if (bytestowrite <= 0)
return; /* failed to write */
- file->offsets[file->curFile] += bytestowrite;
file->curOffset += bytestowrite;
wpos += bytestowrite;
@@ -803,11 +773,10 @@ BufFileSize(BufFile *file)
{
off_t lastFileSize;
- /* Get the size of the last physical file by seeking to end. */
- lastFileSize = FileSeek(file->files[file->numFiles - 1], 0, SEEK_END);
+ /* Get the size of the last physical file. */
+ lastFileSize = FileSize(file->files[file->numFiles - 1]);
if (lastFileSize < 0)
return -1;
- file->offsets[file->numFiles - 1] = lastFileSize;
return ((file->numFiles - 1) * (off_t) MAX_PHYSICAL_FILESIZE) +
lastFileSize;
@@ -849,13 +818,8 @@ BufFileAppend(BufFile *target, BufFile *source)
target->files = (File *)
repalloc(target->files, sizeof(File) * newNumFiles);
- target->offsets = (off_t *)
- repalloc(target->offsets, sizeof(off_t) * newNumFiles);
for (i = target->numFiles; i < newNumFiles; i++)
- {
target->files[i] = source->files[i - target->numFiles];
- target->offsets[i] = source->offsets[i - target->numFiles];
- }
target->numFiles = newNumFiles;
return startBlock;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 8dd51f17674..a380f794014 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -16,8 +16,8 @@
* including base tables, scratch files (e.g., sort and hash spool
* files), and random calls to C library routines like system(3); it
* is quite easy to exceed system limits on the number of open files a
- * single process can have. (This is around 256 on many modern
- * operating systems, but can be as low as 32 on others.)
+ * single process can have. (This is around 1024 on many modern
+ * operating systems, but may be lower on others.)
*
* VFDs are managed as an LRU pool, with actual OS file descriptors
* being opened and closed as needed. Obviously, if a routine is
@@ -167,15 +167,6 @@ int max_safe_fds = 32; /* default if not changed */
#define FileIsNotOpen(file) (VfdCache[file].fd == VFD_CLOSED)
-/*
- * Note: a VFD's seekPos is normally always valid, but if for some reason
- * an lseek() fails, it might become set to FileUnknownPos. We can struggle
- * along without knowing the seek position in many cases, but in some places
- * we have to fail if we don't have it.
- */
-#define FileUnknownPos ((off_t) -1)
-#define FilePosIsUnknown(pos) ((pos) < 0)
-
/* these are the assigned bits in fdstate below: */
#define FD_DELETE_AT_CLOSE (1 << 0) /* T = delete when closed */
#define FD_CLOSE_AT_EOXACT (1 << 1) /* T = close at eoXact */
@@ -189,7 +180,6 @@ typedef struct vfd
File nextFree; /* link to next free VFD, if in freelist */
File lruMoreRecently; /* doubly linked recency-of-use list */
File lruLessRecently;
- off_t seekPos; /* current logical file position, or -1 */
off_t fileSize; /* current size of file (0 if not temporary) */
char *fileName; /* name of file, or NULL for unused VFD */
/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
@@ -407,9 +397,7 @@ pg_fdatasync(int fd)
/*
* pg_flush_data --- advise OS that the described dirty data should be flushed
*
- * offset of 0 with nbytes 0 means that the entire file should be flushed;
- * in this case, this function may have side-effects on the file's
- * seek position!
+ * offset of 0 with nbytes 0 means that the entire file should be flushed
*/
void
pg_flush_data(int fd, off_t offset, off_t nbytes)
@@ -1029,22 +1017,6 @@ LruDelete(File file)
vfdP = &VfdCache[file];
- /*
- * Normally we should know the seek position, but if for some reason we
- * have lost track of it, try again to get it. If we still can't get it,
- * we have a problem: we will be unable to restore the file seek position
- * when and if the file is re-opened. But we can't really throw an error
- * and refuse to close the file, or activities such as transaction cleanup
- * will be broken.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(LOG, "could not seek file \"%s\" before closing: %m",
- vfdP->fileName);
- }
-
/*
* Close the file. We aren't expecting this to fail; if it does, better
* to leak the FD than to mess up our internal state.
@@ -1113,33 +1085,6 @@ LruInsert(File file)
{
++nfile;
}
-
- /*
- * Seek to the right position. We need no special case for seekPos
- * equal to FileUnknownPos, as lseek() will certainly reject that
- * (thus completing the logic noted in LruDelete() that we will fail
- * to re-open a file if we couldn't get its seek position before
- * closing).
- */
- if (vfdP->seekPos != (off_t) 0)
- {
- if (lseek(vfdP->fd, vfdP->seekPos, SEEK_SET) < 0)
- {
- /*
- * If we fail to restore the seek position, treat it like an
- * open() failure.
- */
- int save_errno = errno;
-
- elog(LOG, "could not seek file \"%s\" after re-opening: %m",
- vfdP->fileName);
- (void) close(vfdP->fd);
- vfdP->fd = VFD_CLOSED;
- --nfile;
- errno = save_errno;
- return -1;
- }
- }
}
/*
@@ -1406,7 +1351,6 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
/* Saved flags are adjusted to be OK for re-opening file */
vfdP->fileFlags = fileFlags & ~(O_CREAT | O_TRUNC | O_EXCL);
vfdP->fileMode = fileMode;
- vfdP->seekPos = 0;
vfdP->fileSize = 0;
vfdP->fdstate = 0x0;
vfdP->resowner = NULL;
@@ -1820,7 +1764,6 @@ FileClose(File file)
/*
* FilePrefetch - initiate asynchronous read of a given range of the file.
- * The logical seek position is unaffected.
*
* Currently the only implementation of this function is using posix_fadvise
* which is the simplest standardized interface that accomplishes this.
@@ -1867,10 +1810,6 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
file, VfdCache[file].fileName,
(int64) offset, (int64) nbytes));
- /*
- * Caution: do not call pg_flush_data with nbytes = 0, it could trash the
- * file's seek position. We prefer to define that as a no-op here.
- */
if (nbytes <= 0)
return;
@@ -1884,7 +1823,8 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
}
int
-FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
+FileRead(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1893,7 +1833,7 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileRead: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1904,16 +1844,16 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
pgstat_report_wait_start(wait_event_info);
- returnCode = read(vfdP->fd, buffer, amount);
+#ifdef HAVE_PREAD
+ returnCode = pread(vfdP->fd, buffer, amount, offset);
+#else
+ returnCode = lseek(VfdCache[file].fd, offset, SEEK_SET);
+ if (returnCode >= 0)
+ returnCode = read(vfdP->fd, buffer, amount);
+#endif
pgstat_report_wait_end();
- if (returnCode >= 0)
- {
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
- }
- else
+ if (returnCode < 0)
{
/*
* Windows may run out of kernel buffers and return "Insufficient
@@ -1939,16 +1879,14 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
}
int
-FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
+FileWrite(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1957,7 +1895,7 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileWrite: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1976,26 +1914,13 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
*/
if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
{
- off_t newPos;
+ off_t past_write = offset + amount;
- /*
- * Normally we should know the seek position, but if for some reason
- * we have lost track of it, try again to get it. Here, it's fine to
- * throw an error if we still can't get it.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(ERROR, "could not seek file \"%s\": %m", vfdP->fileName);
- }
-
- newPos = vfdP->seekPos + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
uint64 newTotal = temporary_files_size;
- newTotal += newPos - vfdP->fileSize;
+ newTotal += past_write - vfdP->fileSize;
if (newTotal > (uint64) temp_file_limit * (uint64) 1024)
ereport(ERROR,
(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
@@ -2007,7 +1932,13 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
errno = 0;
pgstat_report_wait_start(wait_event_info);
- returnCode = write(vfdP->fd, buffer, amount);
+#ifdef HAVE_PWRITE
+ returnCode = pwrite(VfdCache[file].fd, buffer, amount, offset);
+#else
+ returnCode = lseek(VfdCache[file].fd, offset, SEEK_SET);
+ if (returnCode >= 0)
+ returnCode = write(VfdCache[file].fd, buffer, amount);
+#endif
pgstat_report_wait_end();
/* if write didn't set errno, assume problem is no disk space */
@@ -2016,10 +1947,6 @@ retry:
if (returnCode >= 0)
{
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
-
/*
* Maintain fileSize and temporary_files_size if it's a temp file.
*
@@ -2029,12 +1956,12 @@ retry:
*/
if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
{
- off_t newPos = vfdP->seekPos;
+ off_t past_write = offset + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
- temporary_files_size += newPos - vfdP->fileSize;
- vfdP->fileSize = newPos;
+ temporary_files_size += past_write - vfdP->fileSize;
+ vfdP->fileSize = past_write;
}
}
}
@@ -2060,9 +1987,6 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
@@ -2090,92 +2014,25 @@ FileSync(File file, uint32 wait_event_info)
}
off_t
-FileSeek(File file, off_t offset, int whence)
+FileSize(File file)
{
Vfd *vfdP;
Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileSeek: %d (%s) " INT64_FORMAT " " INT64_FORMAT " %d",
- file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
- (int64) offset, whence));
+ DO_DB(elog(LOG, "FileSize %d (%s)",
+ file, VfdCache[file].fileName));
vfdP = &VfdCache[file];
if (FileIsNotOpen(file))
{
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos = offset;
- break;
- case SEEK_CUR:
- if (FilePosIsUnknown(vfdP->seekPos) ||
- vfdP->seekPos + offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos += offset;
- break;
- case SEEK_END:
- if (FileAccess(file) < 0)
- return (off_t) -1;
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
- }
- else
- {
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- if (vfdP->seekPos != offset)
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_CUR:
- if (offset != 0 || FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_END:
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
+ if (FileAccess(file) < 0)
+ return (off_t) -1;
}
- return vfdP->seekPos;
-}
-
-/*
- * XXX not actually used but here for completeness
- */
-#ifdef NOT_USED
-off_t
-FileTell(File file)
-{
- Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileTell %d (%s)",
- file, VfdCache[file].fileName));
- return VfdCache[file].seekPos;
+ return lseek(VfdCache[file].fd, 0, SEEK_END);
}
-#endif
int
FileTruncate(File file, off_t offset, uint32 wait_event_info)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f4374d077be..86013a5c8b2 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -522,22 +522,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- /*
- * Note: because caller usually obtained blocknum by calling mdnblocks,
- * which did a seek(SEEK_END), this seek is often redundant and will be
- * optimized away by fd.c. It's not redundant, however, if there is a
- * partial page at the end of the file. In that case we want to try to
- * overwrite the partial page with a full page. It's also not redundant
- * if bufmgr.c had to dump another buffer of the same file to make room
- * for the new page's buffer.
- */
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
+ if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
{
if (nbytes < 0)
ereport(ERROR,
@@ -748,13 +733,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_READ);
+ nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -824,13 +803,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_WRITE);
+ nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -1979,7 +1952,7 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
{
off_t len;
- len = FileSeek(seg->mdfd_vfd, 0L, SEEK_END);
+ len = FileSize(seg->mdfd_vfd);
if (len < 0)
ereport(ERROR,
(errcode_for_file_access(),
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 9798bd24b44..5a996e75572 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -438,6 +438,9 @@
/* Define to 1 if you have the `ppoll' function. */
#undef HAVE_PPOLL
+/* Define to 1 if you have the `pread' function. */
+#undef HAVE_PREAD
+
/* Define to 1 if you have the `pstat' function. */
#undef HAVE_PSTAT
@@ -453,6 +456,9 @@
/* Have PTHREAD_PRIO_INHERIT. */
#undef HAVE_PTHREAD_PRIO_INHERIT
+/* Define to 1 if you have the `pwrite' function. */
+#undef HAVE_PWRITE
+
/* Define to 1 if you have the `random' function. */
#undef HAVE_RANDOM
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8e7c9728f4b..f8b6fa8ece5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -15,7 +15,7 @@
/*
* calls:
*
- * File {Close, Read, Write, Seek, Tell, Sync}
+ * File {Close, Read, Write, Size, Tell, Sync}
* {Path Name Open, Allocate, Free} File
*
* These are NOT JUST RENAMINGS OF THE UNIX ROUTINES.
@@ -42,10 +42,6 @@
#include <dirent.h>
-/*
- * FileSeek uses the standard UNIX lseek(2) flags.
- */
-
typedef int File;
@@ -68,10 +64,10 @@ extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fil
extern File OpenTemporaryFile(bool interXact);
extern void FileClose(File file);
extern int FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
-extern int FileRead(File file, char *buffer, int amount, uint32 wait_event_info);
-extern int FileWrite(File file, char *buffer, int amount, uint32 wait_event_info);
+extern int FileRead(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
+extern int FileWrite(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
extern int FileSync(File file, uint32 wait_event_info);
-extern off_t FileSeek(File file, off_t offset, int whence);
+extern off_t FileSize(File file);
extern int FileTruncate(File file, off_t offset, uint32 wait_event_info);
extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
extern char *FilePathName(File file);
--
2.19.1
0002-Supply-pread-pwrite-implementations-for-Windows-v8.patchapplication/octet-stream; name=0002-Supply-pread-pwrite-implementations-for-Windows-v8.patchDownload
From bbe406e14f75eb198f7cb581513bfd3c4a02481a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Sat, 3 Nov 2018 23:11:29 +1300
Subject: [PATCH 2/3] Supply pread()/pwrite() implementations for Windows.
Emulate POSIX pread()/pwrite() with the OVERLAPPED interface.
The emulation is not perfect, as the file position is changed, but
that is OK for our purposes. We don't plan to mix read() and
pread() calls on the same fd.
Author: Thomas Munro
Reviewed-by:
Discussion:
---
src/backend/port/win32/Makefile | 2 +-
src/backend/port/win32/pread.c | 69 +++++++++++++++++++++++++++++++++
src/include/pg_config.h.win32 | 6 +++
src/include/port/win32_port.h | 4 ++
4 files changed, 80 insertions(+), 1 deletion(-)
create mode 100644 src/backend/port/win32/pread.c
diff --git a/src/backend/port/win32/Makefile b/src/backend/port/win32/Makefile
index a6ace93e261..9539bd22673 100644
--- a/src/backend/port/win32/Makefile
+++ b/src/backend/port/win32/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/port/win32
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = timer.o socket.o signal.o mingwcompat.o
+OBJS = pread.o timer.o socket.o signal.o mingwcompat.o
ifeq ($(have_win32_dbghelp), yes)
OBJS += crashdump.o
endif
diff --git a/src/backend/port/win32/pread.c b/src/backend/port/win32/pread.c
new file mode 100644
index 00000000000..7984a1f5a4c
--- /dev/null
+++ b/src/backend/port/win32/pread.c
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * pread.c
+ * Microsoft Windows Win32 pread() and pwrite() implementations.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/port/win32/pread.c
+ *
+ * Note that these implementations change the current file position, unlike
+ * the POSIX functions, so should not be mixed with regular read() and
+ * write() calls.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifdef WIN32
+
+#include "postgres.h"
+#include <Windows.h>
+
+ssize_t
+pread(int fd, void *buf, size_t size, off_t offset)
+{
+ OVERLAPPED overlapped = {0};
+ HANDLE handle;
+ DWORD result;
+
+ handle = (HANDLE) _get_osfhandle(fd);
+ if (handle == INVALID_HANDLE_VALUE)
+ {
+ errno = EBADF;
+ return -1;
+ }
+
+ overlapped.Offset = (uint32) offset;
+ if (!ReadFile(handle, buf, size, &result, &overlapped))
+ {
+ _dosmaperr(GetLastError());
+ return -1;
+ }
+ return result;
+}
+
+ssize_t
+pwrite(int fd, void *buf, size_t size, off_t offset)
+{
+ OVERLAPPED overlapped = {0};
+ HANDLE handle;
+ DWORD result;
+
+ handle = (HANDLE) _get_osfhandle(fd);
+ if (handle == INVALID_HANDLE_VALUE)
+ {
+ errno = EBADF;
+ return -1;
+ }
+
+ overlapped.Offset = (uint32) offset;
+ if (!WriteFile(handle, buf, size, &result, &overlapped))
+ {
+ _dosmaperr(GetLastError());
+ return -1;
+ }
+ return result;
+}
+
+#endif
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index f7a051d1127..2172102aa7c 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -322,12 +322,18 @@
/* Define to 1 if you have the `ppoll' function. */
/* #undef HAVE_PPOLL */
+/* Define to 1 if you have the `pread' function. */
+#define HAVE_PREAD 1
+
/* Define to 1 if you have the `pstat' function. */
/* #undef HAVE_PSTAT */
/* Define to 1 if the PS_STRINGS thing exists. */
/* #undef HAVE_PS_STRINGS */
+/* Define to 1 if you have the `pwrite' function. */
+#define HAVE_PWRITE 1
+
/* Define to 1 if you have the `random' function. */
/* #undef HAVE_RANDOM */
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index 360dbdf3a75..dd76cb16811 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -512,6 +512,10 @@ typedef unsigned short mode_t;
#define isnan(x) _isnan(x)
#endif
+/* in backend/port/win32/pread.c */
+extern ssize_t pread(int fd, void *buf, size_t nbyte, off_t offset);
+extern ssize_t pwrite(int fd, void *buf, size_t nbyte, off_t offset);
+
/* Pulled from Makefile.port in MinGW */
#define DLSUFFIX ".dll"
--
2.19.1
On Sun, Nov 4, 2018 at 12:03 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
On Sat, Nov 3, 2018 at 2:07 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:This still applies, and passes make check-world.
I wonder what the commit policy is on this, if the Windows part isn't
included. I read Heikki's comment [1] as it would be ok to commit
benefiting all platforms that has pread/pwrite.Here's a patch to add Windows support by supplying
src/backend/port/win32/pread.c. Thoughts?
If we do that, I suppose we might as well supply implementations for
HP-UX 10.20 as well, and then we can get rid of the conditional macro
stuff at various call sites and use pread() and pwrite() freely.
Here's a version that does it that way. One question is whether the
caveat mentioned in patch 0001 is acceptable.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
0001-Supply-pread-pwrite-where-missing-v9.patchapplication/octet-stream; name=0001-Supply-pread-pwrite-where-missing-v9.patchDownload
From 2394d026016797231a0e5595460db5a040c04ae2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Sat, 3 Nov 2018 23:11:29 +1300
Subject: [PATCH 1/3] Supply pread()/pwrite() where missing.
Emulate POSIX pread()/pwrite() with lseek() or Win32 OVERLAPPED.
The emulation is not perfect, as the file position is changed, but
that is OK as long as we don't mix read() and pread() calls on the
same fd.
Author: Thomas Munro
Reviewed-by:
Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com
---
configure | 47 +++++++++++++++++++++++++++++
configure.in | 3 ++
src/include/pg_config.h.in | 14 +++++++++
src/include/pg_config.h.win32 | 14 +++++++++
src/include/port.h | 8 +++++
src/port/pread.c | 56 +++++++++++++++++++++++++++++++++++
src/port/pwrite.c | 56 +++++++++++++++++++++++++++++++++++
src/tools/msvc/Mkvcbuild.pm | 1 +
8 files changed, 199 insertions(+)
create mode 100644 src/port/pread.c
create mode 100644 src/port/pwrite.c
diff --git a/configure b/configure
index 0686941331c..d42fb317513 100755
--- a/configure
+++ b/configure
@@ -15309,6 +15309,27 @@ cat >>confdefs.h <<_ACEOF
#define HAVE_DECL_STRNLEN $ac_have_decl
_ACEOF
+ac_fn_c_check_decl "$LINENO" "pread" "ac_cv_have_decl_pread" "$ac_includes_default"
+if test "x$ac_cv_have_decl_pread" = xyes; then :
+ ac_have_decl=1
+else
+ ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_PREAD $ac_have_decl
+_ACEOF
+ac_fn_c_check_decl "$LINENO" "pwrite" "ac_cv_have_decl_pwrite" "$ac_includes_default"
+if test "x$ac_cv_have_decl_pwrite" = xyes; then :
+ ac_have_decl=1
+else
+ ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_PWRITE $ac_have_decl
+_ACEOF
+
# This is probably only present on macOS, but may as well check always
ac_fn_c_check_decl "$LINENO" "F_FULLFSYNC" "ac_cv_have_decl_F_FULLFSYNC" "#include <fcntl.h>
"
@@ -15543,6 +15564,32 @@ esac
fi
+ac_fn_c_check_func "$LINENO" "pread" "ac_cv_func_pread"
+if test "x$ac_cv_func_pread" = xyes; then :
+ $as_echo "#define HAVE_PREAD 1" >>confdefs.h
+
+else
+ case " $LIBOBJS " in
+ *" pread.$ac_objext "* ) ;;
+ *) LIBOBJS="$LIBOBJS pread.$ac_objext"
+ ;;
+esac
+
+fi
+
+ac_fn_c_check_func "$LINENO" "pwrite" "ac_cv_func_pwrite"
+if test "x$ac_cv_func_pwrite" = xyes; then :
+ $as_echo "#define HAVE_PWRITE 1" >>confdefs.h
+
+else
+ case " $LIBOBJS " in
+ *" pwrite.$ac_objext "* ) ;;
+ *) LIBOBJS="$LIBOBJS pwrite.$ac_objext"
+ ;;
+esac
+
+fi
+
ac_fn_c_check_func "$LINENO" "random" "ac_cv_func_random"
if test "x$ac_cv_func_random" = xyes; then :
$as_echo "#define HAVE_RANDOM 1" >>confdefs.h
diff --git a/configure.in b/configure.in
index 7586deb7ee6..2b1513aa436 100644
--- a/configure.in
+++ b/configure.in
@@ -1647,6 +1647,7 @@ fi
AC_CHECK_DECLS(fdatasync, [], [], [#include <unistd.h>])
AC_CHECK_DECLS([strlcat, strlcpy, strnlen])
+AC_CHECK_DECLS([pread, pwrite])
# This is probably only present on macOS, but may as well check always
AC_CHECK_DECLS(F_FULLFSYNC, [], [], [#include <fcntl.h>])
@@ -1701,6 +1702,8 @@ AC_REPLACE_FUNCS(m4_normalize([
getrusage
inet_aton
mkdtemp
+ pread
+ pwrite
random
rint
srandom
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 9798bd24b44..2a80368a745 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -158,6 +158,14 @@
don't. */
#undef HAVE_DECL_POSIX_FADVISE
+/* Define to 1 if you have the declaration of `pread', and to 0 if you don't.
+ */
+#undef HAVE_DECL_PREAD
+
+/* Define to 1 if you have the declaration of `pwrite', and to 0 if you don't.
+ */
+#undef HAVE_DECL_PWRITE
+
/* Define to 1 if you have the declaration of `RTLD_GLOBAL', and to 0 if you
don't. */
#undef HAVE_DECL_RTLD_GLOBAL
@@ -438,6 +446,9 @@
/* Define to 1 if you have the `ppoll' function. */
#undef HAVE_PPOLL
+/* Define to 1 if you have the `pread' function. */
+#undef HAVE_PREAD
+
/* Define to 1 if you have the `pstat' function. */
#undef HAVE_PSTAT
@@ -453,6 +464,9 @@
/* Have PTHREAD_PRIO_INHERIT. */
#undef HAVE_PTHREAD_PRIO_INHERIT
+/* Define to 1 if you have the `pwrite' function. */
+#undef HAVE_PWRITE
+
/* Define to 1 if you have the `random' function. */
#undef HAVE_RANDOM
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index f7a051d1127..f857661ef2f 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -147,6 +147,14 @@
don't. */
#define HAVE_DECL_STRTOULL 1
+/* Define to 1 if you have the declaration of `pread', and to 0 if you don't.
+ */
+#define HAVE_DECL_PREAD 0
+
+/* Define to 1 if you have the declaration of `pwrite', and to 0 if you don't.
+ */
+#define HAVE_DECL_PWRITE 0
+
/* Define to 1 if you have the `dlopen' function. */
/* #undef HAVE_DLOPEN */
@@ -322,12 +330,18 @@
/* Define to 1 if you have the `ppoll' function. */
/* #undef HAVE_PPOLL */
+/* Define to 1 if you have the `pread' function. */
+/* #undef HAVE_PREAD */
+
/* Define to 1 if you have the `pstat' function. */
/* #undef HAVE_PSTAT */
/* Define to 1 if the PS_STRINGS thing exists. */
/* #undef HAVE_PS_STRINGS */
+/* Define to 1 if you have the `pwrite' function. */
+/* #undef HAVE_PWRITE */
+
/* Define to 1 if you have the `random' function. */
/* #undef HAVE_RANDOM */
diff --git a/src/include/port.h b/src/include/port.h
index 3a53bcf2e4b..bcf03a5a7ac 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -392,6 +392,14 @@ extern double rint(double x);
extern int inet_aton(const char *cp, struct in_addr *addr);
#endif
+#if !HAVE_DECL_PREAD
+extern ssize_t pread(int fd, void *buf, size_t nbyte, off_t offset);
+#endif
+
+#if !HAVE_DECL_PWRITE
+extern ssize_t pwrite(int fd, const void *buf, size_t nbyte, off_t offset);
+#endif
+
#if !HAVE_DECL_STRLCAT
extern size_t strlcat(char *dst, const char *src, size_t siz);
#endif
diff --git a/src/port/pread.c b/src/port/pread.c
new file mode 100644
index 00000000000..6844353fc97
--- /dev/null
+++ b/src/port/pread.c
@@ -0,0 +1,56 @@
+/*-------------------------------------------------------------------------
+ *
+ * pread.c
+ * Implementation of pread(2) for platforms that lack one.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pread.c
+ *
+ * Note that this implementation changes the current file position, unlike
+ * the POSIX function, so should not be mixed with regular read() calls
+ * on the same file descriptor.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#include "postgres.h"
+
+#ifdef WIN32
+#include <windows.h>
+#else
+#include <unistd.h>
+#endif
+
+ssize_t
+pread(int fd, void *buf, size_t size, off_t offset)
+{
+#ifdef WIN32
+ OVERLAPPED overlapped = {0};
+ HANDLE handle;
+ DWORD result;
+
+ handle = (HANDLE) _get_osfhandle(fd);
+ if (handle == INVALID_HANDLE_VALUE)
+ {
+ errno = EBADF;
+ return -1;
+ }
+
+ overlapped.Offset = (uint32) offset;
+ if (!ReadFile(handle, buf, size, &result, &overlapped))
+ {
+ _dosmaperr(GetLastError());
+ return -1;
+ }
+
+ return result;
+#else
+ if (lseek(fd, offset, SEEK_SET) < 0)
+ return -1;
+
+ return read(fd, buf, size);
+#endif
+}
diff --git a/src/port/pwrite.c b/src/port/pwrite.c
new file mode 100644
index 00000000000..50217a767b8
--- /dev/null
+++ b/src/port/pwrite.c
@@ -0,0 +1,56 @@
+/*-------------------------------------------------------------------------
+ *
+ * pwrite.c
+ * Implementation of pwrite(2) for platforms that lack one.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pwrite.c
+ *
+ * Note that this implementation changes the current file position, unlike
+ * the POSIX function, so should not be mixed with regular write() calls
+ * on the same file descriptor.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#include "postgres.h"
+
+#ifdef WIN32
+#include <windows.h>
+#else
+#include <unistd.h>
+#endif
+
+ssize_t
+pwrite(int fd, const void *buf, size_t size, off_t offset)
+{
+#ifdef WIN32
+ OVERLAPPED overlapped = {0};
+ HANDLE handle;
+ DWORD result;
+
+ handle = (HANDLE) _get_osfhandle(fd);
+ if (handle == INVALID_HANDLE_VALUE)
+ {
+ errno = EBADF;
+ return -1;
+ }
+
+ overlapped.Offset = offset;
+ if (!WriteFile(handle, buf, size, &result, &overlapped))
+ {
+ _dosmaperr(GetLastError());
+ return -1;
+ }
+
+ return result;
+#else
+ if (lseek(fd, offset, SEEK_SET) < 0)
+ return -1;
+
+ return write(fd, buf, size);
+#endif
+}
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 708579d9dfb..b562044fa71 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -97,6 +97,7 @@ sub mkvcbuild
srandom.c getaddrinfo.c gettimeofday.c inet_net_ntop.c kill.c open.c
erand48.c snprintf.c strlcat.c strlcpy.c dirmod.c noblock.c path.c
dirent.c dlopen.c getopt.c getopt_long.c
+ pread.c pwrite.c
pg_strong_random.c pgcheckdir.c pgmkdirp.c pgsleep.c pgstrcasecmp.c
pqsignal.c mkdtemp.c qsort.c qsort_arg.c quotes.c system.c
sprompt.c strerror.c tar.c thread.c
--
2.19.1
0002-Use-pread-pwrite-instead-of-lseek-read-write-v9.patchapplication/octet-stream; name=0002-Use-pread-pwrite-instead-of-lseek-read-write-v9.patchDownload
From 6bca083407ed2a2ebcfcfe84e1ef959e85f7a101 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Thu, 12 Jul 2018 13:14:02 +1200
Subject: [PATCH 2/3] Use pread()/pwrite() instead of lseek() + read()/write().
Cut down on system calls by doing random IO using the POSIX
offset-based IO routines. Remove the code for tracking the 'virtual'
seek position. The only reason left to call FileSeek() was to get
the file's size, so provide a new function FileSize() instead.
Author: Oskari Saarenmaa, Thomas Munro
Reviewed-by: Thomas Munro, Jesper Pedersen
Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com
Discussion: https://postgr.es/m/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com
Discussion: https://postgr.es/m/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi
---
src/backend/access/heap/rewriteheap.c | 2 +-
src/backend/access/transam/xlog.c | 30 +---
src/backend/storage/file/buffile.c | 46 +-----
src/backend/storage/file/fd.c | 205 ++++----------------------
src/backend/storage/smgr/md.c | 35 +----
src/include/storage/fd.h | 12 +-
6 files changed, 42 insertions(+), 288 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 71277889649..c5db75afa1f 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -935,7 +935,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
* Note that we deviate from the usual WAL coding practices here,
* check the above "Logical rewrite support" comment for reasoning.
*/
- written = FileWrite(src->vfd, waldata_start, len,
+ written = FileWrite(src->vfd, waldata_start, len, src->off,
WAIT_EVENT_LOGICAL_REWRITE_WRITE);
if (written != len)
ereport(ERROR,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 246869bba29..353b749dd1a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2478,18 +2478,6 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
Size nleft;
int written;
- /* Need to seek in the file? */
- if (openLogOff != startoffset)
- {
- if (lseek(openLogFile, (off_t) startoffset, SEEK_SET) < 0)
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg("could not seek in log file %s to offset %u: %m",
- XLogFileNameP(ThisTimeLineID, openLogSegNo),
- startoffset)));
- openLogOff = startoffset;
- }
-
/* OK to write the page(s) */
from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
nbytes = npages * (Size) XLOG_BLCKSZ;
@@ -2498,7 +2486,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
{
errno = 0;
pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
- written = write(openLogFile, from, nleft);
+ written = pwrite(openLogFile, from, nleft, startoffset);
pgstat_report_wait_end();
if (written <= 0)
{
@@ -2513,6 +2501,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
}
nleft -= written;
from += written;
+ startoffset += written;
} while (nleft > 0);
/* Update state for write */
@@ -11821,22 +11810,9 @@ retry:
/* Read the requested page */
readOff = targetPageOff;
- if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
- {
- char fname[MAXFNAMELEN];
- int save_errno = errno;
-
- XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
- errno = save_errno;
- ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
- (errcode_for_file_access(),
- errmsg("could not seek in log segment %s to offset %u: %m",
- fname, readOff)));
- goto next_record_is_invalid;
- }
pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
- r = read(readFile, readBuf, XLOG_BLCKSZ);
+ r = pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
if (r != XLOG_BLCKSZ)
{
char fname[MAXFNAMELEN];
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index e93813d9737..dd687dfe71f 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -67,12 +67,6 @@ struct BufFile
int numFiles; /* number of physical files in set */
/* all files except the last have length exactly MAX_PHYSICAL_FILESIZE */
File *files; /* palloc'd array with numFiles entries */
- off_t *offsets; /* palloc'd array with numFiles entries */
-
- /*
- * offsets[i] is the current seek position of files[i]. We use this to
- * avoid making redundant FileSeek calls.
- */
bool isInterXact; /* keep open over transactions? */
bool dirty; /* does buffer need to be written? */
@@ -116,7 +110,6 @@ makeBufFileCommon(int nfiles)
BufFile *file = (BufFile *) palloc(sizeof(BufFile));
file->numFiles = nfiles;
- file->offsets = (off_t *) palloc0(sizeof(off_t) * nfiles);
file->isInterXact = false;
file->dirty = false;
file->resowner = CurrentResourceOwner;
@@ -170,10 +163,7 @@ extendBufFile(BufFile *file)
file->files = (File *) repalloc(file->files,
(file->numFiles + 1) * sizeof(File));
- file->offsets = (off_t *) repalloc(file->offsets,
- (file->numFiles + 1) * sizeof(off_t));
file->files[file->numFiles] = pfile;
- file->offsets[file->numFiles] = 0L;
file->numFiles++;
}
@@ -396,7 +386,6 @@ BufFileClose(BufFile *file)
FileClose(file->files[i]);
/* release the buffer space */
pfree(file->files);
- pfree(file->offsets);
pfree(file);
}
@@ -422,27 +411,17 @@ BufFileLoadBuffer(BufFile *file)
file->curOffset = 0L;
}
- /*
- * May need to reposition physical file.
- */
- thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, read nothing */
- file->offsets[file->curFile] = file->curOffset;
- }
-
/*
* Read whatever we can get, up to a full bufferload.
*/
+ thisfile = file->files[file->curFile];
file->nbytes = FileRead(thisfile,
file->buffer.data,
sizeof(file->buffer),
+ file->curOffset,
WAIT_EVENT_BUFFILE_READ);
if (file->nbytes < 0)
file->nbytes = 0;
- file->offsets[file->curFile] += file->nbytes;
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
@@ -491,23 +470,14 @@ BufFileDumpBuffer(BufFile *file)
if ((off_t) bytestowrite > availbytes)
bytestowrite = (int) availbytes;
- /*
- * May need to reposition physical file.
- */
thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, give up */
- file->offsets[file->curFile] = file->curOffset;
- }
bytestowrite = FileWrite(thisfile,
file->buffer.data + wpos,
bytestowrite,
+ file->curOffset,
WAIT_EVENT_BUFFILE_WRITE);
if (bytestowrite <= 0)
return; /* failed to write */
- file->offsets[file->curFile] += bytestowrite;
file->curOffset += bytestowrite;
wpos += bytestowrite;
@@ -803,11 +773,10 @@ BufFileSize(BufFile *file)
{
off_t lastFileSize;
- /* Get the size of the last physical file by seeking to end. */
- lastFileSize = FileSeek(file->files[file->numFiles - 1], 0, SEEK_END);
+ /* Get the size of the last physical file. */
+ lastFileSize = FileSize(file->files[file->numFiles - 1]);
if (lastFileSize < 0)
return -1;
- file->offsets[file->numFiles - 1] = lastFileSize;
return ((file->numFiles - 1) * (off_t) MAX_PHYSICAL_FILESIZE) +
lastFileSize;
@@ -849,13 +818,8 @@ BufFileAppend(BufFile *target, BufFile *source)
target->files = (File *)
repalloc(target->files, sizeof(File) * newNumFiles);
- target->offsets = (off_t *)
- repalloc(target->offsets, sizeof(off_t) * newNumFiles);
for (i = target->numFiles; i < newNumFiles; i++)
- {
target->files[i] = source->files[i - target->numFiles];
- target->offsets[i] = source->offsets[i - target->numFiles];
- }
target->numFiles = newNumFiles;
return startBlock;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 8dd51f17674..3e476298616 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -16,8 +16,8 @@
* including base tables, scratch files (e.g., sort and hash spool
* files), and random calls to C library routines like system(3); it
* is quite easy to exceed system limits on the number of open files a
- * single process can have. (This is around 256 on many modern
- * operating systems, but can be as low as 32 on others.)
+ * single process can have. (This is around 1024 on many modern
+ * operating systems, but may be lower on others.)
*
* VFDs are managed as an LRU pool, with actual OS file descriptors
* being opened and closed as needed. Obviously, if a routine is
@@ -167,15 +167,6 @@ int max_safe_fds = 32; /* default if not changed */
#define FileIsNotOpen(file) (VfdCache[file].fd == VFD_CLOSED)
-/*
- * Note: a VFD's seekPos is normally always valid, but if for some reason
- * an lseek() fails, it might become set to FileUnknownPos. We can struggle
- * along without knowing the seek position in many cases, but in some places
- * we have to fail if we don't have it.
- */
-#define FileUnknownPos ((off_t) -1)
-#define FilePosIsUnknown(pos) ((pos) < 0)
-
/* these are the assigned bits in fdstate below: */
#define FD_DELETE_AT_CLOSE (1 << 0) /* T = delete when closed */
#define FD_CLOSE_AT_EOXACT (1 << 1) /* T = close at eoXact */
@@ -189,7 +180,6 @@ typedef struct vfd
File nextFree; /* link to next free VFD, if in freelist */
File lruMoreRecently; /* doubly linked recency-of-use list */
File lruLessRecently;
- off_t seekPos; /* current logical file position, or -1 */
off_t fileSize; /* current size of file (0 if not temporary) */
char *fileName; /* name of file, or NULL for unused VFD */
/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
@@ -407,9 +397,7 @@ pg_fdatasync(int fd)
/*
* pg_flush_data --- advise OS that the described dirty data should be flushed
*
- * offset of 0 with nbytes 0 means that the entire file should be flushed;
- * in this case, this function may have side-effects on the file's
- * seek position!
+ * offset of 0 with nbytes 0 means that the entire file should be flushed
*/
void
pg_flush_data(int fd, off_t offset, off_t nbytes)
@@ -1029,22 +1017,6 @@ LruDelete(File file)
vfdP = &VfdCache[file];
- /*
- * Normally we should know the seek position, but if for some reason we
- * have lost track of it, try again to get it. If we still can't get it,
- * we have a problem: we will be unable to restore the file seek position
- * when and if the file is re-opened. But we can't really throw an error
- * and refuse to close the file, or activities such as transaction cleanup
- * will be broken.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(LOG, "could not seek file \"%s\" before closing: %m",
- vfdP->fileName);
- }
-
/*
* Close the file. We aren't expecting this to fail; if it does, better
* to leak the FD than to mess up our internal state.
@@ -1113,33 +1085,6 @@ LruInsert(File file)
{
++nfile;
}
-
- /*
- * Seek to the right position. We need no special case for seekPos
- * equal to FileUnknownPos, as lseek() will certainly reject that
- * (thus completing the logic noted in LruDelete() that we will fail
- * to re-open a file if we couldn't get its seek position before
- * closing).
- */
- if (vfdP->seekPos != (off_t) 0)
- {
- if (lseek(vfdP->fd, vfdP->seekPos, SEEK_SET) < 0)
- {
- /*
- * If we fail to restore the seek position, treat it like an
- * open() failure.
- */
- int save_errno = errno;
-
- elog(LOG, "could not seek file \"%s\" after re-opening: %m",
- vfdP->fileName);
- (void) close(vfdP->fd);
- vfdP->fd = VFD_CLOSED;
- --nfile;
- errno = save_errno;
- return -1;
- }
- }
}
/*
@@ -1406,7 +1351,6 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
/* Saved flags are adjusted to be OK for re-opening file */
vfdP->fileFlags = fileFlags & ~(O_CREAT | O_TRUNC | O_EXCL);
vfdP->fileMode = fileMode;
- vfdP->seekPos = 0;
vfdP->fileSize = 0;
vfdP->fdstate = 0x0;
vfdP->resowner = NULL;
@@ -1820,7 +1764,6 @@ FileClose(File file)
/*
* FilePrefetch - initiate asynchronous read of a given range of the file.
- * The logical seek position is unaffected.
*
* Currently the only implementation of this function is using posix_fadvise
* which is the simplest standardized interface that accomplishes this.
@@ -1867,10 +1810,6 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
file, VfdCache[file].fileName,
(int64) offset, (int64) nbytes));
- /*
- * Caution: do not call pg_flush_data with nbytes = 0, it could trash the
- * file's seek position. We prefer to define that as a no-op here.
- */
if (nbytes <= 0)
return;
@@ -1884,7 +1823,8 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
}
int
-FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
+FileRead(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1893,7 +1833,7 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileRead: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1904,16 +1844,10 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
pgstat_report_wait_start(wait_event_info);
- returnCode = read(vfdP->fd, buffer, amount);
+ returnCode = pread(vfdP->fd, buffer, amount, offset);
pgstat_report_wait_end();
- if (returnCode >= 0)
- {
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
- }
- else
+ if (returnCode < 0)
{
/*
* Windows may run out of kernel buffers and return "Insufficient
@@ -1939,16 +1873,14 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
}
int
-FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
+FileWrite(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1957,7 +1889,7 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileWrite: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1976,26 +1908,13 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
*/
if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
{
- off_t newPos;
+ off_t past_write = offset + amount;
- /*
- * Normally we should know the seek position, but if for some reason
- * we have lost track of it, try again to get it. Here, it's fine to
- * throw an error if we still can't get it.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(ERROR, "could not seek file \"%s\": %m", vfdP->fileName);
- }
-
- newPos = vfdP->seekPos + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
uint64 newTotal = temporary_files_size;
- newTotal += newPos - vfdP->fileSize;
+ newTotal += past_write - vfdP->fileSize;
if (newTotal > (uint64) temp_file_limit * (uint64) 1024)
ereport(ERROR,
(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
@@ -2007,7 +1926,7 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
errno = 0;
pgstat_report_wait_start(wait_event_info);
- returnCode = write(vfdP->fd, buffer, amount);
+ returnCode = pwrite(VfdCache[file].fd, buffer, amount, offset);
pgstat_report_wait_end();
/* if write didn't set errno, assume problem is no disk space */
@@ -2016,10 +1935,6 @@ retry:
if (returnCode >= 0)
{
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
-
/*
* Maintain fileSize and temporary_files_size if it's a temp file.
*
@@ -2029,12 +1944,12 @@ retry:
*/
if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
{
- off_t newPos = vfdP->seekPos;
+ off_t past_write = offset + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
- temporary_files_size += newPos - vfdP->fileSize;
- vfdP->fileSize = newPos;
+ temporary_files_size += past_write - vfdP->fileSize;
+ vfdP->fileSize = past_write;
}
}
}
@@ -2060,9 +1975,6 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
@@ -2090,92 +2002,25 @@ FileSync(File file, uint32 wait_event_info)
}
off_t
-FileSeek(File file, off_t offset, int whence)
+FileSize(File file)
{
Vfd *vfdP;
Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileSeek: %d (%s) " INT64_FORMAT " " INT64_FORMAT " %d",
- file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
- (int64) offset, whence));
+ DO_DB(elog(LOG, "FileSize %d (%s)",
+ file, VfdCache[file].fileName));
vfdP = &VfdCache[file];
if (FileIsNotOpen(file))
{
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos = offset;
- break;
- case SEEK_CUR:
- if (FilePosIsUnknown(vfdP->seekPos) ||
- vfdP->seekPos + offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos += offset;
- break;
- case SEEK_END:
- if (FileAccess(file) < 0)
- return (off_t) -1;
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
+ if (FileAccess(file) < 0)
+ return (off_t) -1;
}
- else
- {
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- if (vfdP->seekPos != offset)
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_CUR:
- if (offset != 0 || FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_END:
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
- }
-
- return vfdP->seekPos;
-}
-/*
- * XXX not actually used but here for completeness
- */
-#ifdef NOT_USED
-off_t
-FileTell(File file)
-{
- Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileTell %d (%s)",
- file, VfdCache[file].fileName));
- return VfdCache[file].seekPos;
+ return lseek(VfdCache[file].fd, 0, SEEK_END);
}
-#endif
int
FileTruncate(File file, off_t offset, uint32 wait_event_info)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f4374d077be..86013a5c8b2 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -522,22 +522,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- /*
- * Note: because caller usually obtained blocknum by calling mdnblocks,
- * which did a seek(SEEK_END), this seek is often redundant and will be
- * optimized away by fd.c. It's not redundant, however, if there is a
- * partial page at the end of the file. In that case we want to try to
- * overwrite the partial page with a full page. It's also not redundant
- * if bufmgr.c had to dump another buffer of the same file to make room
- * for the new page's buffer.
- */
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
+ if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
{
if (nbytes < 0)
ereport(ERROR,
@@ -748,13 +733,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_READ);
+ nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -824,13 +803,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_WRITE);
+ nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -1979,7 +1952,7 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
{
off_t len;
- len = FileSeek(seg->mdfd_vfd, 0L, SEEK_END);
+ len = FileSize(seg->mdfd_vfd);
if (len < 0)
ereport(ERROR,
(errcode_for_file_access(),
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8e7c9728f4b..f8b6fa8ece5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -15,7 +15,7 @@
/*
* calls:
*
- * File {Close, Read, Write, Seek, Tell, Sync}
+ * File {Close, Read, Write, Size, Tell, Sync}
* {Path Name Open, Allocate, Free} File
*
* These are NOT JUST RENAMINGS OF THE UNIX ROUTINES.
@@ -42,10 +42,6 @@
#include <dirent.h>
-/*
- * FileSeek uses the standard UNIX lseek(2) flags.
- */
-
typedef int File;
@@ -68,10 +64,10 @@ extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fil
extern File OpenTemporaryFile(bool interXact);
extern void FileClose(File file);
extern int FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
-extern int FileRead(File file, char *buffer, int amount, uint32 wait_event_info);
-extern int FileWrite(File file, char *buffer, int amount, uint32 wait_event_info);
+extern int FileRead(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
+extern int FileWrite(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
extern int FileSync(File file, uint32 wait_event_info);
-extern off_t FileSeek(File file, off_t offset, int whence);
+extern off_t FileSize(File file);
extern int FileTruncate(File file, off_t offset, uint32 wait_event_info);
extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
extern char *FilePathName(File file);
--
2.19.1
Hi Thomas,
On 11/5/18 7:08 AM, Thomas Munro wrote:
On Sun, Nov 4, 2018 at 12:03 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:On Sat, Nov 3, 2018 at 2:07 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:This still applies, and passes make check-world.
I wonder what the commit policy is on this, if the Windows part isn't
included. I read Heikki's comment [1] as it would be ok to commit
benefiting all platforms that has pread/pwrite.Here's a patch to add Windows support by supplying
src/backend/port/win32/pread.c. Thoughts?If we do that, I suppose we might as well supply implementations for
HP-UX 10.20 as well, and then we can get rid of the conditional macro
stuff at various call sites and use pread() and pwrite() freely.
Here's a version that does it that way. One question is whether the
caveat mentioned in patch 0001 is acceptable.
Passed check-world, but I can't verify the 0001 patch. Reading the the
API it looks ok to me.
I guess the caveat in 0001 is ok, as it is a side-effect of the
underlying API.
Best regards,
Jesper
On 2018-Nov-04, Thomas Munro wrote:
Here's a patch to add Windows support by supplying
src/backend/port/win32/pread.c. Thoughts?
Hmm, so how easy is to detect that somebody runs read/write on fds where
pread/pwrite have occurred? I guess for data files it's easy to detect
since you'd quickly end up with corrupted files, but what about other
kinds of files? I wonder if we should be worrying about using this
interface somewhere other than fd.c and forgetting about the limitation.
Say, what happens if we patch some place in xlog.c after this patch gets
in, using write() instead of pwrite()?
I suppose the safest approach is to use lseek (or whatever) to fix up
the position after the pread/pwrite -- but we don't want to pay the
price on an additional syscall. Are there any other options? Is there
a way to prevent read/write from being used on a file handle?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Please remove Tell from line 18 in fd.h. To K�ssnacht with him!
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
On 2018-Nov-04, Thomas Munro wrote:
Here's a patch to add Windows support by supplying
src/backend/port/win32/pread.c. Thoughts?
Hmm, so how easy is to detect that somebody runs read/write on fds where
pread/pwrite have occurred? I guess for data files it's easy to detect
since you'd quickly end up with corrupted files, but what about other
kinds of files? I wonder if we should be worrying about using this
interface somewhere other than fd.c and forgetting about the limitation.
Yeah. I think the patch as presented is OK; it uses pread/pwrite only
inside fd.c, which is a reasonably non-leaky abstraction. But there's
definitely a hazard of somebody submitting a patch that depends on
using pread/pwrite elsewhere, and then that maybe not working.
What I suggest is that we *not* try to make this a completely transparent
substitute. Instead, make the functions exported by src/port/ be
"pg_pread" and "pg_pwrite", and inside fd.c we'd write something like
#ifdef HAVE_PREAD
#define pg_pread pread
#endif
and then refer to pg_pread/pg_pwrite in the body of that file. That
way, if someone refers to pread and expects standard functionality
from it, they'll get a failure on platforms not supporting it.
FWIW, I tested the given patches on HPUX 10.20; they compiled cleanly
and pass the core regression tests.
regards, tom lane
On Tue, Nov 6, 2018 at 5:07 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Please remove Tell from line 18 in fd.h. To Küssnacht with him!
Thanks, done. But what is this arrow sticking through my Mac laptop's
screen...?
On Tue, Nov 6, 2018 at 6:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
On 2018-Nov-04, Thomas Munro wrote:
Here's a patch to add Windows support by supplying
src/backend/port/win32/pread.c. Thoughts?Hmm, so how easy is to detect that somebody runs read/write on fds where
pread/pwrite have occurred? I guess for data files it's easy to detect
since you'd quickly end up with corrupted files, but what about other
kinds of files? I wonder if we should be worrying about using this
interface somewhere other than fd.c and forgetting about the limitation.Yeah. I think the patch as presented is OK; it uses pread/pwrite only
inside fd.c, which is a reasonably non-leaky abstraction. But there's
definitely a hazard of somebody submitting a patch that depends on
using pread/pwrite elsewhere, and then that maybe not working.What I suggest is that we *not* try to make this a completely transparent
substitute. Instead, make the functions exported by src/port/ be
"pg_pread" and "pg_pwrite", and inside fd.c we'd write something like#ifdef HAVE_PREAD
#define pg_pread pread
#endifand then refer to pg_pread/pg_pwrite in the body of that file. That
way, if someone refers to pread and expects standard functionality
from it, they'll get a failure on platforms not supporting it.
OK. But since we're using this from both fd.c and xlog.c, I put that
into src/include/port.h.
FWIW, I tested the given patches on HPUX 10.20; they compiled cleanly
and pass the core regression tests.
Thanks. I also tested the replacements by temporarily hacking my
configure script to look for the wrong function name:
-ac_fn_c_check_func "$LINENO" "pread" "ac_cv_func_pread"
+ac_fn_c_check_func "$LINENO" "preadx" "ac_cv_func_pread"
-ac_fn_c_check_func "$LINENO" "pwrite" "ac_cv_func_pwrite"
+ac_fn_c_check_func "$LINENO" "pwritex" "ac_cv_func_pwrite"
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
0001-Provide-pg_pread-and-pg_pwrite-for-random-I-O-v10.patchapplication/octet-stream; name=0001-Provide-pg_pread-and-pg_pwrite-for-random-I-O-v10.patchDownload
From d127dbed79c51f0968a7c325129bf26c142c8123 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Sat, 3 Nov 2018 23:11:29 +1300
Subject: [PATCH 1/3] Provide pg_pread() and pg_pwrite() for random I/O.
Forward to POSIX pread() and pwrite(), or emulate them if unavailable.
The emulation is not perfect as the file position is changed, so
we'll put pg_ prefixes on the names to minimize the risk of confusion
in future patches that might inadvertently try to mix pread() and read()
on the same file descriptor.
Author: Thomas Munro
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com
---
configure | 26 +++++++++++++++++
configure.in | 2 ++
src/include/pg_config.h.in | 6 ++++
src/include/pg_config.h.win32 | 6 ++++
src/include/port.h | 17 +++++++++++
src/port/pread.c | 55 +++++++++++++++++++++++++++++++++++
src/port/pwrite.c | 55 +++++++++++++++++++++++++++++++++++
src/tools/msvc/Mkvcbuild.pm | 1 +
8 files changed, 168 insertions(+)
create mode 100644 src/port/pread.c
create mode 100644 src/port/pwrite.c
diff --git a/configure b/configure
index 0686941331c..443da848f57 100755
--- a/configure
+++ b/configure
@@ -15543,6 +15543,32 @@ esac
fi
+ac_fn_c_check_func "$LINENO" "pread" "ac_cv_func_pread"
+if test "x$ac_cv_func_pread" = xyes; then :
+ $as_echo "#define HAVE_PREAD 1" >>confdefs.h
+
+else
+ case " $LIBOBJS " in
+ *" pread.$ac_objext "* ) ;;
+ *) LIBOBJS="$LIBOBJS pread.$ac_objext"
+ ;;
+esac
+
+fi
+
+ac_fn_c_check_func "$LINENO" "pwrite" "ac_cv_func_pwrite"
+if test "x$ac_cv_func_pwrite" = xyes; then :
+ $as_echo "#define HAVE_PWRITE 1" >>confdefs.h
+
+else
+ case " $LIBOBJS " in
+ *" pwrite.$ac_objext "* ) ;;
+ *) LIBOBJS="$LIBOBJS pwrite.$ac_objext"
+ ;;
+esac
+
+fi
+
ac_fn_c_check_func "$LINENO" "random" "ac_cv_func_random"
if test "x$ac_cv_func_random" = xyes; then :
$as_echo "#define HAVE_RANDOM 1" >>confdefs.h
diff --git a/configure.in b/configure.in
index 7586deb7ee6..bed3d05e715 100644
--- a/configure.in
+++ b/configure.in
@@ -1701,6 +1701,8 @@ AC_REPLACE_FUNCS(m4_normalize([
getrusage
inet_aton
mkdtemp
+ pread
+ pwrite
random
rint
srandom
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 9798bd24b44..5a996e75572 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -438,6 +438,9 @@
/* Define to 1 if you have the `ppoll' function. */
#undef HAVE_PPOLL
+/* Define to 1 if you have the `pread' function. */
+#undef HAVE_PREAD
+
/* Define to 1 if you have the `pstat' function. */
#undef HAVE_PSTAT
@@ -453,6 +456,9 @@
/* Have PTHREAD_PRIO_INHERIT. */
#undef HAVE_PTHREAD_PRIO_INHERIT
+/* Define to 1 if you have the `pwrite' function. */
+#undef HAVE_PWRITE
+
/* Define to 1 if you have the `random' function. */
#undef HAVE_RANDOM
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index f7a051d1127..894d658a204 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -322,12 +322,18 @@
/* Define to 1 if you have the `ppoll' function. */
/* #undef HAVE_PPOLL */
+/* Define to 1 if you have the `pread' function. */
+/* #undef HAVE_PREAD */
+
/* Define to 1 if you have the `pstat' function. */
/* #undef HAVE_PSTAT */
/* Define to 1 if the PS_STRINGS thing exists. */
/* #undef HAVE_PS_STRINGS */
+/* Define to 1 if you have the `pwrite' function. */
+/* #undef HAVE_PWRITE */
+
/* Define to 1 if you have the `random' function. */
/* #undef HAVE_RANDOM */
diff --git a/src/include/port.h b/src/include/port.h
index 3a53bcf2e4b..81583d557cd 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -392,6 +392,23 @@ extern double rint(double x);
extern int inet_aton(const char *cp, struct in_addr *addr);
#endif
+/*
+ * Windows and older Unix don't have pread(2) and pwrite(2). We have
+ * replacement functions, but they have slightly different semantics so we'll
+ * use a name with a pg_ prefix to avoid confusion.
+ */
+#ifdef HAVE_PREAD
+#define pg_pread pread
+#else
+extern ssize_t pg_pread(int fd, void *buf, size_t nbyte, off_t offset);
+#endif
+
+#ifdef HAVE_PWRITE
+#define pg_pwrite pwrite
+#else
+extern ssize_t pg_pwrite(int fd, const void *buf, size_t nbyte, off_t offset);
+#endif
+
#if !HAVE_DECL_STRLCAT
extern size_t strlcat(char *dst, const char *src, size_t siz);
#endif
diff --git a/src/port/pread.c b/src/port/pread.c
new file mode 100644
index 00000000000..a22d949cca5
--- /dev/null
+++ b/src/port/pread.c
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * pread.c
+ * Implementation of pread(2) for platforms that lack one.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pread.c
+ *
+ * Note that this implementation changes the current file position, unlike
+ * the POSIX function, so we use the name pg_pread().
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#include "postgres.h"
+
+#ifdef WIN32
+#include <windows.h>
+#else
+#include <unistd.h>
+#endif
+
+ssize_t
+pg_pread(int fd, void *buf, size_t size, off_t offset)
+{
+#ifdef WIN32
+ OVERLAPPED overlapped = {0};
+ HANDLE handle;
+ DWORD result;
+
+ handle = (HANDLE) _get_osfhandle(fd);
+ if (handle == INVALID_HANDLE_VALUE)
+ {
+ errno = EBADF;
+ return -1;
+ }
+
+ overlapped.Offset = offset;
+ if (!ReadFile(handle, buf, size, &result, &overlapped))
+ {
+ _dosmaperr(GetLastError());
+ return -1;
+ }
+
+ return result;
+#else
+ if (lseek(fd, offset, SEEK_SET) < 0)
+ return -1;
+
+ return read(fd, buf, size);
+#endif
+}
diff --git a/src/port/pwrite.c b/src/port/pwrite.c
new file mode 100644
index 00000000000..f3e228cf4f0
--- /dev/null
+++ b/src/port/pwrite.c
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * pwrite.c
+ * Implementation of pwrite(2) for platforms that lack one.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pwrite.c
+ *
+ * Note that this implementation changes the current file position, unlike
+ * the POSIX function, so we use the name pg_write().
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#include "postgres.h"
+
+#ifdef WIN32
+#include <windows.h>
+#else
+#include <unistd.h>
+#endif
+
+ssize_t
+pg_pwrite(int fd, const void *buf, size_t size, off_t offset)
+{
+#ifdef WIN32
+ OVERLAPPED overlapped = {0};
+ HANDLE handle;
+ DWORD result;
+
+ handle = (HANDLE) _get_osfhandle(fd);
+ if (handle == INVALID_HANDLE_VALUE)
+ {
+ errno = EBADF;
+ return -1;
+ }
+
+ overlapped.Offset = offset;
+ if (!WriteFile(handle, buf, size, &result, &overlapped))
+ {
+ _dosmaperr(GetLastError());
+ return -1;
+ }
+
+ return result;
+#else
+ if (lseek(fd, offset, SEEK_SET) < 0)
+ return -1;
+
+ return write(fd, buf, size);
+#endif
+}
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 708579d9dfb..b562044fa71 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -97,6 +97,7 @@ sub mkvcbuild
srandom.c getaddrinfo.c gettimeofday.c inet_net_ntop.c kill.c open.c
erand48.c snprintf.c strlcat.c strlcpy.c dirmod.c noblock.c path.c
dirent.c dlopen.c getopt.c getopt_long.c
+ pread.c pwrite.c
pg_strong_random.c pgcheckdir.c pgmkdirp.c pgsleep.c pgstrcasecmp.c
pqsignal.c mkdtemp.c qsort.c qsort_arg.c quotes.c system.c
sprompt.c strerror.c tar.c thread.c
--
2.19.1
0002-Use-pg_pread-and-pg_pwrite-for-data-files-and-WA-v10.patchapplication/octet-stream; name=0002-Use-pg_pread-and-pg_pwrite-for-data-files-and-WA-v10.patchDownload
From 7af6e74020e416b6fc555e0e2c10151b37a84a18 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Thu, 12 Jul 2018 13:14:02 +1200
Subject: [PATCH 2/3] Use pg_pread() and pg_pwrite() for data files and WAL.
Cut down on system calls by doing random I/O using offset-based OS
routines where available. Remove the code for tracking the 'virtual'
seek position. The only reason left to call FileSeek() was to get
the file's size, so provide a new function FileSize() instead.
Author: Oskari Saarenmaa, Thomas Munro
Reviewed-by: Thomas Munro, Jesper Pedersen, Tom Lane, Alvaro Herrera
Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com
Discussion: https://postgr.es/m/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com
Discussion: https://postgr.es/m/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi
---
src/backend/access/heap/rewriteheap.c | 2 +-
src/backend/access/transam/xlog.c | 30 +---
src/backend/storage/file/buffile.c | 46 +-----
src/backend/storage/file/fd.c | 205 ++++----------------------
src/backend/storage/smgr/md.c | 35 +----
src/include/storage/fd.h | 12 +-
6 files changed, 42 insertions(+), 288 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 71277889649..c5db75afa1f 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -935,7 +935,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
* Note that we deviate from the usual WAL coding practices here,
* check the above "Logical rewrite support" comment for reasoning.
*/
- written = FileWrite(src->vfd, waldata_start, len,
+ written = FileWrite(src->vfd, waldata_start, len, src->off,
WAIT_EVENT_LOGICAL_REWRITE_WRITE);
if (written != len)
ereport(ERROR,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 246869bba29..7eed5866d2e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2478,18 +2478,6 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
Size nleft;
int written;
- /* Need to seek in the file? */
- if (openLogOff != startoffset)
- {
- if (lseek(openLogFile, (off_t) startoffset, SEEK_SET) < 0)
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg("could not seek in log file %s to offset %u: %m",
- XLogFileNameP(ThisTimeLineID, openLogSegNo),
- startoffset)));
- openLogOff = startoffset;
- }
-
/* OK to write the page(s) */
from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
nbytes = npages * (Size) XLOG_BLCKSZ;
@@ -2498,7 +2486,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
{
errno = 0;
pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
- written = write(openLogFile, from, nleft);
+ written = pg_pwrite(openLogFile, from, nleft, startoffset);
pgstat_report_wait_end();
if (written <= 0)
{
@@ -2513,6 +2501,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
}
nleft -= written;
from += written;
+ startoffset += written;
} while (nleft > 0);
/* Update state for write */
@@ -11821,22 +11810,9 @@ retry:
/* Read the requested page */
readOff = targetPageOff;
- if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
- {
- char fname[MAXFNAMELEN];
- int save_errno = errno;
-
- XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
- errno = save_errno;
- ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
- (errcode_for_file_access(),
- errmsg("could not seek in log segment %s to offset %u: %m",
- fname, readOff)));
- goto next_record_is_invalid;
- }
pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
- r = read(readFile, readBuf, XLOG_BLCKSZ);
+ r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
if (r != XLOG_BLCKSZ)
{
char fname[MAXFNAMELEN];
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index e93813d9737..dd687dfe71f 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -67,12 +67,6 @@ struct BufFile
int numFiles; /* number of physical files in set */
/* all files except the last have length exactly MAX_PHYSICAL_FILESIZE */
File *files; /* palloc'd array with numFiles entries */
- off_t *offsets; /* palloc'd array with numFiles entries */
-
- /*
- * offsets[i] is the current seek position of files[i]. We use this to
- * avoid making redundant FileSeek calls.
- */
bool isInterXact; /* keep open over transactions? */
bool dirty; /* does buffer need to be written? */
@@ -116,7 +110,6 @@ makeBufFileCommon(int nfiles)
BufFile *file = (BufFile *) palloc(sizeof(BufFile));
file->numFiles = nfiles;
- file->offsets = (off_t *) palloc0(sizeof(off_t) * nfiles);
file->isInterXact = false;
file->dirty = false;
file->resowner = CurrentResourceOwner;
@@ -170,10 +163,7 @@ extendBufFile(BufFile *file)
file->files = (File *) repalloc(file->files,
(file->numFiles + 1) * sizeof(File));
- file->offsets = (off_t *) repalloc(file->offsets,
- (file->numFiles + 1) * sizeof(off_t));
file->files[file->numFiles] = pfile;
- file->offsets[file->numFiles] = 0L;
file->numFiles++;
}
@@ -396,7 +386,6 @@ BufFileClose(BufFile *file)
FileClose(file->files[i]);
/* release the buffer space */
pfree(file->files);
- pfree(file->offsets);
pfree(file);
}
@@ -422,27 +411,17 @@ BufFileLoadBuffer(BufFile *file)
file->curOffset = 0L;
}
- /*
- * May need to reposition physical file.
- */
- thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, read nothing */
- file->offsets[file->curFile] = file->curOffset;
- }
-
/*
* Read whatever we can get, up to a full bufferload.
*/
+ thisfile = file->files[file->curFile];
file->nbytes = FileRead(thisfile,
file->buffer.data,
sizeof(file->buffer),
+ file->curOffset,
WAIT_EVENT_BUFFILE_READ);
if (file->nbytes < 0)
file->nbytes = 0;
- file->offsets[file->curFile] += file->nbytes;
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
@@ -491,23 +470,14 @@ BufFileDumpBuffer(BufFile *file)
if ((off_t) bytestowrite > availbytes)
bytestowrite = (int) availbytes;
- /*
- * May need to reposition physical file.
- */
thisfile = file->files[file->curFile];
- if (file->curOffset != file->offsets[file->curFile])
- {
- if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
- return; /* seek failed, give up */
- file->offsets[file->curFile] = file->curOffset;
- }
bytestowrite = FileWrite(thisfile,
file->buffer.data + wpos,
bytestowrite,
+ file->curOffset,
WAIT_EVENT_BUFFILE_WRITE);
if (bytestowrite <= 0)
return; /* failed to write */
- file->offsets[file->curFile] += bytestowrite;
file->curOffset += bytestowrite;
wpos += bytestowrite;
@@ -803,11 +773,10 @@ BufFileSize(BufFile *file)
{
off_t lastFileSize;
- /* Get the size of the last physical file by seeking to end. */
- lastFileSize = FileSeek(file->files[file->numFiles - 1], 0, SEEK_END);
+ /* Get the size of the last physical file. */
+ lastFileSize = FileSize(file->files[file->numFiles - 1]);
if (lastFileSize < 0)
return -1;
- file->offsets[file->numFiles - 1] = lastFileSize;
return ((file->numFiles - 1) * (off_t) MAX_PHYSICAL_FILESIZE) +
lastFileSize;
@@ -849,13 +818,8 @@ BufFileAppend(BufFile *target, BufFile *source)
target->files = (File *)
repalloc(target->files, sizeof(File) * newNumFiles);
- target->offsets = (off_t *)
- repalloc(target->offsets, sizeof(off_t) * newNumFiles);
for (i = target->numFiles; i < newNumFiles; i++)
- {
target->files[i] = source->files[i - target->numFiles];
- target->offsets[i] = source->offsets[i - target->numFiles];
- }
target->numFiles = newNumFiles;
return startBlock;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 8dd51f17674..6611edbbd2c 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -16,8 +16,8 @@
* including base tables, scratch files (e.g., sort and hash spool
* files), and random calls to C library routines like system(3); it
* is quite easy to exceed system limits on the number of open files a
- * single process can have. (This is around 256 on many modern
- * operating systems, but can be as low as 32 on others.)
+ * single process can have. (This is around 1024 on many modern
+ * operating systems, but may be lower on others.)
*
* VFDs are managed as an LRU pool, with actual OS file descriptors
* being opened and closed as needed. Obviously, if a routine is
@@ -167,15 +167,6 @@ int max_safe_fds = 32; /* default if not changed */
#define FileIsNotOpen(file) (VfdCache[file].fd == VFD_CLOSED)
-/*
- * Note: a VFD's seekPos is normally always valid, but if for some reason
- * an lseek() fails, it might become set to FileUnknownPos. We can struggle
- * along without knowing the seek position in many cases, but in some places
- * we have to fail if we don't have it.
- */
-#define FileUnknownPos ((off_t) -1)
-#define FilePosIsUnknown(pos) ((pos) < 0)
-
/* these are the assigned bits in fdstate below: */
#define FD_DELETE_AT_CLOSE (1 << 0) /* T = delete when closed */
#define FD_CLOSE_AT_EOXACT (1 << 1) /* T = close at eoXact */
@@ -189,7 +180,6 @@ typedef struct vfd
File nextFree; /* link to next free VFD, if in freelist */
File lruMoreRecently; /* doubly linked recency-of-use list */
File lruLessRecently;
- off_t seekPos; /* current logical file position, or -1 */
off_t fileSize; /* current size of file (0 if not temporary) */
char *fileName; /* name of file, or NULL for unused VFD */
/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
@@ -407,9 +397,7 @@ pg_fdatasync(int fd)
/*
* pg_flush_data --- advise OS that the described dirty data should be flushed
*
- * offset of 0 with nbytes 0 means that the entire file should be flushed;
- * in this case, this function may have side-effects on the file's
- * seek position!
+ * offset of 0 with nbytes 0 means that the entire file should be flushed
*/
void
pg_flush_data(int fd, off_t offset, off_t nbytes)
@@ -1029,22 +1017,6 @@ LruDelete(File file)
vfdP = &VfdCache[file];
- /*
- * Normally we should know the seek position, but if for some reason we
- * have lost track of it, try again to get it. If we still can't get it,
- * we have a problem: we will be unable to restore the file seek position
- * when and if the file is re-opened. But we can't really throw an error
- * and refuse to close the file, or activities such as transaction cleanup
- * will be broken.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(LOG, "could not seek file \"%s\" before closing: %m",
- vfdP->fileName);
- }
-
/*
* Close the file. We aren't expecting this to fail; if it does, better
* to leak the FD than to mess up our internal state.
@@ -1113,33 +1085,6 @@ LruInsert(File file)
{
++nfile;
}
-
- /*
- * Seek to the right position. We need no special case for seekPos
- * equal to FileUnknownPos, as lseek() will certainly reject that
- * (thus completing the logic noted in LruDelete() that we will fail
- * to re-open a file if we couldn't get its seek position before
- * closing).
- */
- if (vfdP->seekPos != (off_t) 0)
- {
- if (lseek(vfdP->fd, vfdP->seekPos, SEEK_SET) < 0)
- {
- /*
- * If we fail to restore the seek position, treat it like an
- * open() failure.
- */
- int save_errno = errno;
-
- elog(LOG, "could not seek file \"%s\" after re-opening: %m",
- vfdP->fileName);
- (void) close(vfdP->fd);
- vfdP->fd = VFD_CLOSED;
- --nfile;
- errno = save_errno;
- return -1;
- }
- }
}
/*
@@ -1406,7 +1351,6 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
/* Saved flags are adjusted to be OK for re-opening file */
vfdP->fileFlags = fileFlags & ~(O_CREAT | O_TRUNC | O_EXCL);
vfdP->fileMode = fileMode;
- vfdP->seekPos = 0;
vfdP->fileSize = 0;
vfdP->fdstate = 0x0;
vfdP->resowner = NULL;
@@ -1820,7 +1764,6 @@ FileClose(File file)
/*
* FilePrefetch - initiate asynchronous read of a given range of the file.
- * The logical seek position is unaffected.
*
* Currently the only implementation of this function is using posix_fadvise
* which is the simplest standardized interface that accomplishes this.
@@ -1867,10 +1810,6 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
file, VfdCache[file].fileName,
(int64) offset, (int64) nbytes));
- /*
- * Caution: do not call pg_flush_data with nbytes = 0, it could trash the
- * file's seek position. We prefer to define that as a no-op here.
- */
if (nbytes <= 0)
return;
@@ -1884,7 +1823,8 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
}
int
-FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
+FileRead(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1893,7 +1833,7 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileRead: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1904,16 +1844,10 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
pgstat_report_wait_start(wait_event_info);
- returnCode = read(vfdP->fd, buffer, amount);
+ returnCode = pg_pread(vfdP->fd, buffer, amount, offset);
pgstat_report_wait_end();
- if (returnCode >= 0)
- {
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
- }
- else
+ if (returnCode < 0)
{
/*
* Windows may run out of kernel buffers and return "Insufficient
@@ -1939,16 +1873,14 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
}
int
-FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
+FileWrite(File file, char *buffer, int amount, off_t offset,
+ uint32 wait_event_info)
{
int returnCode;
Vfd *vfdP;
@@ -1957,7 +1889,7 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
DO_DB(elog(LOG, "FileWrite: %d (%s) " INT64_FORMAT " %d %p",
file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
+ (int64) offset,
amount, buffer));
returnCode = FileAccess(file);
@@ -1976,26 +1908,13 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
*/
if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
{
- off_t newPos;
+ off_t past_write = offset + amount;
- /*
- * Normally we should know the seek position, but if for some reason
- * we have lost track of it, try again to get it. Here, it's fine to
- * throw an error if we still can't get it.
- */
- if (FilePosIsUnknown(vfdP->seekPos))
- {
- vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
- if (FilePosIsUnknown(vfdP->seekPos))
- elog(ERROR, "could not seek file \"%s\": %m", vfdP->fileName);
- }
-
- newPos = vfdP->seekPos + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
uint64 newTotal = temporary_files_size;
- newTotal += newPos - vfdP->fileSize;
+ newTotal += past_write - vfdP->fileSize;
if (newTotal > (uint64) temp_file_limit * (uint64) 1024)
ereport(ERROR,
(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
@@ -2007,7 +1926,7 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
retry:
errno = 0;
pgstat_report_wait_start(wait_event_info);
- returnCode = write(vfdP->fd, buffer, amount);
+ returnCode = pg_pwrite(VfdCache[file].fd, buffer, amount, offset);
pgstat_report_wait_end();
/* if write didn't set errno, assume problem is no disk space */
@@ -2016,10 +1935,6 @@ retry:
if (returnCode >= 0)
{
- /* if seekPos is unknown, leave it that way */
- if (!FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos += returnCode;
-
/*
* Maintain fileSize and temporary_files_size if it's a temp file.
*
@@ -2029,12 +1944,12 @@ retry:
*/
if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
{
- off_t newPos = vfdP->seekPos;
+ off_t past_write = offset + amount;
- if (newPos > vfdP->fileSize)
+ if (past_write > vfdP->fileSize)
{
- temporary_files_size += newPos - vfdP->fileSize;
- vfdP->fileSize = newPos;
+ temporary_files_size += past_write - vfdP->fileSize;
+ vfdP->fileSize = past_write;
}
}
}
@@ -2060,9 +1975,6 @@ retry:
/* OK to retry if interrupted */
if (errno == EINTR)
goto retry;
-
- /* Trouble, so assume we don't know the file position anymore */
- vfdP->seekPos = FileUnknownPos;
}
return returnCode;
@@ -2090,92 +2002,25 @@ FileSync(File file, uint32 wait_event_info)
}
off_t
-FileSeek(File file, off_t offset, int whence)
+FileSize(File file)
{
Vfd *vfdP;
Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileSeek: %d (%s) " INT64_FORMAT " " INT64_FORMAT " %d",
- file, VfdCache[file].fileName,
- (int64) VfdCache[file].seekPos,
- (int64) offset, whence));
+ DO_DB(elog(LOG, "FileSize %d (%s)",
+ file, VfdCache[file].fileName));
vfdP = &VfdCache[file];
if (FileIsNotOpen(file))
{
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos = offset;
- break;
- case SEEK_CUR:
- if (FilePosIsUnknown(vfdP->seekPos) ||
- vfdP->seekPos + offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- vfdP->seekPos += offset;
- break;
- case SEEK_END:
- if (FileAccess(file) < 0)
- return (off_t) -1;
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
+ if (FileAccess(file) < 0)
+ return (off_t) -1;
}
- else
- {
- switch (whence)
- {
- case SEEK_SET:
- if (offset < 0)
- {
- errno = EINVAL;
- return (off_t) -1;
- }
- if (vfdP->seekPos != offset)
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_CUR:
- if (offset != 0 || FilePosIsUnknown(vfdP->seekPos))
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- case SEEK_END:
- vfdP->seekPos = lseek(vfdP->fd, offset, whence);
- break;
- default:
- elog(ERROR, "invalid whence: %d", whence);
- break;
- }
- }
-
- return vfdP->seekPos;
-}
-/*
- * XXX not actually used but here for completeness
- */
-#ifdef NOT_USED
-off_t
-FileTell(File file)
-{
- Assert(FileIsValid(file));
- DO_DB(elog(LOG, "FileTell %d (%s)",
- file, VfdCache[file].fileName));
- return VfdCache[file].seekPos;
+ return lseek(VfdCache[file].fd, 0, SEEK_END);
}
-#endif
int
FileTruncate(File file, off_t offset, uint32 wait_event_info)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f4374d077be..86013a5c8b2 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -522,22 +522,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- /*
- * Note: because caller usually obtained blocknum by calling mdnblocks,
- * which did a seek(SEEK_END), this seek is often redundant and will be
- * optimized away by fd.c. It's not redundant, however, if there is a
- * partial page at the end of the file. In that case we want to try to
- * overwrite the partial page with a full page. It's also not redundant
- * if bufmgr.c had to dump another buffer of the same file to make room
- * for the new page's buffer.
- */
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
+ if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
{
if (nbytes < 0)
ereport(ERROR,
@@ -748,13 +733,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_READ);
+ nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -824,13 +803,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
- if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not seek to block %u in file \"%s\": %m",
- blocknum, FilePathName(v->mdfd_vfd))));
-
- nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_WRITE);
+ nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
@@ -1979,7 +1952,7 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
{
off_t len;
- len = FileSeek(seg->mdfd_vfd, 0L, SEEK_END);
+ len = FileSize(seg->mdfd_vfd);
if (len < 0)
ereport(ERROR,
(errcode_for_file_access(),
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8e7c9728f4b..1289589a46b 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -15,7 +15,7 @@
/*
* calls:
*
- * File {Close, Read, Write, Seek, Tell, Sync}
+ * File {Close, Read, Write, Size, Sync}
* {Path Name Open, Allocate, Free} File
*
* These are NOT JUST RENAMINGS OF THE UNIX ROUTINES.
@@ -42,10 +42,6 @@
#include <dirent.h>
-/*
- * FileSeek uses the standard UNIX lseek(2) flags.
- */
-
typedef int File;
@@ -68,10 +64,10 @@ extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fil
extern File OpenTemporaryFile(bool interXact);
extern void FileClose(File file);
extern int FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
-extern int FileRead(File file, char *buffer, int amount, uint32 wait_event_info);
-extern int FileWrite(File file, char *buffer, int amount, uint32 wait_event_info);
+extern int FileRead(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
+extern int FileWrite(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
extern int FileSync(File file, uint32 wait_event_info);
-extern off_t FileSeek(File file, off_t offset, int whence);
+extern off_t FileSize(File file);
extern int FileTruncate(File file, off_t offset, uint32 wait_event_info);
extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
extern char *FilePathName(File file);
--
2.19.1
Thomas Munro <thomas.munro@enterprisedb.com> writes:
On Tue, Nov 6, 2018 at 6:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
What I suggest is that we *not* try to make this a completely transparent
substitute. Instead, make the functions exported by src/port/ be
"pg_pread" and "pg_pwrite", ...
OK. But since we're using this from both fd.c and xlog.c, I put that
into src/include/port.h.
LGTM. I didn't bother to run an actual test cycle, since it's not
materially different from the previous version as far as portability
is concerned.
regards, tom lane
Hi,
On 11/5/18 9:10 PM, Thomas Munro wrote:
On Tue, Nov 6, 2018 at 5:07 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Please remove Tell from line 18 in fd.h. To Küssnacht with him!
Thanks, done. But what is this arrow sticking through my Mac laptop's
screen...?On Tue, Nov 6, 2018 at 6:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
On 2018-Nov-04, Thomas Munro wrote:
Here's a patch to add Windows support by supplying
src/backend/port/win32/pread.c. Thoughts?Hmm, so how easy is to detect that somebody runs read/write on fds where
pread/pwrite have occurred? I guess for data files it's easy to detect
since you'd quickly end up with corrupted files, but what about other
kinds of files? I wonder if we should be worrying about using this
interface somewhere other than fd.c and forgetting about the limitation.Yeah. I think the patch as presented is OK; it uses pread/pwrite only
inside fd.c, which is a reasonably non-leaky abstraction. But there's
definitely a hazard of somebody submitting a patch that depends on
using pread/pwrite elsewhere, and then that maybe not working.What I suggest is that we *not* try to make this a completely transparent
substitute. Instead, make the functions exported by src/port/ be
"pg_pread" and "pg_pwrite", and inside fd.c we'd write something like#ifdef HAVE_PREAD
#define pg_pread pread
#endifand then refer to pg_pread/pg_pwrite in the body of that file. That
way, if someone refers to pread and expects standard functionality
from it, they'll get a failure on platforms not supporting it.OK. But since we're using this from both fd.c and xlog.c, I put that
into src/include/port.h.FWIW, I tested the given patches on HPUX 10.20; they compiled cleanly
and pass the core regression tests.
Passes check-world, and includes the feedback on this thread.
New status: Ready for Committer
Best regards,
Jesper
On Wed, Nov 7, 2018 at 4:42 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
Passes check-world, and includes the feedback on this thread.
New status: Ready for Committer
Thanks! Pushed. I'll keep an eye on the build farm to see if
anything breaks on Cygwin or some other frankenOS.
--
Thomas Munro
http://www.enterprisedb.com
Hi Thomas,
On 11/6/18 4:04 PM, Thomas Munro wrote:
On Wed, Nov 7, 2018 at 4:42 AM Jesper Pedersen
Thanks! Pushed. I'll keep an eye on the build farm to see if
anything breaks on Cygwin or some other frankenOS.
There is [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2018-11-07%2001%3A01%3A01 on Andres' skink setup. Looking.
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2018-11-07%2001%3A01%3A01
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2018-11-07%2001%3A01%3A01
Best regards,
Jesper
On 11/7/18 7:26 AM, Jesper Pedersen wrote:
Hi Thomas,
On 11/6/18 4:04 PM, Thomas Munro wrote:
On Wed, Nov 7, 2018 at 4:42 AM Jesper Pedersen
Thanks! Pushed. I'll keep an eye on the build farm to see if
anything breaks on Cygwin or some other frankenOS.There is [1] on Andres' skink setup. Looking.
[1]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2018-11-07%2001%3A01%3A01
And lousyjack, which uses a slightly different way of calling valgrind,
and thus got past initdb, found a bunch more:
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 11/7/18 7:26 AM, Jesper Pedersen wrote:
On 11/6/18 4:04 PM, Thomas Munro wrote:
On Wed, Nov 7, 2018 at 4:42 AM Jesper Pedersen
Thanks! Pushed. I'll keep an eye on the build farm to see if
anything breaks on Cygwin or some other frankenOS.There is [1] on Andres' skink setup. Looking.
Attached is a reproducer.
Adding the memset() command for the page makes valgrind happy.
Thoughts on how to proceed with this ? The report in [1]/messages/by-id/3fe1e38a-fb70-6260-9300-ce67ede21c32@redhat.com shows that
there are a number of call sites where the page(s) aren't fully initialized.
[1]: /messages/by-id/3fe1e38a-fb70-6260-9300-ce67ede21c32@redhat.com
/messages/by-id/3fe1e38a-fb70-6260-9300-ce67ede21c32@redhat.com
Best regards,
Jesper
Attachments:
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
On 11/7/18 7:26 AM, Jesper Pedersen wrote:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2018-11-07%2001%3A01%3A01
And lousyjack, which uses a slightly different way of calling valgrind,
and thus got past initdb, found a bunch more:
<https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2018-11-07%2001%3A33%3A01>
I'm confused by this. Surely the pwrite-based code is writing exactly the
same data as before. Do we have to conclude that valgrind is complaining
about passing uninitialized data to pwrite() when it did not complain
about exactly the same thing for write()?
[ looks ... ] No, what we have to conclude is that the write-related
suppressions in src/tools/valgrind.supp need to be replaced or augmented
with pwrite-related ones.
regards, tom lane
On 11/7/18 9:30 AM, Tom Lane wrote:
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
On 11/7/18 7:26 AM, Jesper Pedersen wrote:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2018-11-07%2001%3A01%3A01
And lousyjack, which uses a slightly different way of calling valgrind,
and thus got past initdb, found a bunch more:
<https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2018-11-07%2001%3A33%3A01>I'm confused by this. Surely the pwrite-based code is writing exactly the
same data as before. Do we have to conclude that valgrind is complaining
about passing uninitialized data to pwrite() when it did not complain
about exactly the same thing for write()?[ looks ... ] No, what we have to conclude is that the write-related
suppressions in src/tools/valgrind.supp need to be replaced or augmented
with pwrite-related ones.
Yeah. I just trawled through the lousyjack logs and it looks like all
the cases it reported could be handled by:
{
padding_XLogRecData_pwrite
Memcheck:Param
pwrite64(buf)
...
fun:XLogWrite
}
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Tom,
On 11/7/18 9:30 AM, Tom Lane wrote:
I'm confused by this. Surely the pwrite-based code is writing exactly the
same data as before. Do we have to conclude that valgrind is complaining
about passing uninitialized data to pwrite() when it did not complain
about exactly the same thing for write()?[ looks ... ] No, what we have to conclude is that the write-related
suppressions in src/tools/valgrind.supp need to be replaced or augmented
with pwrite-related ones.
The attached patch fixes this for me.
Unfortunately pwrite* doesn't work for the pwrite64(buf) line.
Best regards,
Jesper
Attachments:
valgrind-supp.patchtext/x-patch; name=valgrind-supp.patchDownload
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index af03051260..2f3b602773 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -52,9 +52,10 @@
{
padding_XLogRecData_write
Memcheck:Param
- write(buf)
+ pwrite64(buf)
- ...
+ ...
+ fun:pwrite
fun:XLogWrite
}
On 11/7/18 10:05 AM, Jesper Pedersen wrote:
Hi Tom,
On 11/7/18 9:30 AM, Tom Lane wrote:
I'm confused by this. Surely the pwrite-based code is writing
exactly the
same data as before. Do we have to conclude that valgrind is
complaining
about passing uninitialized data to pwrite() when it did not complain
about exactly the same thing for write()?[ looks ... ] No, what we have to conclude is that the write-related
suppressions in src/tools/valgrind.supp need to be replaced or augmented
with pwrite-related ones.The attached patch fixes this for me.
Unfortunately pwrite* doesn't work for the pwrite64(buf) line.
Works for me. If there's no objection I will commit this.
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Nov 8, 2018 at 4:27 AM Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:
On 11/7/18 10:05 AM, Jesper Pedersen wrote:
On 11/7/18 9:30 AM, Tom Lane wrote:
I'm confused by this. Surely the pwrite-based code is writing
exactly the
same data as before. Do we have to conclude that valgrind is
complaining
about passing uninitialized data to pwrite() when it did not complain
about exactly the same thing for write()?[ looks ... ] No, what we have to conclude is that the write-related
suppressions in src/tools/valgrind.supp need to be replaced or augmented
with pwrite-related ones.The attached patch fixes this for me.
Unfortunately pwrite* doesn't work for the pwrite64(buf) line.
Works for me. If there's no objection I will commit this.
Thanks for adjusting that. I suppose I would have known about this if
cfbot checked every patch with valgrind, which I might look into.
I'm a little confused about how an uninitialised value originating in
an OID list finishes up in an xlog buffer, considering that OIDs don't
have padding.
--
Thomas Munro
http://www.enterprisedb.com