Use pread and pwrite instead of lseek + write and read

Started by Oskari Saarenmaaover 9 years ago15 messages

os@ohmu.fi

over 9 years ago

1 attachment(s)

Hi,

The Uber blog post, among other things, pointed out that PG uses lseek +
read instead of pread. I didn't see any discussion around that and my
Google searches didn't find any posts about pread / pwrite for the past
10 years.

With that plus the "C++ port" thread in mind, I was wondering if it's
time to see if we could do better by just utilizing newer C and POSIX
features.

The attached patch replaces FileWrite and FileRead with FileWriteAt and
FileReadAt and removes most FileSeek calls. FileSeek is still around so
we can find the end of a file, but it's not used for anything else.

On my laptop a simple pgbench run (scale 100, 15 minutes) shows a 1.5%
performance improvement. A 1.5% performance improvement is small but
measurable - and IMV more importantly it allows us to drop more than 100
lines of backwards (compatible?) code; maybe we could start targeting
more recent platforms in v10?

Obviously this patch needs some more work before it could be merged, and
we probably still need a fallback for some platforms without pread and
pwrite (afaik Windows doesn't implement them.)

/ Oskari

Attachments:

use-pread-pwrite.patchapplication/x-patch; name=use-pread-pwrite.patchDownload

diff --git src/backend/access/heap/rewriteheap.c src/backend/access/heap/rewriteheap.c
index f9ce986..6dc5df3 100644
--- src/backend/access/heap/rewriteheap.c
+++ src/backend/access/heap/rewriteheap.c
@@ -918,7 +918,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		 * Note that we deviate from the usual WAL coding practices here,
 		 * check the above "Logical rewrite support" comment for reasoning.
 		 */
-		written = FileWrite(src->vfd, waldata_start, len);
+		written = FileWriteAt(src->vfd, waldata_start, len, src->off);
 		if (written != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git src/backend/storage/file/buffile.c src/backend/storage/file/buffile.c
index 042be79..39482fa 100644
--- src/backend/storage/file/buffile.c
+++ src/backend/storage/file/buffile.c
@@ -60,12 +60,6 @@ struct BufFile
 	int			numFiles;		/* number of physical files in set */
 	/* all files except the last have length exactly MAX_PHYSICAL_FILESIZE */
 	File	   *files;			/* palloc'd array with numFiles entries */
-	off_t	   *offsets;		/* palloc'd array with numFiles entries */
-
-	/*
-	 * offsets[i] is the current seek position of files[i].  We use this to
-	 * avoid making redundant FileSeek calls.
-	 */
 
 	bool		isTemp;			/* can only add files if this is TRUE */
 	bool		isInterXact;	/* keep open over transactions? */
@@ -108,8 +102,6 @@ makeBufFile(File firstfile)
 	file->numFiles = 1;
 	file->files = (File *) palloc(sizeof(File));
 	file->files[0] = firstfile;
-	file->offsets = (off_t *) palloc(sizeof(off_t));
-	file->offsets[0] = 0L;
 	file->isTemp = false;
 	file->isInterXact = false;
 	file->dirty = false;
@@ -143,10 +135,7 @@ extendBufFile(BufFile *file)
 
 	file->files = (File *) repalloc(file->files,
 									(file->numFiles + 1) * sizeof(File));
-	file->offsets = (off_t *) repalloc(file->offsets,
-									   (file->numFiles + 1) * sizeof(off_t));
 	file->files[file->numFiles] = pfile;
-	file->offsets[file->numFiles] = 0L;
 	file->numFiles++;
 }
 
@@ -210,7 +199,6 @@ BufFileClose(BufFile *file)
 		FileClose(file->files[i]);
 	/* release the buffer space */
 	pfree(file->files);
-	pfree(file->offsets);
 	pfree(file);
 }
 
@@ -241,23 +229,12 @@ BufFileLoadBuffer(BufFile *file)
 	}
 
 	/*
-	 * May need to reposition physical file.
-	 */
-	thisfile = file->files[file->curFile];
-	if (file->curOffset != file->offsets[file->curFile])
-	{
-		if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
-			return;				/* seek failed, read nothing */
-		file->offsets[file->curFile] = file->curOffset;
-	}
-
-	/*
 	 * Read whatever we can get, up to a full bufferload.
 	 */
-	file->nbytes = FileRead(thisfile, file->buffer, sizeof(file->buffer));
+	thisfile = file->files[file->curFile];
+	file->nbytes = FileReadAt(thisfile, file->buffer, sizeof(file->buffer), file->curOffset);
 	if (file->nbytes < 0)
 		file->nbytes = 0;
-	file->offsets[file->curFile] += file->nbytes;
 	/* we choose not to advance curOffset here */
 
 	pgBufferUsage.temp_blks_read++;
@@ -307,20 +284,10 @@ BufFileDumpBuffer(BufFile *file)
 				bytestowrite = (int) availbytes;
 		}
 
-		/*
-		 * May need to reposition physical file.
-		 */
 		thisfile = file->files[file->curFile];
-		if (file->curOffset != file->offsets[file->curFile])
-		{
-			if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
-				return;			/* seek failed, give up */
-			file->offsets[file->curFile] = file->curOffset;
-		}
-		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite);
+		bytestowrite = FileWriteAt(thisfile, file->buffer + wpos, bytestowrite, file->curOffset);
 		if (bytestowrite <= 0)
 			return;				/* failed to write */
-		file->offsets[file->curFile] += bytestowrite;
 		file->curOffset += bytestowrite;
 		wpos += bytestowrite;
 
diff --git src/backend/storage/file/fd.c src/backend/storage/file/fd.c
index 03143f1..795cb12 100644
--- src/backend/storage/file/fd.c
+++ src/backend/storage/file/fd.c
@@ -16,8 +16,8 @@
  * including base tables, scratch files (e.g., sort and hash spool
  * files), and random calls to C library routines like system(3); it
  * is quite easy to exceed system limits on the number of open files a
- * single process can have.  (This is around 256 on many modern
- * operating systems, but can be as low as 32 on others.)
+ * single process can have.  (This is around 1024 on many modern
+ * operating systems, but may be lower on others.)
  *
  * VFDs are managed as an LRU pool, with actual OS file descriptors
  * being opened and closed as needed.  Obviously, if a routine is
@@ -174,7 +174,6 @@ typedef struct vfd
 	File		nextFree;		/* link to next free VFD, if in freelist */
 	File		lruMoreRecently;	/* doubly linked recency-of-use list */
 	File		lruLessRecently;
-	off_t		seekPos;		/* current logical file position */
 	off_t		fileSize;		/* current size of file (0 if not temporary) */
 	char	   *fileName;		/* name of file, or NULL for unused VFD */
 	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
@@ -970,10 +969,6 @@ LruDelete(File file)
 	/* delete the vfd record from the LRU ring */
 	Delete(file);
 
-	/* save the seek position */
-	vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
-	Assert(vfdP->seekPos != (off_t) -1);
-
 	/* close the file */
 	if (close(vfdP->fd))
 		elog(ERROR, "could not close file \"%s\": %m", vfdP->fileName);
@@ -1038,15 +1033,6 @@ LruInsert(File file)
 			DO_DB(elog(LOG, "RE_OPEN SUCCESS"));
 			++nfile;
 		}
-
-		/* seek to the right position */
-		if (vfdP->seekPos != (off_t) 0)
-		{
-			off_t returnValue PG_USED_FOR_ASSERTS_ONLY;
-
-			returnValue = lseek(vfdP->fd, vfdP->seekPos, SEEK_SET);
-			Assert(returnValue != (off_t) -1);
-		}
 	}
 
 	/*
@@ -1270,7 +1256,6 @@ PathNameOpenFile(FileName fileName, int fileFlags, int fileMode)
 	/* Saved flags are adjusted to be OK for re-opening file */
 	vfdP->fileFlags = fileFlags & ~(O_CREAT | O_TRUNC | O_EXCL);
 	vfdP->fileMode = fileMode;
-	vfdP->seekPos = 0;
 	vfdP->fileSize = 0;
 	vfdP->fdstate = 0x0;
 	vfdP->resowner = NULL;
@@ -1563,7 +1548,7 @@ FileWriteback(File file, off_t offset, off_t nbytes)
 }
 
 int
-FileRead(File file, char *buffer, int amount)
+FileReadAt(File file, char *buffer, int amount, off_t offset)
 {
 	int			returnCode;
 
@@ -1571,7 +1556,7 @@ FileRead(File file, char *buffer, int amount)
 
 	DO_DB(elog(LOG, "FileRead: %d (%s) " INT64_FORMAT " %d %p",
 			   file, VfdCache[file].fileName,
-			   (int64) VfdCache[file].seekPos,
+			   (int64) offset,
 			   amount, buffer));
 
 	returnCode = FileAccess(file);
@@ -1579,11 +1564,9 @@ FileRead(File file, char *buffer, int amount)
 		return returnCode;
 
 retry:
-	returnCode = read(VfdCache[file].fd, buffer, amount);
+	returnCode = pread(VfdCache[file].fd, buffer, amount, offset);
 
-	if (returnCode >= 0)
-		VfdCache[file].seekPos += returnCode;
-	else
+	if (returnCode < 0)
 	{
 		/*
 		 * Windows may run out of kernel buffers and return "Insufficient
@@ -1609,16 +1592,13 @@ retry:
 		/* OK to retry if interrupted */
 		if (errno == EINTR)
 			goto retry;
-
-		/* Trouble, so assume we don't know the file position anymore */
-		VfdCache[file].seekPos = FileUnknownPos;
 	}
 
 	return returnCode;
 }
 
 int
-FileWrite(File file, char *buffer, int amount)
+FileWriteAt(File file, char *buffer, int amount, off_t offset)
 {
 	int			returnCode;
 
@@ -1626,7 +1606,7 @@ FileWrite(File file, char *buffer, int amount)
 
 	DO_DB(elog(LOG, "FileWrite: %d (%s) " INT64_FORMAT " %d %p",
 			   file, VfdCache[file].fileName,
-			   (int64) VfdCache[file].seekPos,
+			   (int64) offset,
 			   amount, buffer));
 
 	returnCode = FileAccess(file);
@@ -1643,7 +1623,7 @@ FileWrite(File file, char *buffer, int amount)
 	 */
 	if (temp_file_limit >= 0 && (VfdCache[file].fdstate & FD_TEMPORARY))
 	{
-		off_t		newPos = VfdCache[file].seekPos + amount;
+		off_t		newPos = offset + amount;
 
 		if (newPos > VfdCache[file].fileSize)
 		{
@@ -1660,7 +1640,7 @@ FileWrite(File file, char *buffer, int amount)
 
 retry:
 	errno = 0;
-	returnCode = write(VfdCache[file].fd, buffer, amount);
+	returnCode = pwrite(VfdCache[file].fd, buffer, amount, offset);
 
 	/* if write didn't set errno, assume problem is no disk space */
 	if (returnCode != amount && errno == 0)
@@ -1668,12 +1648,10 @@ retry:
 
 	if (returnCode >= 0)
 	{
-		VfdCache[file].seekPos += returnCode;
-
 		/* maintain fileSize and temporary_files_size if it's a temp file */
 		if (VfdCache[file].fdstate & FD_TEMPORARY)
 		{
-			off_t		newPos = VfdCache[file].seekPos;
+			off_t		newPos = offset + returnCode;
 
 			if (newPos > VfdCache[file].fileSize)
 			{
@@ -1685,7 +1663,7 @@ retry:
 	else
 	{
 		/*
-		 * See comments in FileRead()
+		 * See comments in FileReadAt()
 		 */
 #ifdef WIN32
 		DWORD		error = GetLastError();
@@ -1704,9 +1682,6 @@ retry:
 		/* OK to retry if interrupted */
 		if (errno == EINTR)
 			goto retry;
-
-		/* Trouble, so assume we don't know the file position anymore */
-		VfdCache[file].seekPos = FileUnknownPos;
 	}
 
 	return returnCode;
@@ -1736,78 +1711,33 @@ FileSeek(File file, off_t offset, int whence)
 
 	Assert(FileIsValid(file));
 
-	DO_DB(elog(LOG, "FileSeek: %d (%s) " INT64_FORMAT " " INT64_FORMAT " %d",
+	DO_DB(elog(LOG, "FileSeek: %d (%s) " INT64_FORMAT " %d",
 			   file, VfdCache[file].fileName,
-			   (int64) VfdCache[file].seekPos,
 			   (int64) offset, whence));
 
-	if (FileIsNotOpen(file))
+	switch (whence)
 	{
-		switch (whence)
-		{
-			case SEEK_SET:
-				if (offset < 0)
-					elog(ERROR, "invalid seek offset: " INT64_FORMAT,
-						 (int64) offset);
-				VfdCache[file].seekPos = offset;
-				break;
-			case SEEK_CUR:
-				VfdCache[file].seekPos += offset;
-				break;
-			case SEEK_END:
+		case SEEK_SET:
+			if (offset < 0)
+				elog(ERROR, "invalid seek offset: " INT64_FORMAT,
+					 (int64) offset);
+			return offset;
+
+		case SEEK_END:
+			if (FileIsNotOpen(file))
+			{
 				returnCode = FileAccess(file);
 				if (returnCode < 0)
 					return returnCode;
-				VfdCache[file].seekPos = lseek(VfdCache[file].fd,
-											   offset, whence);
-				break;
-			default:
-				elog(ERROR, "invalid whence: %d", whence);
-				break;
-		}
-	}
-	else
-	{
-		switch (whence)
-		{
-			case SEEK_SET:
-				if (offset < 0)
-					elog(ERROR, "invalid seek offset: " INT64_FORMAT,
-						 (int64) offset);
-				if (VfdCache[file].seekPos != offset)
-					VfdCache[file].seekPos = lseek(VfdCache[file].fd,
-												   offset, whence);
-				break;
-			case SEEK_CUR:
-				if (offset != 0 || VfdCache[file].seekPos == FileUnknownPos)
-					VfdCache[file].seekPos = lseek(VfdCache[file].fd,
-												   offset, whence);
-				break;
-			case SEEK_END:
-				VfdCache[file].seekPos = lseek(VfdCache[file].fd,
-											   offset, whence);
-				break;
-			default:
-				elog(ERROR, "invalid whence: %d", whence);
-				break;
-		}
+			}
+			return lseek(VfdCache[file].fd, offset, whence);
+
+		default:
+			elog(ERROR, "invalid whence: %d", whence);
 	}
-	return VfdCache[file].seekPos;
-}
 
-/*
- * XXX not actually used but here for completeness
- */
-#ifdef NOT_USED
-off_t
-FileTell(File file)
-{
-	Assert(FileIsValid(file));
-	DO_DB(elog(LOG, "FileTell %d (%s)",
-			   file, VfdCache[file].fileName));
-	return VfdCache[file].seekPos;
+	return -1;
 }
-#endif
 
 int
 FileTruncate(File file, off_t offset)
diff --git src/backend/storage/smgr/md.c src/backend/storage/smgr/md.c
index f329d15..5fc9a6b 100644
--- src/backend/storage/smgr/md.c
+++ src/backend/storage/smgr/md.c
@@ -527,22 +527,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
 
-	/*
-	 * Note: because caller usually obtained blocknum by calling mdnblocks,
-	 * which did a seek(SEEK_END), this seek is often redundant and will be
-	 * optimized away by fd.c.  It's not redundant, however, if there is a
-	 * partial page at the end of the file. In that case we want to try to
-	 * overwrite the partial page with a full page.  It's also not redundant
-	 * if bufmgr.c had to dump another buffer of the same file to make room
-	 * for the new page's buffer.
-	 */
-	if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not seek to block %u in file \"%s\": %m",
-						blocknum, FilePathName(v->mdfd_vfd))));
-
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ)) != BLCKSZ)
+	if ((nbytes = FileWriteAt(v->mdfd_vfd, buffer, BLCKSZ, seekpos)) != BLCKSZ)
 	{
 		if (nbytes < 0)
 			ereport(ERROR,
@@ -749,13 +734,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
 
-	if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not seek to block %u in file \"%s\": %m",
-						blocknum, FilePathName(v->mdfd_vfd))));
-
-	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileReadAt(v->mdfd_vfd, buffer, BLCKSZ, seekpos);
 
 	TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
 									   reln->smgr_rnode.node.spcNode,
@@ -825,13 +804,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
 
-	if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not seek to block %u in file \"%s\": %m",
-						blocknum, FilePathName(v->mdfd_vfd))));
-
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileWriteAt(v->mdfd_vfd, buffer, BLCKSZ, seekpos);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
diff --git src/include/storage/fd.h src/include/storage/fd.h
index cbc2224..0b6b846 100644
--- src/include/storage/fd.h
+++ src/include/storage/fd.h
@@ -69,8 +69,8 @@ extern File PathNameOpenFile(FileName fileName, int fileFlags, int fileMode);
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
-extern int	FileRead(File file, char *buffer, int amount);
-extern int	FileWrite(File file, char *buffer, int amount);
+extern int	FileReadAt(File file, char *buffer, int amount, off_t offset);
+extern int	FileWriteAt(File file, char *buffer, int amount, off_t offset);
 extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);

Victor Wagner

vitus@wagner.pp.ru

over 9 years ago

In reply to: Oskari Saarenmaa (#1)

Re: Use pread and pwrite instead of lseek + write and read

On Wed, 17 Aug 2016 10:58:09 +0300
Oskari Saarenmaa <os@ohmu.fi> wrote:

The attached patch replaces FileWrite and FileRead with FileWriteAt
and FileReadAt and removes most FileSeek calls. FileSeek is still
around so we can find the end of a file, but it's not used for
anything else.

It seems that configure test for availability of pread/pwrite functions
and corresponding #define is needed.

I don't think that all platforms, supported by PostgreSQL support this
API. Especially, I cannot find any mention of pread/pwrite in the Win32
except this thread on stackoverflow:

http://stackoverflow.com/questions/766477/are-there-equivalents-to-pread-on-different-platforms

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Magnus Hagander

magnus@hagander.net

over 9 years ago

In reply to: Victor Wagner (#2)

Re: Use pread and pwrite instead of lseek + write and read

On Wed, Aug 17, 2016 at 11:34 AM, Victor Wagner <vitus@wagner.pp.ru> wrote:

On Wed, 17 Aug 2016 10:58:09 +0300
Oskari Saarenmaa <os@ohmu.fi> wrote:

The attached patch replaces FileWrite and FileRead with FileWriteAt
and FileReadAt and removes most FileSeek calls. FileSeek is still
around so we can find the end of a file, but it's not used for
anything else.

It seems that configure test for availability of pread/pwrite functions
and corresponding #define is needed.

I don't think that all platforms, supported by PostgreSQL support this
API. Especially, I cannot find any mention of pread/pwrite in the Win32
except this thread on stackoverflow:

Yeah, Windows does not have those API calls, but it shouldn't be rocket
science to write a wrapper for it. The standard windows APIs can do the
same thing -- but they'll need access to the HANDLE for the file and not
the posix file descriptor.

It also has things like ReadFileScatter() (
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365469(v=vs.85).aspx)
which is not the same, but might also be interesting as a future
improvement. That's clearly something different though, and out of scope
for this one. But IIRC that functionality was actually added for the sake
of SQLServer back in the days.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Noname

ilmari@ilmari.org

over 9 years ago

In reply to: Magnus Hagander (#3)

Re: Use pread and pwrite instead of lseek + write and read

Magnus Hagander <magnus@hagander.net> writes:

[pread/pwrite]

Yeah, Windows does not have those API calls, but it shouldn't be rocket
science to write a wrapper for it. The standard windows APIs can do the
same thing -- but they'll need access to the HANDLE for the file and not
the posix file descriptor.

It also has things like ReadFileScatter() (
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365469(v=vs.85).aspx)
which is not the same, but might also be interesting as a future
improvement.

That looks a lot like POSIX readv()
(http://pubs.opengroup.org/onlinepubs/9699919799/functions/readv.html),
and as far as I can tell it has the same issue as it in that it doesn't
take an offset argument, but requires you to seek first.

Linux and modern BSDs however have preadv()
(http://manpages.ubuntu.com/manpages/xenial/en/man2/preadv.2.html),
which takes an offset and an iovec array. I don't know if Windows and
other platforms have anything similar.

--
- Twitter seems more influential [than blogs] in the 'gets reported in
the mainstream press' sense at least. - Matt McLeod
- That'd be because the content of a tweet is easier to condense down
to a mainstream media article. - Calle Dybedahl

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Magnus Hagander

magnus@hagander.net

over 9 years ago

In reply to: Noname (#4)

Re: Use pread and pwrite instead of lseek + write and read

On Wed, Aug 17, 2016 at 12:41 PM, Dagfinn Ilmari Mannsåker <
ilmari@ilmari.org> wrote:

Magnus Hagander <magnus@hagander.net> writes:

[pread/pwrite]

Yeah, Windows does not have those API calls, but it shouldn't be rocket
science to write a wrapper for it. The standard windows APIs can do the
same thing -- but they'll need access to the HANDLE for the file and not
the posix file descriptor.

Just to be clear, I'm referring to the standard ReadFile() and WriteFile()
APIs here.

It also has things like ReadFileScatter() (
https://msdn.microsoft.com/en-us/library/windows/desktop/

aa365469(v=vs.85).aspx)

which is not the same, but might also be interesting as a future
improvement.

That looks a lot like POSIX readv()
(http://pubs.opengroup.org/onlinepubs/9699919799/functions/readv.html),
and as far as I can tell it has the same issue as it in that it doesn't
take an offset argument, but requires you to seek first.

Ah yeah, for some reason I keep getting readv() and pread(). Which solve a
different problem (see below about that function not having the same issues
on windows -- but it's still not the problem we're trying to solve here)

Linux and modern BSDs however have preadv()
(http://manpages.ubuntu.com/manpages/xenial/en/man2/preadv.2.html),
which takes an offset and an iovec array. I don't know if Windows and
other platforms have anything similar.

ReadFileScatter() can take the offset from OVERLAPPED (same as regular
ReadFile) and an array of FILE_SEGMENT_ELEMENT, same as preadv(). But the
APIs start looking more different the further down the rabbithole you go, I
think. But the capability is definitely there (and has been for ages so is
in all supported version).

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Victor Wagner

vitus@wagner.pp.ru

over 9 years ago

In reply to: Magnus Hagander (#3)

Re: Use pread and pwrite instead of lseek + write and read

On Wed, 17 Aug 2016 12:17:35 +0200
Magnus Hagander <magnus@hagander.net> wrote:

On Wed, Aug 17, 2016 at 11:34 AM, Victor Wagner <vitus@wagner.pp.ru>
wrote:

I don't think that all platforms, supported by PostgreSQL support
this API. Especially, I cannot find any mention of pread/pwrite in
the Win32 except this thread on stackoverflow:

Yeah, Windows does not have those API calls, but it shouldn't be
rocket science to write a wrapper for it. The standard windows APIs
can do the same thing -- but they'll need access to the HANDLE for
the file and not the posix file descriptor.

There is _get_osfhandle function, which allows to find out Windows
HANDLE associated with posix file descriptor.

Really my question was - someone should write these wrappers into
src/port and add corresponding test to the configure and/or CMakefile
for this patch to be complete.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Oskari Saarenmaa (#1)

Re: Use pread and pwrite instead of lseek + write and read

Oskari Saarenmaa <os@ohmu.fi> writes:

On my laptop a simple pgbench run (scale 100, 15 minutes) shows a 1.5%
performance improvement.

I would have hoped for a lot better result before anyone would propose
that we should deal with all the portability issues this'll create.

A 1.5% performance improvement is small but
measurable - and IMV more importantly it allows us to drop more than 100
lines of backwards (compatible?) code; maybe we could start targeting
more recent platforms in v10?

That's basically nonsense: we'll end up adding way more than that to
deal with platforms that haven't got these APIs.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Oskari Saarenmaa

os@ohmu.fi

over 9 years ago

In reply to: Tom Lane (#7)

1 attachment(s)

Re: Use pread and pwrite instead of lseek + write and read

17.08.2016, 16:40, Tom Lane kirjoitti:

Oskari Saarenmaa <os@ohmu.fi> writes:

On my laptop a simple pgbench run (scale 100, 15 minutes) shows a 1.5%
performance improvement.

I would have hoped for a lot better result before anyone would propose
that we should deal with all the portability issues this'll create.

AFAICT pread and pwrite are available on pretty much all operating
systems released in 2000s; it was added to Linux in 1997. Windows and
HP-UX 10.20 don't have it, but we can just simulate it using lseek +
read/write there without adding too much code.

A 1.5% performance improvement is small but
measurable - and IMV more importantly it allows us to drop more than 100
lines of backwards (compatible?) code; maybe we could start targeting
more recent platforms in v10?

That's basically nonsense: we'll end up adding way more than that to
deal with platforms that haven't got these APIs.

Attached an updated patch that adds a configure check and uses
lseek+read/write instead pread/pwrite when the latter aren't available.
The previous code ended up seeking anyway in most of the cases and
pgbench shows no performance regression on my Linux box.

8 files changed, 54 insertions(+), 168 deletions(-)

/ Oskari

Attachments:

use-pread-pwrite-v2.patchapplication/x-patch; name=use-pread-pwrite-v2.patchDownload

diff --git configure configure
index 45c8eef..4d50b52 100755
--- configure
+++ configure
@@ -12473,7 +12473,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
+for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pread pstat pthread_is_threaded_np pwrite readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git configure.in configure.in
index c878b4e..5eddaca 100644
--- configure.in
+++ configure.in
@@ -1446,7 +1446,7 @@ PGAC_FUNC_WCSTOMBS_L
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])
+AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pread pstat pthread_is_threaded_np pwrite readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])
 
 AC_REPLACE_FUNCS(fseeko)
 case $host_os in
diff --git src/backend/access/heap/rewriteheap.c src/backend/access/heap/rewriteheap.c
index f9ce986..6dc5df3 100644
--- src/backend/access/heap/rewriteheap.c
+++ src/backend/access/heap/rewriteheap.c
@@ -918,7 +918,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		 * Note that we deviate from the usual WAL coding practices here,
 		 * check the above "Logical rewrite support" comment for reasoning.
 		 */
-		written = FileWrite(src->vfd, waldata_start, len);
+		written = FileWriteAt(src->vfd, waldata_start, len, src->off);
 		if (written != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git src/backend/storage/file/buffile.c src/backend/storage/file/buffile.c
index 042be79..39482fa 100644
--- src/backend/storage/file/buffile.c
+++ src/backend/storage/file/buffile.c
@@ -60,12 +60,6 @@ struct BufFile
 	int			numFiles;		/* number of physical files in set */
 	/* all files except the last have length exactly MAX_PHYSICAL_FILESIZE */
 	File	   *files;			/* palloc'd array with numFiles entries */
-	off_t	   *offsets;		/* palloc'd array with numFiles entries */
-
-	/*
-	 * offsets[i] is the current seek position of files[i].  We use this to
-	 * avoid making redundant FileSeek calls.
-	 */
 
 	bool		isTemp;			/* can only add files if this is TRUE */
 	bool		isInterXact;	/* keep open over transactions? */
@@ -108,8 +102,6 @@ makeBufFile(File firstfile)
 	file->numFiles = 1;
 	file->files = (File *) palloc(sizeof(File));
 	file->files[0] = firstfile;
-	file->offsets = (off_t *) palloc(sizeof(off_t));
-	file->offsets[0] = 0L;
 	file->isTemp = false;
 	file->isInterXact = false;
 	file->dirty = false;
@@ -143,10 +135,7 @@ extendBufFile(BufFile *file)
 
 	file->files = (File *) repalloc(file->files,
 									(file->numFiles + 1) * sizeof(File));
-	file->offsets = (off_t *) repalloc(file->offsets,
-									   (file->numFiles + 1) * sizeof(off_t));
 	file->files[file->numFiles] = pfile;
-	file->offsets[file->numFiles] = 0L;
 	file->numFiles++;
 }
 
@@ -210,7 +199,6 @@ BufFileClose(BufFile *file)
 		FileClose(file->files[i]);
 	/* release the buffer space */
 	pfree(file->files);
-	pfree(file->offsets);
 	pfree(file);
 }
 
@@ -241,23 +229,12 @@ BufFileLoadBuffer(BufFile *file)
 	}
 
 	/*
-	 * May need to reposition physical file.
-	 */
-	thisfile = file->files[file->curFile];
-	if (file->curOffset != file->offsets[file->curFile])
-	{
-		if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
-			return;				/* seek failed, read nothing */
-		file->offsets[file->curFile] = file->curOffset;
-	}
-
-	/*
 	 * Read whatever we can get, up to a full bufferload.
 	 */
-	file->nbytes = FileRead(thisfile, file->buffer, sizeof(file->buffer));
+	thisfile = file->files[file->curFile];
+	file->nbytes = FileReadAt(thisfile, file->buffer, sizeof(file->buffer), file->curOffset);
 	if (file->nbytes < 0)
 		file->nbytes = 0;
-	file->offsets[file->curFile] += file->nbytes;
 	/* we choose not to advance curOffset here */
 
 	pgBufferUsage.temp_blks_read++;
@@ -307,20 +284,10 @@ BufFileDumpBuffer(BufFile *file)
 				bytestowrite = (int) availbytes;
 		}
 
-		/*
-		 * May need to reposition physical file.
-		 */
 		thisfile = file->files[file->curFile];
-		if (file->curOffset != file->offsets[file->curFile])
-		{
-			if (FileSeek(thisfile, file->curOffset, SEEK_SET) != file->curOffset)
-				return;			/* seek failed, give up */
-			file->offsets[file->curFile] = file->curOffset;
-		}
-		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite);
+		bytestowrite = FileWriteAt(thisfile, file->buffer + wpos, bytestowrite, file->curOffset);
 		if (bytestowrite <= 0)
 			return;				/* failed to write */
-		file->offsets[file->curFile] += bytestowrite;
 		file->curOffset += bytestowrite;
 		wpos += bytestowrite;
 
diff --git src/backend/storage/file/fd.c src/backend/storage/file/fd.c
index 03143f1..56348f5 100644
--- src/backend/storage/file/fd.c
+++ src/backend/storage/file/fd.c
@@ -16,8 +16,8 @@
  * including base tables, scratch files (e.g., sort and hash spool
  * files), and random calls to C library routines like system(3); it
  * is quite easy to exceed system limits on the number of open files a
- * single process can have.  (This is around 256 on many modern
- * operating systems, but can be as low as 32 on others.)
+ * single process can have.  (This is around 1024 on many modern
+ * operating systems, but may be lower on others.)
  *
  * VFDs are managed as an LRU pool, with actual OS file descriptors
  * being opened and closed as needed.  Obviously, if a routine is
@@ -174,7 +174,6 @@ typedef struct vfd
 	File		nextFree;		/* link to next free VFD, if in freelist */
 	File		lruMoreRecently;	/* doubly linked recency-of-use list */
 	File		lruLessRecently;
-	off_t		seekPos;		/* current logical file position */
 	off_t		fileSize;		/* current size of file (0 if not temporary) */
 	char	   *fileName;		/* name of file, or NULL for unused VFD */
 	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
@@ -970,10 +969,6 @@ LruDelete(File file)
 	/* delete the vfd record from the LRU ring */
 	Delete(file);
 
-	/* save the seek position */
-	vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
-	Assert(vfdP->seekPos != (off_t) -1);
-
 	/* close the file */
 	if (close(vfdP->fd))
 		elog(ERROR, "could not close file \"%s\": %m", vfdP->fileName);
@@ -1038,15 +1033,6 @@ LruInsert(File file)
 			DO_DB(elog(LOG, "RE_OPEN SUCCESS"));
 			++nfile;
 		}
-
-		/* seek to the right position */
-		if (vfdP->seekPos != (off_t) 0)
-		{
-			off_t returnValue PG_USED_FOR_ASSERTS_ONLY;
-
-			returnValue = lseek(vfdP->fd, vfdP->seekPos, SEEK_SET);
-			Assert(returnValue != (off_t) -1);
-		}
 	}
 
 	/*
@@ -1270,7 +1256,6 @@ PathNameOpenFile(FileName fileName, int fileFlags, int fileMode)
 	/* Saved flags are adjusted to be OK for re-opening file */
 	vfdP->fileFlags = fileFlags & ~(O_CREAT | O_TRUNC | O_EXCL);
 	vfdP->fileMode = fileMode;
-	vfdP->seekPos = 0;
 	vfdP->fileSize = 0;
 	vfdP->fdstate = 0x0;
 	vfdP->resowner = NULL;
@@ -1563,7 +1548,7 @@ FileWriteback(File file, off_t offset, off_t nbytes)
 }
 
 int
-FileRead(File file, char *buffer, int amount)
+FileReadAt(File file, char *buffer, int amount, off_t offset)
 {
 	int			returnCode;
 
@@ -1571,7 +1556,7 @@ FileRead(File file, char *buffer, int amount)
 
 	DO_DB(elog(LOG, "FileRead: %d (%s) " INT64_FORMAT " %d %p",
 			   file, VfdCache[file].fileName,
-			   (int64) VfdCache[file].seekPos,
+			   (int64) offset,
 			   amount, buffer));
 
 	returnCode = FileAccess(file);
@@ -1579,11 +1564,14 @@ FileRead(File file, char *buffer, int amount)
 		return returnCode;
 
 retry:
+#ifdef HAVE_PREAD
+	returnCode = pread(VfdCache[file].fd, buffer, amount, offset);
+#else
+	lseek(VfdCache[file].fd, offset, SEEK_SET);
 	returnCode = read(VfdCache[file].fd, buffer, amount);
+#endif
 
-	if (returnCode >= 0)
-		VfdCache[file].seekPos += returnCode;
-	else
+	if (returnCode < 0)
 	{
 		/*
 		 * Windows may run out of kernel buffers and return "Insufficient
@@ -1609,16 +1597,13 @@ retry:
 		/* OK to retry if interrupted */
 		if (errno == EINTR)
 			goto retry;
-
-		/* Trouble, so assume we don't know the file position anymore */
-		VfdCache[file].seekPos = FileUnknownPos;
 	}
 
 	return returnCode;
 }
 
 int
-FileWrite(File file, char *buffer, int amount)
+FileWriteAt(File file, char *buffer, int amount, off_t offset)
 {
 	int			returnCode;
 
@@ -1626,7 +1611,7 @@ FileWrite(File file, char *buffer, int amount)
 
 	DO_DB(elog(LOG, "FileWrite: %d (%s) " INT64_FORMAT " %d %p",
 			   file, VfdCache[file].fileName,
-			   (int64) VfdCache[file].seekPos,
+			   (int64) offset,
 			   amount, buffer));
 
 	returnCode = FileAccess(file);
@@ -1643,7 +1628,7 @@ FileWrite(File file, char *buffer, int amount)
 	 */
 	if (temp_file_limit >= 0 && (VfdCache[file].fdstate & FD_TEMPORARY))
 	{
-		off_t		newPos = VfdCache[file].seekPos + amount;
+		off_t		newPos = offset + amount;
 
 		if (newPos > VfdCache[file].fileSize)
 		{
@@ -1660,7 +1645,12 @@ FileWrite(File file, char *buffer, int amount)
 
 retry:
 	errno = 0;
+#ifdef HAVE_PWRITE
+	returnCode = pwrite(VfdCache[file].fd, buffer, amount, offset);
+#else
+	lseek(VfdCache[file].fd, offset, SEEK_SET);
 	returnCode = write(VfdCache[file].fd, buffer, amount);
+#endif
 
 	/* if write didn't set errno, assume problem is no disk space */
 	if (returnCode != amount && errno == 0)
@@ -1668,12 +1658,10 @@ retry:
 
 	if (returnCode >= 0)
 	{
-		VfdCache[file].seekPos += returnCode;
-
 		/* maintain fileSize and temporary_files_size if it's a temp file */
 		if (VfdCache[file].fdstate & FD_TEMPORARY)
 		{
-			off_t		newPos = VfdCache[file].seekPos;
+			off_t		newPos = offset + returnCode;
 
 			if (newPos > VfdCache[file].fileSize)
 			{
@@ -1685,7 +1673,7 @@ retry:
 	else
 	{
 		/*
-		 * See comments in FileRead()
+		 * See comments in FileReadAt()
 		 */
 #ifdef WIN32
 		DWORD		error = GetLastError();
@@ -1704,9 +1692,6 @@ retry:
 		/* OK to retry if interrupted */
 		if (errno == EINTR)
 			goto retry;
-
-		/* Trouble, so assume we don't know the file position anymore */
-		VfdCache[file].seekPos = FileUnknownPos;
 	}
 
 	return returnCode;
@@ -1736,78 +1721,33 @@ FileSeek(File file, off_t offset, int whence)
 
 	Assert(FileIsValid(file));
 
-	DO_DB(elog(LOG, "FileSeek: %d (%s) " INT64_FORMAT " " INT64_FORMAT " %d",
+	DO_DB(elog(LOG, "FileSeek: %d (%s) " INT64_FORMAT " %d",
 			   file, VfdCache[file].fileName,
-			   (int64) VfdCache[file].seekPos,
 			   (int64) offset, whence));
 
-	if (FileIsNotOpen(file))
+	switch (whence)
 	{
-		switch (whence)
-		{
-			case SEEK_SET:
-				if (offset < 0)
-					elog(ERROR, "invalid seek offset: " INT64_FORMAT,
-						 (int64) offset);
-				VfdCache[file].seekPos = offset;
-				break;
-			case SEEK_CUR:
-				VfdCache[file].seekPos += offset;
-				break;
-			case SEEK_END:
+		case SEEK_SET:
+			if (offset < 0)
+				elog(ERROR, "invalid seek offset: " INT64_FORMAT,
+					 (int64) offset);
+			return offset;
+
+		case SEEK_END:
+			if (FileIsNotOpen(file))
+			{
 				returnCode = FileAccess(file);
 				if (returnCode < 0)
 					return returnCode;
-				VfdCache[file].seekPos = lseek(VfdCache[file].fd,
-											   offset, whence);
-				break;
-			default:
-				elog(ERROR, "invalid whence: %d", whence);
-				break;
-		}
-	}
-	else
-	{
-		switch (whence)
-		{
-			case SEEK_SET:
-				if (offset < 0)
-					elog(ERROR, "invalid seek offset: " INT64_FORMAT,
-						 (int64) offset);
-				if (VfdCache[file].seekPos != offset)
-					VfdCache[file].seekPos = lseek(VfdCache[file].fd,
-												   offset, whence);
-				break;
-			case SEEK_CUR:
-				if (offset != 0 || VfdCache[file].seekPos == FileUnknownPos)
-					VfdCache[file].seekPos = lseek(VfdCache[file].fd,
-												   offset, whence);
-				break;
-			case SEEK_END:
-				VfdCache[file].seekPos = lseek(VfdCache[file].fd,
-											   offset, whence);
-				break;
-			default:
-				elog(ERROR, "invalid whence: %d", whence);
-				break;
-		}
+			}
+			return lseek(VfdCache[file].fd, offset, whence);
+
+		default:
+			elog(ERROR, "invalid whence: %d", whence);
 	}
-	return VfdCache[file].seekPos;
-}
 
-/*
- * XXX not actually used but here for completeness
- */
-#ifdef NOT_USED
-off_t
-FileTell(File file)
-{
-	Assert(FileIsValid(file));
-	DO_DB(elog(LOG, "FileTell %d (%s)",
-			   file, VfdCache[file].fileName));
-	return VfdCache[file].seekPos;
+	return -1;
 }
-#endif
 
 int
 FileTruncate(File file, off_t offset)
diff --git src/backend/storage/smgr/md.c src/backend/storage/smgr/md.c
index f329d15..5fc9a6b 100644
--- src/backend/storage/smgr/md.c
+++ src/backend/storage/smgr/md.c
@@ -527,22 +527,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
 
-	/*
-	 * Note: because caller usually obtained blocknum by calling mdnblocks,
-	 * which did a seek(SEEK_END), this seek is often redundant and will be
-	 * optimized away by fd.c.  It's not redundant, however, if there is a
-	 * partial page at the end of the file. In that case we want to try to
-	 * overwrite the partial page with a full page.  It's also not redundant
-	 * if bufmgr.c had to dump another buffer of the same file to make room
-	 * for the new page's buffer.
-	 */
-	if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not seek to block %u in file \"%s\": %m",
-						blocknum, FilePathName(v->mdfd_vfd))));
-
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ)) != BLCKSZ)
+	if ((nbytes = FileWriteAt(v->mdfd_vfd, buffer, BLCKSZ, seekpos)) != BLCKSZ)
 	{
 		if (nbytes < 0)
 			ereport(ERROR,
@@ -749,13 +734,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
 
-	if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not seek to block %u in file \"%s\": %m",
-						blocknum, FilePathName(v->mdfd_vfd))));
-
-	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileReadAt(v->mdfd_vfd, buffer, BLCKSZ, seekpos);
 
 	TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
 									   reln->smgr_rnode.node.spcNode,
@@ -825,13 +804,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
 
-	if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not seek to block %u in file \"%s\": %m",
-						blocknum, FilePathName(v->mdfd_vfd))));
-
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileWriteAt(v->mdfd_vfd, buffer, BLCKSZ, seekpos);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
diff --git src/include/pg_config.h.in src/include/pg_config.h.in
index b621ff2..7d6723a 100644
--- src/include/pg_config.h.in
+++ src/include/pg_config.h.in
@@ -382,6 +382,9 @@
 /* Define to 1 if the assembler supports PPC's LWARX mutex hint bit. */
 #undef HAVE_PPC_LWARX_MUTEX_HINT
 
+/* Define to 1 if you have the `pread' function. */
+#undef HAVE_PREAD
+
 /* Define to 1 if you have the `pstat' function. */
 #undef HAVE_PSTAT
 
@@ -400,6 +403,9 @@
 /* Define to 1 if you have the <pwd.h> header file. */
 #undef HAVE_PWD_H
 
+/* Define to 1 if you have the `pwrite' function. */
+#undef HAVE_PWRITE
+
 /* Define to 1 if you have the `random' function. */
 #undef HAVE_RANDOM
 
diff --git src/include/storage/fd.h src/include/storage/fd.h
index cbc2224..0b6b846 100644
--- src/include/storage/fd.h
+++ src/include/storage/fd.h
@@ -69,8 +69,8 @@ extern File PathNameOpenFile(FileName fileName, int fileFlags, int fileMode);
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
-extern int	FileRead(File file, char *buffer, int amount);
-extern int	FileWrite(File file, char *buffer, int amount);
+extern int	FileReadAt(File file, char *buffer, int amount, off_t offset);
+extern int	FileWriteAt(File file, char *buffer, int amount, off_t offset);
 extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Tom Lane (#7)

Re: Use pread and pwrite instead of lseek + write and read

On Wed, Aug 17, 2016 at 9:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oskari Saarenmaa <os@ohmu.fi> writes:

On my laptop a simple pgbench run (scale 100, 15 minutes) shows a 1.5%
performance improvement.

I would have hoped for a lot better result before anyone would propose
that we should deal with all the portability issues this'll create.

A 1.5% performance improvement is small but
measurable - and IMV more importantly it allows us to drop more than 100
lines of backwards (compatible?) code; maybe we could start targeting
more recent platforms in v10?

That's basically nonsense: we'll end up adding way more than that to
deal with platforms that haven't got these APIs.

I don't understand why you think this would create non-trivial
portability issues.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Claudio Freire

klaussfreire@gmail.com

over 9 years ago

In reply to: Tom Lane (#7)

Re: Use pread and pwrite instead of lseek + write and read

On Wed, Aug 17, 2016 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oskari Saarenmaa <os@ohmu.fi> writes:

On my laptop a simple pgbench run (scale 100, 15 minutes) shows a 1.5%
performance improvement.

I would have hoped for a lot better result before anyone would propose
that we should deal with all the portability issues this'll create.

A 1.5% performance improvement is small but
measurable - and IMV more importantly it allows us to drop more than 100
lines of backwards (compatible?) code; maybe we could start targeting
more recent platforms in v10?

That's basically nonsense: we'll end up adding way more than that to
deal with platforms that haven't got these APIs.

Perhaps a better target would then be to try and make use of readv and
writev, which should both be useful for sequential access of various
kinds and network I/O. For instance, when reading 10 sequential
buffers, a readv could fill 10 buffers at a time.

I remember a project where we got a linear improvement in thoughput by
using them for network I/O, because we were limited by packet
thoughput rather than byte thoughput, and using the iovec utilities
reduced the overhead considerably.

But all this is anecdotal evidence in any case, YMMV.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Robert Haas (#9)

Re: Use pread and pwrite instead of lseek + write and read

Robert Haas <robertmhaas@gmail.com> writes:

I don't understand why you think this would create non-trivial
portability issues.

The patch as submitted breaks entirely on platforms without pread/pwrite.
Yes, we can add a configure test and some shim functions to fix that,
but the argument that it makes the code shorter will get a lot weaker
once we do.

I agree that adding such functions is pretty trivial, but there are
reasons to think there are other hazards that are less trivial:

First, a self-contained shim function will necessarily do an lseek every
time, which means performance will get *worse* not better on non-pread
platforms. And yes, the existing logic to avoid lseeks fires often enough
to be worthwhile, particularly in seqscans.

Second, I wonder whether this will break any kernel's readahead detection.
I wouldn't be too surprised if successive reads (not preads) without
intervening lseeks are needed to trigger readahead on at least some
platforms. So there's a potential, both on platforms with pread and those
without, for this to completely destroy seqscan performance, with
penalties very far exceeding what we might save by avoiding some kernel
calls.

I'd be more excited about this if the claimed improvement were more than
1.5%, but you know as well as I do that that's barely above the noise
floor for most performance measurements. I'm left wondering why bother,
and why take any risk of de-optimizing on some platforms.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Tom Lane (#11)

Re: Use pread and pwrite instead of lseek + write and read

On Wed, Aug 17, 2016 at 3:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I don't understand why you think this would create non-trivial
portability issues.

The patch as submitted breaks entirely on platforms without pread/pwrite.
Yes, we can add a configure test and some shim functions to fix that,
but the argument that it makes the code shorter will get a lot weaker
once we do.

I agree that adding such functions is pretty trivial, but there are
reasons to think there are other hazards that are less trivial:

First, a self-contained shim function will necessarily do an lseek every
time, which means performance will get *worse* not better on non-pread
platforms. And yes, the existing logic to avoid lseeks fires often enough
to be worthwhile, particularly in seqscans.

Second, I wonder whether this will break any kernel's readahead detection.
I wouldn't be too surprised if successive reads (not preads) without
intervening lseeks are needed to trigger readahead on at least some
platforms. So there's a potential, both on platforms with pread and those
without, for this to completely destroy seqscan performance, with
penalties very far exceeding what we might save by avoiding some kernel
calls.

I'd be more excited about this if the claimed improvement were more than
1.5%, but you know as well as I do that that's barely above the noise
floor for most performance measurements. I'm left wondering why bother,
and why take any risk of de-optimizing on some platforms.

Well, I think you're pointing out some things that need to be figured
out, but I hardly think that's a good enough reason to pour cold water
on the whole approach. The number of lseeks we issue on many
workloads is absolutely appalling, and I don't think there's any
reason at all to assume that a 1.5% gain is as good as it gets. Even
if it is, a 1% speedup on a benchmark where the noise is 5-10% is just
as much of a speedup as a 1% speedup on a benchmark on a benchmark
where the noise is 0.1%. Faster is faster, and 1% improvements are
not so numerous that we can afford to ignore them when they pop up.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Robert Haas (#12)

Re: Use pread and pwrite instead of lseek + write and read

Robert Haas <robertmhaas@gmail.com> writes:

Well, I think you're pointing out some things that need to be figured
out, but I hardly think that's a good enough reason to pour cold water
on the whole approach.

If somebody feels like doing the legwork to find out if those performance
hazards are real (which I freely concede they may not be), fine. I'm just
saying this isn't a slam-dunk commit-it-and-move-on win.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Oskari Saarenmaa

os@ohmu.fi

over 9 years ago

In reply to: Tom Lane (#11)

Re: Use pread and pwrite instead of lseek + write and read

17.08.2016, 22:11, Tom Lane kirjoitti:

Robert Haas <robertmhaas@gmail.com> writes:

I don't understand why you think this would create non-trivial
portability issues.

The patch as submitted breaks entirely on platforms without pread/pwrite.
Yes, we can add a configure test and some shim functions to fix that,
but the argument that it makes the code shorter will get a lot weaker
once we do.

I posted an updated patch which just calls lseek + read/write, the
code's still a lot shorter.

I agree that adding such functions is pretty trivial, but there are
reasons to think there are other hazards that are less trivial:

First, a self-contained shim function will necessarily do an lseek every
time, which means performance will get *worse* not better on non-pread
platforms. And yes, the existing logic to avoid lseeks fires often enough
to be worthwhile, particularly in seqscans.

This will only regress on platforms without pread. The only relevant
such platform appears to be Windows which has equivalent APIs.

FWIW, I ran the same pgbench benchmarks on my Linux system where I
always used lseek() + read/write instead of pread and pwrite - they ran
slightly faster than the previous code which saved seek positions, but I
suppose a workload with lots of seqscans could be slower.

Unfortunately I didn't save the actual numbers anywhere, but I can rerun
the benchmarks if you're interested. The numbers were pretty stable
across multiple runs.

Second, I wonder whether this will break any kernel's readahead detection.
I wouldn't be too surprised if successive reads (not preads) without
intervening lseeks are needed to trigger readahead on at least some
platforms. So there's a potential, both on platforms with pread and those
without, for this to completely destroy seqscan performance, with
penalties very far exceeding what we might save by avoiding some kernel
calls.

At least Linux and FreeBSD don't seem to care how and why you read
pages, they'll do readahead regardless of the way you read files and
extend the readahead once you access previously readahead pages. They
disable readahead only if fadvise(POSIX_FADV_RANDOM) has been used.

I'd expect any kernel that implements mmap to also implement readahead
based on page usage rather than than the seek position. Do you know of
a kernel that would actually use the seek position for readahead?

I'd be more excited about this if the claimed improvement were more than
1.5%, but you know as well as I do that that's barely above the noise
floor for most performance measurements. I'm left wondering why bother,
and why take any risk of de-optimizing on some platforms.

I think it makes sense to try to optimize for the platforms that people
actually use for performance critical workloads, especially if it also
allows us to simplify the code and remove more lines than we add. It's
nice if the software still works on legacy platforms, but I don't think
we should be concerned about a hypothetical performance impact on
platforms no one uses in production anymore.

/ Oskari

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Oskari Saarenmaa (#14)

Re: Use pread and pwrite instead of lseek + write and read

Oskari Saarenmaa <os@ohmu.fi> writes:

17.08.2016, 22:11, Tom Lane kirjoitti:

I'd be more excited about this if the claimed improvement were more than
1.5%, but you know as well as I do that that's barely above the noise
floor for most performance measurements. I'm left wondering why bother,
and why take any risk of de-optimizing on some platforms.

I think it makes sense to try to optimize for the platforms that people
actually use for performance critical workloads, especially if it also
allows us to simplify the code and remove more lines than we add. It's
nice if the software still works on legacy platforms, but I don't think
we should be concerned about a hypothetical performance impact on
platforms no one uses in production anymore.

Well, my point remains that I see little value in messing with
long-established code if you can't demonstrate a benefit that's clearly
above the noise level. We don't really know whether this change might
have adverse impacts somewhere --- either performance-wise or bug-wise ---
and for that amount of benefit I don't want to take the trouble to find
out.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers