BufFileSize() doesn't work on a "shared" BufFiles

Started by Heikki Linnakangasover 7 years ago3 messages
#1Heikki Linnakangas
hlinnaka@iki.fi
1 attachment(s)

Hi,

I've been noodling around the tuplesort/logtape code, reviewing the
changes that were made in v11 to support parallel sorting, and also
rebasing my older patch to replace the polyphase merge with a simple
balanced merge.

Looking at TapeShare:

/*
* The approach tuplesort.c takes to parallel external sorts is that workers,
* whose state is almost the same as independent serial sorts, are made to
* produce a final materialized tape of sorted output in all cases. This is
* frozen, just like any case requiring a final materialized tape. However,
* there is one difference, which is that freezing will also export an
* underlying shared fileset BufFile for sharing. Freezing produces TapeShare
* metadata for the worker when this happens, which is passed along through
* shared memory to leader.
*
* The leader process can then pass an array of TapeShare metadata (one per
* worker participant) to LogicalTapeSetCreate(), alongside a handle to a
* shared fileset, which is sufficient to construct a new logical tapeset that
* consists of each of the tapes materialized by workers.
*
* Note that while logtape.c does create an empty leader tape at the end of the
* tapeset in the leader case, it can never be written to due to a restriction
* in the shared buffile infrastructure.
*/
typedef struct TapeShare
{
/*
* firstblocknumber is first block that should be read from materialized
* tape.
*
* buffilesize is the size of associated BufFile following freezing.
*/
long firstblocknumber;
off_t buffilesize;
} TapeShare;

Why is 'buffilesize' part of the exported state? The leader process can
easily call BufFileSize() itself, instead of having the worker process
pass it through shared memory, right? Wrong. I tried to change it that
way, and after some debugging, I realized that BufFileSize() doesn't
work if it's called on a "shared" BufFile. It always returns 0. That's
because it uses FileGetSize(), which in turn only works on a temporary
file. None of this is mentioned in the comments :-(.

Perhaps that would be OK, if it was properly commented. But it's not
actually hard to make BufFileSize() work on shared files, too, so I
think we should do that.

Another little bug I noticed is that BufFileAppend() incorrectly resets
the 'offsets' of the source files. You won't notice if you call
BufFileAppend() immediately after BufFileOpenShared(), but if you had
called BufFileSeek() or BufFileRead() on the shared BufFile handle
before calling BufFileAppend(), it would get confused.

I propose the attached patch.

- Heikki

Attachments:

0001-Fix-some-sloppiness-in-the-new-BufFileSize-and-BufFi.patchtext/x-patch; name=0001-Fix-some-sloppiness-in-the-new-BufFileSize-and-BufFi.patchDownload
From 6d8c1619c448bc66351eb51393950f898b5c3c5f Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 30 Apr 2018 20:05:59 +0300
Subject: [PATCH 1/1] Fix some sloppiness in the new BufFileSize() and
 BufFileAppend() funtions.

There were three related issues:

* BufFileAppend() incorrectly reset the seek position on the 'source'
  file. As a result, if you had called BufFileRead() on the file before
  calling BufFileAppend(), it got confused, and subsequent calls would
  read/write at wrong position.

* BufFileSize() did not work with files opened with BufFileOpenShared().

* FileGetSize() only worked on temporary files.

To fix, change the way BufFileSize() works so that it works on shared
files. Remove FileGetSize() altogetherm, as it's no longer needed. Remove
buffilesize from TapeShare struct, as the leader process can simply call
BufFileSize() to get the tape's size, there's no need to pass it through
shared memory.
---
 src/backend/storage/file/buffile.c | 18 ++++++++++++++----
 src/backend/storage/file/fd.c      | 10 ----------
 src/backend/utils/sort/logtape.c   | 11 ++++++++---
 src/backend/utils/sort/tuplesort.c |  1 -
 src/include/storage/fd.h           |  1 -
 src/include/utils/logtape.h        |  7 ++-----
 6 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 9cdddba510..d8a18dd3dc 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -802,14 +802,24 @@ BufFileTellBlock(BufFile *file)
 #endif
 
 /*
- * Return the current file size.  Counts any holes left behind by
- * BufFileViewAppend as part of the size.
+ * Return the current file size.
+ *
+ * Counts any holes left behind by BufFileAppend as part of the size.
+ * Returns -1 on error.
  */
 off_t
 BufFileSize(BufFile *file)
 {
+	off_t		lastFileSize;
+
+	/* Get the size of the last physical file by seeking to end. */
+	lastFileSize = FileSeek(file->files[file->numFiles - 1], 0, SEEK_END);
+	if (lastFileSize < 0)
+		return -1;
+	file->offsets[file->numFiles - 1] = lastFileSize;
+
 	return ((file->numFiles - 1) * (off_t) MAX_PHYSICAL_FILESIZE) +
-		FileGetSize(file->files[file->numFiles - 1]);
+		lastFileSize;
 }
 
 /*
@@ -853,7 +863,7 @@ BufFileAppend(BufFile *target, BufFile *source)
 	for (i = target->numFiles; i < newNumFiles; i++)
 	{
 		target->files[i] = source->files[i - target->numFiles];
-		target->offsets[i] = 0L;
+		target->offsets[i] = source->offsets[i - target->numFiles];
 	}
 	target->numFiles = newNumFiles;
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index afce5dadc0..441f18dcf5 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2256,16 +2256,6 @@ FileGetRawMode(File file)
 }
 
 /*
- * FileGetSize - returns the size of file
- */
-off_t
-FileGetSize(File file)
-{
-	Assert(FileIsValid(file));
-	return VfdCache[file].fileSize;
-}
-
-/*
  * Make room for another allocatedDescs[] array entry if needed and possible.
  * Returns true if an array element is available.
  */
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 19eb2fddca..a0d6c75c37 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -426,11 +426,17 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 	{
 		char		filename[MAXPGPATH];
 		BufFile    *file;
+		off_t		filesize;
 
 		lt = &lts->tapes[i];
 
 		pg_itoa(i, filename);
 		file = BufFileOpenShared(fileset, filename);
+		filesize = BufFileSize(file);
+		if (filesize < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not determine size of temporary file \"%s\"", filename)));
 
 		/*
 		 * Stash first BufFile, and concatenate subsequent BufFiles to that.
@@ -447,8 +453,8 @@ ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
 			lt->offsetBlockNumber = BufFileAppend(lts->pfile, file);
 		}
 		/* Don't allocate more for read buffer than could possibly help */
-		lt->max_size = Min(MaxAllocSize, shared[i].buffilesize);
-		tapeblocks = shared[i].buffilesize / BLCKSZ;
+		lt->max_size = Min(MaxAllocSize, filesize);
+		tapeblocks = filesize / BLCKSZ;
 		nphysicalblocks += tapeblocks;
 	}
 
@@ -938,7 +944,6 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum, TapeShare *share)
 	{
 		BufFileExportShared(lts->pfile);
 		share->firstblocknumber = lt->firstBlockNumber;
-		share->buffilesize = BufFileSize(lts->pfile);
 	}
 }
 
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index e6a8d22feb..9fb33b9035 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4395,7 +4395,6 @@ tuplesort_initialize_shared(Sharedsort *shared, int nWorkers, dsm_segment *seg)
 	for (i = 0; i < nWorkers; i++)
 	{
 		shared->tapes[i].firstblocknumber = 0L;
-		shared->tapes[i].buffilesize = 0;
 	}
 }
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 548a832be9..8e7c9728f4 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -78,7 +78,6 @@ extern char *FilePathName(File file);
 extern int	FileGetRawDesc(File file);
 extern int	FileGetRawFlags(File file);
 extern mode_t FileGetRawMode(File file);
-extern off_t FileGetSize(File file);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index 9bf1d80142..06dc734eb6 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -44,13 +44,10 @@ typedef struct LogicalTapeSet LogicalTapeSet;
 typedef struct TapeShare
 {
 	/*
-	 * firstblocknumber is first block that should be read from materialized
-	 * tape.
-	 *
-	 * buffilesize is the size of associated BufFile following freezing.
+	 * Currently, all the leader process needs is the location of the
+	 * materialized tape's first block.
 	 */
 	long		firstblocknumber;
-	off_t		buffilesize;
 } TapeShare;
 
 /*
-- 
2.11.0

In reply to: Heikki Linnakangas (#1)
Re: BufFileSize() doesn't work on a "shared" BufFiles

On Mon, Apr 30, 2018 at 10:18 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Perhaps that would be OK, if it was properly commented. But it's not
actually hard to make BufFileSize() work on shared files, too, so I think we
should do that.

I agree that this could use a comment, but I don't see much point in
adding the extra FileSeek(). The fact that fd.c is unwilling to track
filesize for non-temp files (where "non-temp" means a
PathNameOpenTemporaryFile()-returned/exported file) is due to
vfdP->fileSize being involved in temp_file_limit enforcement. Maybe a
FD_TEMP_FILE_LIMIT assertion within FileGetSize() would help?

Another little bug I noticed is that BufFileAppend() incorrectly resets the
'offsets' of the source files. You won't notice if you call BufFileAppend()
immediately after BufFileOpenShared(), but if you had called BufFileSeek()
or BufFileRead() on the shared BufFile handle before calling
BufFileAppend(), it would get confused.

Seems worth fixing.

--
Peter Geoghegan

#3Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Peter Geoghegan (#2)
Re: BufFileSize() doesn't work on a "shared" BufFiles

On 30/04/18 22:00, Peter Geoghegan wrote:

On Mon, Apr 30, 2018 at 10:18 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Perhaps that would be OK, if it was properly commented. But it's not
actually hard to make BufFileSize() work on shared files, too, so I think we
should do that.

I agree that this could use a comment, but I don't see much point in
adding the extra FileSeek(). The fact that fd.c is unwilling to track
filesize for non-temp files (where "non-temp" means a
PathNameOpenTemporaryFile()-returned/exported file) is due to
vfdP->fileSize being involved in temp_file_limit enforcement. Maybe a
FD_TEMP_FILE_LIMIT assertion within FileGetSize() would help?

Seems simpler to just make it work. It's not like the extra lseek() is
going to make any performance difference. Functions that work only under
certain conditions are annoying.

Pushed, thanks.

- Heikki