pendingOps table is not cleared with fsync=off

Started by Heikki Linnakangasover 5 years ago9 messages
#1Heikki Linnakangas
hlinnaka@iki.fi
1 attachment(s)

Hi!

I noticed that commit 3eb77eba5a changed the logic in
ProcessSyncRequests() (formerly mdsync()) so that if you have fsync=off,
the entries are not removed from the pendingOps hash table. I don't
think that was intended.

I propose the attached patch to move the test for enableFsync so that
the entries are removed from pendingOps again. It looks larger than it
really is because it re-indents the block of code that is now inside the
"if (enableFsync)" condition.

- Heikki

Attachments:

clear-pendingOps-with-fsync-off-1.patchtext/x-patch; charset=UTF-8; name=clear-pendingOps-with-fsync-off-1.patchDownload
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 8282a476b4..022366c172 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -315,14 +315,6 @@ ProcessSyncRequests(void)
 	{
 		int			failures;
 
-		/*
-		 * If fsync is off then we don't have to bother opening the file at
-		 * all.  (We delay checking until this point so that changing fsync on
-		 * the fly behaves sensibly.)
-		 */
-		if (!enableFsync)
-			continue;
-
 		/*
 		 * If the entry is new then don't process it this time; it is new.
 		 * Note "continue" bypasses the hash-remove call at the bottom of the
@@ -335,78 +327,88 @@ ProcessSyncRequests(void)
 		Assert((CycleCtr) (entry->cycle_ctr + 1) == sync_cycle_ctr);
 
 		/*
-		 * If in checkpointer, we want to absorb pending requests every so
-		 * often to prevent overflow of the fsync request queue.  It is
-		 * unspecified whether newly-added entries will be visited by
-		 * hash_seq_search, but we don't care since we don't need to process
-		 * them anyway.
-		 */
-		if (--absorb_counter <= 0)
-		{
-			AbsorbSyncRequests();
-			absorb_counter = FSYNCS_PER_ABSORB;
-		}
-
-		/*
-		 * The fsync table could contain requests to fsync segments that have
-		 * been deleted (unlinked) by the time we get to them. Rather than
-		 * just hoping an ENOENT (or EACCES on Windows) error can be ignored,
-		 * what we do on error is absorb pending requests and then retry.
-		 * Since mdunlink() queues a "cancel" message before actually
-		 * unlinking, the fsync request is guaranteed to be marked canceled
-		 * after the absorb if it really was this case. DROP DATABASE likewise
-		 * has to tell us to forget fsync requests before it starts deletions.
+		 * If fsync is off then we don't have to bother opening the file at
+		 * all.  (We delay checking until this point so that changing fsync on
+		 * the fly behaves sensibly.)
 		 */
-		for (failures = 0; !entry->canceled; failures++)
+		if (enableFsync)
 		{
-			char		path[MAXPGPATH];
-
-			INSTR_TIME_SET_CURRENT(sync_start);
-			if (syncsw[entry->tag.handler].sync_syncfiletag(&entry->tag,
-															path) == 0)
-			{
-				/* Success; update statistics about sync timing */
-				INSTR_TIME_SET_CURRENT(sync_end);
-				sync_diff = sync_end;
-				INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-				elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-				if (elapsed > longest)
-					longest = elapsed;
-				total_elapsed += elapsed;
-				processed++;
-
-				if (log_checkpoints)
-					elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-						 processed,
-						 path,
-						 (double) elapsed / 1000);
-
-				break;			/* out of retry loop */
-			}
-
 			/*
-			 * It is possible that the relation has been dropped or truncated
-			 * since the fsync request was entered. Therefore, allow ENOENT,
-			 * but only if we didn't fail already on this file.
+			 * If in checkpointer, we want to absorb pending requests every so
+			 * often to prevent overflow of the fsync request queue.  It is
+			 * unspecified whether newly-added entries will be visited by
+			 * hash_seq_search, but we don't care since we don't need to
+			 * process them anyway.
 			 */
-			if (!FILE_POSSIBLY_DELETED(errno) || failures > 0)
-				ereport(data_sync_elevel(ERROR),
-						(errcode_for_file_access(),
-						 errmsg("could not fsync file \"%s\": %m",
-								path)));
-			else
-				ereport(DEBUG1,
-						(errcode_for_file_access(),
-						 errmsg("could not fsync file \"%s\" but retrying: %m",
-								path)));
+			if (--absorb_counter <= 0)
+			{
+				AbsorbSyncRequests();
+				absorb_counter = FSYNCS_PER_ABSORB;
+			}
 
 			/*
-			 * Absorb incoming requests and check to see if a cancel arrived
-			 * for this relation fork.
+			 * The fsync table could contain requests to fsync segments that
+			 * have been deleted (unlinked) by the time we get to them. Rather
+			 * than just hoping an ENOENT (or EACCES on Windows) error can be
+			 * ignored, what we do on error is absorb pending requests and
+			 * then retry. Since mdunlink() queues a "cancel" message before
+			 * actually unlinking, the fsync request is guaranteed to be
+			 * marked canceled after the absorb if it really was this case.
+			 * DROP DATABASE likewise has to tell us to forget fsync requests
+			 * before it starts deletions.
 			 */
-			AbsorbSyncRequests();
-			absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-		}						/* end retry loop */
+			for (failures = 0; !entry->canceled; failures++)
+			{
+				char		path[MAXPGPATH];
+
+				INSTR_TIME_SET_CURRENT(sync_start);
+				if (syncsw[entry->tag.handler].sync_syncfiletag(&entry->tag,
+																path) == 0)
+				{
+					/* Success; update statistics about sync timing */
+					INSTR_TIME_SET_CURRENT(sync_end);
+					sync_diff = sync_end;
+					INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+					elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+					if (elapsed > longest)
+						longest = elapsed;
+					total_elapsed += elapsed;
+					processed++;
+
+					if (log_checkpoints)
+						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
+							 processed,
+							 path,
+							 (double) elapsed / 1000);
+
+					break;		/* out of retry loop */
+				}
+
+				/*
+				 * It is possible that the relation has been dropped or
+				 * truncated since the fsync request was entered. Therefore,
+				 * allow ENOENT, but only if we didn't fail already on this
+				 * file.
+				 */
+				if (!FILE_POSSIBLY_DELETED(errno) || failures > 0)
+					ereport(data_sync_elevel(ERROR),
+							(errcode_for_file_access(),
+							 errmsg("could not fsync file \"%s\": %m",
+									path)));
+				else
+					ereport(DEBUG1,
+							(errcode_for_file_access(),
+							 errmsg("could not fsync file \"%s\" but retrying: %m",
+									path)));
+
+				/*
+				 * Absorb incoming requests and check to see if a cancel
+				 * arrived for this relation fork.
+				 */
+				AbsorbSyncRequests();
+				absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
+			}					/* end retry loop */
+		}
 
 		/* We are done with this entry, remove it */
 		if (hash_search(pendingOps, &entry->tag, HASH_REMOVE, NULL) == NULL)
#2Thomas Munro
thomas.munro@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: pendingOps table is not cleared with fsync=off

On Sat, May 9, 2020 at 9:21 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I noticed that commit 3eb77eba5a changed the logic in
ProcessSyncRequests() (formerly mdsync()) so that if you have fsync=off,
the entries are not removed from the pendingOps hash table. I don't
think that was intended.

Perhaps we got confused about what the comment "... so that changing
fsync on the fly behaves sensibly" means. Fsyncing everything you
missed when you turn it back on after a period running with it off
does sound a bit like behaviour that someone might want or expect,
though it probably isn't really enough to guarantee durability,
because requests queued here aren't the only fsyncs you missed while
you had it off, among other problems. Unfortunately, if you try that
on an assertion build, you get a failure anyway, so that probably
wasn't deliberate:

TRAP: FailedAssertion("(CycleCtr) (entry->cycle_ctr + 1) ==
sync_cycle_ctr", File: "sync.c", Line: 335)

I propose the attached patch to move the test for enableFsync so that
the entries are removed from pendingOps again. It looks larger than it
really is because it re-indents the block of code that is now inside the
"if (enableFsync)" condition.

Yeah, I found that git diff/show -w made it easier to understand that
change. LGTM, though I'd be tempted to use "goto skip" instead of
incurring that much indentation but up to you ...

#3Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Thomas Munro (#2)
Re: pendingOps table is not cleared with fsync=off

On 09/05/2020 02:53, Thomas Munro wrote:

On Sat, May 9, 2020 at 9:21 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I noticed that commit 3eb77eba5a changed the logic in
ProcessSyncRequests() (formerly mdsync()) so that if you have fsync=off,
the entries are not removed from the pendingOps hash table. I don't
think that was intended.

Perhaps we got confused about what the comment "... so that changing
fsync on the fly behaves sensibly" means. Fsyncing everything you
missed when you turn it back on after a period running with it off
does sound a bit like behaviour that someone might want or expect,
though it probably isn't really enough to guarantee durability,
because requests queued here aren't the only fsyncs you missed while
you had it off, among other problems.

Yeah, you need to run "sync" after turning fsync=on to be safe, there's
no way around it.

Unfortunately, if you try that on an assertion build, you get a
failure anyway, so that probably wasn't deliberate:

TRAP: FailedAssertion("(CycleCtr) (entry->cycle_ctr + 1) ==
sync_cycle_ctr", File: "sync.c", Line: 335)

Ah, I didn't notice that.

I propose the attached patch to move the test for enableFsync so that
the entries are removed from pendingOps again. It looks larger than it
really is because it re-indents the block of code that is now inside the
"if (enableFsync)" condition.

Yeah, I found that git diff/show -w made it easier to understand that
change. LGTM, though I'd be tempted to use "goto skip" instead of
incurring that much indentation but up to you ...

I considered a goto too, but I found it confusing. If we need any more
nesting here in the future, I think extracting the inner parts into a
function would be good.

Committed, thanks!

- Heikki

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#3)
Re: pendingOps table is not cleared with fsync=off

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 09/05/2020 02:53, Thomas Munro wrote:

On Sat, May 9, 2020 at 9:21 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I noticed that commit 3eb77eba5a changed the logic in
ProcessSyncRequests() (formerly mdsync()) so that if you have fsync=off,
the entries are not removed from the pendingOps hash table. I don't
think that was intended.

I'm looking at this commit in connection with writing release notes
for next week's releases. Am I right in thinking that this bug leads
to indefinite bloat of the pendingOps hash table when fsync is off?
If so, that seems much more worth documenting than the assertion
failure.

regards, tom lane

#5Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Tom Lane (#4)
Re: pendingOps table is not cleared with fsync=off

On 06/08/2020 18:42, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 09/05/2020 02:53, Thomas Munro wrote:

On Sat, May 9, 2020 at 9:21 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I noticed that commit 3eb77eba5a changed the logic in
ProcessSyncRequests() (formerly mdsync()) so that if you have fsync=off,
the entries are not removed from the pendingOps hash table. I don't
think that was intended.

I'm looking at this commit in connection with writing release notes
for next week's releases. Am I right in thinking that this bug leads
to indefinite bloat of the pendingOps hash table when fsync is off?
If so, that seems much more worth documenting than the assertion
failure.

That's correct.

- Heikki

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#5)
Re: pendingOps table is not cleared with fsync=off

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 06/08/2020 18:42, Tom Lane wrote:

I'm looking at this commit in connection with writing release notes
for next week's releases. Am I right in thinking that this bug leads
to indefinite bloat of the pendingOps hash table when fsync is off?
If so, that seems much more worth documenting than the assertion
failure.

That's correct.

OK, thanks for confirming.

regards, tom lane

#7Shawn Debnath
sdn@amazon.com
In reply to: Thomas Munro (#2)
Re: pendingOps table is not cleared with fsync=off

On Sat, May 09, 2020 at 11:53:13AM +1200, Thomas Munro wrote:

On Sat, May 9, 2020 at 9:21 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I noticed that commit 3eb77eba5a changed the logic in
ProcessSyncRequests() (formerly mdsync()) so that if you have fsync=off,
the entries are not removed from the pendingOps hash table. I don't
think that was intended.

Perhaps we got confused about what the comment "... so that changing
fsync on the fly behaves sensibly" means. Fsyncing everything you
missed when you turn it back on after a period running with it off
does sound a bit like behaviour that someone might want or expect,
though it probably isn't really enough to guarantee durability,
because requests queued here aren't the only fsyncs you missed while
you had it off, among other problems.

Good catch. Question is, are the users aware of the requirement to do a
manual fsync if they flip the fsync GUC off and then on? Should we do
this on their behalf to make a good faith attempt to ensure things are
flushed properly via an assign hook?

--
Shawn Debnath
Amazon Web Services (AWS)

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Shawn Debnath (#7)
Re: pendingOps table is not cleared with fsync=off

Shawn Debnath <sdn@amazon.com> writes:

Good catch. Question is, are the users aware of the requirement to do a
manual fsync if they flip the fsync GUC off and then on? Should we do
this on their behalf to make a good faith attempt to ensure things are
flushed properly via an assign hook?

No. Or at least, expecting that you can do that from an assign hook
is impossibly wrong-headed. GUC assign hooks can't have failure modes.

regards, tom lane

#9Shawn Debnath
sdn@amazon.com
In reply to: Tom Lane (#8)
Re: pendingOps table is not cleared with fsync=off

On Mon, Aug 10, 2020 at 02:50:26PM -0400, Tom Lane wrote:

Shawn Debnath <sdn@amazon.com> writes:

Good catch. Question is, are the users aware of the requirement to do a
manual fsync if they flip the fsync GUC off and then on? Should we do
this on their behalf to make a good faith attempt to ensure things are
flushed properly via an assign hook?

No. Or at least, expecting that you can do that from an assign hook
is impossibly wrong-headed. GUC assign hooks can't have failure modes.

Okay agree, will remind myself to drink more coffee next time.

If we think a fsync should be performed in this case, assign hook
could set a value to indicate parameter was reset via SIGHUP. Next call
to ProcessSyncRequests() could check for this, do a fsync prior to
absorbing the newly submitted sync requests, and reset the flag.
fsync_pgdata() comes to mind to be inclusive.

If folks are not inclined to do the fsync, the change is good as is.

--
Shawn Debnath
Amazon Web Services (AWS)