pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

Started by Justin Pryzbyalmost 6 years ago30 messages
#1Justin Pryzby
pryzby@telsasoft.com

While working on a patch, I noticed this pre-existing behavior, which seems to
be new since v11, maybe due to changes to SRF.

|postgres=# SELECT pg_ls_dir('.') LIMIT 1;
|WARNING: 1 temporary files and directories not closed at end-of-transaction
|pg_ls_dir | pg_dynshmem

|postgres=# SELECT pg_ls_waldir() LIMIT 1;
|WARNING: 1 temporary files and directories not closed at end-of-transaction
|-[ RECORD 1 ]+-------------------------------------------------------------
|pg_ls_waldir | (00000001000031920000007B,16777216,"2020-03-08 03:50:34-07")

Note, that doesn't happen with "SELECT * FROM".

I'm not sure what the solution is to that, but my patch was going to make it
worse rather than better for pg_ls_tmpdir.

--
Justin

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#1)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

Justin Pryzby <pryzby@telsasoft.com> writes:

While working on a patch, I noticed this pre-existing behavior, which seems to
be new since v11, maybe due to changes to SRF.

|postgres=# SELECT pg_ls_dir('.') LIMIT 1;
|WARNING: 1 temporary files and directories not closed at end-of-transaction

Hmm, actually it looks to me like pg_ls_dir has been broken forever.
The reason the warning didn't show up before v11 is that CleanupTempFiles
didn't bleat about leaked "allocated" directories before that
(cf 9cb7db3f0).

I guess we ought to change that function to use returns-a-tuplestore
protocol instead of thinking it can hold a directory open across calls.
It's not hard to think of use-cases where the existing behavior would
cause issues worse than a nanny-ish WARNING, especially on platforms
with tight "ulimit -n" limits.

regards, tom lane

#3Justin Pryzby
pryzby@telsasoft.com
In reply to: Tom Lane (#2)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

On Sun, Mar 08, 2020 at 02:37:49PM -0400, Tom Lane wrote:

Justin Pryzby <pryzby@telsasoft.com> writes:

While working on a patch, I noticed this pre-existing behavior, which seems to
be new since v11, maybe due to changes to SRF.

|postgres=# SELECT pg_ls_dir('.') LIMIT 1;
|WARNING: 1 temporary files and directories not closed at end-of-transaction

Hmm, actually it looks to me like pg_ls_dir has been broken forever.
The reason the warning didn't show up before v11 is that CleanupTempFiles
didn't bleat about leaked "allocated" directories before that
(cf 9cb7db3f0).

I guess we ought to change that function to use returns-a-tuplestore
protocol instead of thinking it can hold a directory open across calls.
It's not hard to think of use-cases where the existing behavior would
cause issues worse than a nanny-ish WARNING, especially on platforms
with tight "ulimit -n" limits.

Thanks for the analysis.

Do you mean it should enumerate all files during the initial SRF call, or use
something other than the SRF_* macros ?

--
Justin

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#3)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

Justin Pryzby <pryzby@telsasoft.com> writes:

On Sun, Mar 08, 2020 at 02:37:49PM -0400, Tom Lane wrote:

I guess we ought to change that function to use returns-a-tuplestore
protocol instead of thinking it can hold a directory open across calls.
It's not hard to think of use-cases where the existing behavior would
cause issues worse than a nanny-ish WARNING, especially on platforms
with tight "ulimit -n" limits.

Thanks for the analysis.

Do you mean it should enumerate all files during the initial SRF call, or use
something other than the SRF_* macros ?

It has to enumerate all the files during the first call. I suppose it
could do that and then still hand back the results one-at-a-time, but
there seems little point compared to filling a tuplestore immediately.
So probably the SRF_ macros are useless here.

Another possible solution is to register an exprstate-shutdown hook to
ensure the resource is cleaned up, but I'm not very happy with that
because it does nothing to prevent the hazard of overrunning the
available resources if you have several of these active at once.

I've just finished scanning the source code and concluding that all
of these functions are similarly broken:

pg_ls_dir
pg_ls_dir_files
pg_tablespace_databases
pg_logdir_ls_internal
pg_timezone_names
pgrowlocks

The first five risk leaking an open directory, the last risks leaking
an active tablescan and open relation.

I don't see anything in the documentation (either funcapi.h or
xfunc.sgml) warning that the function might not get run to completion,
either ...

regards, tom lane

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#4)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

I wrote:

I've just finished scanning the source code and concluding that all
of these functions are similarly broken:
pg_ls_dir
pg_ls_dir_files
pg_tablespace_databases
pg_logdir_ls_internal
pg_timezone_names
pgrowlocks

BTW, another thing I noticed while looking around is that some of
the functions using SRF_RETURN_DONE() think they should clean up
memory beforehand. This is a waste of code/cycles, as long as the
memory was properly allocated in funcctx->multi_call_memory_ctx,
because funcapi.c takes care of deleting that context.

We should probably document that *any* manual cleanup before
SRF_RETURN_DONE() is an antipattern. If you have to have cleanup,
it needs to be done via RegisterExprContextCallback instead.

regards, tom lane

#6Justin Pryzby
pryzby@telsasoft.com
In reply to: Tom Lane (#4)
1 attachment(s)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

On Sun, Mar 08, 2020 at 03:40:09PM -0400, Tom Lane wrote:

Justin Pryzby <pryzby@telsasoft.com> writes:

On Sun, Mar 08, 2020 at 02:37:49PM -0400, Tom Lane wrote:

I guess we ought to change that function to use returns-a-tuplestore
protocol instead of thinking it can hold a directory open across calls.

Thanks for the analysis.
Do you mean it should enumerate all files during the initial SRF call, or use
something other than the SRF_* macros ?

It has to enumerate all the files during the first call. I suppose it

I've just finished scanning the source code and concluding that all
of these functions are similarly broken:
pg_ls_dir_files

I patched this one to see what it looks like and to allow /hopefully/ moving
forward one way or another with the pg_ls_tmpfile() patch set (or at least
avoid trying to do anything there which is too inconsistent with this fix).

I don't see anything in the documentation (either funcapi.h or
xfunc.sgml) warning that the function might not get run to completion,
either ...

Also, at first glance, these seem to be passing constant "randomAccess=true"
rather than (bool) (rsinfo->allowedModes&SFRM_Materialize_Random)

$ git grep -wl SFRM_Materialize |xargs grep -l 'tuplestore_begin_heap(true'
contrib/dblink/dblink.c
contrib/pageinspect/brinfuncs.c
contrib/pg_stat_statements/pg_stat_statements.c
src/backend/access/transam/xlogfuncs.c
src/backend/commands/event_trigger.c
src/backend/commands/extension.c
src/backend/foreign/foreign.c
src/backend/replication/logical/launcher.c
src/backend/replication/logical/logicalfuncs.c
src/backend/replication/logical/origin.c
src/backend/replication/slotfuncs.c
src/backend/replication/walsender.c
src/backend/storage/ipc/shmem.c
src/backend/utils/adt/pgstatfuncs.c
src/backend/utils/misc/guc.c
src/backend/utils/misc/pg_config.c

--
Justin

Attachments:

v1-0001-pg_ls_dir_files-avoid-leaking-DIR-if-not-run-to-c.patchtext/x-diff; charset=us-asciiDownload
From 1fd2918d65ed8cc64158e407e56ed61b44e951db Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Tue, 10 Mar 2020 21:01:47 -0500
Subject: [PATCH v1] pg_ls_dir_files: avoid leaking DIR if not run to
 completion

Change to return a tuplestore rather than leaving directory open across calls.

Discussion: https://www.postgresql.org/message-id/20200308173103.GC1357%40telsasoft.com
See also: 9cb7db3f0
---
 src/backend/utils/adt/genfile.c | 90 +++++++++++++++------------------
 1 file changed, 42 insertions(+), 48 deletions(-)

diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index 5bda2af87c..046d218ffa 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -527,67 +527,63 @@ pg_ls_dir_1arg(PG_FUNCTION_ARGS)
 static Datum
 pg_ls_dir_files(FunctionCallInfo fcinfo, const char *dir, bool missing_ok)
 {
-	FuncCallContext *funcctx;
 	struct dirent *de;
-	directory_fctx *fctx;
-
-	if (SRF_IS_FIRSTCALL())
-	{
-		MemoryContext oldcontext;
-		TupleDesc	tupdesc;
+	MemoryContext oldcontext;
+	TupleDesc	tupdesc;
+	ReturnSetInfo	*rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	bool		randomAccess = (rsinfo->allowedModes&SFRM_Materialize_Random) != 0;
+	DIR			*dirdesc;
+	Tuplestorestate *tupstore;
 
-		funcctx = SRF_FIRSTCALL_INIT();
-		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	/* check to see if caller supports us returning a tuplestore */
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("materialize mode required, but it is not "
+					 "allowed in this context")));
 
-		fctx = palloc(sizeof(directory_fctx));
+	/* The Tuplestore and TupleDesc should be created in ecxt_per_query_memory */
+	oldcontext = MemoryContextSwitchTo(rsinfo->econtext->ecxt_per_query_memory);
 
-		tupdesc = CreateTemplateTupleDesc(3);
-		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "name",
-						   TEXTOID, -1, 0);
-		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "size",
-						   INT8OID, -1, 0);
-		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "modification",
-						   TIMESTAMPTZOID, -1, 0);
-		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
 
-		fctx->location = pstrdup(dir);
-		fctx->dirdesc = AllocateDir(fctx->location);
+	tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
 
-		if (!fctx->dirdesc)
-		{
-			if (missing_ok && errno == ENOENT)
-			{
-				MemoryContextSwitchTo(oldcontext);
-				SRF_RETURN_DONE(funcctx);
-			}
-			else
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not open directory \"%s\": %m",
-								fctx->location)));
-		}
+	MemoryContextSwitchTo(oldcontext);
 
-		funcctx->user_fctx = fctx;
-		MemoryContextSwitchTo(oldcontext);
+	dirdesc = AllocateDir(dir);
+	if (!dirdesc)
+	{
+		if (missing_ok && errno == ENOENT)
+			return NULL; // XXX
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open directory \"%s\": %m",
+							dir)));
 	}
 
-	funcctx = SRF_PERCALL_SETUP();
-	fctx = (directory_fctx *) funcctx->user_fctx;
-
-	while ((de = ReadDir(fctx->dirdesc, fctx->location)) != NULL)
+	while ((de = ReadDir(dirdesc, dir)) != NULL)
 	{
 		Datum		values[3];
-		bool		nulls[3];
+		bool		nulls[3] = {0};
 		char		path[MAXPGPATH * 2];
 		struct stat attrib;
-		HeapTuple	tuple;
 
 		/* Skip hidden files */
 		if (de->d_name[0] == '.')
 			continue;
 
 		/* Get the file info */
-		snprintf(path, sizeof(path), "%s/%s", fctx->location, de->d_name);
+		snprintf(path, sizeof(path), "%s/%s", dir, de->d_name);
 		if (stat(path, &attrib) < 0)
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -600,14 +596,12 @@ pg_ls_dir_files(FunctionCallInfo fcinfo, const char *dir, bool missing_ok)
 		values[0] = CStringGetTextDatum(de->d_name);
 		values[1] = Int64GetDatum((int64) attrib.st_size);
 		values[2] = TimestampTzGetDatum(time_t_to_timestamptz(attrib.st_mtime));
-		memset(nulls, 0, sizeof(nulls));
-
-		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
-		SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
 	}
 
-	FreeDir(fctx->dirdesc);
-	SRF_RETURN_DONE(funcctx);
+	tuplestore_donestoring(tupstore);
+	FreeDir(dirdesc);
+	return NULL;
 }
 
 /* Function to return the list of files in the log directory */
-- 
2.17.0

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#6)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

Justin Pryzby <pryzby@telsasoft.com> writes:

On Sun, Mar 08, 2020 at 02:37:49PM -0400, Tom Lane wrote:

I guess we ought to change that function to use returns-a-tuplestore
protocol instead of thinking it can hold a directory open across calls.

I patched this one to see what it looks like and to allow /hopefully/ moving
forward one way or another with the pg_ls_tmpfile() patch set (or at least
avoid trying to do anything there which is too inconsistent with this fix).

I reviewed this, added some test cases, and pushed it, so that we can see
if the buildfarm finds anything wrong. (I'm not expecting that, because
this should all be pretty portable, but you never know.) Assuming not,
we need to fix the other functions similarly, and then do something about
revising the documentation to warn against this coding style. Do you
want to have a go at that?

Also, at first glance, these seem to be passing constant "randomAccess=true"
rather than (bool) (rsinfo->allowedModes&SFRM_Materialize_Random)

Hm. Not a bug, but possibly a performance issue, if the tuplestore
gets big enough for that to matter. (I think it doesn't matter until
we start spilling to temp files.)

regards, tom lane

#8Justin Pryzby
pryzby@telsasoft.com
In reply to: Tom Lane (#7)
1 attachment(s)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

On Wed, Mar 11, 2020 at 03:32:38PM -0400, Tom Lane wrote:

I patched this one to see what it looks like and to allow /hopefully/ moving
forward one way or another with the pg_ls_tmpfile() patch set (or at least
avoid trying to do anything there which is too inconsistent with this fix).

I reviewed this, added some test cases, and pushed it, so that we can see

Thanks, tests were on my periphery..

| In passing, fix bogus error report for stat() failure: it was
| whining about the directory when it should be fingering the
| individual file. Doubtless a copy-and-paste error.

Thanks again ; that was my 0001 patch on the other thread. No rebase conflict
even ;)
/messages/by-id/20191228101650.GG12890@telsasoft.com

Do you want to have a go at that?

First draft attached. Note that I handled pg_ls_dir, even though I'm proposing
on the other thread to collapse/merge/meld it with pg_ls_dir_files [0]/messages/by-id/20200310183037.GA29065@telsasoft.com v9-0008-generalize-pg_ls_dir_files-and-retire-pg_ls_dir.patch.
Possibly that's a bad idea with tuplestore, due to returning a scalar vs a row
and needing to conditionally call CreateTemplateTupleDesc vs
get_call_result_type. I'll rebase that patch later today.

I didn't write test cases yet. Also didn't look for functions not on your
list.

I noticed this doesn't actually do anything, but kept it for now...except in
pg_ls_dir error case:

src/include/utils/tuplestore.h:/* tuplestore_donestoring() used to be required, but is no longer used */
src/include/utils/tuplestore.h:#define tuplestore_donestoring(state) ((void) 0)

I found a few documentation bits that I think aren't relevant, but could
possibly be misread to encourage the bad coding practice. This is about *sql*
functions:

|37.5.8. SQL Functions Returning Sets
|When an SQL function is declared as returning SETOF sometype, the function's
|final query is executed TO COMPLETION, and each row it outputs is returned as
|an element of the result set.
|...
|Set-returning functions in the select list are always evaluated as though they
|are on the inside of a nested-loop join with the rest of the FROM clause, so
|that the function(s) are run TO COMPLETION before the next row from the FROM
|clause is considered.

--
Justin

[0]: /messages/by-id/20200310183037.GA29065@telsasoft.com v9-0008-generalize-pg_ls_dir_files-and-retire-pg_ls_dir.patch
v9-0008-generalize-pg_ls_dir_files-and-retire-pg_ls_dir.patch

Attachments:

0001-SRF-avoid-leaking-resources-if-not-run-to-completion.patchtext/x-diff; charset=us-asciiDownload
From 43e7e5a9b679a4172808b248df2bc3365b6336e4 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Wed, 11 Mar 2020 10:09:18 -0500
Subject: [PATCH] SRF: avoid leaking resources if not run to completion

Change to return a tuplestore populated immediately and returned in full.

Discussion: https://www.postgresql.org/message-id/20200308173103.GC1357%40telsasoft.com

See also: 9cb7db3f0 and 085b6b66
---
 contrib/adminpack/adminpack.c    |  83 +++++++++-------
 contrib/pgrowlocks/pgrowlocks.c  | 163 ++++++++++++++-----------------
 doc/src/sgml/xfunc.sgml          |  17 +++-
 src/backend/utils/adt/datetime.c | 100 ++++++++-----------
 src/backend/utils/adt/genfile.c  | 112 +++++++++++----------
 src/backend/utils/adt/misc.c     | 117 ++++++++++++----------
 src/include/funcapi.h            |   7 +-
 7 files changed, 306 insertions(+), 293 deletions(-)

diff --git a/contrib/adminpack/adminpack.c b/contrib/adminpack/adminpack.c
index bc45e79895..2afb999c6e 100644
--- a/contrib/adminpack/adminpack.c
+++ b/contrib/adminpack/adminpack.c
@@ -504,50 +504,57 @@ pg_logdir_ls_v1_1(PG_FUNCTION_ARGS)
 static Datum
 pg_logdir_ls_internal(FunctionCallInfo fcinfo)
 {
-	FuncCallContext *funcctx;
 	struct dirent *de;
-	directory_fctx *fctx;
+
+	ReturnSetInfo	*rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	MemoryContext	oldcontext;
+	TupleDesc		tupdesc;
+	Tuplestorestate	*tupstore;
+	bool			randomAccess;
+	DIR				*dirdesc;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("materialize mode required, but it is not "
+					 "allowed in this context")));
 
 	if (strcmp(Log_filename, "postgresql-%Y-%m-%d_%H%M%S.log") != 0)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("the log_filename parameter must equal 'postgresql-%%Y-%%m-%%d_%%H%%M%%S.log'")));
 
-	if (SRF_IS_FIRSTCALL())
-	{
-		MemoryContext oldcontext;
-		TupleDesc	tupdesc;
-
-		funcctx = SRF_FIRSTCALL_INIT();
-		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+	/* The Tuplestore and TupleDesc should be created in ecxt_per_query_memory */
+	oldcontext = MemoryContextSwitchTo(rsinfo->econtext->ecxt_per_query_memory);
+	randomAccess = (rsinfo->allowedModes&SFRM_Materialize_Random) != 0;
+	tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
 
-		fctx = palloc(sizeof(directory_fctx));
+	/* Is this right ?? */
+	tupdesc = CreateTemplateTupleDesc(2);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "starttime",
+			TIMESTAMPOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "filename",
+			TEXTOID, -1, 0);
 
-		tupdesc = CreateTemplateTupleDesc(2);
-		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "starttime",
-						   TIMESTAMPOID, -1, 0);
-		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "filename",
-						   TEXTOID, -1, 0);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
 
-		funcctx->attinmeta = TupleDescGetAttInMetadata(tupdesc);
+	MemoryContextSwitchTo(oldcontext);
 
-		fctx->location = pstrdup(Log_directory);
-		fctx->dirdesc = AllocateDir(fctx->location);
-
-		if (!fctx->dirdesc)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not open directory \"%s\": %m",
-							fctx->location)));
-
-		funcctx->user_fctx = fctx;
-		MemoryContextSwitchTo(oldcontext);
-	}
-
-	funcctx = SRF_PERCALL_SETUP();
-	fctx = (directory_fctx *) funcctx->user_fctx;
+	dirdesc = AllocateDir(Log_directory);
+	if (!dirdesc)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open directory \"%s\": %m",
+						Log_directory)));
 
-	while ((de = ReadDir(fctx->dirdesc, fctx->location)) != NULL)
+	while ((de = ReadDir(dirdesc, Log_directory)) != NULL)
 	{
 		char	   *values[2];
 		HeapTuple	tuple;
@@ -584,13 +591,13 @@ pg_logdir_ls_internal(FunctionCallInfo fcinfo)
 		/* Seems the timestamp is OK; prepare and return tuple */
 
 		values[0] = timestampbuf;
-		values[1] = psprintf("%s/%s", fctx->location, de->d_name);
-
-		tuple = BuildTupleFromCStrings(funcctx->attinmeta, values);
+		values[1] = psprintf("%s/%s", Log_directory, de->d_name);
 
-		SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
+		tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupdesc), values);
+		tuplestore_puttuple(tupstore, tuple);
 	}
 
-	FreeDir(fctx->dirdesc);
-	SRF_RETURN_DONE(funcctx);
+	tuplestore_donestoring(tupstore);
+	FreeDir(dirdesc);
+	return (Datum) 0;
 }
diff --git a/contrib/pgrowlocks/pgrowlocks.c b/contrib/pgrowlocks/pgrowlocks.c
index a2c44a916c..908019e774 100644
--- a/contrib/pgrowlocks/pgrowlocks.c
+++ b/contrib/pgrowlocks/pgrowlocks.c
@@ -54,13 +54,6 @@ PG_FUNCTION_INFO_V1(pgrowlocks);
 
 #define NCHARS 32
 
-typedef struct
-{
-	Relation	rel;
-	TableScanDesc scan;
-	int			ncolumns;
-} MyData;
-
 #define		Atnum_tid		0
 #define		Atnum_xmax		1
 #define		Atnum_ismulti	2
@@ -71,82 +64,85 @@ typedef struct
 Datum
 pgrowlocks(PG_FUNCTION_ARGS)
 {
-	FuncCallContext *funcctx;
 	TableScanDesc scan;
 	HeapScanDesc hscan;
 	HeapTuple	tuple;
 	TupleDesc	tupdesc;
 	AttInMetadata *attinmeta;
-	Datum		result;
-	MyData	   *mydata;
 	Relation	rel;
 
-	if (SRF_IS_FIRSTCALL())
-	{
-		text	   *relname;
-		RangeVar   *relrv;
-		MemoryContext oldcontext;
-		AclResult	aclresult;
-
-		funcctx = SRF_FIRSTCALL_INIT();
-		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
-
-		/* Build a tuple descriptor for our result type */
-		if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
-			elog(ERROR, "return type must be a row type");
-
-		attinmeta = TupleDescGetAttInMetadata(tupdesc);
-		funcctx->attinmeta = attinmeta;
-
-		relname = PG_GETARG_TEXT_PP(0);
-		relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
-		rel = relation_openrv(relrv, AccessShareLock);
-
-		if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
-			ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							errmsg("only heap AM is supported")));
-
-		if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-			ereport(ERROR,
-					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("\"%s\" is a partitioned table",
-							RelationGetRelationName(rel)),
-					 errdetail("Partitioned tables do not contain rows.")));
-		else if (rel->rd_rel->relkind != RELKIND_RELATION)
-			ereport(ERROR,
-					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("\"%s\" is not a table",
-							RelationGetRelationName(rel))));
+	ReturnSetInfo   *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Tuplestorestate	*tupstore;
+	bool            randomAccess;
+
+	text	   *relname;
+	RangeVar   *relrv;
+	MemoryContext oldcontext;
+	AclResult	aclresult;
+	char	  **values;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("materialize mode required, but it is not "
+					 "allowed in this context")));
+
+	/* The Tuplestore and TupleDesc should be created in ecxt_per_query_memory */
+	oldcontext = MemoryContextSwitchTo(rsinfo->econtext->ecxt_per_query_memory);
+	randomAccess = (rsinfo->allowedModes&SFRM_Materialize_Random) != 0;
+	tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	attinmeta = TupleDescGetAttInMetadata(tupdesc);
+
+	relname = PG_GETARG_TEXT_PP(0);
+	relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
+	rel = relation_openrv(relrv, AccessShareLock);
+
+	if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("only heap AM is supported")));
+
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("\"%s\" is a partitioned table",
+						RelationGetRelationName(rel)),
+				 errdetail("Partitioned tables do not contain rows.")));
+	else if (rel->rd_rel->relkind != RELKIND_RELATION)
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("\"%s\" is not a table",
+						RelationGetRelationName(rel))));
+
+	/*
+	 * check permissions: must have SELECT on table or be in
+	 * pg_stat_scan_tables
+	 */
+	aclresult = pg_class_aclcheck(RelationGetRelid(rel), GetUserId(),
+								  ACL_SELECT);
+	if (aclresult != ACLCHECK_OK)
+		aclresult = is_member_of_role(GetUserId(), DEFAULT_ROLE_STAT_SCAN_TABLES) ? ACLCHECK_OK : ACLCHECK_NO_PRIV;
+
+	if (aclresult != ACLCHECK_OK)
+		aclcheck_error(aclresult, get_relkind_objtype(rel->rd_rel->relkind),
+					   RelationGetRelationName(rel));
+
+	scan = table_beginscan(rel, GetActiveSnapshot(), 0, NULL);
+	hscan = (HeapScanDesc) scan;
 
-		/*
-		 * check permissions: must have SELECT on table or be in
-		 * pg_stat_scan_tables
-		 */
-		aclresult = pg_class_aclcheck(RelationGetRelid(rel), GetUserId(),
-									  ACL_SELECT);
-		if (aclresult != ACLCHECK_OK)
-			aclresult = is_member_of_role(GetUserId(), DEFAULT_ROLE_STAT_SCAN_TABLES) ? ACLCHECK_OK : ACLCHECK_NO_PRIV;
-
-		if (aclresult != ACLCHECK_OK)
-			aclcheck_error(aclresult, get_relkind_objtype(rel->rd_rel->relkind),
-						   RelationGetRelationName(rel));
-
-		scan = table_beginscan(rel, GetActiveSnapshot(), 0, NULL);
-		hscan = (HeapScanDesc) scan;
-		mydata = palloc(sizeof(*mydata));
-		mydata->rel = rel;
-		mydata->scan = scan;
-		mydata->ncolumns = tupdesc->natts;
-		funcctx->user_fctx = mydata;
-
-		MemoryContextSwitchTo(oldcontext);
-	}
+	MemoryContextSwitchTo(oldcontext);
 
-	funcctx = SRF_PERCALL_SETUP();
-	attinmeta = funcctx->attinmeta;
-	mydata = (MyData *) funcctx->user_fctx;
-	scan = mydata->scan;
-	hscan = (HeapScanDesc) scan;
+	values = (char **) palloc(tupdesc->natts * sizeof(char *));
 
 	/* scan the relation */
 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
@@ -169,9 +165,6 @@ pgrowlocks(PG_FUNCTION_ARGS)
 		 */
 		if (htsu == TM_BeingModified)
 		{
-			char	  **values;
-
-			values = (char **) palloc(mydata->ncolumns * sizeof(char *));
 
 			values[Atnum_tid] = (char *) DirectFunctionCall1(tidout,
 															 PointerGetDatum(&tuple->t_self));
@@ -297,16 +290,7 @@ pgrowlocks(PG_FUNCTION_ARGS)
 
 			/* build a tuple */
 			tuple = BuildTupleFromCStrings(attinmeta, values);
-
-			/* make the tuple into a datum */
-			result = HeapTupleGetDatum(tuple);
-
-			/*
-			 * no need to pfree what we allocated; it's on a short-lived
-			 * memory context anyway
-			 */
-
-			SRF_RETURN_NEXT(funcctx, result);
+			tuplestore_puttuple(tupstore, tuple);
 		}
 		else
 		{
@@ -314,8 +298,9 @@ pgrowlocks(PG_FUNCTION_ARGS)
 		}
 	}
 
+	tuplestore_donestoring(tupstore);
 	table_endscan(scan);
-	table_close(mydata->rel, AccessShareLock);
+	table_close(rel, AccessShareLock);
 
-	SRF_RETURN_DONE(funcctx);
+	return (Datum) 0;
 }
diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index ca5e6efd7e..11311905c7 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -2812,7 +2812,7 @@ HeapTupleGetDatum(HeapTuple tuple)
     <title>Returning Sets</title>
 
     <para>
-     There is also a special API that provides support for returning
+     There is also a special, helper API that provides support for returning
      sets (multiple rows) from a C-language function.  A set-returning
      function must follow the version-1 calling conventions.  Also,
      source files must include <filename>funcapi.h</filename>, as
@@ -2820,10 +2820,19 @@ HeapTupleGetDatum(HeapTuple tuple)
     </para>
 
     <para>
-     A set-returning function (<acronym>SRF</acronym>) is called
-     once for each item it returns.  The <acronym>SRF</acronym> must
-     therefore save enough state to remember what it was doing and
+     The helper API allows writing a set-returning function
+     (<acronym>SRF</acronym>) using the ValuePerCall mode, in which the SRF is
+     called once for each item it returns.  The <acronym>SRF</acronym> must
+     then save enough state to remember what it was doing and
      return the next item on each call.
+     Note, however, that an SRF will possibly not be run to completion, due to
+     a LIMIT or a cursor, and should avoid leaving resources like DIR*s or
+     tablescans opened across calls.  To avoid leaking resources in those cases,
+     instead of using ValuePerCall mode, the SRF should populate and return a
+     tuplestore with SFRM_Materialize mode.  See fmgr/README.
+    </para>
+
+    <para>
      The structure <structname>FuncCallContext</structname> is provided to help
      control this process.  Within a function, <literal>fcinfo-&gt;flinfo-&gt;fn_extra</literal>
      is used to hold a pointer to <structname>FuncCallContext</structname>
diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index 4f109111d1..3c5a428654 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -4756,11 +4756,8 @@ Datum
 pg_timezone_names(PG_FUNCTION_ARGS)
 {
 	MemoryContext oldcontext;
-	FuncCallContext *funcctx;
 	pg_tzenum  *tzenum;
 	pg_tz	   *tz;
-	Datum		result;
-	HeapTuple	tuple;
 	Datum		values[4];
 	bool		nulls[4];
 	int			tzoff;
@@ -4770,58 +4767,45 @@ pg_timezone_names(PG_FUNCTION_ARGS)
 	Interval   *resInterval;
 	struct pg_tm itm;
 
-	/* stuff done only on the first call of the function */
-	if (SRF_IS_FIRSTCALL())
-	{
-		TupleDesc	tupdesc;
+	ReturnSetInfo   *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 
-		/* create a function context for cross-call persistence */
-		funcctx = SRF_FIRSTCALL_INIT();
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	bool            randomAccess;
 
-		/*
-		 * switch to memory context appropriate for multiple function calls
-		 */
-		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("materialize mode required, but it is not "
+					 "allowed in this context")));
 
-		/* initialize timezone scanning code */
-		tzenum = pg_tzenumerate_start();
-		funcctx->user_fctx = (void *) tzenum;
+	/* The Tuplestore and TupleDesc should be created in ecxt_per_query_memory */
+	oldcontext = MemoryContextSwitchTo(rsinfo->econtext->ecxt_per_query_memory);
+	randomAccess = (rsinfo->allowedModes&SFRM_Materialize_Random) != 0;
+	tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
 
-		/*
-		 * build tupdesc for result tuples. This must match this function's
-		 * pg_proc entry!
-		 */
-		tupdesc = CreateTemplateTupleDesc(4);
-		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "name",
-						   TEXTOID, -1, 0);
-		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "abbrev",
-						   TEXTOID, -1, 0);
-		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "utc_offset",
-						   INTERVALOID, -1, 0);
-		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "is_dst",
-						   BOOLOID, -1, 0);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
 
-		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
-		MemoryContextSwitchTo(oldcontext);
-	}
+	MemoryContextSwitchTo(oldcontext);
 
-	/* stuff done on every call of the function */
-	funcctx = SRF_PERCALL_SETUP();
-	tzenum = (pg_tzenum *) funcctx->user_fctx;
+	/* initialize timezone scanning code */
+	tzenum = pg_tzenumerate_start();
 
 	/* search for another zone to display */
 	for (;;)
 	{
-		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
 		tz = pg_tzenumerate_next(tzenum);
-		MemoryContextSwitchTo(oldcontext);
-
 		if (!tz)
-		{
-			pg_tzenumerate_end(tzenum);
-			funcctx->user_fctx = NULL;
-			SRF_RETURN_DONE(funcctx);
-		}
+			break;
 
 		/* Convert now() to local time in this zone */
 		if (timestamp2tm(GetCurrentTransactionStartTimestamp(),
@@ -4840,25 +4824,23 @@ pg_timezone_names(PG_FUNCTION_ARGS)
 		if (tzn && strlen(tzn) > 31)
 			continue;
 
-		/* Found a displayable zone */
-		break;
-	}
-
-	MemSet(nulls, 0, sizeof(nulls));
 
-	values[0] = CStringGetTextDatum(pg_get_timezone_name(tz));
-	values[1] = CStringGetTextDatum(tzn ? tzn : "");
+		MemSet(nulls, 0, sizeof(nulls));
+		values[0] = CStringGetTextDatum(pg_get_timezone_name(tz));
+		values[1] = CStringGetTextDatum(tzn ? tzn : "");
 
-	MemSet(&itm, 0, sizeof(struct pg_tm));
-	itm.tm_sec = -tzoff;
-	resInterval = (Interval *) palloc(sizeof(Interval));
-	tm2interval(&itm, 0, resInterval);
-	values[2] = IntervalPGetDatum(resInterval);
+		MemSet(&itm, 0, sizeof(struct pg_tm));
+		itm.tm_sec = -tzoff;
+		resInterval = (Interval *) palloc(sizeof(Interval));
+		tm2interval(&itm, 0, resInterval);
+		values[2] = IntervalPGetDatum(resInterval);
 
-	values[3] = BoolGetDatum(tm.tm_isdst > 0);
+		values[3] = BoolGetDatum(tm.tm_isdst > 0);
 
-	tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
-	result = HeapTupleGetDatum(tuple);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
 
-	SRF_RETURN_NEXT(funcctx, result);
+	pg_tzenumerate_end(tzenum);
+	tuplestore_donestoring(tupstore);
+	return (Datum) 0;
 }
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index bcf9bd1b97..e348089418 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -36,13 +36,6 @@
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
 
-typedef struct
-{
-	char	   *location;
-	DIR		   *dirdesc;
-	bool		include_dot_dirs;
-} directory_fctx;
-
 
 /*
  * Convert a "text" filename argument to C string, and check it's allowable.
@@ -447,67 +440,82 @@ pg_stat_file_1arg(PG_FUNCTION_ARGS)
 Datum
 pg_ls_dir(PG_FUNCTION_ARGS)
 {
-	FuncCallContext *funcctx;
-	struct dirent *de;
-	directory_fctx *fctx;
-	MemoryContext oldcontext;
+	ReturnSetInfo	*rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	MemoryContext	oldcontext;
+	TupleDesc		tupdesc;
+	Tuplestorestate *tupstore;
+	bool			randomAccess;
 
-	if (SRF_IS_FIRSTCALL())
+	bool			missing_ok = false;
+	bool			include_dot_dirs = false;
+	char			*location;
+	DIR				*dirdesc;
+	struct dirent	*de;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("materialize mode required, but it is not "
+					 "allowed in this context")));
+
+	/* check the optional arguments */
+	if (PG_NARGS() == 3)
 	{
-		bool		missing_ok = false;
-		bool		include_dot_dirs = false;
+		if (!PG_ARGISNULL(1))
+			missing_ok = PG_GETARG_BOOL(1);
+		if (!PG_ARGISNULL(2))
+			include_dot_dirs = PG_GETARG_BOOL(2);
+	}
 
-		/* check the optional arguments */
-		if (PG_NARGS() == 3)
-		{
-			if (!PG_ARGISNULL(1))
-				missing_ok = PG_GETARG_BOOL(1);
-			if (!PG_ARGISNULL(2))
-				include_dot_dirs = PG_GETARG_BOOL(2);
-		}
+	/* The Tuplestore and TupleDesc should be created in ecxt_per_query_memory */
+	oldcontext = MemoryContextSwitchTo(rsinfo->econtext->ecxt_per_query_memory);
+	randomAccess = (rsinfo->allowedModes&SFRM_Materialize_Random) != 0;
+	tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
 
-		funcctx = SRF_FIRSTCALL_INIT();
-		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "column", TEXTOID, -1, 0);
 
-		fctx = palloc(sizeof(directory_fctx));
-		fctx->location = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+	MemoryContextSwitchTo(oldcontext);
 
-		fctx->include_dot_dirs = include_dot_dirs;
-		fctx->dirdesc = AllocateDir(fctx->location);
+	location = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
+	dirdesc = AllocateDir(location);
 
-		if (!fctx->dirdesc)
-		{
-			if (missing_ok && errno == ENOENT)
-			{
-				MemoryContextSwitchTo(oldcontext);
-				SRF_RETURN_DONE(funcctx);
-			}
-			else
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not open directory \"%s\": %m",
-								fctx->location)));
-		}
-		funcctx->user_fctx = fctx;
-		MemoryContextSwitchTo(oldcontext);
+	if (!dirdesc)
+	{
+		if (missing_ok && errno == ENOENT)
+			return (Datum) 0;
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open directory \"%s\": %m",
+							location)));
 	}
 
-	funcctx = SRF_PERCALL_SETUP();
-	fctx = (directory_fctx *) funcctx->user_fctx;
-
-	while ((de = ReadDir(fctx->dirdesc, fctx->location)) != NULL)
+	while ((de = ReadDir(dirdesc, location)) != NULL)
 	{
-		if (!fctx->include_dot_dirs &&
+		Datum		values[1];
+		bool		nulls[1] = {false};
+
+		if (!include_dot_dirs &&
 			(strcmp(de->d_name, ".") == 0 ||
 			 strcmp(de->d_name, "..") == 0))
 			continue;
 
-		SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(de->d_name));
+		values[0] = CStringGetTextDatum(de->d_name);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
 	}
 
-	FreeDir(fctx->dirdesc);
-
-	SRF_RETURN_DONE(funcctx);
+	tuplestore_donestoring(tupstore);
+	FreeDir(dirdesc);
+	return (Datum) 0;
 }
 
 /*
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 323e36b81c..789febebe3 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -194,72 +194,87 @@ current_query(PG_FUNCTION_ARGS)
 
 /* Function to find out which databases make use of a tablespace */
 
-typedef struct
-{
-	char	   *location;
-	DIR		   *dirdesc;
-} ts_db_fctx;
-
 Datum
 pg_tablespace_databases(PG_FUNCTION_ARGS)
 {
-	FuncCallContext *funcctx;
 	struct dirent *de;
-	ts_db_fctx *fctx;
 
-	if (SRF_IS_FIRSTCALL())
-	{
-		MemoryContext oldcontext;
-		Oid			tablespaceOid = PG_GETARG_OID(0);
+	MemoryContext oldcontext;
+	Oid			tablespaceOid = PG_GETARG_OID(0);
 
-		funcctx = SRF_FIRSTCALL_INIT();
-		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+	ReturnSetInfo	*rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc		tupdesc;
+	Tuplestorestate *tupstore;
+	bool			randomAccess;
 
-		fctx = palloc(sizeof(ts_db_fctx));
+	DIR				*dirdesc;
+	char			*location;
 
-		if (tablespaceOid == GLOBALTABLESPACE_OID)
-		{
-			fctx->dirdesc = NULL;
-			ereport(WARNING,
-					(errmsg("global tablespace never has databases")));
-		}
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("materialize mode required, but it is not "
+					 "allowed in this context")));
+
+	if (tablespaceOid == GLOBALTABLESPACE_OID)
+	{
+		dirdesc = NULL;
+		ereport(WARNING,
+				(errmsg("global tablespace never has databases")));
+	}
+	else
+	{
+		if (tablespaceOid == DEFAULTTABLESPACE_OID)
+			location = psprintf("base");
 		else
-		{
-			if (tablespaceOid == DEFAULTTABLESPACE_OID)
-				fctx->location = psprintf("base");
-			else
-				fctx->location = psprintf("pg_tblspc/%u/%s", tablespaceOid,
-										  TABLESPACE_VERSION_DIRECTORY);
+			location = psprintf("pg_tblspc/%u/%s", tablespaceOid,
+									  TABLESPACE_VERSION_DIRECTORY);
 
-			fctx->dirdesc = AllocateDir(fctx->location);
+		dirdesc = AllocateDir(location);
 
-			if (!fctx->dirdesc)
-			{
-				/* the only expected error is ENOENT */
-				if (errno != ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open directory \"%s\": %m",
-									fctx->location)));
-				ereport(WARNING,
-						(errmsg("%u is not a tablespace OID", tablespaceOid)));
-			}
+		if (!dirdesc)
+		{
+			/* the only expected error is ENOENT */
+			if (errno != ENOENT)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not open directory \"%s\": %m",
+								location)));
+			ereport(WARNING,
+					(errmsg("%u is not a tablespace OID", tablespaceOid)));
 		}
-		funcctx->user_fctx = fctx;
-		MemoryContextSwitchTo(oldcontext);
 	}
 
-	funcctx = SRF_PERCALL_SETUP();
-	fctx = (ts_db_fctx *) funcctx->user_fctx;
+	/* The Tuplestore and TupleDesc should be created in ecxt_per_query_memory */
+	oldcontext = MemoryContextSwitchTo(rsinfo->econtext->ecxt_per_query_memory);
+
+	randomAccess = (rsinfo->allowedModes&SFRM_Materialize_Random) != 0;
+	tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
 
-	if (!fctx->dirdesc)			/* not a tablespace */
-		SRF_RETURN_DONE(funcctx);
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "column", OIDOID, -1, 0);
 
-	while ((de = ReadDir(fctx->dirdesc, fctx->location)) != NULL)
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (!dirdesc)			/* not a tablespace */
+		return 0;
+
+	while ((de = ReadDir(dirdesc, location)) != NULL)
 	{
 		Oid			datOid = atooid(de->d_name);
 		char	   *subdir;
 		bool		isempty;
+		Datum		values[1];
+		bool		nulls[1] = {false};
 
 		/* this test skips . and .., but is awfully weak */
 		if (!datOid)
@@ -267,18 +282,20 @@ pg_tablespace_databases(PG_FUNCTION_ARGS)
 
 		/* if database subdir is empty, don't report tablespace as used */
 
-		subdir = psprintf("%s/%s", fctx->location, de->d_name);
+		subdir = psprintf("%s/%s", location, de->d_name);
 		isempty = directory_is_empty(subdir);
 		pfree(subdir);
 
 		if (isempty)
 			continue;			/* indeed, nothing in it */
 
-		SRF_RETURN_NEXT(funcctx, ObjectIdGetDatum(datOid));
+		values[0] = ObjectIdGetDatum(datOid);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
 	}
 
-	FreeDir(fctx->dirdesc);
-	SRF_RETURN_DONE(funcctx);
+	tuplestore_donestoring(tupstore);
+	FreeDir(dirdesc);
+	return (Datum) 0;
 }
 
 
diff --git a/src/include/funcapi.h b/src/include/funcapi.h
index f9b75ae390..471100b7e5 100644
--- a/src/include/funcapi.h
+++ b/src/include/funcapi.h
@@ -234,7 +234,12 @@ extern Datum HeapTupleHeaderGetDatum(HeapTupleHeader tuple);
 /*----------
  *		Support for Set Returning Functions (SRFs)
  *
- * The basic API for SRFs looks something like:
+ * The basic API for SRFs using ValuePerCall mode looks something like this.
+ * Note that an SRF will possibly not be run to completion, due to a LIMIT or a
+ * cursor, and should avoid leaving resources like DIR*s or tablescans opened
+ * across calls.  To avoid leaking resources in those cases, instead of using
+ * ValuePerCall mode, the SRF should populate and return a tuplestore with
+ * SFRM_Materialize mode.  See fmgr/README.
  *
  * Datum
  * my_Set_Returning_Function(PG_FUNCTION_ARGS)
-- 
2.17.0

#9Justin Pryzby
pryzby@telsasoft.com
In reply to: Tom Lane (#5)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

On Sun, Mar 08, 2020 at 04:30:44PM -0400, Tom Lane wrote:

BTW, another thing I noticed while looking around is that some of
the functions using SRF_RETURN_DONE() think they should clean up
memory beforehand. This is a waste of code/cycles, as long as the
memory was properly allocated in funcctx->multi_call_memory_ctx,
because funcapi.c takes care of deleting that context.

We should probably document that *any* manual cleanup before
SRF_RETURN_DONE() is an antipattern. If you have to have cleanup,
it needs to be done via RegisterExprContextCallback instead.

This part appears to be already in place since
e4186762ffaa4188e16702e8f4f299ea70988b96:

|The memory context that is current when the SRF is called is a transient
|context that will be cleared between calls. This means that you do not need to
|call pfree on everything you allocated using palloc; it will go away anyway.
|However, if you want to allocate any data structures to live across calls, you
|need to put them somewhere else. The memory context referenced by
|multi_call_memory_ctx is a suitable location for any data that needs to survive
|until the SRF is finished running. In most cases, this means that you should
|switch into multi_call_memory_ctx while doing the first-call setup.

--
Justin

#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Justin Pryzby (#8)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

Nitpick: please see c4dcd9144ba6.

From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Wed, 11 Mar 2020 10:09:18 -0500
Subject: [PATCH] SRF: avoid leaking resources if not run to completion

Change to return a tuplestore populated immediately and returned in full.

Discussion: /messages/by-id/20200308173103.GC1357@telsasoft.com

I wonder if this isn't saying that the whole value-per-call protocol is
bogus, in that it seems impossible to write a useful function with it.
Maybe we should add one final call with a special flag "function
shutdown" or something, so that these resources can be released if the
SRF isn't run to completion?

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#10)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I wonder if this isn't saying that the whole value-per-call protocol is
bogus, in that it seems impossible to write a useful function with it.

Only if you have a *very* narrow definition of "useful function".
If you look through SRF_RETURN_DONE callers, only a small minority
are trying to do resource cleanup beforehand.

Maybe we should add one final call with a special flag "function
shutdown" or something, so that these resources can be released if the
SRF isn't run to completion?

We already have an appropriate mechanism for cleaning up resources,
ie RegisterExprContextCallback. I do not think what you're suggesting
could be made to work without an incompatible break in the API for
SRFs, so it's not an improvement over telling people more forcefully
about how to do resource cleanup correctly.

regards, tom lane

#12Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#8)
2 attachment(s)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

On Thu, Mar 12, 2020 at 07:11:56AM -0500, Justin Pryzby wrote:

Do you want to have a go at that?

First draft attached. Note that I handled pg_ls_dir, even though I'm proposing
on the other thread to collapse/merge/meld it with pg_ls_dir_files [0].
Possibly that's a bad idea with tuplestore, due to returning a scalar vs a row
and needing to conditionally call CreateTemplateTupleDesc vs
get_call_result_type. I'll rebase that patch later today.

I didn't write test cases yet. Also didn't look for functions not on your
list.

I noticed this doesn't actually do anything, but kept it for now...except in
pg_ls_dir error case:

src/include/utils/tuplestore.h:/* tuplestore_donestoring() used to be required, but is no longer used */
src/include/utils/tuplestore.h:#define tuplestore_donestoring(state) ((void) 0)

v2 attached - I will add to next CF in case you want to defer it until later.

--
Justin

Attachments:

v2-0001-SRF-avoid-leaking-resources-if-not-run-to-complet.patchtext/x-diff; charset=us-asciiDownload
From 9ae75f693adf246ad551e258198606fd63190a20 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Wed, 11 Mar 2020 10:09:18 -0500
Subject: [PATCH v2 1/2] SRF: avoid leaking resources if not run to completion

Change to return a tuplestore populated immediately and returned in full.

Discussion: https://www.postgresql.org/message-id/20200308173103.GC1357%40telsasoft.com

See also: 9cb7db3f0 and 085b6b66
---
 contrib/adminpack/adminpack.c                |  82 +++++-----
 contrib/pgrowlocks/pgrowlocks.c              | 162 +++++++++----------
 doc/src/sgml/xfunc.sgml                      |  17 +-
 src/backend/utils/adt/datetime.c             |  99 +++++-------
 src/backend/utils/adt/genfile.c              | 111 +++++++------
 src/backend/utils/adt/misc.c                 | 116 +++++++------
 src/include/funcapi.h                        |   7 +-
 src/test/regress/expected/misc_functions.out |  18 +++
 src/test/regress/sql/misc_functions.sql      |   6 +
 9 files changed, 325 insertions(+), 293 deletions(-)

diff --git a/contrib/adminpack/adminpack.c b/contrib/adminpack/adminpack.c
index bc45e79895..070025f5a8 100644
--- a/contrib/adminpack/adminpack.c
+++ b/contrib/adminpack/adminpack.c
@@ -504,50 +504,56 @@ pg_logdir_ls_v1_1(PG_FUNCTION_ARGS)
 static Datum
 pg_logdir_ls_internal(FunctionCallInfo fcinfo)
 {
-	FuncCallContext *funcctx;
 	struct dirent *de;
-	directory_fctx *fctx;
+
+	ReturnSetInfo	*rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	MemoryContext	oldcontext;
+	TupleDesc		tupdesc;
+	Tuplestorestate	*tupstore;
+	bool			randomAccess;
+	DIR				*dirdesc;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
 
 	if (strcmp(Log_filename, "postgresql-%Y-%m-%d_%H%M%S.log") != 0)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("the log_filename parameter must equal 'postgresql-%%Y-%%m-%%d_%%H%%M%%S.log'")));
 
-	if (SRF_IS_FIRSTCALL())
-	{
-		MemoryContext oldcontext;
-		TupleDesc	tupdesc;
-
-		funcctx = SRF_FIRSTCALL_INIT();
-		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+	/* The Tuplestore and TupleDesc should be created in ecxt_per_query_memory */
+	oldcontext = MemoryContextSwitchTo(rsinfo->econtext->ecxt_per_query_memory);
+	randomAccess = (rsinfo->allowedModes&SFRM_Materialize_Random) != 0;
+	tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
 
-		fctx = palloc(sizeof(directory_fctx));
+	/* Is this right ?? */
+	tupdesc = CreateTemplateTupleDesc(2);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "starttime",
+			TIMESTAMPOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "filename",
+			TEXTOID, -1, 0);
 
-		tupdesc = CreateTemplateTupleDesc(2);
-		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "starttime",
-						   TIMESTAMPOID, -1, 0);
-		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "filename",
-						   TEXTOID, -1, 0);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
 
-		funcctx->attinmeta = TupleDescGetAttInMetadata(tupdesc);
+	MemoryContextSwitchTo(oldcontext);
 
-		fctx->location = pstrdup(Log_directory);
-		fctx->dirdesc = AllocateDir(fctx->location);
-
-		if (!fctx->dirdesc)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not open directory \"%s\": %m",
-							fctx->location)));
-
-		funcctx->user_fctx = fctx;
-		MemoryContextSwitchTo(oldcontext);
-	}
-
-	funcctx = SRF_PERCALL_SETUP();
-	fctx = (directory_fctx *) funcctx->user_fctx;
+	dirdesc = AllocateDir(Log_directory);
+	if (!dirdesc)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open directory \"%s\": %m",
+						Log_directory)));
 
-	while ((de = ReadDir(fctx->dirdesc, fctx->location)) != NULL)
+	while ((de = ReadDir(dirdesc, Log_directory)) != NULL)
 	{
 		char	   *values[2];
 		HeapTuple	tuple;
@@ -584,13 +590,13 @@ pg_logdir_ls_internal(FunctionCallInfo fcinfo)
 		/* Seems the timestamp is OK; prepare and return tuple */
 
 		values[0] = timestampbuf;
-		values[1] = psprintf("%s/%s", fctx->location, de->d_name);
-
-		tuple = BuildTupleFromCStrings(funcctx->attinmeta, values);
+		values[1] = psprintf("%s/%s", Log_directory, de->d_name);
 
-		SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
+		tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupdesc), values);
+		tuplestore_puttuple(tupstore, tuple);
 	}
 
-	FreeDir(fctx->dirdesc);
-	SRF_RETURN_DONE(funcctx);
+	tuplestore_donestoring(tupstore);
+	FreeDir(dirdesc);
+	return (Datum) 0;
 }
diff --git a/contrib/pgrowlocks/pgrowlocks.c b/contrib/pgrowlocks/pgrowlocks.c
index a2c44a916c..baad83fcde 100644
--- a/contrib/pgrowlocks/pgrowlocks.c
+++ b/contrib/pgrowlocks/pgrowlocks.c
@@ -54,13 +54,6 @@ PG_FUNCTION_INFO_V1(pgrowlocks);
 
 #define NCHARS 32
 
-typedef struct
-{
-	Relation	rel;
-	TableScanDesc scan;
-	int			ncolumns;
-} MyData;
-
 #define		Atnum_tid		0
 #define		Atnum_xmax		1
 #define		Atnum_ismulti	2
@@ -71,82 +64,84 @@ typedef struct
 Datum
 pgrowlocks(PG_FUNCTION_ARGS)
 {
-	FuncCallContext *funcctx;
 	TableScanDesc scan;
 	HeapScanDesc hscan;
 	HeapTuple	tuple;
 	TupleDesc	tupdesc;
 	AttInMetadata *attinmeta;
-	Datum		result;
-	MyData	   *mydata;
 	Relation	rel;
 
-	if (SRF_IS_FIRSTCALL())
-	{
-		text	   *relname;
-		RangeVar   *relrv;
-		MemoryContext oldcontext;
-		AclResult	aclresult;
-
-		funcctx = SRF_FIRSTCALL_INIT();
-		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
-
-		/* Build a tuple descriptor for our result type */
-		if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
-			elog(ERROR, "return type must be a row type");
-
-		attinmeta = TupleDescGetAttInMetadata(tupdesc);
-		funcctx->attinmeta = attinmeta;
-
-		relname = PG_GETARG_TEXT_PP(0);
-		relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
-		rel = relation_openrv(relrv, AccessShareLock);
-
-		if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
-			ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							errmsg("only heap AM is supported")));
-
-		if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-			ereport(ERROR,
-					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("\"%s\" is a partitioned table",
-							RelationGetRelationName(rel)),
-					 errdetail("Partitioned tables do not contain rows.")));
-		else if (rel->rd_rel->relkind != RELKIND_RELATION)
-			ereport(ERROR,
-					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("\"%s\" is not a table",
-							RelationGetRelationName(rel))));
+	ReturnSetInfo   *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Tuplestorestate	*tupstore;
+	bool            randomAccess;
+
+	text	   *relname;
+	RangeVar   *relrv;
+	MemoryContext oldcontext;
+	AclResult	aclresult;
+	char	  **values;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* The Tuplestore and TupleDesc should be created in ecxt_per_query_memory */
+	oldcontext = MemoryContextSwitchTo(rsinfo->econtext->ecxt_per_query_memory);
+	randomAccess = (rsinfo->allowedModes&SFRM_Materialize_Random) != 0;
+	tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	attinmeta = TupleDescGetAttInMetadata(tupdesc);
+
+	relname = PG_GETARG_TEXT_PP(0);
+	relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
+	rel = relation_openrv(relrv, AccessShareLock);
+
+	if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("only heap AM is supported")));
+
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("\"%s\" is a partitioned table",
+						RelationGetRelationName(rel)),
+				 errdetail("Partitioned tables do not contain rows.")));
+	else if (rel->rd_rel->relkind != RELKIND_RELATION)
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("\"%s\" is not a table",
+						RelationGetRelationName(rel))));
+
+	/*
+	 * check permissions: must have SELECT on table or be in
+	 * pg_stat_scan_tables
+	 */
+	aclresult = pg_class_aclcheck(RelationGetRelid(rel), GetUserId(),
+								  ACL_SELECT);
+	if (aclresult != ACLCHECK_OK)
+		aclresult = is_member_of_role(GetUserId(), DEFAULT_ROLE_STAT_SCAN_TABLES) ? ACLCHECK_OK : ACLCHECK_NO_PRIV;
+
+	if (aclresult != ACLCHECK_OK)
+		aclcheck_error(aclresult, get_relkind_objtype(rel->rd_rel->relkind),
+					   RelationGetRelationName(rel));
+
+	scan = table_beginscan(rel, GetActiveSnapshot(), 0, NULL);
+	hscan = (HeapScanDesc) scan;
 
-		/*
-		 * check permissions: must have SELECT on table or be in
-		 * pg_stat_scan_tables
-		 */
-		aclresult = pg_class_aclcheck(RelationGetRelid(rel), GetUserId(),
-									  ACL_SELECT);
-		if (aclresult != ACLCHECK_OK)
-			aclresult = is_member_of_role(GetUserId(), DEFAULT_ROLE_STAT_SCAN_TABLES) ? ACLCHECK_OK : ACLCHECK_NO_PRIV;
-
-		if (aclresult != ACLCHECK_OK)
-			aclcheck_error(aclresult, get_relkind_objtype(rel->rd_rel->relkind),
-						   RelationGetRelationName(rel));
-
-		scan = table_beginscan(rel, GetActiveSnapshot(), 0, NULL);
-		hscan = (HeapScanDesc) scan;
-		mydata = palloc(sizeof(*mydata));
-		mydata->rel = rel;
-		mydata->scan = scan;
-		mydata->ncolumns = tupdesc->natts;
-		funcctx->user_fctx = mydata;
-
-		MemoryContextSwitchTo(oldcontext);
-	}
+	MemoryContextSwitchTo(oldcontext);
 
-	funcctx = SRF_PERCALL_SETUP();
-	attinmeta = funcctx->attinmeta;
-	mydata = (MyData *) funcctx->user_fctx;
-	scan = mydata->scan;
-	hscan = (HeapScanDesc) scan;
+	values = (char **) palloc(tupdesc->natts * sizeof(char *));
 
 	/* scan the relation */
 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
@@ -169,9 +164,6 @@ pgrowlocks(PG_FUNCTION_ARGS)
 		 */
 		if (htsu == TM_BeingModified)
 		{
-			char	  **values;
-
-			values = (char **) palloc(mydata->ncolumns * sizeof(char *));
 
 			values[Atnum_tid] = (char *) DirectFunctionCall1(tidout,
 															 PointerGetDatum(&tuple->t_self));
@@ -297,16 +289,7 @@ pgrowlocks(PG_FUNCTION_ARGS)
 
 			/* build a tuple */
 			tuple = BuildTupleFromCStrings(attinmeta, values);
-
-			/* make the tuple into a datum */
-			result = HeapTupleGetDatum(tuple);
-
-			/*
-			 * no need to pfree what we allocated; it's on a short-lived
-			 * memory context anyway
-			 */
-
-			SRF_RETURN_NEXT(funcctx, result);
+			tuplestore_puttuple(tupstore, tuple);
 		}
 		else
 		{
@@ -314,8 +297,9 @@ pgrowlocks(PG_FUNCTION_ARGS)
 		}
 	}
 
+	tuplestore_donestoring(tupstore);
 	table_endscan(scan);
-	table_close(mydata->rel, AccessShareLock);
+	table_close(rel, AccessShareLock);
 
-	SRF_RETURN_DONE(funcctx);
+	return (Datum) 0;
 }
diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index a91f887119..432758e75e 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -2812,7 +2812,7 @@ HeapTupleGetDatum(HeapTuple tuple)
     <title>Returning Sets</title>
 
     <para>
-     There is also a special API that provides support for returning
+     There is also a special, helper API that provides support for returning
      sets (multiple rows) from a C-language function.  A set-returning
      function must follow the version-1 calling conventions.  Also,
      source files must include <filename>funcapi.h</filename>, as
@@ -2820,10 +2820,19 @@ HeapTupleGetDatum(HeapTuple tuple)
     </para>
 
     <para>
-     A set-returning function (<acronym>SRF</acronym>) is called
-     once for each item it returns.  The <acronym>SRF</acronym> must
-     therefore save enough state to remember what it was doing and
+     The helper API allows writing a set-returning function
+     (<acronym>SRF</acronym>) using the ValuePerCall mode, in which the SRF is
+     called once for each item it returns.  The <acronym>SRF</acronym> must
+     then save enough state to remember what it was doing and
      return the next item on each call.
+     Note, however, that an SRF will possibly not be run to completion, due to
+     a LIMIT or a cursor, and should avoid leaving resources like DIR*s or
+     tablescans opened across calls.  To avoid leaking resources in those cases,
+     instead of using ValuePerCall mode, the SRF should populate and return a
+     tuplestore with SFRM_Materialize mode.  See fmgr/README.
+    </para>
+
+    <para>
      The structure <structname>FuncCallContext</structname> is provided to help
      control this process.  Within a function, <literal>fcinfo-&gt;flinfo-&gt;fn_extra</literal>
      is used to hold a pointer to <structname>FuncCallContext</structname>
diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index 4f109111d1..d0966c8305 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -4756,11 +4756,8 @@ Datum
 pg_timezone_names(PG_FUNCTION_ARGS)
 {
 	MemoryContext oldcontext;
-	FuncCallContext *funcctx;
 	pg_tzenum  *tzenum;
 	pg_tz	   *tz;
-	Datum		result;
-	HeapTuple	tuple;
 	Datum		values[4];
 	bool		nulls[4];
 	int			tzoff;
@@ -4770,58 +4767,44 @@ pg_timezone_names(PG_FUNCTION_ARGS)
 	Interval   *resInterval;
 	struct pg_tm itm;
 
-	/* stuff done only on the first call of the function */
-	if (SRF_IS_FIRSTCALL())
-	{
-		TupleDesc	tupdesc;
+	ReturnSetInfo   *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 
-		/* create a function context for cross-call persistence */
-		funcctx = SRF_FIRSTCALL_INIT();
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	bool            randomAccess;
 
-		/*
-		 * switch to memory context appropriate for multiple function calls
-		 */
-		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
 
-		/* initialize timezone scanning code */
-		tzenum = pg_tzenumerate_start();
-		funcctx->user_fctx = (void *) tzenum;
+	/* The Tuplestore and TupleDesc should be created in ecxt_per_query_memory */
+	oldcontext = MemoryContextSwitchTo(rsinfo->econtext->ecxt_per_query_memory);
+	randomAccess = (rsinfo->allowedModes&SFRM_Materialize_Random) != 0;
+	tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
 
-		/*
-		 * build tupdesc for result tuples. This must match this function's
-		 * pg_proc entry!
-		 */
-		tupdesc = CreateTemplateTupleDesc(4);
-		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "name",
-						   TEXTOID, -1, 0);
-		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "abbrev",
-						   TEXTOID, -1, 0);
-		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "utc_offset",
-						   INTERVALOID, -1, 0);
-		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "is_dst",
-						   BOOLOID, -1, 0);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
 
-		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
-		MemoryContextSwitchTo(oldcontext);
-	}
+	MemoryContextSwitchTo(oldcontext);
 
-	/* stuff done on every call of the function */
-	funcctx = SRF_PERCALL_SETUP();
-	tzenum = (pg_tzenum *) funcctx->user_fctx;
+	/* initialize timezone scanning code */
+	tzenum = pg_tzenumerate_start();
 
 	/* search for another zone to display */
 	for (;;)
 	{
-		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
 		tz = pg_tzenumerate_next(tzenum);
-		MemoryContextSwitchTo(oldcontext);
-
 		if (!tz)
-		{
-			pg_tzenumerate_end(tzenum);
-			funcctx->user_fctx = NULL;
-			SRF_RETURN_DONE(funcctx);
-		}
+			break;
 
 		/* Convert now() to local time in this zone */
 		if (timestamp2tm(GetCurrentTransactionStartTimestamp(),
@@ -4840,25 +4823,23 @@ pg_timezone_names(PG_FUNCTION_ARGS)
 		if (tzn && strlen(tzn) > 31)
 			continue;
 
-		/* Found a displayable zone */
-		break;
-	}
-
-	MemSet(nulls, 0, sizeof(nulls));
 
-	values[0] = CStringGetTextDatum(pg_get_timezone_name(tz));
-	values[1] = CStringGetTextDatum(tzn ? tzn : "");
+		MemSet(nulls, 0, sizeof(nulls));
+		values[0] = CStringGetTextDatum(pg_get_timezone_name(tz));
+		values[1] = CStringGetTextDatum(tzn ? tzn : "");
 
-	MemSet(&itm, 0, sizeof(struct pg_tm));
-	itm.tm_sec = -tzoff;
-	resInterval = (Interval *) palloc(sizeof(Interval));
-	tm2interval(&itm, 0, resInterval);
-	values[2] = IntervalPGetDatum(resInterval);
+		MemSet(&itm, 0, sizeof(struct pg_tm));
+		itm.tm_sec = -tzoff;
+		resInterval = (Interval *) palloc(sizeof(Interval));
+		tm2interval(&itm, 0, resInterval);
+		values[2] = IntervalPGetDatum(resInterval);
 
-	values[3] = BoolGetDatum(tm.tm_isdst > 0);
+		values[3] = BoolGetDatum(tm.tm_isdst > 0);
 
-	tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
-	result = HeapTupleGetDatum(tuple);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
 
-	SRF_RETURN_NEXT(funcctx, result);
+	pg_tzenumerate_end(tzenum);
+	tuplestore_donestoring(tupstore);
+	return (Datum) 0;
 }
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index bcf9bd1b97..39a54d532f 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -36,13 +36,6 @@
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
 
-typedef struct
-{
-	char	   *location;
-	DIR		   *dirdesc;
-	bool		include_dot_dirs;
-} directory_fctx;
-
 
 /*
  * Convert a "text" filename argument to C string, and check it's allowable.
@@ -447,67 +440,81 @@ pg_stat_file_1arg(PG_FUNCTION_ARGS)
 Datum
 pg_ls_dir(PG_FUNCTION_ARGS)
 {
-	FuncCallContext *funcctx;
-	struct dirent *de;
-	directory_fctx *fctx;
-	MemoryContext oldcontext;
+	ReturnSetInfo	*rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	MemoryContext	oldcontext;
+	TupleDesc		tupdesc;
+	Tuplestorestate *tupstore;
+	bool			randomAccess;
+
+	bool			missing_ok = false;
+	bool			include_dot_dirs = false;
+	char			*location;
+	DIR				*dirdesc;
+	struct dirent	*de;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
 
-	if (SRF_IS_FIRSTCALL())
+	/* check the optional arguments */
+	if (PG_NARGS() == 3)
 	{
-		bool		missing_ok = false;
-		bool		include_dot_dirs = false;
+		if (!PG_ARGISNULL(1))
+			missing_ok = PG_GETARG_BOOL(1);
+		if (!PG_ARGISNULL(2))
+			include_dot_dirs = PG_GETARG_BOOL(2);
+	}
 
-		/* check the optional arguments */
-		if (PG_NARGS() == 3)
-		{
-			if (!PG_ARGISNULL(1))
-				missing_ok = PG_GETARG_BOOL(1);
-			if (!PG_ARGISNULL(2))
-				include_dot_dirs = PG_GETARG_BOOL(2);
-		}
+	/* The Tuplestore and TupleDesc should be created in ecxt_per_query_memory */
+	oldcontext = MemoryContextSwitchTo(rsinfo->econtext->ecxt_per_query_memory);
+	randomAccess = (rsinfo->allowedModes&SFRM_Materialize_Random) != 0;
+	tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
 
-		funcctx = SRF_FIRSTCALL_INIT();
-		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "column", TEXTOID, -1, 0);
 
-		fctx = palloc(sizeof(directory_fctx));
-		fctx->location = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+	MemoryContextSwitchTo(oldcontext);
 
-		fctx->include_dot_dirs = include_dot_dirs;
-		fctx->dirdesc = AllocateDir(fctx->location);
+	location = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
+	dirdesc = AllocateDir(location);
 
-		if (!fctx->dirdesc)
-		{
-			if (missing_ok && errno == ENOENT)
-			{
-				MemoryContextSwitchTo(oldcontext);
-				SRF_RETURN_DONE(funcctx);
-			}
-			else
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not open directory \"%s\": %m",
-								fctx->location)));
-		}
-		funcctx->user_fctx = fctx;
-		MemoryContextSwitchTo(oldcontext);
+	if (!dirdesc)
+	{
+		if (missing_ok && errno == ENOENT)
+			return (Datum) 0;
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open directory \"%s\": %m",
+							location)));
 	}
 
-	funcctx = SRF_PERCALL_SETUP();
-	fctx = (directory_fctx *) funcctx->user_fctx;
-
-	while ((de = ReadDir(fctx->dirdesc, fctx->location)) != NULL)
+	while ((de = ReadDir(dirdesc, location)) != NULL)
 	{
-		if (!fctx->include_dot_dirs &&
+		Datum		values[1];
+		bool		nulls[1] = {false};
+
+		if (!include_dot_dirs &&
 			(strcmp(de->d_name, ".") == 0 ||
 			 strcmp(de->d_name, "..") == 0))
 			continue;
 
-		SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(de->d_name));
+		values[0] = CStringGetTextDatum(de->d_name);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
 	}
 
-	FreeDir(fctx->dirdesc);
-
-	SRF_RETURN_DONE(funcctx);
+	tuplestore_donestoring(tupstore);
+	FreeDir(dirdesc);
+	return (Datum) 0;
 }
 
 /*
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 323e36b81c..9919087335 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -194,72 +194,86 @@ current_query(PG_FUNCTION_ARGS)
 
 /* Function to find out which databases make use of a tablespace */
 
-typedef struct
-{
-	char	   *location;
-	DIR		   *dirdesc;
-} ts_db_fctx;
-
 Datum
 pg_tablespace_databases(PG_FUNCTION_ARGS)
 {
-	FuncCallContext *funcctx;
 	struct dirent *de;
-	ts_db_fctx *fctx;
 
-	if (SRF_IS_FIRSTCALL())
-	{
-		MemoryContext oldcontext;
-		Oid			tablespaceOid = PG_GETARG_OID(0);
+	MemoryContext oldcontext;
+	Oid			tablespaceOid = PG_GETARG_OID(0);
 
-		funcctx = SRF_FIRSTCALL_INIT();
-		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+	ReturnSetInfo	*rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc		tupdesc;
+	Tuplestorestate *tupstore;
+	bool			randomAccess;
 
-		fctx = palloc(sizeof(ts_db_fctx));
+	DIR				*dirdesc;
+	char			*location;
 
-		if (tablespaceOid == GLOBALTABLESPACE_OID)
-		{
-			fctx->dirdesc = NULL;
-			ereport(WARNING,
-					(errmsg("global tablespace never has databases")));
-		}
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	if (tablespaceOid == GLOBALTABLESPACE_OID)
+	{
+		dirdesc = NULL;
+		ereport(WARNING,
+				(errmsg("global tablespace never has databases")));
+	}
+	else
+	{
+		if (tablespaceOid == DEFAULTTABLESPACE_OID)
+			location = psprintf("base");
 		else
-		{
-			if (tablespaceOid == DEFAULTTABLESPACE_OID)
-				fctx->location = psprintf("base");
-			else
-				fctx->location = psprintf("pg_tblspc/%u/%s", tablespaceOid,
-										  TABLESPACE_VERSION_DIRECTORY);
+			location = psprintf("pg_tblspc/%u/%s", tablespaceOid,
+									  TABLESPACE_VERSION_DIRECTORY);
 
-			fctx->dirdesc = AllocateDir(fctx->location);
+		dirdesc = AllocateDir(location);
 
-			if (!fctx->dirdesc)
-			{
-				/* the only expected error is ENOENT */
-				if (errno != ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open directory \"%s\": %m",
-									fctx->location)));
-				ereport(WARNING,
-						(errmsg("%u is not a tablespace OID", tablespaceOid)));
-			}
+		if (!dirdesc)
+		{
+			/* the only expected error is ENOENT */
+			if (errno != ENOENT)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not open directory \"%s\": %m",
+								location)));
+			ereport(WARNING,
+					(errmsg("%u is not a tablespace OID", tablespaceOid)));
 		}
-		funcctx->user_fctx = fctx;
-		MemoryContextSwitchTo(oldcontext);
 	}
 
-	funcctx = SRF_PERCALL_SETUP();
-	fctx = (ts_db_fctx *) funcctx->user_fctx;
+	/* The Tuplestore and TupleDesc should be created in ecxt_per_query_memory */
+	oldcontext = MemoryContextSwitchTo(rsinfo->econtext->ecxt_per_query_memory);
+
+	randomAccess = (rsinfo->allowedModes&SFRM_Materialize_Random) != 0;
+	tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
 
-	if (!fctx->dirdesc)			/* not a tablespace */
-		SRF_RETURN_DONE(funcctx);
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "column", OIDOID, -1, 0);
 
-	while ((de = ReadDir(fctx->dirdesc, fctx->location)) != NULL)
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (!dirdesc)			/* not a tablespace */
+		return 0;
+
+	while ((de = ReadDir(dirdesc, location)) != NULL)
 	{
 		Oid			datOid = atooid(de->d_name);
 		char	   *subdir;
 		bool		isempty;
+		Datum		values[1];
+		bool		nulls[1] = {false};
 
 		/* this test skips . and .., but is awfully weak */
 		if (!datOid)
@@ -267,18 +281,20 @@ pg_tablespace_databases(PG_FUNCTION_ARGS)
 
 		/* if database subdir is empty, don't report tablespace as used */
 
-		subdir = psprintf("%s/%s", fctx->location, de->d_name);
+		subdir = psprintf("%s/%s", location, de->d_name);
 		isempty = directory_is_empty(subdir);
 		pfree(subdir);
 
 		if (isempty)
 			continue;			/* indeed, nothing in it */
 
-		SRF_RETURN_NEXT(funcctx, ObjectIdGetDatum(datOid));
+		values[0] = ObjectIdGetDatum(datOid);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
 	}
 
-	FreeDir(fctx->dirdesc);
-	SRF_RETURN_DONE(funcctx);
+	tuplestore_donestoring(tupstore);
+	FreeDir(dirdesc);
+	return (Datum) 0;
 }
 
 
diff --git a/src/include/funcapi.h b/src/include/funcapi.h
index f9b75ae390..471100b7e5 100644
--- a/src/include/funcapi.h
+++ b/src/include/funcapi.h
@@ -234,7 +234,12 @@ extern Datum HeapTupleHeaderGetDatum(HeapTupleHeader tuple);
 /*----------
  *		Support for Set Returning Functions (SRFs)
  *
- * The basic API for SRFs looks something like:
+ * The basic API for SRFs using ValuePerCall mode looks something like this.
+ * Note that an SRF will possibly not be run to completion, due to a LIMIT or a
+ * cursor, and should avoid leaving resources like DIR*s or tablescans opened
+ * across calls.  To avoid leaking resources in those cases, instead of using
+ * ValuePerCall mode, the SRF should populate and return a tuplestore with
+ * SFRM_Materialize mode.  See fmgr/README.
  *
  * Datum
  * my_Set_Returning_Function(PG_FUNCTION_ARGS)
diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out
index e217b678d7..b56ce74586 100644
--- a/src/test/regress/expected/misc_functions.out
+++ b/src/test/regress/expected/misc_functions.out
@@ -180,6 +180,24 @@ select count(*) >= 0 as ok from pg_ls_archive_statusdir();
  t
 (1 row)
 
+select * from (select pg_ls_dir('.')a)a where a='base' limit 1;
+  a   
+------
+ base
+(1 row)
+
+select * from (select (pg_timezone_names()).name)ptn where name='UTC' limit 1;
+ name 
+------
+ UTC
+(1 row)
+
+select datname from (select pg_tablespace_databases(b.oid) as pts from pg_tablespace b where b.spcname!='pg_global')pts join pg_database db on pts.pts=db.oid limit 1;
+  datname  
+-----------
+ template1
+(1 row)
+
 --
 -- Test adding a support function to a subject function
 --
diff --git a/src/test/regress/sql/misc_functions.sql b/src/test/regress/sql/misc_functions.sql
index 1e11eb3554..5441b8b643 100644
--- a/src/test/regress/sql/misc_functions.sql
+++ b/src/test/regress/sql/misc_functions.sql
@@ -51,6 +51,12 @@ from (select pg_ls_waldir() w) ss where length((w).name) = 24 limit 1;
 
 select count(*) >= 0 as ok from pg_ls_archive_statusdir();
 
+select * from (select pg_ls_dir('.')a)a where a='base' limit 1;
+
+select * from (select (pg_timezone_names()).name)ptn where name='UTC' limit 1;
+
+select datname from (select pg_tablespace_databases(b.oid) as pts from pg_tablespace b where b.spcname!='pg_global')pts join pg_database db on pts.pts=db.oid limit 1;
+
 --
 -- Test adding a support function to a subject function
 --
-- 
2.17.0

v2-0002-Clean-up-existing-anti-pattern-of-freeing-SRF-res.patchtext/x-diff; charset=us-asciiDownload
From 7942df89eed700dfcf71294e9e0aed821b737281 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Fri, 13 Mar 2020 19:53:42 -0500
Subject: [PATCH v2 2/2] Clean up existing anti-pattern of freeing SRF
 resources

See also: e4186762ffaa4188e16702e8f4f299ea70988b96

Discussion: https://www.postgresql.org/message-id/8034.1583699444%40sss.pgh.pa.us
---
 contrib/pageinspect/btreefuncs.c       | 17 ++++++-----------
 contrib/pageinspect/ginfuncs.c         |  5 +++--
 contrib/pageinspect/hashfuncs.c        |  8 +++-----
 doc/src/sgml/xfunc.sgml                | 12 +++++++++---
 src/backend/access/transam/multixact.c |  5 +----
 src/backend/tsearch/wparser.c          | 13 ++++---------
 src/backend/utils/adt/jsonfuncs.c      | 16 ++--------------
 src/backend/utils/adt/tsvector_op.c    |  2 +-
 src/include/funcapi.h                  |  7 ++++---
 9 files changed, 33 insertions(+), 52 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index e6a2fc1e15..0f244e4f30 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -502,12 +502,9 @@ bt_page_items(PG_FUNCTION_ARGS)
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
-	else
-	{
-		pfree(uargs->page);
-		pfree(uargs);
-		SRF_RETURN_DONE(fctx);
-	}
+
+	/* allocations in multi_call_memory_ctx are released automatically */
+	SRF_RETURN_DONE(fctx);
 }
 
 /*-------------------------------------------------------
@@ -590,11 +587,9 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
-	else
-	{
-		pfree(uargs);
-		SRF_RETURN_DONE(fctx);
-	}
+
+	/* allocations in multi_call_memory_ctx are released automatically */
+	SRF_RETURN_DONE(fctx);
 }
 
 /* Number of output arguments (columns) for bt_metap() */
diff --git a/contrib/pageinspect/ginfuncs.c b/contrib/pageinspect/ginfuncs.c
index 7e2cafab74..5109c4b133 100644
--- a/contrib/pageinspect/ginfuncs.c
+++ b/contrib/pageinspect/ginfuncs.c
@@ -260,6 +260,7 @@ gin_leafpage_items(PG_FUNCTION_ARGS)
 
 		SRF_RETURN_NEXT(fctx, result);
 	}
-	else
-		SRF_RETURN_DONE(fctx);
+
+	/* allocations in multi_call_memory_ctx are released automatically */
+	SRF_RETURN_DONE(fctx);
 }
diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c
index 984ac33186..d024f1ac7b 100644
--- a/contrib/pageinspect/hashfuncs.c
+++ b/contrib/pageinspect/hashfuncs.c
@@ -374,11 +374,9 @@ hash_page_items(PG_FUNCTION_ARGS)
 
 		SRF_RETURN_NEXT(fctx, result);
 	}
-	else
-	{
-		pfree(uargs);
-		SRF_RETURN_DONE(fctx);
-	}
+
+	/* allocations in multi_call_memory_ctx are released automatically */
+	SRF_RETURN_DONE(fctx);
 }
 
 /* ------------------------------------------------
diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index 432758e75e..f3e96087fa 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -2861,7 +2861,8 @@ typedef struct FuncCallContext
      * OPTIONAL pointer to miscellaneous user-provided context information
      *
      * user_fctx is for use as a pointer to your own data to retain
-     * arbitrary context information between calls of your function.
+     * context information between calls of your function.
+     * It should probably not include resources other than allocated memory.
      */
     void *user_fctx;
 
@@ -2881,7 +2882,8 @@ typedef struct FuncCallContext
      * multi_call_memory_ctx is set by SRF_FIRSTCALL_INIT() for you, and used
      * by SRF_RETURN_DONE() for cleanup. It is the most appropriate memory
      * context for any memory that is to be reused across multiple calls
-     * of the SRF.
+     * of the SRF.  Memory palloc'd in this context will be pfree'd
+     * automatically.
      */
     MemoryContext multi_call_memory_ctx;
 
@@ -2935,6 +2937,7 @@ SRF_RETURN_NEXT(funcctx, result)
 SRF_RETURN_DONE(funcctx)
 </programlisting>
      to clean up and end the <acronym>SRF</acronym>.
+     Allocations within multi_call_memory_ctx are automatically pfree'd.
     </para>
 
     <para>
@@ -3004,7 +3007,10 @@ my_set_returning_function(PG_FUNCTION_ARGS)
     }
     else
     {
-        /* Here we are done returning items and just need to clean up: */
+        /*
+         * Here we are done returning items and just need to clean up, but no
+         * need to pfree allocations within multi_call_memory_ctx.
+         */
         <replaceable>user code</replaceable>
         SRF_RETURN_DONE(funcctx);
     }
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 50e98caaeb..cb92e049aa 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3388,9 +3388,6 @@ pg_get_multixact_members(PG_FUNCTION_ARGS)
 		SRF_RETURN_NEXT(funccxt, HeapTupleGetDatum(tuple));
 	}
 
-	if (multi->nmembers > 0)
-		pfree(multi->members);
-	pfree(multi);
-
+	/* allocations in multi_call_memory_ctx are released automatically */
 	SRF_RETURN_DONE(funccxt);
 }
diff --git a/src/backend/tsearch/wparser.c b/src/backend/tsearch/wparser.c
index 88005c0519..85eaf041d6 100644
--- a/src/backend/tsearch/wparser.c
+++ b/src/backend/tsearch/wparser.c
@@ -104,9 +104,8 @@ tt_process_call(FuncCallContext *funcctx)
 		st->cur++;
 		return result;
 	}
-	if (st->list)
-		pfree(st->list);
-	pfree(st);
+
+	/* allocations in multi_call_memory_ctx are released automatically */
 	return (Datum) 0;
 }
 
@@ -245,12 +244,8 @@ prs_process_call(FuncCallContext *funcctx)
 		st->cur++;
 		return result;
 	}
-	else
-	{
-		if (st->list)
-			pfree(st->list);
-		pfree(st);
-	}
+
+	/* allocations in multi_call_memory_ctx are released automatically */
 	return (Datum) 0;
 }
 
diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
index f92861d8d2..c5a286dd58 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -535,7 +535,6 @@ jsonb_object_keys(PG_FUNCTION_ARGS)
 {
 	FuncCallContext *funcctx;
 	OkeysState *state;
-	int			i;
 
 	if (SRF_IS_FIRSTCALL())
 	{
@@ -598,12 +597,7 @@ jsonb_object_keys(PG_FUNCTION_ARGS)
 		SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(nxt));
 	}
 
-	/* cleanup to reduce or eliminate memory leaks */
-	for (i = 0; i < state->result_count; i++)
-		pfree(state->result[i]);
-	pfree(state->result);
-	pfree(state);
-
+	/* allocations in multi_call_memory_ctx are released automatically */
 	SRF_RETURN_DONE(funcctx);
 }
 
@@ -706,7 +700,6 @@ json_object_keys(PG_FUNCTION_ARGS)
 {
 	FuncCallContext *funcctx;
 	OkeysState *state;
-	int			i;
 
 	if (SRF_IS_FIRSTCALL())
 	{
@@ -755,12 +748,7 @@ json_object_keys(PG_FUNCTION_ARGS)
 		SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(nxt));
 	}
 
-	/* cleanup to reduce or eliminate memory leaks */
-	for (i = 0; i < state->result_count; i++)
-		pfree(state->result[i]);
-	pfree(state->result);
-	pfree(state);
-
+	/* allocations in multi_call_memory_ctx are released automatically */
 	SRF_RETURN_DONE(funcctx);
 }
 
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 108dd998c7..e7b8fbc3ca 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -706,7 +706,7 @@ tsvector_unnest(PG_FUNCTION_ARGS)
 	}
 	else
 	{
-		pfree(tsin);
+		/* allocations in multi_call_memory_ctx are released automatically */
 		SRF_RETURN_DONE(funcctx);
 	}
 }
diff --git a/src/include/funcapi.h b/src/include/funcapi.h
index 471100b7e5..8f6db97fcf 100644
--- a/src/include/funcapi.h
+++ b/src/include/funcapi.h
@@ -77,7 +77,7 @@ typedef struct FuncCallContext
 	 * OPTIONAL pointer to miscellaneous user-provided context information
 	 *
 	 * user_fctx is for use as a pointer to your own struct to retain
-	 * arbitrary context information between calls of your function.
+	 * context information between calls of your function.
 	 */
 	void	   *user_fctx;
 
@@ -93,8 +93,9 @@ typedef struct FuncCallContext
 	/*
 	 * memory context used for structures that must live for multiple calls
 	 *
-	 * multi_call_memory_ctx is set by SRF_FIRSTCALL_INIT() for you, and used
-	 * by SRF_RETURN_DONE() for cleanup. It is the most appropriate memory
+	 * multi_call_memory_ctx is set by SRF_FIRSTCALL_INIT() for you, and
+	 * automatically cleaned up by SRF_RETURN_DONE(). It is the most
+	 * appropriate memory
 	 * context for any memory that is to be reused across multiple calls of
 	 * the SRF.
 	 */
-- 
2.17.0

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#12)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

Justin Pryzby <pryzby@telsasoft.com> writes:

v2 attached - I will add to next CF in case you want to defer it until later.

Thanks, reviewed and pushed. Since this is a bug fix (at least in part)
I didn't want to wait.

regards, tom lane

#14Justin Pryzby
pryzby@telsasoft.com
In reply to: Tom Lane (#13)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

On Mon, Mar 16, 2020 at 09:38:50PM -0400, Tom Lane wrote:

Justin Pryzby <pryzby@telsasoft.com> writes:

v2 attached - I will add to next CF in case you want to defer it until later.

Thanks, reviewed and pushed. Since this is a bug fix (at least in part)
I didn't want to wait.

Thanks for fixing my test case and pushing.

--
Justin

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#14)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

Justin Pryzby <pryzby@telsasoft.com> writes:

Thanks for fixing my test case and pushing.

The buildfarm just showed up another instability in the test cases
we added:

=========================== regression.diffs ================
diff -U3 /home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/../pgsql/src/test/regress/expected/misc_functions.out /home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/regress/results/misc_functions.out
--- /home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/../pgsql/src/test/regress/expected/misc_functions.out	2020-03-17 08:14:50.292037956 +0100
+++ /home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/regress/results/misc_functions.out	2020-03-28 13:55:12.490024822 +0100
@@ -169,11 +169,7 @@
 select (w).size = :segsize as ok
 from (select pg_ls_waldir() w) ss where length((w).name) = 24 limit 1;
- ok 
-----
- t
-(1 row)
-
+ERROR:  could not stat file "pg_wal/000000010000000000000078": No such file or directory
 select count(*) >= 0 as ok from pg_ls_archive_statusdir();
  ok 
 ----

It's pretty obvious what happened here: concurrent activity renamed or
removed the WAL segment between when we saw it in the directory and
when we tried to stat() it.

This seems like it would be just as much of a hazard for field usage
as it is for regression testing, so I propose that we fix these
directory-scanning functions to silently ignore ENOENT failures from
stat(). Are there any for which we should not do that?

regards, tom lane

#16Justin Pryzby
pryzby@telsasoft.com
In reply to: Tom Lane (#15)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

On Sat, Mar 28, 2020 at 01:13:54PM -0400, Tom Lane wrote:

The buildfarm just showed up another instability in the test cases
we added:

Yea, as you said, this is an issue with the *testcase*. The function behavior
didn't change, we just weren't previously exercising it.

select (w).size = :segsize as ok
from (select pg_ls_waldir() w) ss where length((w).name) = 24 limit 1;
- ok 
-----
- t
-(1 row)
-
+ERROR:  could not stat file "pg_wal/000000010000000000000078": No such file or directory
select count(*) >= 0 as ok from pg_ls_archive_statusdir();
ok 
----

It's pretty obvious what happened here: concurrent activity renamed or
removed the WAL segment between when we saw it in the directory and
when we tried to stat() it.

This seems like it would be just as much of a hazard for field usage
as it is for regression testing,

That's clearly true for pg_ls_waldir(), which call pg_ls_dir_files, and
includes some metadata columns.

so I propose that we fix these directory-scanning functions to silently
ignore ENOENT failures from stat(). Are there any for which we should not do
that?

I think it's reasonable to ignore transient ENOENT for tmpdir, logdir, and
probably archive_statusdir. That doesn't currently affect pg_ls_dir(), which
lists file but not metadata for an arbitrary dir, so doesn't call stat().

Note that dangling links in the other functions currently cause (wrong [0]Which you fixed in 085b6b667 and I previously fixed at: /messages/by-id/attachment/106478/v2-0001-BUG-in-errmsg.patch |$ sudo ln -s foo /var/log/postgresql/bar |ts=# SELECT * FROM pg_ls_logdir() ORDER BY 3; |ERROR: could not stat directory "/var/log/postgresql": No such file or directory)
error. I guess it should be documented if broken link is will be ignored due
to ENOENT.

Maybe we should lstat() the file to determine if it's a dangling link; if
lstat() fails, then skip it. Currently, we use stat(), which shows metdata of
a link's *target*. Maybe we'd change that.

Note that I have a patch which generalizes pg_ls_dir_files and makes
pg_ls_dir() a simple wrapper, so if that's pursued, they would behave the same
unless I add another flag to do otherwise (but behaving the same has its
merits). It already uses lstat() to show links to dirs as isdir=no, which was
needed to avoid recursing into links-to-dirs in the new helper function
pg_ls_dir_recurse(). https://commitfest.postgresql.org/26/2377/

[0]: Which you fixed in 085b6b667 and I previously fixed at: /messages/by-id/attachment/106478/v2-0001-BUG-in-errmsg.patch |$ sudo ln -s foo /var/log/postgresql/bar |ts=# SELECT * FROM pg_ls_logdir() ORDER BY 3; |ERROR: could not stat directory "/var/log/postgresql": No such file or directory
/messages/by-id/attachment/106478/v2-0001-BUG-in-errmsg.patch
|$ sudo ln -s foo /var/log/postgresql/bar
|ts=# SELECT * FROM pg_ls_logdir() ORDER BY 3;
|ERROR: could not stat directory "/var/log/postgresql": No such file or directory

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#16)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

Justin Pryzby <pryzby@telsasoft.com> writes:

On Sat, Mar 28, 2020 at 01:13:54PM -0400, Tom Lane wrote:

so I propose that we fix these directory-scanning functions to silently
ignore ENOENT failures from stat(). Are there any for which we should not do
that?

Maybe we should lstat() the file to determine if it's a dangling link; if
lstat() fails, then skip it. Currently, we use stat(), which shows metdata of
a link's *target*. Maybe we'd change that.

Hm, good point that ENOENT could refer to a symlink's target. Still,
I'm not sure it's worth going out of our way to disambiguate that,
given that these directories aren't really supposed to contain symlinks.
(And on the third hand, if they aren't supposed to, then maybe these
functions needn't look through any symlinks? In which case just
substituting lstat for stat would resolve the ambiguity.)

Note that I have a patch which generalizes pg_ls_dir_files and makes
pg_ls_dir() a simple wrapper, so if that's pursued, they would behave the same
unless I add another flag to do otherwise (but behaving the same has its
merits). It already uses lstat() to show links to dirs as isdir=no, which was
needed to avoid recursing into links-to-dirs in the new helper function
pg_ls_dir_recurse(). https://commitfest.postgresql.org/26/2377/

I think we need a back-patchable fix for the ENOENT failure, seeing that
we back-patched the new regression test; intermittent buildfarm failures
are no fun in any branch. So new functions aren't too relevant here,
although it's fair to look ahead at whether the same behavior will be
appropriate for them.

regards, tom lane

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#17)
1 attachment(s)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

I wrote:

Justin Pryzby <pryzby@telsasoft.com> writes:

Maybe we should lstat() the file to determine if it's a dangling link; if
lstat() fails, then skip it. Currently, we use stat(), which shows metdata of
a link's *target*. Maybe we'd change that.

Hm, good point that ENOENT could refer to a symlink's target. Still,
I'm not sure it's worth going out of our way to disambiguate that,
given that these directories aren't really supposed to contain symlinks.
(And on the third hand, if they aren't supposed to, then maybe these
functions needn't look through any symlinks? In which case just
substituting lstat for stat would resolve the ambiguity.)

After looking at the callers of pg_ls_dir_files, and noticing that
it's already defined to ignore anything that's not a regular file,
I think switching to lstat makes sense.

I also grepped the other uses of ReadDir[Extended], and didn't see
any other ones that seemed desperately in need of changing.

So the attached seems like a sufficient fix.

regards, tom lane

Attachments:

ignore-concurrent-removals-in-pg_ls_dir_files.patchtext/x-diff; charset=us-ascii; name=ignore-concurrent-removals-in-pg_ls_dir_files.patchDownload
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index 01185f2..8429a12 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -596,10 +596,15 @@ pg_ls_dir_files(FunctionCallInfo fcinfo, const char *dir, bool missing_ok)
 
 		/* Get the file info */
 		snprintf(path, sizeof(path), "%s/%s", dir, de->d_name);
-		if (stat(path, &attrib) < 0)
+		if (lstat(path, &attrib) < 0)
+		{
+			/* Ignore concurrently-deleted files, else complain */
+			if (errno == ENOENT)
+				continue;
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not stat file \"%s\": %m", path)));
+		}
 
 		/* Ignore anything but regular files */
 		if (!S_ISREG(attrib.st_mode))
#19Justin Pryzby
pryzby@telsasoft.com
In reply to: Tom Lane (#18)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

On Sun, Mar 29, 2020 at 12:37:05PM -0400, Tom Lane wrote:

I wrote:

Justin Pryzby <pryzby@telsasoft.com> writes:

Maybe we should lstat() the file to determine if it's a dangling link; if
lstat() fails, then skip it. Currently, we use stat(), which shows metdata of
a link's *target*. Maybe we'd change that.

Hm, good point that ENOENT could refer to a symlink's target. Still,
I'm not sure it's worth going out of our way to disambiguate that,
given that these directories aren't really supposed to contain symlinks.
(And on the third hand, if they aren't supposed to, then maybe these
functions needn't look through any symlinks? In which case just
substituting lstat for stat would resolve the ambiguity.)

After looking at the callers of pg_ls_dir_files, and noticing that
it's already defined to ignore anything that's not a regular file,
I think switching to lstat makes sense.

Yea, only pg_ls_dir() shows special file types (and currently the others even
hide dirs).

The essence of your patch is to ignore ENOENT, but you also changed to use
lstat(), which seems unrelated. That means we'll now hide (non-broken)
symlinks. Is that intentional/needed ? I guess maybe you're trying to fix the
bug (?) that symlinks aren't skipped? If so, I guess it should be a separate
commit, or the commit message should say so. I think the doc update is already
handled by: 8b6d94cf6c8319bfd6bebf8b863a5db586c19c3b (we didn't used to say we
skipped specials, and now we say we do, and we'll to follow through RSN and
actually do it, too).

Show quoted text
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index 01185f2..8429a12 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -596,10 +596,15 @@ pg_ls_dir_files(FunctionCallInfo fcinfo, const char *dir, bool missing_ok)
/* Get the file info */
snprintf(path, sizeof(path), "%s/%s", dir, de->d_name);
-		if (stat(path, &attrib) < 0)
+		if (lstat(path, &attrib) < 0)
+		{
+			/* Ignore concurrently-deleted files, else complain */
+			if (errno == ENOENT)
+				continue;
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not stat file \"%s\": %m", path)));
+		}

/* Ignore anything but regular files */
if (!S_ISREG(attrib.st_mode))

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#19)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

Justin Pryzby <pryzby@telsasoft.com> writes:

On Sun, Mar 29, 2020 at 12:37:05PM -0400, Tom Lane wrote:

After looking at the callers of pg_ls_dir_files, and noticing that
it's already defined to ignore anything that's not a regular file,
I think switching to lstat makes sense.

Yea, only pg_ls_dir() shows special file types (and currently the others even
hide dirs).

The essence of your patch is to ignore ENOENT, but you also changed to use
lstat(), which seems unrelated. That means we'll now hide (non-broken)
symlinks. Is that intentional/needed ?

Well, the following comment says "ignore anything but regular files",
so I'm supposing that that is the behavior that we actually want here
and failed to implement correctly. There might be scope for
additional directory-reading functions, but I'd think you'd want
more information (such as the file type) returned from anything
that doesn't act this way.

In practice, since these directories shouldn't contain symlinks,
it's likely moot. The only place in PG data directories where
we actually expect symlinks is pg_tablespace ... and that contains
symlinks to directories, so that this function would ignore them
anyway.

regards, tom lane

#21Justin Pryzby
pryzby@telsasoft.com
In reply to: Tom Lane (#20)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

On Sun, Mar 29, 2020 at 01:22:04PM -0400, Tom Lane wrote:

Justin Pryzby <pryzby@telsasoft.com> writes:

On Sun, Mar 29, 2020 at 12:37:05PM -0400, Tom Lane wrote:

After looking at the callers of pg_ls_dir_files, and noticing that
it's already defined to ignore anything that's not a regular file,
I think switching to lstat makes sense.

Yea, only pg_ls_dir() shows special file types (and currently the others even
hide dirs).

The essence of your patch is to ignore ENOENT, but you also changed to use
lstat(), which seems unrelated. That means we'll now hide (non-broken)
symlinks. Is that intentional/needed ?

Well, the following comment says "ignore anything but regular files",
so I'm supposing that that is the behavior that we actually want here
and failed to implement correctly. There might be scope for
additional directory-reading functions, but I'd think you'd want
more information (such as the file type) returned from anything
that doesn't act this way.

Maybe pg_stat_file() deserves similar attention ? Right now, it'll fail on a
broken link. If we changed it to lstat(), then it'd work, but it'd also show
metadata for the *link* rather than its target.

Patch proposed as v14-0001 patch here may be relevant:
https://www.postgresql.org/message-id/20200317190401.GY26184%40telsasoft.com
-    indicating if it is a directory.  Typical usages include:
+    indicating if it is a directory (or a symbolic link to a directory).
...

In practice, since these directories shouldn't contain symlinks,
it's likely moot. The only place in PG data directories where
we actually expect symlinks is pg_tablespace ... and that contains
symlinks to directories, so that this function would ignore them
anyway.

I wouldn't hesitate to make symlinks, at least in log. It's surprising when
files are hidden, but I won't argue about the best behavior here.

I'm thinking of distributions or local configurations that use
/var/log/postgresql. I didn't remember or didn't realize, but it looks like
debian's packages use logging_collector=off and then launch postmaster with
2> /var/log/postgres/... It seems reasonable to do something like:
log/huge-querylog.csv => /zfs/compressed/...

--
Justin

#22Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Justin Pryzby (#21)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

Hello Justin,

Well, the following comment says "ignore anything but regular files",
so I'm supposing that that is the behavior that we actually want here
and failed to implement correctly. There might be scope for
additional directory-reading functions, but I'd think you'd want
more information (such as the file type) returned from anything
that doesn't act this way.

Maybe pg_stat_file() deserves similar attention ? Right now, it'll fail on a
broken link. If we changed it to lstat(), then it'd work, but it'd also show
metadata for the *link* rather than its target.

Yep. I think this traditional answer is the rational answer.

As I wrote about an earlier version of the patch, ISTM that instead of
reinventing, extending, adapting various ls variants (with/without
metadata, which show only files, which shows target of links, which shows
directory, etc.) we would just need *one* postgres "ls" implementation
which would be like "ls -la arg" (returns file type, dates), and then
everything else is a wrapper around that with appropriate filtering that
can be done at the SQL level, like you started with recurse.

It would reduce the amount of C code and I find the SQL-level approach
quite elegant.

--
Fabien.

#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fabien COELHO (#22)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

Fabien COELHO <coelho@cri.ensmp.fr> writes:

As I wrote about an earlier version of the patch, ISTM that instead of
reinventing, extending, adapting various ls variants (with/without
metadata, which show only files, which shows target of links, which shows
directory, etc.) we would just need *one* postgres "ls" implementation
which would be like "ls -la arg" (returns file type, dates), and then
everything else is a wrapper around that with appropriate filtering that
can be done at the SQL level, like you started with recurse.

Yeah, I agree that some new function that can represent symlinks
explicitly in its output is the place to deal with this, for
people who want to deal with it.

In the meantime, there's still the question of what pg_ls_dir_files
should do exactly. Are we content to have it ignore symlinks?
I remain inclined to think that's the right thing given its current
brief.

regards, tom lane

#24Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Tom Lane (#23)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

Hello,

As I wrote about an earlier version of the patch, ISTM that instead of
reinventing, extending, adapting various ls variants (with/without
metadata, which show only files, which shows target of links, which shows
directory, etc.) we would just need *one* postgres "ls" implementation
which would be like "ls -la arg" (returns file type, dates), and then
everything else is a wrapper around that with appropriate filtering that
can be done at the SQL level, like you started with recurse.

Yeah, I agree that some new function that can represent symlinks
explicitly in its output is the place to deal with this, for
people who want to deal with it.

In the meantime, there's still the question of what pg_ls_dir_files
should do exactly. Are we content to have it ignore symlinks?
I remain inclined to think that's the right thing given its current
brief.

My 0.02€:

I agree that it is enough to reproduce the current behavior of various
existing pg_ls* functions, but on the other hand outputing a column type
char like ls (-, d, l…) looks like really no big deal. I'd say that the
only reason not to do it may be to pass this before feature freeze.

--
Fabien.

#25Justin Pryzby
pryzby@telsasoft.com
In reply to: Fabien COELHO (#24)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

On Tue, Mar 31, 2020 at 07:36:03AM +0200, Fabien COELHO wrote:

As I wrote about an earlier version of the patch, ISTM that instead of
reinventing, extending, adapting various ls variants (with/without
metadata, which show only files, which shows target of links, which shows
directory, etc.) we would just need *one* postgres "ls" implementation
which would be like "ls -la arg" (returns file type, dates), and then
everything else is a wrapper around that with appropriate filtering that
can be done at the SQL level, like you started with recurse.

Yeah, I agree that some new function that can represent symlinks
explicitly in its output is the place to deal with this, for
people who want to deal with it.

In the meantime, there's still the question of what pg_ls_dir_files
should do exactly. Are we content to have it ignore symlinks?
I remain inclined to think that's the right thing given its current
brief.

My 0.02€:

I agree that it is enough to reproduce the current behavior of various
existing pg_ls* functions, but on the other hand outputing a column type
char like ls (-, d, l…) looks like really no big deal. I'd say that the only
reason not to do it may be to pass this before feature freeze.

Remember, there's two threads here, and this one is about the bug in stable
releases ($SUBJECT), and now the instability in the test that was added with
its fix.

I suggest to leave stat() alone in your patch for stable releases. I think
it's okay if we change behavior so that a broken symlink is skipped instead of
erroring (as a side effect of skipping ENOENT with stat()). But not okay if we
change pg_ls_logdir() to hide symlinks in back braches.

--
Justin

#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#25)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

Justin Pryzby <pryzby@telsasoft.com> writes:

I suggest to leave stat() alone in your patch for stable releases. I think
it's okay if we change behavior so that a broken symlink is skipped instead of
erroring (as a side effect of skipping ENOENT with stat()). But not okay if we
change pg_ls_logdir() to hide symlinks in back braches.

Meh. I'm not really convinced, but in the absence of anyone expressing
support for my position, I'll do it that way. I don't think it's worth
doing both a stat and lstat to tell the difference between file-is-gone
and file-is-a-broken-symlink.

regards, tom lane

#27Alexander Lakhin
exclusion@gmail.com
In reply to: Tom Lane (#26)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

Hello hackers,
31.03.2020 19:41, Tom Lane wrote:

Justin Pryzby <pryzby@telsasoft.com> writes:

I suggest to leave stat() alone in your patch for stable releases. I think
it's okay if we change behavior so that a broken symlink is skipped instead of
erroring (as a side effect of skipping ENOENT with stat()). But not okay if we
change pg_ls_logdir() to hide symlinks in back braches.

Meh. I'm not really convinced, but in the absence of anyone expressing
support for my position, I'll do it that way. I don't think it's worth
doing both a stat and lstat to tell the difference between file-is-gone
and file-is-a-broken-symlink.

As we've discovered in Bug #[16161]/messages/by-id/16161-7a985d2f1bbe8f71@postgresql.org, stat() for "concurrently-deleted
file" can also return ERROR_ACCESS_DENIED on Windows. It seems that
pg_upgradeCheck failures seen on
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=fairywren&amp;br=REL_13_STABLE
caused by the same issue.
Shouldn't pg_ls_dir_files() retry stat() on ERROR_ACCESS_DENIED just
like the pgwin32_open() does to ignore files in "delete pending" state?

[16161]: /messages/by-id/16161-7a985d2f1bbe8f71@postgresql.org
/messages/by-id/16161-7a985d2f1bbe8f71@postgresql.org

Best regards,
Alexander

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Lakhin (#27)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

Alexander Lakhin <exclusion@gmail.com> writes:

Shouldn't pg_ls_dir_files() retry stat() on ERROR_ACCESS_DENIED just
like the pgwin32_open() does to ignore files in "delete pending" state?

That would soon lead us to changing every stat() caller in the system
to have Windows-specific looping logic. No thanks. If we need to do
this, let's put in a Windows wrapper layer comparable to pgwin32_open()
for open().

regards, tom lane

#29Alexander Lakhin
exclusion@gmail.com
In reply to: Tom Lane (#28)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

13.11.2020 23:16, Tom Lane wrote:

Alexander Lakhin <exclusion@gmail.com> writes:

Shouldn't pg_ls_dir_files() retry stat() on ERROR_ACCESS_DENIED just
like the pgwin32_open() does to ignore files in "delete pending" state?

That would soon lead us to changing every stat() caller in the system
to have Windows-specific looping logic. No thanks. If we need to do
this, let's put in a Windows wrapper layer comparable to pgwin32_open()
for open().

Maybe pgwin32_safestat() should perform this... For now it checks
GetLastError() for ERROR_DELETE_PENDING, but as we've found out
previously this error code in fact is not returned by the OS.
And if the fix is not going to be quick, probably it should be discussed
in another thread...

Best regards,
Alexander

#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Lakhin (#29)
Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

Alexander Lakhin <exclusion@gmail.com> writes:

13.11.2020 23:16, Tom Lane wrote:

That would soon lead us to changing every stat() caller in the system
to have Windows-specific looping logic. No thanks. If we need to do
this, let's put in a Windows wrapper layer comparable to pgwin32_open()
for open().

Maybe pgwin32_safestat() should perform this...

Uh ... now that you mention it, that's gone since bed90759f.

There is code in win32stat.c that purports to cope with this case, though
I've not tested it personally.

regards, tom lane