out of memory in crosstab()

Started by Amit Langoteabout 3 years ago2 messages

amitlangote09@gmail.com

about 3 years ago

1 attachment(s)

Hi,

A customer seems to have run into $subject. Here's a reproducer they shared:

CREATE TABLE test (id integer, category integer, rate numeric);
INSERT INTO test
SELECT x.id,
y.category,
random() * 10 AS rate
FROM generate_series(1, 1000000) AS x(id)
INNER JOIN generate_series(1, 25) AS y(category)
ON 1 = 1;
SELECT * FROM crosstab('SELECT id, category, rate FROM test ORDER BY
1, 2') AS final_result(id integer, "1" numeric, "2" numeric, "3"
numeric, "4" numeric, "5" numeric);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
Time: 106095.766 ms (01:46.096)
!?> \q

With the following logged:

LOG: server process (PID 121846) was terminated by signal 9: Killed
DETAIL: Failed process was running: SELECT * FROM crosstab('SELECT
id, category, rate FROM test ORDER BY 1, 2') AS final_result(id
integer, "1" numeric, "2" numeric, "3" numeric, "4" numeric, "5"
numeric);

The problem seems to be spi_printtup() continuing to allocate memory
to expand _SPI_current->tuptable to store the result of crosstab()'s
input query that's executed using:

/* Retrieve the desired rows */
ret = SPI_execute(sql, true, 0);

Note that this asks SPI to retrieve and store *all* result rows of the
query in _SPI_current->tuptable, and if there happen to be so many
rows, as in the case of above example, spi_printtup() ends up asking
for a bit too much memory.

The easiest fix for this seems to be for crosstab() to use open a
cursor (SPI_cursor_open) and fetch the rows in batches
(SPI_cursor_fetch) rather than all in one go. I have implemented that
in the attached. Maybe the patch should address other functions that
potentially have the same problem.

I also wondered about fixing this by making _SPI_current->tuptable use
a tuplestore that can spill to disk as its backing store rather than a
plain C HeapTuple array, but haven't checked how big of a change that
would be; SPI_tuptable is referenced in many places across the tree.
Though I suspect that idea has enough merits to give that a try
someday.

Thoughts on whether this should be fixed and the fix be back-patched?

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachments:

v1-0001-Teach-crosstab-to-use-SPI_cursor_-interface.patchapplication/octet-stream; name=v1-0001-Teach-crosstab-to-use-SPI_cursor_-interface.patchDownload

From a2ae44429a68f9776d99285930b16f46c4c9197d Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Wed, 16 Nov 2022 10:49:45 +0900
Subject: [PATCH v1] Teach crosstab() to use SPI_cursor_* interface

Fixes memory issues for input queries that return many rows.
---
 contrib/tablefunc/tablefunc.c | 368 +++++++++++++++++++---------------
 1 file changed, 204 insertions(+), 164 deletions(-)

diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c
index b967e6d4be..feebe497d7 100644
--- a/contrib/tablefunc/tablefunc.c
+++ b/contrib/tablefunc/tablefunc.c
@@ -356,12 +356,12 @@ crosstab(PG_FUNCTION_ARGS)
 {
 	char	   *sql = text_to_cstring(PG_GETARG_TEXT_PP(0));
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
-	Tuplestorestate *tupstore;
+	Tuplestorestate *tupstore = NULL;
 	TupleDesc	tupdesc;
 	uint64		call_cntr;
 	uint64		max_calls;
 	AttInMetadata *attinmeta;
-	SPITupleTable *spi_tuptable;
+	SPITupleTable *spi_tuptable = NULL;
 	TupleDesc	spi_tupdesc;
 	bool		firstpass;
 	char	   *lastrowid;
@@ -371,6 +371,9 @@ crosstab(PG_FUNCTION_ARGS)
 	MemoryContext oldcontext;
 	int			ret;
 	uint64		proc;
+	SPIPlanPtr	plan;
+	Portal		portal;
+	bool		tupstore_initialized = false;
 
 	/* check to see if caller supports us returning a tuplestore */
 	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
@@ -389,204 +392,244 @@ crosstab(PG_FUNCTION_ARGS)
 		/* internal error */
 		elog(ERROR, "crosstab: SPI_connect returned %d", ret);
 
-	/* Retrieve the desired rows */
-	ret = SPI_execute(sql, true, 0);
-	proc = SPI_processed;
-
-	/* If no qualifying tuples, fall out early */
-	if (ret != SPI_OK_SELECT || proc == 0)
-	{
-		SPI_finish();
-		rsinfo->isDone = ExprEndResult;
-		PG_RETURN_NULL();
-	}
+	/* Create an SPI plan and a cursor for fetching the result rows. */
+	if ((plan = SPI_prepare(sql, 0, NULL)) == NULL)
+		/* internal error */
+		elog(ERROR, "SPI_prepare(\"%s\") failed", sql);
 
-	spi_tuptable = SPI_tuptable;
-	spi_tupdesc = spi_tuptable->tupdesc;
+	if ((portal = SPI_cursor_open(NULL, plan, NULL, NULL, true)) == NULL)
+		/* internal error */
+		elog(ERROR, "SPI_cursor_open(\"%s\") failed", sql);
 
-	/*----------
-	 * The provided SQL query must always return three columns.
-	 *
-	 * 1. rowname
-	 *	the label or identifier for each row in the final result
-	 * 2. category
-	 *	the label or identifier for each column in the final result
-	 * 3. values
-	 *	the value for each column in the final result
-	 *----------
+	/*
+	 * Fetch rows from the cursor in batches to avoid overloading
+	 * SPI_tuptable and store them in the result tuplestore.
 	 */
-	if (spi_tupdesc->natts != 3)
-		ereport(ERROR,
-				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-				 errmsg("invalid source data SQL statement"),
-				 errdetail("The provided SQL must return 3 "
-						   "columns: rowid, category, and values.")));
-
-	/* get a tuple descriptor for our result type */
-	switch (get_call_result_type(fcinfo, NULL, &tupdesc))
+#define	CROSSTAB_BATCHSIZE	10000
+	for (;;)
 	{
-		case TYPEFUNC_COMPOSITE:
-			/* success */
-			break;
-		case TYPEFUNC_RECORD:
-			/* failed to determine actual type of RECORD */
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("function returning record called in context "
-							"that cannot accept type record")));
-			break;
-		default:
-			/* result type isn't composite */
-			ereport(ERROR,
-					(errcode(ERRCODE_DATATYPE_MISMATCH),
-					 errmsg("return type must be a row type")));
-			break;
-	}
+		/* Retrieve the desired rows */
+		SPI_cursor_fetch(portal, true, CROSSTAB_BATCHSIZE);
+		proc = SPI_processed;
 
-	/*
-	 * Check that return tupdesc is compatible with the data we got from SPI,
-	 * at least based on number and type of attributes
-	 */
-	if (!compatCrosstabTupleDescs(tupdesc, spi_tupdesc))
-		ereport(ERROR,
-				(errcode(ERRCODE_SYNTAX_ERROR),
-				 errmsg("return and sql tuple descriptions are " \
-						"incompatible")));
+		/* If no qualifying tuples, fall out early */
+		if (proc == 0)
+			goto done;
 
-	/*
-	 * switch to long-lived memory context
-	 */
-	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+		spi_tuptable = SPI_tuptable;
+		spi_tupdesc = spi_tuptable->tupdesc;
 
-	/* make sure we have a persistent copy of the result tupdesc */
-	tupdesc = CreateTupleDescCopy(tupdesc);
+		/*
+		 * If not already done, initialize the tuplestore to return the result
+		 * in.
+		 */
+		if (!tupstore_initialized)
+		{
+			/* Sanity checks. */
+
+			/*----------
+			 * The provided SQL query must always return three columns.
+			 *
+			 * 1. rowname
+			 *	the label or identifier for each row in the final result
+			 * 2. category
+			 *	the label or identifier for each column in the final result
+			 * 3. values
+			 *	the value for each column in the final result
+			 *----------
+			 */
+			if (spi_tupdesc->natts != 3)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("invalid source data SQL statement"),
+						 errdetail("The provided SQL must return 3 "
+								   "columns: rowid, category, and values.")));
 
-	/* initialize our tuplestore in long-lived context */
-	tupstore =
-		tuplestore_begin_heap(rsinfo->allowedModes & SFRM_Materialize_Random,
-							  false, work_mem);
+			/* get a tuple descriptor for our result type */
+			switch (get_call_result_type(fcinfo, NULL, &tupdesc))
+			{
+				case TYPEFUNC_COMPOSITE:
+					/* success */
+					break;
+				case TYPEFUNC_RECORD:
+					/* failed to determine actual type of RECORD */
+					ereport(ERROR,
+							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("function returning record called in context "
+									"that cannot accept type record")));
+					break;
+				default:
+					/* result type isn't composite */
+					ereport(ERROR,
+							(errcode(ERRCODE_DATATYPE_MISMATCH),
+							 errmsg("return type must be a row type")));
+					break;
+			}
 
-	MemoryContextSwitchTo(oldcontext);
+			/*
+			 * Check that return tupdesc is compatible with the data we got
+			 * from SPI, at least based on number and type of attributes
+			 */
+			if (!compatCrosstabTupleDescs(tupdesc, spi_tupdesc))
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("return and sql tuple descriptions are " \
+								"incompatible")));
+			/*
+			 * switch to long-lived memory context
+			 */
+			oldcontext = MemoryContextSwitchTo(per_query_ctx);
 
-	/*
-	 * Generate attribute metadata needed later to produce tuples from raw C
-	 * strings
-	 */
-	attinmeta = TupleDescGetAttInMetadata(tupdesc);
+			/* make sure we have a persistent copy of the result tupdesc */
+			tupdesc = CreateTupleDescCopy(tupdesc);
 
-	/* total number of tuples to be examined */
-	max_calls = proc;
+			/* initialize our tuplestore in long-lived context */
+			tupstore =
+				tuplestore_begin_heap(rsinfo->allowedModes & SFRM_Materialize_Random,
+									  false, work_mem);
 
-	/* the return tuple always must have 1 rowid + num_categories columns */
-	num_categories = tupdesc->natts - 1;
+			MemoryContextSwitchTo(oldcontext);
 
-	firstpass = true;
-	lastrowid = NULL;
+			/*
+			 * Generate attribute metadata needed later to produce tuples from
+			 * raw C strings
+			 */
+			attinmeta = TupleDescGetAttInMetadata(tupdesc);
 
-	for (call_cntr = 0; call_cntr < max_calls; call_cntr++)
-	{
-		bool		skip_tuple = false;
-		char	  **values;
+			/* total number of tuples to be examined */
+			max_calls = proc;
 
-		/* allocate and zero space */
-		values = (char **) palloc0((1 + num_categories) * sizeof(char *));
+			/* the return tuple always must have 1 rowid + num_categories columns */
+			num_categories = tupdesc->natts - 1;
 
-		/*
-		 * now loop through the sql results and assign each value in sequence
-		 * to the next category
-		 */
-		for (i = 0; i < num_categories; i++)
-		{
-			HeapTuple	spi_tuple;
-			char	   *rowid;
+			tupstore_initialized = true;
+		}
 
-			/* see if we've gone too far already */
-			if (call_cntr >= max_calls)
-				break;
+		firstpass = true;
+		lastrowid = NULL;
 
-			/* get the next sql result tuple */
-			spi_tuple = spi_tuptable->vals[call_cntr];
+		for (call_cntr = 0; call_cntr < max_calls; call_cntr++)
+		{
+			bool		skip_tuple = false;
+			char	  **values;
 
-			/* get the rowid from the current sql result tuple */
-			rowid = SPI_getvalue(spi_tuple, spi_tupdesc, 1);
+			/* allocate and zero space */
+			values = (char **) palloc0((1 + num_categories) * sizeof(char *));
 
 			/*
-			 * If this is the first pass through the values for this rowid,
-			 * set the first column to rowid
+			 * now loop through the sql results and assign each value in sequence
+			 * to the next category
 			 */
-			if (i == 0)
+			for (i = 0; i < num_categories; i++)
 			{
-				xpstrdup(values[0], rowid);
+				HeapTuple	spi_tuple;
+				char	   *rowid;
+
+				/* see if we've gone too far already */
+				if (call_cntr >= max_calls)
+					break;
+
+				/* get the next sql result tuple */
+				spi_tuple = spi_tuptable->vals[call_cntr];
+
+				/* get the rowid from the current sql result tuple */
+				rowid = SPI_getvalue(spi_tuple, spi_tupdesc, 1);
 
 				/*
-				 * Check to see if the rowid is the same as that of the last
-				 * tuple sent -- if so, skip this tuple entirely
+				 * If this is the first pass through the values for this rowid,
+				 * set the first column to rowid
 				 */
-				if (!firstpass && xstreq(lastrowid, rowid))
+				if (i == 0)
 				{
+					xpstrdup(values[0], rowid);
+
+					/*
+					 * Check to see if the rowid is the same as that of the last
+					 * tuple sent -- if so, skip this tuple entirely
+					 */
+					if (!firstpass && xstreq(lastrowid, rowid))
+					{
+						xpfree(rowid);
+						skip_tuple = true;
+						break;
+					}
+				}
+
+				/*
+				 * If rowid hasn't changed on us, continue building the output
+				 * tuple.
+				 */
+				if (xstreq(rowid, values[0]))
+				{
+					/*
+					 * Get the next category item value, which is always
+					 * attribute number three.
+					 *
+					 * Be careful to assign the value to the array index based
+					 * on which category we are presently processing.
+					 */
+					values[1 + i] = SPI_getvalue(spi_tuple, spi_tupdesc, 3);
+
+					/*
+					 * increment the counter since we consume a row for each
+					 * category, but not for last pass because the outer loop will
+					 * do that for us
+					 */
+					if (i < (num_categories - 1))
+						call_cntr++;
+					xpfree(rowid);
+				}
+				else
+				{
+					/*
+					 * We'll fill in NULLs for the missing values, but we need
+					 * to decrement the counter since this sql result row
+					 * doesn't belong to the current output tuple.
+					 */
+					call_cntr--;
 					xpfree(rowid);
-					skip_tuple = true;
 					break;
 				}
 			}
 
-			/*
-			 * If rowid hasn't changed on us, continue building the output
-			 * tuple.
-			 */
-			if (xstreq(rowid, values[0]))
+			if (!skip_tuple)
 			{
-				/*
-				 * Get the next category item value, which is always attribute
-				 * number three.
-				 *
-				 * Be careful to assign the value to the array index based on
-				 * which category we are presently processing.
-				 */
-				values[1 + i] = SPI_getvalue(spi_tuple, spi_tupdesc, 3);
+				HeapTuple	tuple;
 
-				/*
-				 * increment the counter since we consume a row for each
-				 * category, but not for last pass because the outer loop will
-				 * do that for us
-				 */
-				if (i < (num_categories - 1))
-					call_cntr++;
-				xpfree(rowid);
+				/* build the tuple and store it */
+				tuple = BuildTupleFromCStrings(attinmeta, values);
+				tuplestore_puttuple(tupstore, tuple);
+				heap_freetuple(tuple);
 			}
-			else
-			{
-				/*
-				 * We'll fill in NULLs for the missing values, but we need to
-				 * decrement the counter since this sql result row doesn't
-				 * belong to the current output tuple.
-				 */
-				call_cntr--;
-				xpfree(rowid);
-				break;
-			}
-		}
-
-		if (!skip_tuple)
-		{
-			HeapTuple	tuple;
 
-			/* build the tuple and store it */
-			tuple = BuildTupleFromCStrings(attinmeta, values);
-			tuplestore_puttuple(tupstore, tuple);
-			heap_freetuple(tuple);
+			/* Remember current rowid */
+			xpfree(lastrowid);
+			xpstrdup(lastrowid, values[0]);
+			firstpass = false;
+
+			/* Clean up */
+			for (i = 0; i < num_categories + 1; i++)
+				if (values[i] != NULL)
+					pfree(values[i]);
+			pfree(values);
 		}
 
-		/* Remember current rowid */
-		xpfree(lastrowid);
-		xpstrdup(lastrowid, values[0]);
-		firstpass = false;
+		/* Free up the memory of this batch's rows. */
+		SPI_freetuptable(spi_tuptable);
+	}
+
+done:
+	/* release SPI related resources (and return to caller's context) */
+	SPI_freetuptable(spi_tuptable);
+	SPI_cursor_close(portal);
+	SPI_freeplan(plan);
+	SPI_finish();
 
-		/* Clean up */
-		for (i = 0; i < num_categories + 1; i++)
-			if (values[i] != NULL)
-				pfree(values[i]);
-		pfree(values);
+	/* If the query returned 0 zeros, let the caller know so. */
+	if (!tupstore_initialized)
+	{
+		rsinfo->isDone = ExprEndResult;
+		PG_RETURN_NULL();
 	}
 
 	/* let the caller know we're sending back a tuplestore */
@@ -594,9 +637,6 @@ crosstab(PG_FUNCTION_ARGS)
 	rsinfo->setResult = tupstore;
 	rsinfo->setDesc = tupdesc;
 
-	/* release SPI related resources (and return to caller's context) */
-	SPI_finish();
-
 	return (Datum) 0;
 }
 
-- 
2.35.3

Joe Conway

mail@joeconway.com

about 3 years ago

In reply to: Amit Langote (#1)

Re: out of memory in crosstab()

On 11/16/22 02:47, Amit Langote wrote:

A customer seems to have run into $subject. Here's a reproducer they shared:

With the following logged:

LOG: server process (PID 121846) was terminated by signal 9: Killed

That's the Linux OOM killer. Was this running in a container or under
systemd with memory.limit_in_bytes set? If so, perhaps they need a
higher setting.

The problem seems to be spi_printtup() continuing to allocate memory
to expand _SPI_current->tuptable to store the result of crosstab()'s
input query that's executed using:

/* Retrieve the desired rows */
ret = SPI_execute(sql, true, 0);

Note that this asks SPI to retrieve and store *all* result rows of the
query in _SPI_current->tuptable, and if there happen to be so many
rows, as in the case of above example, spi_printtup() ends up asking
for a bit too much memory.

check

The easiest fix for this seems to be for crosstab() to use open a
cursor (SPI_cursor_open) and fetch the rows in batches
(SPI_cursor_fetch) rather than all in one go. I have implemented that
in the attached. Maybe the patch should address other functions that
potentially have the same problem.

Seems reasonable. I didn't look that closely at the patch, but I do
think that there needs to be some justification for the selected batch
size and/or make it configurable.

I also wondered about fixing this by making _SPI_current->tuptable use
a tuplestore that can spill to disk as its backing store rather than a
plain C HeapTuple array, but haven't checked how big of a change that
would be; SPI_tuptable is referenced in many places across the tree.
Though I suspect that idea has enough merits to give that a try
someday.

Seems like a separate patch at the very least

Thoughts on whether this should be fixed and the fix be back-patched?

-1 on backpatching -- this is not a bug, and the changes are non-trivial

Joe

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com