A potential memory leak on Merge Join when Sort node is not below Materialize node

Started by Önder Kalacıover 3 years ago21 messages
#1Önder Kalacı
onderkalaci@gmail.com

Hi hackers,

With PG 15 (rc1 or beta4), I'm observing an interesting memory pattern. I
have not seen a similar discussion on the mailing list. If I missed that,
please refer me there. The problem that I'm going to explain does not
happen on PG 13/14.

It seems like there is a memory leak(?) with $title. Still, not sure about
what is going on and, thought it'd be useful to share at least my initial
investigation.

After running the query and waiting a few minutes (see steps to repro
below), use pg_log_backend_memory_contexts() to get the contexts of the
backend executing the command. See that it goes beyond 100GB. And depending
on vm.overcommit_memory, you get an OOM error or OOM crash eventually.

```
2022-09-28 17:33:38.155 CEST [32224] LOG: level: 2; PortalContext: 1024
total in 1 blocks; 592 free (0 chunks); 432 used: <unnamed>
2022-09-28 17:33:38.159 CEST [32224] LOG: level: 3; ExecutorState:
*114923929600* total in 13710 blocks; 7783264 free (3 chunks); 114916146336
used
2022-09-28 17:33:38.159 CEST [32224] LOG: level: 4; TupleSort main: 8192
total in 1 blocks; 3928 free (0 chunks); 4264 used
2022-09-28 17:33:38.159 CEST [32224] LOG: level: 5; TupleSort sort: 295096
total in 8 blocks; 256952 free (67 chunks); 38144 used
2022-09-28 17:33:38.159 CEST [32224] LOG: level: 6; Caller tuples: 8192
total in 1 blocks (0 chunks); 7992 free (0 chunks); 200 used
2022-09-28 17:33:38.159 CEST [32224] LOG: level: 4; TupleSort main: 8192
total in 1 blocks; 3928 free (0 chunks); 4264 used
2022-09-28 17:33:38.159 CEST [32224] LOG: level: 5; TupleSort sort:
4309736 total in 18 blocks; 263864 free (59 chunks); 4045872 used
2022-09-28 17:33:38.159 CEST [32224] LOG: level: 6; Caller tuples: 8192
total in 1 blocks (0 chunks); 7992 free (0 chunks); 200 used
...
2022-09-28 17:33:38.160 CEST [32224] LOG: Grand total: *114930446784*
bytes in 13972 blocks; 8802248 free (275 chunks); 114921644536 used
```

I observed this with a merge join involving a table and set returning
function. To simulate the problem with two tables, I have the following
steps:

```
CREATE TABLE t1 (a text);
CREATE TABLE t2 (a text);

-- make the text a little large by adding 100000000000
INSERT INTO t1 SELECT (100000000000+i%1000)::text FROM
generate_series(0,10000000) i;

-- make the text a little large by adding 100000000000
INSERT INTO t2 SELECT (100000000000+i%10000)::text FROM
generate_series(0,10000000) i;

-- to simplify the explain plan, not strictly necessary
SET max_parallel_workers_per_gather TO 0;

-- these two are necessary so that the problem is triggered
-- these are helpful to use Merge join and avoid materialization
SET enable_hashjoin TO false;
SET enable_material TO false;

-- the join is on a TEXT column
-- when the join is on INT column with a similar setup, I do not observe
this problem
SELECT count(*) FROM t1 JOIN t2 USING (a);
```

The explain output for the query like the following:
```
explain SELECT count(*) FROM t1 JOIN t2 USING (a);
┌─────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN

├─────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=177735283.36..177735283.37 rows=1 width=8)

│ -> Merge Join (cost=2556923.81..152703372.24 rows=10012764448
width=0) │
│ Merge Cond: (t1.a = t2.a)

│ -> Sort (cost=1658556.19..1683556.63 rows=10000175 width=13)

│ Sort Key: t1.a

│ -> Seq Scan on t1 (cost=0.00..154056.75 rows=10000175
width=13) │
│ -> Sort (cost=1658507.28..1683506.93 rows=9999861 width=13)

│ Sort Key: t2.a

│ -> Seq Scan on t2 (cost=0.00..154053.61 rows=9999861
width=13) │
└─────────────────────────────────────────────────────────────────────────────────┘
(9 rows)
```

In the end, my investigation mostly got me to the following palloc(), where
we seem to allocate memory over and over again as memory grows:
```
(gdb) bt
#0 __GI___libc_malloc (bytes=bytes@entry=8388608) at malloc.c:3038
#1 0x00005589f3c55444 in AllocSetAlloc (context=0x5589f4896300, size=14)
at aset.c:920
#2 0x00005589f3c5d763 in palloc (size=size@entry=14) at mcxt.c:1082
#3 0x00005589f3b1f553 in datumCopy (value=94051002161216,
typByVal=typByVal@entry=false,
typLen=<optimized out>) at datum.c:162
#4 0x00005589f3c6ed0b in tuplesort_getdatum (state=state@entry
=0x5589f49274e0,
forward=forward@entry=true, val=0x5589f48d7860, isNull=0x5589f48d7868,
abbrev=abbrev@entry=0x0)
at tuplesort.c:2675
#5 0x00005589f3947925 in ExecSort (pstate=0x5589f48d0a38) at nodeSort.c:200
#6 0x00005589f393d74c in ExecProcNode (node=0x5589f48d0a38)
at ../../../src/include/executor/executor.h:259
#7 ExecMergeJoin (pstate=0x5589f4896cc8) at nodeMergejoin.c:871
#8 0x00005589f391fbc8 in ExecProcNode (node=0x5589f4896cc8)
at ../../../src/include/executor/executor.h:259
#9 fetch_input_tuple (aggstate=aggstate@entry=0x5589f4896670) at
nodeAgg.c:563
#10 0x00005589f3923742 in agg_retrieve_direct (aggstate=aggstate@entry
=0x5589f4896670)
at nodeAgg.c:2441
....
```

Could this be a bug, or am I missing anything?

Thanks,
Onder KALACI

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Önder Kalacı (#1)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

=?UTF-8?B?w5ZuZGVyIEthbGFjxLE=?= <onderkalaci@gmail.com> writes:

With PG 15 (rc1 or beta4), I'm observing an interesting memory pattern.

Yup, that is a leak. valgrind'ing it blames this call chain:

==00:00:16:12.228 4011013== 790,404,056 bytes in 60,800,312 blocks are definitely lost in loss record 1,108 of 1,108
==00:00:16:12.228 4011013== at 0x9A5104: palloc (mcxt.c:1170)
==00:00:16:12.228 4011013== by 0x89F8D9: datumCopy (datum.c:175)
==00:00:16:12.228 4011013== by 0x9B5BEE: tuplesort_getdatum (tuplesortvariants.c:882)
==00:00:16:12.228 4011013== by 0x6FA8B3: ExecSort (nodeSort.c:200)
==00:00:16:12.228 4011013== by 0x6F1E87: ExecProcNode (executor.h:259)
==00:00:16:12.228 4011013== by 0x6F1E87: ExecMergeJoin (nodeMergejoin.c:871)
==00:00:16:12.228 4011013== by 0x6D7800: ExecProcNode (executor.h:259)
==00:00:16:12.228 4011013== by 0x6D7800: fetch_input_tuple (nodeAgg.c:562)
==00:00:16:12.228 4011013== by 0x6DAE2E: agg_retrieve_direct (nodeAgg.c:2454)
==00:00:16:12.228 4011013== by 0x6DAE2E: ExecAgg (nodeAgg.c:2174)
==00:00:16:12.228 4011013== by 0x6C6122: ExecProcNode (executor.h:259)
==00:00:16:12.228 4011013== by 0x6C6122: ExecutePlan (execMain.c:1636)

and bisecting fingers this commit as the guilty party:

commit 91e9e89dccdfdf4216953d3d8f5515dcdef177fb
Author: David Rowley <drowley@postgresql.org>
Date: Thu Jul 22 14:03:19 2021 +1200

Make nodeSort.c use Datum sorts for single column sorts

Looks like that forgot that tuplesort_getdatum()'s result has to
be freed by the caller.

regards, tom lane

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#2)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

I wrote:

and bisecting fingers this commit as the guilty party:

commit 91e9e89dccdfdf4216953d3d8f5515dcdef177fb
Author: David Rowley <drowley@postgresql.org>
Date: Thu Jul 22 14:03:19 2021 +1200

Make nodeSort.c use Datum sorts for single column sorts

After looking at that for a little while, I wonder if we shouldn't
fix this by restricting the Datum-sort path to be used only with
pass-by-value data types. That'd require only a minor addition
to the new logic in ExecInitSort.

The alternative of inserting a pfree of the old value would complicate
the code nontrivially, I think, and really it would necessitate a
complete performance re-test. I'm wondering if the claimed speedup
for pass-by-ref types wasn't fictional and based on skipping the
required pfrees. Besides, if you think this code is hot enough that
you don't want to add a test-and-branch per tuple (a claim I also
doubt, BTW) then you probably don't want to add such overhead into
the pass-by-value case where the speedup is clear.

regards, tom lane

#4David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#3)
1 attachment(s)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

Thanks for investigating this and finding the guilty commit.

On Thu, 29 Sept 2022 at 07:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:

After looking at that for a little while, I wonder if we shouldn't
fix this by restricting the Datum-sort path to be used only with
pass-by-value data types. That'd require only a minor addition
to the new logic in ExecInitSort.

I'm also wondering if that's the best fix given the timing of this discovery.

The alternative of inserting a pfree of the old value would complicate
the code nontrivially, I think, and really it would necessitate a
complete performance re-test. I'm wondering if the claimed speedup
for pass-by-ref types wasn't fictional and based on skipping the
required pfrees. Besides, if you think this code is hot enough that
you don't want to add a test-and-branch per tuple (a claim I also
doubt, BTW) then you probably don't want to add such overhead into
the pass-by-value case where the speedup is clear.

I'm wondering if the best way to fix it if doing it that way would be
to invent tuplesort_getdatum_nocopy() which would be the same as
tuplesort_getdatum() except it wouldn't do the datumCopy for byref
types. It looks like tuplesort_gettupleslot() when copy==false just
directly stores the MinimalTuple that's in stup.tuple and shouldFree
is set to false.

Going by [1]/messages/by-id/CAApHDvrWV=v0qKsC9_BHqhCn9TusrNvCaZDz77StCO--fmgbKA@mail.gmail.com, it looks like I saw gains in test 6, which was a byref
Datum. Skipping the datumCopy() I imagine could only make the gains
slightly higher on that. That puts me a bit more on the fence about
the best fix for PG15.

I've attached a patch to restrict the optimisation to byval types in
the meantime.

David

[1]: /messages/by-id/CAApHDvrWV=v0qKsC9_BHqhCn9TusrNvCaZDz77StCO--fmgbKA@mail.gmail.com

Attachments:

datum_sort_fixes.patchtext/plain; charset=US-ASCII; name=datum_sort_fixes.patchDownload
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 3c28d60c3e..740ad37717 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -220,6 +220,7 @@ SortState *
 ExecInitSort(Sort *node, EState *estate, int eflags)
 {
 	SortState  *sortstate;
+	TupleDesc	outerTupDesc;
 
 	SO1_printf("ExecInitSort: %s\n",
 			   "initializing sort node");
@@ -274,11 +275,13 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	ExecInitResultTupleSlotTL(&sortstate->ss.ps, &TTSOpsMinimalTuple);
 	sortstate->ss.ps.ps_ProjInfo = NULL;
 
+	outerTupDesc = ExecGetResultType(outerPlanState(sortstate));
+
 	/*
-	 * We perform a Datum sort when we're sorting just a single column,
+	 * We perform a Datum sort when we're sorting just a single byval column,
 	 * otherwise we perform a tuple sort.
 	 */
-	if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
+	if (outerTupDesc->natts == 1 && outerTupDesc->attrs[0].attbyval)
 		sortstate->datumSort = true;
 	else
 		sortstate->datumSort = false;
#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#4)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

David Rowley <dgrowleyml@gmail.com> writes:

I'm wondering if the best way to fix it if doing it that way would be
to invent tuplesort_getdatum_nocopy() which would be the same as
tuplesort_getdatum() except it wouldn't do the datumCopy for byref
types.

Yeah, perhaps. We'd need a clear spec on how long the Datum could
be presumed good --- probably till the next tuplesort_getdatum_nocopy
call, but that'd need to be checked --- and then check if that is
satisfactory for nodeSort's purposes.

If we had such a thing, I wonder if any of the other existing
tuplesort_getdatum callers would be happier with that. nodeAgg for
one is tediously freeing the result, but could we drop that logic?
(I hasten to add that I'm not proposing we touch that for v15.)

regards, tom lane

#6David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#5)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

On Thu, 29 Sept 2022 at 08:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

I'm wondering if the best way to fix it if doing it that way would be
to invent tuplesort_getdatum_nocopy() which would be the same as
tuplesort_getdatum() except it wouldn't do the datumCopy for byref
types.

Yeah, perhaps. We'd need a clear spec on how long the Datum could
be presumed good --- probably till the next tuplesort_getdatum_nocopy
call, but that'd need to be checked --- and then check if that is
satisfactory for nodeSort's purposes.

Yeah, I think the same rules around scope apply as
tuplesort_gettupleslot() with copy==false. We could do it by adding a
copy flag to the existing function, but I'd rather not add the
branching to that function. It's probably just better to duplicate it
and adjust.

If we had such a thing, I wonder if any of the other existing
tuplesort_getdatum callers would be happier with that. nodeAgg for
one is tediously freeing the result, but could we drop that logic?

Looking at process_ordered_aggregate_single(), it's likely more
efficient to use the nocopy version and just perform a datumCopy()
when we need to store the oldVal. At least, that would be more
efficient when many values are being skipped due to being the same as
the last one.

I've just pushed the disable byref Datums patch I posted earlier. I
only made a small adjustment to make use of the TupleDescAttr() macro.
Önder, thank you for the report.

David

In reply to: Tom Lane (#5)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

On Wed, Sep 28, 2022 at 12:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

I'm wondering if the best way to fix it if doing it that way would be
to invent tuplesort_getdatum_nocopy() which would be the same as
tuplesort_getdatum() except it wouldn't do the datumCopy for byref
types.

Yeah, perhaps. We'd need a clear spec on how long the Datum could
be presumed good --- probably till the next tuplesort_getdatum_nocopy
call, but that'd need to be checked --- and then check if that is
satisfactory for nodeSort's purposes.

I am reminded of the discussion that led to bugfix commit c2d4eb1b
some years back.

As the commit message of that old bugfix notes, tuplesort_getdatum()
and tuplesort_gettupleslot() are "the odd ones out" among "get tuple"
routines (i.e. routines that get a tuple from a tuplesort by calling
tuplesort_gettuple_common()). We used to sometimes do that with
tuplesort_getindextuple() and possible other such routines, but the
need for that capability was eliminated on the caller side around the
same time as the bugfix went in.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#7)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

On Wed, Sep 28, 2022 at 4:00 PM Peter Geoghegan <pg@bowt.ie> wrote:

I am reminded of the discussion that led to bugfix commit c2d4eb1b
some years back.

Also potentially relevant: the 2017 commit fa117ee4 anticipated adding
a "copy" argument to tuplesort_getdatum() (the same commit added such
a "copy" argument to tuplesort_gettupleslot()). I see that that still
hasn't happened to tuplesort_getdatum() all these years later. Might
be a good idea to do it in the next year or two, though.

If David is interested in pursuing this now then I certainly won't object.

--
Peter Geoghegan

#9Michael Paquier
michael@paquier.xyz
In reply to: David Rowley (#6)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

On Thu, Sep 29, 2022 at 11:58:17AM +1300, David Rowley wrote:

I've just pushed the disable byref Datums patch I posted earlier. I
only made a small adjustment to make use of the TupleDescAttr() macro.
Önder, thank you for the report.

Wouldn't it be better to have 3a58176 reflect the non-optimization
path in the EXPLAIN output of a new regression test if none of the
existing tests are able to show any difference?
--
Michael

#10David Rowley
dgrowleyml@gmail.com
In reply to: Michael Paquier (#9)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

On Thu, 29 Sept 2022 at 12:30, Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Sep 29, 2022 at 11:58:17AM +1300, David Rowley wrote:

I've just pushed the disable byref Datums patch I posted earlier. I
only made a small adjustment to make use of the TupleDescAttr() macro.
Önder, thank you for the report.

Wouldn't it be better to have 3a58176 reflect the non-optimization
path in the EXPLAIN output of a new regression test if none of the
existing tests are able to show any difference?

There's nothing in EXPLAIN that shows that this optimization occurs.
Or, are you proposing that you think there should be something? and
for 15??

David

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#9)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

Michael Paquier <michael@paquier.xyz> writes:

Wouldn't it be better to have 3a58176 reflect the non-optimization
path in the EXPLAIN output of a new regression test if none of the
existing tests are able to show any difference?

This decision is not visible in EXPLAIN in any case.

regards, tom lane

#12Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#11)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

On Wed, Sep 28, 2022 at 07:35:07PM -0400, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:

Wouldn't it be better to have 3a58176 reflect the non-optimization
path in the EXPLAIN output of a new regression test if none of the
existing tests are able to show any difference?

This decision is not visible in EXPLAIN in any case.

Okay, thanks!
--
Michael

#13David Rowley
dgrowleyml@gmail.com
In reply to: Peter Geoghegan (#8)
1 attachment(s)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

On Thu, 29 Sept 2022 at 12:07, Peter Geoghegan <pg@bowt.ie> wrote:

Also potentially relevant: the 2017 commit fa117ee4 anticipated adding
a "copy" argument to tuplesort_getdatum() (the same commit added such
a "copy" argument to tuplesort_gettupleslot()). I see that that still
hasn't happened to tuplesort_getdatum() all these years later. Might
be a good idea to do it in the next year or two, though.

If David is interested in pursuing this now then I certainly won't object.

Just while this is fresh in my head, I wrote some code to make this
happen. My preference would be not to add the "copy" param to the
existing function and instead just add a new function to prevent
additional branching.

The attached puts back the datum sort in nodeSort.c for byref types
and adjusts process_ordered_aggregate_single() to make use of this
function.

I did a quick benchmark to see if this help DISTINCT aggregate any:

create table t1 (a varchar(32) not null, b varchar(32) not null);
insert into t1 select md5((x%10)::text),md5((x%10)::text) from
generate_Series(1,1000000)x;
vacuum freeze t1;
create index on t1(a);

With a work_mem of 256MBs I get:

query = select max(distinct a), max(distinct b) from t1;

Master:
latency average = 313.197 ms

Patched:
latency average = 304.335 ms

So not a very impressive speedup there (about 3%)

Some excerpts from perf top show:

Master:
1.40% postgres [.] palloc
1.13% postgres [.] tuplesort_getdatum
0.77% postgres [.] datumCopy

Patched:
0.91% postgres [.] tuplesort_getdatum_nocopy
0.65% postgres [.] palloc

I stared for a while at the mode_final() function and thought maybe we
could use the nocopy variant there. I just didn't quite pluck up the
motivation to write any code to see if it could be made faster.

David

Attachments:

add_tuplesort_getdatum_nocopy_function.patchtext/plain; charset=US-ASCII; name=add_tuplesort_getdatum_nocopy_function.patchDownload
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index fe74e49814..b5f63b5a2b 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -878,8 +878,8 @@ process_ordered_aggregate_single(AggState *aggstate,
 	 * pfree them when they are no longer needed.
 	 */
 
-	while (tuplesort_getdatum(pertrans->sortstates[aggstate->current_set],
-							  true, newVal, isNull, &newAbbrevVal))
+	while (tuplesort_getdatum_nocopy(pertrans->sortstates[aggstate->current_set],
+									 true, newVal, isNull, &newAbbrevVal))
 	{
 		/*
 		 * Clear and select the working context for evaluation of the equality
@@ -900,24 +900,33 @@ process_ordered_aggregate_single(AggState *aggstate,
 											 pertrans->aggCollation,
 											 oldVal, *newVal)))))
 		{
-			/* equal to prior, so forget this one */
-			if (!pertrans->inputtypeByVal && !*isNull)
-				pfree(DatumGetPointer(*newVal));
+			MemoryContextSwitchTo(oldContext);
+			continue;
 		}
 		else
 		{
 			advance_transition_function(aggstate, pertrans, pergroupstate);
-			/* forget the old value, if any */
-			if (!oldIsNull && !pertrans->inputtypeByVal)
-				pfree(DatumGetPointer(oldVal));
-			/* and remember the new one for subsequent equality checks */
-			oldVal = *newVal;
+
+			MemoryContextSwitchTo(oldContext);
+
+			/*
+			 * Forget the old value, if any and remember the new one for
+			 * subsequent equality checks.
+			 */
+			if (!pertrans->inputtypeByVal)
+			{
+				if (!oldIsNull)
+					pfree(DatumGetPointer(oldVal));
+				if (!*isNull)
+					oldVal = datumCopy(*newVal, pertrans->inputtypeByVal,
+									   pertrans->inputtypeLen);
+			}
+			else
+				oldVal = *newVal;
 			oldAbbrevVal = newAbbrevVal;
 			oldIsNull = *isNull;
 			haveOldVal = true;
 		}
-
-		MemoryContextSwitchTo(oldContext);
 	}
 
 	if (!oldIsNull && !pertrans->inputtypeByVal)
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 37ad35704b..f418be30a1 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -197,8 +197,8 @@ ExecSort(PlanState *pstate)
 	if (node->datumSort)
 	{
 		ExecClearTuple(slot);
-		if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
-							   &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+		if (tuplesort_getdatum_nocopy(tuplesortstate, ScanDirectionIsForward(dir),
+									  &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
 			ExecStoreVirtualTuple(slot);
 	}
 	else
@@ -278,10 +278,10 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	outerTupDesc = ExecGetResultType(outerPlanState(sortstate));
 
 	/*
-	 * We perform a Datum sort when we're sorting just a single byval column,
+	 * We perform a Datum sort when we're sorting just a single column,
 	 * otherwise we perform a tuple sort.
 	 */
-	if (outerTupDesc->natts == 1 && TupleDescAttr(outerTupDesc, 0)->attbyval)
+	if (outerTupDesc->natts == 1)
 		sortstate->datumSort = true;
 	else
 		sortstate->datumSort = false;
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index afa5bdbf04..a9d51234e6 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -886,6 +886,58 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
 	return true;
 }
 
+/*
+ * Fetch the next Datum in either forward or back direction.
+ * Returns false if no more datums.
+ *
+ * If the Datum is pass-by-ref type, the returned value is a pointer directly
+ * into the tuplesort's stup.tuple.  This is only safe for callers that no
+ * longer require a reference to the datum beyond any subsequent manipulation
+ * of the tuplesort's state.
+ *
+ * Caller may optionally be passed back abbreviated value (on true return
+ * value) when abbreviation was used, which can be used to cheaply avoid
+ * equality checks that might otherwise be required.  Caller can safely make a
+ * determination of "non-equal tuple" based on simple binary inequality.  A
+ * NULL value will have a zeroed abbreviated value representation, which caller
+ * may rely on in abbreviated inequality check.
+ */
+bool
+tuplesort_getdatum_nocopy(Tuplesortstate *state, bool forward,
+						  Datum *val, bool *isNull, Datum *abbrev)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->sortcontext);
+	SortTuple	stup;
+
+	if (!tuplesort_gettuple_common(state, forward, &stup))
+	{
+		MemoryContextSwitchTo(oldcontext);
+		return false;
+	}
+
+	/* Ensure we copy into caller's memory context */
+	MemoryContextSwitchTo(oldcontext);
+
+	/* Record abbreviated key for caller */
+	if (base->sortKeys->abbrev_converter && abbrev)
+		*abbrev = stup.datum1;
+
+	if (stup.isnull1 || !base->tuples)
+	{
+		*val = stup.datum1;
+		*isNull = stup.isnull1;
+	}
+	else
+	{
+		/* use stup.tuple because stup.datum1 may be an abbreviation */
+		*val = PointerGetDatum(stup.tuple);
+		*isNull = false;
+	}
+
+	return true;
+}
+
 
 /*
  * Routines specialized for HeapTuple (actually MinimalTuple) case
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 4441274990..fc8fc780eb 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -441,6 +441,7 @@ extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward,
 							   Datum *val, bool *isNull, Datum *abbrev);
-
+extern bool tuplesort_getdatum_nocopy(Tuplesortstate *state, bool forward,
+									  Datum *val, bool *isNull, Datum *abbrev);
 
 #endif							/* TUPLESORT_H */
In reply to: David Rowley (#13)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

On Wed, Sep 28, 2022 at 6:13 PM David Rowley <dgrowleyml@gmail.com> wrote:

Master:
latency average = 313.197 ms

Patched:
latency average = 304.335 ms

So not a very impressive speedup there (about 3%)

Worth a try, at least. Having a more consistent interface is valuable
in itself too.

--
Peter Geoghegan

#15David Rowley
dgrowleyml@gmail.com
In reply to: Peter Geoghegan (#14)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

On Thu, 29 Sept 2022 at 14:32, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Sep 28, 2022 at 6:13 PM David Rowley <dgrowleyml@gmail.com> wrote:

Master:
latency average = 313.197 ms

Patched:
latency average = 304.335 ms

So not a very impressive speedup there (about 3%)

Worth a try, at least. Having a more consistent interface is valuable
in itself too.

Just testing the datum sort in nodeSort.c with the same table as
before but using the query:

select b from t1 order by b offset 1000000;

Master:
latency average = 344.763 ms

Patched:
latency average = 268.374 ms

about 28% faster.

I'll take this to another thread and put it in the next CF

David

In reply to: David Rowley (#15)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

On Wed, Sep 28, 2022 at 9:59 PM David Rowley <dgrowleyml@gmail.com> wrote:

select b from t1 order by b offset 1000000;

Master:
latency average = 344.763 ms

Patched:
latency average = 268.374 ms

about 28% faster.

That's more like it!

--
Peter Geoghegan

#17Ronan Dunklau
ronan.dunklau@aiven.io
In reply to: David Rowley (#6)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

I've just pushed the disable byref Datums patch I posted earlier. I
only made a small adjustment to make use of the TupleDescAttr() macro.
Önder, thank you for the report.

Thank you David for taking care of this.

Yeah, I think the same rules around scope apply as
tuplesort_gettupleslot() with copy==false. We could do it by adding a
copy flag to the existing function, but I'd rather not add the
branching to that function. It's probably just better to duplicate it
and adjust.

For the record, I tried to see if gcc would optimize the function by
generating two different versions when copy is true or false, thus getting rid
of the branching while still having only one function to deal with. Using the
-fipa-cp-clone (or even the whole set of additional flags coming with -O3), it
does generate a special-case version of the function, but it seems to then
only be used by heapam_index_validate_scan and
percentile_cont_multi_final_common. This is from my investigation looking for
references to the specialized version in the DWARF debug information.

Regards,

--
Ronan Dunklau

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ronan Dunklau (#17)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

Ronan Dunklau <ronan.dunklau@aiven.io> writes:

Yeah, I think the same rules around scope apply as
tuplesort_gettupleslot() with copy==false. We could do it by adding a
copy flag to the existing function, but I'd rather not add the
branching to that function. It's probably just better to duplicate it
and adjust.

For the record, I tried to see if gcc would optimize the function by
generating two different versions when copy is true or false, thus getting rid
of the branching while still having only one function to deal with.

TBH, I think this is completely ridiculous over-optimization.
There's exactly zero evidence that a second copy of the function
would improve performance, or do anything but contribute to code
bloat (which does have a distributed performance cost).

regards, tom lane

#19Ronan Dunklau
ronan.dunklau@aiven.io
In reply to: Tom Lane (#18)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

Le jeudi 29 septembre 2022, 16:10:03 CEST Tom Lane a écrit :

Ronan Dunklau <ronan.dunklau@aiven.io> writes:

Yeah, I think the same rules around scope apply as
tuplesort_gettupleslot() with copy==false. We could do it by adding a
copy flag to the existing function, but I'd rather not add the
branching to that function. It's probably just better to duplicate it
and adjust.

For the record, I tried to see if gcc would optimize the function by
generating two different versions when copy is true or false, thus getting

rid

of the branching while still having only one function to deal with.

TBH, I think this is completely ridiculous over-optimization.
There's exactly zero evidence that a second copy of the function
would improve performance, or do anything but contribute to code
bloat (which does have a distributed performance cost).

I wasn't commenting on the merit of the optimization, but just that I tried to
get gcc to apply it itself, which it doesn't.

Regards,

--
Ronan Dunklau

In reply to: Tom Lane (#18)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

On Thu, Sep 29, 2022 at 7:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

TBH, I think this is completely ridiculous over-optimization.
There's exactly zero evidence that a second copy of the function
would improve performance, or do anything but contribute to code
bloat (which does have a distributed performance cost).

I thought that that was unjustified myself.

--
Peter Geoghegan

#21Önder Kalacı
onderkalaci@gmail.com
In reply to: David Rowley (#6)
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

Hi David, Tom, all,

I've just pushed the disable byref Datums patch I posted earlier. I
only made a small adjustment to make use of the TupleDescAttr() macro.
Önder, thank you for the report.

With this commit, I re-run the query patterns where we observed the
problem, all looks good now. Wanted to share this information as fyi.

Thanks for the quick turnaround!

Onder KALACI