PostgreSQL crashes with SIGSEGV
A customer recently reported a crash in a postgres backend. The backend
encountered a SIGSEGV, crashing during SELECTs from a fairly
complicated view using a grouping set directive. I've managed to
reproduce it by tracking it down to a specific SELECT, but
unfortunately couldn't yet manage to strip it down to a small,
repeatable test case which doesn't involve the whole (sensitive)
dataset. I'm reporting my findings so far, maybe it helps to track it
down already.
The crashed backend gives the following backtrace:
Program received signal SIGSEGV, Segmentation fault.
0x000000000099be84 in pfree (pointer=0x17194e8) at mcxt.c:1007
1007 context = ((StandardChunkHeader *)
(gdb) bt
#0 0x000000000099be84 in pfree (pointer=0x17194e8) at mcxt.c:1007
#1 0x00000000004798d8 in heap_free_minimal_tuple (mtup=0x17194e8) at
heaptuple.c:1403
#2 0x0000000000687202 in ExecClearTuple (slot=0x10a3f88) at
execTuples.c:455
#3 0x0000000000686b89 in ExecResetTupleTable (tupleTable=0x10a06a0,
shouldFree=0 '\000') at execTuples.c:169
#4 0x00000000006773ff in ExecEndPlan (planstate=0x10a0398,
estate=0x10a01f0) at execMain.c:1469
#5 0x0000000000675927 in standard_ExecutorEnd (queryDesc=0x10910c0) at
execMain.c:468
#6 0x0000000000675865 in ExecutorEnd (queryDesc=0x10910c0) at
execMain.c:439
#7 0x000000000062c71c in PortalCleanup (portal=0xf79ae0) at
portalcmds.c:280
#8 0x000000000099ce97 in PortalDrop (portal=0xf79ae0, isTopCommit=0
'\000') at portalmem.c:510
#9 0x0000000000813639 in exec_simple_query (
query_string=0xf3dbe0 "SELECT * FROM \"xxxx\".\"xxxx\" WHERE (xxxx
= 'xxxx') LIMIT 1000;") at postgres.c:1095
#10 0x00000000008177ea in PostgresMain (argc=1, argv=0xee9e10,
dbname=0xeb9a78 "xxxx", username=0xee9c78 "bernd")
at postgres.c:4072
#11 0x000000000078cc27 in BackendRun (port=0xedc700) at
postmaster.c:4342
#12 0x000000000078c377 in BackendStartup (port=0xedc700) at
postmaster.c:4016
#13 0x0000000000788983 in ServerLoop () at postmaster.c:1721
#14 0x0000000000787fb8 in PostmasterMain (argc=1, argv=0xeb7960) at
postmaster.c:1329
#15 0x00000000006d1b52 in main (argc=1, argv=0xeb7960) at main.c:228
I've tested this so far against very current REL9_6_STABLE and
REL9_5_STABLE and got them to crash with the same backtrace. The crash
is dependent on the chosen plan, experiments with work_mem show that
the crash seems to happen only if you get external sorts into the
execution plan.
REL10_STABLE seems not affected, as my extracted application query
doesn't crash there.
Running the query against REL9_6_STABLE with valgrind shows the
following results:
==00:00:01:33.067 13158== Invalid write of size 8
==00:00:01:33.067 13158== at 0x93D53B: AllocSetFree (aset.c:998)
==00:00:01:33.067 13158== by 0x93EC03: pfree (mcxt.c:1012)
==00:00:01:33.067 13158== by 0x476E34: heap_free_minimal_tuple
(heaptuple.c:1403)
==00:00:01:33.067 13158== by 0x6521A3: ExecClearTuple
(execTuples.c:455)
==00:00:01:33.067 13158== by 0x651D4B: ExecResetTupleTable
(execTuples.c:169)
==00:00:03:34.540 14381== Invalid read of size 8
==00:00:03:34.540 14381== at 0x99E29C: pfree (mcxt.c:1007)
==00:00:03:34.540 14381== by 0x4798D7: heap_free_minimal_tuple
(heaptuple.c:1403)
==00:00:03:34.540 14381== by 0x68753E: ExecClearTuple
(execTuples.c:455)
==00:00:03:34.540 14381== by 0x686EC5: ExecResetTupleTable
(execTuples.c:169)
==00:00:03:34.540 14381== by 0x67773B: ExecEndPlan (execMain.c:1469)
==00:00:03:34.540 14381== by 0x675C63: standard_ExecutorEnd
(execMain.c:468)
==00:00:03:34.540 14381== by 0x675BA1: ExecutorEnd (execMain.c:439)
==00:00:03:34.540 14381== by 0x62CA58: PortalCleanup
(portalcmds.c:280)
==00:00:03:34.540 14381== by 0x99F412: PortalDrop (portalmem.c:510)
==00:00:03:34.540 14381== by 0x8142A4: exec_simple_query
(postgres.c:1095)
==00:00:03:34.540 14381== by 0x818455: PostgresMain
(postgres.c:4072)
==00:00:03:34.540 14381== by 0x78CF63: BackendRun
(postmaster.c:4342)
==00:00:03:34.540 14381== Address 0x188cd220 is 474,080 bytes inside a
block of size 1,048,576 free'd
==00:00:03:34.540 14381== at 0x4C2FD18: free
(vg_replace_malloc.c:530)
==00:00:03:34.540 14381== by 0x99A8DB: AllocSetDelete (aset.c:653)
==00:00:03:34.540 14381== by 0x99C7A4: MemoryContextDelete
(mcxt.c:225)
==00:00:03:34.540 14381== by 0x99C855: MemoryContextDeleteChildren
(mcxt.c:245)
==00:00:03:34.540 14381== by 0x99C772: MemoryContextDelete
(mcxt.c:208)
==00:00:03:34.540 14381== by 0x9A5F10: tuplesort_end
(tuplesort.c:1198)
==00:00:03:34.540 14381== by 0x692568: ExecEndAgg (nodeAgg.c:3449)
==00:00:03:34.540 14381== by 0x67BB46: ExecEndNode
(execProcnode.c:755)
==00:00:03:34.540 14381== by 0x6AD652: ExecEndSubqueryScan
(nodeSubqueryscan.c:181)
==00:00:03:34.540 14381== by 0x67BA69: ExecEndNode
(execProcnode.c:697)
==00:00:03:34.540 14381== by 0x69DF6E: ExecEndLimit
(nodeLimit.c:438)
==00:00:03:34.540 14381== by 0x67BB9D: ExecEndNode
(execProcnode.c:779)
==00:00:03:34.540 14381== Block was alloc'd at
==00:00:03:34.540 14381== at 0x4C2EB6B: malloc
(vg_replace_malloc.c:299)
==00:00:03:34.540 14381== by 0x99AED4: AllocSetAlloc (aset.c:866)
==00:00:03:34.540 14381== by 0x99DC7A: palloc (mcxt.c:904)
==00:00:03:34.540 14381== by 0x4798F6: heap_copy_minimal_tuple
(heaptuple.c:1417)
==00:00:03:34.540 14381== by 0x6878D7: ExecCopySlotMinimalTuple
(execTuples.c:593)
==00:00:03:34.540 14381== by 0x9AC1D2: copytup_heap
(tuplesort.c:3998)
==00:00:03:34.540 14381== by 0x9A616C: tuplesort_puttupleslot
(tuplesort.c:1345)
==00:00:03:34.540 14381== by 0x68D7D8: fetch_input_tuple
(nodeAgg.c:601)
==00:00:03:34.540 14381== by 0x68FEC1: agg_retrieve_direct
(nodeAgg.c:2168)
==00:00:03:34.540 14381== by 0x68F9AB: ExecAgg (nodeAgg.c:1903)
==00:00:03:34.540 14381== by 0x67B70A: ExecProcNode
(execProcnode.c:503)
==00:00:03:34.540 14381== by 0x6AD40F: SubqueryNext
(nodeSubqueryscan.c:53)
==00:00:03:34.540 14381==
{
<insert_a_suppression_name_here>
Memcheck:Addr8
fun:pfree
fun:heap_free_minimal_tuple
fun:ExecClearTuple
fun:ExecResetTupleTable
fun:ExecEndPlan
fun:standard_ExecutorEnd
fun:ExecutorEnd
fun:PortalCleanup
fun:PortalDrop
fun:exec_simple_query
fun:PostgresMain
fun:BackendRun
}
==00:00:03:34.548 14381== Invalid read of size 4
==00:00:03:34.548 14381== at 0x99E2AE: pfree (mcxt.c:1010)
==00:00:03:34.548 14381== by 0x4798D7: heap_free_minimal_tuple
(heaptuple.c:1403)
==00:00:03:34.548 14381== by 0x68753E: ExecClearTuple
(execTuples.c:455)
==00:00:03:34.548 14381== by 0x686EC5: ExecResetTupleTable
(execTuples.c:169)
==00:00:03:34.548 14381== by 0x67773B: ExecEndPlan (execMain.c:1469)
==00:00:03:34.548 14381== by 0x675C63: standard_ExecutorEnd
(execMain.c:468)
==00:00:03:34.548 14381== by 0x675BA1: ExecutorEnd (execMain.c:439)
==00:00:03:34.548 14381== by 0x62CA58: PortalCleanup
(portalcmds.c:280)
==00:00:03:34.548 14381== by 0x99F412: PortalDrop (portalmem.c:510)
==00:00:03:34.548 14381== by 0x8142A4: exec_simple_query
(postgres.c:1095)
==00:00:03:34.548 14381== by 0x818455: PostgresMain
(postgres.c:4072)
==00:00:03:34.548 14381== by 0x78CF63: BackendRun
(postmaster.c:4342)
==00:00:03:34.548 14381== Address 0x7f7f7f7f7f7f7f7f is not stack'd,
malloc'd or (recently) free'd
==00:00:03:34.548 14381==
{
<insert_a_suppression_name_here>
Memcheck:Addr4
fun:pfree
fun:heap_free_minimal_tuple
fun:ExecClearTuple
fun:ExecResetTupleTable
fun:ExecEndPlan
fun:standard_ExecutorEnd
fun:ExecutorEnd
fun:PortalCleanup
fun:PortalDrop
fun:exec_simple_query
fun:PostgresMain
fun:BackendRun
}
==00:00:03:34.548 14381==
==00:00:03:34.548 14381== Process terminating with default action of
signal 11 (SIGSEGV): dumping core
==00:00:03:34.548 14381== General Protection Fault
==00:00:03:34.548 14381== at 0x99E2AE: pfree (mcxt.c:1010)
==00:00:03:34.548 14381== by 0x4798D7: heap_free_minimal_tuple
(heaptuple.c:1403)
==00:00:03:34.548 14381== by 0x68753E: ExecClearTuple
(execTuples.c:455)
==00:00:03:34.548 14381== by 0x686EC5: ExecResetTupleTable
(execTuples.c:169)
==00:00:03:34.548 14381== by 0x67773B: ExecEndPlan (execMain.c:1469)
==00:00:03:34.548 14381== by 0x675C63: standard_ExecutorEnd
(execMain.c:468)
==00:00:03:34.548 14381== by 0x675BA1: ExecutorEnd (execMain.c:439)
==00:00:03:34.548 14381== by 0x62CA58: PortalCleanup
(portalcmds.c:280)
==00:00:03:34.548 14381== by 0x99F412: PortalDrop (portalmem.c:510)
==00:00:03:34.548 14381== by 0x8142A4: exec_simple_query
(postgres.c:1095)
==00:00:03:34.548 14381== by 0x818455: PostgresMain
(postgres.c:4072)
==00:00:03:34.548 14381== by 0x78CF63: BackendRun
(postmaster.c:4342)
==00:00:03:35.088 14381==
==00:00:03:35.088 14381== Process terminating with default action of
signal 11 (SIGSEGV)
==00:00:03:35.088 14381== General Protection Fault
==00:00:03:35.088 14381== at 0x632016C: _dl_catch_error (in
/usr/lib64/libc-2.25.so)
==00:00:03:35.088 14381== by 0x631F8E6: __libc_dlclose (in
/usr/lib64/libc-2.25.so)
==00:00:03:35.088 14381== by 0x634B5E4: free_mem (in
/usr/lib64/libc-2.25.so)
==00:00:03:35.088 14381== by 0x634B1E1: __libc_freeres (in
/usr/lib64/libc-2.25.so)
==00:00:03:35.088 14381== by 0x4A296DB: _vgnU_freeres
(vg_preloaded.c:77)
==00:00:03:35.088 14381== by 0x18118EA7: ???
==00:00:03:35.088 14381== by 0x4798D7: heap_free_minimal_tuple
(heaptuple.c:1403)
==00:00:03:35.088 14381== by 0x68753E: ExecClearTuple
(execTuples.c:455)
==00:00:03:35.088 14381== by 0x686EC5: ExecResetTupleTable
(execTuples.c:169)
==00:00:03:35.088 14381== by 0x67773B: ExecEndPlan (execMain.c:1469)
==00:00:03:35.088 14381== by 0x675C63: standard_ExecutorEnd
(execMain.c:468)
==00:00:03:35.088 14381== by 0x675BA1: ExecutorEnd (execMain.c:439)
==00:00:03:35.088 14381==
==00:00:03:35.088 14381== HEAP SUMMARY:
==00:00:03:35.088 14381== in use at exit: 6,833,625 bytes in 531
blocks
==00:00:03:35.088 14381== total heap usage: 814,111 allocs, 1,264
frees, 118,761,978 bytes allocated
==00:00:03:35.088 14381==
==00:00:03:35.088 14381== For a detailed leak analysis, rerun with: --
leak-check=full
==00:00:03:35.088 14381==
==00:00:03:35.088 14381== For counts of detected and suppressed errors,
rerun with: -v
==00:00:03:35.088 14381== ERROR SUMMARY: 2 errors from 2 contexts
(suppressed: 35 from 6)
It seems the backend tries to free a minimal tuple at executor end,
which is already gone by deleting the memory context it was allocated
in before. ExecResetTupleTable() is looping through a list with 70
entries, it encounters (after 6 or seven rounds) the first tuple slot
with tts_shouldFreeMin set, all others before don't have it set:
(gdb) p *slot
$6 = {type = T_TupleTableSlot, tts_isempty = 0 '\000', tts_shouldFree =
0 '\000', tts_shouldFreeMin = 1 '\001', tts_slow = 1 '\001',
tts_tuple = 0x10a6c48, tts_tupleDescriptor = 0x10d1be8, tts_mcxt =
0xf59668, tts_buffer = 0, tts_nvalid = 6, tts_values = 0x10d2640,
tts_isnull = 0x10d26d8 "", tts_mintuple = 0x171c168, tts_minhdr =
{t_len = 162, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 0},
ip_posid = 0}, t_tableOid = 0, t_data = 0x171c160}, tts_off = 88}
(gdb) p list_length(tupleTable)
tts_mintuple is invalid, though:
(gdb) p *slot->tts_mintuple
Cannot access memory at address 0x171c168
The following attempt to clean it in ExecClearTuple() then crashes the
backend.
Bernd
On Fri, Dec 8, 2017 at 12:47 AM, Bernd Helmle <mailings@oopsware.de> wrote:
A customer recently reported a crash in a postgres backend. The backend
encountered a SIGSEGV, crashing during SELECTs from a fairly
complicated view using a grouping set directive. I've managed to
reproduce it by tracking it down to a specific SELECT, but
unfortunately couldn't yet manage to strip it down to a small,
repeatable test case which doesn't involve the whole (sensitive)
dataset. I'm reporting my findings so far, maybe it helps to track it
down already.
Hmm. Even if you cannot reproduce an isolated test case, could it be
possible to get an idea of the shape the SELECT query involved and of
the schema plus the view? No need for sensitive data here. This would
help in reproducing a test case. What are also the sizes involved?
Even a small data set with work_mem low should trigger the problem?
I've tested this so far against very current REL9_6_STABLE and
REL9_5_STABLE and got them to crash with the same backtrace. The crash
is dependent on the chosen plan, experiments with work_mem show that
the crash seems to happen only if you get external sorts into the
execution plan.
REL10_STABLE seems not affected, as my extracted application query
doesn't crash there.
That's one thing to begin with. So HEAD is not affected as well?
--
Michael
On Thu, Dec 7, 2017 at 7:47 AM, Bernd Helmle <mailings@oopsware.de> wrote:
I've tested this so far against very current REL9_6_STABLE and
REL9_5_STABLE and got them to crash with the same backtrace. The crash
is dependent on the chosen plan, experiments with work_mem show that
the crash seems to happen only if you get external sorts into the
execution plan.REL10_STABLE seems not affected, as my extracted application query
doesn't crash there.
Does "set replacement_sort_tuples = 0" change anything on
REL9_6_STABLE? If you only get a crash when there is very little
work_mem, then that might be a good avenue of investigation. Note that
my changes to external sorting started in REL9_6_STABLE, so they can't
be involved here.
Are you aware of commit 512f67c8d02cc558f9c269cc848b0f0f788c4fe1,
which fixed a bug affecting external sorts? Are you sure that you have
that fix on REL9_5_STABLE + REL9_6_STABLE?
--
Peter Geoghegan
On Thu, Dec 7, 2017 at 7:47 AM, Bernd Helmle <mailings@oopsware.de> wrote:
It seems the backend tries to free a minimal tuple at executor end,
which is already gone by deleting the memory context it was allocated
in before. ExecResetTupleTable() is looping through a list with 70
entries, it encounters (after 6 or seven rounds) the first tuple slot
with tts_shouldFreeMin set, all others before don't have it set:
On second thought, it seems more likely that the reason that
REL10_STABLE is unaffected is commit
3856cf9607f41245ec9462519c53f1109e781fc5. As of that commit (which
built on earlier v10 work) there is no question about memory for
tuples retrieved via tuplesort_gettupleslot() not belonging to caller
-- it must belong to caller. The old interface already resulted in
bugs in early 9.6 point releases that looked similar to this one, so I
was glad to remove it. (Note also that this picture was slightly
complicated by the performance optimization commit
fa117ee40330db401da776e7b003f047098a7d4c that followed, which made
some callers opt out of copying when that was clearly safe, but that
probably isn't important.)
So, as you said, the question that we probably need to answer is: just
how did grouping sets/nodeAgg.c code end up getting tuple memory
lifetime wrong. One good way to get more information is to rerun
Valgrind, but this time with track origins enabled. I myself run
Valgrind like this when I want to see the origin of memory involved in
an error. I specify:
$ valgrind --leak-check=no --gen-suppressions=all --trace-children=yes
--track-origins=yes --read-var-info=yes
--suppressions=/home/pg/postgresql/root/source/src/tools/valgrind.supp
-v postgres --log_line_prefix="%m %p " --log_statement=all
--shared_buffers=64MB 2>&1 | tee postmaster.log
(Probably the only change that you'll need is to make is to run
Valgrind with an the extra "--track-origins=yes".)
--track-origins=yes is usually something I use when I already know
that Valgrind will complain, but want more information about the
nature of the problem.
--
Peter Geoghegan
Am Donnerstag, den 07.12.2017, 18:23 -0800 schrieb Peter Geoghegan:
Does "set replacement_sort_tuples = 0" change anything on
REL9_6_STABLE? If you only get a crash when there is very little
work_mem, then that might be a good avenue of investigation. Note
that
my changes to external sorting started in REL9_6_STABLE, so they
can't
be involved here.
replacement_sort_tuples = 0 changes the picture, indeed. With that
setting, the query runs without problems in REL9_6_STABLE.
Are you aware of commit 512f67c8d02cc558f9c269cc848b0f0f788c4fe1,
which fixed a bug affecting external sorts? Are you sure that you
have
that fix on REL9_5_STABLE + REL9_6_STABLE?
My test instances are build against a fresh pull from yesterday.
Am Donnerstag, den 07.12.2017, 18:54 -0800 schrieb Peter Geoghegan:
So, as you said, the question that we probably need to answer is:
just
how did grouping sets/nodeAgg.c code end up getting tuple memory
lifetime wrong. One good way to get more information is to rerun
Valgrind, but this time with track origins enabled. I myself run
Valgrind like this when I want to see the origin of memory involved
in
an error. I specify:$ valgrind --leak-check=no --gen-suppressions=all --trace-
children=yes
--track-origins=yes --read-var-info=yes
--
suppressions=/home/pg/postgresql/root/source/src/tools/valgrind.supp
-v postgres --log_line_prefix="%m %p " --log_statement=all
--shared_buffers=64MB 2>&1 | tee postmaster.log(Probably the only change that you'll need is to make is to run
Valgrind with an the extra "--track-origins=yes".)--track-origins=yes is usually something I use when I already know
that Valgrind will complain, but want more information about the
nature of the problem.
That's what i've already did. My usage of valgrind was this:
valgrind --leak-check=no --gen-suppressions=all \
--track-origins=yes --suppressions=src/tools/valgrind.supp \
--time-stamp=yes --trace-children=yes postgres
On Fri, Dec 8, 2017 at 6:19 PM, Bernd Helmle <mailings@oopsware.de> wrote:
valgrind --leak-check=no --gen-suppressions=all \
--track-origins=yes --suppressions=src/tools/valgrind.supp \
--time-stamp=yes --trace-children=yes postgres
I have been trying for a couple of hours to reproduce a failure using
views with grouping sets, enforcing an external sort with things like
that:zfAgg
create table aa (a1 int, a2 int, a3 int, a4 int, a5 int, a6 int);
insert into aa SELECT (random() * 2000000000)::int,
(random() * 2000000000)::int,
(random() * 2000000000)::int,
(random() * 2000000000)::int,
(random() * 2000000000)::int,
(random() * 2000000000)::int FROM generate_series(1,1000000);
set work_mem = '128kB';
create or replace view aav as SELECT a1, a2, a3, a4, a5, avg(a6)
FROM aa GROUP BY GROUPING SETS (a1, a2, a3, a4, a5);
explain analyze select * from aav order by a1, a2, a3, a4, a5;
Not sure if you can provide this information, but what does the plan
of the query look like?
--
Michael
Michael Paquier writes:
Not sure if you can provide this information, but what does the plan
of the query look like?
We did some more reducing work on the original query and data. The
following testcase reproduces the double free reported by valgrind for
me when run against a vanilla REL9_6_STABLE build.
regards,
Andreas
--8<---------------cut here---------------start------------->8---
drop table if exists bug;
create table bug (n text, v text, b text, t text);
insert into bug
select i%9, i%9, i%16 ,i%4096 from generate_series(1,100000) g(i);
analyze bug;
explain select * from (
select thecube.nv
from ( select
(n || ' ') || coalesce(v, '') as nv
from bug
group by ((n || ' ') || coalesce(v, '')) ,cube(b, t)
) thecube
where nv = '8 8'
) sub limit 7000;
--8<---------------cut here---------------end--------------->8---
QUERY PLAN
---------------------------------------------------------------------------------------------------
Limit (cost=13422.32..15747.28 rows=7000 width=32)
-> Subquery Scan on thecube (cost=13422.32..27622.49 rows=42754 width=32)
-> GroupAggregate (cost=13422.32..27194.95 rows=42754 width=38)
Group Key: (((bug.n || ' '::text) || COALESCE(bug.v, ''::text))), bug.b, bug.t
Group Key: (((bug.n || ' '::text) || COALESCE(bug.v, ''::text))), bug.b
Group Key: (((bug.n || ' '::text) || COALESCE(bug.v, ''::text)))
Sort Key: (((bug.n || ' '::text) || COALESCE(bug.v, ''::text))), bug.t
Group Key: (((bug.n || ' '::text) || COALESCE(bug.v, ''::text))), bug.t
Filter: ((((bug.n || ' '::text) || COALESCE(bug.v, ''::text))) = '8 8'::text)
-> Sort (cost=13422.32..13672.32 rows=100000 width=42)
Sort Key: (((bug.n || ' '::text) || COALESCE(bug.v, ''::text))), bug.b, bug.t
-> Seq Scan on bug (cost=0.00..2041.00 rows=100000 width=42)
On Thu, Dec 14, 2017 at 10:47 AM, Andreas Seltenreich
<andreas.seltenreich@credativ.de> wrote:
We did some more reducing work on the original query and data. The
following testcase reproduces the double free reported by valgrind for
me when run against a vanilla REL9_6_STABLE build.
--8<---------------cut here---------------start------------->8---
drop table if exists bug;
create table bug (n text, v text, b text, t text);
insert into bug
select i%9, i%9, i%16 ,i%4096 from generate_series(1,100000) g(i);
analyze bug;explain select * from (
select thecube.nv
from ( select
(n || ' ') || coalesce(v, '') as nv
from bug
group by ((n || ' ') || coalesce(v, '')) ,cube(b, t)
) thecube
where nv = '8 8'
) sub limit 7000;
--8<---------------cut here---------------end--------------->8---
I can reproduce this against REL9_6_STABLE, once work_mem is set to
4MB, and replacement_sort_tuples is set to 150000.
--
Peter Geoghegan
On Thu, Dec 14, 2017 at 12:06 PM, Peter Geoghegan <pg@bowt.ie> wrote:
I can reproduce this against REL9_6_STABLE, once work_mem is set to
4MB, and replacement_sort_tuples is set to 150000.
I've figured this one out. It's a double free. Consider this code:
(Although REL9_5_STABLE is also affected, the following remarks
reference REL9_6_STABLE-only code.)
* The slot receives a copied tuple (sometimes allocated in caller memory
* context) that will stay valid regardless of future manipulations of the
* tuplesort's state.
*/
bool
tuplesort_gettupleslot(/* SNIP */)
I wrote that comment block, in commit a5f0bd77a, an early bugfix for
9.6 (I actually mentioned this specific commit to Bernd in passing
already). This bug is very similar to the one fixed by a5f0bd77a, and
is arguably a broader version of that same bug. Note that I didn't do
anything with external sorts until 9.6, so I expect that this affects
all stable branches. That said, it may not be possible to produce a
hard crash on all versions. More or less by accident.
The function opens with this code:
MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
SortTuple stup;
bool should_free;
if (!tuplesort_gettuple_common(state, forward, &stup, &should_free))
stup.tuple = NULL;
MemoryContextSwitchTo(oldcontext);
/* SNIP */
...
As long as the slot receives a copied tuple that may or may not be in
caller's memory context (it may instead get allocated in a
tuplesort-private memory context), and as long as destroying a
tuplesort via tuplesort_end() counts as a "future manipulation", then
you it's possible to crash with a double-pfree() while not violating
the tuplesort_gettupleslot() contract. The memory may or may not be
freed as part of tuplesort_end() teardown, based only on accidental
implementation factors (that's the first time the memory gets freed
when we do crash). Specifically, if state->status == TSS_SORTEDONTAPE,
we risk this crash on versions before Postgres 10 (more on why 10
appears to be okay tomorrow), when tuplesort_end() just so happens to
have been called (destroying the 2 memory contexts) before the
executor's superfluous pfree() within ExecResetTupleTable().
The ambiguity about who owns the tuple memory when is the basic
problem, once again. One could argue that it's the caller's fault for
not knowing to not pfree() the tuple after tuplesort_end() is called,
but I don't buy that because it seems like an undue burden on callers
to do that. It seems okay to either provide a very weak, simple
guarantee about tuple memory lifetime, or very strong simple guarantee
about tuple memory lifetime. That's what routines like
tuplesort_gettupleslot() and tuplesort_getindextuple() should limit
their contracts to -- we've had enough problems in this area over the
years that that seems pretty clear to me.
ISTM that an appropriate fix is one that results in having
tuplesort_gettupleslot() tuple memory that is *consistently* allocated
within caller's own memory context, at least on versions prior to
Postgres 10 (it's a bit more complicated in Postgres 10). I'll begin
work at a patch for this tomorrow. It's not obvious what approach to
take, because on 9.6+, memory is allocated in tuplecontext, not
sortcontext. A well-targeted fix will not burden low-level READTUP()
routines (that use tuplecontext) with direct knowledge of the problem.
--
Peter Geoghegan
On Thu, Dec 14, 2017 at 5:58 PM, Peter Geoghegan <pg@bowt.ie> wrote:
The ambiguity about who owns the tuple memory when is the basic
problem, once again. One could argue that it's the caller's fault for
not knowing to not pfree() the tuple after tuplesort_end() is called,
but I don't buy that because it seems like an undue burden on callers
to do that. It seems okay to either provide a very weak, simple
guarantee about tuple memory lifetime, or very strong simple guarantee
about tuple memory lifetime. That's what routines like
tuplesort_gettupleslot() and tuplesort_getindextuple() should limit
their contracts to -- we've had enough problems in this area over the
years that that seems pretty clear to me.ISTM that an appropriate fix is one that results in having
tuplesort_gettupleslot() tuple memory that is *consistently* allocated
within caller's own memory context, at least on versions prior to
Postgres 10 (it's a bit more complicated in Postgres 10).
This took longer than expected. Attached patch, which targets 9.6,
shows what I have in mind. I'm going on vacation shortly, but wanted
to post this patch before I leave. It could definitely use some more
testing, since it was written under time pressure. I expect to be back
early in the new year, so if someone feels like taking this off my
hands, they're more than welcome to do so.
Postgres v10 isn't affected because there is no such thing as
should_free from commit f1f5ec1ef onwards (commit fa117ee40
reintroduced a no-copy optimization a while later, which looks like it
made things unsafe again, but it's actually fine -- the
ExecClearTuple() pfree() won't happen, so no crash). Everything is
fine on v10 because the design was improved for v10. On v10,
tuplesort_gettupleslot() calls ExecStoreMinimalTuple() with arguments
that indicate that the memory still belongs to the tuplesort, unless
it was an explicit copy made with heap_copy_minimal_tuple() there and
then, in which case it belong to caller. This works because on v10
we're always managing some temp buffers, regardless of state->status.
There is no question of a routine like readtup_alloc() allocating or
using memory without knowing about its lifetime, because it's always
using the smallish slab allocator buffers.
To put in another way, in v10 everything is fine because callers to
routines like tuplesort_gettupleslot() either have very limited
guarantees about fetched tuple memory that is owned by tuplesort (it
can go away almost immediately, on the next call to
tuplesort_gettupleslot()), or infinitely flexible guarantees (it's
unambiguously caller's copy, and lives in caller's memory context).
There is no awkward mix of these two situations, as is the case in
earlier versions. That awkward mix is: "You (caller) should free this
memory, and can rely on this memory sticking around until
tuplesort_end() is called, at which point I (tuplesort) will free it
if you haven't -- though I might *not* free it then, based on whether
or not I felt like using your memory context!". This is an accident
waiting to happen. Remember, the crash isn't a use-after-free; it's a
double-free caused by conflated responsibilities across two low-level,
refcount-like mechanisms. The idea of a tts_shouldFree=true tuple
whose memory exists in a context that isn't under the executor's
direct control is a fundamentally bogus idea. It kinda works most of
the time, but only to the extent that the lifetime of a tuplesort is
accidentally tied to the lifetime of an executor node, without the
tuple/slot making in into something like estate->es_tupleTable (as was
the case in Andreas' example).
An alternative approach to fixing this issue, that I also want to
throw out there, is perform a simple change within
ExecResetTupleTable() to make ExecClearTuple()'d slots force
"tts_shouldFree = false" -- a second patch shows what I mean here. All
ExecResetTupleTable() callers pass "shouldFree = false" anyway,
because all memory is about to go away (within FreeExecutorState()).
This second approach seems like a real kludge to me, but it's also a
fix that requires a tiny tweak, rather than non-trivial tuplesort.c
memory management changes. I would feel worse about the idea of
applying this kludge if we had to rely on it forever, but since v10 is
unaffected anyway, we really don't. But it's still an awful kludge. I
find it hard to know what to do here (the vacation is overdue, it
seems).
(Note that v10 does have a minor, related bug, noted in commit
messages of both patches.)
--
Peter Geoghegan
Attachments:
WIP-kludge-fix.patchtext/x-patch; charset=US-ASCII; name=WIP-kludge-fix.patchDownload
From ed4909ff9ee88b6b2f66034f043e544dfcfda3bd Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 16 Dec 2017 13:48:01 -0800
Subject: [PATCH] Prevent double-free in ExecEndPlan().
This fixes a crash that can occur on at least 9.5 and 9.6, where some
grouping sets queries may call tuplesort_end() at node shutdown time,
and free tuple memory that the executor imagines it is responsible for,
and must free within ExecClearTuple(). This fix simply prevents the
memory from being freed by the tts_shouldFree, which is safe but is also
rather a kludge. It may be better than a more invasive fix in
tuplesort.c, since no fix is really needed for v10+, because v10 changed
the contract that callers have with tuple-fetch tuplesort routines to
something more robust (it would be even more invasive to backpatch the
v10 tuplesort.enhancements changes that make this a non-issue).
In passing, change the memory context switch point within
tuplesort_getdatum() to matchtuplesort_gettupleslot(). It's not clear
that this can cause a crash, but it's still wrong for pass-by-value
Datum tuplesorts. Note that this tuplesort_getdatum() fix *is* required
for v10.
Author: Peter Geoghegan
Reported-By: Bernd Helmle
Analyzed-By: Peter Geoghegan, Andreas Seltenreich, Bernd Helmle
Discussion: https://www.postgresql.org/message-id/1512661638.9720.34.camel@oopsware.de
---
src/backend/commands/copy.c | 2 +-
src/backend/executor/execMain.c | 4 ++--
src/backend/executor/execTuples.c | 34 +++++++++++++++-------------------
src/backend/utils/sort/tuplesort.c | 4 ++--
src/include/executor/tuptable.h | 2 +-
5 files changed, 21 insertions(+), 25 deletions(-)
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index ca72dd1..0a636b9 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2601,7 +2601,7 @@ CopyFrom(CopyState cstate)
pfree(values);
pfree(nulls);
- ExecResetTupleTable(estate->es_tupleTable, false);
+ ExecResetTupleTable(estate->es_tupleTable);
ExecCloseIndices(resultRelInfo);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 10ccf59..d657644 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1466,7 +1466,7 @@ ExecEndPlan(PlanState *planstate, EState *estate)
* the TupleTableSlots, since the containing memory context is about to go
* away anyway.
*/
- ExecResetTupleTable(estate->es_tupleTable, false);
+ ExecResetTupleTable(estate->es_tupleTable);
/*
* close the result relation(s) if any, but hold locks until xact commit.
@@ -2903,7 +2903,7 @@ EvalPlanQualEnd(EPQState *epqstate)
}
/* throw away the per-estate tuple table */
- ExecResetTupleTable(estate->es_tupleTable, false);
+ ExecResetTupleTable(estate->es_tupleTable);
/* close any trigger target relations attached to this EState */
foreach(l, estate->es_trig_target_relations)
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 533050d..d84966f 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -147,14 +147,13 @@ ExecAllocTableSlot(List **tupleTable)
* ExecResetTupleTable
*
* This releases any resources (buffer pins, tupdesc refcounts)
- * held by the tuple table, and optionally releases the memory
- * occupied by the tuple table data structure.
- * It is expected that this routine be called by EndPlan().
+ * held by the tuple table. It never releases the memory
+ * occupied by the tuple table data structure. It is expected
+ * that this routine be called by EndPlan().
* --------------------------------
*/
void
-ExecResetTupleTable(List *tupleTable, /* tuple table */
- bool shouldFree) /* true if we should free memory */
+ExecResetTupleTable(List *tupleTable) /* tuple table */
{
ListCell *lc;
@@ -165,6 +164,17 @@ ExecResetTupleTable(List *tupleTable, /* tuple table */
/* Sanity checks */
Assert(IsA(slot, TupleTableSlot));
+ /*
+ * Kludge: force slot shouldFree policy to match caller's.
+ *
+ * This can prevent problems when executor nodes were not sufficiently
+ * careful about tuple lifetime, for example when tuple comes from a
+ * utility like a tuplesort that doesn't heed tts_shouldFree
+ * conventions in managing underlying memory.
+ */
+ slot->tts_shouldFree = false;
+ slot->tts_shouldFreeMin = false;
+
/* Always release resources and reset the slot to empty */
ExecClearTuple(slot);
if (slot->tts_tupleDescriptor)
@@ -172,21 +182,7 @@ ExecResetTupleTable(List *tupleTable, /* tuple table */
ReleaseTupleDesc(slot->tts_tupleDescriptor);
slot->tts_tupleDescriptor = NULL;
}
-
- /* If shouldFree, release memory occupied by the slot itself */
- if (shouldFree)
- {
- if (slot->tts_values)
- pfree(slot->tts_values);
- if (slot->tts_isnull)
- pfree(slot->tts_isnull);
- pfree(slot);
- }
}
-
- /* If shouldFree, release the list structure */
- if (shouldFree)
- list_free(tupleTable);
}
/* --------------------------------
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 4dd5407..a6c8de4 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2155,6 +2155,8 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
return false;
}
+ MemoryContextSwitchTo(oldcontext);
+
/* Record abbreviated key for caller */
if (state->sortKeys->abbrev_converter && abbrev)
*abbrev = stup.datum1;
@@ -2175,8 +2177,6 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
*isNull = false;
}
- MemoryContextSwitchTo(oldcontext);
-
return true;
}
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 5ac0b6a..39a678a 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -141,7 +141,7 @@ typedef struct TupleTableSlot
/* in executor/execTuples.c */
extern TupleTableSlot *MakeTupleTableSlot(void);
extern TupleTableSlot *ExecAllocTableSlot(List **tupleTable);
-extern void ExecResetTupleTable(List *tupleTable, bool shouldFree);
+extern void ExecResetTupleTable(List *tupleTable);
extern TupleTableSlot *MakeSingleTupleTableSlot(TupleDesc tupdesc);
extern void ExecDropSingleTupleTableSlot(TupleTableSlot *slot);
extern void ExecSetSlotDescriptor(TupleTableSlot *slot, TupleDesc tupdesc);
--
2.7.4
WIP-tuplesort-memcontext-fix.patchtext/x-patch; charset=US-ASCII; name=WIP-tuplesort-memcontext-fix.patchDownload
From 8743b443d91b51371fe5df17795a88210aa0264a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 14 Dec 2017 21:22:41 -0800
Subject: [PATCH] Fix double free of tuple with grouping sets/tuplesort.c.
Have tuplesort_gettupleslot() copy the contents of its current table
slot into caller's memory context if and when caller is obligated to
free memory itself. Otherwise, continue to allocate the memory in
tuplesort-owned context, and never tell the caller to free it
themselves. Repeat this for all "fetch tuple" routines, to be
consistent.
This fixes a crash that can occur on at least 9.5 and 9.6, where some
grouping sets queries may call tuplesort_end() at node shutdown time,
and free tuple memory that the executor imagines it is responsible for,
and must free within ExecClearTuple().
In passing, change the memory context switch point within
tuplesort_getdatum() to matchtuplesort_gettupleslot(). It's not clear
that this can cause a crash, but it's still wrong for pass-by-value
Datum tuplesorts.
Only the tuplesort_getdatum() fix is required for v10, because v10
changed the contract that callers have with tuple-fetch tuplesort
routines to something more robust. Fetch tuple routines create full
copies of tuples iff callers explicitly opt in on v10, accepting the
risk that it won't be safe to reference the tuple in the very next call
to the fetch routine.
Author: Peter Geoghegan
Reported-By: Bernd Helmle
Analyzed-By: Peter Geoghegan, Andreas Seltenreich, Bernd Helmle
Discussion: https://www.postgresql.org/message-id/1512661638.9720.34.camel@oopsware.de
---
src/backend/utils/sort/tuplesort.c | 191 ++++++++++++++++++++++++-------------
1 file changed, 124 insertions(+), 67 deletions(-)
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 4dd5407..f7f8b83 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -264,6 +264,7 @@ struct Tuplesortstate
int tapeRange; /* maxTapes-1 (Knuth's P) */
MemoryContext sortcontext; /* memory context holding most sort data */
MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
+ MemoryContext callercontext; /* tuplesort caller's own context */
LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
/*
@@ -390,10 +391,11 @@ struct Tuplesortstate
* sorted order).
*/
int64 spacePerTape; /* Space (memory) for tuples (not slots) */
+ int64 *mergeosize; /* "overflow" allocation size */
char **mergetuples; /* Each tape's memory allocation */
char **mergecurrent; /* Current offset into each tape's memory */
char **mergetail; /* Last item's start point for each tape */
- char **mergeoverflow; /* Retail palloc() "overflow" for each tape */
+ char **mergeoverflow; /* "overflow" scratch buffer for each tape */
/*
* Variables for Algorithm D. Note that destTape is a "logical" tape
@@ -534,7 +536,11 @@ struct Tuplesortstate
* Note that this places a responsibility on readtup and copytup routines
* to use the right memory context for these tuples (and to not use the
* reset context for anything whose lifetime needs to span multiple
- * external sort runs).
+ * external sort runs). Note also that readtup routines need to make sure
+ * that memory that caller must free is in fact allocated in caller
+ * context, not in either of the two tuplesort memory contexts. Callers
+ * must either have full ownership of memory for returned tuples, or
+ * leave it all entirely up to us.
*/
/* When using this macro, beware of double evaluation of len */
@@ -691,6 +697,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
state->availMem = state->allowedMem;
state->sortcontext = sortcontext;
state->tuplecontext = tuplecontext;
+ state->callercontext = NULL;
state->tapeset = NULL;
state->memtupcount = 0;
@@ -721,6 +728,12 @@ tuplesort_begin_common(int workMem, bool randomAccess)
state->result_tape = -1; /* flag that result tape has not been formed */
+ /*
+ * Let caller initialize state->callercontext. The usage pattern for
+ * state->callercontext is that it must be set and used in externally
+ * callable functions. Functions with local linkage do a conventional
+ * switch-and-unswitch, to avoid clobbering state->callercontext.
+ */
MemoryContextSwitchTo(oldcontext);
return state;
@@ -733,11 +746,13 @@ tuplesort_begin_heap(TupleDesc tupDesc,
bool *nullsFirstFlags,
int workMem, bool randomAccess)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
- MemoryContext oldcontext;
+ Tuplesortstate *state;
+ MemoryContext oldcontext = CurrentMemoryContext;
int i;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ state = tuplesort_begin_common(workMem, randomAccess);
+ MemoryContextSwitchTo(state->sortcontext);
+ state->callercontext = oldcontext;
AssertArg(nkeys > 0);
@@ -804,14 +819,16 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ Tuplesortstate *state;
ScanKey indexScanKey;
- MemoryContext oldcontext;
+ MemoryContext oldcontext = CurrentMemoryContext;
int i;
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ state = tuplesort_begin_common(workMem, randomAccess);
+ MemoryContextSwitchTo(state->sortcontext);
+ state->callercontext = oldcontext;
#ifdef TRACE_SORT
if (trace_sort)
@@ -898,12 +915,14 @@ tuplesort_begin_index_btree(Relation heapRel,
bool enforceUnique,
int workMem, bool randomAccess)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ Tuplesortstate *state;
ScanKey indexScanKey;
- MemoryContext oldcontext;
+ MemoryContext oldcontext = CurrentMemoryContext;
int i;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ state = tuplesort_begin_common(workMem, randomAccess);
+ MemoryContextSwitchTo(state->sortcontext);
+ state->callercontext = oldcontext;
#ifdef TRACE_SORT
if (trace_sort)
@@ -974,10 +993,12 @@ tuplesort_begin_index_hash(Relation heapRel,
uint32 hash_mask,
int workMem, bool randomAccess)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
- MemoryContext oldcontext;
+ Tuplesortstate *state;
+ MemoryContext oldcontext = CurrentMemoryContext;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ state = tuplesort_begin_common(workMem, randomAccess);
+ MemoryContextSwitchTo(state->sortcontext);
+ state->callercontext = oldcontext;
#ifdef TRACE_SORT
if (trace_sort)
@@ -1010,12 +1031,14 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
bool nullsFirstFlag,
int workMem, bool randomAccess)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
- MemoryContext oldcontext;
+ Tuplesortstate *state;
+ MemoryContext oldcontext = CurrentMemoryContext;
int16 typlen;
bool typbyval;
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ state = tuplesort_begin_common(workMem, randomAccess);
+ MemoryContextSwitchTo(state->sortcontext);
+ state->callercontext = oldcontext;
#ifdef TRACE_SORT
if (trace_sort)
@@ -1193,7 +1216,8 @@ tuplesort_end(Tuplesortstate *state)
/*
* Free the per-sort memory context, thereby releasing all working memory,
- * including the Tuplesortstate struct itself.
+ * including the Tuplesortstate struct itself. Caller tuples where
+ * should_free was set to TRUE remain allocated, however.
*/
MemoryContextDelete(state->sortcontext);
}
@@ -1335,9 +1359,10 @@ noalloc:
void
tuplesort_puttupleslot(Tuplesortstate *state, TupleTableSlot *slot)
{
- MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
SortTuple stup;
+ state->callercontext = MemoryContextSwitchTo(state->sortcontext);
+
/*
* Copy the given tuple into memory we control, and decrease availMem.
* Then call the common code.
@@ -1346,7 +1371,7 @@ tuplesort_puttupleslot(Tuplesortstate *state, TupleTableSlot *slot)
puttuple_common(state, &stup);
- MemoryContextSwitchTo(oldcontext);
+ MemoryContextSwitchTo(state->callercontext);
}
/*
@@ -1357,18 +1382,18 @@ tuplesort_puttupleslot(Tuplesortstate *state, TupleTableSlot *slot)
void
tuplesort_putheaptuple(Tuplesortstate *state, HeapTuple tup)
{
- MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
SortTuple stup;
/*
* Copy the given tuple into memory we control, and decrease availMem.
* Then call the common code.
*/
+ state->callercontext = MemoryContextSwitchTo(state->sortcontext);
COPYTUP(state, &stup, (void *) tup);
puttuple_common(state, &stup);
- MemoryContextSwitchTo(oldcontext);
+ MemoryContextSwitchTo(state->callercontext);
}
/*
@@ -1380,11 +1405,11 @@ tuplesort_putindextuplevalues(Tuplesortstate *state, Relation rel,
ItemPointer self, Datum *values,
bool *isnull)
{
- MemoryContext oldcontext = MemoryContextSwitchTo(state->tuplecontext);
SortTuple stup;
Datum original;
IndexTuple tuple;
+ state->callercontext = MemoryContextSwitchTo(state->tuplecontext);
stup.tuple = index_form_tuple(RelationGetDescr(rel), values, isnull);
tuple = ((IndexTuple) stup.tuple);
tuple->t_tid = *self;
@@ -1445,7 +1470,7 @@ tuplesort_putindextuplevalues(Tuplesortstate *state, Relation rel,
puttuple_common(state, &stup);
- MemoryContextSwitchTo(oldcontext);
+ MemoryContextSwitchTo(state->callercontext);
}
/*
@@ -1456,7 +1481,6 @@ tuplesort_putindextuplevalues(Tuplesortstate *state, Relation rel,
void
tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull)
{
- MemoryContext oldcontext = MemoryContextSwitchTo(state->tuplecontext);
SortTuple stup;
/*
@@ -1470,6 +1494,7 @@ tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull)
* abbreviated value if abbreviation is happening, otherwise it's
* identical to stup.tuple.
*/
+ state->callercontext = MemoryContextSwitchTo(state->tuplecontext);
if (isNull || !state->tuples)
{
@@ -1529,7 +1554,7 @@ tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull)
puttuple_common(state, &stup);
- MemoryContextSwitchTo(oldcontext);
+ MemoryContextSwitchTo(state->callercontext);
}
/*
@@ -1743,7 +1768,7 @@ consider_abort_common(Tuplesortstate *state)
void
tuplesort_performsort(Tuplesortstate *state)
{
- MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ state->callercontext = MemoryContextSwitchTo(state->sortcontext);
#ifdef TRACE_SORT
if (trace_sort)
@@ -1816,7 +1841,7 @@ tuplesort_performsort(Tuplesortstate *state)
}
#endif
- MemoryContextSwitchTo(oldcontext);
+ MemoryContextSwitchTo(state->callercontext);
}
/*
@@ -1824,6 +1849,9 @@ tuplesort_performsort(Tuplesortstate *state)
* direction into *stup. Returns FALSE if no more tuples.
* If *should_free is set, the caller must pfree stup.tuple when done with it.
* Otherwise, caller should not use tuple following next call here.
+ *
+ * Note: Caller can expect tuple to be allocated in their own memory context
+ * when should_free is TRUE.
*/
static bool
tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
@@ -2046,22 +2074,24 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
* NULL value in leading attribute will set abbreviated value to zeroed
* representation, which caller may rely on in abbreviated inequality check.
*
- * The slot receives a copied tuple (sometimes allocated in caller memory
- * context) that will stay valid regardless of future manipulations of the
- * tuplesort's state.
+ * The slot receives a copied tuple that will stay valid regardless of future
+ * manipulations of the tuplesort's state. This includes calls to
+ * tuplesort_end() -- tuple will be allocated in caller's own context to make
+ * this work.
*/
bool
tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
TupleTableSlot *slot, Datum *abbrev)
{
- MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
SortTuple stup;
bool should_free;
+ state->callercontext = MemoryContextSwitchTo(state->sortcontext);
+
if (!tuplesort_gettuple_common(state, forward, &stup, &should_free))
stup.tuple = NULL;
- MemoryContextSwitchTo(oldcontext);
+ MemoryContextSwitchTo(state->callercontext);
if (stup.tuple)
{
@@ -2089,18 +2119,20 @@ tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
* Returns NULL if no more tuples. If *should_free is set, the
* caller must pfree the returned tuple when done with it.
* If it is not set, caller should not use tuple following next
- * call here.
+ * call here, and can expect tuple to be in their own memory
+ * context.
*/
HeapTuple
tuplesort_getheaptuple(Tuplesortstate *state, bool forward, bool *should_free)
{
- MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
SortTuple stup;
+ state->callercontext = MemoryContextSwitchTo(state->sortcontext);
+
if (!tuplesort_gettuple_common(state, forward, &stup, should_free))
stup.tuple = NULL;
- MemoryContextSwitchTo(oldcontext);
+ MemoryContextSwitchTo(state->callercontext);
return stup.tuple;
}
@@ -2116,13 +2148,14 @@ IndexTuple
tuplesort_getindextuple(Tuplesortstate *state, bool forward,
bool *should_free)
{
- MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
SortTuple stup;
+ state->callercontext = MemoryContextSwitchTo(state->sortcontext);
+
if (!tuplesort_gettuple_common(state, forward, &stup, should_free))
stup.tuple = NULL;
- MemoryContextSwitchTo(oldcontext);
+ MemoryContextSwitchTo(state->callercontext);
return (IndexTuple) stup.tuple;
}
@@ -2132,7 +2165,7 @@ tuplesort_getindextuple(Tuplesortstate *state, bool forward,
* Returns FALSE if no more datums.
*
* If the Datum is pass-by-ref type, the returned value is freshly palloc'd
- * and is now owned by the caller.
+ * in caller's context, and is now owned by the caller.
*
* Caller may optionally be passed back abbreviated value (on TRUE return
* value) when abbreviation was used, which can be used to cheaply avoid
@@ -2145,16 +2178,19 @@ bool
tuplesort_getdatum(Tuplesortstate *state, bool forward,
Datum *val, bool *isNull, Datum *abbrev)
{
- MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
SortTuple stup;
bool should_free;
+ state->callercontext = MemoryContextSwitchTo(state->sortcontext);
+
if (!tuplesort_gettuple_common(state, forward, &stup, &should_free))
{
- MemoryContextSwitchTo(oldcontext);
+ MemoryContextSwitchTo(state->callercontext);
return false;
}
+ MemoryContextSwitchTo(state->callercontext);
+
/* Record abbreviated key for caller */
if (state->sortKeys->abbrev_converter && abbrev)
*abbrev = stup.datum1;
@@ -2175,8 +2211,6 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
*isNull = false;
}
- MemoryContextSwitchTo(oldcontext);
-
return true;
}
@@ -2188,8 +2222,6 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
bool
tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples, bool forward)
{
- MemoryContext oldcontext;
-
/*
* We don't actually support backwards skip yet, because no callers need
* it. The API is designed to allow for that later, though.
@@ -2225,7 +2257,7 @@ tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples, bool forward)
* We could probably optimize these cases better, but for now it's
* not worth the trouble.
*/
- oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ state->callercontext = MemoryContextSwitchTo(state->sortcontext);
while (ntuples-- > 0)
{
SortTuple stup;
@@ -2234,14 +2266,14 @@ tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples, bool forward)
if (!tuplesort_gettuple_common(state, forward,
&stup, &should_free))
{
- MemoryContextSwitchTo(oldcontext);
+ MemoryContextSwitchTo(state->callercontext);
return false;
}
if (should_free && stup.tuple)
pfree(stup.tuple);
CHECK_FOR_INTERRUPTS();
}
- MemoryContextSwitchTo(oldcontext);
+ MemoryContextSwitchTo(state->callercontext);
return true;
default:
@@ -2363,6 +2395,7 @@ inittapes(Tuplesortstate *state)
state->mergelast = (int *) palloc0(maxTapes * sizeof(int));
state->mergeavailslots = (int *) palloc0(maxTapes * sizeof(int));
state->mergeavailmem = (int64 *) palloc0(maxTapes * sizeof(int64));
+ state->mergeosize = (int64 *) palloc0(maxTapes * sizeof(int64 *));
state->mergetuples = (char **) palloc0(maxTapes * sizeof(char *));
state->mergecurrent = (char **) palloc0(maxTapes * sizeof(char *));
state->mergetail = (char **) palloc0(maxTapes * sizeof(char *));
@@ -2978,6 +3011,7 @@ mergebatch(Tuplesortstate *state, int64 spacePerTape)
state->mergetuples[srcTape] = mergetuples;
state->mergecurrent[srcTape] = mergetuples;
state->mergetail[srcTape] = mergetuples;
+ state->mergeosize[srcTape] = 0;
state->mergeoverflow[srcTape] = NULL;
}
@@ -2993,7 +3027,8 @@ mergebatch(Tuplesortstate *state, int64 spacePerTape)
* reset to indicate that all memory may be reused.
*
* This routine must deal with fixing up the tuple that is about to be returned
- * to the client, due to "overflow" allocations.
+ * to the client, due to "overflow" allocations, while making sure that tuple
+ * memory ends up in caller's own context.
*/
static void
mergebatchone(Tuplesortstate *state, int srcTape, SortTuple *rtup,
@@ -3033,17 +3068,28 @@ mergebatchone(Tuplesortstate *state, int srcTape, SortTuple *rtup,
/*
* Handle an "overflow" retail palloc.
*
- * This is needed when we run out of tuple memory for the tape.
+ * This is needed when we run out of tuple memory for the tape. Note
+ * that we actually free the memory used for the overflow allocation
+ * directly, and provide a copy allocated in caller's context to
+ * caller.
*/
state->mergecurrent[srcTape] = state->mergetuples[srcTape];
state->mergetail[srcTape] = state->mergetuples[srcTape];
if (rtup->tuple)
{
+ Assert(state->mergeosize[srcTape] > 0);
Assert(rtup->tuple == (void *) state->mergeoverflow[srcTape]);
+ rtup->tuple = MemoryContextAlloc(state->callercontext,
+ state->mergeosize[srcTape]);
+ MOVETUP(rtup->tuple, state->mergeoverflow[srcTape],
+ state->mergeosize[srcTape]);
/* Caller should free palloc'd tuple */
*should_free = true;
+ /* Free scratch buffer */
+ pfree(state->mergeoverflow[srcTape]);
}
+ state->mergeosize[srcTape] = 0;
state->mergeoverflow[srcTape] = NULL;
}
}
@@ -3064,7 +3110,7 @@ mergebatchfreetape(Tuplesortstate *state, int srcTape, SortTuple *rtup,
Assert(state->status == TSS_FINALMERGE);
/*
- * Tuple may or may not already be an overflow allocation from
+ * Tuple may or may not already be caller-owned allocation from
* mergebatchone()
*/
if (!*should_free && rtup->tuple)
@@ -3072,17 +3118,15 @@ mergebatchfreetape(Tuplesortstate *state, int srcTape, SortTuple *rtup,
/*
* Final tuple still in tape's batch allocation.
*
- * Return palloc()'d copy to caller, and have it freed in a similar
- * manner to overflow allocation. Otherwise, we'd free batch memory
- * and pass back a pointer to garbage. Note that we deliberately
- * allocate this in the parent tuplesort context, to be on the safe
- * side.
+ * Return palloc()'d copy to caller, in caller's own memory context.
+ * Otherwise, we'd free batch memory and pass back a pointer to
+ * garbage.
*/
Size tuplen;
void *oldTuple = rtup->tuple;
tuplen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
- rtup->tuple = MemoryContextAlloc(state->sortcontext, tuplen);
+ rtup->tuple = MemoryContextAlloc(state->callercontext, tuplen);
MOVETUP(rtup->tuple, oldTuple, tuplen);
*should_free = true;
}
@@ -3138,11 +3182,14 @@ mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen)
* will be detected quickly, in a similar fashion to a LACKMEM()
* condition, and should not happen again before a new round of
* preloading for caller's tape. Note that we deliberately allocate
- * this in the parent tuplesort context, to be on the safe side.
+ * this in the parent tuplesort context, to be on the safe side. Note
+ * also that tuplesort caller will ultimately get their own copy of
+ * this, in their own memory context.
*
* Sometimes, this does not happen because merging runs out of slots
* before running out of memory.
*/
+ state->mergeosize[tapenum] = tuplen;
ret = state->mergeoverflow[tapenum] =
MemoryContextAlloc(state->sortcontext, tuplen);
}
@@ -3460,7 +3507,7 @@ dumpbatch(Tuplesortstate *state, bool alltuples)
void
tuplesort_rescan(Tuplesortstate *state)
{
- MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ state->callercontext = MemoryContextSwitchTo(state->sortcontext);
Assert(state->randomAccess);
@@ -3486,7 +3533,7 @@ tuplesort_rescan(Tuplesortstate *state)
break;
}
- MemoryContextSwitchTo(oldcontext);
+ MemoryContextSwitchTo(state->callercontext);
}
/*
@@ -3495,7 +3542,7 @@ tuplesort_rescan(Tuplesortstate *state)
void
tuplesort_markpos(Tuplesortstate *state)
{
- MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ state->callercontext = MemoryContextSwitchTo(state->sortcontext);
Assert(state->randomAccess);
@@ -3517,7 +3564,7 @@ tuplesort_markpos(Tuplesortstate *state)
break;
}
- MemoryContextSwitchTo(oldcontext);
+ MemoryContextSwitchTo(state->callercontext);
}
/*
@@ -3527,7 +3574,7 @@ tuplesort_markpos(Tuplesortstate *state)
void
tuplesort_restorepos(Tuplesortstate *state)
{
- MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ state->callercontext = MemoryContextSwitchTo(state->sortcontext);
Assert(state->randomAccess);
@@ -3550,7 +3597,7 @@ tuplesort_restorepos(Tuplesortstate *state)
break;
}
- MemoryContextSwitchTo(oldcontext);
+ MemoryContextSwitchTo(state->callercontext);
}
/*
@@ -3887,7 +3934,8 @@ markrunend(Tuplesortstate *state, int tapenum)
* from tape's batch allocation. Otherwise, callers must pfree() or
* reset tuple child memory context, and account for that with a
* FREEMEM(). Currently, this only ever needs to happen in WRITETUP()
- * routines.
+ * routines, or when memory is allocated in caller's own context, and
+ * so must be freed on caller's own terms.
*/
static void *
readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen)
@@ -3903,15 +3951,24 @@ readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen)
*/
return mergebatchalloc(state, tapenum, tuplen);
}
- else
+ else if (state->status == TSS_BUILDRUNS)
{
char *ret;
- /* Batch allocation yet to be performed */
+ /*
+ * Batch allocation yet to be performed (non-final merge). Use
+ * tuplecontext to avoid fragmentation, much like tuplesort_putXXX and
+ * copytup_XXX routines during initial run generation.
+ */
ret = MemoryContextAlloc(state->tuplecontext, tuplen);
USEMEM(state, GetMemoryChunkSpace(ret));
return ret;
}
+ else
+ {
+ /* Must be allocated within tuplesort caller's context */
+ return MemoryContextAlloc(state->callercontext, tuplen);
+ }
}
--
2.7.4
On Sun, Dec 17, 2017 at 10:40 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Dec 14, 2017 at 5:58 PM, Peter Geoghegan <pg@bowt.ie> wrote:
The ambiguity about who owns the tuple memory when is the basic
problem, once again. One could argue that it's the caller's fault for
not knowing to not pfree() the tuple after tuplesort_end() is called,
but I don't buy that because it seems like an undue burden on callers
to do that. It seems okay to either provide a very weak, simple
guarantee about tuple memory lifetime, or very strong simple guarantee
about tuple memory lifetime. That's what routines like
tuplesort_gettupleslot() and tuplesort_getindextuple() should limit
their contracts to -- we've had enough problems in this area over the
years that that seems pretty clear to me.ISTM that an appropriate fix is one that results in having
tuplesort_gettupleslot() tuple memory that is *consistently* allocated
within caller's own memory context, at least on versions prior to
Postgres 10 (it's a bit more complicated in Postgres 10).This took longer than expected. Attached patch, which targets 9.6,
shows what I have in mind. I'm going on vacation shortly, but wanted
to post this patch before I leave. It could definitely use some more
testing, since it was written under time pressure. I expect to be back
early in the new year, so if someone feels like taking this off my
hands, they're more than welcome to do so.
Could you add that to the next commit fest registered as a bug fix? We
don't want to lose track of that and this should be reviewed.
--
Michael
On Sun, Dec 17, 2017 at 2:46 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
Could you add that to the next commit fest registered as a bug fix? We
don't want to lose track of that and this should be reviewed.
Done.
--
Peter Geoghegan
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed
Hi,
All information is related to WIP-tuplesort-memcontext-fix.patch.
The patch applies and fixes the bug on REL9_6_STABLE and LGTM.
However, REL9_5_STABLE affected as well and the patch doesn't applicable to it.
But it may be another entry because the difference in tuplesort may cause some changes in the fix too.
The new status of this patch is: Ready for Committer
Aleksandr Parfenov <a.parfenov@postgrespro.ru> writes:
The new status of this patch is: Ready for Committer
I don't feel particularly comfortable committing a patch that
was clearly labeled as a rushed draft by its author.
Peter, where do you stand on this work?
In a quick look at the patches, WIP-kludge-fix.patch seems clearly
unacceptable for back-patching because it changes the signature and
behavior of ExecResetTupleTable, which external code might well be using.
regards, tom lane
On Wed, Jan 17, 2018 at 11:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Aleksandr Parfenov <a.parfenov@postgrespro.ru> writes:
The new status of this patch is: Ready for Committer
I don't feel particularly comfortable committing a patch that
was clearly labeled as a rushed draft by its author.
Peter, where do you stand on this work?
I would like to take another pass over
WIP-tuplesort-memcontext-fix.patch, to be on the safe side. I'm
currently up to my neck in parallel CREATE INDEX work, though, and
would prefer to avoid context switching for a week or two, if
possible. How time sensitive do you think this is?
I think we'll end up doing this:
* Committing the minimal modifications made (in both WIP patches) to
tuplesort_getdatum() to both v10 and master branches.
tuplesort_getdatum() must follow the example of
tuplesort_gettupleslot() on these branches, since
tuplesort_gettupleslot() already manages to get everything right in
recent releases. (There is no known tuplesort_getdatum() crash on
these versions, but this still seems necessary on general principle.)
* Committing something pretty close to
WIP-tuplesort-memcontext-fix.patch to 9.6.
* Committing another fix to 9.5. This fix will apply the same
principles as WIP-tuplesort-memcontext-fix.patch, but will be simpler
mechanically, since the whole batch memory mechanism added to
tuplesort.c for 9.6 isn't involved.
I'm not sure whether or not we should also apply this
still-to-be-written 9.5 patch to 9.4 and 9.3, since those versions
don't have grouping sets, and so cannot crash. ISTM that we should
leave them alone, since tuplesort has had this problem forever.
Perhaps you should go ahead and commit a patch with just the changes
to tuplesort_getdatum() now. This part seems low risk, and worth doing
in a single, (almost) consistent pass over the back branches. This
part is a trivial backpatch, and could be thought of as an independent
problem. (It will be nice to get v10 and master branches completely
out of the way quickly.)
In a quick look at the patches, WIP-kludge-fix.patch seems clearly
unacceptable for back-patching because it changes the signature and
behavior of ExecResetTupleTable, which external code might well be using.
The signature change isn't required, and it was silly of me to add it.
But I also really dislike the general approach it takes, and mostly
posted it because I thought that it was a useful counterpoint.
--
Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes:
On Wed, Jan 17, 2018 at 11:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I don't feel particularly comfortable committing a patch that
was clearly labeled as a rushed draft by its author.
Peter, where do you stand on this work?
I would like to take another pass over
WIP-tuplesort-memcontext-fix.patch, to be on the safe side. I'm
currently up to my neck in parallel CREATE INDEX work, though, and
would prefer to avoid context switching for a week or two, if
possible. How time sensitive do you think this is?
Probably not very. It'd be nice to have it done by the next minor
releases, ie before 5-Feb ... but given that these bugs are years
old, missing that deadline would not be catastrophic.
I'm not sure whether or not we should also apply this
still-to-be-written 9.5 patch to 9.4 and 9.3, since those versions
don't have grouping sets, and so cannot crash. ISTM that we should
leave them alone, since tuplesort has had this problem forever.
+1. If the problem isn't known to be reproducible in those branches,
the risk of adding new bugs seems to outweigh any benefit.
regards, tom lane
On Wed, Jan 17, 2018 at 12:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Probably not very. It'd be nice to have it done by the next minor
releases, ie before 5-Feb ... but given that these bugs are years
old, missing that deadline would not be catastrophic.
Got it.
I'm not sure whether or not we should also apply this
still-to-be-written 9.5 patch to 9.4 and 9.3, since those versions
don't have grouping sets, and so cannot crash. ISTM that we should
leave them alone, since tuplesort has had this problem forever.+1. If the problem isn't known to be reproducible in those branches,
the risk of adding new bugs seems to outweigh any benefit.
You could make the same objection to changing tuplesort_getdatum()
outside of the master branch, though. I think that going back further
than that for the (arguably independent) tuplesort_getdatum() subset
fix might still be a good idea. I wonder where you stand on this.
--
Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes:
On Wed, Jan 17, 2018 at 12:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
+1. If the problem isn't known to be reproducible in those branches,
the risk of adding new bugs seems to outweigh any benefit.
You could make the same objection to changing tuplesort_getdatum()
outside of the master branch, though. I think that going back further
than that for the (arguably independent) tuplesort_getdatum() subset
fix might still be a good idea. I wonder where you stand on this.
I haven't been following the thread very closely, so I don't have an
opinion on that.
regards, tom lane
On Wed, Jan 17, 2018 at 1:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
You could make the same objection to changing tuplesort_getdatum()
outside of the master branch, though. I think that going back further
than that for the (arguably independent) tuplesort_getdatum() subset
fix might still be a good idea. I wonder where you stand on this.I haven't been following the thread very closely, so I don't have an
opinion on that.
A complicating factor for this fix of mine is that mode_final() seems
to have its own ideas about tuple memory lifetime, over and above what
tuplesort_getdatum() explicitly promises, as can be seen here:
/*
* Note: we *cannot* clean up the tuplesort object here, because the value
* to be returned is allocated inside its sortcontext. We could use
* datumCopy to copy it out of there, but it doesn't seem worth the
* trouble, since the cleanup callback will clear the tuplesort later.
*/
My WIP-tuplesort-memcontext-fix.patch fix is premised on the idea that
nodeAgg.c/grouping sets got it right: nodeAgg.c should be able to
continue to assume that in "owning" the memory used for a tuple (in a
table slot), it has it in its own memory context -- otherwise, the
whole tts_shouldFree tuple slot mechanism is prone to double-frees.
This comment directly contradicts/undermines that premise.
ISTM that either grouping sets or mode_final() must necessarily be
wrong, because each oversteps, and infers a different contract from
tuplesort tuple fetching routines (different assumptions about memory
contexts are made in each case). Only one can be right, unless it's
okay to have one rule for tuplesort_getdatum() and another for
tuplesort_gettupleslot() (which seems questionable to me). I still
think that grouping sets is right (and that mode_final() is wrong). Do
you?
--
Peter Geoghegan
On Wed, Jan 17, 2018 at 2:23 PM, Peter Geoghegan <pg@bowt.ie> wrote:
A complicating factor for this fix of mine is that mode_final() seems
to have its own ideas about tuple memory lifetime, over and above what
tuplesort_getdatum() explicitly promises, as can be seen here:/*
* Note: we *cannot* clean up the tuplesort object here, because the value
* to be returned is allocated inside its sortcontext. We could use
* datumCopy to copy it out of there, but it doesn't seem worth the
* trouble, since the cleanup callback will clear the tuplesort later.
*/
ISTM that either grouping sets or mode_final() must necessarily be
wrong, because each oversteps, and infers a different contract from
tuplesort tuple fetching routines (different assumptions about memory
contexts are made in each case). Only one can be right, unless it's
okay to have one rule for tuplesort_getdatum() and another for
tuplesort_gettupleslot() (which seems questionable to me). I still
think that grouping sets is right (and that mode_final() is wrong). Do
you?
It would be nice to get an opinion on this mode_final() + tuplesort
memory lifetime business from you, Tom.
Note that you removed the quoted comment within be0ebb65, back in October.
--
Peter Geoghegan
On 18 January 2018 at 03:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Aleksandr Parfenov <a.parfenov@postgrespro.ru> writes:
The new status of this patch is: Ready for Committer
I don't feel particularly comfortable committing a patch that
was clearly labeled as a rushed draft by its author.
Peter, where do you stand on this work?In a quick look at the patches, WIP-kludge-fix.patch seems clearly
unacceptable for back-patching because it changes the signature and
behavior of ExecResetTupleTable, which external code might well be using.
Definitely is using, in the case of BDR and pglogical. But we can patch in
a version check easily enough.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Feb 6, 2018 at 5:54 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
In a quick look at the patches, WIP-kludge-fix.patch seems clearly
unacceptable for back-patching because it changes the signature and
behavior of ExecResetTupleTable, which external code might well be using.
Definitely is using, in the case of BDR and pglogical. But we can patch in a
version check easily enough.
That won't be necessary. The WIP-kludge-fix.patch approach is never
going to be used, and was only really posted for illustrative
purposes.
--
Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes:
On Wed, Jan 17, 2018 at 2:23 PM, Peter Geoghegan <pg@bowt.ie> wrote:
A complicating factor for this fix of mine is that mode_final() seems
to have its own ideas about tuple memory lifetime, over and above what
tuplesort_getdatum() explicitly promises, as can be seen here:
/*
* Note: we *cannot* clean up the tuplesort object here, because the value
* to be returned is allocated inside its sortcontext. We could use
* datumCopy to copy it out of there, but it doesn't seem worth the
* trouble, since the cleanup callback will clear the tuplesort later.
*/
ISTM that either grouping sets or mode_final() must necessarily be
wrong, because each oversteps, and infers a different contract from
tuplesort tuple fetching routines (different assumptions about memory
contexts are made in each case). Only one can be right, unless it's
okay to have one rule for tuplesort_getdatum() and another for
tuplesort_gettupleslot() (which seems questionable to me). I still
think that grouping sets is right (and that mode_final() is wrong). Do
you?
It would be nice to get an opinion on this mode_final() + tuplesort
memory lifetime business from you, Tom.
I'm fairly sure that that bit in mode_final() was just a hack to make
it work. If there's a better, more principled way, let's go for it.
Note that you removed the quoted comment within be0ebb65, back in October.
There were multiple instances of basically that same comment before.
AFAICS I just consolidated them into one place, in the header comment for
ordered_set_shutdown.
regards, tom lane
On Wed, Feb 7, 2018 at 4:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Geoghegan <pg@bowt.ie> writes:
It would be nice to get an opinion on this mode_final() + tuplesort
memory lifetime business from you, Tom.I'm fairly sure that that bit in mode_final() was just a hack to make
it work. If there's a better, more principled way, let's go for it.
This is the more principled way. It is much easier to make every
single tuplesort caller on every release branch follow this rule (or
those on 9.5+, at least).
My big concern with making tuplesort_getdatum() deliberately copy into
caller's context is that that could introduce memory leaks in caller
code that evolved in a world where tuplesort_end() frees
pass-by-reference datum memory. Seems like the only risky case is
certain ordered set aggregate code, though. And, it's probably just a
matter of fixing the comments. I'll read through the bug report thread
that led up to October's commit be0ebb65 [1]/messages/by-id/CAB4ELO5RZhOamuT9Xsf72ozbenDLLXZKSk07FiSVsuJNZB861A@mail.gmail.com -- Peter Geoghegan tomorrow, to make sure of
this.
I just noticed that process_ordered_aggregate_single() says that the
behavior I propose for tuplesort_getdatum() is what it *already*
expects:
/*
* Note: if input type is pass-by-ref, the datums returned by the sort are
* freshly palloc'd in the per-query context, so we must be careful to
* pfree them when they are no longer needed.
*/
This supports the idea that ordered set aggregate code is the odd one
out when it comes to beliefs about tuplesort memory contexts. Even
just among tuplesort_getdatum() callers, though even more so among
tuplesort callers in general. One simple rule for all tuplesort fetch
routines is the high-level goal here.
Note that you removed the quoted comment within be0ebb65, back in October.
There were multiple instances of basically that same comment before.
AFAICS I just consolidated them into one place, in the header comment for
ordered_set_shutdown.
I see. So, to restate my earlier remarks in terms of the newer
organization: The last line in that header comment will need to be
removed, since it will become false under my scheme. The line is:
"Note that many of the finalfns could *not* free the tuplesort object,
at least not without extra data copying, because what they return is a
pointer to a datum inside the tuplesort object".
[1]: /messages/by-id/CAB4ELO5RZhOamuT9Xsf72ozbenDLLXZKSk07FiSVsuJNZB861A@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
On Wed, Feb 7, 2018 at 7:02 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Feb 7, 2018 at 4:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Geoghegan <pg@bowt.ie> writes:
It would be nice to get an opinion on this mode_final() + tuplesort
memory lifetime business from you, Tom.I'm fairly sure that that bit in mode_final() was just a hack to make
it work. If there's a better, more principled way, let's go for it.This is the more principled way. It is much easier to make every
single tuplesort caller on every release branch follow this rule (or
those on 9.5+, at least).
I now think that my approach to fixing 9.6 in
WIP-tuplesort-memcontext-fix.patch is too complicated to justify
backpatching. I had the right idea there, and have no reason to think
it won't work, but it now seems like the complexity simply isn't worth
it. The advantage of WIP-tuplesort-memcontext-fix.patch was that it
avoided an extra copy within tuplesort_gettupleslot() on the earlier
Postgres versions it targeted (the versions where that function does
not have a "copy" argument), by arranging to make sure that low-level
routines have tuplesort caller context passed all the way down.
However, now that I consider the frequency that
WIP-tuplesort-memcontext-fix.patch would avoid such copying given real
9.6 workloads, its approach looks rather unappealing -- we should
instead just do a copy in all cases.
Another way of putting it is that it now seems like the approach taken
in bugfix commit d8589946d should be taken even further for 9.6, so
that we *always* copy for the tuplesort_gettupleslot() caller, rather
than just copying in the most common cases. We'd also sometimes have
to free the redundant memory allocated by tuplesort_gettuple_common()
within tuplesort_gettupleslot() if we went this way -- the should_free
= true case would have tuplesort_gettuple_common() do a pfree() after
copying. Needing a pfree() is a consequence of allocating memory for
caller, and then copying it for caller when we know that we're using
caller's memory context. A bit weird, but certainly very simple.
New plan:
* For 9.5 and 9.6, the approach taken in bugfix commit d8589946d
should be taken even further -- we should always copy. Moreover, we
should always copy within tuplesort_getdatum(), for the same reasons.
* For 9.5, 9.6, 10, and master, we should make sure that
tuplesort_getdatum() uses the caller's memory context. The fact that
it doesn't already do so seems like a simple oversight. We should do
this to be consistent with tuplesort_gettupleslot(). (This isn't
critical, but seems like a good idea.)
* For 9.5, 9.6, 10, and master, we should adjust some comments from
tuplesort_getdatum() callers, so that they no longer say that
tuplesort datum tuple memory lives in tuplesort context. That won't be
true anymore.
Anyone have an opinion on this?
The advantages of this approach are:
- It's far simpler than WIP-tuplesort-memcontext-fix.patch, and can be
applied to 9.5 and 9.6 with only small adjustments.
- It leaves all branches essentially consistent with v10+. v10+ gets
everything right already (except for that one minor
tuplesort_getdatum() + caller context issue), and it seems sensible to
treat v10 as a kind of model to follow here.
There are also some disadvantages for this new plan, though:
- There is a slightly awkward question for tuplesort_getdatum() in
9.6: Is tuplesort_getdatum() *always* explicitly copying an acceptable
overhead, given that tuplesort_getdatum() is not known to cause a
crash? I doubt so myself, since tuplesort_getdatum() *always* copies
on Postgres v10+ anyway, and even on 9.6 copying is already the common
case.
- There is a new overhead in 9.5. As I said, 9.6 mostly already copies
anyway, since it already has d8589946d -- 9.5 never got that commit.
This is very similar to the situation we faced about a year ago with
d8589946d on 9.6, since there isn't going to be much extra copying
than the copying that d8589946d already implies. ISTM that d8589946d
set a precedent that makes the situation that this creates for 9.5
okay today.
--
Peter Geoghegan
On Mon, Feb 12, 2018 at 6:15 PM, Peter Geoghegan <pg@bowt.ie> wrote:
* For 9.5 and 9.6, the approach taken in bugfix commit d8589946d
should be taken even further -- we should always copy. Moreover, we
should always copy within tuplesort_getdatum(), for the same reasons.* For 9.5, 9.6, 10, and master, we should make sure that
tuplesort_getdatum() uses the caller's memory context. The fact that
it doesn't already do so seems like a simple oversight. We should do
this to be consistent with tuplesort_gettupleslot(). (This isn't
critical, but seems like a good idea.)* For 9.5, 9.6, 10, and master, we should adjust some comments from
tuplesort_getdatum() callers, so that they no longer say that
tuplesort datum tuple memory lives in tuplesort context. That won't be
true anymore.
Attached patches do it that way. I'm happy with what I came up with,
which is a lot simpler than my first approach. The extra copying seems
likely to be well worth it, since it is fairly isolated in practice,
especially on 9.6. There is no extra copying from v10+, since they
don't need the first fix at all.
--
Peter Geoghegan
Attachments:
REL9_5_STABLE-fix-double-free-tuplesort.patchtext/x-patch; charset=US-ASCII; name=REL9_5_STABLE-fix-double-free-tuplesort.patchDownload
From 083b4f26446b51b4a3d839ca44c58144aecadc34 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 9 Feb 2018 15:02:08 -0800
Subject: [PATCH] Fix double free of tuple with grouping sets/tuplesort.c.
Have tuplesort_gettupleslot() copy tuple contents for slot into caller's
memory context in all cases. This necessitates an extra copy on 9.5 and
9.6, though only for cases where an explicit copy was not already
required (mostly just TSS_SORTEDONTAPE fetches, though all
TSS_FINALMERGE fetches on 9.5). Do the same for tuplesort_getdatum().
This fixes a crash that can occur on at least 9.5 and 9.6, where some
grouping sets queries may call tuplesort_end() at node shutdown time,
and free tuple memory that the executor/a tuple slot imagines it is
responsible for.
In the case of all other fetch routines, continue to allocate the memory
in the tuplesort-owned context (only change comments). This is a little
inconsistent with what we now do for tuplesort_gettupleslot() and
tuplesort_getdatum(), but that's preferable to adding new copy overhead
in the backbranches where it's clearly unnecessary. These other fetch
routines provide the weakest possible guarantees about tuple memory
lifespan from v10 on, anyway, so this actually seems more consistent
overall.
In passing, change the memory context switch point within
tuplesort_getdatum() to matchtuplesort_gettupleslot(), so that we
actually allocate memory in the right context there (caller's context),
which seems like an independent bug. It's not clear that this can cause
a crash, but it's still wrong.
Only the latter, less important fix is required on v10 and master,
because in v10 there is already no question of not having to make a copy
in caller's context anyway (see commits e94568ec and 3856cf96).
Author: Peter Geoghegan
Reported-By: Bernd Helmle
Analyzed-By: Peter Geoghegan, Andreas Seltenreich, Bernd Helmle
Discussion: https://www.postgresql.org/message-id/1512661638.9720.34.camel@oopsware.de
Backpatch: 9.5-, where the issue is known to cause a crash.
---
src/backend/utils/adt/orderedsetaggs.c | 21 +++++------
src/backend/utils/sort/tuplesort.c | 68 +++++++++++++++++++++++++++++-----
2 files changed, 67 insertions(+), 22 deletions(-)
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 39ed85b..1a10202 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -457,10 +457,9 @@ percentile_disc_final(PG_FUNCTION_ARGS)
elog(ERROR, "missing row in percentile_disc");
/*
- * Note: we *cannot* clean up the tuplesort object here, because the value
- * to be returned is allocated inside its sortcontext. We could use
- * datumCopy to copy it out of there, but it doesn't seem worth the
- * trouble, since the cleanup callback will clear the tuplesort later.
+ * Note: we could clean up the tuplesort object here, but it doesn't seem
+ * worth the trouble, since the cleanup callback will clear the tuplesort
+ * later.
*/
/* We shouldn't have stored any nulls, but do the right thing anyway */
@@ -575,10 +574,9 @@ percentile_cont_final_common(FunctionCallInfo fcinfo,
}
/*
- * Note: we *cannot* clean up the tuplesort object here, because the value
- * to be returned may be allocated inside its sortcontext. We could use
- * datumCopy to copy it out of there, but it doesn't seem worth the
- * trouble, since the cleanup callback will clear the tuplesort later.
+ * Note: we could clean up the tuplesort object here, but it doesn't seem
+ * worth the trouble, since the cleanup callback will clear the tuplesort
+ * later.
*/
PG_RETURN_DATUM(val);
@@ -1089,10 +1087,9 @@ mode_final(PG_FUNCTION_ARGS)
pfree(DatumGetPointer(last_val));
/*
- * Note: we *cannot* clean up the tuplesort object here, because the value
- * to be returned is allocated inside its sortcontext. We could use
- * datumCopy to copy it out of there, but it doesn't seem worth the
- * trouble, since the cleanup callback will clear the tuplesort later.
+ * Note: we could clean up the tuplesort object here, but it doesn't seem
+ * worth the trouble, since the cleanup callback will clear the tuplesort
+ * later.
*/
if (mode_freq)
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 5676c07..294f2bd 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1664,6 +1664,12 @@ tuplesort_performsort(Tuplesortstate *state)
* Internal routine to fetch the next tuple in either forward or back
* direction into *stup. Returns FALSE if no more tuples.
* If *should_free is set, the caller must pfree stup.tuple when done with it.
+ * Otherwise, caller should not use tuple following next call here.
+ *
+ * Note: Public tuplesort fetch routine callers cannot rely on tuple being
+ * allocated in their own memory context when should_free is TRUE. It may be
+ * necessary to create a new copy of the tuple to meet the requirements of
+ * public fetch routine callers.
*/
static bool
tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
@@ -1865,6 +1871,12 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
* Fetch the next tuple in either forward or back direction.
* If successful, put tuple in slot and return TRUE; else, clear the slot
* and return FALSE.
+ *
+ * The slot receives a copied tuple that will stay valid regardless of future
+ * manipulations of the tuplesort's state. This includes calls to
+ * tuplesort_end() -- tuple will be allocated in caller's own context to make
+ * this work (this differs from similar routines for other types of
+ * tuplesorts).
*/
bool
tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
@@ -1881,7 +1893,26 @@ tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
if (stup.tuple)
{
- ExecStoreMinimalTuple((MinimalTuple) stup.tuple, slot, should_free);
+ void *tuple = stup.tuple;
+
+ /*
+ * Callers rely on memory being in their own memory context, which is
+ * not guaranteed by tuplesort_gettuple_common(), even when should_free
+ * is set to TRUE. We must always copy here, since our interface does
+ * not allow callers to opt into arrangement where tuple memory can go
+ * away on the next call here, or after tuplesort_end() is called.
+ */
+ stup.tuple = heap_copy_minimal_tuple((MinimalTuple) tuple);
+
+ /*
+ * Free local copy eagerly. It would be very invasive to get
+ * tuplesort_gettuple_common() to allocate tuple in caller's context
+ * for us, so we just do this instead.
+ */
+ if (should_free)
+ pfree(tuple);
+
+ ExecStoreMinimalTuple((MinimalTuple) stup.tuple, slot, true);
return true;
}
else
@@ -1895,6 +1926,8 @@ tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
* Fetch the next tuple in either forward or back direction.
* Returns NULL if no more tuples. If *should_free is set, the
* caller must pfree the returned tuple when done with it.
+ * If it is not set, caller should not use tuple following next
+ * call here. It's never okay to use it after tuplesort_end().
*/
HeapTuple
tuplesort_getheaptuple(Tuplesortstate *state, bool forward, bool *should_free)
@@ -1914,6 +1947,8 @@ tuplesort_getheaptuple(Tuplesortstate *state, bool forward, bool *should_free)
* Fetch the next index tuple in either forward or back direction.
* Returns NULL if no more tuples. If *should_free is set, the
* caller must pfree the returned tuple when done with it.
+ * If it is not set, caller should not use tuple following next
+ * call here. It's never okay to use it after tuplesort_end().
*/
IndexTuple
tuplesort_getindextuple(Tuplesortstate *state, bool forward,
@@ -1935,7 +1970,8 @@ tuplesort_getindextuple(Tuplesortstate *state, bool forward,
* Returns FALSE if no more datums.
*
* If the Datum is pass-by-ref type, the returned value is freshly palloc'd
- * and is now owned by the caller.
+ * in caller's context, and is now owned by the caller (this differs from
+ * similar routines for other types of tuplesorts).
*/
bool
tuplesort_getdatum(Tuplesortstate *state, bool forward,
@@ -1950,6 +1986,8 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
MemoryContextSwitchTo(oldcontext);
return false;
}
+ /* Copy into caller's memory context */
+ MemoryContextSwitchTo(oldcontext);
if (stup.isnull1 || state->datumTypeByVal)
{
@@ -1958,16 +1996,26 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
}
else
{
- /* use stup.tuple because stup.datum1 may be an abbreviation */
-
- if (should_free)
- *val = PointerGetDatum(stup.tuple);
- else
- *val = datumCopy(PointerGetDatum(stup.tuple), false, state->datumTypeLen);
+ /*
+ * Callers rely on memory being in their own memory context, which is
+ * not guaranteed by tuplesort_gettuple_common(), even when should_free
+ * is set to TRUE. We must always copy here, since our interface does
+ * not allow callers to opt into arrangement where tuple memory can go
+ * away on the next call here, or after tuplesort_end() is called.
+ *
+ * Use stup.tuple because stup.datum1 may be an abbreviation.
+ */
+ *val = datumCopy(PointerGetDatum(stup.tuple), false, state->datumTypeLen);
*isNull = false;
- }
- MemoryContextSwitchTo(oldcontext);
+ /*
+ * Free local copy eagerly. It would be very invasive to get
+ * tuplesort_gettuple_common() to allocate tuple in caller's context
+ * for us, so we just do this instead.
+ */
+ if (should_free)
+ pfree(stup.tuple);
+ }
return true;
}
--
2.7.4
master-fix-double-free-tuplesort.patchtext/x-patch; charset=US-ASCII; name=master-fix-double-free-tuplesort.patchDownload
From 61ee080f3763e5a4112e9507f81d60fb8e889e99 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 9 Feb 2018 15:40:44 -0800
Subject: [PATCH] Fix double free of tuple with grouping sets/tuplesort.c.
Have tuplesort_gettupleslot() copy tuple contents for slot into caller's
memory context in all cases. This necessitates an extra copy on 9.5 and
9.6, though only for cases where an explicit copy was not already
required (mostly just TSS_SORTEDONTAPE fetches, though all
TSS_FINALMERGE fetches on 9.5). Do the same for tuplesort_getdatum().
This fixes a crash that can occur on at least 9.5 and 9.6, where some
grouping sets queries may call tuplesort_end() at node shutdown time,
and free tuple memory that the executor/a tuple slot imagines it is
responsible for.
In the case of all other fetch routines, continue to allocate the memory
in the tuplesort-owned context (only change comments). This is a little
inconsistent with what we now do for tuplesort_gettupleslot() and
tuplesort_getdatum(), but that's preferable to adding new copy overhead
in the backbranches where it's clearly unnecessary. These other fetch
routines provide the weakest possible guarantees about tuple memory
lifespan from v10 on, anyway, so this actually seems more consistent
overall.
In passing, change the memory context switch point within
tuplesort_getdatum() to matchtuplesort_gettupleslot(), so that we
actually allocate memory in the right context there (caller's context),
which seems like an independent bug. It's not clear that this can cause
a crash, but it's still wrong.
Only the latter, less important fix is required on v10 and master,
because in v10 there is already no question of not having to make a copy
in caller's context anyway (see commits e94568ec and 3856cf96).
Author: Peter Geoghegan
Reported-By: Bernd Helmle
Analyzed-By: Peter Geoghegan, Andreas Seltenreich, Bernd Helmle
Discussion: https://www.postgresql.org/message-id/1512661638.9720.34.camel@oopsware.de
Backpatch: 9.5-, where the issue is known to cause a crash.
---
src/backend/utils/adt/orderedsetaggs.c | 5 +----
src/backend/utils/sort/tuplesort.c | 19 ++++++++++---------
2 files changed, 11 insertions(+), 13 deletions(-)
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 50b34fc..ed36851 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -329,10 +329,7 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
*
* In the case where we're not expecting multiple finalfn calls, we could
* arguably rely on the finalfn to clean up; but it's easier and more testable
- * if we just do it the same way in either case. Note that many of the
- * finalfns could *not* free the tuplesort object, at least not without extra
- * data copying, because what they return is a pointer to a datum inside the
- * tuplesort object.
+ * if we just do it the same way in either case.
*/
static void
ordered_set_shutdown(Datum arg)
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 041bdc2..7fec18e 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2148,11 +2148,12 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
* representation, which caller may rely on in abbreviated inequality check.
*
* If copy is true, the slot receives a copied tuple that will stay valid
- * regardless of future manipulations of the tuplesort's state. Memory is
- * owned by the caller. If copy is false, the slot will just receive a
- * pointer to a tuple held within the tuplesort, which is more efficient, but
- * only safe for callers that are prepared to have any subsequent manipulation
- * of the tuplesort's state invalidate slot contents.
+ * regardless of future manipulations of the tuplesort's state. This includes
+ * calls to tuplesort_end() -- tuple will be allocated in caller's own context
+ * to make this work. If copy is false, the slot will just receive a pointer
+ * to a tuple held within the tuplesort, which is more efficient, but only safe
+ * for callers that are prepared to have any subsequent manipulation of the
+ * tuplesort's state invalidate slot contents.
*/
bool
tuplesort_gettupleslot(Tuplesortstate *state, bool forward, bool copy,
@@ -2230,8 +2231,8 @@ tuplesort_getindextuple(Tuplesortstate *state, bool forward)
* Returns false if no more datums.
*
* If the Datum is pass-by-ref type, the returned value is freshly palloc'd
- * and is now owned by the caller (this differs from similar routines for
- * other types of tuplesorts).
+ * in caller's context, and is now owned by the caller (this differs from
+ * similar routines for other types of tuplesorts).
*
* Caller may optionally be passed back abbreviated value (on true return
* value) when abbreviation was used, which can be used to cheaply avoid
@@ -2252,6 +2253,8 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
MemoryContextSwitchTo(oldcontext);
return false;
}
+ /* Copy into caller's memory context */
+ MemoryContextSwitchTo(oldcontext);
/* Record abbreviated key for caller */
if (state->sortKeys->abbrev_converter && abbrev)
@@ -2269,8 +2272,6 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
*isNull = false;
}
- MemoryContextSwitchTo(oldcontext);
-
return true;
}
--
2.7.4
REL9_6_STABLE-fix-double-free-tuplesort.patchtext/x-patch; charset=US-ASCII; name=REL9_6_STABLE-fix-double-free-tuplesort.patchDownload
From fcbdb2e4fb76a75a0c56ca258f99df888b174dc3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 14 Dec 2017 21:22:41 -0800
Subject: [PATCH] Fix double free of tuple with grouping sets/tuplesort.c.
Have tuplesort_gettupleslot() copy tuple contents for slot into caller's
memory context in all cases. This necessitates an extra copy on 9.5 and
9.6, though only for cases where an explicit copy was not already
required (mostly just TSS_SORTEDONTAPE fetches, though all
TSS_FINALMERGE fetches on 9.5). Do the same for tuplesort_getdatum().
This fixes a crash that can occur on at least 9.5 and 9.6, where some
grouping sets queries may call tuplesort_end() at node shutdown time,
and free tuple memory that the executor/a tuple slot imagines it is
responsible for.
In the case of all other fetch routines, continue to allocate the memory
in the tuplesort-owned context (only change comments). This is a little
inconsistent with what we now do for tuplesort_gettupleslot() and
tuplesort_getdatum(), but that's preferable to adding new copy overhead
in the backbranches where it's clearly unnecessary. These other fetch
routines provide the weakest possible guarantees about tuple memory
lifespan from v10 on, anyway, so this actually seems more consistent
overall.
In passing, change the memory context switch point within
tuplesort_getdatum() to matchtuplesort_gettupleslot(), so that we
actually allocate memory in the right context there (caller's context),
which seems like an independent bug. It's not clear that this can cause
a crash, but it's still wrong.
Only the latter, less important fix is required on v10 and master,
because in v10 there is already no question of not having to make a copy
in caller's context anyway (see commits e94568ec and 3856cf96).
Author: Peter Geoghegan
Reported-By: Bernd Helmle
Analyzed-By: Peter Geoghegan, Andreas Seltenreich, Bernd Helmle
Discussion: https://www.postgresql.org/message-id/1512661638.9720.34.camel@oopsware.de
Backpatch: 9.5-, where the issue is known to cause a crash.
---
src/backend/utils/adt/orderedsetaggs.c | 21 +++++-----
src/backend/utils/sort/tuplesort.c | 74 +++++++++++++++++++++++++---------
2 files changed, 63 insertions(+), 32 deletions(-)
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index fe44d56..4d1e201 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -457,10 +457,9 @@ percentile_disc_final(PG_FUNCTION_ARGS)
elog(ERROR, "missing row in percentile_disc");
/*
- * Note: we *cannot* clean up the tuplesort object here, because the value
- * to be returned is allocated inside its sortcontext. We could use
- * datumCopy to copy it out of there, but it doesn't seem worth the
- * trouble, since the cleanup callback will clear the tuplesort later.
+ * Note: we could clean up the tuplesort object here, but it doesn't seem
+ * worth the trouble, since the cleanup callback will clear the tuplesort
+ * later.
*/
/* We shouldn't have stored any nulls, but do the right thing anyway */
@@ -575,10 +574,9 @@ percentile_cont_final_common(FunctionCallInfo fcinfo,
}
/*
- * Note: we *cannot* clean up the tuplesort object here, because the value
- * to be returned may be allocated inside its sortcontext. We could use
- * datumCopy to copy it out of there, but it doesn't seem worth the
- * trouble, since the cleanup callback will clear the tuplesort later.
+ * Note: we could clean up the tuplesort object here, but it doesn't seem
+ * worth the trouble, since the cleanup callback will clear the tuplesort
+ * later.
*/
PG_RETURN_DATUM(val);
@@ -1097,10 +1095,9 @@ mode_final(PG_FUNCTION_ARGS)
pfree(DatumGetPointer(last_val));
/*
- * Note: we *cannot* clean up the tuplesort object here, because the value
- * to be returned is allocated inside its sortcontext. We could use
- * datumCopy to copy it out of there, but it doesn't seem worth the
- * trouble, since the cleanup callback will clear the tuplesort later.
+ * Note: we could clean up the tuplesort object here, but it doesn't seem
+ * worth the trouble, since the cleanup callback will clear the tuplesort
+ * later.
*/
if (mode_freq)
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 4dd5407..544f06d 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1824,6 +1824,11 @@ tuplesort_performsort(Tuplesortstate *state)
* direction into *stup. Returns FALSE if no more tuples.
* If *should_free is set, the caller must pfree stup.tuple when done with it.
* Otherwise, caller should not use tuple following next call here.
+ *
+ * Note: Public tuplesort fetch routine callers cannot rely on tuple being
+ * allocated in their own memory context when should_free is TRUE. It may be
+ * necessary to create a new copy of the tuple to meet the requirements of
+ * public fetch routine callers.
*/
static bool
tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
@@ -2046,9 +2051,11 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
* NULL value in leading attribute will set abbreviated value to zeroed
* representation, which caller may rely on in abbreviated inequality check.
*
- * The slot receives a copied tuple (sometimes allocated in caller memory
- * context) that will stay valid regardless of future manipulations of the
- * tuplesort's state.
+ * The slot receives a copied tuple that will stay valid regardless of future
+ * manipulations of the tuplesort's state. This includes calls to
+ * tuplesort_end() -- tuple will be allocated in caller's own context to make
+ * this work (this differs from similar routines for other types of
+ * tuplesorts).
*/
bool
tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
@@ -2065,16 +2072,30 @@ tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
if (stup.tuple)
{
+ void *tuple = stup.tuple;
+
/* Record abbreviated key for caller */
if (state->sortKeys->abbrev_converter && abbrev)
*abbrev = stup.datum1;
- if (!should_free)
- {
- stup.tuple = heap_copy_minimal_tuple((MinimalTuple) stup.tuple);
- should_free = true;
- }
- ExecStoreMinimalTuple((MinimalTuple) stup.tuple, slot, should_free);
+ /*
+ * Callers rely on memory being in their own memory context, which is
+ * not guaranteed by tuplesort_gettuple_common(), even when should_free
+ * is set to TRUE. We must always copy here, since our interface does
+ * not allow callers to opt into arrangement where tuple memory can go
+ * away on the next call here, or after tuplesort_end() is called.
+ */
+ stup.tuple = heap_copy_minimal_tuple((MinimalTuple) tuple);
+
+ /*
+ * Free local copy eagerly. It would be very invasive to get
+ * tuplesort_gettuple_common() to allocate tuple in caller's context
+ * for us, so we just do this instead.
+ */
+ if (should_free)
+ pfree(tuple);
+
+ ExecStoreMinimalTuple((MinimalTuple) stup.tuple, slot, true);
return true;
}
else
@@ -2089,7 +2110,7 @@ tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
* Returns NULL if no more tuples. If *should_free is set, the
* caller must pfree the returned tuple when done with it.
* If it is not set, caller should not use tuple following next
- * call here.
+ * call here. It's never okay to use it after tuplesort_end().
*/
HeapTuple
tuplesort_getheaptuple(Tuplesortstate *state, bool forward, bool *should_free)
@@ -2110,7 +2131,7 @@ tuplesort_getheaptuple(Tuplesortstate *state, bool forward, bool *should_free)
* Returns NULL if no more tuples. If *should_free is set, the
* caller must pfree the returned tuple when done with it.
* If it is not set, caller should not use tuple following next
- * call here.
+ * call here. It's never okay to use it after tuplesort_end().
*/
IndexTuple
tuplesort_getindextuple(Tuplesortstate *state, bool forward,
@@ -2132,7 +2153,8 @@ tuplesort_getindextuple(Tuplesortstate *state, bool forward,
* Returns FALSE if no more datums.
*
* If the Datum is pass-by-ref type, the returned value is freshly palloc'd
- * and is now owned by the caller.
+ * in caller's context, and is now owned by the caller (this differs from
+ * similar routines for other types of tuplesorts).
*
* Caller may optionally be passed back abbreviated value (on TRUE return
* value) when abbreviation was used, which can be used to cheaply avoid
@@ -2154,6 +2176,8 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
MemoryContextSwitchTo(oldcontext);
return false;
}
+ /* Copy into caller's memory context */
+ MemoryContextSwitchTo(oldcontext);
/* Record abbreviated key for caller */
if (state->sortKeys->abbrev_converter && abbrev)
@@ -2166,16 +2190,26 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
}
else
{
- /* use stup.tuple because stup.datum1 may be an abbreviation */
-
- if (should_free)
- *val = PointerGetDatum(stup.tuple);
- else
- *val = datumCopy(PointerGetDatum(stup.tuple), false, state->datumTypeLen);
+ /*
+ * Callers rely on memory being in their own memory context, which is
+ * not guaranteed by tuplesort_gettuple_common(), even when should_free
+ * is set to TRUE. We must always copy here, since our interface does
+ * not allow callers to opt into arrangement where tuple memory can go
+ * away on the next call here, or after tuplesort_end() is called.
+ *
+ * Use stup.tuple because stup.datum1 may be an abbreviation.
+ */
+ *val = datumCopy(PointerGetDatum(stup.tuple), false, state->datumTypeLen);
*isNull = false;
- }
- MemoryContextSwitchTo(oldcontext);
+ /*
+ * Free local copy eagerly. It would be very invasive to get
+ * tuplesort_gettuple_common() to allocate tuple in caller's context
+ * for us, so we just do this instead.
+ */
+ if (should_free)
+ pfree(stup.tuple);
+ }
return true;
}
--
2.7.4
REL_10_STABLE-fix-double-free-tuplesort.patchtext/x-patch; charset=US-ASCII; name=REL_10_STABLE-fix-double-free-tuplesort.patchDownload
From f7a480807127e7b2fa5ee356981bbfc23ae0cd60 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 9 Feb 2018 14:58:08 -0800
Subject: [PATCH] Fix double free of tuple with grouping sets/tuplesort.c.
Have tuplesort_gettupleslot() copy tuple contents for slot into caller's
memory context in all cases. This necessitates an extra copy on 9.5 and
9.6, though only for cases where an explicit copy was not already
required (mostly just TSS_SORTEDONTAPE fetches, though all
TSS_FINALMERGE fetches on 9.5). Do the same for tuplesort_getdatum().
This fixes a crash that can occur on at least 9.5 and 9.6, where some
grouping sets queries may call tuplesort_end() at node shutdown time,
and free tuple memory that the executor/a tuple slot imagines it is
responsible for.
In the case of all other fetch routines, continue to allocate the memory
in the tuplesort-owned context (only change comments). This is a little
inconsistent with what we now do for tuplesort_gettupleslot() and
tuplesort_getdatum(), but that's preferable to adding new copy overhead
in the backbranches where it's clearly unnecessary. These other fetch
routines provide the weakest possible guarantees about tuple memory
lifespan from v10 on, anyway, so this actually seems more consistent
overall.
In passing, change the memory context switch point within
tuplesort_getdatum() to matchtuplesort_gettupleslot(), so that we
actually allocate memory in the right context there (caller's context),
which seems like an independent bug. It's not clear that this can cause
a crash, but it's still wrong.
Only the latter, less important fix is required on v10 and master,
because in v10 there is already no question of not having to make a copy
in caller's context anyway (see commits e94568ec and 3856cf96).
Author: Peter Geoghegan
Reported-By: Bernd Helmle
Analyzed-By: Peter Geoghegan, Andreas Seltenreich, Bernd Helmle
Discussion: https://www.postgresql.org/message-id/1512661638.9720.34.camel@oopsware.de
Backpatch: 9.5-, where the issue is known to cause a crash.
---
src/backend/utils/adt/orderedsetaggs.c | 21 +++++++++------------
src/backend/utils/sort/tuplesort.c | 19 ++++++++++---------
2 files changed, 19 insertions(+), 21 deletions(-)
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 8502fcf..c1d50eb 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -458,10 +458,9 @@ percentile_disc_final(PG_FUNCTION_ARGS)
elog(ERROR, "missing row in percentile_disc");
/*
- * Note: we *cannot* clean up the tuplesort object here, because the value
- * to be returned is allocated inside its sortcontext. We could use
- * datumCopy to copy it out of there, but it doesn't seem worth the
- * trouble, since the cleanup callback will clear the tuplesort later.
+ * Note: we could clean up the tuplesort object here, but it doesn't seem
+ * worth the trouble, since the cleanup callback will clear the tuplesort
+ * later.
*/
/* We shouldn't have stored any nulls, but do the right thing anyway */
@@ -576,10 +575,9 @@ percentile_cont_final_common(FunctionCallInfo fcinfo,
}
/*
- * Note: we *cannot* clean up the tuplesort object here, because the value
- * to be returned may be allocated inside its sortcontext. We could use
- * datumCopy to copy it out of there, but it doesn't seem worth the
- * trouble, since the cleanup callback will clear the tuplesort later.
+ * Note: we could clean up the tuplesort object here, but it doesn't seem
+ * worth the trouble, since the cleanup callback will clear the tuplesort
+ * later.
*/
PG_RETURN_DATUM(val);
@@ -1098,10 +1096,9 @@ mode_final(PG_FUNCTION_ARGS)
pfree(DatumGetPointer(last_val));
/*
- * Note: we *cannot* clean up the tuplesort object here, because the value
- * to be returned is allocated inside its sortcontext. We could use
- * datumCopy to copy it out of there, but it doesn't seem worth the
- * trouble, since the cleanup callback will clear the tuplesort later.
+ * Note: we could clean up the tuplesort object here, but it doesn't seem
+ * worth the trouble, since the cleanup callback will clear the tuplesort
+ * later.
*/
if (mode_freq)
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 59cd28e..1dd568c 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2103,11 +2103,12 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
* representation, which caller may rely on in abbreviated inequality check.
*
* If copy is true, the slot receives a copied tuple that will stay valid
- * regardless of future manipulations of the tuplesort's state. Memory is
- * owned by the caller. If copy is false, the slot will just receive a
- * pointer to a tuple held within the tuplesort, which is more efficient, but
- * only safe for callers that are prepared to have any subsequent manipulation
- * of the tuplesort's state invalidate slot contents.
+ * regardless of future manipulations of the tuplesort's state. This includes
+ * calls to tuplesort_end() -- tuple will be allocated in caller's own context
+ * to make this work. If copy is false, the slot will just receive a pointer
+ * to a tuple held within the tuplesort, which is more efficient, but only safe
+ * for callers that are prepared to have any subsequent manipulation of the
+ * tuplesort's state invalidate slot contents.
*/
bool
tuplesort_gettupleslot(Tuplesortstate *state, bool forward, bool copy,
@@ -2185,8 +2186,8 @@ tuplesort_getindextuple(Tuplesortstate *state, bool forward)
* Returns FALSE if no more datums.
*
* If the Datum is pass-by-ref type, the returned value is freshly palloc'd
- * and is now owned by the caller (this differs from similar routines for
- * other types of tuplesorts).
+ * in caller's context, and is now owned by the caller (this differs from
+ * similar routines for other types of tuplesorts).
*
* Caller may optionally be passed back abbreviated value (on TRUE return
* value) when abbreviation was used, which can be used to cheaply avoid
@@ -2207,6 +2208,8 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
MemoryContextSwitchTo(oldcontext);
return false;
}
+ /* Copy into caller's memory context */
+ MemoryContextSwitchTo(oldcontext);
/* Record abbreviated key for caller */
if (state->sortKeys->abbrev_converter && abbrev)
@@ -2224,8 +2227,6 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
*isNull = false;
}
- MemoryContextSwitchTo(oldcontext);
-
return true;
}
--
2.7.4
At Thu, 22 Feb 2018 16:22:47 -0800, Peter Geoghegan <pg@bowt.ie> wrote in <CAH2-Wz=QW6FNpLEfQpFKmKiu_WkfxCpmPDmp6ZUf=7SrARsNpQ@mail.gmail.com>
On Mon, Feb 12, 2018 at 6:15 PM, Peter Geoghegan <pg@bowt.ie> wrote:
* For 9.5 and 9.6, the approach taken in bugfix commit d8589946d
should be taken even further -- we should always copy. Moreover, we
should always copy within tuplesort_getdatum(), for the same reasons.* For 9.5, 9.6, 10, and master, we should make sure that
tuplesort_getdatum() uses the caller's memory context. The fact that
it doesn't already do so seems like a simple oversight. We should do
this to be consistent with tuplesort_gettupleslot(). (This isn't
critical, but seems like a good idea.)* For 9.5, 9.6, 10, and master, we should adjust some comments from
tuplesort_getdatum() callers, so that they no longer say that
tuplesort datum tuple memory lives in tuplesort context. That won't be
true anymore.Attached patches do it that way. I'm happy with what I came up with,
which is a lot simpler than my first approach. The extra copying seems
likely to be well worth it, since it is fairly isolated in practice,
especially on 9.6. There is no extra copying from v10+, since they
don't need the first fix at all.
FWIW I examined the case by myself. And I confirmed that the
cause is tuple freeing after tuplesort_end conducted by
shouldFree. As a principle, since it is declared that the caller
is responsible to free the result, tuplesort shouldn't free it
out of the caller's control. Even if it is not currently causig
use-after-free problem, it is also possible to happen.
From this point of view, the patch for 9.5 and 9.6 looks fine and
actually fixes the problem.
For 10+, copying is controlled by the caller side, but only
tuplesort_getdatum() didn't make the copy in the caller
context. It is just an overlook and the fix looks reasonable.
I'm not appropriate for checking comment wording but it makes
sense for me.
If no one objects, I'll mark this as Ready for Commit in a couple
of days.
reagards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, Mar 26, 2018 at 5:14 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Attached patches do it that way. I'm happy with what I came up with,
which is a lot simpler than my first approach. The extra copying seems
likely to be well worth it, since it is fairly isolated in practice,
especially on 9.6. There is no extra copying from v10+, since they
don't need the first fix at all.
If no one objects, I'll mark this as Ready for Commit in a couple
of days.
Thank you for the review, Horiguchi-san. It's hard to decide how
important each goal is when coming up with a back-patchable fix like
this. When the goals are somewhat in competition with each other, a
second or a third opinion is particularly appreciated.
--
Peter Geoghegan
At Mon, 26 Mar 2018 20:40:51 -0700, Peter Geoghegan <pg@bowt.ie> wrote in <CAH2-Wzk4MTSPWnCv3ENz9HZrtiJGut8TOtLTaf56AxmJ9VbcEA@mail.gmail.com>
On Mon, Mar 26, 2018 at 5:14 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Attached patches do it that way. I'm happy with what I came up with,
which is a lot simpler than my first approach. The extra copying seems
likely to be well worth it, since it is fairly isolated in practice,
especially on 9.6. There is no extra copying from v10+, since they
don't need the first fix at all.If no one objects, I'll mark this as Ready for Commit in a couple
of days.Thank you for the review, Horiguchi-san. It's hard to decide how
important each goal is when coming up with a back-patchable fix like
this. When the goals are somewhat in competition with each other, a
second or a third opinion is particularly appreciated.
Understood. I'll wait for the other opinions.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, Mar 27, 2018 at 2:23 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
If no one objects, I'll mark this as Ready for Commit in a couple
of days.Thank you for the review, Horiguchi-san. It's hard to decide how
important each goal is when coming up with a back-patchable fix like
this. When the goals are somewhat in competition with each other, a
second or a third opinion is particularly appreciated.Understood. I'll wait for the other opinions.
I wasn't specifically requesting that you not mark the patch as ready
for committer, actually. I was just expressing that your input was
valuable. Sorry for being unclear.
--
Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes:
On Mon, Mar 26, 2018 at 5:14 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:If no one objects, I'll mark this as Ready for Commit in a couple
of days.
Thank you for the review, Horiguchi-san. It's hard to decide how
important each goal is when coming up with a back-patchable fix like
this. When the goals are somewhat in competition with each other, a
second or a third opinion is particularly appreciated.
It looks good to me. The only real objection would be if someone came
up with a test case proving that there's a significant performance
degradation from the extra copies. But given that these are back
branches, it would take a pretty steep penalty for me to want to take
the risk of refactoring to avoid that.
I've pushed it with some cosmetic adjustments.
regards, tom lane
On Wed, Mar 28, 2018 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
It looks good to me. The only real objection would be if someone came
up with a test case proving that there's a significant performance
degradation from the extra copies. But given that these are back
branches, it would take a pretty steep penalty for me to want to take
the risk of refactoring to avoid that.I've pushed it with some cosmetic adjustments.
Thank you, Tom.
--
Peter Geoghegan