Shared detoast Datum proposal

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Andy Fan (#1)

Re: Shared detoast Datum proposal

Andy Fan <zhihuifan1213@163.com> writes:

Some Known issues:
------------------

1. Currently only Scan & Join nodes are considered for this feature.
2. JIT is not adapted for this purpose yet.

JIT is adapted for this feature in v2. Any feedback is welcome.

--
Best Regards
Andy Fan

vignesh C

vignesh21@gmail.com

about 2 years ago

In reply to: Andy Fan (#2)

Re: Shared detoast Datum proposal

On Mon, 1 Jan 2024 at 19:26, Andy Fan <zhihuifan1213@163.com> wrote:

Andy Fan <zhihuifan1213@163.com> writes:

Some Known issues:
------------------

1. Currently only Scan & Join nodes are considered for this feature.
2. JIT is not adapted for this purpose yet.

JIT is adapted for this feature in v2. Any feedback is welcome.

One of the tests was aborted at CFBOT [1]https://cirrus-ci.com/task/4765094966460416 with:
[09:47:00.735] dumping /tmp/cores/postgres-11-28182.core for
/tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin/postgres
[09:47:01.035] [New LWP 28182]
[09:47:01.748] [Thread debugging using libthread_db enabled]
[09:47:01.748] Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1".
[09:47:09.392] Core was generated by `postgres: postgres regression
[local] SELECT '.
[09:47:09.392] Program terminated with signal SIGSEGV, Segmentation fault.
[09:47:09.392] #0 0x00007fa4eed4a5a1 in ?? ()
[09:47:11.123]
[09:47:11.123] Thread 1 (Thread 0x7fa4f8050a40 (LWP 28182)):
[09:47:11.123] #0 0x00007fa4eed4a5a1 in ?? ()
[09:47:11.123] No symbol table info available.
...
...
[09:47:11.123] #4 0x00007fa4ebc7a186 in LLVMOrcGetSymbolAddress () at
/build/llvm-toolchain-11-HMpQvg/llvm-toolchain-11-11.0.1/llvm/lib/ExecutionEngine/Orc/OrcCBindings.cpp:124
[09:47:11.123] No locals.
[09:47:11.123] #5 0x00007fa4eed6fc7a in llvm_get_function
(context=0x564b1813a8a0, funcname=0x7fa4eed4a570 "AWAVATSH\201",
<incomplete sequence \354\210>) at
../src/backend/jit/llvm/llvmjit.c:460
[09:47:11.123] addr = 94880527996960
[09:47:11.123] __func__ = "llvm_get_function"
[09:47:11.123] #6 0x00007fa4eed902e1 in ExecRunCompiledExpr
(state=0x0, econtext=0x564b18269d20, isNull=0x7ffc11054d5f) at
../src/backend/jit/llvm/llvmjit_expr.c:2577
[09:47:11.123] cstate = <optimized out>
[09:47:11.123] func = 0x564b18269d20
[09:47:11.123] #7 0x0000564b1698e614 in ExecEvalExprSwitchContext
(isNull=0x7ffc11054d5f, econtext=0x564b18269d20, state=0x564b182ad820)
at ../src/include/executor/executor.h:355
[09:47:11.123] retDatum = <optimized out>
[09:47:11.123] oldContext = 0x564b182680d0
[09:47:11.123] retDatum = <optimized out>
[09:47:11.123] oldContext = <optimized out>
[09:47:11.123] #8 ExecProject (projInfo=0x564b182ad818) at
../src/include/executor/executor.h:389
[09:47:11.123] econtext = 0x564b18269d20
[09:47:11.123] state = 0x564b182ad820
[09:47:11.123] slot = 0x564b182ad788
[09:47:11.123] isnull = false
[09:47:11.123] econtext = <optimized out>
[09:47:11.123] state = <optimized out>
[09:47:11.123] slot = <optimized out>
[09:47:11.123] isnull = <optimized out>
[09:47:11.123] #9 ExecMergeJoin (pstate=<optimized out>) at
../src/backend/executor/nodeMergejoin.c:836
[09:47:11.123] node = <optimized out>
[09:47:11.123] joinqual = 0x0
[09:47:11.123] otherqual = 0x0
[09:47:11.123] qualResult = <optimized out>
[09:47:11.123] compareResult = <optimized out>
[09:47:11.123] innerPlan = <optimized out>
[09:47:11.123] innerTupleSlot = <optimized out>
[09:47:11.123] outerPlan = <optimized out>
[09:47:11.123] outerTupleSlot = <optimized out>
[09:47:11.123] econtext = 0x564b18269d20
[09:47:11.123] doFillOuter = false
[09:47:11.123] doFillInner = false
[09:47:11.123] __func__ = "ExecMergeJoin"
[09:47:11.123] #10 0x0000564b169275b9 in ExecProcNodeFirst
(node=0x564b18269db0) at ../src/backend/executor/execProcnode.c:464
[09:47:11.123] No locals.
[09:47:11.123] #11 0x0000564b169a2675 in ExecProcNode
(node=0x564b18269db0) at ../src/include/executor/executor.h:273
[09:47:11.123] No locals.
[09:47:11.123] #12 ExecRecursiveUnion (pstate=0x564b182684a0) at
../src/backend/executor/nodeRecursiveunion.c:115
[09:47:11.123] node = 0x564b182684a0
[09:47:11.123] outerPlan = 0x564b18268d00
[09:47:11.123] innerPlan = 0x564b18269db0
[09:47:11.123] plan = 0x564b182ddc78
[09:47:11.123] slot = <optimized out>
[09:47:11.123] isnew = false
[09:47:11.123] #13 0x0000564b1695a421 in ExecProcNode
(node=0x564b182684a0) at ../src/include/executor/executor.h:273
[09:47:11.123] No locals.
[09:47:11.123] #14 CteScanNext (node=0x564b183a6830) at
../src/backend/executor/nodeCtescan.c:103
[09:47:11.123] cteslot = <optimized out>
[09:47:11.123] estate = <optimized out>
[09:47:11.123] dir = ForwardScanDirection
[09:47:11.123] forward = true
[09:47:11.123] tuplestorestate = 0x564b183a5cd0
[09:47:11.123] eof_tuplestore = <optimized out>
[09:47:11.123] slot = 0x564b183a6be0
[09:47:11.123] #15 0x0000564b1692e22b in ExecScanFetch
(node=node@entry=0x564b183a6830,
accessMtd=accessMtd@entry=0x564b1695a183 <CteScanNext>,
recheckMtd=recheckMtd@entry=0x564b16959db3 <CteScanRecheck>) at
../src/backend/executor/execScan.c:132
[09:47:11.123] estate = <optimized out>
[09:47:11.123] #16 0x0000564b1692e332 in ExecScan
(node=0x564b183a6830, accessMtd=accessMtd@entry=0x564b1695a183
<CteScanNext>, recheckMtd=recheckMtd@entry=0x564b16959db3
<CteScanRecheck>) at ../src/backend/executor/execScan.c:181
[09:47:11.123] econtext = 0x564b183a6b50
[09:47:11.123] qual = 0x0
[09:47:11.123] projInfo = 0x0
[09:47:11.123] #17 0x0000564b1695a5ea in ExecCteScan
(pstate=<optimized out>) at ../src/backend/executor/nodeCtescan.c:164
[09:47:11.123] node = <optimized out>
[09:47:11.123] #18 0x0000564b169a81da in ExecProcNode
(node=0x564b183a6830) at ../src/include/executor/executor.h:273
[09:47:11.123] No locals.
[09:47:11.123] #19 ExecSort (pstate=0x564b183a6620) at
../src/backend/executor/nodeSort.c:149
[09:47:11.123] plannode = 0x564b182dfb78
[09:47:11.123] outerNode = 0x564b183a6830
[09:47:11.123] tupDesc = <optimized out>
[09:47:11.123] tuplesortopts = <optimized out>
[09:47:11.123] node = 0x564b183a6620
[09:47:11.123] estate = 0x564b182681d0
[09:47:11.123] dir = ForwardScanDirection
[09:47:11.123] tuplesortstate = 0x564b18207ff0
[09:47:11.123] slot = <optimized out>
[09:47:11.123] #20 0x0000564b169275b9 in ExecProcNodeFirst
(node=0x564b183a6620) at ../src/backend/executor/execProcnode.c:464
[09:47:11.123] No locals.
[09:47:11.123] #21 0x0000564b16913d01 in ExecProcNode
(node=0x564b183a6620) at ../src/include/executor/executor.h:273
[09:47:11.123] No locals.
[09:47:11.123] #22 ExecutePlan (estate=estate@entry=0x564b182681d0,
planstate=0x564b183a6620, use_parallel_mode=<optimized out>,
operation=operation@entry=CMD_SELECT,
sendTuples=sendTuples@entry=true, numberTuples=numberTuples@entry=0,
direction=ForwardScanDirection, dest=0x564b182e2728,
execute_once=true) at ../src/backend/executor/execMain.c:1670
[09:47:11.123] slot = <optimized out>
[09:47:11.123] current_tuple_count = 0
[09:47:11.123] #23 0x0000564b16914024 in standard_ExecutorRun
(queryDesc=0x564b181ba200, direction=ForwardScanDirection, count=0,
execute_once=<optimized out>) at
../src/backend/executor/execMain.c:365
[09:47:11.123] estate = 0x564b182681d0
[09:47:11.123] operation = CMD_SELECT
[09:47:11.123] dest = 0x564b182e2728
[09:47:11.123] sendTuples = true
[09:47:11.123] oldcontext = 0x564b181ba100
[09:47:11.123] __func__ = "standard_ExecutorRun"
[09:47:11.123] #24 0x0000564b1691418f in ExecutorRun
(queryDesc=queryDesc@entry=0x564b181ba200,
direction=direction@entry=ForwardScanDirection, count=count@entry=0,
execute_once=<optimized out>) at
../src/backend/executor/execMain.c:309
[09:47:11.123] No locals.
[09:47:11.123] #25 0x0000564b16d208af in PortalRunSelect
(portal=portal@entry=0x564b1817ae10, forward=forward@entry=true,
count=0, count@entry=9223372036854775807,
dest=dest@entry=0x564b182e2728) at ../src/backend/tcop/pquery.c:924
[09:47:11.123] queryDesc = 0x564b181ba200
[09:47:11.123] direction = <optimized out>
[09:47:11.123] nprocessed = <optimized out>
[09:47:11.123] __func__ = "PortalRunSelect"
[09:47:11.123] #26 0x0000564b16d2405b in PortalRun
(portal=portal@entry=0x564b1817ae10,
count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x564b182e2728, altdest=altdest@entry=0x564b182e2728,
qc=0x7ffc110551e0) at ../src/backend/tcop/pquery.c:768
[09:47:11.123] _save_exception_stack = 0x7ffc11055290
[09:47:11.123] _save_context_stack = 0x0
[09:47:11.123] _local_sigjmp_buf = {{__jmpbuf = {1,
3431825231787999889, 94880528213800, 94880526221072, 94880526741008,
94880526221000, -3433879442176195951, -8991885832768699759},
__mask_was_saved = 0, __saved_mask = {__val = {140720594047343, 688,
94880526749216, 94880511205302, 1, 140720594047343, 94880526999808, 8,
94880526213088, 112, 179, 94880526221072, 94880526221048,
94880526221000, 94880508502828, 2}}}}
[09:47:11.123] _do_rethrow = <optimized out>
[09:47:11.123] result = <optimized out>
[09:47:11.123] nprocessed = <optimized out>
[09:47:11.123] saveTopTransactionResourceOwner = 0x564b18139ad8
[09:47:11.123] saveTopTransactionContext = 0x564b181229f0
[09:47:11.123] saveActivePortal = 0x0
[09:47:11.123] saveResourceOwner = 0x564b18139ad8
[09:47:11.123] savePortalContext = 0x0
[09:47:11.123] saveMemoryContext = 0x564b181229f0
[09:47:11.123] __func__ = "PortalRun"
[09:47:11.123] #27 0x0000564b16d1d098 in exec_simple_query
(query_string=query_string@entry=0x564b180fa0e0 "with recursive
search_graph(f, t, label) as (\n\tselect * from graph0 g\n\tunion
all\n\tselect g.*\n\tfrom graph0 g, search_graph sg\n\twhere g.f =
sg.t\n) search depth first by f, t set seq\nselect * from search"...)
at ../src/backend/tcop/postgres.c:1273
[09:47:11.123] cmdtaglen = 6
[09:47:11.123] snapshot_set = <optimized out>
[09:47:11.123] per_parsetree_context = 0x0
[09:47:11.123] plantree_list = 0x564b182e26d8
[09:47:11.123] parsetree = 0x564b180fbec8
[09:47:11.123] commandTag = <optimized out>
[09:47:11.123] qc = {commandTag = CMDTAG_UNKNOWN, nprocessed = 0}
[09:47:11.123] querytree_list = <optimized out>
[09:47:11.123] portal = 0x564b1817ae10
[09:47:11.123] receiver = 0x564b182e2728
[09:47:11.123] format = 0
[09:47:11.123] cmdtagname = <optimized out>
[09:47:11.123] parsetree_item__state = {l = <optimized out>, i
= <optimized out>}
[09:47:11.123] dest = DestRemote
[09:47:11.123] oldcontext = 0x564b181229f0
[09:47:11.123] parsetree_list = 0x564b180fbef8
[09:47:11.123] parsetree_item = 0x564b180fbf10
[09:47:11.123] save_log_statement_stats = false
[09:47:11.123] was_logged = false
[09:47:11.123] use_implicit_block = false
[09:47:11.123] msec_str =
"\004\000\000\000\000\000\000\000\346[\376\026KV\000\000pR\005\021\374\177\000\000\335\000\000\000\000\000\000"
[09:47:11.123] __func__ = "exec_simple_query"
[09:47:11.123] #28 0x0000564b16d1fe33 in PostgresMain
(dbname=<optimized out>, username=<optimized out>) at
../src/backend/tcop/postgres.c:4653
[09:47:11.123] query_string = 0x564b180fa0e0 "with recursive
search_graph(f, t, label) as (\n\tselect * from graph0 g\n\tunion
all\n\tselect g.*\n\tfrom graph0 g, search_graph sg\n\twhere g.f =
sg.t\n) search depth first by f, t set seq\nselect * from search"...
[09:47:11.123] firstchar = <optimized out>
[09:47:11.123] input_message = {data = 0x564b180fa0e0 "with
recursive search_graph(f, t, label) as (\n\tselect * from graph0
g\n\tunion all\n\tselect g.*\n\tfrom graph0 g, search_graph
sg\n\twhere g.f = sg.t\n) search depth first by f, t set seq\nselect *
from search"..., len = 221, maxlen = 1024, cursor = 221}
[09:47:11.123] local_sigjmp_buf = {{__jmpbuf =
{94880520543032, -8991887800862096751, 0, 4, 140720594048004, 1,
-3433879442247499119, -8991885813233991023}, __mask_was_saved = 1,
__saved_mask = {__val = {4194304, 1, 140346553036196, 94880526187920,
15616, 15680, 94880508418872, 0, 94880526187920, 15616,
94880520537224, 4, 140720594048004, 1, 94880508502235, 1}}}}
[09:47:11.123] send_ready_for_query = false
[09:47:11.123] idle_in_transaction_timeout_enabled = false
[09:47:11.123] idle_session_timeout_enabled = false
[09:47:11.123] __func__ = "PostgresMain"
[09:47:11.123] #29 0x0000564b16bcc4e4 in BackendRun
(port=port@entry=0x564b18126f50) at
../src/backend/postmaster/postmaster.c:4464

[1]: https://cirrus-ci.com/task/4765094966460416

Regards,
Vignesh

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: vignesh C (#3)

Re: Shared detoast Datum proposal

Hi,

vignesh C <vignesh21@gmail.com> writes:

On Mon, 1 Jan 2024 at 19:26, Andy Fan <zhihuifan1213@163.com> wrote:

Andy Fan <zhihuifan1213@163.com> writes:

Some Known issues:
------------------

1. Currently only Scan & Join nodes are considered for this feature.
2. JIT is not adapted for this purpose yet.

JIT is adapted for this feature in v2. Any feedback is welcome.

One of the tests was aborted at CFBOT [1] with:
[09:47:00.735] dumping /tmp/cores/postgres-11-28182.core for
/tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin/postgres
[09:47:01.035] [New LWP 28182]

There was a bug in JIT part, here is the fix. Thanks for taking care of
this!

--
Best Regards
Andy Fan

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Andy Fan (#4)

Re: Shared detoast Datum proposal

Andy Fan <zhihuifan1213@163.com> writes:

One of the tests was aborted at CFBOT [1] with:
[09:47:00.735] dumping /tmp/cores/postgres-11-28182.core for
/tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin/postgres
[09:47:01.035] [New LWP 28182]

There was a bug in JIT part, here is the fix. Thanks for taking care of
this!

Fixed a GCC warning in cirrus-ci, hope everything is fine now.

--
Best Regards
Andy Fan

Peter Smith

smithpb2250@gmail.com

about 2 years ago

In reply to: Andy Fan (#5)

Re: Shared detoast Datum proposal

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1]https://commitfest.postgresql.org/46/4759/, but it seems
there were CFbot test failures last time it was run [2]https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759. Please have a
look and post an updated version if necessary.

======
[1]: https://commitfest.postgresql.org/46/4759/
[2]: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759

Kind Regards,
Peter Smith.

[1]: /messages/by-id/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ+ZYA@mail.gmail.com
[2]: /messages/by-id/875xzqxbv5.fsf@163.com

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Peter Smith (#6)

Re: Shared detoast Datum proposal

Hi,

Peter Smith <smithpb2250@gmail.com> writes:

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4759/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759

v5 attached, it should fix the above issue. This version also introduce
a data struct called bitset, which has a similar APIs like bitmapset but
have the ability to reset all bits without recycle its allocated memory,
this is important for this feature.

commit 44754fb03accb0dec9710a962a334ee73eba3c49 (HEAD -> shared_detoast_value_v2)
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date: Tue Jan 23 13:38:34 2024 +0800

shared detoast feature.

commit 14a6eafef9ff4926b8b877d694de476657deee8a
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date: Mon Jan 22 15:48:33 2024 +0800

Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

Also for performance aspect, the functions for Bitset removed some
unlikely checks, instead with some Asserts.

I didn't write a good commit message for commit 2, the people who is
interested with this can see the first message in this thread for
explaination.

I think anyone whose customer uses lots of jsonb probably can get
benefits from this. the precondition is the toast value should be
accessed 1+ times, including the jsonb_out function. I think this would
be not rare to happen.

--
Best Regards
Andy Fan

Michael Zhilin

m.zhilin@postgrespro.ru

about 2 years ago

In reply to: Andy Fan (#7)

Re: Shared detoast Datum proposal

Hi Andy,

It looks like v5 is missing in your mail. Could you please check and
resend it?

Thanks,
Michael.

On 1/23/24 08:44, Andy Fan wrote:

Hi,

Peter Smith<smithpb2250@gmail.com> writes:

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1]https://commitfest.postgresql.org/46/4759/
[2]https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759

v5 attached, it should fix the above issue. This version also introduce
a data struct called bitset, which has a similar APIs like bitmapset but
have the ability to reset all bits without recycle its allocated memory,
this is important for this feature.

commit 44754fb03accb0dec9710a962a334ee73eba3c49 (HEAD -> shared_detoast_value_v2)
Author: yizhi.fzh<yizhi.fzh@alibaba-inc.com>
Date: Tue Jan 23 13:38:34 2024 +0800

shared detoast feature.

commit 14a6eafef9ff4926b8b877d694de476657deee8a
Author: yizhi.fzh<yizhi.fzh@alibaba-inc.com>
Date: Mon Jan 22 15:48:33 2024 +0800

Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

Also for performance aspect, the functions for Bitset removed some
unlikely checks, instead with some Asserts.

[1]/messages/by-id/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ+ZYA@mail.gmail.com
[2]/messages/by-id/875xzqxbv5.fsf@163.com

I didn't write a good commit message for commit 2, the people who is
interested with this can see the first message in this thread for
explaination.

I think anyone whose customer uses lots of jsonb probably can get
benefits from this. the precondition is the toast value should be
accessed 1+ times, including the jsonb_out function. I think this would
be not rare to happen.

--
Michael Zhilin
Postgres Professional
+7(925)3366270
https://www.postgrespro.ru

[1]: /messages/by-id/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ+ZYA@mail.gmail.com
[2]: /messages/by-id/875xzqxbv5.fsf@163.com

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Michael Zhilin (#8)

Re: Shared detoast Datum proposal

Michael Zhilin <m.zhilin@postgrespro.ru> writes:

Hi Andy,

It looks like v5 is missing in your mail. Could you please check and resend it?

ha, yes.. v5 is really attached this time.

commit eee0b2058f912d0d56282711c5d88bc0b1b75c2f (HEAD -> shared_detoast_value_v3)
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date: Tue Jan 23 13:38:34 2024 +0800

shared detoast feature.

details at /messages/by-id/87il4jrk1l.fsf@163.com

commit eeca405f5ae87e7d4e5496de989ac7b5173bcaa9
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date: Mon Jan 22 15:48:33 2024 +0800

Introduce a Bitset data struct.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

Also for performance aspect, the functions for Bitset removed some
unlikely checks, instead with some Asserts.

As for the commit "Introduce a Bitset data struct.", the test coverage
is 100% now. So it would be great that people can review this first.

--
Best Regards
Andy Fan

#10

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Andy Fan (#9)

Re: Shared detoast Datum proposal

Hi,

I didn't another round of self-review. Comments, variable names, the
order of function definition are improved so that it can be read as
smooth as possible. so v6 attached.

--
Best Regards
Andy Fan

#11

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Andy Fan (#10)

Re: Shared detoast Datum proposal

Hi,

I took a quick look on this thread/patch today, so let me share a couple
initial thoughts. I may not have a particularly coherent/consistent
opinion on the patch or what would be a better way to do this yet, but
perhaps it'll start a discussion ...

The goal of the patch (as I understand it) is essentially to cache
detoasted values, so that the value does not need to be detoasted
repeatedly in different parts of the plan. I think that's perfectly
sensible and worthwhile goal - detoasting is not cheap, and complex
plans may easily spend a lot of time on it.

That being said, the approach seems somewhat invasive, and touching
parts I wouldn't have expected to need a change to implement this. For
example, I certainly would not have guessed the patch to need changes in
createplan.c or setrefs.c.

Perhaps it really needs to do these things, but neither the thread nor
the comments are very enlightening as for why it's needed :-( In many
cases I can guess, but I'm not sure my guess is correct. And comments in
code generally describe what's happening locally / next line, not the
bigger picture & why it's happening.

IIUC we walk the plan to decide which Vars should be detoasted (and
cached) once, and which cases should not do that because it'd inflate
the amount of data we need to keep in a Sort, Hash etc. Not sure if
there's a better way to do this - it depends on what happens in the
upper parts of the plan, so we can't decide while building the paths.

But maybe we could decide this while transforming the paths into a plan?
(I realize the JIT thread nearby needs to do something like that in
create_plan, and in that one I suggested maybe walking the plan would be
a better approach, so I may be contradicting myself a little bit.).

In any case, set_plan_forbid_pre_detoast_vars_recurse should probably
explain the overall strategy / reasoning in a bit more detail. Maybe
it's somewhere in this thread, but that's not great for reviewers.

Similar for the setrefs.c changes. It seems a bit suspicious to piggy
back the new code into fix_scan_expr/fix_scan_list and similar code.
Those functions have a pretty clearly defined purpose, not sure we want
to also extend them to also deal with this new thing. (FWIW I'd 100%%
did it this way if I hacked on a PoC of this, to make it work. But I'm
not sure it's the right solution.)

I don't know what to thing about the Bitset - maybe it's necessary, but
how would I know? I don't have any way to measure the benefits, because
the 0002 patch uses it right away. I think it should be done the other
way around, i.e. the patch should introduce the main feature first
(using the traditional Bitmapset), and then add Bitset on top of that.
That way we could easily measure the impact and see if it's useful.

On the whole, my biggest concern is memory usage & leaks. It's not
difficult to already have problems with large detoasted values, and if
we start keeping more of them, that may get worse. Or at least that's my
intuition - it can't really get better by keeping the values longer, right?

The other thing is the risk of leaks (in the sense of keeping detoasted
values longer than expected). I see the values are allocated in
tts_mcxt, and maybe that's the right solution - not sure.

FWIW while looking at the patch, I couldn't help but to think about
expanded datums. There's similarity in what these two features do - keep
detoasted values for a while, so that we don't need to do the expensive
processing if we access them repeatedly. Of course, expanded datums are
not meant to be long-lived, while "shared detoasted values" are meant to
exist (potentially) for the query duration. But maybe there's something
we could learn from expanded datums? For example how the varlena pointer
is leveraged to point to the expanded object.

For example, what if we add a "TOAST cache" as a query-level hash table,
and modify the detoasting to first check the hash table (with the TOAST
pointer as a key)? It'd be fairly trivial to enforce a memory limit on
the hash table, evict values from it, etc. And it wouldn't require any
of the createplan/setrefs changes, I think ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12

[1]: /messages/by-id/87il4jrk1l.fsf@163.com
[2]: /messages/by-id/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ+ZYA@mail.gmail.com
/messages/by-id/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ+ZYA@mail.gmail.com

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#11)

Re: Shared detoast Datum proposal

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

Hi Tomas,

I took a quick look on this thread/patch today, so let me share a couple
initial thoughts. I may not have a particularly coherent/consistent
opinion on the patch or what would be a better way to do this yet, but
perhaps it'll start a discussion ...

Thank you for this!

The goal of the patch (as I understand it) is essentially to cache
detoasted values, so that the value does not need to be detoasted
repeatedly in different parts of the plan. I think that's perfectly
sensible and worthwhile goal - detoasting is not cheap, and complex
plans may easily spend a lot of time on it.

exactly.

That being said, the approach seems somewhat invasive, and touching
parts I wouldn't have expected to need a change to implement this. For
example, I certainly would not have guessed the patch to need changes in
createplan.c or setrefs.c.

Perhaps it really needs to do these things, but neither the thread nor
the comments are very enlightening as for why it's needed :-( In many
cases I can guess, but I'm not sure my guess is correct. And comments in
code generally describe what's happening locally / next line, not the
bigger picture & why it's happening.

there were explaination at [1]/messages/by-id/87il4jrk1l.fsf@163.com, but it probably is too high level.
Writing a proper comments is challenging for me, but I am pretty happy
to try more. At the end of this writing, I explained the data workflow,
I am feeling that would be useful for reviewers.

IIUC we walk the plan to decide which Vars should be detoasted (and
cached) once, and which cases should not do that because it'd inflate
the amount of data we need to keep in a Sort, Hash etc.

Exactly.

Not sure if
there's a better way to do this - it depends on what happens in the
upper parts of the plan, so we can't decide while building the paths.

I'd say I did this intentionally. Deciding such things in paths will be
more expensive than create_plan stage IMO.

But maybe we could decide this while transforming the paths into a plan?
(I realize the JIT thread nearby needs to do something like that in
create_plan, and in that one I suggested maybe walking the plan would be
a better approach, so I may be contradicting myself a little bit.).

I think that's pretty similar what I'm doing now. Just that I did it
*just after* the create_plan. This is because the create_plan doesn't
transform the path to plan in the top->down manner all the time, the
known exception is create_mergejoin_plan. so I have to walk just after
the create_plan is
done.

In the create_mergejoin_plan, the Sort node is created *after* the
subplan for the Sort is created.

/* Recursively process the path tree, demanding the correct tlist result */
plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST);

+ /*
+  * After the plan tree is built completed, we start to walk for which
+  * expressions should not used the shared-detoast feature.
+  */
+ set_plan_forbid_pre_detoast_vars_recurse(plan, NIL);

In any case, set_plan_forbid_pre_detoast_vars_recurse should probably
explain the overall strategy / reasoning in a bit more detail. Maybe
it's somewhere in this thread, but that's not great for reviewers.

a lession learnt, thanks.

a revisted version of comments from the lastest patch.

/*
* set_plan_forbid_pre_detoast_vars_recurse
* Walking the Plan tree in the top-down manner to gather the vars which
* should be as small as possible and record them in Plan.forbid_pre_detoast_vars
*
* plan: the plan node to walk right now.
* small_tlist: a list of nodes which its subplan should provide them as
* small as possible.
*/
static void
set_plan_forbid_pre_detoast_vars_recurse(Plan *plan, List *small_tlist)

Similar for the setrefs.c changes. It seems a bit suspicious to piggy
back the new code into fix_scan_expr/fix_scan_list and similar code.
Those functions have a pretty clearly defined purpose, not sure we want
to also extend them to also deal with this new thing. (FWIW I'd 100%%
did it this way if I hacked on a PoC of this, to make it work. But I'm
not sure it's the right solution.)

The main reason of doing so is because I want to share the same walk
effort as fix_scan_expr. otherwise I have to walk the plan for
every expression again. I thought this as a best practice in the past
and thought we can treat the pre_detoast_attrs as a valuable side
effects:(

I don't know what to thing about the Bitset - maybe it's necessary, but
how would I know? I don't have any way to measure the benefits, because
the 0002 patch uses it right away.

a revisted version of comments from the latest patch. graph 2 explains
this decision.

/*
* The attributes whose values are the detoasted version in tts_values[*],
* if so these memory needs some extra clean-up. These memory can't be put
* into ecxt_per_tuple_memory since many of them needs a longer life span,
* for example the Datum in outer join. These memory is put into
* TupleTableSlot.tts_mcxt and be clear whenever the tts_values[*] is
* invalidated.
*
* Bitset rather than Bitmapset is chosen here because when all the members
* of Bitmapset are deleted, the allocated memory will be deallocated
* automatically, which is too expensive in this case since we need to
* deleted all the members in each ExecClearTuple and repopulate it again
* when fill the detoast datum to tts_values[*]. This situation will be run
* again and again in an execution cycle.
*
* These values are populated by EEOP_{INNER/OUTER/SCAN}_VAR_TOAST steps.
*/
Bitset *pre_detoasted_attrs;

I think it should be done the other
way around, i.e. the patch should introduce the main feature first
(using the traditional Bitmapset), and then add Bitset on top of that.
That way we could easily measure the impact and see if it's useful.

Acutally v4 used the Bitmapset, and then both perf and pgbench's tps
indicate it is too expensive. and after talk with David at [2]/messages/by-id/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ+ZYA@mail.gmail.com, I
introduced bitset and use it here. the test case I used comes from [1]/messages/by-id/87il4jrk1l.fsf@163.com.
IRCC, there were 5% performance difference because of this.

create table w(a int, b numeric);
insert into w select i, i from generate_series(1, 1000000)i;
select b from w where b > 0;

To reproduce the difference, we can replace the bitset_clear() with

bitset_free(slot->pre_detoasted_attrs);
slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts);

in ExecFreePreDetoastDatum. then it works same as Bitmapset.

On the whole, my biggest concern is memory usage & leaks. It's not
difficult to already have problems with large detoasted values, and if
we start keeping more of them, that may get worse. Or at least that's my
intuition - it can't really get better by keeping the values longer, right?

The other thing is the risk of leaks (in the sense of keeping detoasted
values longer than expected). I see the values are allocated in
tts_mcxt, and maybe that's the right solution - not sure.

about the memory usage, first it is kept as the same lifesplan as the
tts_values[*] which can be released pretty quickly, only if the certain
values of the tuples is not needed. it is true that we keep the detoast
version longer than before, but that's something we have to pay I
think.

Leaks may happen since tts_mcxt is reset at the end of *executor*. So if
we forget to release the memory when the tts_values[*] is invalidated
somehow, the memory will be leaked until the end of executor. I think
that will be enough to cause an issue. Currently besides I release such
memory at the ExecClearTuple, I also relase such memory whenever we set
tts_nvalid to 0, the theory used here is:

/*
* tts_values is treated invalidated since tts_nvalid is set to 0, so
* let's free the pre-detoast datum.
*/
ExecFreePreDetoastDatum(slot);

I will do more test on the memory leak stuff, since there are so many
operation aginst slot like ExecCopySlot etc, I don't know how to test it
fully. the method in my mind now is use TPCH with 10GB data size, and
monitor the query runtime memory usage.

FWIW while looking at the patch, I couldn't help but to think about
expanded datums. There's similarity in what these two features do - keep
detoasted values for a while, so that we don't need to do the expensive
processing if we access them repeatedly.

Could you provide some keyword or function names for the expanded datum
here, I probably miss this.

Of course, expanded datums are
not meant to be long-lived, while "shared detoasted values" are meant to
exist (potentially) for the query duration.

hmm, acutally the "shared detoast value" just live in the
TupleTableSlot->tts_values[*], rather than the whole query duration. The
simple case is:

SELECT * FROM t WHERE a_text LIKE 'abc%';

when we scan to the next tuple, the detoast value for the previous tuple
will be relased.

But maybe there's something
we could learn from expanded datums? For example how the varlena pointer
is leveraged to point to the expanded object.

maybe. currently I just use detoast_attr to get the desired version. I'm
pleasure if we have more effective way.

if (!slot->tts_isnull[attnum] &&
VARATT_IS_EXTENDED(slot->tts_values[attnum]))
{
Datum oldDatum;
MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);

oldDatum = slot->tts_values[attnum];
slot->tts_values[attnum] = PointerGetDatum(detoast_attr( (struct varlena *) oldDatum));
Assert(slot->tts_nvalid > attnum);
Assert(oldDatum != slot->tts_values[attnum]);
bitset_add_member(slot->pre_detoasted_attrs, attnum);
MemoryContextSwitchTo(old);
}

For example, what if we add a "TOAST cache" as a query-level hash table,
and modify the detoasting to first check the hash table (with the TOAST
pointer as a key)? It'd be fairly trivial to enforce a memory limit on
the hash table, evict values from it, etc. And it wouldn't require any
of the createplan/setrefs changes, I think ...

Hmm, I am not sure I understand you correctly at this part. In the
current patch, to avoid the run-time (ExecExprInterp) check if we
should detoast and save the datum, I defined 3 extra steps so that
the *extra check itself* is not needed for unnecessary attributes.
for example an datum for int or a detoast datum should not be saved back
to tts_values[*] due to the small_tlist reason. However these steps can
be generated is based on the output of createplan/setrefs changes. take
the INNER_VAR for example:

In ExecInitExprRec:

switch (variable->varno)
{
case INNER_VAR:
if (is_join_plan(plan) &&
bms_is_member(attnum,
((JoinState *) state->parent)->inner_pre_detoast_attrs))
{
scratch.opcode = EEOP_INNER_VAR_TOAST;
}
else
{
scratch.opcode = EEOP_INNER_VAR;
}
}

The data workflow is:

1. set_plan_forbid_pre_detoast_vars_recurse (in the createplan.c)
decides which Vars should *not* be pre_detoasted because of small_tlist
reason and record it in Plan.forbid_pre_detoast_vars.

2. fix_scan_expr (in the setrefs.c) tracks which Vars should be
detoasted for the specific plan node and record them in it. Currently
only Scan and Join nodes support this feature.

typedef struct Scan
{
...
/*
* Records of var's varattno - 1 where the Var is accessed indirectly by
* any expression, like a > 3. However a IS [NOT] NULL is not included
* since it doesn't access the tts_values[*] at all.
*
* This is a essential information to figure out which attrs should use
* the pre-detoast-attrs logic.
*/
Bitmapset *reference_attrs;
} Scan;

typedef struct Join
{
..
/*
* Records of var's varattno - 1 where the Var is accessed indirectly by
* any expression, like a > 3. However a IS [NOT] NULL is not included
* since it doesn't access the tts_values[*] at all.
*
* This is a essential information to figure out which attrs should use
* the pre-detoast-attrs logic.
*/
Bitmapset *outer_reference_attrs;
Bitmapset *inner_reference_attrs;
} Join;

3. during the InitPlan stage, we maintain the
PlanState.xxx_pre_detoast_attrs and generated different StepOp for them.

4. At the ExecExprInterp stage, only the new StepOp do the extra check
to see if the detoast should happen. Other steps doesn't need this
check at all.

If we avoid the createplan/setref.c changes, probabaly some unrelated
StepOp needs the extra check as well?

When I worked with the UniqueKey feature, I maintained a
UniqueKey.README to summaried all the dicussed topics in threads, the
README is designed to save the effort for more reviewer, I think I
should apply the same logic for this feature.

Thank you very much for your feedback!

v7 attached, just some comments and Assert changes.

--
Best Regards
Andy Fan

#13

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Andy Fan (#12)

Re: Shared detoast Datum proposal

On 2/20/24 19:38, Andy Fan wrote:

...

I think it should be done the other
way around, i.e. the patch should introduce the main feature first
(using the traditional Bitmapset), and then add Bitset on top of that.
That way we could easily measure the impact and see if it's useful.

Acutally v4 used the Bitmapset, and then both perf and pgbench's tps
indicate it is too expensive. and after talk with David at [2], I
introduced bitset and use it here. the test case I used comes from [1].
IRCC, there were 5% performance difference because of this.

create table w(a int, b numeric);
insert into w select i, i from generate_series(1, 1000000)i;
select b from w where b > 0;

To reproduce the difference, we can replace the bitset_clear() with

bitset_free(slot->pre_detoasted_attrs);
slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts);

in ExecFreePreDetoastDatum. then it works same as Bitmapset.

I understand the bitset was not introduced until v5, after noticing the
bitmapset is not quite efficient. But I still think the patches should
be the other way around, i.e. the main feature first, then the bitset as
an optimization.

That allows everyone to observe the improvement on their own (without
having to tweak the code), and it also doesn't require commit of the
bitset part before it gets actually used by anything.

On the whole, my biggest concern is memory usage & leaks. It's not
difficult to already have problems with large detoasted values, and if
we start keeping more of them, that may get worse. Or at least that's my
intuition - it can't really get better by keeping the values longer, right?

The other thing is the risk of leaks (in the sense of keeping detoasted
values longer than expected). I see the values are allocated in
tts_mcxt, and maybe that's the right solution - not sure.

about the memory usage, first it is kept as the same lifesplan as the
tts_values[*] which can be released pretty quickly, only if the certain
values of the tuples is not needed. it is true that we keep the detoast
version longer than before, but that's something we have to pay I
think.

Leaks may happen since tts_mcxt is reset at the end of *executor*. So if
we forget to release the memory when the tts_values[*] is invalidated
somehow, the memory will be leaked until the end of executor. I think
that will be enough to cause an issue. Currently besides I release such
memory at the ExecClearTuple, I also relase such memory whenever we set
tts_nvalid to 0, the theory used here is:

/*
* tts_values is treated invalidated since tts_nvalid is set to 0, so
* let's free the pre-detoast datum.
*/
ExecFreePreDetoastDatum(slot);

I will do more test on the memory leak stuff, since there are so many
operation aginst slot like ExecCopySlot etc, I don't know how to test it
fully. the method in my mind now is use TPCH with 10GB data size, and
monitor the query runtime memory usage.

I think this is exactly the "high level design" description that should
be in a comment, somewhere.

FWIW while looking at the patch, I couldn't help but to think about
expanded datums. There's similarity in what these two features do - keep
detoasted values for a while, so that we don't need to do the expensive
processing if we access them repeatedly.

Could you provide some keyword or function names for the expanded datum
here, I probably miss this.

see src/include/utils/expandeddatum.h

Of course, expanded datums are
not meant to be long-lived, while "shared detoasted values" are meant to
exist (potentially) for the query duration.

hmm, acutally the "shared detoast value" just live in the
TupleTableSlot->tts_values[*], rather than the whole query duration. The
simple case is:

SELECT * FROM t WHERE a_text LIKE 'abc%';

when we scan to the next tuple, the detoast value for the previous tuple
will be relased.

But if the (detoasted) value is passed to the next executor node, it'll
be kept, right?

But maybe there's something
we could learn from expanded datums? For example how the varlena pointer
is leveraged to point to the expanded object.

maybe. currently I just use detoast_attr to get the desired version. I'm
pleasure if we have more effective way.

if (!slot->tts_isnull[attnum] &&
VARATT_IS_EXTENDED(slot->tts_values[attnum]))
{
Datum oldDatum;
MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);

oldDatum = slot->tts_values[attnum];
slot->tts_values[attnum] = PointerGetDatum(detoast_attr( (struct varlena *) oldDatum));
Assert(slot->tts_nvalid > attnum);
Assert(oldDatum != slot->tts_values[attnum]);
bitset_add_member(slot->pre_detoasted_attrs, attnum);
MemoryContextSwitchTo(old);
}

Right. FWIW I'm not sure if expanded objects are similar enough to be a
useful inspiration.

Unrelated question - could this actually end up being too aggressive?
That is, could it detoast attributes that we end up not needing? For
example, what if the tuple gets eliminated for some reason (e.g. because
of a restriction on the table, or not having a match in a join)? Won't
we detoast the tuple only to throw it away?

For example, what if we add a "TOAST cache" as a query-level hash table,
and modify the detoasting to first check the hash table (with the TOAST
pointer as a key)? It'd be fairly trivial to enforce a memory limit on
the hash table, evict values from it, etc. And it wouldn't require any
of the createplan/setrefs changes, I think ...

Hmm, I am not sure I understand you correctly at this part. In the
current patch, to avoid the run-time (ExecExprInterp) check if we
should detoast and save the datum, I defined 3 extra steps so that
the *extra check itself* is not needed for unnecessary attributes.
for example an datum for int or a detoast datum should not be saved back
to tts_values[*] due to the small_tlist reason. However these steps can
be generated is based on the output of createplan/setrefs changes. take
the INNER_VAR for example:

In ExecInitExprRec:

switch (variable->varno)
{
case INNER_VAR:
if (is_join_plan(plan) &&
bms_is_member(attnum,
((JoinState *) state->parent)->inner_pre_detoast_attrs))
{
scratch.opcode = EEOP_INNER_VAR_TOAST;
}
else
{
scratch.opcode = EEOP_INNER_VAR;
}
}

The data workflow is:

1. set_plan_forbid_pre_detoast_vars_recurse (in the createplan.c)
decides which Vars should *not* be pre_detoasted because of small_tlist
reason and record it in Plan.forbid_pre_detoast_vars.

2. fix_scan_expr (in the setrefs.c) tracks which Vars should be
detoasted for the specific plan node and record them in it. Currently
only Scan and Join nodes support this feature.

typedef struct Scan
{
...
/*
* Records of var's varattno - 1 where the Var is accessed indirectly by
* any expression, like a > 3. However a IS [NOT] NULL is not included
* since it doesn't access the tts_values[*] at all.
*
* This is a essential information to figure out which attrs should use
* the pre-detoast-attrs logic.
*/
Bitmapset *reference_attrs;
} Scan;

typedef struct Join
{
..
/*
* Records of var's varattno - 1 where the Var is accessed indirectly by
* any expression, like a > 3. However a IS [NOT] NULL is not included
* since it doesn't access the tts_values[*] at all.
*
* This is a essential information to figure out which attrs should use
* the pre-detoast-attrs logic.
*/
Bitmapset *outer_reference_attrs;
Bitmapset *inner_reference_attrs;
} Join;

Is it actually necessary to add new fields to these nodes? Also, the
names are not particularly descriptive of the purpose - it'd be better
to have "detoast" in the name, instead of generic "reference".

3. during the InitPlan stage, we maintain the
PlanState.xxx_pre_detoast_attrs and generated different StepOp for them.

4. At the ExecExprInterp stage, only the new StepOp do the extra check
to see if the detoast should happen. Other steps doesn't need this
check at all.

If we avoid the createplan/setref.c changes, probabaly some unrelated
StepOp needs the extra check as well?

When I worked with the UniqueKey feature, I maintained a
UniqueKey.README to summaried all the dicussed topics in threads, the
README is designed to save the effort for more reviewer, I think I
should apply the same logic for this feature.

Good idea. Either that (a separate README), or a comment in a header of
some suitable .c/.h file (I prefer that, because that's kinda obvious
when reading the code, I often not notice a README exists next to it).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#13)

Re: Shared detoast Datum proposal

I see your reply when I started to write a more high level
document. Thanks for the step by step help!

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

On 2/20/24 19:38, Andy Fan wrote:

...

I think it should be done the other
way around, i.e. the patch should introduce the main feature first
(using the traditional Bitmapset), and then add Bitset on top of that.
That way we could easily measure the impact and see if it's useful.

Acutally v4 used the Bitmapset, and then both perf and pgbench's tps
indicate it is too expensive. and after talk with David at [2], I
introduced bitset and use it here. the test case I used comes from [1].
IRCC, there were 5% performance difference because of this.

create table w(a int, b numeric);
insert into w select i, i from generate_series(1, 1000000)i;
select b from w where b > 0;

To reproduce the difference, we can replace the bitset_clear() with

bitset_free(slot->pre_detoasted_attrs);
slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts);

in ExecFreePreDetoastDatum. then it works same as Bitmapset.

I understand the bitset was not introduced until v5, after noticing the
bitmapset is not quite efficient. But I still think the patches should
be the other way around, i.e. the main feature first, then the bitset as
an optimization.

That allows everyone to observe the improvement on their own (without
having to tweak the code), and it also doesn't require commit of the
bitset part before it gets actually used by anything.

I start to think this is a better way rather than the opposite. The next
version will be:

0001: shared detoast datum feature with high level introduction.
0002: introduce bitset and use it shared-detoast-datum feature, with the
test case to show the improvement.

I will do more test on the memory leak stuff, since there are so many
operation aginst slot like ExecCopySlot etc, I don't know how to test it
fully. the method in my mind now is use TPCH with 10GB data size, and
monitor the query runtime memory usage.

I think this is exactly the "high level design" description that should
be in a comment, somewhere.

Got it.

Of course, expanded datums are
not meant to be long-lived, while "shared detoasted values" are meant to
exist (potentially) for the query duration.

hmm, acutally the "shared detoast value" just live in the
TupleTableSlot->tts_values[*], rather than the whole query duration. The
simple case is:

SELECT * FROM t WHERE a_text LIKE 'abc%';

when we scan to the next tuple, the detoast value for the previous tuple
will be relased.

But if the (detoasted) value is passed to the next executor node, it'll
be kept, right?

Yes and only one copy for all the executor nodes.

Unrelated question - could this actually end up being too aggressive?
That is, could it detoast attributes that we end up not needing? For
example, what if the tuple gets eliminated for some reason (e.g. because
of a restriction on the table, or not having a match in a join)? Won't
we detoast the tuple only to throw it away?

The detoast datum will have the exactly same lifespan with other
tts_values[*]. If the tuple get eliminated for any reason, those detoast
datum still exist until the slot is cleared for storing the next tuple.

typedef struct Join
{
..
/*
* Records of var's varattno - 1 where the Var is accessed indirectly by
* any expression, like a > 3. However a IS [NOT] NULL is not included
* since it doesn't access the tts_values[*] at all.
*
* This is a essential information to figure out which attrs should use
* the pre-detoast-attrs logic.
*/
Bitmapset *outer_reference_attrs;
Bitmapset *inner_reference_attrs;
} Join;

Is it actually necessary to add new fields to these nodes? Also, the
names are not particularly descriptive of the purpose - it'd be better
to have "detoast" in the name, instead of generic "reference".

Because of the way of the data transformation, I think we must add the
fields to keep such inforamtion. Then these information will be used
initilize the necessary information in PlanState. maybe I am having a
fixed mindset, I can't think out a way to avoid that right now.

I used 'reference' rather than detoast is because some implementaion
issues. In the createplan.c and setref.c, I can't check the atttyplen
effectively, so even a Var with int type is still hold here which may
have nothing with detoast.

When I worked with the UniqueKey feature, I maintained a
UniqueKey.README to summaried all the dicussed topics in threads, the
README is designed to save the effort for more reviewer, I think I
should apply the same logic for this feature.

Good idea. Either that (a separate README), or a comment in a header of
some suitable .c/.h file (I prefer that, because that's kinda obvious
when reading the code, I often not notice a README exists next to it).

Great, I'd try this from tomorrow.

--
Best Regards
Andy Fan

#15

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Andy Fan (#14)

Re: Shared detoast Datum proposal

Hi,

Good idea. Either that (a separate README), or a comment in a header of
some suitable .c/.h file (I prefer that, because that's kinda obvious
when reading the code, I often not notice a README exists next to it).

Great, I'd try this from tomorrow.

I have made it. Currently I choose README because this feature changed
createplan.c, setrefs.c, execExpr.c and execExprInterp.c, so putting the
high level design to any of them looks inappropriate. So the high level
design is here and detailed design for each steps is in the comments
around the code. Hope this is helpful!

The problem:
-------------

In the current expression engine, a toasted datum is detoasted when
required, but the result is discarded immediately, either by pfree it
immediately or leave it for ResetExprContext. Arguments for which one to
use exists sometimes. More serious problem is detoasting is expensive,
especially for the data types like jsonb or array, which the value might
be very huge. In the blow example, the detoasting happens twice.

SELECT jb_col->'a', jb_col->'b' FROM t;

Within the shared-detoast-datum, we just need to detoast once for each
tuple, and discard it immediately when the tuple is not needed any
more. FWIW this issue may existing for small numeric, text as well
because of SHORT_TOAST feature where the toast's len using 1 byte rather
than 4 bytes.

Current Design
--------------

The high level design is let createplan.c and setref.c decide which
Vars can use this feature, and then the executor save the detoast
datum back slot->to tts_values[*] during the ExprEvalStep of
EEOP_{INNER|OUTER|SCAN}_VAR_TOAST. The reasons includes:

- The existing expression engine read datum from tts_values[*], no any
extra work need to be done.
- Reuse the lifespan of TupleTableSlot system to manage memory. It
is natural to think the detoast datum is a tts_value just that it is
in a detoast format. Since we have a clear lifespan for TupleTableSlot
already, like ExecClearTuple, ExecCopySlot et. We are easy to reuse
them for managing the datoast datum's memory.
- The existing projection method can copy the datoasted datum (int64)
automatically to the next node's slot, but keeping the ownership
unchanged, so only the slot where the detoast really happen take the
charge of it's lifespan.

So the final data change is adding the below field into TubleTableSlot.

typedef struct TupleTableSlot
{
..

Assuming which Var should use this feature has been decided in
createplan.c and setref.c already. The 3 new ExprEvalSteps
EEOP_{INNER,OUTER,SCAN}_VAR_TOAST as used. During the evaluating these
steps, the below code is used.

static inline void
ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
{
if (!slot->tts_isnull[attnum] &&
VARATT_IS_EXTENDED(slot->tts_values[attnum]))
{
Datum oldDatum;
MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);

oldDatum = slot->tts_values[attnum];
slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
(struct varlena *) oldDatum));
Assert(slot->tts_nvalid > attnum);
Assert(oldDatum != slot->tts_values[attnum]);
slot->pre_detoasted_attrs= bms_add_member(slot->pre_detoasted_attrs, attnum);
MemoryContextSwitchTo(old);
}
}

Since I don't want to the run-time extra check to see if is a detoast
should happen, so introducing 3 new steps.

When to free the detoast datum? It depends on when the slot's
tts_values[*] is invalidated, ExecClearTuple is the clear one, but any
TupleTableSlotOps which set the tts_nvalid = 0 tells us no one will use
the datum in tts_values[*] so it is time to release them based on
slot.pre_detoast_attrs as well.

Now comes to the createplan.c/setref.c part, which decides which Vars
should use the shared detoast feature. The guideline of this is:

1. It needs a detoast for a given expression.
2. It should not breaks the CP_SMALL_TLIST design. Since we saved the
detoast datum back to tts_values[*], which make tuple bigger. if we
do this blindly, it would be harmful to the ORDER / HASH style nodes.

A high level data flow is:

1. at the createplan.c, we walk the plan tree go gather the
CP_SMALL_TLIST because of SORT/HASH style nodes, information and save
it to Plan.forbid_pre_detoast_vars via the function
set_plan_forbid_pre_detoast_vars_recurse.

2. at the setrefs.c, fix_{scan|join}_expr will recurse to Var for each
expression, so it is a good time to track the attribute number and
see if the Var is directly or indirectly accessed. Usually the
indirectly access a Var means a detoast would happens, for
example an expression like a > 3. However some known expressions like
VAR is NULL; is ignored. The output is {Scan|Join}.xxx_reference_attrs;

As a result, the final result is added into the plan node of Scan and
Join.

typedef struct Scan
{
/*
* Records of var's varattno - 1 where the Var is accessed indirectly by
* any expression, like a > 3. However a IS [NOT] NULL is not included
* since it doesn't access the tts_values[*] at all.
*
* This is a essential information to figure out which attrs should use
* the pre-detoast-attrs logic.
*/
Bitmapset *reference_attrs;
} Scan;

Note that here I used '_reference_' rather than '_detoast_' is because
at this part, I still don't know if it is a toastable attrbiute, which
is known at the MakeTupleTableSlot stage.

3. At the InitPlan Stage, we calculate the final xxx_pre_detoast_attrs
in ScanState & JoinState, which will be passed into expression
engine in the ExecInitExprRec stage and EEOP_{INNER|OUTER|SCAN}
_VAR_TOAST steps are generated finally then everything is connected
with ExecSlotDetoastDatum!

Testing
-------

Case 1:

create table t (a numeric);
insert into t select i from generate_series(1, 100000)i;

cat 1.sql

select * from t where a > 0;

In this test, the current master run detoast twice for each datum. one
in numeric_gt, one in numeric_out. this feature makes the detoast once.

pgbench -f 1.sql -n postgres -T 10 -M prepared

master: 30.218 ms
patched(Bitmapset): 30.881ms

Then we can see the perf report as below:

- 7.34% 0.00% postgres postgres [.] ExecSlotDetoastDatum (inlined)
- ExecSlotDetoastDatum (inlined)
- 3.47% bms_add_member
- 3.06% bms_make_singleton (inlined)
- palloc0
1.30% AllocSetAlloc

- 5.99% 0.00% postgres postgres [.] ExecFreePreDetoastDatum (inlined)
- ExecFreePreDetoastDatum (inlined)
2.64% bms_next_member
1.17% bms_del_members
0.94% AllocSetFree

One of the reasons is because Bitmapset will deallocate its memory when
all the bits are deleted due to commit 00b41463c, then we have to
allocate memory at the next time when adding a member to it. This kind
of action is executed 100000 times in the above workload.

Then I introduce bitset data struct (commit 0002) which is pretty like
the Bitmapset, but it would not deallocate the memory when all the bits
is unset. and use it in this feature (commit 0003). Then the result
became to: 28.715ms

- 5.22% 0.00% postgres postgres [.] ExecFreePreDetoastDatum (inlined)
- ExecFreePreDetoastDatum (inlined)
- 2.82% bitset_next_member
1.69% bms_next_member_internal (inlined)
0.95% bitset_next_member
0.66% AllocSetFree

Here we can see the expensive calls are bitset_next_member on
slot->pre_detoast_attrs and pfree. if we put the detoast datum into
a dedicated memory context, then we can save the cost of
bitset_next_member since can discard all the memory in once and use
MemoryContextReset instead of AllocSetFree (commit 0004). then the
result became to 27.840ms.

So the final result for case 1:

master: 30.218 ms
patched(Bitmapset): 30.881ms
patched(bitset): 28.715ms
latency average(bitset + tts_value_mctx) = 27.840 ms

Big jsonbs test:

create table b(big jsonb);

insert into b select
jsonb_object_agg(x::text,
random()::text || random()::text || random()::text)
from generate_series(1,600) f(x);

insert into b select (select big from b) from generate_series(1, 1000)i;

explain analyze
select big->'1', big->'2', big->'3', big->'5', big->'10' from b;

master: 702.224 ms
patched: 133.306 ms

Memory usage test:

I run the workload of tpch scale 10 on against both master and patched
versions, the memory usage looks stable.

In progress work:

I'm still running tpc-h scale 100 to see if anything interesting
finding, that is in progress. As for the scale 10:

master: 55s
patched: 56s

The reason is q9 plan changed a bit, the real reason needs some more
time. Since this patch doesn't impact on the best path generation, so it
should not reasonble for me.

--
Best Regards
Andy Fan

#16

hukutoc@gmail.com

about 2 years ago

In reply to: Andy Fan (#15)

Re: Shared detoast Datum proposal

Hi!

I see this to be a very promising initiative, but some issues come into my
mind.
When we store and detoast large values, say, 1Gb - that's a very likely
scenario,
we have such cases from prod systems - we would end up in using a lot of
shared
memory to keep these values alive, only to discard them later. Also,
toasted values
are not always being used immediately and as a whole, i.e. jsonb values are
fully
detoasted (we're working on this right now) to extract the smallest value
from
big json, and these values are not worth keeping in memory. For text values
too,
we often do not need the whole value to be detoasted and kept in memory.

What do you think?

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

#17

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Nikita Malakhov (#16)

Re: Shared detoast Datum proposal

Hi,

When we store and detoast large values, say, 1Gb - that's a very likely scenario,
we have such cases from prod systems - we would end up in using a lot of shared
memory to keep these values alive, only to discard them later.

First the feature can make sure if the original expression will not
detoast it, this feature will not detoast it as well. Based on this, if
there is a 1Gb datum, in the past, it took 1GB memory as well, but
it may be discarded quicker than the patched version. but patched
version will *not* keep it for too long time since once the
slot->tts_values[*] is invalidated, the memory will be released
immedately.

For example:

create table t(a text, b int);
insert into t select '1-gb-length-text' from generate_series(1, 100);

select * from t where a like '%gb%';

The patched version will take 1gb extra memory only.

Are you worried about this case or some other cases?

Also, toasted values
are not always being used immediately and as a whole, i.e. jsonb values are fully
detoasted (we're working on this right now) to extract the smallest value from
big json, and these values are not worth keeping in memory. For text values too,
we often do not need the whole value to be detoasted.

I'm not sure how often this is true, espeically you metioned text data
type. I can't image why people acquire a piece of text and how can we
design a toasting system to fulfill such case without detoast the whole
as the first step. But for the jsonb or array data type, I think it is
more often. However if we design toasting like that, I'm not sure if it
would slow down other user case. for example detoasting the whole piece
use case. I am justing feeling both way has its user case, kind of heap
store and columna store.

FWIW, as I shown in the test case, this feature is not only helpful for
big datum, it is also helpful for small toast datum.

--
Best Regards
Andy Fan

#18

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Andy Fan (#17)

Re: Shared detoast Datum proposal

On 2/26/24 14:22, Andy Fan wrote:

...

Also, toasted values
are not always being used immediately and as a whole, i.e. jsonb values are fully
detoasted (we're working on this right now) to extract the smallest value from
big json, and these values are not worth keeping in memory. For text values too,
we often do not need the whole value to be detoasted.

I'm not sure how often this is true, espeically you metioned text data
type. I can't image why people acquire a piece of text and how can we
design a toasting system to fulfill such case without detoast the whole
as the first step. But for the jsonb or array data type, I think it is
more often. However if we design toasting like that, I'm not sure if it
would slow down other user case. for example detoasting the whole piece
use case. I am justing feeling both way has its user case, kind of heap
store and columna store.

Any substr/starts_with call benefits from this optimization, and such
calls don't seem that uncommon. I certainly can imagine people doing
this fairly often. In any case, it's a long-standing behavior /
optimization, and I don't think we can just dismiss it this quickly.

Is there a way to identify cases that are likely to benefit from this
slicing, and disable the detoasting for them? We already disable it for
other cases, so maybe we can do this here too?

Or maybe there's a way to do the detoasting lazily, and only keep the
detoasted value when it's requesting the whole value. Or perhaps even
better, remember what part we detoasted, and maybe it'll be enough for
future requests?

I'm not sure how difficult would this be with the proposed approach,
which eagerly detoasts the value into tts_values. I think it'd be easier
to implement with the TOAST cache I briefly described ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19

hukutoc@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#18)

Re: Shared detoast Datum proposal

Hi,

Tomas, we already have a working jsonb partial detoast prototype,
and currently I'm porting it to the recent master. Due to the size
of the changes and very invasive nature it takes a lot of effort,
but it is already done. I'm also trying to make the core patch
less invasive. Actually, it is a subject for a separate topic, as soon
as I make it work on the current master we'll propose it
to the community.

Andy, thank you! I'll check the last patch set out and reply in a day or
two.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

#20

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#18)

Re: Shared detoast Datum proposal

On 2/26/24 14:22, Andy Fan wrote:

...

Also, toasted values
are not always being used immediately and as a whole, i.e. jsonb values are fully
detoasted (we're working on this right now) to extract the smallest value from
big json, and these values are not worth keeping in memory. For text values too,
we often do not need the whole value to be detoasted.

I'm not sure how often this is true, espeically you metioned text data
type. I can't image why people acquire a piece of text and how can we
design a toasting system to fulfill such case without detoast the whole
as the first step. But for the jsonb or array data type, I think it is
more often. However if we design toasting like that, I'm not sure if it
would slow down other user case. for example detoasting the whole piece
use case. I am justing feeling both way has its user case, kind of heap
store and columna store.

Any substr/starts_with call benefits from this optimization, and such
calls don't seem that uncommon. I certainly can imagine people doing
this fairly often.

This leads me to pay more attention to pg_detoast_datum_slice user case
which I did overlook and caused some regression in this case. Thanks!

As you said later:

Is there a way to identify cases that are likely to benefit from this
slicing, and disable the detoasting for them? We already disable it for
other cases, so maybe we can do this here too?

I think the answer is yes, if our goal is to disable the whole toast for
some speical calls like substr/starts_with. In the current patch, we have:

/*
* increase_level_for_pre_detoast
* Check if the given Expr could detoast a Var directly, if yes,
* increase the level and return true. otherwise return false;
*/
static inline void
increase_level_for_pre_detoast(Node *node, intermediate_level_context *ctx)
{
/* The following nodes is impossible to detoast a Var directly. */
if (IsA(node, List) || IsA(node, TargetEntry) || IsA(node, NullTest))
{
ctx->level_added = false;
}
else if (IsA(node, FuncExpr) && castNode(FuncExpr, node)->funcid == F_PG_COLUMN_COMPRESSION)
{
/* let's not detoast first so that pg_column_compression works. */
ctx->level_added = false;
}

while it works, but the blacklist looks not awesome.

In any case, it's a long-standing behavior /
optimization, and I don't think we can just dismiss it this quickly.

I agree. So I said both have their user case. and when I say the *both*, I
mean the other method is "partial detoast prototype", which Nikita has
told me before. while I am sure it has many user case, I'm also feeling
it is complex and have many questions in my mind, but I'd like to see
the design before asking them.

Or maybe there's a way to do the detoasting lazily, and only keep the
detoasted value when it's requesting the whole value. Or perhaps even
better, remember what part we detoasted, and maybe it'll be enough for
future requests?

I'm not sure how difficult would this be with the proposed approach,
which eagerly detoasts the value into tts_values. I think it'd be
easier to implement with the TOAST cache I briefly described ...

I can understand the benefits of the TOAST cache, but the following
issues looks not good to me IIUC:

1). we can't put the result to tts_values[*] since without the planner
decision, we don't know if this will break small_tlist logic. But if we
put it into the cache only, which means a cache-lookup as a overhead is
unavoidable.

2). It is hard to decide which entry should be evicted without attaching
it to the TupleTableSlot's life-cycle. then we can't grantee the entry
we keep is the entry we will reuse soon?

--
Best Regards
Andy Fan

#21

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Nikita Malakhov (#19)

#22

hukutoc@gmail.com

about 2 years ago

In reply to: Andy Fan (#21)

#23

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Nikita Malakhov (#22)

#24

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Andy Fan (#23)

#25

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Andy Fan (#23)

#26

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Andy Fan (#20)

#27

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Andy Fan (#24)

#28

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#25)

#29

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#26)

#30

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#27)

#31

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Andy Fan (#30)

#32

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#31)

#33

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Andy Fan (#29)

#34

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#33)

#35

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Andy Fan (#34)

#36

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#35)

#37

hukutoc@gmail.com

about 2 years ago

In reply to: Andy Fan (#36)

#38

hukutoc@gmail.com

about 2 years ago

In reply to: Nikita Malakhov (#37)

#39

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Nikita Malakhov (#38)

#40

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Nikita Malakhov (#37)

#41

hukutoc@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#40)

#42

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Nikita Malakhov (#41)

#43

hukutoc@gmail.com

almost 2 years ago

In reply to: Tomas Vondra (#42)

#44

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Nikita Malakhov (#43)

#45

zhihui.fan1213@gmail.com

almost 2 years ago

In reply to: Robert Haas (#44)

#46

zhihui.fan1213@gmail.com

almost 2 years ago

In reply to: Andy Fan (#45)

#47

hukutoc@gmail.com

almost 2 years ago

In reply to: Andy Fan (#46)

#48

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Andy Fan (#45)

#49

zhihui.fan1213@gmail.com

almost 2 years ago

In reply to: Robert Haas (#48)

#50

zhihui.fan1213@gmail.com

almost 2 years ago

In reply to: Nikita Malakhov (#47)

#51

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Andy Fan (#49)

#52

zhihui.fan1213@gmail.com

almost 2 years ago

In reply to: Robert Haas (#51)

#53