Some ExecSeqScan optimizations
Hi,
I’ve been looking into possible optimizations for ExecSeqScan() and
chatting with colleagues about cases where it shows up prominently in
analytics-style query plans.
For example, in queries like SELECT agg(somecol) FROM big_table WHERE
<condition>, ExecScan() often dominates the profile. Digging into it,
I found two potential sources of overhead:
1. Run-time checks for PlanState.qual and PlanState.ps_ProjInfo
nullness: these checks are done repeatedly, which seems unnecessary if
we know the values at plan init time.
2. Overhead from ExecScanFetch() when EvalPlanQual() isn’t relevant:
Andres pointed out that ExecScanFetch() introduces unnecessary
overhead even in the common case where EvalPlanQual() isn’t
applicable.
To address (1), I tried assigning specialized functions to
PlanState.ExecProcNode in ExecInitSeqScan() based on whether qual or
projInfo are NULL. Inspired by David Rowley’s suggestion to look at
ExecHashJoinImpl(), I wrote variants like ExecSeqScanNoQual() (for
qual == NULL) and ExecSeqScanNoProj() (for projInfo == NULL). These
call a local version of ExecScan() that lives in nodeSeqScan.c, marked
always-inline. This local copy takes qual and projInfo as arguments,
letting compilers inline and optimize unnecessary branches away.
For (2), the local ExecScan() copy avoids the generic ExecScanFetch()
logic, simplifying things further when EvalPlanQual() doesn’t apply.
That has the additional benefit of allowing SeqNext() to be called
directly instead of via an indirect function pointer. This reduces the
overhead of indirect calls and enables better compiler optimizations
like inlining.
Junwang Zhao helped with creating a benchmark to test the patch, the
results of which can be accessed in the spreadsheet at [1]https://docs.google.com/spreadsheets/d/1AsJOUgIfSsYIJUJwbXk4aO9FVOFOrBCvrfmdQYkHIw4/edit?usp=sharing. The
results show that the patch makes the latency of queries of shape
`SELECT agg(somecol or *) FROM big_table WHERE <condition>` generally
faster with up to 5% improvement in some cases.
Would love to hear thoughts.
--
Thanks, Amit Langote
[1]: https://docs.google.com/spreadsheets/d/1AsJOUgIfSsYIJUJwbXk4aO9FVOFOrBCvrfmdQYkHIw4/edit?usp=sharing
Attachments:
v1-0001-Introduce-optimized-ExecSeqScan-variants-for-tail.patchapplication/octet-stream; name=v1-0001-Introduce-optimized-ExecSeqScan-variants-for-tail.patchDownload+166-2
On Sat, 21 Dec 2024 at 00:41, Amit Langote <amitlangote09@gmail.com> wrote:
To address (1), I tried assigning specialized functions to
PlanState.ExecProcNode in ExecInitSeqScan() based on whether qual or
projInfo are NULL. Inspired by David Rowley’s suggestion to look at
ExecHashJoinImpl(), I wrote variants like ExecSeqScanNoQual() (for
qual == NULL) and ExecSeqScanNoProj() (for projInfo == NULL). These
call a local version of ExecScan() that lives in nodeSeqScan.c, marked
always-inline. This local copy takes qual and projInfo as arguments,
letting compilers inline and optimize unnecessary branches away.
I tested the performance of this and I do see close to a 5%
performance increase in TPC-H Q1. Nice.
I'm a little concerned with the method the patch takes where it copies
most of ExecScan and includes it in nodeSeqscan.c. If there are any
future changes to ExecScan, someone might forget to propagate those
changes into nodeSeqscan.c's version. What if instead you moved
ExecScan() into a header file and made it static inline? That way
callers would get their own inlined copy with the callback functions
inlined too, which for nodeSeqscan is good, since the recheck callback
does nothing.
Just as an additional reason for why I think this might be a better
idea is that the patch doesn't seem to quite keep things equivalent as
in the process of having ExecSeqScanNoEPQImpl() directly call
SeqNext() without going through ExecScanFetch is that you've lost a
call to CHECK_FOR_INTERRUPTS().
On the other hand, one possible drawback from making ExecScan a static
inline is that any non-core code that uses ExecScan won't get any bug
fixes if we were to fix some bug in ExecScan in a minor release unless
the extension is compiled again. That could be fixed by keeping
ExecScan as an extern function and maybe just having ExecScanExtended
as the static inline version.
Another thing I wondered about is the naming conversion you're using
for these ExecSeqScan variant functions.
+ExecSeqScanNoQualNoProj(PlanState *pstate)
+ExecSeqScanNoQual(PlanState *pstate)
+ExecSeqScanNoProj(PlanState *pstate)
+ExecSeqScanNoEPQ(PlanState *pstate)
I think it's better to have a naming convention that aims to convey
what the function does do rather than what it does not do.
I've attached my workings of what I was messing around with. It seems
to perform about the same as your version. I think maybe we'd need
some sort of execScan.h instead of where I've stuffed the functions
in.
It would also be good if there was some way to give guarantees to the
compiler that a given pointer isn't NULL. For example in:
return ExecScanExtended(&node->ss,
(ExecScanAccessMtd) SeqNext,
(ExecScanRecheckMtd) SeqRecheck,
NULL,
pstate->qual,
NULL);
It would be good if when ExecScanExtended is inlined the compiler
wouldn't emit code for the "if (qual == NULL)" ... part. I don't know
if there's any way to do that. I thought I'd mention it in case
someone can think of a way... I guess you could add another parameter
that gets passed as a const and have the "if" test look at that
instead, that's a bit ugly though.
David
Attachments:
inline_ExecScan.patch.txttext/plain; charset=US-ASCII; name=inline_ExecScan.patch.txtDownload+386-239
On Mon, Jan 6, 2025 at 10:18 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Sat, 21 Dec 2024 at 00:41, Amit Langote <amitlangote09@gmail.com> wrote:
To address (1), I tried assigning specialized functions to
PlanState.ExecProcNode in ExecInitSeqScan() based on whether qual or
projInfo are NULL. Inspired by David Rowley’s suggestion to look at
ExecHashJoinImpl(), I wrote variants like ExecSeqScanNoQual() (for
qual == NULL) and ExecSeqScanNoProj() (for projInfo == NULL). These
call a local version of ExecScan() that lives in nodeSeqScan.c, marked
always-inline. This local copy takes qual and projInfo as arguments,
letting compilers inline and optimize unnecessary branches away.I tested the performance of this and I do see close to a 5%
performance increase in TPC-H Q1. Nice.
Thanks David for looking at this.
I'm a little concerned with the method the patch takes where it copies
most of ExecScan and includes it in nodeSeqscan.c. If there are any
future changes to ExecScan, someone might forget to propagate those
changes into nodeSeqscan.c's version. What if instead you moved
ExecScan() into a header file and made it static inline? That way
callers would get their own inlined copy with the callback functions
inlined too, which for nodeSeqscan is good, since the recheck callback
does nothing.
Yeah, having an inline-able version of ExecScan() in a separate header
sounds better than what I proposed.
Just as an additional reason for why I think this might be a better
idea is that the patch doesn't seem to quite keep things equivalent as
in the process of having ExecSeqScanNoEPQImpl() directly call
SeqNext() without going through ExecScanFetch is that you've lost a
call to CHECK_FOR_INTERRUPTS().
Yeah, that was clearly a bug in my patch.
On the other hand, one possible drawback from making ExecScan a static
inline is that any non-core code that uses ExecScan won't get any bug
fixes if we were to fix some bug in ExecScan in a minor release unless
the extension is compiled again. That could be fixed by keeping
ExecScan as an extern function and maybe just having ExecScanExtended
as the static inline version.
Yes, keeping ExecScan()'s interface unchanged seems better for the
considerations you mention.
Another thing I wondered about is the naming conversion you're using
for these ExecSeqScan variant functions.+ExecSeqScanNoQualNoProj(PlanState *pstate) +ExecSeqScanNoQual(PlanState *pstate) +ExecSeqScanNoProj(PlanState *pstate) +ExecSeqScanNoEPQ(PlanState *pstate)I think it's better to have a naming convention that aims to convey
what the function does do rather than what it does not do.
Agreed.
I've attached my workings of what I was messing around with. It seems
to perform about the same as your version. I think maybe we'd need
some sort of execScan.h instead of where I've stuffed the functions
in.
I've done that in the attached v2.
It would also be good if there was some way to give guarantees to the
compiler that a given pointer isn't NULL. For example in:return ExecScanExtended(&node->ss,
(ExecScanAccessMtd) SeqNext,
(ExecScanRecheckMtd) SeqRecheck,
NULL,
pstate->qual,
NULL);It would be good if when ExecScanExtended is inlined the compiler
wouldn't emit code for the "if (qual == NULL)" ... part. I don't know
if there's any way to do that. I thought I'd mention it in case
someone can think of a way... I guess you could add another parameter
that gets passed as a const and have the "if" test look at that
instead, that's a bit ugly though.
I too am not sure of a way short of breaking ExecScanExtended() down
into individual functions, each for the following cases:
1. qual != NULL && projInfo != NULL
2. qual != NULL (&& projInfo == NULL)
3. projInfo != NULL (&& qual == NULL)
So basically, mirroring the variants we now have in nodeSeqScan.c to
the execScan.h. To avoid inlining of the EPQ code when epqstate ==
NULL, rename ExecScanFetch() to ExecScanGetEPQTuple() and move the
(*accessMtd) call to the caller when epqstate == NULL.
CHECK_FOR_INTERRUPTS() is now repeated at every place that needs it.
Attached 0002 shows a PoC of that.
--
Thanks, Amit Langote
Attachments:
v2-0001-Refactor-ExecScan-to-inline-scan-filtering-and-pr.patchapplication/octet-stream; name=v2-0001-Refactor-ExecScan-to-inline-scan-filtering-and-pr.patchDownload+366-204
v2-0002-Break-ExecScanExtended-into-variants-based-on-qua.patchapplication/octet-stream; name=v2-0002-Break-ExecScanExtended-into-variants-based-on-qua.patchDownload+216-74
On Fri, 10 Jan 2025 at 02:46, Amit Langote <amitlangote09@gmail.com> wrote:
On Mon, Jan 6, 2025 at 10:18 PM David Rowley <dgrowleyml@gmail.com> wrote:
I've attached my workings of what I was messing around with. It seems
to perform about the same as your version. I think maybe we'd need
some sort of execScan.h instead of where I've stuffed the functions
in.I've done that in the attached v2.
I think 0001 looks ok, aside from what the attached fixes. (at least
one is my mistake)
Did you test the performance of 0002? I didn't look at it hard enough
to understand what you've done. I can look if performance tests show
that it might be worthwhile considering.
David
Attachments:
minor_fixes.txttext/plain; charset=US-ASCII; name=minor_fixes.txtDownload+4-5
On Fri, Jan 10, 2025 at 1:06 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 10 Jan 2025 at 02:46, Amit Langote <amitlangote09@gmail.com> wrote:
On Mon, Jan 6, 2025 at 10:18 PM David Rowley <dgrowleyml@gmail.com> wrote:
I've attached my workings of what I was messing around with. It seems
to perform about the same as your version. I think maybe we'd need
some sort of execScan.h instead of where I've stuffed the functions
in.I've done that in the attached v2.
I think 0001 looks ok, aside from what the attached fixes. (at least
one is my mistake)
Oops, thanks for the fixes. Attaching an updated version.
Did you test the performance of 0002? I didn't look at it hard enough
to understand what you've done.
I reran the test suite I used before and I don't see a consistent
improvement due to 0002 or perhaps rather degradation. I've saved the
results in the sheet named 2025-01-10 in the spreadsheet at [1].
Comparing the latency for the query `select count(*) from test_table
where <first_column> = <nonexistant_value>` (where test_table has 30
integer columns and 1 million rows in it) between v17, master, and the
patched (0001 or 0001+0002) shows an improvement of close to 10% with
the patch.
-- v17
select count(*) from test_table where a_1 = 1000000;
count
-------
0
(1 row)
Time: 286.618 ms
-- master
select count(*) from test_table where a_1 = 1000000;
count
-------
0
(1 row)
Time: 283.564 ms
-- patched (0001+0002)
select count(*) from test_table where a_1 = 1000000;
count
-------
0
(1 row)
Time: 260.547 ms
Note that I turned off Gather for these tests, because then I find
that the improvements to ExecScan() are better measurable.
I can look if performance tests show
that it might be worthwhile considering.
Sure, that would be great.
--
Thanks, Amit Langote
Attachments:
v3-0002-Break-ExecScanExtended-into-variants-based-on-qua.patchapplication/octet-stream; name=v3-0002-Break-ExecScanExtended-into-variants-based-on-qua.patchDownload+215-73
v3-0001-Refactor-ExecScan-to-inline-scan-filtering-and-pr.patchapplication/octet-stream; name=v3-0001-Refactor-ExecScan-to-inline-scan-filtering-and-pr.patchDownload+365-204
Amit Langote писал(а) 2025-01-10 16:22:
On Fri, Jan 10, 2025 at 1:06 PM David Rowley <dgrowleyml@gmail.com>
wrote:On Fri, 10 Jan 2025 at 02:46, Amit Langote <amitlangote09@gmail.com>
wrote:On Mon, Jan 6, 2025 at 10:18 PM David Rowley <dgrowleyml@gmail.com> wrote:
I've attached my workings of what I was messing around with. It seems
to perform about the same as your version. I think maybe we'd need
some sort of execScan.h instead of where I've stuffed the functions
in.I've done that in the attached v2.
I think 0001 looks ok, aside from what the attached fixes. (at least
one is my mistake)Oops, thanks for the fixes. Attaching an updated version.
Did you test the performance of 0002? I didn't look at it hard enough
to understand what you've done.I reran the test suite I used before and I don't see a consistent
improvement due to 0002 or perhaps rather degradation. I've saved the
results in the sheet named 2025-01-10 in the spreadsheet at [1].Comparing the latency for the query `select count(*) from test_table
where <first_column> = <nonexistant_value>` (where test_table has 30
integer columns and 1 million rows in it) between v17, master, and the
patched (0001 or 0001+0002) shows an improvement of close to 10% with
the patch.-- v17
select count(*) from test_table where a_1 = 1000000;
count
-------
0
(1 row)
Time: 286.618 ms-- master
select count(*) from test_table where a_1 = 1000000;
count
-------
0
(1 row)
Time: 283.564 ms-- patched (0001+0002)
select count(*) from test_table where a_1 = 1000000;
count
-------
0
(1 row)
Time: 260.547 msNote that I turned off Gather for these tests, because then I find
that the improvements to ExecScan() are better measurable.I can look if performance tests show
that it might be worthwhile considering.Sure, that would be great.
Hi
Could you clarify, how do you get this improvements (283 ms to 260 ms)
in this patch?
I see additional code ( if ... else if ... else if ...) and the same
function declared
as inline, but it is called by pointer as before, and it does not
matter, that it is
declared as inline.
In case of query
select count(*) from test_table where a_1 = 1000000;
I would expect increase of query time due to additional if...else . It
is not clear
what code was eliminated to decrease query time.
--
Best regards,
Vladlen Popolitov.
On Fri, 10 Jan 2025 at 22:53, Vladlen Popolitov
<v.popolitov@postgrespro.ru> wrote:
In case of query
select count(*) from test_table where a_1 = 1000000;
I would expect increase of query time due to additional if...else . It
is not clear
what code was eliminated to decrease query time.
Are you talking about the code added to ExecInitSeqScan() to determine
which node function to call? If so, that's only called during executor
startup. The idea here is to reduce the branching during execution by
calling one of those special functions which has a more specialised
version of the ExecScan code for the particular purpose it's going to
be used for.
David
On Fri, 10 Jan 2025 at 22:22, Amit Langote <amitlangote09@gmail.com> wrote:
On Fri, Jan 10, 2025 at 1:06 PM David Rowley <dgrowleyml@gmail.com> wrote:
I can look if performance tests show
that it might be worthwhile considering.Sure, that would be great.
What I wanted to know was if 0002 shows any additional gains over just
0001. If there isn't any, I didn't see the point in looking at it.
David
On Fri, Jan 10, 2025 at 7:36 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 10 Jan 2025 at 22:53, Vladlen Popolitov
<v.popolitov@postgrespro.ru> wrote:In case of query
select count(*) from test_table where a_1 = 1000000;
I would expect increase of query time due to additional if...else . It
is not clear
what code was eliminated to decrease query time.Are you talking about the code added to ExecInitSeqScan() to determine
which node function to call? If so, that's only called during executor
startup. The idea here is to reduce the branching during execution by
calling one of those special functions which has a more specialised
version of the ExecScan code for the particular purpose it's going to
be used for.
Looks like I hadn't mentioned this key aspect of the patch in the
commit message, so did that in the attached.
Vladlen, does what David wrote and the new commit message answer your
question(s)?
--
Thanks, Amit Langote
Attachments:
v4-0001-Refactor-ExecScan-to-inline-scan-filtering-and-pr.patchapplication/octet-stream; name=v4-0001-Refactor-ExecScan-to-inline-scan-filtering-and-pr.patchDownload+365-204
Amit Langote писал(а) 2025-01-10 18:22:
On Fri, Jan 10, 2025 at 7:36 PM David Rowley <dgrowleyml@gmail.com>
wrote:On Fri, 10 Jan 2025 at 22:53, Vladlen Popolitov
<v.popolitov@postgrespro.ru> wrote:In case of query
select count(*) from test_table where a_1 = 1000000;
I would expect increase of query time due to additional if...else . It
is not clear
what code was eliminated to decrease query time.Are you talking about the code added to ExecInitSeqScan() to determine
which node function to call? If so, that's only called during executor
startup. The idea here is to reduce the branching during execution by
calling one of those special functions which has a more specialised
version of the ExecScan code for the particular purpose it's going to
be used for.Looks like I hadn't mentioned this key aspect of the patch in the
commit message, so did that in the attached.Vladlen, does what David wrote and the new commit message answer your
question(s)?
Hi Amit,
Yes, David clarified the idea, but it is still hard to believe in 5% of
improvements.
The query
select count(*) from test_table where a_1 = 1000000;
has both qual and projection, and ExecScanExtended() will be generated
similar to ExecScan() (the same not NULL values to check in if()).
Do you have some scripts to reproduce your benchmark?
--
Best regards,
Vladlen Popolitov.
On Fri, Jan 10, 2025 at 10:49 PM Vladlen Popolitov
<v.popolitov@postgrespro.ru> wrote:
Amit Langote писал(а) 2025-01-10 18:22:
On Fri, Jan 10, 2025 at 7:36 PM David Rowley <dgrowleyml@gmail.com>
wrote:On Fri, 10 Jan 2025 at 22:53, Vladlen Popolitov
<v.popolitov@postgrespro.ru> wrote:In case of query
select count(*) from test_table where a_1 = 1000000;
I would expect increase of query time due to additional if...else . It
is not clear
what code was eliminated to decrease query time.Are you talking about the code added to ExecInitSeqScan() to determine
which node function to call? If so, that's only called during executor
startup. The idea here is to reduce the branching during execution by
calling one of those special functions which has a more specialised
version of the ExecScan code for the particular purpose it's going to
be used for.Looks like I hadn't mentioned this key aspect of the patch in the
commit message, so did that in the attached.Vladlen, does what David wrote and the new commit message answer your
question(s)?Hi Amit,
Yes, David clarified the idea, but it is still hard to believe in 5% of
improvements.
The query
select count(*) from test_table where a_1 = 1000000;
has both qual and projection, and ExecScanExtended() will be generated
similar to ExecScan() (the same not NULL values to check in if()).
Do you have some scripts to reproduce your benchmark?
The benchmark is provided [0]https://docs.google.com/spreadsheets/d/1AsJOUgIfSsYIJUJwbXk4aO9FVOFOrBCvrfmdQYkHIw4/edit?usp=sharing.
Here is a rough comparison of compiled variants' assembly code.
<ExecSeqScan>: start 2b8590, end 2b868c => 252
<ExecSeqScanWithProject>: start 2b8034, end 2b8140 => 268
<ExecSeqScanWithQual>: start 2b8144, end 2b831c => 472
<ExecSeqScanWithQualProject>: start 2b8320, end 2b858c => 620
Before Amit's optimization, it was basically called the
ExecSeqScanWithQualProject, you
can see the other 3 variants all have some reduction in function size.
--
Best regards,Vladlen Popolitov.
[0]: https://docs.google.com/spreadsheets/d/1AsJOUgIfSsYIJUJwbXk4aO9FVOFOrBCvrfmdQYkHIw4/edit?usp=sharing
--
Regards
Junwang Zhao
On Sat, Jan 11, 2025 at 4:57 PM Junwang Zhao <zhjwpku@gmail.com> wrote:
On Fri, Jan 10, 2025 at 10:49 PM Vladlen Popolitov
<v.popolitov@postgrespro.ru> wrote:Amit Langote писал(а) 2025-01-10 18:22:
On Fri, Jan 10, 2025 at 7:36 PM David Rowley <dgrowleyml@gmail.com>
wrote:On Fri, 10 Jan 2025 at 22:53, Vladlen Popolitov
<v.popolitov@postgrespro.ru> wrote:In case of query
select count(*) from test_table where a_1 = 1000000;
I would expect increase of query time due to additional if...else . It
is not clear
what code was eliminated to decrease query time.Are you talking about the code added to ExecInitSeqScan() to determine
which node function to call? If so, that's only called during executor
startup. The idea here is to reduce the branching during execution by
calling one of those special functions which has a more specialised
version of the ExecScan code for the particular purpose it's going to
be used for.Looks like I hadn't mentioned this key aspect of the patch in the
commit message, so did that in the attached.Vladlen, does what David wrote and the new commit message answer your
question(s)?Hi Amit,
Yes, David clarified the idea, but it is still hard to believe in 5% of
improvements.
The query
select count(*) from test_table where a_1 = 1000000;
has both qual and projection, and ExecScanExtended() will be generated
similar to ExecScan() (the same not NULL values to check in if()).
Do you have some scripts to reproduce your benchmark?The benchmark is provided [0].
Here is a rough comparison of compiled variants' assembly code.
<ExecSeqScan>: start 2b8590, end 2b868c => 252
<ExecSeqScanWithProject>: start 2b8034, end 2b8140 => 268
<ExecSeqScanWithQual>: start 2b8144, end 2b831c => 472
<ExecSeqScanWithQualProject>: start 2b8320, end 2b858c => 620
Here is my compile options:
meson setup $HOME/build --prefix=$HOME/pgsql --buildtype=release
and I use `objdump -D postgres | less` to see the assembly code.
Before Amit's optimization, it was basically called the
ExecSeqScanWithQualProject, you
can see the other 3 variants all have some reduction in function size.--
Best regards,Vladlen Popolitov.
[0] https://docs.google.com/spreadsheets/d/1AsJOUgIfSsYIJUJwbXk4aO9FVOFOrBCvrfmdQYkHIw4/edit?usp=sharing
--
Regards
Junwang Zhao
--
Regards
Junwang Zhao
Hi Amit,
On Fri, Jan 10, 2025 at 7:22 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Fri, Jan 10, 2025 at 7:36 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 10 Jan 2025 at 22:53, Vladlen Popolitov
<v.popolitov@postgrespro.ru> wrote:In case of query
select count(*) from test_table where a_1 = 1000000;
I would expect increase of query time due to additional if...else . It
is not clear
what code was eliminated to decrease query time.Are you talking about the code added to ExecInitSeqScan() to determine
which node function to call? If so, that's only called during executor
startup. The idea here is to reduce the branching during execution by
calling one of those special functions which has a more specialised
version of the ExecScan code for the particular purpose it's going to
be used for.Looks like I hadn't mentioned this key aspect of the patch in the
commit message, so did that in the attached.
Thanks for updating the patch. While seeing the patch, the es_epq_active
confused me a little bit mostly because its name, a field name ending with
"active" typically suggests a boolean value, but here it is not, should we
change it to sth like es_epqstate? However this is not related to this patch,
I can start a new thread if you think this is worth a patch.
There is one tiny indent issue(my IDE does this automatically), which I
guess you will fix before committing.
- EPQState *epqstate;
+ EPQState *epqstate;
Vladlen, does what David wrote and the new commit message answer your
question(s)?--
Thanks, Amit Langote
--
Regards
Junwang Zhao
Hi Vladlen,
On Fri, Jan 10, 2025 at 11:49 PM Vladlen Popolitov
<v.popolitov@postgrespro.ru> wrote:
Amit Langote писал(а) 2025-01-10 18:22:
On Fri, Jan 10, 2025 at 7:36 PM David Rowley <dgrowleyml@gmail.com>
wrote:On Fri, 10 Jan 2025 at 22:53, Vladlen Popolitov
<v.popolitov@postgrespro.ru> wrote:In case of query
select count(*) from test_table where a_1 = 1000000;
I would expect increase of query time due to additional if...else . It
is not clear
what code was eliminated to decrease query time.Are you talking about the code added to ExecInitSeqScan() to determine
which node function to call? If so, that's only called during executor
startup. The idea here is to reduce the branching during execution by
calling one of those special functions which has a more specialised
version of the ExecScan code for the particular purpose it's going to
be used for.Looks like I hadn't mentioned this key aspect of the patch in the
commit message, so did that in the attached.Vladlen, does what David wrote and the new commit message answer your
question(s)?Hi Amit,
Yes, David clarified the idea, but it is still hard to believe in 5% of
improvements.
The query
select count(*) from test_table where a_1 = 1000000;
has both qual and projection, and ExecScanExtended() will be generated
similar to ExecScan() (the same not NULL values to check in if()).
Yes, I've noticed that if the plan for the above query contains a
projection, like when it contains a Gather node, the inlined version
of ExecScanExtended() will look more or less the same as the full
ExecScan(). There won't be noticeable speedup with the patch in that
case.
However, I ran the benchmark tests with Gather disabled such that I
get a plan without projection, which uses an inlined version that
doesn't have branches related to projection. I illustrate my example
below.
Do you have some scripts to reproduce your benchmark?
Use these steps. Set max_parallel_workers_per_gather to 0,
shared_buffers to 512MB. Compile the patch using --buildtype=release.
create table foo (a int, b int, c int, d int, e int);
insert into foo select i, i, i, i, i from generate_series(1, 10000000) i;
-- pg_prewarm: to ensure that no buffers lead to I/O to reduce noise
select pg_size_pretty(pg_prewarm('foo'));
select count(*) from foo where a = 10000000;
Times I get on v17, master, and with the patch for the above query are
as follows:
v17: 173, 173, 174 ms
master: 173, 175, 169 ms
Patched: 160, 161, 158 ms
Please let me know if you're still unable to reproduce such numbers
with the steps I described.
--
Thanks, Amit Langote
Hi Junwang,
On Sat, Jan 11, 2025 at 7:39 PM Junwang Zhao <zhjwpku@gmail.com> wrote:
Hi Amit,
On Fri, Jan 10, 2025 at 7:22 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Fri, Jan 10, 2025 at 7:36 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 10 Jan 2025 at 22:53, Vladlen Popolitov
<v.popolitov@postgrespro.ru> wrote:In case of query
select count(*) from test_table where a_1 = 1000000;
I would expect increase of query time due to additional if...else . It
is not clear
what code was eliminated to decrease query time.Are you talking about the code added to ExecInitSeqScan() to determine
which node function to call? If so, that's only called during executor
startup. The idea here is to reduce the branching during execution by
calling one of those special functions which has a more specialised
version of the ExecScan code for the particular purpose it's going to
be used for.Looks like I hadn't mentioned this key aspect of the patch in the
commit message, so did that in the attached.Thanks for updating the patch. While seeing the patch, the es_epq_active
confused me a little bit mostly because its name, a field name ending with
"active" typically suggests a boolean value, but here it is not, should we
change it to sth like es_epqstate? However this is not related to this patch,
I can start a new thread if you think this is worth a patch.
Yeah, the name has confused me as well from time to time.
Though it might be a good idea to dig the thread that led to the
introduction of this field to find out if the naming has some logic
we're missing.
You may start a new thread to get the attention of other folks who
might have some clue.
There is one tiny indent issue(my IDE does this automatically), which I
guess you will fix before committing.- EPQState *epqstate; + EPQState *epqstate;
Thanks for the heads up.
--
Thanks, Amit Langote
Here's v5 with a few commit message tweaks.
Barring objections, I would like to push this early next week.
--
Thanks, Amit Langote
Attachments:
v5-0001-Refactor-ExecScan-to-allow-inlining-of-its-core-l.patchapplication/octet-stream; name=v5-0001-Refactor-ExecScan-to-allow-inlining-of-its-core-l.patchDownload+365-204
On Fri, Jan 17, 2025 at 2:05 PM Amit Langote <amitlangote09@gmail.com> wrote:
Here's v5 with a few commit message tweaks.
Barring objections, I would like to push this early next week.
Pushed yesterday. Thank you all.
--
Thanks, Amit Langote
Hi,
On 2025-01-22 10:07:51 +0900, Amit Langote wrote:
On Fri, Jan 17, 2025 at 2:05 PM Amit Langote <amitlangote09@gmail.com> wrote:
Here's v5 with a few commit message tweaks.
Barring objections, I would like to push this early next week.
Pushed yesterday. Thank you all.
While looking at a profile I recently noticed that ExecSeqScanWithQual() had a
runtime branch to test whether qual is NULL, which seemed a bit silly. I think
we should use pg_assume(), which I just added to avoid a compiler warning, to
improve the code generation here.
The performance gain unsurprisingly isn't significant (but seems repeatably
measureable), but it does cut out a fair bit of unnecessary code.
andres@awork3:/srv/dev/build/postgres/m-dev-optimize$ size executor_nodeSeqscan.c.*o
text data bss dec hex filename
3330 0 0 3330 d02 executor_nodeSeqscan.c.assume.o
3834 0 0 3834 efa executor_nodeSeqscan.c.o
A 13% reduction in actual code size isn't bad for such a small change, imo.
I have a separate question as well - do we need to call ResetExprContext() if
we neither qual, projection nor epq? I see a small gain by avoiding that.
Greetings,
Andres Freund
Attachments:
v1-0001-Optimize-seqscan-code-generation-using-pg_assume.patchtext/x-diff; charset=us-asciiDownload+8-5
Hi Andres,
On Thu, Jul 10, 2025 at 8:34 AM Andres Freund <andres@anarazel.de> wrote:
On 2025-01-22 10:07:51 +0900, Amit Langote wrote:
On Fri, Jan 17, 2025 at 2:05 PM Amit Langote <amitlangote09@gmail.com> wrote:
Here's v5 with a few commit message tweaks.
Barring objections, I would like to push this early next week.
Pushed yesterday. Thank you all.
While looking at a profile I recently noticed that ExecSeqScanWithQual() had a
runtime branch to test whether qual is NULL, which seemed a bit silly. I think
we should use pg_assume(), which I just added to avoid a compiler warning, to
improve the code generation here.
+1. I think this might be what David was getting at in his first
message in this thread.
The performance gain unsurprisingly isn't significant (but seems repeatably
measureable), but it does cut out a fair bit of unnecessary code.andres@awork3:/srv/dev/build/postgres/m-dev-optimize$ size executor_nodeSeqscan.c.*o
text data bss dec hex filename
3330 0 0 3330 d02 executor_nodeSeqscan.c.assume.o
3834 0 0 3834 efa executor_nodeSeqscan.c.oA 13% reduction in actual code size isn't bad for such a small change, imo.
Yeah, that seems worthwhile. I had been a bit concerned about code
size growth from having four variant functions with at least some
duplication, so this is a nice offset.
Thanks for the patch.
+ /*
+ * Use pg_assume() for != NULL tests to make the compiler realize no
+ * runtime check for the field is needed in ExecScanExtended().
+ */
I propose changing "to make the compiler realize no runtime check" to
"so the compiler can optimize away the runtime check", assuming that
is what it means.
Also, I assume you intentionally avoided repeating the comment in all
the variant functions.
I have a separate question as well - do we need to call ResetExprContext() if
we neither qual, projection nor epq? I see a small gain by avoiding that.
You're referring to this block, I assume:
/*
* If we have neither a qual to check nor a projection to do, just skip
* all the overhead and return the raw scan tuple.
*/
if (!qual && !projInfo)
{
ResetExprContext(econtext);
return ExecScanFetch(node, epqstate, accessMtd, recheckMtd);
}
Yeah, I think it's fine to remove ResetExprContext() here. When I
looked at it before, I left it in because I was unsure whether
accessMtd() might leak memory into the per-tuple context. But on
second thought, that seems unlikely? Would you like me to do it or do
you have a patch in your tree already?
--
Thanks, Amit Langote
Hi,
On 2025-07-10 17:28:50 +0900, Amit Langote wrote:
On Thu, Jul 10, 2025 at 8:34 AM Andres Freund <andres@anarazel.de> wrote:
On 2025-01-22 10:07:51 +0900, Amit Langote wrote:
On Fri, Jan 17, 2025 at 2:05 PM Amit Langote <amitlangote09@gmail.com> wrote:
Here's v5 with a few commit message tweaks.
Barring objections, I would like to push this early next week.
Pushed yesterday. Thank you all.
While looking at a profile I recently noticed that ExecSeqScanWithQual() had a
runtime branch to test whether qual is NULL, which seemed a bit silly. I think
we should use pg_assume(), which I just added to avoid a compiler warning, to
improve the code generation here.+1. I think this might be what David was getting at in his first
message in this thread.
Indeed.
The performance gain unsurprisingly isn't significant (but seems repeatably
measureable), but it does cut out a fair bit of unnecessary code.andres@awork3:/srv/dev/build/postgres/m-dev-optimize$ size executor_nodeSeqscan.c.*o
text data bss dec hex filename
3330 0 0 3330 d02 executor_nodeSeqscan.c.assume.o
3834 0 0 3834 efa executor_nodeSeqscan.c.oA 13% reduction in actual code size isn't bad for such a small change, imo.
Yeah, that seems worthwhile. I had been a bit concerned about code
size growth from having four variant functions with at least some
duplication, so this is a nice offset.
I'm rather surprised by just how much the size reduces...
I built nodeSeqscan.c with -ffunction-sections and looked at the size with
size --format=sysv:
Before:
.text.SeqRecheck 6 0
.rodata.str1.8 135 0
.text.unlikely.SeqNext 53 0
.text.SeqNext 178 0
.text.ExecSeqScanEPQ 20 0
.text.ExecSeqScanWithProject 289 0
.text.unlikely.ExecSeqScanWithQual 53 0
.text.ExecSeqScanWithQual 441 0
.text.unlikely.ExecSeqScanWithQualProject 53 0
.text.ExecSeqScanWithQualProject 811 0
.text.unlikely.ExecSeqScan 53 0
.text.ExecSeqScan 245 0
.text.ExecInitSeqScan 287 0
.text.ExecEndSeqScan 33 0
.text.ExecReScanSeqScan 63 0
.text.ExecSeqScanEstimate 88 0
.text.ExecSeqScanInitializeDSM 114 0
.text.ExecSeqScanReInitializeDSM 34 0
.text.ExecSeqScanInitializeWorker 64 0
After:
.text.SeqRecheck 6 0
.rodata.str1.8 135 0
.text.unlikely.SeqNext 53 0
.text.SeqNext 178 0
.text.ExecSeqScanEPQ 20 0
.text.ExecSeqScanWithProject 209 0
.text.unlikely.ExecSeqScanWithQual 53 0
.text.ExecSeqScanWithQual 373 0
.text.unlikely.ExecSeqScanWithQualProject 53 0
.text.ExecSeqScanWithQualProject 474 0
.text.unlikely.ExecSeqScan 53 0
.text.ExecSeqScan 245 0
.text.ExecInitSeqScan 287 0
.text.ExecEndSeqScan 33 0
.text.ExecReScanSeqScan 63 0
.text.ExecSeqScanEstimate 88 0
.text.ExecSeqScanInitializeDSM 114 0
.text.ExecSeqScanReInitializeDSM 34 0
.text.ExecSeqScanInitializeWorker 64 0
I'm rather baffled that the size of ExecSeqScanWithQualProject goes from 811
to 474, just due to those null checks being removed... But I'll take it.
Thanks for the patch.
+ /* + * Use pg_assume() for != NULL tests to make the compiler realize no + * runtime check for the field is needed in ExecScanExtended(). + */I propose changing "to make the compiler realize no runtime check" to
"so the compiler can optimize away the runtime check", assuming that
is what it means.
It does. I don't really see a meaningful difference between the comments?
Also, I assume you intentionally avoided repeating the comment in all
the variant functions.
Yea, it didn't seem helpful to do so.
I have a separate question as well - do we need to call ResetExprContext() if
we neither qual, projection nor epq? I see a small gain by avoiding that.You're referring to this block, I assume:
/*
* If we have neither a qual to check nor a projection to do, just skip
* all the overhead and return the raw scan tuple.
*/
if (!qual && !projInfo)
{
ResetExprContext(econtext);
return ExecScanFetch(node, epqstate, accessMtd, recheckMtd);
}
Yep.
Yeah, I think it's fine to remove ResetExprContext() here. When I
looked at it before, I left it in because I was unsure whether
accessMtd() might leak memory into the per-tuple context.
It's a good question. I think I unfortunately found a problematic case,
ForeignNext().
I wonder if we instead can MemoryContextReset cheaper, by avoiding a function
call for the common case that no reset is needed. Right now we can't just
check ->isReset in an inline function, because we also delete children. I
wonder if we could define isReset so that creating a child context unsets
isReset?
Greetings,
Andres Freund