Improved scanner performance

Started by Peter Eisentrautover 23 years ago7 messages
#1Peter Eisentraut
peter_e@gmx.net

I've been poking at the scanner a bit using the large literal test case
from the other day

http://archives.postgresql.org/pgsql-hackers/2002-04/msg00811.php

I've been able to reduce the wall-clock run time of that test from 3:37
min to 2:13 min and the base_yylex() per-call time from 137ms to 24ms.

I've used 'flex -8 -CFa' and restructured the code to avoid looping over
and copying the input string half a dozen times. For instance, instead of
scanstr(), the escape sequences are resolved as the input is scanned, and
instead of the myinput() routine I use the function yy_scan_buffer()
provided by flex for scanning in-memory strings. (This would make the
code flex-dependent, but in reality it already is anyway.)

The "before" profile was:

% cumulative self self total
time seconds seconds calls ms/call ms/call name
23.51 6.65 6.65 110 60.45 137.27 base_yylex
23.19 13.21 6.56 11 596.36 1089.69 pq_getstring
19.16 18.63 5.42 74882482 0.00 0.00 pq_getbyte
14.99 22.87 4.24 11 385.45 385.46 scanstr
9.61 25.59 2.72 23 118.26 118.26 yy_get_previous_state
3.78 26.66 1.07 34 31.47 31.47 myinput
3.64 27.69 1.03 22 46.82 46.82 textin
1.48 28.11 0.42 34 12.35 43.82 yy_get_next_buffer

The "after" profile is:

% cumulative self self total
time seconds seconds calls ms/call ms/call name
40.30 5.65 5.65 11 513.64 943.64 pq_getstring
33.74 10.38 4.73 74882482 0.00 0.00 pq_getbyte
18.90 13.03 2.65 110 24.09 24.09 base_yylex
6.85 13.99 0.96 22 43.64 43.64 textin
0.07 14.00 0.01 86 0.12 0.12 heap_fetch

--
Peter Eisentraut peter_e@gmx.net

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#1)
Re: Improved scanner performance

Peter Eisentraut <peter_e@gmx.net> writes:

I've used 'flex -8 -CFa' and restructured the code to avoid looping over
and copying the input string half a dozen times. For instance, instead of
scanstr(), the escape sequences are resolved as the input is scanned, and
instead of the myinput() routine I use the function yy_scan_buffer()
provided by flex for scanning in-memory strings. (This would make the
code flex-dependent, but in reality it already is anyway.)

Yes, we've been requiring flex-only features for years, so that aspect
of it doesn't bother me. Any downsides to the changes? (For instance,
I had the idea that -CF would enlarge the lexer tables quite a bit ---
what's the change in executable size?)

The "after" profile is:

% cumulative self self total
time seconds seconds calls ms/call ms/call name
40.30 5.65 5.65 11 513.64 943.64 pq_getstring
33.74 10.38 4.73 74882482 0.00 0.00 pq_getbyte
18.90 13.03 2.65 110 24.09 24.09 base_yylex
6.85 13.99 0.96 22 43.64 43.64 textin
0.07 14.00 0.01 86 0.12 0.12 heap_fetch

Looks like inlining pq_getbyte into pq_getstring would be worth doing
too.

regards, tom lane

#3Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#2)
Re: Improved scanner performance

Tom Lane writes:

I had the idea that -CF would enlarge the lexer tables quite a bit ---
what's the change in executable size?)

+150 kB

I've also looked at -CFe, which is supposedly the next slowest level, but
it doesn't do nearly as well.

--
Peter Eisentraut peter_e@gmx.net

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#3)
Re: Improved scanner performance

Peter Eisentraut <peter_e@gmx.net> writes:

Tom Lane writes:

I had the idea that -CF would enlarge the lexer tables quite a bit ---
what's the change in executable size?)

+150 kB

I've also looked at -CFe, which is supposedly the next slowest level, but
it doesn't do nearly as well.

Ouch; that sounds like about a ten percent increase in the size of
the backend executable. That's enough to reach my threshold of pain;
is the long-literal issue worth that much?

How much of your reported improvement is due to -CFa, and how much to
the coding improvements you made?

regards, tom lane

#5Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#4)
Re: Improved scanner performance

Tom Lane writes:

Peter Eisentraut <peter_e@gmx.net> writes:

Tom Lane writes:

I had the idea that -CF would enlarge the lexer tables quite a bit ---
what's the change in executable size?)

+150 kB

I've also looked at -CFe, which is supposedly the next slowest level, but
it doesn't do nearly as well.

Ouch; that sounds like about a ten percent increase in the size of
the backend executable. That's enough to reach my threshold of pain;
is the long-literal issue worth that much?

Here's a breakdown of the postmaster file sizes and the wall-clock run
time of the long-literal test:

no options 1749912 1m58.688s
-CFe 1754315 1m49.223s
-CF 1817621 1m43.780s
-CFa 1890197 1m45.600s

(These numbers are different than yesterday's because they don't have
profiling and debugging overhead.)

Seeing this, I think -CF should be OK space and time-wise.

How much of your reported improvement is due to -CFa, and how much to
the coding improvements you made?

As I recall it, probably a third of the overall improvement came from
using -CF[a].

--
Peter Eisentraut peter_e@gmx.net

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#5)
Re: Improved scanner performance

Peter Eisentraut <peter_e@gmx.net> writes:

Here's a breakdown of the postmaster file sizes and the wall-clock run
time of the long-literal test:

no options 1749912 1m58.688s
-CFe 1754315 1m49.223s
-CF 1817621 1m43.780s
-CFa 1890197 1m45.600s

Seeing this, I think -CF should be OK space and time-wise.

Looks like a reasonable compromise to me too.

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#1)
Re: Improved scanner performance

BTW, here is what I get on an HP box for the same test you described
(a dozen trivial SELECTs using string literals between 5MB and 10MB
long), using latest sources:

% cumulative self self total
time seconds seconds calls ms/call ms/call name
47.51 8.19 8.19 chunks
26.16 12.70 4.51 129 34.96 35.97 base_yylex
12.30 14.82 2.12 1521 1.39 1.39 strlen
6.79 15.99 1.17 13 90.00 90.00 pq_getstring
4.18 16.71 0.72 chunk2
2.55 17.15 0.44 _recv_sys
0.29 17.20 0.05 _mcount

"chunks" is the inner loop of memcpy() --- evidently all the time is
being spent just copying those big literals around.

We could probably avoid some of that copying if we had a cleaner
approach to parsetree handling, ie, no scribbling on one's input.
Then operations like eval_const_expressions wouldn't feel compelled
to copy parsetree nodes that they weren't modifying. But I think
we've gotten all the low-hanging fruit for now.

At least on the backend side. Did you notice that psql was chewing
up three times more CPU than the backend in this test??

regards, tom lane