Proposed refactoring of planner header files
In <6044.1548524131@sss.pgh.pa.us> I worried about how much planner
stuff that patch might end up dragging into files that contain
planner support functions, and suggested that we could refactor the
planner's header files to reduce the inclusion footprint. Attached
are some proposed patches to improve the situation. The work isn't
fully done yet, but I was hoping to get buy-in on this approach
before going further.
The basic idea here is to create a new header file, which I chose
to call optimizer/optimizer.h, and put into it just the stuff that
"outside" callers of the planner might need. For this purpose
"outside" can be approximated as "doesn't really need to know what
is in relation.h", ie Paths and related data structures. I expect
that planner support functions will mostly be working with parsetree
data structures for their functions, so they should fit that
restriction. In some cases they need to be able to pass a PlannerInfo
pointer through to some planner function they want to call, but they
can treat the pointer as an opaque type. This worked out pretty well,
as I was able to eliminate uses of all other optimizer/ headers (and,
therefore, relation.h) from all but four or five files outside
backend/optimizer/. The holdouts are mostly places that are pretty
much in bed with the planner anyway, such as statistics/dependencies.c.
I did not attempt to narrow the API used by FDWs, so file_fdw and
postgres_fdw are two of the main places that still need other
optimizer/ headers. It might be useful to do a similar exercise
focusing on the API seen by FDWs, but that's for another time.
Also, I didn't work on tightening selfuncs.c's dependencies.
While I don't have a big problem with considering selfuncs.c to be
in bed with the planner, that's risky in that whatever dependencies
selfuncs.c has may well apply to extensions' selectivity estimators too.
What I'm thinking about doing there is trying to split selfuncs.c into
two parts, one being infrastructure that can be tightly tied to the
core planner (and, likely, get moved under backend/optimizer/) and the
other being estimators that use a limited API and can serve as models
for extension code. But I haven't tried to do that yet, and would like
to get the attached committed first.
There are three patches attached:
0001 takes some very trivial support functions out of clauses.c and
puts them into the generic node support headers (nodeFuncs.h and
makefuncs.h) where they arguably should have been all along. I also
took the opportunity to rename and_clause() and friends into
is_andclause() etc, to make it clearer that they are node-type-testing
functions not node-construction functions, and improved the style a
bit by using "static inline" where practical.
0002 adjusts the API of flatten_join_alias_vars() and some subsidiary
functions so that they take a Query not a PlannerInfo to define the
context they're using for Var transformation. This makes it greatly
less ugly for parse_agg.c to call that function. Without this change
it's impossible for parse_agg.c to be decoupled from relation.h.
It likely also saves some tiny number of cycles, by removing one level
of pointer indirection within that processing.
0003 then creates optimizer.h, moves relevant declarations there, and
adjusts #includes as needed.
Since I was intentionally trying to limit what optimizer.h pulls in,
and in particular not let it include relation.h, I needed an opaque
typedef for PlannerInfo. On the other hand relation.h also needs to
typedef that. I fixed that with a method that we've not used in our
code AFAIK, but is really common in system headers: there's a #define
symbol to remember whether we've created the typedef, and including
both headers in either order will work fine.
optimizer.h exposes a few of the planner's GUCs, but just the basic
cost parameters, which are likely to be useful to planner support
functions. Another choice is to expose all of them, but I'm not
sure that's a great idea --- see gripe below for an example of why
that can encourage broken coding.
I intentionally limited 0003 to just do header refactoring, not code
motion, so there are some other follow-on tasks I'm thinking about.
Notably:
I'm really unhappy that force_parallel_mode and
parallel_leader_participation are being treated as planner GUCs.
They are not that, IMO, because they also affect the behavior of
the executor, cf HandleParallelMessage, ExecGather, ExecGatherMerge.
This is somewhere between ill-considered and outright broken: what
happens if the values change between planning and execution? I think
we probably need to fix things so that those variables do not need to
be looked at by the executor, carrying them forward in the plan
tree if necessary. Then they'd not need to be exposed by
optimizer.h, and indeed I think the mentioned modules wouldn't
need any optimizer inclusions anymore.
Most everything that's being exposed from tlist.c and var.c could be
argued to be generic parsetree-manipulation support that should be
somewhere else, perhaps backend/nodes/ or backend/parser/. If we
moved those functions, I think we could get to a place where
backend/parser/ doesn't use any optimizer headers at all, which seems
like a good idea from a modularity standpoint.
Likewise, maybe expand_function_arguments() should be elsewhere.
It seems like evaluate_expr() probably doesn't belong in the planner
at all; it looks like an executor function doesn't it?
It seems possible that cost_qual_eval() should be exposed by
optimizer.h, but to do so we'd need to expose struct QualCost,
which is a creature of relation.h ATM. Seeing that typedef Cost
is already in nodes.h, maybe it wouldn't be too awful to put
QualCost there too?
I would have exposed estimate_rel_size, which is needed by
access/hash/hash.c, except that it requires Relation and
BlockNumber typedefs. The incremental value from keeping
hash.c from using plancat.h probably isn't worth widening
optimizer.h's #include footprint further. Also, I wonder
whether that whole area needs a rethink for pluggable storage.
Comments?
regards, tom lane
Attachments:
0001-move-trivial-node-support.patchtext/x-diff; charset=us-ascii; name=0001-move-trivial-node-support.patchDownload+277-304
0002-flatten-join-alias-with-query.patchtext/x-diff; charset=us-ascii; name=0002-flatten-join-alias-with-query.patchDownload+32-41
0003-refactor-optimizer-headers.patchtext/x-diff; charset=us-ascii; name=0003-refactor-optimizer-headers.patchDownload+277-303
Hi,
On 2019-01-28 15:17:19 -0500, Tom Lane wrote:
In <6044.1548524131@sss.pgh.pa.us> I worried about how much planner
stuff that patch might end up dragging into files that contain
planner support functions, and suggested that we could refactor the
planner's header files to reduce the inclusion footprint. Attached
are some proposed patches to improve the situation. The work isn't
fully done yet, but I was hoping to get buy-in on this approach
before going further.
The basic idea here is to create a new header file, which I chose
to call optimizer/optimizer.h, and put into it just the stuff that
"outside" callers of the planner might need. For this purpose
"outside" can be approximated as "doesn't really need to know what
is in relation.h", ie Paths and related data structures. I expect
that planner support functions will mostly be working with parsetree
data structures for their functions, so they should fit that
restriction. In some cases they need to be able to pass a PlannerInfo
pointer through to some planner function they want to call, but they
can treat the pointer as an opaque type. This worked out pretty well,
as I was able to eliminate uses of all other optimizer/ headers (and,
therefore, relation.h) from all but four or five files outside
backend/optimizer/. The holdouts are mostly places that are pretty
much in bed with the planner anyway, such as statistics/dependencies.c.
Without having studied the patch in any detail, that seems a worthwhile
goal to me.
Since I was intentionally trying to limit what optimizer.h pulls in,
and in particular not let it include relation.h, I needed an opaque
typedef for PlannerInfo. On the other hand relation.h also needs to
typedef that. I fixed that with a method that we've not used in our
code AFAIK, but is really common in system headers: there's a #define
symbol to remember whether we've created the typedef, and including
both headers in either order will work fine.
Ugh, isn't it nicer to just use the underlying struct type instead of
that? Or alternatively we could expand our compiler demands to require
that redundant typedefs are allowed - I'm not sure there's any animal
left that doesn't support that (rather than triggering an error it via
an intentionally set flag).
I'm really unhappy that force_parallel_mode and
parallel_leader_participation are being treated as planner GUCs.
They are not that, IMO, because they also affect the behavior of
the executor, cf HandleParallelMessage, ExecGather, ExecGatherMerge.
This is somewhere between ill-considered and outright broken: what
happens if the values change between planning and execution? I think
we probably need to fix things so that those variables do not need to
be looked at by the executor, carrying them forward in the plan
tree if necessary. Then they'd not need to be exposed by
optimizer.h, and indeed I think the mentioned modules wouldn't
need any optimizer inclusions anymore.
Stylistically I agree with that (and it's what I ended up doing for JIT
compilation as well), but do you see concrete harm here? I don't think
it's super likely that anybody changes these on the fly and expects
immediate effects?
I would have exposed estimate_rel_size, which is needed by
access/hash/hash.c, except that it requires Relation and
BlockNumber typedefs. The incremental value from keeping
hash.c from using plancat.h probably isn't worth widening
optimizer.h's #include footprint further. Also, I wonder
whether that whole area needs a rethink for pluggable storage.
The pluggable storage patchset currently makes estimate_rel_size() a
callback for table-like objects. While using BlockNumber isn't all that
pretty (it's one of a few tableam functions dealing with blocks, the
others being samplescan and bitmapscan). I've been wondering whether we
should change the API to return the size in a non-block oriented way,
but given the current costing I wasn't sure that's worth much.
There's an XXX in the changed code wondering whether we should make
estimate_rel_size() also call a callback for indexes, that'd e.g. allow
better per-index logic (c.f. the function's comment about gist).
What kind of rejiggering were you thinking of re pluggable storage?
Greetings,
Andres Freund
On Mon, Jan 28, 2019 at 3:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm really unhappy that force_parallel_mode and
parallel_leader_participation are being treated as planner GUCs.
They are not that, IMO, because they also affect the behavior of
the executor, cf HandleParallelMessage, ExecGather, ExecGatherMerge.
This is somewhere between ill-considered and outright broken: what
happens if the values change between planning and execution?
The only use of parallel_leader_participation at plan time seems to be
to twiddle the costing, and the use of it in the executor is to decide
whether to have the leader participate. So if the values differ,
you'll get a plan running a behavior for which plan selection was not
optimized. I don't know whether it's useful to intentionally allow
this so that you can see how the same plan behaves under the other
setting, or whether it's just a wart we'd be better off without. It
might be confusing, though, if you change the setting and it doesn't
force a replan.
Similarly, the use of force_parallel_mode in HandleParallelMessage()
really just affects what CONTEXT lines you get. That may be
ill-considered but it doesn't seem outright broken. Here again, I'm
not really sure that forcibly preserving the plan-time value at
execution time would really result in happier users.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Andres Freund <andres@anarazel.de> writes:
On 2019-01-28 15:17:19 -0500, Tom Lane wrote:
Since I was intentionally trying to limit what optimizer.h pulls in,
and in particular not let it include relation.h, I needed an opaque
typedef for PlannerInfo. On the other hand relation.h also needs to
typedef that. I fixed that with a method that we've not used in our
code AFAIK, but is really common in system headers: there's a #define
symbol to remember whether we've created the typedef, and including
both headers in either order will work fine.
Ugh, isn't it nicer to just use the underlying struct type instead of
that?
No, because that'd mean that anyplace relying on optimizer.h would also
have to write "struct PlannerInfo" rather than "PlannerInfo"; the
effects wouldn't be limited to the header itself.
Or alternatively we could expand our compiler demands to require
that redundant typedefs are allowed - I'm not sure there's any animal
left that doesn't support that (rather than triggering an error it via
an intentionally set flag).
I'd be happy with that if it actually works, but I strongly suspect
that it does not. Or can you cite chapter and verse where C99
requires it to work? My own compiler complains about "redefinition
of typedef 'foo'".
I'm really unhappy that force_parallel_mode and
parallel_leader_participation are being treated as planner GUCs.
They are not that, IMO, because they also affect the behavior of
the executor, cf HandleParallelMessage, ExecGather, ExecGatherMerge.
This is somewhere between ill-considered and outright broken: what
happens if the values change between planning and execution?
Stylistically I agree with that (and it's what I ended up doing for JIT
compilation as well), but do you see concrete harm here?
Well, I don't know: I haven't tried to trace the actual effects if
the values change. Seems like they could be pretty bad, in the worst
case if this wasn't thought through, which it looks like it wasn't.
In any case, if these settings are somehow global rather than being
genuinely planner GUCs, then the planner shouldn't be responsible
for defining them ... maybe they belong in miscadmin.h?
I would have exposed estimate_rel_size, which is needed by
access/hash/hash.c, except that it requires Relation and
BlockNumber typedefs. The incremental value from keeping
hash.c from using plancat.h probably isn't worth widening
optimizer.h's #include footprint further. Also, I wonder
whether that whole area needs a rethink for pluggable storage.
What kind of rejiggering were you thinking of re pluggable storage?
I wasn't; I was just thinking that I didn't want to advertise it
as a stable globally-accessible API if it's likely to get whacked
around soon.
regards, tom lane
Hi,
On 2019-01-28 15:50:22 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On 2019-01-28 15:17:19 -0500, Tom Lane wrote:
Since I was intentionally trying to limit what optimizer.h pulls in,
and in particular not let it include relation.h, I needed an opaque
typedef for PlannerInfo. On the other hand relation.h also needs to
typedef that. I fixed that with a method that we've not used in our
code AFAIK, but is really common in system headers: there's a #define
symbol to remember whether we've created the typedef, and including
both headers in either order will work fine.Ugh, isn't it nicer to just use the underlying struct type instead of
that?No, because that'd mean that anyplace relying on optimizer.h would also
have to write "struct PlannerInfo" rather than "PlannerInfo"; the
effects wouldn't be limited to the header itself.
Why? It can be called with the typedef'd version, or not. Unless it
calling code has on-stack pointers to it, it ought to never deal with
PlannerInfo vs struct PlannerInfo. In a lot of cases the code calling
the function will either get the PlannerInfo from somewhere (in which
case that'll either have the typedef'd version or not).
Or alternatively we could expand our compiler demands to require
that redundant typedefs are allowed - I'm not sure there's any animal
left that doesn't support that (rather than triggering an error it via
an intentionally set flag).I'd be happy with that if it actually works, but I strongly suspect
that it does not. Or can you cite chapter and verse where C99
requires it to work? My own compiler complains about "redefinition
of typedef 'foo'".
It's not required by C99, it however is required by C11. But a lot of
compilers have allowed it as an extension for a long time (like before
C99), unless suppressed by some option. I think that's partially because
C++ has allowed it for longer. I don't know how many of the BF
compilers could be made to accept that - I'd be very suprised if yours couldn't.
I would have exposed estimate_rel_size, which is needed by
access/hash/hash.c, except that it requires Relation and
BlockNumber typedefs. The incremental value from keeping
hash.c from using plancat.h probably isn't worth widening
optimizer.h's #include footprint further. Also, I wonder
whether that whole area needs a rethink for pluggable storage.What kind of rejiggering were you thinking of re pluggable storage?
I wasn't; I was just thinking that I didn't want to advertise it
as a stable globally-accessible API if it's likely to get whacked
around soon.
Ah. So far the signature didn't need to change, and I'm not aware of
pending issues requiring it.
Greetings,
Andres Freund
Hi,
On 2019-01-28 13:02:11 -0800, Andres Freund wrote:
It's not required by C99, it however is required by C11. But a lot of
compilers have allowed it as an extension for a long time (like before
C99), unless suppressed by some option. I think that's partially because
C++ has allowed it for longer. I don't know how many of the BF
compilers could be made to accept that - I'd be very suprised if yours couldn't.
Hm, it's only in gcc 4.6, so that's probably too recent.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2019-01-28 15:50:22 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
Ugh, isn't it nicer to just use the underlying struct type instead of
that?
No, because that'd mean that anyplace relying on optimizer.h would also
have to write "struct PlannerInfo" rather than "PlannerInfo"; the
effects wouldn't be limited to the header itself.
Why? It can be called with the typedef'd version, or not.
Because I don't want users of the header to have to declare, say, local
variables as "struct PlannerInfo *root" instead of "PlannerInfo *root".
The former is not project style and I will not accept forcing that in
order to save a couple of #ifdefs in headers. I don't actually understand
what you find so ugly about it.
One idea that would save a lot of random "struct foo" hacks in headers
is to allow nodes.h to include "typedef struct MyNode MyNode" for any
recognized node type. We could either do that just for nodes we've
found it useful for, or pre-emptively for the whole list.
regards, tom lane
Andres Freund <andres@anarazel.de> writes:
On 2019-01-28 13:02:11 -0800, Andres Freund wrote:
It's not required by C99, it however is required by C11. But a lot of
compilers have allowed it as an extension for a long time (like before
C99), unless suppressed by some option.
Hm, it's only in gcc 4.6, so that's probably too recent.
Yeah, I tried it with RHEL6's gcc 4.4.7, and it doesn't work
(and AFAICS there is no option that would make it work). A lot
of the buildfarm is running compilers older than that.
I fear we're still 10 years away from being able to demand C11
support ...
regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Jan 28, 2019 at 3:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm really unhappy that force_parallel_mode and
parallel_leader_participation are being treated as planner GUCs.
The only use of parallel_leader_participation at plan time seems to be
to twiddle the costing, and the use of it in the executor is to decide
whether to have the leader participate. So if the values differ,
you'll get a plan running a behavior for which plan selection was not
optimized. I don't know whether it's useful to intentionally allow
this so that you can see how the same plan behaves under the other
setting, or whether it's just a wart we'd be better off without. It
might be confusing, though, if you change the setting and it doesn't
force a replan.
Well, that puts it at the ill-considered end of the spectrum instead
of the outright-broken end, but I still say it's a bad idea. Planner
GUCs ought to control the produced plan, not other behaviors.
regards, tom lane
I wrote:
... I didn't work on tightening selfuncs.c's dependencies.
While I don't have a big problem with considering selfuncs.c to be
in bed with the planner, that's risky in that whatever dependencies
selfuncs.c has may well apply to extensions' selectivity estimators too.
What I'm thinking about doing there is trying to split selfuncs.c into
two parts, one being infrastructure that can be tightly tied to the
core planner (and, likely, get moved under backend/optimizer/) and the
other being estimators that use a limited API and can serve as models
for extension code.
I spent some time looking at this, wondering whether it'd be practical
to write selectivity-estimating code that hasn't #included pathnodes.h
(nee relation.h). It seems not: even pretty high-level functions
such as eqjoinsel() need access to fields like RelOptInfo.rows and
SpecialJoinInfo.jointype. Now, there are only a few such fields, so
conceivably we could provide accessor functions in optimizer.h for
the commonly-used fields and keep the struct types opaque. I'm not
excited about that though; it's unlike how we do things elsewhere in
Postgres and the mere savings of one #include dependency doesn't seem
to justify it. So I'm thinking that planner support functions that
want to do selectivity estimation will still end up pulling in
pathnodes.h via selfuncs.h, and we'll just live with that.
However ... there are three pretty clearly identifiable groups of
functions in selfuncs.c: operator-specific estimators, support
functions, and AM-specific indexscan cost estimation functions.
There's a case to be made for splitting them into three separate files.
One point is that selfuncs.c is just too large; at over 8K lines,
it's currently the 7th-largest hand-maintained C file in our tree.
Another point is that as things stand, there's a strong temptation
to bury useful functionality in static functions that can't be
gotten at by extension estimators. Separating the support functions
might help keep us more honest on that point. (Or not.)
I'm not sure whether those arguments are strong enough to justify
the added pain-in-the-neck factor from moving a bunch of code
around. That complicates back-patching bug fixes and it makes it
harder to trace the git history of the code. So I've got mixed
emotions about whether it's worth doing anything.
Thoughts?
regards, tom lane
On Wed, Jan 30, 2019 at 10:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
However ... there are three pretty clearly identifiable groups of
functions in selfuncs.c: operator-specific estimators, support
functions, and AM-specific indexscan cost estimation functions.
There's a case to be made for splitting them into three separate files.
One point is that selfuncs.c is just too large; at over 8K lines,
it's currently the 7th-largest hand-maintained C file in our tree.
Another point is that as things stand, there's a strong temptation
to bury useful functionality in static functions that can't be
gotten at by extension estimators. Separating the support functions
might help keep us more honest on that point. (Or not.)I'm not sure whether those arguments are strong enough to justify
the added pain-in-the-neck factor from moving a bunch of code
around. That complicates back-patching bug fixes and it makes it
harder to trace the git history of the code. So I've got mixed
emotions about whether it's worth doing anything.
I think splitting it along these lines would be pretty sensible. I'm
not really a fan of giant source files, especially if there's no
really good reason to put all that stuff in one source file ... then
again, I'm also not volunteering to do the work.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Jan 30, 2019 at 10:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
However ... there are three pretty clearly identifiable groups of
functions in selfuncs.c: operator-specific estimators, support
functions, and AM-specific indexscan cost estimation functions.
There's a case to be made for splitting them into three separate files.
I think splitting it along these lines would be pretty sensible. I'm
not really a fan of giant source files, especially if there's no
really good reason to put all that stuff in one source file ... then
again, I'm also not volunteering to do the work.
I'm happy to do the work if there's consensus on what to do. After
a few moments' thought, I'd suggest:
1. Move the indexscan cost estimation functions into a new file
adt/index_selfuncs.c. Most of them already are declared in
utils/index_selfuncs.h, and we'd move the remaining declarations
(primarily, genericcostestimate()) to that file as well. This
would affect #includes in contrib/bloom/ and any 3rd-party index
AMs that might be using genericcostestimate().
2a. Leave the support functions in selfuncs.c with declarations in
utils/selfuncs.h, and move the operator-specific estimators to
a new file, perhaps opr_selfuncs.c. This would minimize pain for
existing includers of selfuncs.h, most of which (particularly outside
core) are presumably interested in the support functions.
2b. Alternatively, leave the operator-specific estimators where they
are, and make new files under optimizer/ for the support functions.
I think 2b would make more sense in the abstract, but it would also
result in more #include churn for extensions. OTOH we already
forced some churn in that area with the planner-header-refactoring
patch, so the delta cost is probably not that much.
In any case, I'm hoping to get rid of the exported declarations for
pattern selectivity (Pattern_Prefix_Status, pattern_fixed_prefix,
etc) altogether. Those are exported because indxpath.c currently
calls them from its wired-in lossy-index-clause logic. When the
dust settles, all that code should be associated with planner
support functions for the LIKE and regex operators, and will not
need to be visible outside of whatever module we put those in.
But that patch is yet to be written.
Opinions?
regards, tom lane
On Thu, Jan 31, 2019 at 2:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm happy to do the work if there's consensus on what to do. After
a few moments' thought, I'd suggest:1. Move the indexscan cost estimation functions into a new file
adt/index_selfuncs.c. Most of them already are declared in
utils/index_selfuncs.h, and we'd move the remaining declarations
(primarily, genericcostestimate()) to that file as well. This
would affect #includes in contrib/bloom/ and any 3rd-party index
AMs that might be using genericcostestimate().2a. Leave the support functions in selfuncs.c with declarations in
utils/selfuncs.h, and move the operator-specific estimators to
a new file, perhaps opr_selfuncs.c. This would minimize pain for
existing includers of selfuncs.h, most of which (particularly outside
core) are presumably interested in the support functions.2b. Alternatively, leave the operator-specific estimators where they
are, and make new files under optimizer/ for the support functions.
I do not have a powerful opinion on exactly what to do here, but I
offer the following for what it's worth:
- I do not really think much of the selfuncs.c naming. Nobody who is
not deeply immersed in the PostgreSQL code knows what a "selfunc" is.
Therefore, breaking selfuncs.c into opr_selfuncs.c and
index_selfuncs.c doesn't strike me as the ideal naming choice. I
would suggest looking for a name that is less steeped in venerable
tradition; perhaps estimator.c or indexsupport.c or selectivity.c or
whatever could be considered for whatever new files we are creating.
- I have always found the placement of selfuncs.c a bit strange: why
is it in utils/adt? Perhaps there is an argument for having the
things that are specific to a particular type or a particular operator
in that directory, but if so, shouldn't they be grouped with the type
or operator to which they relate, rather than each other? And if
they're not specific to a certain thing, or if grouping them with that
thing would suck because of what we'd have to #include or some other
grounds, then why not put them in the optimizer? I think that a large
fraction of this code is actually generic, and putting it in the
optimizer directory someplace would be more logical than what we have
now. Alternatively, since these have to be exposed via pg_proc, maybe
the shims should go in this directory and the core logic elsewhere.
Again, I don't have a strong opinion here, but I've never thought this
made a ton of sense the way it is.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
I do not have a powerful opinion on exactly what to do here, but I
offer the following for what it's worth:
- I do not really think much of the selfuncs.c naming. Nobody who is
not deeply immersed in the PostgreSQL code knows what a "selfunc" is.
Therefore, breaking selfuncs.c into opr_selfuncs.c and
index_selfuncs.c doesn't strike me as the ideal naming choice. I
would suggest looking for a name that is less steeped in venerable
tradition; perhaps estimator.c or indexsupport.c or selectivity.c or
whatever could be considered for whatever new files we are creating.
Hm. I'm not really persuaded. There isn't any part of this code that
isn't "steeped in venerable tradition"; the problem I'm ultimately
trying to fix here can fairly be characterized as 25-year-old technical
debt. I see your point that adt/ is not where new people might look for
selectivity estimators, but we should also have some sympathy for people
who *do* know the code base expecting to find these functions more or
less where they are.
- I have always found the placement of selfuncs.c a bit strange: why
is it in utils/adt? Perhaps there is an argument for having the
things that are specific to a particular type or a particular operator
in that directory, but if so, shouldn't they be grouped with the type
or operator to which they relate, rather than each other?
We do already have that to some extent:
src/backend/tsearch/ts_selfuncs.c
src/backend/utils/adt/array_selfuncs.c
src/backend/utils/adt/rangetypes_selfuncs.c
src/backend/utils/adt/geo_selfuncs.c
src/backend/utils/adt/network_selfuncs.c
src/backend/utils/adt/selfuncs.c
src/include/utils/index_selfuncs.h
src/include/utils/selfuncs.h
contrib/intarray/_int_selfuncs.c
Breaking up selfuncs.c would carry this a bit further, but giving it
a random new name doesn't seem to me to really improve matters.
Now, if we move the support functions to someplace under optimizer/,
we do have to think about a name for that. I was considering
optimizer/utils/selsupport.c and optimizer/selsupport.h, but I'm
certainly not wedded to that.
regards, tom lane
Pursuant to this discussion about breaking up selfuncs.c, here's
a proposed patch to shove all the planner logic associated with LIKE
and regex operators into the recently-created like_support.c file.
This takes a bit under 1400 lines out of selfuncs.c.
I was fairly pleased at how neatly this broke up. I had to export
ineq_histogram_selectivity() and var_eq_const() from selfuncs.c
because some of the moved functions called those; and I also exported
var_eq_non_const() because it seemed to make sense for that to be
visible if var_eq_const() is. But otherwise there aren't connections
between the two files.
I made pattern_fixed_prefix() and make_greater_string() static in
like_support.c, since all of their callers are there now. It's
possible that somebody is using those from an extension, in which
case I wouldn't have a big problem with re-exporting them, but
let's wait and see if there's actually any complaints.
This isn't quite purely a code-motion patch: I failed to resist the
temptation to refactor patternsel() slightly so that the LIKE/regex
support functions can call it for SupportRequestSelectivity. That
means that invoking LIKE/regex through function syntax is exactly
on par with invoking it through operator syntax, so far as the
planner is concerned: not only can we extract lossy index conditions
from something like "WHERE like(textcolumn, 'foo%bar')", but we get
the right selectivity estimate too. (HEAD can only do the former.)
I should emphasize that I'm not that excited about whether calling
like() as a function is well-supported, and I'd certainly not propose
that we make any large-scale effort to make this work for other
built-in operators. But I think that extensions like PostGIS may
very well have reason to want their functions and operators to have
parity, so there should be an example in the core code of how to
do it.
Barring objections I'll push this shortly. I'm not done looking
at selfuncs.c, but this is a separable piece.
regards, tom lane
Attachments:
split-out-pattern-selfuncs-1.patchtext/x-diff; charset=us-ascii; name=split-out-pattern-selfuncs-1.patchDownload+1455-1373
[ moving this discussion to a more appropriate thread; let's keep the
original thread for discussing whether bloom actually needs fixed ]
Over in
/messages/by-id/CAMkU=1yHfC+Gu84UFsz4hWn=C7tgQFMLiEQcto1Y-8WDE96vaw@mail.gmail.com
Jeff Janes <jeff.janes@gmail.com> writes:
On Tue, Feb 12, 2019 at 4:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm just in the midst of refactoring that stuff, so if you have
suggestions, let's hear 'em.
The goal would be that I can copy the entire definition of
genericcostestimate into blcost.c, change the function's name, and get it
to compile. I don't know the correct way accomplish that.
The private functions that genericcostestimate needs are
get_index_quals, add_predicate_to_quals, other_operands_eval_cost,
and orderby_operands_eval_cost. I don't mind exporting those,
although they should get names more appropriate to being global,
and I think the latter two should be merged, as attached.
Another thing that I've been thinking about here is that
deconstruct_indexquals() and the IndexQualInfo struct probably ought
to go away. That was useful code given our previous convoluted data
structure for index quals, but now its purposes of hiding clause
commutation and associating the right index column with each clause
are vestigial. The other goal that I originally had in inventing
that code was to allow cost estimation methods to disregard the
details of which clause type each indexqual actually is. But looking
at how it's being used, it's generally not the case that callers get
to ignore that, which is unsurprising when you think about it: there's
enough behavioral difference between say OpExpr and ScalarArrayOpExpr
that it's kinda silly to think we could really paper it over.
So the attached patch includes a proof-of-concept for getting rid of
IndexQualInfo. I didn't carry it to completion: gincostestimate is
still using that struct. The thing that was blocking me was that
given the definition that IndexClause.indexquals is NIL if we're
supposed to use the original clause as-is, it's really hard to write
code that processes all the indexquals without duplicating logic.
That is, you tend to end up with something along the lines of
this, inside a loop over the IndexClauses:
if (iclause->indexquals == NIL)
do_something_with(iclause->rinfo);
else
foreach(lc, iclause->indexquals)
do_something_with(lfirst(lc));
If you look at deconstruct_indexquals you'll see that I got around
that by making a subroutine for do_something_with, but it's not
convenient to do that if the code needs access to local variables
in the surrounding function. In btcostestimate I instead did
if (iclause->indexquals)
indexquals = iclause->indexquals;
else
indexquals = list_make1(iclause->rinfo);
foreach(lc, indexquals)
do_something_with(lfirst(lc));
but I don't like that much either. It'd be better I think if
we changed the definition of IndexClause.indexquals to always
be nonempty, and just be a singleton list of the original
iclause->rinfo in the simple case. I'd rejected that approach
to begin with on the grounds that (a) letting it be NIL saves
a list_make1() in the common case, and (b) it seemed a bit
dodgy to have the original clause doubly-linked in this structure.
But argument (a) falls apart completely if places like btcostestimate
are doing their own list_make1 because the data structure is too
hard to work with. And I don't think (b) holds much water either,
since we doubly-link RestrictInfos all over the place.
So I'm thinking about going and changing the definition of that
data structure before it's set in stone.
Comments? Does anyone know of third-party AMs that are using
IndexQualInfo and would be sad to have it go away?
regards, tom lane