Detection of nested function calls
Hi all,
The Oslandia team is involved in PostGIS project for years, with a
current focus on PostGIS 3D support.
With PostGIS queries, nested functions calls that manipulate geometries
are quite common, e.g.: SELECT ST_Union(ST_Intersection(a.geom,
ST_Buffer(b.geom, 50)))
PostGIS functions that manipulate geometries have to unserialize their
input geometries from the 'flat' varlena representation to their own,
and serialize the processed geometries back when returning.
But in such nested call queries, this serialization-unserialization
process is just an overhead.
Avoiding it could then lead to a real gain in terms of performances [1]Talking about performances, we already investigated such "pass-by-reference" mechanism with PostGIS. Taking a dummy function "st_copy" that only copies its input geometry to its output with 4 levels of nesting gives encouraging results (passing geometries by reference is more than 2x faster than (un)serializing) : https://github.com/Oslandia/sfcgal-tests/blob/master/bench/report_serialization_referenced_vs_native.pdf,
especially here when the internal type takes time to serialize (and with
new PostGIS types like rasters or 3D geometries it's really meaningful)
So we thought having a way for user functions to know if they are part
of a nested call could allow them to avoid this serialization phase.
The idea would be to have a boolean flag reachable from a user function
(within FunctionCallInfoData) that says if the current function is
nested or not.
We already investigated such a modification and here is where we are up
to now :
- we modified the parser with a new boolean member 'nested' to the
FuncExpr struct. Within the parser, we know if a function call is nested
into another one and then we can mark the FuncExpr
- the executor has been modified so it can take into account this
nested member and pass it to the FunctionCallInfoData structure before
evaluating the function
We are working on a PostGIS branch that takes benefit of this
functionality [2]https://github.com/Oslandia/postgis/tree/nested_ref_passing -- Hugo Mercier Oslandia
You can find in attachment a first draft of the patch.
Obviously, even if this is about a PostGIS use case here, this subject
could be helpful for every other queries using both nested functions and
serialization.
I am quite new to postgresql hacking, so I'm sure there is room for
improvements. But, what about this first proposal ?
I'll be at the PGDay conf in Dublin next week, so we could discuss this
topic.
[1]: Talking about performances, we already investigated such "pass-by-reference" mechanism with PostGIS. Taking a dummy function "st_copy" that only copies its input geometry to its output with 4 levels of nesting gives encouraging results (passing geometries by reference is more than 2x faster than (un)serializing) : https://github.com/Oslandia/sfcgal-tests/blob/master/bench/report_serialization_referenced_vs_native.pdf
"pass-by-reference" mechanism with PostGIS. Taking a dummy function
"st_copy" that only copies its input geometry to its output with 4
levels of nesting gives encouraging results (passing geometries by
reference is more than 2x faster than (un)serializing) :
https://github.com/Oslandia/sfcgal-tests/blob/master/bench/report_serialization_referenced_vs_native.pdf
[2]: https://github.com/Oslandia/postgis/tree/nested_ref_passing -- Hugo Mercier Oslandia
--
Hugo Mercier
Oslandia
Attachments:
nested_calls.patchtext/x-patch; name=nested_calls.patchDownload+107-56
Hello
2013/10/25 Hugo Mercier <hugo.mercier@oslandia.com>
Hi all,
The Oslandia team is involved in PostGIS project for years, with a
current focus on PostGIS 3D support.
With PostGIS queries, nested functions calls that manipulate geometries
are quite common, e.g.: SELECT ST_Union(ST_Intersection(a.geom,
ST_Buffer(b.geom, 50)))PostGIS functions that manipulate geometries have to unserialize their
input geometries from the 'flat' varlena representation to their own,
and serialize the processed geometries back when returning.
But in such nested call queries, this serialization-unserialization
process is just an overhead.Avoiding it could then lead to a real gain in terms of performances [1],
especially here when the internal type takes time to serialize (and with
new PostGIS types like rasters or 3D geometries it's really meaningful)So we thought having a way for user functions to know if they are part
of a nested call could allow them to avoid this serialization phase.The idea would be to have a boolean flag reachable from a user function
(within FunctionCallInfoData) that says if the current function is
nested or not.We already investigated such a modification and here is where we are up
to now :
- we modified the parser with a new boolean member 'nested' to the
FuncExpr struct. Within the parser, we know if a function call is nested
into another one and then we can mark the FuncExpr
- the executor has been modified so it can take into account this
nested member and pass it to the FunctionCallInfoData structure before
evaluating the functionWe are working on a PostGIS branch that takes benefit of this
functionality [2]You can find in attachment a first draft of the patch.
Obviously, even if this is about a PostGIS use case here, this subject
could be helpful for every other queries using both nested functions and
serialization.I am quite new to postgresql hacking, so I'm sure there is room for
improvements. But, what about this first proposal ?
I am not sure, if this solution is enough - what will be done if I store
some values in PL/pgSQL variables?
Regards
Pavel
Show quoted text
I'll be at the PGDay conf in Dublin next week, so we could discuss this
topic.[1] Talking about performances, we already investigated such
"pass-by-reference" mechanism with PostGIS. Taking a dummy function
"st_copy" that only copies its input geometry to its output with 4
levels of nesting gives encouraging results (passing geometries by
reference is more than 2x faster than (un)serializing) :[2] https://github.com/Oslandia/postgis/tree/nested_ref_passing
--
Hugo Mercier
Oslandia--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Le 25/10/2013 14:29, Pavel Stehule a écrit :
Hello
2013/10/25 Hugo Mercier <hugo.mercier@oslandia.com
<mailto:hugo.mercier@oslandia.com>>.I am quite new to postgresql hacking, so I'm sure there is room for
improvements. But, what about this first proposal ?I am not sure, if this solution is enough - what will be done if I store
some values in PL/pgSQL variables?
You mean if you store the result of a (nested) function evaluation in a
PL/pgSQL variable ?
Then no nesting will be detected by the parser and in this case the user
function must ensure its result is serialized, since it could be stored
(in a variable or a table) at any time.
--
Hugo Mercier
Oslandia
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2013/10/25 Hugo Mercier <hugo.mercier@oslandia.com>
Le 25/10/2013 14:29, Pavel Stehule a écrit :
Hello
2013/10/25 Hugo Mercier <hugo.mercier@oslandia.com
<mailto:hugo.mercier@oslandia.com>>.I am quite new to postgresql hacking, so I'm sure there is room for
improvements. But, what about this first proposal ?I am not sure, if this solution is enough - what will be done if I store
some values in PL/pgSQL variables?You mean if you store the result of a (nested) function evaluation in a
PL/pgSQL variable ?
Then no nesting will be detected by the parser and in this case the user
function must ensure its result is serialized, since it could be stored
(in a variable or a table) at any time.
ok
I remember, so I though about similar optimization when I worked on SQL/XML
implementation - so same optimization can be used there.
Regards
Pavel
Show quoted text
--
Hugo Mercier
Oslandia--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hugo Mercier <hugo.mercier@oslandia.com> writes:
PostGIS functions that manipulate geometries have to unserialize their
input geometries from the 'flat' varlena representation to their own,
and serialize the processed geometries back when returning.
But in such nested call queries, this serialization-unserialization
process is just an overhead.
This is a reasonable thing to worry about, not just for PostGIS types but
for many container types such as arrays --- it'd be nice to be able to
work with an in-memory representation that wasn't just a contiguous blob
of data. For instance, assignment to an array element might become a
constant-time operation even when working with variable-length datatypes.
So we thought having a way for user functions to know if they are part
of a nested call could allow them to avoid this serialization phase.
However, this seems like a completely wrong way to go at it. In the first
place, it wouldn't help for situations like a complex value stored in a
plpgsql variable. In the second, I don't think that what you are
describing scales to any more than the most trivial situations. What
about functions with more than one complex-type input, for example? And
you'd need to be certain that every single function taking or returning
the datatype gets updated at exactly the same time, else it'll break.
I think the right way to attack it is to create some way for a Datum
value to indicate, at runtime, whether it's a flat value or an in-memory
representation. Any given function returning the type could choose to
return either representation. The datatype would have to provide a way
to serialize the in-memory representation, when and if it came time to
store it in a table. To avoid breaking functions that hadn't yet been
taught about the new representation, we'd probably want to redefine the
existing DETOAST macros as also invoking this datatype flattening
function, and then you'd need to use some new access macro if you wanted
visibility of the non-flat representation. (This assumes that the whole
thing is only applicable to toastable datatypes, but that seems like a
reasonable restriction.)
Another thing that would have to be attacked in order to make the
plpgsql-variable case work is that you'd need some design for copying such
Datums in-memory, and perhaps a reference count mechanism to optimize away
unnecessary copies. Your idea of tying the optimization to the nested
function call scenario would avoid the need to solve this problem, but
I think it's too narrow a scope to justify all the other work that'd be
involved.
Some colleagues of mine at Salesforce have been playing with ideas like
this, though last I heard they were nowhere near having a submittable
patch.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-10-25 10:18:27 -0400, Tom Lane wrote:
I think the right way to attack it is to create some way for a Datum
value to indicate, at runtime, whether it's a flat value or an in-memory
representation. Any given function returning the type could choose to
return either representation. The datatype would have to provide a way
to serialize the in-memory representation, when and if it came time to
store it in a table. To avoid breaking functions that hadn't yet been
taught about the new representation, we'd probably want to redefine the
existing DETOAST macros as also invoking this datatype flattening
function, and then you'd need to use some new access macro if you wanted
visibility of the non-flat representation. (This assumes that the whole
thing is only applicable to toastable datatypes, but that seems like a
reasonable restriction.)
That sounds reasonable, and we have most of the infrastructure for it
since the "indirect toast" thing got in.
Another thing that would have to be attacked in order to make the
plpgsql-variable case work is that you'd need some design for copying such
Datums in-memory, and perhaps a reference count mechanism to optimize away
unnecessary copies. Your idea of tying the optimization to the nested
function call scenario would avoid the need to solve this problem, but
I think it's too narrow a scope to justify all the other work that'd be
involved.
I've thought about refcounting Datums several times, but I always got
stuck when thinking about how to deal memory context resets and errors.
Any ideas about that?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@2ndquadrant.com> writes:
On 2013-10-25 10:18:27 -0400, Tom Lane wrote:
I think the right way to attack it is to create some way for a Datum
value to indicate, at runtime, whether it's a flat value or an in-memory
representation.
That sounds reasonable, and we have most of the infrastructure for it
since the "indirect toast" thing got in.
Oh really? I hadn't been paying much attention to that, but obviously
I better go back and study it.
I've thought about refcounting Datums several times, but I always got
stuck when thinking about how to deal memory context resets and errors.
Any ideas about that?
Not yet. But it makes no sense to claim that a Datum could have a
reference that's longer-lived than the memory context it's in, so
I'm not sure the context reset case is really a problem.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Le 25/10/2013 16:18, Tom Lane a écrit :
Hugo Mercier <hugo.mercier@oslandia.com> writes:
PostGIS functions that manipulate geometries have to unserialize their
input geometries from the 'flat' varlena representation to their own,
and serialize the processed geometries back when returning.
But in such nested call queries, this serialization-unserialization
process is just an overhead.This is a reasonable thing to worry about, not just for PostGIS types but
for many container types such as arrays --- it'd be nice to be able to
work with an in-memory representation that wasn't just a contiguous blob
of data. For instance, assignment to an array element might become a
constant-time operation even when working with variable-length datatypes.So we thought having a way for user functions to know if they are part
of a nested call could allow them to avoid this serialization phase.However, this seems like a completely wrong way to go at it. In the first
place, it wouldn't help for situations like a complex value stored in a
plpgsql variable. In the second, I don't think that what you are
describing scales to any more than the most trivial situations. What
about functions with more than one complex-type input, for example? And
you'd need to be certain that every single function taking or returning
the datatype gets updated at exactly the same time, else it'll break.
About plpgsql variables : no there won't be no optimization in that
case. At the time the function result has to be stored in a variable, it
must be serialized.
About functions with more than one complex-type input, as soon as each
parameter are of the same type, there is no problem with that.
But if your function deals with more than one complex type AND you want
to avoid serialization on each parameter, then yes, each type must be
aware of this possible optimization (choose whether to serialize or not).
I don't understand what you mean by "be certain that every single
function ... gets updated at exactly the same time". Could you develop ?
I think the right way to attack it is to create some way for a Datum
value to indicate, at runtime, whether it's a flat value or an in-memory
representation. Any given function returning the type could choose to
return either representation. The datatype would have to provide a way
to serialize the in-memory representation, when and if it came time to
store it in a table. To avoid breaking functions that hadn't yet been
taught about the new representation, we'd probably want to redefine the
existing DETOAST macros as also invoking this datatype flattening
function, and then you'd need to use some new access macro if you wanted
visibility of the non-flat representation. (This assumes that the whole
thing is only applicable to toastable datatypes, but that seems like a
reasonable restriction.)
You're totally right. That is very close to what I am working on with
PostGIS.
This is still early work, but for some details :
https://github.com/Oslandia/postgis/blob/nested_ref_passing/postgis/lwgeom_ref.h
Basically, the 'geometry' type of PostGIS is here extended with a flag
saying if the data is actual 'flat' data or a plain pointer. And if this
is a pointer, a type identifier is stored.
And there is a new DETOAST macro (here POSTGIS_DETOAST_DATUM) that will
test if the Datum is a pointer or not and if it is the case, call
corresponding unserializing functions. So you can avoid copies if your
function is aware of that, and the change for existing functions will be
minimum.
https://github.com/Oslandia/postgis/blob/nested_ref_passing/postgis/lwgeom_ref.c
You said "when and if it came time to store it in a table". And, that is
exactly the point of this 'nested' boolean: when do you know that it is
time to store in a table, from a function point of view, otherwise ?
Another thing that would have to be attacked in order to make the
plpgsql-variable case work is that you'd need some design for copying such
Datums in-memory, and perhaps a reference count mechanism to optimize away
unnecessary copies. Your idea of tying the optimization to the nested
function call scenario would avoid the need to solve this problem, but
I think it's too narrow a scope to justify all the other work that'd be
involved.
Do you think it must necessarly cover the plpgsql variable case to be
acceptable ?
--
Hugo Mercier
Oslandia
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-10-25 11:01:28 -0400, Tom Lane wrote:
Andres Freund <andres@2ndquadrant.com> writes:
On 2013-10-25 10:18:27 -0400, Tom Lane wrote:
I think the right way to attack it is to create some way for a Datum
value to indicate, at runtime, whether it's a flat value or an in-memory
representation.That sounds reasonable, and we have most of the infrastructure for it
since the "indirect toast" thing got in.Oh really? I hadn't been paying much attention to that, but obviously
I better go back and study it.
Well, it has the infrastructure for adding further types of
varattrib_1b_e types and for computing the size independently. So you
can easily add a new type of toast datum. There still needs to be
handling for it in tuptoaster.c et al, but that's not surprising ;)
I've thought about refcounting Datums several times, but I always got
stuck when thinking about how to deal memory context resets and errors.
Any ideas about that?Not yet. But it makes no sense to claim that a Datum could have a
reference that's longer-lived than the memory context it's in, so
I'm not sure the context reset case is really a problem.
Given how short lived many of the contexts used for expression
evaluation are, that might restrict the usefullness quite a bit. I think
at the very least it has to be allowed that a Datum gets also used in
child contexts.
But that's already opens up the door for refcount leakage when the child
context gets destroyed.
I wonder if this needs mcxt.c/aset.c support to be useful.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hugo Mercier <hugo.mercier@oslandia.com> writes:
Le 25/10/2013 16:18, Tom Lane a �crit :
However, this seems like a completely wrong way to go at it. In the first
place, it wouldn't help for situations like a complex value stored in a
plpgsql variable. In the second, I don't think that what you are
describing scales to any more than the most trivial situations. What
about functions with more than one complex-type input, for example? And
you'd need to be certain that every single function taking or returning
the datatype gets updated at exactly the same time, else it'll break.
About functions with more than one complex-type input, as soon as each
parameter are of the same type, there is no problem with that.
How do you tell the difference between
foo(col1, bar(col2))
foo(bar(col1), col2)
I don't understand what you mean by "be certain that every single
function ... gets updated at exactly the same time". Could you develop ?
If you're tying this to the syntax of the expression, then bar() *must*
return a non-serialized value when and only when foo() is expecting that,
therefore their implementations must change at the same time. Perhaps
that's workable for PostGIS, but it's a complete nonstarter for
widely-known datatypes like arrays, where affected functions might be
spread through any number of extensions. We need a design that permits
incremental fixing of functions that work with a deserializable datatype.
Another point worth worrying about is that not all expressions are
function calls, nor do all function calls arise from expressions.
Chasing down all the corner cases and making sure they work properly
in a syntax-driven approach is going to be a headache.
Basically, the 'geometry' type of PostGIS is here extended with a flag
saying if the data is actual 'flat' data or a plain pointer. And if this
is a pointer, a type identifier is stored.
If you're doing that, why do you need the decoration on the FuncExpr
expressions? Can't you just look at your input datums and see if they're
flat or not?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Le 25/10/2013 17:20, Tom Lane a écrit :
Hugo Mercier <hugo.mercier@oslandia.com> writes:
Le 25/10/2013 16:18, Tom Lane a écrit :
How do you tell the difference between
foo(col1, bar(col2))
foo(bar(col1), col2)
Still not sure to understand ...
I assume foo() takes two argument of type A.
bar() can take one argument of A or another type B.
In bar(), you would have the choice to return either a plain A
or a pointer to A. Because bar() knows its call is nested (by foo()),
than it can decide to return a pointer to A.
foo() is then evaluated and we assume it knows A can be a pointer.
foo() then knows its nesting level of 0 and must return something
serialized in that case.
I don't understand what you mean by "be certain that every single
function ... gets updated at exactly the same time". Could you develop ?If you're tying this to the syntax of the expression, then bar() *must*
return a non-serialized value when and only when foo() is expecting that,
therefore their implementations must change at the same time. Perhaps
that's workable for PostGIS, but it's a complete nonstarter for
widely-known datatypes like arrays, where affected functions might be
spread through any number of extensions. We need a design that permits
incremental fixing of functions that work with a deserializable datatype.
Yes.
It could work for each user type assuming each function working with
this type is aware of this pointer/serialized nature, including extensions.
So you have to, at least, recompile every extensions depending on that
types. Which ... limits the interest for very general types, I have to
admit.
Another point worth worrying about is that not all expressions are
function calls, nor do all function calls arise from expressions.
Chasing down all the corner cases and making sure they work properly
in a syntax-driven approach is going to be a headache.
We could add this 'nesting' detection to operators (and probably other
constructs that I don't know) little by little.
Optimizing only function calls as a first step is not enough ?
Basically, the 'geometry' type of PostGIS is here extended with a flag
saying if the data is actual 'flat' data or a plain pointer. And if this
is a pointer, a type identifier is stored.If you're doing that, why do you need the decoration on the FuncExpr
expressions? Can't you just look at your input datums and see if they're
flat or not?
If a function returns a pointer whatever the nesting level is, you could
end with something storing a raw pointer, which is bad. You could
eventually add a way to detect that what you stored was as pointer and
that your data no longer exists (be NULL ?) when read back, but you
basically end with users manipulating pointers, which is bad.
If you want to make it transparent to the user, you need to know the
nesting level to decide whether you could just pass it to something that
is aware of this pointer (nesting level >=1) or serialize it back
(nesting level == 0).
--
Hugo Mercier
Oslandia
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hugo Mercier <hugo.mercier@oslandia.com> writes:
Le 25/10/2013 17:20, Tom Lane a �crit :
How do you tell the difference between
foo(col1, bar(col2))
foo(bar(col1), col2)
Still not sure to understand ...
I assume foo() takes two argument of type A.
bar() can take one argument of A or another type B.
I was assuming everything was the same datatype in this example, ie
col1, col2, and the result of bar() are all type A.
The point I'm trying to make is that in the first case, foo would be
receiving a first argument that was flat and a second that was not flat;
while in the second case, it would be receiving a first argument that was
not flat and a second that was flat. The expression labeling you're
proposing does not help it tell the difference. What's more, you're
proposing that the labeling be made by generic code that can't possibly
know what bar() is really going to do.
In bar(), you would have the choice to return either a plain A
or a pointer to A. Because bar() knows its call is nested (by foo()),
than it can decide to return a pointer to A.
foo() is then evaluated and we assume it knows A can be a pointer.
foo() then knows its nesting level of 0 and must return something
serialized in that case.
Whoa. That's the most fragile, assumption-filled way you could possibly
go about this. In general, bar() cannot be expected to know whether the
outer function is able to take a non-flat parameter value. And you've
glossed over how foo() would know whether its input was flat or not.
Another point here is that there's no good reason to suppose that a
function should return a flattened value just because it's at the outer
level of its syntactic expression. For example, if we're doing a plain
SELECT foo(...) FROM ..., the next thing that will happen with that value
is it'll be fed to the output function for the datatype. Maybe that
output function would like to have a non-flat input value, too, to save
the time of transforming back to that representation. On the other hand,
if it's a SELECT ... ORDER BY ... and the planner chooses to do the ORDER
BY with a final sort step, we'll probably have to flatten the value to
pass it through sorting. (Or possibly not --- perhaps we could just pass
the toast token through sorting?) There are a lot of considerations here
and it's really unreasonable to expect that static expression labeling
will be able to do the right thing every time.
Basically the only way to make this work reliably is for Datums to be
self-identifying as to whether they're flat or structured values; then
make code do the right thing on-the-fly at runtime depending on what kind
of Datum it gets. Once you've done that, I don't see that parse-time
labeling of expression nesting adds anything useful. As Andres said,
the provisions for toasted datums are a good precedent, and none of that
depends on parse-time decisions.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 25, 2013 at 10:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hugo Mercier <hugo.mercier@oslandia.com> writes:
PostGIS functions that manipulate geometries have to unserialize their
input geometries from the 'flat' varlena representation to their own,
and serialize the processed geometries back when returning.
But in such nested call queries, this serialization-unserialization
process is just an overhead.This is a reasonable thing to worry about, not just for PostGIS types but
for many container types such as arrays --- it'd be nice to be able to
work with an in-memory representation that wasn't just a contiguous blob
of data. For instance, assignment to an array element might become a
constant-time operation even when working with variable-length datatypes.
I bet numeric could benefit as well. Essentially all of the
operations on numeric start by transforming the on-disk representation
into an internal form used only for the duration of a single call, and
end by transforming the internal form of the result back to the
on-disk representation.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Le 25/10/2013 18:44, Tom Lane a écrit :
Hugo Mercier <hugo.mercier@oslandia.com> writes:
Le 25/10/2013 17:20, Tom Lane a écrit :
How do you tell the difference between
foo(col1, bar(col2))
foo(bar(col1), col2)Still not sure to understand ...
I assume foo() takes two argument of type A.
bar() can take one argument of A or another type B.I was assuming everything was the same datatype in this example, ie
col1, col2, and the result of bar() are all type A.The point I'm trying to make is that in the first case, foo would be
receiving a first argument that was flat and a second that was not flat;
while in the second case, it would be receiving a first argument that was
not flat and a second that was flat. The expression labeling you're
proposing does not help it tell the difference.
No it does not. It's then up to the data type to store whether it is
flat or not. And every functions manipulating this type is assumed to be
aware of this flat/non-flat flagging.
Another point here is that there's no good reason to suppose that a
function should return a flattened value just because it's at the outer
level of its syntactic expression. For example, if we're doing a plain
SELECT foo(...) FROM ..., the next thing that will happen with that value
is it'll be fed to the output function for the datatype. Maybe that
output function would like to have a non-flat input value, too, to save
the time of transforming back to that representation. On the other hand,
if it's a SELECT ... ORDER BY ... and the planner chooses to do the ORDER
BY with a final sort step, we'll probably have to flatten the value to
pass it through sorting. (Or possibly not --- perhaps we could just pass
the toast token through sorting?) There are a lot of considerations here
and it's really unreasonable to expect that static expression labeling
will be able to do the right thing every time.
Again, my proposal is very conservative here. It does not expect to
optimize all spots where copies are not necessary. Only at a some level
of function evaluation with ... some assumptions.
Basically the only way to make this work reliably is for Datums to be
self-identifying as to whether they're flat or structured values; then
make code do the right thing on-the-fly at runtime depending on what kind
of Datum it gets. Once you've done that, I don't see that parse-time
labeling of expression nesting adds anything useful. As Andres said,
the provisions for toasted datums are a good precedent, and none of that
depends on parse-time decisions.
This is something I have to investigate, thanks for pointing it out.
What I've understood so far is that there is room for new flags in the
TOAST mechanism, so the idea would be to add a new strategy where opaque
pointers could be stored. And it would then require a way for extensions
to register their own "(de)toasting" functions, right ?
--
Hugo Mercier
Oslandia
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-10-28 09:13:06 +0100, Hugo Mercier wrote:
Le 25/10/2013 18:44, Tom Lane a écrit :
Hugo Mercier <hugo.mercier@oslandia.com> writes:
Le 25/10/2013 17:20, Tom Lane a écrit :
How do you tell the difference between
foo(col1, bar(col2))
foo(bar(col1), col2)Still not sure to understand ...
I assume foo() takes two argument of type A.
bar() can take one argument of A or another type B.I was assuming everything was the same datatype in this example, ie
col1, col2, and the result of bar() are all type A.The point I'm trying to make is that in the first case, foo would be
receiving a first argument that was flat and a second that was not flat;
while in the second case, it would be receiving a first argument that was
not flat and a second that was flat. The expression labeling you're
proposing does not help it tell the difference.No it does not. It's then up to the data type to store whether it is
flat or not. And every functions manipulating this type is assumed to be
aware of this flat/non-flat flagging.
But what if the in-memory type contains pointers and is copied or
spilled to disk? There needs to be a mechanism handling that case.
Basically the only way to make this work reliably is for Datums to be
self-identifying as to whether they're flat or structured values; then
make code do the right thing on-the-fly at runtime depending on what kind
of Datum it gets. Once you've done that, I don't see that parse-time
labeling of expression nesting adds anything useful. As Andres said,
the provisions for toasted datums are a good precedent, and none of that
depends on parse-time decisions.This is something I have to investigate, thanks for pointing it out.
What I've understood so far is that there is room for new flags in the
TOAST mechanism, so the idea would be to add a new strategy where opaque
pointers could be stored. And it would then require a way for extensions
to register their own "(de)toasting" functions, right ?
I think we'd need another argument to CREATE FUNCTION like SERIALIZE
pointing to a function that that has to return data that can be stored
on disk. Deserialization would be up to individual functions.
Depending on the specification this might turn out to be slightly
invasive, tuplestore/sort et al probably have to care...
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2013/10/28 Andres Freund <andres@2ndquadrant.com>
On 2013-10-28 09:13:06 +0100, Hugo Mercier wrote:
Le 25/10/2013 18:44, Tom Lane a écrit :
Hugo Mercier <hugo.mercier@oslandia.com> writes:
Le 25/10/2013 17:20, Tom Lane a écrit :
How do you tell the difference between
foo(col1, bar(col2))
foo(bar(col1), col2)Still not sure to understand ...
I assume foo() takes two argument of type A.
bar() can take one argument of A or another type B.I was assuming everything was the same datatype in this example, ie
col1, col2, and the result of bar() are all type A.The point I'm trying to make is that in the first case, foo would be
receiving a first argument that was flat and a second that was notflat;
while in the second case, it would be receiving a first argument that
was
not flat and a second that was flat. The expression labeling you're
proposing does not help it tell the difference.No it does not. It's then up to the data type to store whether it is
flat or not. And every functions manipulating this type is assumed to be
aware of this flat/non-flat flagging.But what if the in-memory type contains pointers and is copied or
spilled to disk? There needs to be a mechanism handling that case.Basically the only way to make this work reliably is for Datums to be
self-identifying as to whether they're flat or structured values; then
make code do the right thing on-the-fly at runtime depending on whatkind
of Datum it gets. Once you've done that, I don't see that parse-time
labeling of expression nesting adds anything useful. As Andres said,
the provisions for toasted datums are a good precedent, and none ofthat
depends on parse-time decisions.
This is something I have to investigate, thanks for pointing it out.
What I've understood so far is that there is room for new flags in the
TOAST mechanism, so the idea would be to add a new strategy where opaque
pointers could be stored. And it would then require a way for extensions
to register their own "(de)toasting" functions, right ?I think we'd need another argument to CREATE FUNCTION like SERIALIZE
pointing to a function that that has to return data that can be stored
on disk. Deserialization would be up to individual functions.Depending on the specification this might turn out to be slightly
invasive, tuplestore/sort et al probably have to care...
Then you need a functions than prepare a clone of unpacked data too.
Regards
Pavel
Show quoted text
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-10-28 10:12:41 +0100, Pavel Stehule wrote:
I think we'd need another argument to CREATE FUNCTION like SERIALIZE
pointing to a function that that has to return data that can be stored
on disk. Deserialization would be up to individual functions.Depending on the specification this might turn out to be slightly
invasive, tuplestore/sort et al probably have to care...
Then you need a functions than prepare a clone of unpacked data too.
Why? In those case we can (and should) just store the ondisk
representation.
Greetings,
Andres Freund
PS: Could you please try to trim the quoted emails a bit?
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Le 28/10/2013 09:39, Andres Freund a écrit :
On 2013-10-28 09:13:06 +0100, Hugo Mercier wrote:
Le 25/10/2013 18:44, Tom Lane a écrit :
Hugo Mercier <hugo.mercier@oslandia.com> writes:
Le 25/10/2013 17:20, Tom Lane a écrit :
How do you tell the difference between
The point I'm trying to make is that in the first case, foo would be
receiving a first argument that was flat and a second that was not flat;
while in the second case, it would be receiving a first argument that was
not flat and a second that was flat. The expression labeling you're
proposing does not help it tell the difference.No it does not. It's then up to the data type to store whether it is
flat or not. And every functions manipulating this type is assumed to be
aware of this flat/non-flat flagging.But what if the in-memory type contains pointers and is copied or
spilled to disk? There needs to be a mechanism handling that case.
It must not happen. The 'nested' boolean may be seen as "everything
returning from this function may be stored on disk at any time, so
serialize it" for nested==0.
If there is another mechanism to tell, inside a function, if the result
will be "stored" (stored on disk, copied to another context, ...) or
not, then I'll be happy with that.
I think we'd need another argument to CREATE FUNCTION like SERIALIZE
pointing to a function that that has to return data that can be stored
on disk. Deserialization would be up to individual functions.
Either as argument to CREATE FUNCTION or to CREATE TYPE, right ?
Ok, so a user function calls PG_DETOAST to get its input. The most
nested will get it straight from where it is stored.
Then the function can decide to deserialize it in its own format,
process it, and return it as is, with probably a call to
PG_RETURN(pointer). Nested function will get their inputs still from
PG_DETOAST and can use them directly.
But for the last function in the nesting chain, how the pointer will be
serialized back to something storeable ? i.e. who will call the
serialize function declared in CREATE (FUNCTION|TYPE) ?
--
Hugo Mercier
Oslandia
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-10-28 10:29:59 +0100, Hugo Mercier wrote:
Le 28/10/2013 09:39, Andres Freund a écrit :
On 2013-10-28 09:13:06 +0100, Hugo Mercier wrote:
Le 25/10/2013 18:44, Tom Lane a écrit :
Hugo Mercier <hugo.mercier@oslandia.com> writes:
Le 25/10/2013 17:20, Tom Lane a écrit :
How do you tell the difference between
The point I'm trying to make is that in the first case, foo would be
receiving a first argument that was flat and a second that was not flat;
while in the second case, it would be receiving a first argument that was
not flat and a second that was flat. The expression labeling you're
proposing does not help it tell the difference.No it does not. It's then up to the data type to store whether it is
flat or not. And every functions manipulating this type is assumed to be
aware of this flat/non-flat flagging.But what if the in-memory type contains pointers and is copied or
spilled to disk? There needs to be a mechanism handling that case.It must not happen. The 'nested' boolean may be seen as "everything
returning from this function may be stored on disk at any time, so
serialize it" for nested==0.
I don't think that's sufficient. There'll be lots of places where you'd
need to special-case hack this logic.
Think of SELECT aggregate(somefunc(foo)) FROM ... GROUP BY something_else;
If there is another mechanism to tell, inside a function, if the result
will be "stored" (stored on disk, copied to another context, ...) or
not, then I'll be happy with that.
I don't think telling the function that is the right approach at all.
I think we'd need another argument to CREATE FUNCTION like SERIALIZE
pointing to a function that that has to return data that can be stored
on disk. Deserialization would be up to individual functions.Either as argument to CREATE FUNCTION or to CREATE TYPE, right ?
Err, CREATE TYPE, yes.
Ok, so a user function calls PG_DETOAST to get its input. The most
nested will get it straight from where it is stored.
Then the function can decide to deserialize it in its own format,
process it, and return it as is, with probably a call to
PG_RETURN(pointer). Nested function will get their inputs still from
PG_DETOAST and can use them directly.
But for the last function in the nesting chain, how the pointer will be
serialized back to something storeable ? i.e. who will call the
serialize function declared in CREATE (FUNCTION|TYPE) ?
Something around toast_insert_or_update(). We'd need to set
HeapTupleHasExternal() for those kind of tuples or similar so it gets
called, but that shouldn't be the biggest problem.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2013/10/28 Andres Freund <andres@2ndquadrant.com>
On 2013-10-28 10:12:41 +0100, Pavel Stehule wrote:
I think we'd need another argument to CREATE FUNCTION like SERIALIZE
pointing to a function that that has to return data that can be stored
on disk. Deserialization would be up to individual functions.Depending on the specification this might turn out to be slightly
invasive, tuplestore/sort et al probably have to care...Then you need a functions than prepare a clone of unpacked data too.
Why? In those case we can (and should) just store the ondisk
representation.
ok,
Pavel
Show quoted text
Greetings,
Andres Freund
PS: Could you please try to trim the quoted emails a bit?
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services