percentile value check can be slow

Started by jotpeabout 8 years ago9 messages
#1jotpe
jotpe@posteo.de

I tried to enter invalid percentile fractions, and was astonished that
it seems to be checked after many work is done?

psql (11devel)
Type "help" for help.

jotpe=# \timing
Timing is on.
jotpe=# select percentile_cont(array[0,0.25,0.5,1,1,null,2]) within
group(order by txk) from klima_tag;
ERROR: percentile value 2 is not between 0 and 1
Time: 19155,565 ms (00:19,156)
jotpe=# select count(*) from klima_tag;
count
----------
13950214
(1 row)

Time: 933,847 ms

Best regards Johannes

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: jotpe (#1)
Re: percentile value check can be slow

jotpe <jotpe@posteo.de> writes:

I tried to enter invalid percentile fractions, and was astonished that
it seems to be checked after many work is done?

IIRC, only the aggregate's final-function is concerned with direct
arguments, so it's not all that astonishing.

regards, tom lane

#3David Fetter
david@fetter.org
In reply to: Tom Lane (#2)
Re: percentile value check can be slow

On Sat, Nov 18, 2017 at 10:44:36AM -0500, Tom Lane wrote:

jotpe <jotpe@posteo.de> writes:

I tried to enter invalid percentile fractions, and was astonished
that it seems to be checked after many work is done?

IIRC, only the aggregate's final-function is concerned with direct
arguments, so it's not all that astonishing.

It may not be surprising from the point of view of a systems
programmer, but it's pretty surprising that this check is deferred to
many seconds in when the system has all the information it need in
order to establish this before execution begins.

I'm not sure I see an easy way to do this check early, but it's worth
trying on grounds of POLA violation. I have a couple of ideas on how
to do this, one less invasive but hinky, the other a lot more invasive
but better overall.

Ugly Hack With Ugly Performance Consequences:
Inject a subtransaction at the start of execution that supplies an
empty input to the final function with the supplied direct
arguments.

Bigger Lift:
Require a separate recognizer function direct arguments and fire
it during post-parse analysis. Perhaps this could be called
recognizer along with the corresponding mrecognizer. It's not
clear that it's sane to have separate ones, but I thought I'd
bring it up for completeness.

Way Bigger Lift, As Far As I Can Tell, But More Fun For Users:
Allow optional CHECK constraints in CREATE AGGREGATE for direct
arguments.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: David Fetter (#3)
Re: percentile value check can be slow

Hi,

On 11/18/2017 10:30 PM, David Fetter wrote:

On Sat, Nov 18, 2017 at 10:44:36AM -0500, Tom Lane wrote:

jotpe <jotpe@posteo.de> writes:

I tried to enter invalid percentile fractions, and was astonished
that it seems to be checked after many work is done?

IIRC, only the aggregate's final-function is concerned with direct
arguments, so it's not all that astonishing.

It may not be surprising from the point of view of a systems
programmer, but it's pretty surprising that this check is deferred to
many seconds in when the system has all the information it need in
order to establish this before execution begins.

I'm not sure I see an easy way to do this check early, but it's worth
trying on grounds of POLA violation. I have a couple of ideas on how
to do this, one less invasive but hinky, the other a lot more invasive
but better overall.

Ugly Hack With Ugly Performance Consequences:
Inject a subtransaction at the start of execution that supplies an
empty input to the final function with the supplied direct
arguments.

I'm pretty sure you realize this is quite unlikely to get accepted.

Bigger Lift:
Require a separate recognizer function direct arguments and fire
it during post-parse analysis. Perhaps this could be called
recognizer along with the corresponding mrecognizer. It's not
clear that it's sane to have separate ones, but I thought I'd
bring it up for completeness.

Is 'recognizer' an established definition I should know? Is it the same
as 'validator' or is it something new/different?

Way Bigger Lift, As Far As I Can Tell, But More Fun For Users:
Allow optional CHECK constraints in CREATE AGGREGATE for direct
arguments.

How will any of the approaches deal with something like

select percentile_cont((select array_agg(v) from p))
within group (order by a) from t;

In this case the the values are unknown after the parse analysis, so I
guess it does not really address that.

FWIW while I won't stand in the way of improving this, I wonder if this
is really worth the additional complexity. If you get errors like this
with a static list of values, you will fix the list and you're done. If
the list is dynamic (computed in the query itself), you'll still get the
error much later during query execution.

So if you're getting many failures like this for the "delayed error
reporting" to be an issue, perhaps there's something wrong in you stack
and you should address that instead?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#5David Fetter
david@fetter.org
In reply to: Tomas Vondra (#4)
Re: percentile value check can be slow

On Sat, Nov 18, 2017 at 11:05:47PM +0100, Tomas Vondra wrote:

Hi,

On 11/18/2017 10:30 PM, David Fetter wrote:

On Sat, Nov 18, 2017 at 10:44:36AM -0500, Tom Lane wrote:

jotpe <jotpe@posteo.de> writes:

I tried to enter invalid percentile fractions, and was astonished
that it seems to be checked after many work is done?

IIRC, only the aggregate's final-function is concerned with direct
arguments, so it's not all that astonishing.

It may not be surprising from the point of view of a systems
programmer, but it's pretty surprising that this check is deferred to
many seconds in when the system has all the information it need in
order to establish this before execution begins.

I'm not sure I see an easy way to do this check early, but it's worth
trying on grounds of POLA violation. I have a couple of ideas on how
to do this, one less invasive but hinky, the other a lot more invasive
but better overall.

Ugly Hack With Ugly Performance Consequences:
Inject a subtransaction at the start of execution that supplies an
empty input to the final function with the supplied direct
arguments.

I'm pretty sure you realize this is quite unlikely to get accepted.

I should hope not, but I mentioned it because I'd like to get it on
the record as rejected.

Bigger Lift:
Require a separate recognizer function direct arguments and fire
it during post-parse analysis. Perhaps this could be called
recognizer along with the corresponding mrecognizer. It's not
clear that it's sane to have separate ones, but I thought I'd
bring it up for completeness.

Is 'recognizer' an established definition I should know? Is it the same
as 'validator' or is it something new/different?

I borrowed it from http://langsec.org/

I'm not entirely sure what you mean by a validator, but a recognizer
is something that gives a quick and sure read as to whether the input
is well-formed. In general, it's along the lines of a tokenizer, a
parser, and something that does very light post-parse analysis for
correctness of form.

For the case that started the thread, a recognizer would check
something along the lines of

CHECK('[0,1]' @> ALL(input_array))

Way Bigger Lift, As Far As I Can Tell, But More Fun For Users:
Allow optional CHECK constraints in CREATE AGGREGATE for direct
arguments.

How will any of the approaches deal with something like

select percentile_cont((select array_agg(v) from p))
within group (order by a) from t;

In this case the the values are unknown after the parse analysis, so I
guess it does not really address that.

It doesn't. Does it make sense to do a one-shot execution for cases
like that? It might well be worth it to do the aggregate once in
advance as a throw-away if the query execution time is already going
to take awhile. Of course, you can break that one by making p a JOIN
to yet another thing...

FWIW while I won't stand in the way of improving this, I wonder if this
is really worth the additional complexity. If you get errors like this
with a static list of values, you will fix the list and you're done. If
the list is dynamic (computed in the query itself), you'll still get the
error much later during query execution.

So if you're getting many failures like this for the "delayed error
reporting" to be an issue, perhaps there's something wrong in you stack
and you should address that instead?

I'd like to think that getting something to fail quickly and cheaply
when it can will give our end users a better experience. Here,
"cheaply" refers to their computing resources and time. Clearly, not
having this happen in this case bothered Johannes enough to wade in
here.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#6Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: David Fetter (#5)
Re: percentile value check can be slow

Hi,

On 11/19/2017 03:10 AM, David Fetter wrote:

On Sat, Nov 18, 2017 at 11:05:47PM +0100, Tomas Vondra wrote:

Hi,

...

Is 'recognizer' an established definition I should know? Is it the same
as 'validator' or is it something new/different?

I borrowed it from http://langsec.org/

I'm not entirely sure what you mean by a validator, but a recognizer
is something that gives a quick and sure read as to whether the input
is well-formed. In general, it's along the lines of a tokenizer, a
parser, and something that does very light post-parse analysis for
correctness of form.

For the case that started the thread, a recognizer would check
something along the lines of

CHECK('[0,1]' @> ALL(input_array))

OK, thanks. From what I understand, recognizer is more about recognizing
if a string is valid within a given formal language (essentially, if
it's a well-formed program). That may not be the right term for checks
on parameter values.

OTOH we already have "validators" on a number of places - functions
checking various parameters, e.g. reloptions for FDWs, etc.

But I guess the naming can be solved later ...

Way Bigger Lift, As Far As I Can Tell, But More Fun For Users:
Allow optional CHECK constraints in CREATE AGGREGATE for direct
arguments.

How will any of the approaches deal with something like

select percentile_cont((select array_agg(v) from p))
within group (order by a) from t;

In this case the the values are unknown after the parse analysis, so I
guess it does not really address that.

It doesn't. Does it make sense to do a one-shot execution for cases
like that? It might well be worth it to do the aggregate once in
advance as a throw-away if the query execution time is already going
to take awhile. Of course, you can break that one by making p a JOIN
to yet another thing...

FWIW while I won't stand in the way of improving this, I wonder if this
is really worth the additional complexity. If you get errors like this
with a static list of values, you will fix the list and you're done. If
the list is dynamic (computed in the query itself), you'll still get the
error much later during query execution.

So if you're getting many failures like this for the "delayed error
reporting" to be an issue, perhaps there's something wrong in you stack
and you should address that instead?

I'd like to think that getting something to fail quickly and cheaply
when it can will give our end users a better experience. Here,
"cheaply" refers to their computing resources and time.

The trouble is, this increases execution time for everyone, including
people who carefully construct the parameter values. That seems rather
undesirable.

Clearly, not having this happen in this case bothered Johannes
enough to wade in here.

No. He was surprised the error is reported after significant amount of
time, but he does not seem to claim failing faster would be valuable to
him. That is your assumption, and I have my doubts about it.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#7jotpe
jotpe@posteo.de
In reply to: Tomas Vondra (#6)
Re: percentile value check can be slow

On 19.11.2017 13:23, Tomas Vondra wrote:

Hi,

On 11/19/2017 03:10 AM, David Fetter wrote:

On Sat, Nov 18, 2017 at 11:05:47PM +0100, Tomas Vondra wrote:

Hi,

...

Is 'recognizer' an established definition I should know? Is it the same
as 'validator' or is it something new/different?

I borrowed it from http://langsec.org/

I'm not entirely sure what you mean by a validator, but a recognizer
is something that gives a quick and sure read as to whether the input
is well-formed. In general, it's along the lines of a tokenizer, a
parser, and something that does very light post-parse analysis for
correctness of form.

For the case that started the thread, a recognizer would check
something along the lines of

CHECK('[0,1]' @> ALL(input_array))

OK, thanks. From what I understand, recognizer is more about recognizing
if a string is valid within a given formal language (essentially, if
it's a well-formed program). That may not be the right term for checks
on parameter values.

OTOH we already have "validators" on a number of places - functions
checking various parameters, e.g. reloptions for FDWs, etc.

But I guess the naming can be solved later ...

Way Bigger Lift, As Far As I Can Tell, But More Fun For Users:
Allow optional CHECK constraints in CREATE AGGREGATE for direct
arguments.

How will any of the approaches deal with something like

select percentile_cont((select array_agg(v) from p))
within group (order by a) from t;

In this case the the values are unknown after the parse analysis, so I
guess it does not really address that.

It doesn't. Does it make sense to do a one-shot execution for cases
like that? It might well be worth it to do the aggregate once in
advance as a throw-away if the query execution time is already going
to take awhile. Of course, you can break that one by making p a JOIN
to yet another thing...

FWIW while I won't stand in the way of improving this, I wonder if this
is really worth the additional complexity. If you get errors like this
with a static list of values, you will fix the list and you're done. If
the list is dynamic (computed in the query itself), you'll still get the
error much later during query execution.

So if you're getting many failures like this for the "delayed error
reporting" to be an issue, perhaps there's something wrong in you stack
and you should address that instead?

I'd like to think that getting something to fail quickly and cheaply
when it can will give our end users a better experience. Here,
"cheaply" refers to their computing resources and time.

The trouble is, this increases execution time for everyone, including
people who carefully construct the parameter values. That seems rather
undesirable.

Clearly, not having this happen in this case bothered Johannes
enough to wade in here.

No. He was surprised the error is reported after significant amount of
time, but he does not seem to claim failing faster would be valuable to
him. That is your assumption, and I have my doubts about it.

I did not know about the complexity that is needed to precheck the
parameters. I thought maybe it could be done easily.
If it's too hard to change that, I wouldn't want that improvement.

Best Regards
Johannes

#8David Fetter
david@fetter.org
In reply to: Tomas Vondra (#6)
Re: percentile value check can be slow

On Sun, Nov 19, 2017 at 01:23:42PM +0100, Tomas Vondra wrote:

Hi,

On 11/19/2017 03:10 AM, David Fetter wrote:

On Sat, Nov 18, 2017 at 11:05:47PM +0100, Tomas Vondra wrote:

Hi,

...

Is 'recognizer' an established definition I should know? Is it the same
as 'validator' or is it something new/different?

I borrowed it from http://langsec.org/

I'm not entirely sure what you mean by a validator, but a recognizer
is something that gives a quick and sure read as to whether the input
is well-formed. In general, it's along the lines of a tokenizer, a
parser, and something that does very light post-parse analysis for
correctness of form.

For the case that started the thread, a recognizer would check
something along the lines of

CHECK('[0,1]' @> ALL(input_array))

OK, thanks. From what I understand, recognizer is more about recognizing
if a string is valid within a given formal language (essentially, if
it's a well-formed program). That may not be the right term for checks
on parameter values.

There are two hard problems in computer science: naming things, cache
coherency, and off-by-one.

OTOH we already have "validators" on a number of places - functions
checking various parameters, e.g. reloptions for FDWs, etc.

But I guess the naming can be solved later ...

Indeed.

Way Bigger Lift, As Far As I Can Tell, But More Fun For Users:
Allow optional CHECK constraints in CREATE AGGREGATE for direct
arguments.

How will any of the approaches deal with something like

select percentile_cont((select array_agg(v) from p))
within group (order by a) from t;

In this case the the values are unknown after the parse analysis, so I
guess it does not really address that.

It doesn't. Does it make sense to do a one-shot execution for cases
like that? It might well be worth it to do the aggregate once in
advance as a throw-away if the query execution time is already going
to take awhile. Of course, you can break that one by making p a JOIN
to yet another thing...

FWIW while I won't stand in the way of improving this, I wonder if this
is really worth the additional complexity. If you get errors like this
with a static list of values, you will fix the list and you're done. If
the list is dynamic (computed in the query itself), you'll still get the
error much later during query execution.

So if you're getting many failures like this for the "delayed error
reporting" to be an issue, perhaps there's something wrong in you stack
and you should address that instead?

I'd like to think that getting something to fail quickly and cheaply
when it can will give our end users a better experience. Here,
"cheaply" refers to their computing resources and time.

The trouble is, this increases execution time for everyone, including
people who carefully construct the parameter values. That seems rather
undesirable.

I may be wrong but I'm pretty sure that a check for well-formed direct
parameters will not impose a significant cost on aggregates.

It occurs to me that this particular aggregate could take an array of
a domain defined along the lines of:

CREATE DOMAIN float4_0_1_closed AS float4
NOT NULL
CHECK(VALUE >= 0.0 AND VALUE <= 1.0);

Then the check would happen much earlier without adding a bunch of
potentially expensive machinery.

Clearly, not having this happen in this case bothered Johannes
enough to wade in here.

No. He was surprised the error is reported after significant amount
of time, but he does not seem to claim failing faster would be
valuable to him. That is your assumption, and I have my doubts about
it.

My mistake. I shouldn't have guessed when there was a better
alternative.

Johannes, could you help us understand your thinking in reporting
this?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#9jotpe
jotpe@posteo.de
In reply to: David Fetter (#8)
Re: percentile value check can be slow

On 19.11.2017 18:49, David Fetter wrote:

On Sun, Nov 19, 2017 at 01:23:42PM +0100, Tomas Vondra wrote:

Hi,

On 11/19/2017 03:10 AM, David Fetter wrote:

On Sat, Nov 18, 2017 at 11:05:47PM +0100, Tomas Vondra wrote:

Hi,

...

Is 'recognizer' an established definition I should know? Is it the same
as 'validator' or is it something new/different?

I borrowed it from http://langsec.org/

I'm not entirely sure what you mean by a validator, but a recognizer
is something that gives a quick and sure read as to whether the input
is well-formed. In general, it's along the lines of a tokenizer, a
parser, and something that does very light post-parse analysis for
correctness of form.

For the case that started the thread, a recognizer would check
something along the lines of

CHECK('[0,1]' @> ALL(input_array))

OK, thanks. From what I understand, recognizer is more about recognizing
if a string is valid within a given formal language (essentially, if
it's a well-formed program). That may not be the right term for checks
on parameter values.

There are two hard problems in computer science: naming things, cache
coherency, and off-by-one.

OTOH we already have "validators" on a number of places - functions
checking various parameters, e.g. reloptions for FDWs, etc.

But I guess the naming can be solved later ...

Indeed.

Way Bigger Lift, As Far As I Can Tell, But More Fun For Users:
Allow optional CHECK constraints in CREATE AGGREGATE for direct
arguments.

How will any of the approaches deal with something like

select percentile_cont((select array_agg(v) from p))
within group (order by a) from t;

In this case the the values are unknown after the parse analysis, so I
guess it does not really address that.

It doesn't. Does it make sense to do a one-shot execution for cases
like that? It might well be worth it to do the aggregate once in
advance as a throw-away if the query execution time is already going
to take awhile. Of course, you can break that one by making p a JOIN
to yet another thing...

FWIW while I won't stand in the way of improving this, I wonder if this
is really worth the additional complexity. If you get errors like this
with a static list of values, you will fix the list and you're done. If
the list is dynamic (computed in the query itself), you'll still get the
error much later during query execution.

So if you're getting many failures like this for the "delayed error
reporting" to be an issue, perhaps there's something wrong in you stack
and you should address that instead?

I'd like to think that getting something to fail quickly and cheaply
when it can will give our end users a better experience. Here,
"cheaply" refers to their computing resources and time.

The trouble is, this increases execution time for everyone, including
people who carefully construct the parameter values. That seems rather
undesirable.

I may be wrong but I'm pretty sure that a check for well-formed direct
parameters will not impose a significant cost on aggregates.

It occurs to me that this particular aggregate could take an array of
a domain defined along the lines of:

CREATE DOMAIN float4_0_1_closed AS float4
NOT NULL
CHECK(VALUE >= 0.0 AND VALUE <= 1.0);

Then the check would happen much earlier without adding a bunch of
potentially expensive machinery.

Clearly, not having this happen in this case bothered Johannes
enough to wade in here.

No. He was surprised the error is reported after significant amount
of time, but he does not seem to claim failing faster would be
valuable to him. That is your assumption, and I have my doubts about
it.

My mistake. I shouldn't have guessed when there was a better
alternative.

I already wrote that: I did not know about the complexity that is needed
to precheck the parameters. I thought maybe it could be done easily.
If it's too hard to change that, I wouldn't want that improvement.

Johannes, could you help us understand your thinking in reporting
this?

When executing this query was wanted to see whats happening with wrong
fractions. And I expected to fail this query quick, but it seams the
postgresql has to read a lot until it checks (in the end) that on
parameter was out the valid area.

I thougt, maybe there are there are technical forces to do it that way,
or it can just be improved to fail fast.

Best regards.