Make query ID more portable

Started by Andrei Lepikhovover 4 years ago9 messageshackers
Jump to latest
#1Andrei Lepikhov
lepihov@gmail.com

Hi,

QueryID is good tool for query analysis. I want to improve core jumbling
machinery in two ways:
1. QueryID value should survive dump/restore of a database (use fully
qualified name of table instead of relid).
2. QueryID could represent more general class of queries: for example,
it can be independent from permutation of tables in a FROM clause.

See the patch in attachment as an POC. Main idea here is to break
JumbleState down to a 'clocations' part that can be really interested in
a post parse hook and a 'context data', that needed to build query or
subquery signature (hash) and, I guess, isn't really needed in any
extensions.

I think, it adds not much complexity and overhead. It still not
guaranteed equality of queryid on two instances with an equal schema,
but survives across an instance upgrade and allows to do some query
analysis on a replica node.

--
regards,
Andrey Lepikhov
Postgres Professional

Attachments:

0001-Make-queryid-more-portable.patchtext/x-patch; charset=UTF-8; name=0001-Make-queryid-more-portable.patchDownload+299-131
#2Julien Rouhaud
rjuju123@gmail.com
In reply to: Andrei Lepikhov (#1)
Re: Make query ID more portable

Hi,

On Tue, Oct 12, 2021 at 4:12 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:

QueryID is good tool for query analysis. I want to improve core jumbling
machinery in two ways:
1. QueryID value should survive dump/restore of a database (use fully
qualified name of table instead of relid).
2. QueryID could represent more general class of queries: for example,
it can be independent from permutation of tables in a FROM clause.

See the patch in attachment as an POC. Main idea here is to break
JumbleState down to a 'clocations' part that can be really interested in
a post parse hook and a 'context data', that needed to build query or
subquery signature (hash) and, I guess, isn't really needed in any
extensions.

There have been quite a lot of threads about that in the past, and
almost every time people wanted to change how the hash was computed.
So it seems to me that extensions would actually be quite interested
in that. This is even more the case now that an extension can be used
to replace the queryid calculation only and keep the rest of the
extension relying on it as is.

I think, it adds not much complexity and overhead.

I think the biggest change in your patch is:

  case RTE_RELATION:
- APP_JUMB(rte->relid);
- JumbleExpr(jstate, (Node *) rte->tablesample);
+ {
+ char *relname = regclassout_ext(rte->relid, true);
+
+ APP_JUMB_STRING(relname);
+ JumbleExpr(jstate, (Node *) rte->tablesample, ctx);
  APP_JUMB(rte->inh);
  break;

Have you done any benchmark on OLTP workload? Adding catalog access
there is likely to add significant overhead.

Also, why only using the fully qualified relation name for stable
hashes? At least operators and functions should also be treated the
same way. If you do that you will probably have way too much overhead
to be usable in a busy production environment. Why not using the new
possibility of 3rd party extension for the queryid calculation that
exactly suits your need?

#3Andrei Lepikhov
lepihov@gmail.com
In reply to: Julien Rouhaud (#2)
Re: Make query ID more portable

On 12/10/21 13:35, Julien Rouhaud wrote:

Hi,

On Tue, Oct 12, 2021 at 4:12 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:

See the patch in attachment as an POC. Main idea here is to break
JumbleState down to a 'clocations' part that can be really interested in
a post parse hook and a 'context data', that needed to build query or
subquery signature (hash) and, I guess, isn't really needed in any
extensions.

There have been quite a lot of threads about that in the past, and
almost every time people wanted to change how the hash was computed.
So it seems to me that extensions would actually be quite interested
in that. This is even more the case now that an extension can be used
to replace the queryid calculation only and keep the rest of the
extension relying on it as is.

Yes, I know. I have been using such self-made queryID for four years.
And I will use it further.
But core jumbling code is good, fast and much easier in support. The
purpose of this work is extending of jumbling to use in more flexible
way to avoid meaningless copying of this code to an extension.

I think, it adds not much complexity and overhead.

I think the biggest change in your patch is:

case RTE_RELATION:
- APP_JUMB(rte->relid);
- JumbleExpr(jstate, (Node *) rte->tablesample);
+ {
+ char *relname = regclassout_ext(rte->relid, true);
+
+ APP_JUMB_STRING(relname);
+ JumbleExpr(jstate, (Node *) rte->tablesample, ctx);
APP_JUMB(rte->inh);
break;

Have you done any benchmark on OLTP workload? Adding catalog access
there is likely to add significant overhead.

Yes, I should do benchmarking. But I guess, main goal of Query ID is
monitoring, that can be switched off, if necessary.
This part made for a demo. It can be replaced by a hook, for example.

Also, why only using the fully qualified relation name for stable
hashes? At least operators and functions should also be treated the
same way. If you do that you will probably have way too much overhead
to be usable in a busy production environment. Why not using the new
possibility of 3rd party extension for the queryid calculation that
exactly suits your need?

I fully agree with these arguments. This code is POC. Main part here is
breaking down JumbleState, using a local context for subqueries and
sorting of a range table entries hashes.
I think, we can call one routine (APP_JUMB_OBJECT(), as an example) for
all oids in this code. It would allow an extension to intercept this
call and replace oid with an arbitrary value.

--
regards,
Andrey Lepikhov
Postgres Professional

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrei Lepikhov (#3)
Re: Make query ID more portable

Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:

But core jumbling code is good, fast and much easier in support.

It won't be fast once you stick a bunch of catalog lookups into it.
I think this is fine as an extension, but it has no chance of being
accepted in core, just on performance grounds.

(I'm also not sure that the query ID calculation code is always/only
invoked in contexts where it's safe to do catalog accesses.)

A bigger issue is that query ID stability isn't something we are going
to promise on a large scale --- for example, what if a new release adds
some new fields to struct Query? So I'm not sure that "query IDs should
survive dump/reload" is a useful goal to consider. It's certainly not
something that could be reached by anything even remotely like the
existing code.

regards, tom lane

#5Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#4)
Re: Make query ID more portable

On Tue, Oct 12, 2021 at 09:40:47AM -0400, Tom Lane wrote:

Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:

But core jumbling code is good, fast and much easier in support.

It won't be fast once you stick a bunch of catalog lookups into it.
I think this is fine as an extension, but it has no chance of being
accepted in core, just on performance grounds.

(I'm also not sure that the query ID calculation code is always/only
invoked in contexts where it's safe to do catalog accesses.)

A bigger issue is that query ID stability isn't something we are going
to promise on a large scale --- for example, what if a new release adds
some new fields to struct Query? So I'm not sure that "query IDs should
survive dump/reload" is a useful goal to consider. It's certainly not
something that could be reached by anything even remotely like the
existing code.

Also, the current code handles renames of schemas and objects, but this
would not.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#6Andrei Lepikhov
lepihov@gmail.com
In reply to: Bruce Momjian (#5)
Re: Make query ID more portable

On 12/10/21 18:45, Bruce Momjian wrote:

On Tue, Oct 12, 2021 at 09:40:47AM -0400, Tom Lane wrote:

Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:

But core jumbling code is good, fast and much easier in support.

Also, the current code handles renames of schemas and objects, but this
would not.

Yes, It is good option if an extension works only in the context of one
node. But my efforts are directed to the cross-instance usage of a
monitoring data. As an example, it may be useful for sharding.
Also, I guess for an user is essential to think that if he changed a
name of any object he also would change queries and reset monitoring
data, related on this object.

--
regards,
Andrey Lepikhov
Postgres Professional

#7Julien Rouhaud
rjuju123@gmail.com
In reply to: Andrei Lepikhov (#6)
Re: Make query ID more portable

On Thu, Oct 14, 2021 at 12:37 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:

On 12/10/21 18:45, Bruce Momjian wrote:

On Tue, Oct 12, 2021 at 09:40:47AM -0400, Tom Lane wrote:

Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:

But core jumbling code is good, fast and much easier in support.

Also, the current code handles renames of schemas and objects, but this
would not.

Yes, It is good option if an extension works only in the context of one
node. But my efforts are directed to the cross-instance usage of a
monitoring data. As an example, it may be useful for sharding.
Also, I guess for an user is essential to think that if he changed a
name of any object he also would change queries and reset monitoring
data, related on this object.

What if someone wants to allow any form of partitioning without
changing to the ID, or ignore the schema because it's a multi tenant
db with dedicated roles?

I think that there are just too many arbitrary decisions that could be
made on what exactly should be a query identifier to have a single
in-core implementation. If you do sharding, you already have to
properly configure each node, so configuring your custom query id
extension shouldn't be a big problem.

#8Andrei Lepikhov
lepihov@gmail.com
In reply to: Tom Lane (#4)
Re: Make query ID more portable

On 12/10/21 18:40, Tom Lane wrote:

Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:

But core jumbling code is good, fast and much easier in support.

A bigger issue is that query ID stability isn't something we are going
to promise on a large scale --- for example, what if a new release adds
some new fields to struct Query? So I'm not sure that "query IDs should
survive dump/reload" is a useful goal to consider. It's certainly not
something that could be reached by anything even remotely like the
existing code.

Thank you for the explanation.
I think the problem of queryId is that is encapsulates two different
meanings:
1. It allows an extension to match an query on post parse and execution
stages. In this sense, queryId should be as unique as possible for each
query.
2. For pg_stat_statements purposes (and my project too) it represents an
query class and should be stable against permutations of range table
entries, clauses, e.t.c. For example:
"SELECT * FROM a,b;" and "SELECT * FROM b,a;" should have the same queryId.

This issue may be solved in an extension with next approach:
1. Force as unique value for queryId as extension wants in a post parse hook
2. Generalize the JumbleQuery routine code to generate a kind of query
class signature.

The attached patch is a first sketch for such change.

--
regards,
Andrey Lepikhov
Postgres Professional

Attachments:

0001-Add-callback-for-jumbling-of-an-OID-value.patchtext/plain; charset=UTF-8; name=0001-Add-callback-for-jumbling-of-an-OID-value.patchDownload+67-53
#9Andrei Lepikhov
lepihov@gmail.com
In reply to: Julien Rouhaud (#7)
Re: Make query ID more portable

On 14/10/21 10:40, Julien Rouhaud wrote:

On Thu, Oct 14, 2021 at 12:37 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:

On 12/10/21 18:45, Bruce Momjian wrote:

On Tue, Oct 12, 2021 at 09:40:47AM -0400, Tom Lane wrote:

Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:

I think that there are just too many arbitrary decisions that could be
made on what exactly should be a query identifier to have a single
in-core implementation.

Yes, and I use such custom decision too. But core jumbling code
implements good idea and can be generalized for reuse. Patch from
previous letter and breaking down of JumbleState can allow coders to
implement their codes based on queryjumble.c module with smaller changes.

If you do sharding, you already have to
properly configure each node, so configuring your custom query id
extension shouldn't be a big problem.

My project is about adaptive query optimization techniques. It is not
obvious how to match (without a field in Query struct) a post parse and
an execution phases because of nested queries.
Also, if we use queryId in an extension, we interfere with
pg_stat_statements.

--
regards,
Andrey Lepikhov
Postgres Professional