Accounting for between table correlation

Started by Alexander Stoddardover 5 years ago18 messagesgeneral

alexander.stoddard@gmail.com

over 5 years ago

I am having ongoing trouble with a pair of tables, the design of which is
beyond my control.

There is a 'primary' table with hundreds of millions of rows. There is then
a 'subclass' table ~ 10% of the primary which has additional fields. The
tables logically share a primary key field (although that is not annotated
in the schema).

Membership of the subclass table has high correlation with fields in the
primary table - it is very much not random. It seems query plans where the
two tables are joined are 'unstable'. Over time very different plans can
result even for unchanged queries and some plans are exceedingly
inefficient.

I think what is going on is that the query planner assumes many fewer rows
are going to join to the subtable than actually do (because of the strong
correlation).

Can anyone offer any advice on dealing with this scenario (or better
diagnosing it)?

Thank you,
Alex

Ron

ronljohnsonjr@gmail.com

over 5 years ago

In reply to: Alexander Stoddard (#1)

Re: Accounting for between table correlation

On 1/15/21 9:19 AM, Alexander Stoddard wrote:

I am having ongoing trouble with a pair of tables, the design of which is
beyond my control.

There is a 'primary' table with hundreds of millions of rows. There is
then a 'subclass' table ~ 10% of the primary which has additional fields.
The tables logically share a primary key field (although that is not
annotated in the schema).

Membership of the subclass table has high correlation with fields in the
primary table - it is very much not random. It seems query plans where the
two tables are joined are 'unstable'. Over time very different plans can
result even for unchanged queries and some plans are exceedingly inefficient.

I think what is going on is that the query planner assumes many fewer rows
are going to join to the subtable than actually do (because of the strong
correlation).

Can anyone offer any advice on dealing with this scenario (or better
diagnosing it)?

Do the tables get analyzed on a regular basis?

--
Angular momentum makes the world go 'round.

Atul Kumar

akumar14871@gmail.com

over 5 years ago

In reply to: Ron (#2)

Re: Accounting for between table correlation

Hi Alexander,

As per Ron, you are not supposed to ask your questions here.

As According to him, we should keep on doing research on internet rather
than asking for support directly even you have done enough research and
until unless “Ron” won’t be satisfied you have to do keep on researching.

Regards
Atul

On Friday, January 15, 2021, Ron <ronljohnsonjr@gmail.com> wrote:

Show quoted text

On 1/15/21 9:19 AM, Alexander Stoddard wrote:

I am having ongoing trouble with a pair of tables, the design of which is
beyond my control.

There is a 'primary' table with hundreds of millions of rows. There is
then a 'subclass' table ~ 10% of the primary which has additional fields.
The tables logically share a primary key field (although that is not
annotated in the schema).

Membership of the subclass table has high correlation with fields in the
primary table - it is very much not random. It seems query plans where the
two tables are joined are 'unstable'. Over time very different plans can
result even for unchanged queries and some plans are exceedingly
inefficient.

I think what is going on is that the query planner assumes many fewer
rows are going to join to the subtable than actually do (because of the
strong correlation).

Can anyone offer any advice on dealing with this scenario (or better
diagnosing it)?

Do the tables get analyzed on a regular basis?

--
Angular momentum makes the world go 'round.

o1bigtenor

o1bigtenor@gmail.com

over 5 years ago

In reply to: Atul Kumar (#3)

Re: Accounting for between table correlation

On Fri, Jan 15, 2021 at 9:29 AM Atul Kumar <akumar14871@gmail.com> wrote:

Hi Alexander,

As per Ron, you are not supposed to ask your questions here.

As According to him, we should keep on doing research on internet rather than asking for support directly even you have done enough research and until unless “Ron” won’t be satisfied you have to do keep on researching.

Mr Atul

With respect - - - -
When you asked your question I thought you had a valid question but
really hadn't done any research for an answer.
The list is here to help you help yourself.
I have found myself answering my own questions sometimes. But that
also means that I now have a better idea of what's going on.
Just demanding answers with all the steps - - - - well - - - you are
hindering your own learning.
I would suggest that instead of barging into others threads
complaining that you didn't get the answer you wanted that you try
actually working on your own problem.
(I am not a senior listee here - - - - just not happy with someone who
is grumbly AND doesn't want to help themselves!)

(To the admins - - - - if I have overstepped - - - please advise!)

Regards

Atul Kumar

akumar14871@gmail.com

over 5 years ago

In reply to: o1bigtenor (#4)

Re: Accounting for between table correlation

Hi Ron,

I have a simple mindset that If I don’t know about something about anyone’s
query I don’t respond.

Please start learning to not to interrupt or raising useless questions/
phrases on someone’s query bcz it makes you judgemental.

So please take an advice to not to interrupt if you can’t help. It would be
helpful for all of us.

Regards
Atul

On Friday, January 15, 2021, o1bigtenor <o1bigtenor@gmail.com> wrote:

Show quoted text

On Fri, Jan 15, 2021 at 9:29 AM Atul Kumar <akumar14871@gmail.com> wrote:

Hi Alexander,

As per Ron, you are not supposed to ask your questions here.

As According to him, we should keep on doing research on internet rather

than asking for support directly even you have done enough research and
until unless “Ron” won’t be satisfied you have to do keep on researching.

Mr Atul

With respect - - - -
When you asked your question I thought you had a valid question but
really hadn't done any research for an answer.
The list is here to help you help yourself.
I have found myself answering my own questions sometimes. But that
also means that I now have a better idea of what's going on.
Just demanding answers with all the steps - - - - well - - - you are
hindering your own learning.
I would suggest that instead of barging into others threads
complaining that you didn't get the answer you wanted that you try
actually working on your own problem.
(I am not a senior listee here - - - - just not happy with someone who
is grumbly AND doesn't want to help themselves!)

(To the admins - - - - if I have overstepped - - - please advise!)

Regards

David G. Johnston

david.g.johnston@gmail.com

over 5 years ago

In reply to: Atul Kumar (#5)

Re: Accounting for between table correlation

On Fri, Jan 15, 2021 at 8:44 AM Atul Kumar <akumar14871@gmail.com> wrote:

Hi Ron,

I have a simple mindset that If I don’t know about something about
anyone’s query I don’t respond.

Please start learning to not to interrupt or raising useless questions/
phrases on someone’s query bcz it makes you judgemental.

So please take an advice to not to interrupt if you can’t help. It would
be helpful for all of us.

This advice isn't all that helpful as it presumes that the author of the
response actually understands that what they are writing is not helpful.
That isn't always obvious to the writer, and if it was it probably would
not have been written. Instead, if you find a piece of advice to be
unhelpful, and feel the need to say so, be explicit about why you believe
the response was not helpful.

As to the question at hand - it is correct that "analyze" being run or not
doesn't really matter here - PostgreSQL doesn't have cross-table statistics
that can be updated. That is about the most one can glean from the limited
information the OP provided. Now, if you can place the columns on the same
table, say in a materialized view, and then use the somewhat recent
multi-column statistics feature, there is probably room for improvement.

Otherwise, the question is basically an open-ended one and maybe someone
responds with some rules-of-thumb, or maybe they don't.

David J.

Thomas Kellerer

shammat@gmx.net

over 5 years ago

In reply to: Atul Kumar (#3)

Re: Accounting for between table correlation

Atul Kumar schrieb am 15.01.2021 um 16:29:

As per Ron, you are not supposed to ask your questions here.

As According to him, we should keep on doing research on internet
rather than asking for support directly even you have done enough
research and until unless “Ron” won’t be satisfied you have to do
keep on researching.

Ron's question was perfectly valid.

Missing and wrong statistics are one reason for the planner to choose a bad execution plan.

Maybe there are many "idle in transaction" sessions that prevent autovacuum from properly
analyzing those tables. Or maybe for some unknown reason autovacuum was turned off.
Or maybe they receive a lot of bulk loads which would require a manual
analyze.

So the question "are they analyzed on a regular basis" is a valid point and nowhere
did Ron say that the OP didn't do enough research. Ron was merely trying to
rule out one of the more obvious reasons.

Thomas

David G. Johnston

david.g.johnston@gmail.com

over 5 years ago

In reply to: Thomas Kellerer (#7)

Re: Accounting for between table correlation

On Fri, Jan 15, 2021 at 9:10 AM Thomas Kellerer <shammat@gmx.net> wrote:

Atul Kumar schrieb am 15.01.2021 um 16:29:

As per Ron, you are not supposed to ask your questions here.

As According to him, we should keep on doing research on internet
rather than asking for support directly even you have done enough
research and until unless “Ron” won’t be satisfied you have to do
keep on researching.

Ron's question was perfectly valid.

Missing and wrong statistics are one reason for the planner to choose a
bad execution plan.

Yeah, at first blush I didn't think analyze really mattered (and it mostly
doesn't because while you can keep the statistics up-to-date the
multi-table nature of the problem means they are only marginally helpful
here), but that just points out the under-specified nature of the original
posting. Taken as a simple question of "is there a way to work around the
lack of multi-table statistics" the analyze, and even the specific queries,
don't matter all that much. But it also would be much more useful if the
OP would choose a single problematic query and show the schema, query, and
explain results, hopefully both good and bad, and comment on how analyze
seems to affect the plan choice. But for the general question about
overcoming our statistics limitations the analyze point is not relevant.

David J.

Alexander Stoddard

alexander.stoddard@gmail.com

over 5 years ago

In reply to: David G. Johnston (#8)

Re: Accounting for between table correlation

Unfortunately I'm not free to share the specific schema or the query plans.
They derive from an upstream vendor that is 'protective' of their data
model. To get to a proper example I'll need to recreate the behavior with
generic data in a generified schema.

I apologize for being frustratingly vague. I do feel like an idiot for not
at least saying this was with version PG 11.10.
It has been beneficial to me that the replies so far appear to validate my
understanding that no version of postgres has cross table correlation
statistics in the planner.

Analyze is done frequently, and I think at sufficient sample size. The
context is in a large data analysis setting and the data is changed via
bulk ETL not OLTP. The effect on analyzing is just instability - the plan
can flip in either direction (without underlying data changing) between
giving an answer in minutes and timing out after 10s of hours. That could
be indicative of too small a sample but I think in this case it is more the
statistics aren't useful because it is cross table correlation that can't
be accounted for.

The 'fast plans' use parallel seq scans. The 'slow plans' is using index
scans. It appears a good query plan correctly predicts it should be bulk
processing the tables but bad ones get fooled into trashing (hard disk, not
SSD) by mispredicting too few rows to join between the tables.

If anyone has similar experiences and is generous enough with their time to
share possible solutions/work arounds then I'm most grateful. If my
description is too vague to be worthwhile answering then I quite understand
and apologize for the time wasted in reading.

Thank you.

On Fri, Jan 15, 2021 at 10:26 AM David G. Johnston <
david.g.johnston@gmail.com> wrote:

Show quoted text

On Fri, Jan 15, 2021 at 9:10 AM Thomas Kellerer <shammat@gmx.net> wrote:

Atul Kumar schrieb am 15.01.2021 um 16:29:

As per Ron, you are not supposed to ask your questions here.

As According to him, we should keep on doing research on internet
rather than asking for support directly even you have done enough
research and until unless “Ron” won’t be satisfied you have to do
keep on researching.

Ron's question was perfectly valid.

Missing and wrong statistics are one reason for the planner to choose a
bad execution plan.

Yeah, at first blush I didn't think analyze really mattered (and it mostly
doesn't because while you can keep the statistics up-to-date the
multi-table nature of the problem means they are only marginally helpful
here), but that just points out the under-specified nature of the original
posting. Taken as a simple question of "is there a way to work around the
lack of multi-table statistics" the analyze, and even the specific queries,
don't matter all that much. But it also would be much more useful if the
OP would choose a single problematic query and show the schema, query, and
explain results, hopefully both good and bad, and comment on how analyze
seems to affect the plan choice. But for the general question about
overcoming our statistics limitations the analyze point is not relevant.

David J.

#10

Adrian Klaver

adrian.klaver@aklaver.com

over 5 years ago

In reply to: Alexander Stoddard (#9)

Re: Accounting for between table correlation

On 1/15/21 9:21 AM, Alexander Stoddard wrote:

Unfortunately I'm not free to share the specific schema or the query
plans. They derive from an upstream vendor that is 'protective' of their
data model. To get to a proper example I'll need to recreate the
behavior with generic data in a generified schema.

I apologize for being frustratingly vague. I do feel like an idiot for
not at least saying this was with version PG 11.10.
It has been beneficial to me that the replies so far appear to validate
my understanding that no version of postgres has cross table correlation
statistics in the planner.

Analyze is done frequently, and I think at sufficient sample size. The
context is in a large data analysis setting and the data is changed via
bulk ETL not OLTP. The effect on analyzing is just instability - the
plan can flip in either direction (without underlying data changing)
between giving an answer in minutes and timing out after 10s of hours.
That could be indicative of too small a sample but I think in this case
it is more the statistics aren't useful because it is cross table
correlation that can't be accounted for.

So to be clear, the process imports the data, then you run a query and
it completes in x time, you then ANALYZE the same data and it runs in y
time. Is that correct?

The 'fast plans' use parallel seq scans. The 'slow plans' is using index
scans. It appears a good query plan correctly predicts it should be bulk
processing the tables but bad ones get fooled into trashing (hard disk,
not SSD) by mispredicting too few rows to join between the tables.

If anyone has similar experiences and is generous enough with their time
to share possible solutions/work arounds then I'm most grateful. If my
description is too vague to be worthwhile answering then I quite
understand and apologize for the time wasted in reading.

Thank you.

On Fri, Jan 15, 2021 at 10:26 AM David G. Johnston
<david.g.johnston@gmail.com <mailto:david.g.johnston@gmail.com>> wrote:

On Fri, Jan 15, 2021 at 9:10 AM Thomas Kellerer <shammat@gmx.net
<mailto:shammat@gmx.net>> wrote:

Atul Kumar schrieb am 15.01.2021 um 16:29:

As per Ron, you are not supposed to ask your questions here.

As According to him, we should keep on doing research on internet
rather than asking for support directly even you have done enough
research and until unless “Ron” won’t be satisfied you have to do
keep on researching.

Ron's question was perfectly valid.

Missing and wrong statistics are one reason for the planner to
choose a bad execution plan.

Yeah, at first blush I didn't think analyze really mattered (and it
mostly doesn't because while you can keep the statistics up-to-date
the multi-table nature of the problem means they are only marginally
helpful here), but that just points out the under-specified nature
of the original posting. Taken as a simple question of "is there a
way to work around the lack of multi-table statistics" the analyze,
and even the specific queries, don't matter all that much. But it
also would be much more useful if the OP would choose a single
problematic query and show the schema, query, and explain results,
hopefully both good and bad, and comment on how analyze seems to
affect the plan choice. But for the general question about
overcoming our statistics limitations the analyze point is not relevant.

David J.

--
Adrian Klaver
adrian.klaver@aklaver.com

#11

Michael Lewis

mlewis@entrata.com

over 5 years ago

In reply to: Alexander Stoddard (#9)

Re: Accounting for between table correlation

On Fri, Jan 15, 2021 at 10:22 AM Alexander Stoddard <
alexander.stoddard@gmail.com> wrote:

The 'fast plans' use parallel seq scans. The 'slow plans' is using index
scans. It appears a good query plan correctly predicts it should be bulk
processing the tables but bad ones get fooled into trashing (hard disk, not
SSD) by mispredicting too few rows to join between the tables.

How many tables are involved? Are you sure it is stats getting updated
causing the change in behavior? Are you hitting the genetic optimizer?

#12

Michael Lewis

mlewis@entrata.com

over 5 years ago

In reply to: Thomas Kellerer (#7)

Re: Accounting for between table correlation

On Fri, Jan 15, 2021 at 9:10 AM Thomas Kellerer <shammat@gmx.net> wrote:

Maybe there are many "idle in transaction" sessions that prevent
autovacuum from properly
analyzing those tables. Or maybe for some unknown reason autovacuum was
turned off.
Or maybe they receive a lot of bulk loads which would require a manual
analyze.

"hundreds of millions of rows"

Also of note that the default autovacuum settings for scale factor of
10-20% for vacuum & analyze behavior may need to be lowered for such a
table. OP has chimed in that they believe it is being vacuumed/analyzed
often enough, but if data is primarily or onnly maintained by bulk load, it
would seem like autovacuum/analyze wouldn't be a factor anyway.

#13

Adrian Klaver

adrian.klaver@aklaver.com

over 5 years ago

In reply to: Alexander Stoddard (#1)

Re: Accounting for between table correlation

On 1/15/21 10:49 AM, Alexander Stoddard wrote:

Please reply to list also.
Ccing list.

So to be clear, the process imports the data, then you run a query and
it completes in x time, you then ANALYZE the same data and it runs in y
time. Is that correct?

The process imports data, ANALYZE is run and then queries run in x time.
A subsequent ANALYZE, may or may not, change the time to y.
x may be greater or less than y for any given pair of runs, and the
difference is vast. Two very different performance domains, due to the
plan, I believe. If I am correctly reading the EXPLAIN plans the row
estimates are always way off (and low), regardless of if a high or low
performing plan is actually chosen.

Well I'm going to say this is not going to get a useful answer without
some concrete numbers. Too many variables involved to just start
guessing at solutions.

Thank you,
Alex

--
Adrian Klaver
adrian.klaver@aklaver.com

Import Notes

Reply to msg id not found: CADDNc-BpM_ozi60E4qfbO_0dSJu+JhNfsQU9H=8QX6gowhFu2w@mail.gmail.com

#14

Adrian Klaver

adrian.klaver@aklaver.com

over 5 years ago

In reply to: Adrian Klaver (#13)

Re: Accounting for between table correlation

On 1/15/21 11:54 AM, Adrian Klaver wrote:

On 1/15/21 10:49 AM, Alexander Stoddard wrote:

Please reply to list also.
Ccing list.

    So to be clear, the process imports the data, then you run a query
and
    it completes in x time, you then ANALYZE the same data and it runs
in y
    time. Is that correct?

The process imports data, ANALYZE is run and then queries run in x time.
A subsequent ANALYZE, may or may not, change the time to y.
x may be greater or less than y for any given pair of runs, and the
difference is vast. Two very different performance domains, due to the
plan, I believe. If I am correctly reading the EXPLAIN plans the row
estimates are always way off (and low), regardless of if a high or low
performing plan is actually chosen.

Well I'm going to say this is not going to get a useful answer without
some concrete numbers. Too many variables involved to just start
guessing at solutions.

Not sure if it would work for the vendor or not but:

https://explain.depesz.com/

offers an option to obfuscate EXPLAIN/ANALYZE output.

Thank you,
Alex

--
Adrian Klaver
adrian.klaver@aklaver.com

#15

Alexander Stoddard

alexander.stoddard@gmail.com

over 5 years ago

In reply to: Michael Lewis (#11)

Re: Accounting for between table correlation

On Fri, Jan 15, 2021 at 12:27 PM Michael Lewis <mlewis@entrata.com> wrote:

On Fri, Jan 15, 2021 at 10:22 AM Alexander Stoddard <
alexander.stoddard@gmail.com> wrote:

The 'fast plans' use parallel seq scans. The 'slow plans' is using index
scans. It appears a good query plan correctly predicts it should be bulk
processing the tables but bad ones get fooled into trashing (hard disk, not
SSD) by mispredicting too few rows to join between the tables.

How many tables are involved?

The queries are complex, multiple joins to 10 plus tables, although most
are to tiny enum type lookup tables. I believe it is the join between the
two large tables that I have described that causes the issue, and that
seems to be reflected in the different strategies in the plans. For my own
learning and to clarify the problem I probably will have to try and
reproduce the behavior in a test case.

Are you sure it is stats getting updated causing the change in behavior?

No I'm not sure, could something else flip a plan after an ANALYZE?
Differing performance of multiple runs of the same query could be due
caching etc. but that would be a timing difference without a change in
query plan. The output plans I see are radically different and correlate
with large magnitude performance changes.

Are you hitting the genetic optimizer?

I am doing nothing to specify the optimizer. Do I have configurable options
in that regard? I was unaware of them.

Thank you,
Alex

On Fri, Jan 15, 2021 at 12:27 PM Michael Lewis <mlewis@entrata.com> wrote:

Show quoted text

On Fri, Jan 15, 2021 at 10:22 AM Alexander Stoddard <
alexander.stoddard@gmail.com> wrote:

The 'fast plans' use parallel seq scans. The 'slow plans' is using index
scans. It appears a good query plan correctly predicts it should be bulk
processing the tables but bad ones get fooled into trashing (hard disk, not
SSD) by mispredicting too few rows to join between the tables.

How many tables are involved? Are you sure it is stats getting updated
causing the change in behavior? Are you hitting the genetic optimizer?

#16

Michael Lewis

mlewis@entrata.com

over 5 years ago

In reply to: Alexander Stoddard (#15)

Re: Accounting for between table correlation

On Fri, Jan 15, 2021 at 3:27 PM Alexander Stoddard <
alexander.stoddard@gmail.com> wrote:

The queries are complex, multiple joins to 10 plus tables, although most
are to tiny enum type lookup tables. I believe it is the join between the
two large tables that I have described that causes the issue, and that
seems to be reflected in the different strategies in the plans. For my own
learning and to clarify the problem I probably will have to try and
reproduce the behavior in a test case.
...
I am doing nothing to specify the optimizer. Do I have configurable
options in that regard? I was unaware of them.

Read up on join_collapse_limit and geqo.

https://www.postgresql.org/docs/current/runtime-config-query.html

Note that you can set these values within your session to check and even
just run explain [query] to check if the plan that would be used varies.

#17

Alexander Stoddard

alexander.stoddard@gmail.com

over 5 years ago

In reply to: Michael Lewis (#16)

Re: Accounting for between table correlation

On Tue, Jan 19, 2021 at 2:47 PM Michael Lewis <mlewis@entrata.com> wrote:

On Fri, Jan 15, 2021 at 3:27 PM Alexander Stoddard <
alexander.stoddard@gmail.com> wrote:

I am doing nothing to specify the optimizer. Do I have configurable
options in that regard? I was unaware of them.

Read up on join_collapse_limit and geqo.

https://www.postgresql.org/docs/current/runtime-config-query.html

Note that you can set these values within your session to check and even
just run explain [query] to check if the plan that would be used varies.

Thank you for this. I can get different plans by changing the different
parameters, but if the output of EXPLAIN VERBOSE tells me geqo planning
specifically is being used I'm not seeing it. Is there a way to see in the
explain plan if geqo was used?

#18

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Alexander Stoddard (#17)

Re: Accounting for between table correlation

Alexander Stoddard <alexander.stoddard@gmail.com> writes:

Thank you for this. I can get different plans by changing the different
parameters, but if the output of EXPLAIN VERBOSE tells me geqo planning
specifically is being used I'm not seeing it. Is there a way to see in the
explain plan if geqo was used?

It's not reported directly, but you can assume that it was used if the
number of relations in the query (the number of table scans in the plan,
more or less) is >= geqo_threshold. Or if you're unsure, check whether
turning the "geqo" boolean on/off changes the plan at all. In theory you
might get exactly the same plan from the regular and geqo planners, but
for reasonable values of geqo_threshold that's pretty unlikely, I think.

(Another way to be sure might be to notice whether turning geqo on/off
changes the planning time significantly. geqo off should be a great deal
more expensive for large plans.)

regards, tom lane