When to store data that could be derived

Started by Frank Millmanabout 7 years ago15 messagesgeneral

frank@chagford.com

about 7 years ago

Hi all

As I understand it, a general rule of thumb is that you should never
create a physical column if the data could be derived from existing
columns. A possible reason for breaking this rule is for performance
reasons.

I have a situation where I am considering breaking the rule, but I am
not experienced enough in SQL to know if my reason is valid. I would
appreciate it if someone could glance at my 'before' and 'after'
scenarios and see if, from a 'gut-feel' point of view, I should proceed.

I have a VIEW constructed as follows -

CREATE VIEW view_name AS
[select statement 1]
UNION ALL
[select statement 2]
etc.

This is one of the select statements. I will give the 'after' scenario
first -

The two columns a.arec_cust and a.arec_local *can* be derived from other
columns, and in fact that is how it is working at the moment, so here is
the 'before' scenario -

SELECT
    'arec' AS tran_type, a.row_id AS tran_row_id,
    a.tran_number AS tran_number, a.cust_row_id AS cust_row_id,
    CASE
        WHEN a.tran_type = 'ar_rec' THEN y.tran_date
        WHEN a.tran_type = 'cb_rec' THEN w.tran_date
    END AS tran_date,
    CASE
        WHEN a.tran_type = 'ar_rec' THEN y.text
        WHEN a.tran_type = 'cb_rec' THEN w.text
    END AS text,
    ROUND(0 - (ROUND(a.arec_amount / CASE
        WHEN a.tran_type = 'ar_rec' THEN y.tran_exch_rate
        WHEN a.tran_type = 'cb_rec' THEN w.tran_exch_rate
    END * a.cust_exch_rate, u.scale)), 2) AS amount_cust,
    ROUND(0 - (ROUND(a.arec_amount / CASE
        WHEN a.tran_type = 'ar_rec' THEN y.tran_exch_rate
        WHEN a.tran_type = 'cb_rec' THEN w.tran_exch_rate
    END, s.scale)), 2) AS amount_local
FROM ar_rec_subtran a
LEFT JOIN ar_tran_rec_det z ON z.row_id = a.tran_det_row_id
LEFT JOIN ar_tran_rec y ON y.row_id = z.tran_row_id
LEFT JOIN cb_tran_rec_det x ON x.row_id = a.tran_det_row_id
LEFT JOIN cb_tran_rec w ON w.row_id = x.tran_row_id
LEFT JOIN ar_customers v ON v.row_id = a.cust_row_id
LEFT JOIN adm_currencies u ON u.row_id = v.currency_id
LEFT JOIN adm_params t ON t.row_id = 1
LEFT JOIN adm_currencies s ON s.row_id = t.local_curr_id
WHERE
    CASE
        WHEN a.tran_type = 'ar_rec' THEN y.posted
        WHEN a.tran_type = 'cb_rec' THEN w.posted
    END = '1'

As you can see, complexity has increased and there are four additional
JOINs.

I am expecting the VIEW to be used extensively for query purposes, and
my gut-feel says that the second one is likely to lead to performance
problems in a system with a lot of data and a lot of users.

I am not looking for an answer - I know that I should create dummy data
and run some timing tests. I was just wondering if someone more
experienced would wince when they look at the second SELECT, or if they
would shrug and think that it looks fine.

Any input will be appreciated.

Frank Millman

Ron

ronljohnsonjr@gmail.com

about 7 years ago

In reply to: Frank Millman (#1)

Re: When to store data that could be derived

On 3/24/19 1:42 AM, Frank wrote:

Hi all

As I understand it, a general rule of thumb is that you should never
create a physical column if the data could be derived from existing
columns. A possible reason for breaking this rule is for performance reasons.

I have a situation where I am considering breaking the rule, but I am not
experienced enough in SQL to know if my reason is valid. I would
appreciate it if someone could glance at my 'before' and 'after' scenarios
and see if, from a 'gut-feel' point of view, I should proceed.

I have a VIEW constructed as follows -

CREATE VIEW view_name AS
[select statement 1]
UNION ALL
[select statement 2]
etc.

This is one of the select statements. I will give the 'after' scenario
first -

SELECT
    'arec' AS tran_type, a.row_id AS tran_row_id,
    a.tran_number AS tran_number, a.cust_row_id AS cust_row_id,
    CASE
        WHEN a.tran_type = 'ar_rec' THEN y.tran_date
        WHEN a.tran_type = 'cb_rec' THEN w.tran_date
    END AS tran_date,
    CASE
        WHEN a.tran_type = 'ar_rec' THEN y.text
        WHEN a.tran_type = 'cb_rec' THEN w.text
    END AS text,
    0 - a.arec_cust AS amount_cust,
    0 - a.arec_local AS amount_local
FROM ar_rec_subtran a
LEFT JOIN ar_tran_rec_det z ON z.row_id = a.tran_det_row_id
LEFT JOIN ar_tran_rec y ON y.row_id = z.tran_row_id
LEFT JOIN cb_tran_rec_det x ON x.row_id = a.tran_det_row_id
LEFT JOIN cb_tran_rec w ON w.row_id = x.tran_row_id
WHERE
    CASE
        WHEN a.tran_type = 'ar_rec' THEN y.posted
        WHEN a.tran_type = 'cb_rec' THEN w.posted
    END = '1'

The two columns a.arec_cust and a.arec_local *can* be derived from other
columns, and in fact that is how it is working at the moment, so here is
the 'before' scenario -

SELECT
    'arec' AS tran_type, a.row_id AS tran_row_id,
    a.tran_number AS tran_number, a.cust_row_id AS cust_row_id,
    CASE
        WHEN a.tran_type = 'ar_rec' THEN y.tran_date
        WHEN a.tran_type = 'cb_rec' THEN w.tran_date
    END AS tran_date,
    CASE
        WHEN a.tran_type = 'ar_rec' THEN y.text
        WHEN a.tran_type = 'cb_rec' THEN w.text
    END AS text,
    ROUND(0 - (ROUND(a.arec_amount / CASE
        WHEN a.tran_type = 'ar_rec' THEN y.tran_exch_rate
        WHEN a.tran_type = 'cb_rec' THEN w.tran_exch_rate
    END * a.cust_exch_rate, u.scale)), 2) AS amount_cust,
    ROUND(0 - (ROUND(a.arec_amount / CASE
        WHEN a.tran_type = 'ar_rec' THEN y.tran_exch_rate
        WHEN a.tran_type = 'cb_rec' THEN w.tran_exch_rate
    END, s.scale)), 2) AS amount_local
FROM ar_rec_subtran a
LEFT JOIN ar_tran_rec_det z ON z.row_id = a.tran_det_row_id
LEFT JOIN ar_tran_rec y ON y.row_id = z.tran_row_id
LEFT JOIN cb_tran_rec_det x ON x.row_id = a.tran_det_row_id
LEFT JOIN cb_tran_rec w ON w.row_id = x.tran_row_id
LEFT JOIN ar_customers v ON v.row_id = a.cust_row_id
LEFT JOIN adm_currencies u ON u.row_id = v.currency_id
LEFT JOIN adm_params t ON t.row_id = 1
LEFT JOIN adm_currencies s ON s.row_id = t.local_curr_id
WHERE
    CASE
        WHEN a.tran_type = 'ar_rec' THEN y.posted
        WHEN a.tran_type = 'cb_rec' THEN w.posted
    END = '1'

As you can see, complexity has increased and there are four additional JOINs.

I am expecting the VIEW to be used extensively for query purposes, and my
gut-feel says that the second one is likely to lead to performance
problems in a system with a lot of data and a lot of users.

Generate an artificial load and test it?

I am not looking for an answer - I know that I should create dummy data
and run some timing tests. I was just wondering if someone more
experienced would wince when they look at the second SELECT, or if they
would shrug and think that it looks fine.

Sure the second query joins a lot of tables, but is pretty straightforward.

What REALLY worries me is whether or not the query optimiser would look at
the WHERE CASE, run away screaming and then make it use sequential scans.
Thus, even query #1 would be slow.

Any input will be appreciated.

Is this a historical data set that's never updated, or current data that's
constantly added to?

Frank Millman

--
Angular momentum makes the world go 'round.

Frank Millman

frank@chagford.com

about 7 years ago

In reply to: Ron (#2)

Re: When to store data that could be derived

On 2019-03-24 9:25 AM, Ron wrote:

On 3/24/19 1:42 AM, Frank wrote:

Hi all

As I understand it, a general rule of thumb is that you should never
create a physical column if the data could be derived from existing
columns. A possible reason for breaking this rule is for performance
reasons.

I have a situation where I am considering breaking the rule, but I am
not experienced enough in SQL to know if my reason is valid. I would
appreciate it if someone could glance at my 'before' and 'after'
scenarios and see if, from a 'gut-feel' point of view, I should proceed.

[snip]

Sure the second query joins a lot of tables, but is pretty straightforward.

What REALLY worries me is whether or not the query optimiser would look
at the WHERE CASE, run away screaming and then make it use sequential
scans. Thus, even query #1 would be slow.

I had not realised that. I hope someone else chimes in on this.

Is this a historical data set that's never updated, or current data
that's constantly added to?

It is the latter - current data constantly added to.

Frank

Ron

ronljohnsonjr@gmail.com

about 7 years ago

In reply to: Frank Millman (#3)

Re: When to store data that could be derived

On 3/24/19 3:05 AM, Frank wrote:

On 2019-03-24 9:25 AM, Ron wrote:

On 3/24/19 1:42 AM, Frank wrote:

Hi all

As I understand it, a general rule of thumb is that you should never
create a physical column if the data could be derived from existing
columns. A possible reason for breaking this rule is for performance
reasons.

I have a situation where I am considering breaking the rule, but I am
not experienced enough in SQL to know if my reason is valid. I would
appreciate it if someone could glance at my 'before' and 'after'
scenarios and see if, from a 'gut-feel' point of view, I should proceed.

[snip]

Sure the second query joins a lot of tables, but is pretty straightforward.

What REALLY worries me is whether or not the query optimiser would look
at the WHERE CASE, run away screaming and then make it use sequential
scans. Thus, even query #1 would be slow.

I had not realised that. I hope someone else chimes in on this.

In every DBMS that I've used, the lside (left side) needs to be static (not
"a" static) instead of variable (like a function).

For example, this always leads to a sequential scan:
WHERE EXTRACT(DAY FROM DATE_FIELD) = 5

Is this a historical data set that's never updated, or current data
that's constantly added to?

It is the latter - current data constantly added to.

Frank

--
Angular momentum makes the world go 'round.

Chris Travers

chris.travers@gmail.com

about 7 years ago

In reply to: Frank Millman (#3)

Re: When to store data that could be derived

On Sun, Mar 24, 2019 at 9:05 AM Frank <frank@chagford.com> wrote:

On 2019-03-24 9:25 AM, Ron wrote:

On 3/24/19 1:42 AM, Frank wrote:

Hi all

As I understand it, a general rule of thumb is that you should never
create a physical column if the data could be derived from existing
columns. A possible reason for breaking this rule is for performance
reasons.

I have a situation where I am considering breaking the rule, but I am
not experienced enough in SQL to know if my reason is valid. I would
appreciate it if someone could glance at my 'before' and 'after'
scenarios and see if, from a 'gut-feel' point of view, I should proceed.

[snip]

Sure the second query joins a lot of tables, but is pretty

straightforward.

What REALLY worries me is whether or not the query optimiser would look
at the WHERE CASE, run away screaming and then make it use sequential
scans. Thus, even query #1 would be slow.

I had not realised that. I hope someone else chimes in on this.

There are a few other things to note here.

1 If the data is frequently used in its derivative form, putting it in a
function helps, and
2. You can index the output of the function which means you run it on
insert/update time and can often avoid running it on selection time if it
is just a part of the where clause.

In my experience usually we have used trigger-updated functions when the
field values are very large or expensive, and may need to be part of the
column list, and functions with functional indexes when we don't need to
ever put them in the select column list or where the columns are small and
easy to calculate.

--
Best Wishes,
Chris Travers

Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor
lock-in.
http://www.efficito.com/learn_more

Tony Shelver

tshelver@gmail.com

about 7 years ago

In reply to: Frank Millman (#1)

Re: When to store data that could be derived

Not the answer you are looking for, but...

I'd suggest trying to create a non-trivial set of dummy data to test your
assumptions before deciding on a route.
It's saved my (professional) life a few times over the years when dealing
with untested designs and new (to us) technology.

Some years ago we were implementing an identity management system for a
large US bank, with SQL Server as the store, with a planned integration to
the id / access / permissions of some 300+ systems, targeting 30k plus
users.

Requirements kept changing as we added new systems to the initial mix,
which the Id management package couldn't handle out the box, so we had to
implement a custom design. We had to choose between 2 database designs,
one being fully 'normalized' (considering that everything was an object')
and one where we made some assumptions and fixed some table structures in
the interest of performance.

Eventually we spent a few days adding non-trivial amounts of test data to
the proposed designs and it became quickly became very apparent that option
1 was unworkable once we got beyond 10 systems or so.

Frank Millman

frank@chagford.com

about 7 years ago

In reply to: Tony Shelver (#6)

Re: When to store data that could be derived

On 2019-03-24 11:11 AM, Tony Shelver wrote:

Not the answer you are looking for, but...

I'd suggest trying to create a non-trivial set of dummy data to test your
assumptions before deciding on a route.
It's saved my (professional) life a few times over the years when dealing
with untested designs and new (to us) technology.

Some years ago we were implementing an identity management system for a
large US bank, with SQL Server as the store, with a planned integration to
the id / access / permissions of some 300+ systems, targeting 30k plus
users.

Requirements kept changing as we added new systems to the initial mix,
which the Id management package couldn't handle out the box, so we had to
implement a custom design. We had to choose between 2 database designs,
one being fully 'normalized' (considering that everything was an object')
and one where we made some assumptions and fixed some table structures in
the interest of performance.

Eventually we spent a few days adding non-trivial amounts of test data to
the proposed designs and it became quickly became very apparent that option
1 was unworkable once we got beyond 10 systems or so.

Good advice - much appreciated.

Frank

Ron

ronljohnsonjr@gmail.com

about 7 years ago

In reply to: Ron (#4)

Re: When to store data that could be derived

On 3/24/19 3:45 AM, Ron wrote:
[snip]

In every DBMS that I've used, the lside (left side) needs to be static
(not "a" static) instead of variable (like a function).

Thanks to Chris Travers for reminding me that the word is "immutable", not
"static".

--
Angular momentum makes the world go 'round.

Peter J. Holzer

hjp-pgsql@hjp.at

about 7 years ago

In reply to: Frank Millman (#3)

Re: When to store data that could be derived

On 2019-03-24 10:05:02 +0200, Frank wrote:

On 2019-03-24 9:25 AM, Ron wrote:

On 3/24/19 1:42 AM, Frank wrote:

As I understand it, a general rule of thumb is that you should
never create a physical column if the data could be derived from
existing columns.

The main reason for this rule (which leads to the 3rd normal form) is to
avoid inconsistencies when data is changed.

Is this a historical data set that's never updated, or current data
that's constantly added to?

It is the latter - current data constantly added to.

So the important part here is not whether data is added, but whether
data is changed. Sure, new transactions are added all the time. But is
it expected that the data used to derive amount_cust and amount_local
(e.g. the exchange rate) is changed retroactively, and if it is should
the computed amount change? (I'm a bit worried about the join with the
customers table here - what happens when a customer moves their
headquarters to a country with a different currency?)

Sure the second query joins a lot of tables, but is pretty straightforward.

What REALLY worries me is whether or not the query optimiser would look
at the WHERE CASE, run away screaming and then make it use sequential
scans. Thus, even query #1 would be slow.

I had not realised that. I hope someone else chimes in on this.

Your condition is:

CASE
WHEN a.tran_type = 'ar_rec' THEN y.posted
WHEN a.tran_type = 'cb_rec' THEN w.posted
END = '1'

Think about how you would check that. The straightforward way is to
compute the value of the case/end clause and compare that to 1. But to
compute that value you first need the value of a.tran_type. There are
two possible values here, so maybe an index scan on a.tran_type might be
possible, but I'm not sure whether the optimizer is even smart enought
to figure that out and if it is, whether those to values are selective
enough (Maybe all or most records are either ar_rec or cb_rec). After
that you can retrieve the posted value from the correct table.

As a human I see that the condition can only ever be true for records
from y and w with posted = 1. So it might be better to use index scans
on those columns. But this is not that easy to see, and I don't know
whether the optimizer can do it.

Rewriting the condition as

(a.tran_type = 'ar_rec' and y.posted = 1) or
(a.tran_type = 'cb_rec' and w.posted = 1)

might make it easier for the optimizer to find a good plan.

(Please note that I haven't tested any of this. Use EXPLAIN to check
what the optimizer really does.

#10

Chris Travers

chris.travers@gmail.com

about 7 years ago

In reply to: Ron (#4)

Re: When to store data that could be derived

Meant to send this to the list but hit the wrong button.

On Sun, Mar 24, 2019 at 9:45 AM Ron <ronljohnsonjr@gmail.com> wrote:

On 3/24/19 3:05 AM, Frank wrote:

On 2019-03-24 9:25 AM, Ron wrote:

On 3/24/19 1:42 AM, Frank wrote:

Hi all

As I understand it, a general rule of thumb is that you should never
create a physical column if the data could be derived from existing
columns. A possible reason for breaking this rule is for performance
reasons.

I have a situation where I am considering breaking the rule, but I am
not experienced enough in SQL to know if my reason is valid. I would
appreciate it if someone could glance at my 'before' and 'after'
scenarios and see if, from a 'gut-feel' point of view, I should

proceed.

[snip]

Sure the second query joins a lot of tables, but is pretty

straightforward.

What REALLY worries me is whether or not the query optimiser would look
at the WHERE CASE, run away screaming and then make it use sequential
scans. Thus, even query #1 would be slow.

I had not realised that. I hope someone else chimes in on this.

In every DBMS that I've used, the lside (left side) needs to be static
(not
"a" static) instead of variable (like a function).

For example, this always leads to a sequential scan:
WHERE EXTRACT(DAY FROM DATE_FIELD) = 5

PostgreSQL allows expression indexes

So you can:

create index foo on bar ((id % 1000));

And then use the index on:

select * from bar where id % 1000 = 45;

You could similarly

create index foo on bar (extract(day from date_field));

The left side needs to be indexed (and an immutable expression) but beyond
that.....

Is this a historical data set that's never updated, or current data
that's constantly added to?

It is the latter - current data constantly added to.

Frank

--
Angular momentum makes the world go 'round.

--
Best Wishes,
Chris Travers

Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor
lock-in.
http://www.efficito.com/learn_more

#11

Frank Millman

frank@chagford.com

about 7 years ago

In reply to: Peter J. Holzer (#9)

Re: When to store data that could be derived

On 2019-03-24 2:41 PM, Peter J. Holzer wrote:

On 2019-03-24 10:05:02 +0200, Frank wrote:

Many thanks to Peter et al for their valuable insights. I have learned a
lot.

So the important part here is not whether data is added, but whether
data is changed. Sure, new transactions are added all the time. But is
it expected that the data used to derive amount_cust and amount_local
(e.g. the exchange rate) is changed retroactively, and if it is should
the computed amount change? (I'm a bit worried about the join with the
customers table here - what happens when a customer moves their
headquarters to a country with a different currency?)

I think I have got both of those covered. I store the exchange rates in
physical columns on the transaction, so the compute expressions will
always return the same values. I have separate tables for
'organisations' (O) and for 'customers' (C). C has a foreign key
reference to O, and most static data such as addresses and contact
details are stored on O. So if a customer moved, I would create a new C
record with the new currency, and flag the old C record as inactive.
They can happily co-exist, so receipts can be processed against the old
C record until it is paid up.

I have started doing some volume tests, and at this stage, for the kind
of volumes I am concerned about, it looks as if performance is a non-issue.

I generated about 22000 invoices and 22000 receipts, over 12 customers
and 6 months. Invoices and receipts are stored in separate tables, and a
VIEW presents them as a single table.

Using the VIEW, I selected all transactions for a given customer for a
given month. It returned 620 rows and (on my slow desktop computer) it
took 20ms. I can live with that.

I will generate some higher volumes overnight, and see if it makes a big
difference. If you do not hear from me, you can consider it 'problem
solved' :-)

Again, thanks to all.

Frank

#12

Ron

ronljohnsonjr@gmail.com

about 7 years ago

In reply to: Frank Millman (#11)

Re: When to store data that could be derived

On 3/25/19 8:15 AM, Frank wrote:

On 2019-03-24 2:41 PM, Peter J. Holzer wrote:

On 2019-03-24 10:05:02 +0200, Frank wrote:

Many thanks to Peter et al for their valuable insights. I have learned a lot.

So the important part here is not whether data is added, but whether
data is changed. Sure, new transactions are added all the time. But is
it expected that the data used to derive amount_cust and amount_local
(e.g. the exchange rate) is changed retroactively, and if it is should
the computed amount change? (I'm a bit worried about the join with the
customers table here - what happens when a customer moves their
headquarters to a country with a different currency?)

I think I have got both of those covered. I store the exchange rates in
physical columns on the transaction, so the compute expressions will
always return the same values. I have separate tables for 'organisations'
(O) and for 'customers' (C). C has a foreign key reference to O, and most
static data such as addresses and contact details are stored on O. So if a
customer moved, I would create a new C record with the new currency, and
flag the old C record as inactive. They can happily co-exist, so receipts
can be processed against the old C record until it is paid up.

I have started doing some volume tests, and at this stage, for the kind of
volumes I am concerned about, it looks as if performance is a non-issue.

I generated about 22000 invoices and 22000 receipts, over 12 customers and
6 months. Invoices and receipts are stored in separate tables, and a VIEW
presents them as a single table.

Using the VIEW, I selected all transactions for a given customer for a
given month. It returned 620 rows and (on my slow desktop computer) it
took 20ms. I can live with that.

I will generate some higher volumes overnight, and see if it makes a big
difference. If you do not hear from me, you can consider it 'problem
solved' :-)

It would be interesting to see what the query planner tries to do with this:

WHERE
    CASE
        WHEN a.tran_type = 'ar_rec' THEN y.posted
        WHEN a.tran_type = 'cb_rec' THEN w.posted
    END = '1'

--
Angular momentum makes the world go 'round.

#13

Frank Millman

frank@chagford.com

about 7 years ago

In reply to: Ron (#12)

Re: When to store data that could be derived

On 2019-03-25 4:06 PM, Ron wrote:

On 3/25/19 8:15 AM, Frank wrote:

It would be interesting to see what the query planner tries to do with
this:

WHERE
    CASE
        WHEN a.tran_type = 'ar_rec' THEN y.posted
        WHEN a.tran_type = 'cb_rec' THEN w.posted
    END = '1'

I have attached the schema showing the full VIEW definition, and the
result of the following EXPLAIN -

EXPLAIN SELECT * FROM ccc.ar_trans WHERE cust_row_id = 4 AND tran_date
BETWEEN '2015-06-01' AND '2015-06-30'.

Because I have used 'WHERE tran_date' in the query, and tran_date is
also derived from a CASE expression, I imagine that will also add some
complication.

I am running PostgreSQL 11.1 on Fedora 29.

Frank

#14

Frank Millman

frank@chagford.com

about 7 years ago

In reply to: Frank Millman (#13)

Re: When to store data that could be derived

On 2019-03-25 5:11 PM, Frank wrote:

On 2019-03-25 4:06 PM, Ron wrote:

On 3/25/19 8:15 AM, Frank wrote:

It would be interesting to see what the query planner tries to do with
this:

WHERE
     CASE
         WHEN a.tran_type = 'ar_rec' THEN y.posted
         WHEN a.tran_type = 'cb_rec' THEN w.posted
     END = '1'

I have attached the schema showing the full VIEW definition, and the
result of the following EXPLAIN -

EXPLAIN SELECT * FROM ccc.ar_trans WHERE cust_row_id = 4 AND tran_date
BETWEEN '2015-06-01' AND '2015-06-30'.

Because I have used 'WHERE tran_date' in the query, and tran_date is
also derived from a CASE expression, I imagine that will also add some
complication.

I am running PostgreSQL 11.1 on Fedora 29.

Frank

On reflection, I have not been consistent with my use of indexes, and I
think that will affect the query plan.

There are at least two issues -

1. The first table defined in the VIEW is ccc.ar_tran_inv. It has the
following index -
"arinv_cust_date" btree (cust_row_id NULLS FIRST, tran_date NULLS
FIRST) WHERE deleted_id = 0

I have not used "WHERE deleted_id = 0" when constructing the VIEW, but I
have used "WHERE posted = '1'". I don't think the index can be used with
this setup.

2. The complicated table in the VIEW is ccc.ar_rec_subtran. Various
columns such as tran_date and posted are retrieved via CASE expressions
from two underlying tables. Those tables have certain indexes defined,
but I cannot see how they can be utilised from my current setup.

I think I should spend some time tidying this up before you try to make
sense of the query plan. Any tips on how to improve it will be appreciated.

Frank

#15

Frank Millman

frank@chagford.com

about 7 years ago

In reply to: Frank Millman (#14)

Re: When to store data that could be derived

On 2019-03-25 5:44 PM, Frank wrote:

On reflection, I have not been consistent with my use of indexes, and I
think that will affect the query plan.

There are at least two issues -

1. The first table defined in the VIEW is ccc.ar_tran_inv. It has the
following index -
"arinv_cust_date" btree (cust_row_id NULLS FIRST, tran_date NULLS
FIRST) WHERE deleted_id = 0

I have not used "WHERE deleted_id = 0" when constructing the VIEW, but I
have used "WHERE posted = '1'". I don't think the index can be used with
this setup.

2. The complicated table in the VIEW is ccc.ar_rec_subtran. Various
columns such as tran_date and posted are retrieved via CASE expressions
from two underlying tables. Those tables have certain indexes defined,
but I cannot see how they can be utilised from my current setup.

I think I should spend some time tidying this up before you try to make
sense of the query plan. Any tips on how to improve it will be appreciated.

I have spent a lot of time testing various permutations and trying to
understand them using EXPLAIN.

My original concern was the overhead of calculating derived data. I now
realise that it is more important to get the indexes right, as that has
a much bigger impact on performance.

The VIEW that I have been using for testing is actually masking the
problem. The view combines four tables, three of which are
straightforward, easy to index, and fast. The fourth one is complex,
difficult to index, and comparatively slow. So I forgot about the VIEW
and concentrated on the complex table.

I now understand the caveats I received earlier in this thread. It seems
impossible to make use of the indexes on the JOINed tables in the
following query. I did create an index on cust_row_id in the main table,
and it made a difference with a simple query, but by the time I added
the JOINs, the improvement was hardly noticeable.

This is the query that I used -

EXPLAIN SELECT COUNT(*)
FROM ccc.ar_rec_subtran a
LEFT JOIN ccc.ar_tran_rec_det z ON z.row_id = a.tran_det_row_id
LEFT JOIN ccc.ar_tran_rec y ON y.row_id = z.tran_row_id
LEFT JOIN ccc.cb_tran_rec_det x ON x.row_id = a.tran_det_row_id
LEFT JOIN ccc.cb_tran_rec w ON w.row_id = x.tran_row_id
WHERE a.cust_row_id = 4 AND
CASE
WHEN a.tran_type = 'ar_rec' THEN y.tran_date
WHEN a.tran_type = 'cb_rec' THEN w.tran_date
END BETWEEN '2015-05-01' AND '2015-05-31' AND
CASE
WHEN a.tran_type = 'ar_rec' THEN y.posted
WHEN a.tran_type = 'cb_rec' THEN w.posted
END = true AND
a.deleted_id = 0;

Attached is the EXPLAIN for this one.

I also ran the same query with the following WHERE clause -

WHERE a.cust_row_id = 4 AND
((a.tran_type = 'ar_rec' AND
y.tran_date BETWEEN '2015-05-01' AND '2015-05-31') OR
(a.tran_type = 'cb_rec' AND
w.tran_date BETWEEN '2015-05-01' AND '2015-05-31')) AND
((a.tran_type = 'ar_rec' AND y.posted = true) OR
(a.tran_type = 'cb_rec' AND w.posted = true)) AND
a.deleted_id = 0;

The timings were virtually identical, so I have not attached that EXPLAIN.

Frank

When to store data that could be derived

Attachments:

Attachments: