design help for performance

Started by Culley Harrelsonover 14 years ago8 messagesgeneral

harrelson@gmail.com

over 14 years ago

I am bumping into some performance issues and am seeking help.

I have two tables A and B in a one (A) to many (B) relationship. There are
1.4 million records in table A and 44 million records in table B. In my
web application any request for a record from table A is also going to need
a count of associated records in table B. Several years ago I added
table_b_rowcount to table A in order to minimize queries on table B. And
now, as the application has grown, I am starting to having locking problems
on table A. Any change to table B requires the that table_b_rowcount be
updated on table A... The application has outgrown this solution.

So... is there a common solution to this problem?

culley

Misa Simic

misa.simic@gmail.com

over 14 years ago

In reply to: Culley Harrelson (#1)

Re: design help for performance

Hi Culley,

Have you tried to create fk together with index on fk column on table B?

What are results? Would be good if you could send the query and explain
analyze...

Sent from my Windows Phone
------------------------------
From: Culley Harrelson
Sent: 21 December 2011 00:57
To: pgsql-general@postgresql.org
Subject: [GENERAL] design help for performance

I am bumping into some performance issues and am seeking help.

So... is there a common solution to this problem?

culley

Import Notes

Resolved by subject fallback

David G. Johnston

david.g.johnston@gmail.com

over 14 years ago

In reply to: Misa Simic (#2)

Re: design help for performance

Continued top-posting to remain consistent….

It isn’t that the application has outgrown the solution but rather the solution was never correct in the first place. You attempted pre-mature optimization and are getting burned because of it. The reference solution is simply:

SELECT a.*, COUNT(*) AS b_count

FROM a

JOIN b USING (a_id)

GROUP BY a.* {expanded * as needed)

Make sure table b has an index on the a.id column.

This is reference because you never want to introduce computed fields that keep track of other tables WITHOUT some kind of proof that the maintenance nightmare/overhead you are incurring is more than offset by the savings during usage.

Any further optimization requires two things:

Knowledge of the usage patterns of the affected data

Testing to prove that the alternative solutions out-perform the reference solution

Since you already have an existing query you should implement the reference solution above and then test and see whether it performs better or worse than you current solution. If it indeed performs better than move to it; and if it is still not good enough then you need to provide more information about what kinds of queries are hitting A and B as well as Insert/Delete patterns on Table B.

David J.

From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Misa Simic
Sent: Tuesday, December 20, 2011 7:13 PM
To: Culley Harrelson; pgsql-general@postgresql.org
Subject: Re: [GENERAL] design help for performance

Hi Culley,

Have you tried to create fk together with index on fk column on table B?

What are results? Would be good if you could send the query and explain analyze...

Sent from my Windows Phone

_____

From: Culley Harrelson
Sent: 21 December 2011 00:57
To: pgsql-general@postgresql.org
Subject: [GENERAL] design help for performance

I am bumping into some performance issues and am seeking help.

I have two tables A and B in a one (A) to many (B) relationship. There are 1.4 million records in table A and 44 million records in table B. In my web application any request for a record from table A is also going to need a count of associated records in table B. Several years ago I added table_b_rowcount to table A in order to minimize queries on table B. And now, as the application has grown, I am starting to having locking problems on table A. Any change to table B requires the that table_b_rowcount be updated on table A... The application has outgrown this solution.

So... is there a common solution to this problem?

culley

Culley Harrelson

harrelson@gmail.com

over 14 years ago

In reply to: David G. Johnston (#3)

Re: design help for performance

Thanks David. That was my original solution and it began to bog down the
website so I resorted to demoralization 3 years ago.... This is an
extremely high volume website.

On Tue, Dec 20, 2011 at 4:27 PM, David Johnston <polobo@yahoo.com> wrote:

Show quoted text

Continued top-posting to remain consistent….****

** **

It isn’t that the application has outgrown the solution but rather the
solution was never correct in the first place. You attempted pre-mature
optimization and are getting burned because of it. The reference solution
is simply:****

** **

SELECT a.*, COUNT(*) AS b_count****

FROM a****

JOIN b USING (a_id)****

GROUP BY a.* {expanded * as needed)****

** **

Make sure table b has an index on the a.id column.****

** **

This is reference because you never want to introduce computed fields that
keep track of other tables WITHOUT some kind of proof that the maintenance
nightmare/overhead you are incurring is more than offset by the savings
during usage.****

** **

Any further optimization requires two things:****

Knowledge of the usage patterns of the affected data****

Testing to prove that the alternative solutions out-perform the reference
solution****

** **

Since you already have an existing query you should implement the
reference solution above and then test and see whether it performs better
or worse than you current solution. If it indeed performs better than move
to it; and if it is still not good enough then you need to provide more
information about what kinds of queries are hitting A and B as well as
Insert/Delete patterns on Table B.****

** **

David J.****

** **

*From:* pgsql-general-owner@postgresql.org [mailto:
pgsql-general-owner@postgresql.org] *On Behalf Of *Misa Simic
*Sent:* Tuesday, December 20, 2011 7:13 PM
*To:* Culley Harrelson; pgsql-general@postgresql.org
*Subject:* Re: [GENERAL] design help for performance****

** **

Hi Culley,

Have you tried to create fk together with index on fk column on table B?

What are results? Would be good if you could send the query and explain
analyze...

Sent from my Windows Phone****
------------------------------

*From: *Culley Harrelson
*Sent: *21 December 2011 00:57
*To: *pgsql-general@postgresql.org
*Subject: *[GENERAL] design help for performance****

I am bumping into some performance issues and am seeking help.

I have two tables A and B in a one (A) to many (B) relationship. There
are 1.4 million records in table A and 44 million records in table B. In
my web application any request for a record from table A is also going to
need a count of associated records in table B. Several years ago I added
table_b_rowcount to table A in order to minimize queries on table B. And
now, as the application has grown, I am starting to having locking problems
on table A. Any change to table B requires the that table_b_rowcount be
updated on table A... The application has outgrown this solution.

So... is there a common solution to this problem?

culley****

Alban Hertroys

haramrae@gmail.com

over 14 years ago

In reply to: Culley Harrelson (#1)

Re: design help for performance

On 21 Dec 2011, at 24:56, Culley Harrelson wrote:

Several years ago I added table_b_rowcount to table A in order to minimize queries on table B. And now, as the application has grown, I am starting to having locking problems on table A. Any change to table B requires the that table_b_rowcount be updated on table A... The application has outgrown this solution.

When you update rowcount_b in table A, that locks the row in A of course, but there's more going on. Because a new version of that row gets created, the references from B to A also need updating to that new version (creating new versions of rows in B as well). I think that causes a little bit more locking than originally anticipated - it may even be the cause of your locking problem.

Instead, if you'd create a new table C that only holds the rowcount_b and a reference to A (in a 1:1 relationship), most of those problems go away. It does add an extra foreign key reference to table A though, which means it will weigh down updates and deletes there some more.

CREATE TABLE C (
table_a_id int PRIMARY KEY
REFERENCES table_a (id) ON UPDATE CASCADE ON DELETE CASCADE,
table_b_rowcount int NOT NULL DEFAULT 0
);

Yes, those cascades are on purpose - the data in C is useless without the accompanying record in A. Also, the PK makes sure it stays a 1:1 relationship.

Alban Hertroys

--
Screwing up is an excellent way to attach something to the ceiling.

Marc Mamin

M.Mamin@intershop.de

over 14 years ago

In reply to: Alban Hertroys (#5)

Re: design help for performance

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
owner@postgresql.org] On Behalf Of Alban Hertroys
Sent: Mittwoch, 21. Dezember 2011 08:53
To: Culley Harrelson
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] design help for performance

On 21 Dec 2011, at 24:56, Culley Harrelson wrote:

Several years ago I added table_b_rowcount to table A in order to

minimize queries on table B. And now, as the application has grown, I
am starting to having locking problems on table A. Any change to

table

B requires the that table_b_rowcount be updated on table A... The
application has outgrown this solution.

When you update rowcount_b in table A, that locks the row in A of
course, but there's more going on. Because a new version of that row
gets created, the references from B to A also need updating to that

new

version (creating new versions of rows in B as well). I think that
causes a little bit more locking than originally anticipated - it may
even be the cause of your locking problem.

Instead, if you'd create a new table C that only holds the rowcount_b
and a reference to A (in a 1:1 relationship), most of those problems

away. It does add an extra foreign key reference to table A though,
which means it will weigh down updates and deletes there some more.

CREATE TABLE C (
table_a_id int PRIMARY KEY
REFERENCES table_a (id) ON UPDATE CASCADE ON DELETE
CASCADE,
table_b_rowcount int NOT NULL DEFAULT 0
);

Yes, those cascades are on purpose - the data in C is useless without
the accompanying record in A. Also, the PK makes sure it stays a 1:1
relationship.

Alban Hertroys

Hello,

it may help to combine Alban solution with yours but at the cost of a
higher complexity:

In table C use instead a column table_b_delta_rowcount (+1 /-1 ,
smallint) and only use INSERTs to maintain it, no UPDATEs (hence with a
non unique index on id).

Then regularily flush table C content to table A, in order to only have
recent changes in C.
Your query should then query both tables:

SELECT A. table_b_rowcount + coalesce(sum(C.table_b_delta_rowcount))
FROM A LEFT OUTER JOIN B on (A.id=B.id)
WHERE A.id = xxx

Marc Mamin

Culley Harrelson

harrelson@gmail.com

over 14 years ago

In reply to: Marc Mamin (#6)

Re: design help for performance

Thank you so much everyone! Introducing table C was indeed my next step
but I was unsure if I was going to be just moving the locking problems from
A to C. Locking on C is preferable to locking on A but it doesn't really
solve the problem. It sounds like I should expect less locking on C
because it doesn't relate to B. Thanks again, I am going to give it a
try.

I am not going to take it to the delta solution for now.

On Wed, Dec 21, 2011 at 1:46 AM, Marc Mamin <M.Mamin@intershop.de> wrote:

Show quoted text

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
owner@postgresql.org] On Behalf Of Alban Hertroys
Sent: Mittwoch, 21. Dezember 2011 08:53
To: Culley Harrelson
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] design help for performance

On 21 Dec 2011, at 24:56, Culley Harrelson wrote:

Several years ago I added table_b_rowcount to table A in order to

minimize queries on table B. And now, as the application has grown, I
am starting to having locking problems on table A. Any change to

table

B requires the that table_b_rowcount be updated on table A... The
application has outgrown this solution.

When you update rowcount_b in table A, that locks the row in A of
course, but there's more going on. Because a new version of that row
gets created, the references from B to A also need updating to that

new

version (creating new versions of rows in B as well). I think that
causes a little bit more locking than originally anticipated - it may
even be the cause of your locking problem.

Instead, if you'd create a new table C that only holds the rowcount_b
and a reference to A (in a 1:1 relationship), most of those problems

go

away. It does add an extra foreign key reference to table A though,
which means it will weigh down updates and deletes there some more.

CREATE TABLE C (
table_a_id int PRIMARY KEY
REFERENCES table_a (id) ON UPDATE CASCADE ON DELETE
CASCADE,
table_b_rowcount int NOT NULL DEFAULT 0
);

Yes, those cascades are on purpose - the data in C is useless without
the accompanying record in A. Also, the PK makes sure it stays a 1:1
relationship.

Alban Hertroys

Hello,

it may help to combine Alban solution with yours but at the cost of a
higher complexity:

In table C use instead a column table_b_delta_rowcount (+1 /-1 ,
smallint) and only use INSERTs to maintain it, no UPDATEs (hence with a
non unique index on id).

Then regularily flush table C content to table A, in order to only have
recent changes in C.
Your query should then query both tables:

SELECT A. table_b_rowcount + coalesce(sum(C.table_b_delta_rowcount))
FROM A LEFT OUTER JOIN B on (A.id=B.id)
WHERE A.id = xxx

Marc Mamin

Misa Simic

misa.simic@gmail.com

over 14 years ago

In reply to: Culley Harrelson (#7)

Re: design help for performance

Hm...

I think result on the end will be the same... I am not sure realation
produce any locks on parent table...

What produces locks is UPDATE, so is it on table A or C should make no
difference...

If simple join and count fk is so slow - other option would be materialized
view... So it would need to include table C as materialized view but on the
way to to don't make expensive calculations in real time during insert in B
(and locking)

There is a article about materialized views on postgres wiki...

Sent from my Windows Phone
------------------------------
From: Culley Harrelson
Sent: 21 December 2011 22:07
To: Marc Mamin
Cc: Alban Hertroys; pgsql-general@postgresql.org
Subject: Re: [GENERAL] design help for performance

I am not going to take it to the delta solution for now.

On Wed, Dec 21, 2011 at 1:46 AM, Marc Mamin <M.Mamin@intershop.de> wrote:

Show quoted text

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
owner@postgresql.org] On Behalf Of Alban Hertroys
Sent: Mittwoch, 21. Dezember 2011 08:53
To: Culley Harrelson
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] design help for performance

On 21 Dec 2011, at 24:56, Culley Harrelson wrote:

Several years ago I added table_b_rowcount to table A in order to

minimize queries on table B. And now, as the application has grown, I
am starting to having locking problems on table A. Any change to

table

B requires the that table_b_rowcount be updated on table A... The
application has outgrown this solution.

When you update rowcount_b in table A, that locks the row in A of
course, but there's more going on. Because a new version of that row
gets created, the references from B to A also need updating to that

new

version (creating new versions of rows in B as well). I think that
causes a little bit more locking than originally anticipated - it may
even be the cause of your locking problem.

Instead, if you'd create a new table C that only holds the rowcount_b
and a reference to A (in a 1:1 relationship), most of those problems

go

away. It does add an extra foreign key reference to table A though,
which means it will weigh down updates and deletes there some more.

CREATE TABLE C (
table_a_id int PRIMARY KEY
REFERENCES table_a (id) ON UPDATE CASCADE ON DELETE
CASCADE,
table_b_rowcount int NOT NULL DEFAULT 0
);

Yes, those cascades are on purpose - the data in C is useless without
the accompanying record in A. Also, the PK makes sure it stays a 1:1
relationship.

Alban Hertroys

Hello,

it may help to combine Alban solution with yours but at the cost of a
higher complexity:

In table C use instead a column table_b_delta_rowcount (+1 /-1 ,
smallint) and only use INSERTs to maintain it, no UPDATEs (hence with a
non unique index on id).

Then regularily flush table C content to table A, in order to only have
recent changes in C.
Your query should then query both tables:

SELECT A. table_b_rowcount + coalesce(sum(C.table_b_delta_rowcount))
FROM A LEFT OUTER JOIN B on (A.id=B.id)
WHERE A.id = xxx

Marc Mamin

Import Notes

Resolved by subject fallback