Best practices for aggregate table design
Hi,
I'm trying to construct an agg table to capture phone call data and group by
state, city and time but also want just general measures by month. I'm
thinking to have this:
month | city_id | state_id | total_calls_inbound | total_calls_outbound |
total_calls
2015-01 12 2 54 2 56
2015-01 10 4 147 15 162
2015-01 null null 201 17 218
-----------------------
and a dimension table to easily convert city, state into their string
versions and also provide other attributes (e.g. GPS coordinates).
My questions are:
1. I'm including 'total_calls' in the schema even thought it could easily be
calculated from inbound + outbound. I did this for simplicity in a REST
call, is that a bad idea?
2. I'm adding a 'null' row to show all the calls for a given month
regardless of city or state, again to simplify the client side. It adds a
row and is somewhat sparse but preferrable by the developer. Acceptable
practice?
Thanks!
--
View this message in context: http://postgresql.nabble.com/Best-practices-for-aggregate-table-design-tp5868940.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tue, Oct 6, 2015 at 11:59 AM, droberts <david.roberts@riverbed.com>
wrote:
Hi,
I'm trying to construct an agg table to capture phone call data and group
by
state, city and time but also want just general measures by month. I'm
thinking to have this:month | city_id | state_id | total_calls_inbound | total_calls_outbound |
total_calls2015-01 12 2 54 2 56
2015-01 10 4 147 15 162
2015-01 null null 201 17 218-----------------------
and a dimension table to easily convert city, state into their string
versions and also provide other attributes (e.g. GPS coordinates).My questions are:
1. I'm including 'total_calls' in the schema even thought it could easily
be
calculated from inbound + outbound. I did this for simplicity in a REST
call, is that a bad idea?
Hard to say given the limited insight into the use case. More concerned
about writing since on the read side you can easily wrap the table in a
view that provides the derived value as a column. You are also trading
space for processing power. You only end up processing the small subset
actively being queried presently while you end up storing the derived data
for every single record even if it is likely never to be queried again - or
at least queried in a highly time-sensitive environment.
2. I'm adding a 'null' row to show all the calls for a given month
regardless of city or state, again to simplify the client side. It adds a
row and is somewhat sparse but preferrable by the developer. Acceptable
practice?
I would make up a city_id and state_id representing "ALL" and use that in
place of NULL.
Storing derived information is a matter of making calculated trade-offs in
risking data anomalies in exchange for performance benefits. Choosing to
go this route is likely worthwhile if you can execute it correctly.
Neither choice is flat-out wrong. Beyond that it takes more information
than provided to pass judgement.
David J.
Thanks for your response. One more follow-up question.
Is there a best practice when to create a measure that includes a
property/dimension?
Let me give an example, say in my example where I have outbound and inbound
calls.
Is is best to have measures:
-total_inbound
-total_outbound
OR a dimension 'type' with values outbound/inbound and a single measure
column 'total' ?
I think i would have more rows but fewer columns. Perhaps it depends upon
the number of possible options there would be (e.g. is it always just
outbound/inbound or would they ever grow further)
--
View this message in context: http://postgresql.nabble.com/Best-practices-for-aggregate-table-design-tp5868940p5868958.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 10/6/2015 11:24 AM, droberts wrote:
OR a dimension 'type' with values outbound/inbound and a single measure
column 'total' ?
that smells a bit too much like an "EAV" (entity-attribute-value) which
is considered an antipattern in relational circles
--
john r pierce, recycling bits in santa cruz
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tue, Oct 6, 2015 at 2:34 PM, John R Pierce <pierce@hogranch.com> wrote:
On 10/6/2015 11:24 AM, droberts wrote:
OR a dimension 'type' with values outbound/inbound and a single measure
column 'total' ?that smells a bit too much like an "EAV" (entity-attribute-value) which is
considered an antipattern in relational circles
All models are wrong - but some are useful.
In this example what if you care to distinguish between external and
internal variations of inbound/outbound. i.e., Inter-office calls.
While this could degrade to EAV used in moderation its called
normalization. Think about the best way to record phone numbers for a
person. This is similar.
The more arbitrary the distinction the more trouble you can get into with
using columns instead of rows. Typically if its simply "one, two, three"
you are asking to get hosed. When it is more attribute based you are
generally safe in-so-far as useful questions that your model is expected to
answer can be answered with a minimum of difficulty. And if you end up
having to answer different questions than designed for the choice of
row-vs-column probably will be the smallest of the changes that need to be
made.
In a data warehouse situation the choice should also be informed but how
the source system models the data in question. Which itself is based upon
the real-world applicability of the model as it pertains to the users of
the system.
David J.
Okay, so is it safe to say I should use loosely use these guidelines when
deciding whether to model an attribute as a dimension
(type=[inbound,outbound]) vs. bundling with a measure (total_inbound) ?
If you know the number of values for a dimension are fixed (e.g. boolean),
then creating a measure will have benefits of:
- reduced number of rows/storage
- better performance since less indexing/vacuuming
the drawbacks are:
-rigid structure, not very extensible over time (e.g. later realize I need
to also track 'internal' calls).
In my case, I'm now needing to add another measure 'encrypted=true/false',
so my table is starting to look like
month | city_id | state_id | total_calls_inbound | total_calls_outbound |
total_calls_inbound_encr | total_calls_outbound_encr |
getting a bit hairy but the alternative seems like it would start growing
too quickly in rows and more I/O for inserts.
month | city_id | state_id | encrypted | type | total_calls
2015-01 12 2 true, false, 56
2015-01 10 4 true, true, 147
2015-01 null null 201 17 218
--
View this message in context: http://postgresql.nabble.com/Best-practices-for-aggregate-table-design-tp5868940p5868967.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tue, Oct 6, 2015 at 2:53 PM, droberts <david.roberts@riverbed.com> wrote:
month | city_id | state_id | total_calls_inbound | total_calls_outbound |
total_calls_inbound_encr | total_calls_outbound_encr |getting a bit hairy but the alternative seems like it would start growing
too quickly in rows and more I/O for inserts.month | city_id | state_id | encrypted | type | total_calls
2015-01 12 2 true, false, 56
2015-01 10 4 true, true, 147
2015-01 null null 201 17 218
A bit beyond my experience but my gut tells me I would choose the later
over the former. It even has a name - star schema.
David J.
2. I'm adding a 'null' row to show all the calls for a given month
regardless of city or state, again to simplify the client side. It adds a
row and is somewhat sparse but preferrable by the developer. Acceptable
practice?
do you see any advantage with this model?
I would store your monthly data within a separate table.
regards,
Marc Mamin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
I see the advantage is for the developer. We right one REST API call that
leverages this single table regardless whether he wants groups by city for a
month or total for a month. Creating a separate table would make the
backend a bit more complex is all and wouldn't save on space I don't think.
--
View this message in context: http://postgresql.nabble.com/Best-practices-for-aggregate-table-design-tp5868940p5869195.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
droberts schrieb am 06.10.2015 um 20:53:
Okay, so is it safe to say I should use loosely use these guidelines when
deciding whether to model an attribute as a dimension
(type=[inbound,outbound]) vs. bundling with a measure (total_inbound) ?If you know the number of values for a dimension are fixed (e.g. boolean),
then creating a measure will have benefits of:
- reduced number of rows/storage
- better performance since less indexing/vacuumingthe drawbacks are:
-rigid structure, not very extensible over time (e.g. later realize I need
to also track 'internal' calls).In my case, I'm now needing to add another measure 'encrypted=true/false',
so my table is starting to look like
Have you considered using a hstore column to store the attributes you don't know yet?
Which makes this extensible, flexible and fast.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Thomas Kellerer <spam_eater@gmx.net> writes:
droberts schrieb am 06.10.2015 um 20:53:
Okay, so is it safe to say I should use loosely use these guidelines when
deciding whether to model an attribute as a dimension
(type=[inbound,outbound]) vs. bundling with a measure (total_inbound) ?If you know the number of values for a dimension are fixed (e.g. boolean),
then creating a measure will have benefits of:
- reduced number of rows/storage
- better performance since less indexing/vacuumingthe drawbacks are:
-rigid structure, not very extensible over time (e.g. later realize I need
to also track 'internal' calls).In my case, I'm now needing to add another measure 'encrypted=true/false',
so my table is starting to look likeHave you considered using a hstore column to store the attributes you
don't know yet?Which makes this extensible, flexible and fast.
Is there an advantage of hstore vs. json/jsonb?
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Thu, Oct 8, 2015 at 3:49 AM, <hari.fuchs@gmail.com> wrote:
Is there an advantage of hstore vs. json/jsonb?
Definitely over json because that is not indexable. I'm not seeing an
advantage over jsonb unless you want to prevent storing complex data
structures.
I haven't but wouldn't it be better to wait and just add new columns if/when
I need to?
--
View this message in context: http://postgresql.nabble.com/Best-practices-for-aggregate-table-design-tp5868940p5869372.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Nabble has a "quote" feature - please use it.
On Thu, Oct 8, 2015 at 5:00 PM, droberts <david.roberts@riverbed.com> wrote:
I haven't but wouldn't it be better to wait and just add new columns
if/when
I need to?
Its worth keeping in the back of your mind but I tend to think that
choosing hstore in order to "be flexible" is a solution looking for a
problem. The dynamics between your system and the source system will play
a large role in determining whether to use a mini-table-within-a-table in
your model.
David J.
David G Johnston wrote
Nabble has a "quote" feature - please use it.
On Thu, Oct 8, 2015 at 5:00 PM, droberts <
david.roberts@
> wrote:
I haven't but wouldn't it be better to wait and just add new columns
if/when
I need to?Its worth keeping in the back of your mind but I tend to think that
choosing hstore in order to "be flexible" is a solution looking for a
problem. The dynamics between your system and the source system will play
a large role in determining whether to use a mini-table-within-a-table in
your model.David J.
Thanks for everyone's help. Can anyone recommend a good book on database
modeling, around these 'cube' or aggregate concepts in particular? I'm
using Postgres but shouldn't matter too much I assume.
Thanks
--
View this message in context: http://postgresql.nabble.com/Best-practices-for-aggregate-table-design-tp5868940p5869500.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 10/9/2015 3:16 PM, droberts wrote:
Thanks for everyone's help. Can anyone recommend a good book on
database modeling, around these 'cube' or aggregate concepts in
particular? I'm using Postgres but shouldn't matter too much I assume.
Given the shift towards NoSQL for BI, and the age of the consultants, I
don't know how much of a driver of design patterns the Kimball Group
still is in the BI/Data warehouse space - but their blogs and books have
been good fodder for thought for the modeling I've done.
Roxanne
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 10/9/2015 6:07 PM, Roxanne Reid-Bennett wrote:
On 10/9/2015 3:16 PM, droberts wrote:
Thanks for everyone's help. Can anyone recommend a good book on
database modeling, around these 'cube' or aggregate concepts in
particular? I'm using Postgres but shouldn't matter too much I assume.Given the shift towards NoSQL for BI, and the age of the consultants,
I don't know how much of a driver of design patterns the Kimball Group
still is in the BI/Data warehouse space - but their blogs and books
have been good fodder for thought for the modeling I've done.
actually, there've been numerous cases where NoSQL deployments never
worked right, and people reverted to SQL for reporting.
--
john r pierce, recycling bits in santa cruz
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general