Approach to Data Summary and Analysis

Started by Robert DiFalcoalmost 12 years ago19 messagesgeneral

robert.difalco@gmail.com

almost 12 years ago

I have several related tables that represent a call state. Let's think of
these as phone calls to simplify things. Sometimes I need to determine the
last time a user was called, the last time a user answered a call, or the
last time a user completed a call.

The basic schema is something like this:

CREATE TABLE calls (
id BIGINT NOT NULL, // sequence generator
user_id BIGINT NOT NULL,
called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

PRIMARY KEY (id),
FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

CREATE TABLE calls_answered (
id BIGINT NOT NULL,
answered TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

PRIMARY KEY (id),
FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE
);

And so on for calls_connected, calls_completed, call_errors, etc.

Occasionally I will want to know things like "When was the last time a user
answered a call" or "How many times has a user been called".

I can do these queries using a combination of MAX or COUNT. But I'm
concerned about the performance.

SELECT MAX(a.id)
FROM calls_answered a JOIN calls c ON c.id = a.id
WHERE c.user_id = ?;

Or the number of answered calls:

SELECT MAX(a.id)
FROM calls_answered a JOIN calls c ON c.id = a.id
WHERE c.user_id = ?;

Sometimes I might want to get this data for a whole bunch of users. For
example, "give me all users whose have not answered a call in the last 5
days." Or even "what percentage of users called actually answered a call."
This approach could become a performance issue. So the other option is to
create a call_summary table that is updated with triggers.

The summary table would need fields like "user_id", "last_call_id",
"call_count", "last_answered_id", "answered_count", "last_completed_id",
"last_completed_count", etc.

My only issue with a summary table is that I don't want a bunch of null
fields. For example, if the user was *called* but they have never *answered* at
call then the last_call_id and call_count fields on the summary table would
be non-NULL but the last_answer_id and answer_count fields WOULD be NULL.
But over time all fields would eventually become non-NULL.

So that leads me to a summary table for EACH call state. Each summary table
would have a user id, a ref_id, and a count -- one summary table for each
state e.g. call_summary, call_answered_summary, etc.

This approach has the down side that it creates a lot of tables and
triggers. It has the upside of being pretty efficient without having to
deal with NULL values. It's also pretty easy to reason about.

So for my question -- is the choice between these a personal preference
sort of thing or is there a right or wrong approach? Am I missing another
approach that would be better? I'm okay with SQL but I'm not expert so I'm
not sure if there is an accepted DESIGN PATTERN for this that I am missing.

Thanks!

Vincent Veyron

vincent.veyron@libremen.org

almost 12 years ago

In reply to: Robert DiFalco (#1)

Re: Approach to Data Summary and Analysis

On Mon, 14 Apr 2014 09:27:29 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:

I have several related tables that represent a call state.

And so on for calls_connected, calls_completed, call_errors, etc.

So for my question -- is the choice between these a personal preference
sort of thing or is there a right or wrong approach? Am I missing another
approach that would be better?

Hi Robert,

I guess a call state is subject to change, in which case you would have to shuffle records between tables when that happens?

ISTM you should consider using only a 'calls' table, and add an 'id_call_state' field to it that references the list of possible states. This would make your queries simpler.

create table call_state(
id_call_state text PRIMARY KEY,
libelle text);

INSERT INTO call_state (id_call_state, libelle) VALUES ('calls_connected', 'Connected'), ('calls_completed', 'Completed'), ('call_errors', 'Error');

CREATE TABLE calls (
id BIGINT NOT NULL, // sequence generator

id_call_state INTEGER NOT NULL REFERENCES call_state,

user_id BIGINT NOT NULL,
called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

PRIMARY KEY (id),
FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

--
Regards, Vincent Veyron

http://libremen.com/
Legal case, contract and insurance claim management software

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Vincent Veyron

vv.lists@wanadoo.fr

almost 12 years ago

In reply to: Vincent Veyron (#2)

Re: Approach to Data Summary and Analysis

On Mon, 14 Apr 2014 09:27:29 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:

I have several related tables that represent a call state.

And so on for calls_connected, calls_completed, call_errors, etc.

So for my question -- is the choice between these a personal preference
sort of thing or is there a right or wrong approach? Am I missing another
approach that would be better?

Hi Robert,

I guess a call state is subject to change, in which case you would have to shuffle records between tables when that happens?

ISTM you should consider using only a 'calls' table, and add an 'id_call_state' field to it that references the list of possible states. This would make your queries simpler.

create table call_state(
id_call_state text PRIMARY KEY,
libelle text);

INSERT INTO call_state (id_call_state, libelle) VALUES ('calls_connected', 'Connected'), ('calls_completed', 'Completed'), ('call_errors', 'Error');

CREATE TABLE calls (
id BIGINT NOT NULL, // sequence generator

id_call_state INTEGER NOT NULL REFERENCES call_state,

user_id BIGINT NOT NULL,
called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

PRIMARY KEY (id),
FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

Salutations, Vincent Veyron

http://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres d'assurance

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Import Notes

Resolved by subject fallback

Robert DiFalco

robert.difalco@gmail.com

almost 12 years ago

In reply to: Vincent Veyron (#3)

Re: Approach to Data Summary and Analysis

But then I lose a bunch of data like the TIMESTAMPTZ of the call, answer,
connection, etc. Btw, currently these tables never need to be UPDATEd. They
are immutable in the current design. And in the end I'm not sure how the
proposal of one table and a state that is updatable changes the basic
thrust of the question. For example, getting last call, last answered,
total called, total answered. If the state of a call transitions from
called to answered then making it a field loses all the data with the
previous state, make sense?

On Mon, Apr 14, 2014 at 2:43 PM, Vincent Veyron <vv.lists@wanadoo.fr> wrote:

Show quoted text

On Mon, 14 Apr 2014 09:27:29 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:

I have several related tables that represent a call state.

And so on for calls_connected, calls_completed, call_errors, etc.

So for my question -- is the choice between these a personal preference
sort of thing or is there a right or wrong approach? Am I missing another
approach that would be better?

Hi Robert,

I guess a call state is subject to change, in which case you would have to
shuffle records between tables when that happens?

ISTM you should consider using only a 'calls' table, and add an
'id_call_state' field to it that references the list of possible states.
This would make your queries simpler.

create table call_state(
id_call_state text PRIMARY KEY,
libelle text);

INSERT INTO call_state (id_call_state, libelle) VALUES ('calls_connected',
'Connected'), ('calls_completed', 'Completed'), ('call_errors', 'Error');

CREATE TABLE calls (
id BIGINT NOT NULL, // sequence generator

id_call_state INTEGER NOT NULL REFERENCES call_state,

user_id BIGINT NOT NULL,
called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

PRIMARY KEY (id),
FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

--

Salutations, Vincent Veyron

http://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres
d'assurance

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Rob Sargent

robjsargent@gmail.com

almost 12 years ago

In reply to: Robert DiFalco (#4)

Re: Approach to Data Summary and Analysis

On 04/14/2014 04:22 PM, Robert DiFalco wrote:

But then I lose a bunch of data like the TIMESTAMPTZ of the call,
answer, connection, etc. Btw, currently these tables never need to be
UPDATEd. They are immutable in the current design. And in the end I'm
not sure how the proposal of one table and a state that is updatable
changes the basic thrust of the question. For example, getting last
call, last answered, total called, total answered. If the state of a
call transitions from called to answered then making it a field loses
all the data with the previous state, make sense?

On Mon, Apr 14, 2014 at 2:43 PM, Vincent Veyron <vv.lists@wanadoo.fr
<mailto:vv.lists@wanadoo.fr>> wrote:

On Mon, 14 Apr 2014 09:27:29 -0700
Robert DiFalco <robert.difalco@gmail.com
<mailto:robert.difalco@gmail.com>> wrote:

I have several related tables that represent a call state.

And so on for calls_connected, calls_completed, call_errors, etc.

So for my question -- is the choice between these a personal

preference

sort of thing or is there a right or wrong approach? Am I

missing another

approach that would be better?

Hi Robert,

I guess a call state is subject to change, in which case you would
have to shuffle records between tables when that happens?

ISTM you should consider using only a 'calls' table, and add an
'id_call_state' field to it that references the list of possible
states. This would make your queries simpler.

create table call_state(
id_call_state text PRIMARY KEY,
libelle text);

INSERT INTO call_state (id_call_state, libelle) VALUES
('calls_connected', 'Connected'), ('calls_completed',
'Completed'), ('call_errors', 'Error');

CREATE TABLE calls (
id BIGINT NOT NULL, // sequence generator

id_call_state INTEGER NOT NULL REFERENCES call_state,

user_id BIGINT NOT NULL,
called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

PRIMARY KEY (id),
FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

--

Salutations, Vincent Veyron

http://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres
d'assurance

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org
<mailto:pgsql-general@postgresql.org>)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

I wonder if you really need to place the parts of the call into the
various tables. ringtime, answertime, closetime and all the values
associated with those parts of a call are all fundamental to a single
call, though perhaps collected incrementally. Easy queries, for sure.
(Sorry, I haven't gone back to see your orig. schema. If it's clear
there why these are in separate files, say no more)

Robert DiFalco

robert.difalco@gmail.com

almost 12 years ago

In reply to: Rob Sargent (#5)

Re: Approach to Data Summary and Analysis

Things like this. AVG ring time before answer, average connected call
duration. % of calls never answered. % of calls that are answered that are
connected. Number of times John has answered a call versus how many times
we've called him.That sort of stuff.

On Mon, Apr 14, 2014 at 3:34 PM, Rob Sargent <robjsargent@gmail.com> wrote:

Show quoted text

On 04/14/2014 04:22 PM, Robert DiFalco wrote:

But then I lose a bunch of data like the TIMESTAMPTZ of the call, answer,
connection, etc. Btw, currently these tables never need to be UPDATEd. They
are immutable in the current design. And in the end I'm not sure how the
proposal of one table and a state that is updatable changes the basic
thrust of the question. For example, getting last call, last answered,
total called, total answered. If the state of a call transitions from
called to answered then making it a field loses all the data with the
previous state, make sense?

On Mon, Apr 14, 2014 at 2:43 PM, Vincent Veyron <vv.lists@wanadoo.fr>wrote:

On Mon, 14 Apr 2014 09:27:29 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:

I have several related tables that represent a call state.

And so on for calls_connected, calls_completed, call_errors, etc.

So for my question -- is the choice between these a personal preference
sort of thing or is there a right or wrong approach? Am I missing

another

approach that would be better?

Hi Robert,

I guess a call state is subject to change, in which case you would have
to shuffle records between tables when that happens?

ISTM you should consider using only a 'calls' table, and add an
'id_call_state' field to it that references the list of possible states.
This would make your queries simpler.

create table call_state(
id_call_state text PRIMARY KEY,
libelle text);

INSERT INTO call_state (id_call_state, libelle) VALUES
('calls_connected', 'Connected'), ('calls_completed', 'Completed'),
('call_errors', 'Error');

CREATE TABLE calls (
id BIGINT NOT NULL, // sequence generator

id_call_state INTEGER NOT NULL REFERENCES call_state,

user_id BIGINT NOT NULL,
called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

PRIMARY KEY (id),
FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

--

Salutations, Vincent Veyron

http://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres
d'assurance

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

I wonder if you really need to place the parts of the call into the
various tables. ringtime, answertime, closetime and all the values
associated with those parts of a call are all fundamental to a single call,
though perhaps collected incrementally. Easy queries, for sure. (Sorry, I
haven't gone back to see your orig. schema. If it's clear there why these
are in separate files, say no more)

Vincent Veyron

vv.lists@wanadoo.fr

almost 12 years ago

In reply to: Robert DiFalco (#4)

Re: Approach to Data Summary and Analysis

On Mon, 14 Apr 2014 15:22:13 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:

Hi Robert,

But then I lose a bunch of data like the TIMESTAMPTZ of the call, answer,
connection, etc. Btw, currently these tables never need to be UPDATEd. They
are immutable in the current design.

Yes, but you wrote :

Occasionally I will want to know things like "When was the last time a user
answered a call" or "How many times has a user been called".

With your schema, you would have to query several tables each time. Queries will get messy in a hurry, updates and DDL changes too.

And in the end I'm not sure how the
proposal of one table and a state that is updatable changes the basic
thrust of the question. For example, getting last call, last answered,
total called, total answered. If the state of a call transitions from
called to answered then making it a field loses all the data with the
previous state, make sense?

If you need this, you can use a schema that accommodates it.

(off the top of my head, insert a new record instead of altering an existing one, and create a hierarchy with parent records that point to an original call, may be with a 'parent' field and recursive queries. You probably have many requirements that I'm not aware of, but this one can be met)

--
Salutations, Vincent Veyron

http://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres d'assurance

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Robert DiFalco

robert.difalco@gmail.com

almost 12 years ago

In reply to: Vincent Veyron (#7)

Re: Approach to Data Summary and Analysis

I'm sorry Vincent I'm not exactly sure what you are proposing. Are you
proposing that I add another table in addition to what I already have that
all the other tables JOIN to and add a state field in that parent table?
How is that different than what I have except now I have a new table with
an updatable state field? Maybe you can show a query or two to more
specifically show what you are suggesting?

Right now this seems like a simple way to get the last time John was called:

// last answered called for John
SELECT MAX(a.answered)
FROM calls_answered a JOIN calls c ON c.id = a.id
WHERE c.user = John;

If I don't have a calls_answered table I'm not sure how I would get this
information if I had a single table with a mutable state. Unless you are
suggesting denormalizing all the tables into one table that would have a
lot of null fields. For example answered_date would be null if the call was
never answered.

On Tue, Apr 15, 2014 at 5:37 AM, Vincent Veyron <vv.lists@wanadoo.fr> wrote:

Show quoted text

On Mon, 14 Apr 2014 15:22:13 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:

Hi Robert,

But then I lose a bunch of data like the TIMESTAMPTZ of the call, answer,
connection, etc. Btw, currently these tables never need to be UPDATEd.

They

are immutable in the current design.

Yes, but you wrote :

Occasionally I will want to know things like "When was the last time a

user

answered a call" or "How many times has a user been called".

With your schema, you would have to query several tables each time.
Queries will get messy in a hurry, updates and DDL changes too.

And in the end I'm not sure how the
proposal of one table and a state that is updatable changes the basic
thrust of the question. For example, getting last call, last answered,
total called, total answered. If the state of a call transitions from
called to answered then making it a field loses all the data with the
previous state, make sense?

If you need this, you can use a schema that accommodates it.

(off the top of my head, insert a new record instead of altering an
existing one, and create a hierarchy with parent records that point to an
original call, may be with a 'parent' field and recursive queries. You
probably have many requirements that I'm not aware of, but this one can be
met)

--
Salutations, Vincent Veyron

http://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres
d'assurance

Chris Curvey

chris@chriscurvey.com

almost 12 years ago

In reply to: Robert DiFalco (#1)

Re: [GENERAL] Approach to Data Summary and Analysis

On Mon, Apr 14, 2014 at 12:27 PM, Robert DiFalco
<robert.difalco@gmail.com>wrote:

I have several related tables that represent a call state. Let's think of
these as phone calls to simplify things. Sometimes I need to determine the
last time a user was called, the last time a user answered a call, or the
last time a user completed a call.

The basic schema is something like this:

CREATE TABLE calls (
id BIGINT NOT NULL, // sequence generator
user_id BIGINT NOT NULL,
called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

PRIMARY KEY (id),
FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

CREATE TABLE calls_answered (
id BIGINT NOT NULL,
answered TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

PRIMARY KEY (id),
FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE
);

And so on for calls_connected, calls_completed, call_errors, etc.

Occasionally I will want to know things like "When was the last time a
user answered a call" or "How many times has a user been called".

I can do these queries using a combination of MAX or COUNT. But I'm
concerned about the performance.

SELECT MAX(a.id)
FROM calls_answered a JOIN calls c ON c.id = a.id
WHERE c.user_id = ?;

Or the number of answered calls:

SELECT MAX(a.id)
FROM calls_answered a JOIN calls c ON c.id = a.id
WHERE c.user_id = ?;

Sometimes I might want to get this data for a whole bunch of users. For
example, "give me all users whose have not answered a call in the last 5
days." Or even "what percentage of users called actually answered a call."
This approach could become a performance issue. So the other option is to
create a call_summary table that is updated with triggers.

The summary table would need fields like "user_id", "last_call_id",
"call_count", "last_answered_id", "answered_count", "last_completed_id",
"last_completed_count", etc.

My only issue with a summary table is that I don't want a bunch of null
fields. For example, if the user was *called* but they have never
*answered* at call then the last_call_id and call_count fields on the
summary table would be non-NULL but the last_answer_id and answer_count
fields WOULD be NULL. But over time all fields would eventually become
non-NULL.

So that leads me to a summary table for EACH call state. Each summary
table would have a user id, a ref_id, and a count -- one summary table for
each state e.g. call_summary, call_answered_summary, etc.

This approach has the down side that it creates a lot of tables and
triggers. It has the upside of being pretty efficient without having to
deal with NULL values. It's also pretty easy to reason about.

So for my question -- is the choice between these a personal preference
sort of thing or is there a right or wrong approach? Am I missing another
approach that would be better? I'm okay with SQL but I'm not expert so I'm
not sure if there is an accepted DESIGN PATTERN for this that I am missing.

Thanks!

My initial thought is: that design is over-normalized. The thing you are
trying to model is the call, and it has severl attributes, some of which
may be unknown or not applicable (which is what NULL is for). So my
thought would be to do something like this:

CREATE TABLE calls (
id BIGINT NOT NULL, // sequence generator
user_id BIGINT NOT NULL,
called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
answered TIMESTAMPTZ

PRIMARY KEY (id),
FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

--
I asked the Internet how to train my cat, and the Internet told me to get a
dog.

#10

Chris Curvey

chris@chriscurvey.com

almost 12 years ago

In reply to: Chris Curvey (#9)

Re: Approach to Data Summary and Analysis

On Tue, Apr 15, 2014 at 10:56 AM, Chris Curvey <chris@chriscurvey.com>wrote:

On Mon, Apr 14, 2014 at 12:27 PM, Robert DiFalco <robert.difalco@gmail.com

wrote:

I have several related tables that represent a call state. Let's think of
these as phone calls to simplify things. Sometimes I need to determine the
last time a user was called, the last time a user answered a call, or the
last time a user completed a call.

The basic schema is something like this:

CREATE TABLE calls (
id BIGINT NOT NULL, // sequence generator
user_id BIGINT NOT NULL,
called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

PRIMARY KEY (id),
FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

CREATE TABLE calls_answered (
id BIGINT NOT NULL,
answered TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

PRIMARY KEY (id),
FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE
);

And so on for calls_connected, calls_completed, call_errors, etc.

Occasionally I will want to know things like "When was the last time a
user answered a call" or "How many times has a user been called".

I can do these queries using a combination of MAX or COUNT. But I'm
concerned about the performance.

SELECT MAX(a.id)
FROM calls_answered a JOIN calls c ON c.id = a.id
WHERE c.user_id = ?;

Or the number of answered calls:

SELECT MAX(a.id)
FROM calls_answered a JOIN calls c ON c.id = a.id
WHERE c.user_id = ?;

Sometimes I might want to get this data for a whole bunch of users. For
example, "give me all users whose have not answered a call in the last 5
days." Or even "what percentage of users called actually answered a call."
This approach could become a performance issue. So the other option is to
create a call_summary table that is updated with triggers.

The summary table would need fields like "user_id", "last_call_id",
"call_count", "last_answered_id", "answered_count", "last_completed_id",
"last_completed_count", etc.

My only issue with a summary table is that I don't want a bunch of null
fields. For example, if the user was *called* but they have never
*answered* at call then the last_call_id and call_count fields on the
summary table would be non-NULL but the last_answer_id and answer_count
fields WOULD be NULL. But over time all fields would eventually become
non-NULL.

So that leads me to a summary table for EACH call state. Each summary
table would have a user id, a ref_id, and a count -- one summary table for
each state e.g. call_summary, call_answered_summary, etc.

This approach has the down side that it creates a lot of tables and
triggers. It has the upside of being pretty efficient without having to
deal with NULL values. It's also pretty easy to reason about.

So for my question -- is the choice between these a personal preference
sort of thing or is there a right or wrong approach? Am I missing another
approach that would be better? I'm okay with SQL but I'm not expert so I'm
not sure if there is an accepted DESIGN PATTERN for this that I am missing.

Thanks!

(Sorry, fat-fingered and hit "send too early"...)

CREATE TABLE calls (
id BIGINT NOT NULL, // sequence generator
user_id BIGINT NOT NULL,
called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
answered TIMESTAMPTZ NULL,
connected TIMESTAMPTZ NULL,
completed TIMESTAMPTZ NULL,

PRIMARY KEY (id),
FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

Then your queries end up looking like this:

--last time john answered
SELECT MAX(a.id)
FROM calls
where answered is not null
and user_id = ?

-- users that have not answered a call in the last five days (I can think
of a few ways to interpret that phrase)
select myusers.*
from myusers
where not exists
( select *
from calls
where calls.user_id = myusers.user_id
and answered >= <five days ago>)

-- average ring time
select avg(extract ('seconds' from called - answered))
where answered is not null

--
I asked the Internet how to train my cat, and the Internet told me to get a
dog.

#11

Vincent Veyron

vv.lists@wanadoo.fr

almost 12 years ago

In reply to: Robert DiFalco (#8)

Re: Approach to Data Summary and Analysis

On Tue, 15 Apr 2014 07:21:58 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:

I'm sorry Vincent I'm not exactly sure what you are proposing. Are you
proposing that I add another table in addition to what I already have that
all the other tables JOIN to and add a state field in that parent table?

No : keep table 'calls' with an additional 'status' field, and drop the other tables

How is that different than what I have except now I have a new table with
an updatable state field? Maybe you can show a query or two to more
specifically show what you are suggesting?

Right now this seems like a simple way to get the last time John was called:

// last answered called for John
SELECT MAX(a.answered)
FROM calls_answered a JOIN calls c ON c.id = a.id
WHERE c.user = John;

I probably misunderstood something from your first post. Not sure what you call calls_completed, call_errors for instance, but I had the impression your records would go into a different table according to their status. Do calls_answered move to calls_completed at some point?

In that case, how do you know that you should query calls_answered instead of calls_completed? Or, if John did not answer, do you query call_errors?

If I don't have a calls_answered table I'm not sure how I would get this
information if I had a single table with a mutable state.

Check an appropriate boolean field (call_answered boolean not null default false) would be a way. Again, this needs studying, and it would take more details to go on.

I stand by my earlier comment though, see Chris's answer which is on the same line.

Unless you are
suggesting denormalizing all the tables into one table that would have a
lot of null fields. For example answered_date would be null if the call was
never answered.

Not a big problem I should say, unless you deal with really hude data.

--
Salutations, Vincent Veyron

http://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres d'assurance

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12

Robert DiFalco

robert.difalco@gmail.com

almost 12 years ago

In reply to: Chris Curvey (#10)

Re: [GENERAL] Approach to Data Summary and Analysis

Actually that was exactly the initial table design. There were more fields
because for my use case there were a lot more states and certain states
have additional data (for example when a call goes from answered to
connected it also gets the user_id of the person being connected to). So
that one table started getting a LOT of columns which starting making it
hard to reason about.

The more normalized version has a couple of things going for it. COUNT,
MIN, MAX, etc are very fast because I don't have to conditionally add null
checks. Everything is inserted so for the millions of calls that get made
the normalized schema was much more efficient for writing. It was also
easier to understand. The answer table only has calls that were answered,
the error table only has calls the resulted in an error after being
connected, etc.

I know this kind of gets into a religious area when discussing NULLs and
what level of normalization is appropriate so I don't want to spark any of
that on this thread. But only doing inserts and never doing updates or
deletes performed very well for large data sets.

That said, I could explore a compromise between the monolithic table
approach and the completely normalized set of tables approach. Thanks for
your input!

On Tue, Apr 15, 2014 at 8:12 AM, Chris Curvey <chris@chriscurvey.com> wrote:

Show quoted text

On Tue, Apr 15, 2014 at 10:56 AM, Chris Curvey <chris@chriscurvey.com>wrote:

On Mon, Apr 14, 2014 at 12:27 PM, Robert DiFalco <
robert.difalco@gmail.com> wrote:

I have several related tables that represent a call state. Let's think
of these as phone calls to simplify things. Sometimes I need to determine
the last time a user was called, the last time a user answered a call, or
the last time a user completed a call.

The basic schema is something like this:

CREATE TABLE calls (
id BIGINT NOT NULL, // sequence generator
user_id BIGINT NOT NULL,
called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

PRIMARY KEY (id),
FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

CREATE TABLE calls_answered (
id BIGINT NOT NULL,
answered TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

PRIMARY KEY (id),
FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE
);

And so on for calls_connected, calls_completed, call_errors, etc.

Occasionally I will want to know things like "When was the last time a
user answered a call" or "How many times has a user been called".

I can do these queries using a combination of MAX or COUNT. But I'm
concerned about the performance.

SELECT MAX(a.id)
FROM calls_answered a JOIN calls c ON c.id = a.id
WHERE c.user_id = ?;

Or the number of answered calls:

SELECT MAX(a.id)
FROM calls_answered a JOIN calls c ON c.id = a.id
WHERE c.user_id = ?;

Sometimes I might want to get this data for a whole bunch of users. For
example, "give me all users whose have not answered a call in the last 5
days." Or even "what percentage of users called actually answered a call."
This approach could become a performance issue. So the other option is to
create a call_summary table that is updated with triggers.

The summary table would need fields like "user_id", "last_call_id",
"call_count", "last_answered_id", "answered_count", "last_completed_id",
"last_completed_count", etc.

My only issue with a summary table is that I don't want a bunch of null
fields. For example, if the user was *called* but they have never
*answered* at call then the last_call_id and call_count fields on the
summary table would be non-NULL but the last_answer_id and answer_count
fields WOULD be NULL. But over time all fields would eventually become
non-NULL.

So that leads me to a summary table for EACH call state. Each summary
table would have a user id, a ref_id, and a count -- one summary table for
each state e.g. call_summary, call_answered_summary, etc.

This approach has the down side that it creates a lot of tables and
triggers. It has the upside of being pretty efficient without having to
deal with NULL values. It's also pretty easy to reason about.

So for my question -- is the choice between these a personal preference
sort of thing or is there a right or wrong approach? Am I missing another
approach that would be better? I'm okay with SQL but I'm not expert so I'm
not sure if there is an accepted DESIGN PATTERN for this that I am missing.

Thanks!

(Sorry, fat-fingered and hit "send too early"...)

CREATE TABLE calls (
id BIGINT NOT NULL, // sequence generator
user_id BIGINT NOT NULL,
called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
answered TIMESTAMPTZ NULL,
connected TIMESTAMPTZ NULL,
completed TIMESTAMPTZ NULL,

PRIMARY KEY (id),
FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

Then your queries end up looking like this:

--last time john answered
SELECT MAX(a.id)
FROM calls
where answered is not null
and user_id = ?

-- users that have not answered a call in the last five days (I can think
of a few ways to interpret that phrase)
select myusers.*
from myusers
where not exists
( select *
from calls
where calls.user_id = myusers.user_id
and answered >= <five days ago>)

-- average ring time
select avg(extract ('seconds' from called - answered))
where answered is not null

--
I asked the Internet how to train my cat, and the Internet told me to get
a dog.

#13

Rob Sargent

robjsargent@gmail.com

almost 12 years ago

In reply to: Robert DiFalco (#12)

Re: Approach to Data Summary and Analysis

On 04/15/2014 09:53 AM, Robert DiFalco wrote:

Actually that was exactly the initial table design. There were more
fields because for my use case there were a lot more states and
certain states have additional data (for example when a call goes from
answered to connected it also gets the user_id of the person being
connected to). So that one table started getting a LOT of columns
which starting making it hard to reason about.

The more normalized version has a couple of things going for it.
COUNT, MIN, MAX, etc are very fast because I don't have to
conditionally add null checks. Everything is inserted so for the
millions of calls that get made the normalized schema was much more
efficient for writing. It was also easier to understand. The answer
table only has calls that were answered, the error table only has
calls the resulted in an error after being connected, etc.

I know this kind of gets into a religious area when discussing NULLs
and what level of normalization is appropriate so I don't want to
spark any of that on this thread. But only doing inserts and never
doing updates or deletes performed very well for large data sets.

That said, I could explore a compromise between the monolithic table
approach and the completely normalized set of tables approach. Thanks
for your input!

I wonder if the "LOT of columns" are the bits that need to be parcelled
off as specific to one condition of a call?

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#14

Roxanne Reid-Bennett

rox@tara-lu.com

almost 12 years ago

In reply to: Robert DiFalco (#1)

Re: Approach to Data Summary and Analysis

On 4/14/2014 12:27 PM, Robert DiFalco wrote:

And so on for calls_connected, calls_completed, call_errors, etc.

Occasionally I will want to know things like "When was the last time a
user answered a call" or "How many times has a user been called".
...
Sometimes I might want to get this data for a whole bunch of users.
...
So the other option is to create a call_summary table that is updated
with triggers.
...
My only issue with a summary table is that I don't want a bunch of
null fields.
...
But over time all fields would eventually become non-NULL.

So that leads me to a summary table for EACH call state. This approach
has the down side that it creates a lot of tables and triggers. It has
the upside of being pretty efficient without having to deal with NULL
values. It's also pretty easy to reason about.
...
So for my question -- is the choice between these a personal
preference sort of thing or is there a right or wrong approach? Am I
missing another approach that would be better? I'm okay with SQL but
I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN
for this that I am missing.

There is no right or wrong - there is better, worse, best, and worst for
any specific scenario. In my experience, most people have time/money to
get to an 80% "better" design than all the other answers during design
and then it gets refined over time. And yes, personal experience does
play a part in how people interpret better/worse [aka religion] ;)

I didn't see anybody ask these questions - and to identify "better" -
they have to be asked.
1. How much data are you feeding into your system how fast?
this directly affects your choices on distribution, parallel
processing... writes vs updates vs triggers for copying vs all reads
[and if on bare metal - potentially where you place your logs,
indexes, core lookup tables, etc]
2. How much data are you reading out of your system - how fast?
you have given "simple" use cases (how many calls completed
within a time frame or to a number)
you have given very slightly more complex use cases (when was
the last time John answered a call)
you have given a slightly more bulky processing question of (how
many times have these users been called)
So...
a) How many users executing read queries do you have?
b) What is the expected load for simple queries (per
week/day/hour/minute - depending upon your resolution on speed)
c) What is the expected load for your mid-line complex queries
d) What is the "maximum" volume you expect a bulk query to go
after (like all users in the last 18 years, or this city's users in the
last day?) and how frequently will that kind of query be executed? How
much tolerance for delay do your users have?
e) do you have any known really complex queries that might bog
the system down?
f) How much lag time can you afford between capture and reporting?

Answers to the above define your performance requirements - which
defines the style of schema you need. Queries can be written to pull
data from any schema design - but how fast they can perform or how
easily they can be created...

Chris and Vincent both targeted a balance between writes and reads -
which adequately answers 80-85% of the usages out there. But you didn't
give us any of the above - so their recommendation (while very likely
valid) may not actually fit your case at all.

As to design patterns -
"Generally" a database schema is more normalized for an operational
system because normalization results in fewer writes/updates and lowers
the risk of corruption if a failure takes place. It also isolates
updates for any specific value to one location minimizing internally
caused data corruption.
Reporting systems are generally less normalized because writes are more
one-time and reads are where the load occurs.
Sometimes you have to use data replication to have a system that
appropriately supports both.

you have shown you are already aware of normalization.
If you weren't aware of approaches to Data Warehousing... you can review
information about how it is accomplished
- see the blogs on kimballgroup DOT com they cover a lot of high(er)
level concepts with enough specificity to be of some direct use.
[that website also covers some ideas for "Big Data" which aren't
necessarily limited to RDBMS']

Specify your performance requirements, then figure out your schema design.

FWIW I don't understand your (or any other person's) hesitancy for "lots
of" "NULL" values. They provide meaning in a number of different
ways... not the least of which is that you don't know (yet) - which is
knowledge in and of itself.

Roxanne

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#15

Robert DiFalco

robert.difalco@gmail.com

almost 12 years ago

In reply to: Roxanne Reid-Bennett (#14)

Re: Approach to Data Summary and Analysis

1. >500K rows per day into the calls table.
2. Very rarely. The only common query is gathering users that have not been
called "today" (along with some other qualifying criteria). More analytical
queries/reports are done for internal use and it is not essential that they
be lickity-split.
a. Usually just one connection at a time executes read queries.
b. the users not called today query will be done once a day.
c. Daily
d. All users for the last year (if you are asking about retention). We will
also rarely have to run for all time.
e. Not that I'm aware of (or seen) today.
f. For the simple queries we cannot afford latency between calls and
querying who was already called.

While I don't seem to be getting much support for it here :D my write
performance (which is most essential) has been much better since I further
normalized the tables and made it so that NULL is never used and data is
never updated (i.e. it is immutable once it is written).

As for wanting to avoid NULLs I don't really know what to say. Obviously
some times NULL's are required. For this design I don't really need them
and they make the data harder to reason about (because they are kind of
open to interpretation). They can also give you different results than you
sometimes expect (for example when looking for a non matching key, you
start having to inject some OR IS NULLs and such). Also, the absence of
null can make a lot of queries more optimal). That said, I understand where
you all are coming from with de-normalization. It's definitely the path of
the least resistance. Our instinct is to want to see all related data in a
single table when possible.

The summary table was really a separate point from whether or not people
liked my schema or not -- I mean whether I de-normalize as people are
asking or not, there would still be the question of a summary table for MAX
and COUNT queries or to not have a summary table for those. I probably made
the original question too open ended.

On Tue, Apr 15, 2014 at 3:26 PM, Roxanne Reid-Bennett <rox@tara-lu.com>wrote:

Show quoted text

On 4/14/2014 12:27 PM, Robert DiFalco wrote:

And so on for calls_connected, calls_completed, call_errors, etc.

Occasionally I will want to know things like "When was the last time a
user answered a call" or "How many times has a user been called".
...

Sometimes I might want to get this data for a whole bunch of users.
...

So the other option is to create a call_summary table that is updated
with triggers.
...

My only issue with a summary table is that I don't want a bunch of null
fields.
...

But over time all fields would eventually become non-NULL.

So that leads me to a summary table for EACH call state. This approach
has the down side that it creates a lot of tables and triggers. It has the
upside of being pretty efficient without having to deal with NULL values.
It's also pretty easy to reason about.
...

So for my question -- is the choice between these a personal preference
sort of thing or is there a right or wrong approach? Am I missing another
approach that would be better? I'm okay with SQL but I'm not expert so I'm
not sure if there is an accepted DESIGN PATTERN for this that I am missing.

There is no right or wrong - there is better, worse, best, and worst for

any specific scenario. In my experience, most people have time/money to
get to an 80% "better" design than all the other answers during design and
then it gets refined over time. And yes, personal experience does play a
part in how people interpret better/worse [aka religion] ;)

I didn't see anybody ask these questions - and to identify "better" -
they have to be asked.
1. How much data are you feeding into your system how fast?
this directly affects your choices on distribution, parallel
processing... writes vs updates vs triggers for copying vs all reads
[and if on bare metal - potentially where you place your logs,
indexes, core lookup tables, etc]
2. How much data are you reading out of your system - how fast?
you have given "simple" use cases (how many calls completed within
a time frame or to a number)
you have given very slightly more complex use cases (when was the
last time John answered a call)
you have given a slightly more bulky processing question of (how
many times have these users been called)
So...
a) How many users executing read queries do you have?
b) What is the expected load for simple queries (per
week/day/hour/minute - depending upon your resolution on speed)
c) What is the expected load for your mid-line complex queries
d) What is the "maximum" volume you expect a bulk query to go after
(like all users in the last 18 years, or this city's users in the last
day?) and how frequently will that kind of query be executed? How much
tolerance for delay do your users have?
e) do you have any known really complex queries that might bog the
system down?
f) How much lag time can you afford between capture and reporting?

Answers to the above define your performance requirements - which defines
the style of schema you need. Queries can be written to pull data from any
schema design - but how fast they can perform or how easily they can be
created...

Chris and Vincent both targeted a balance between writes and reads - which
adequately answers 80-85% of the usages out there. But you didn't give us
any of the above - so their recommendation (while very likely valid) may
not actually fit your case at all.

As to design patterns -
"Generally" a database schema is more normalized for an operational system
because normalization results in fewer writes/updates and lowers the risk
of corruption if a failure takes place. It also isolates updates for any
specific value to one location minimizing internally caused data corruption.
Reporting systems are generally less normalized because writes are more
one-time and reads are where the load occurs.
Sometimes you have to use data replication to have a system that
appropriately supports both.

you have shown you are already aware of normalization.
If you weren't aware of approaches to Data Warehousing... you can review
information about how it is accomplished
- see the blogs on kimballgroup DOT com they cover a lot of high(er)
level concepts with enough specificity to be of some direct use.
[that website also covers some ideas for "Big Data" which aren't
necessarily limited to RDBMS']

Specify your performance requirements, then figure out your schema design.

FWIW I don't understand your (or any other person's) hesitancy for "lots
of" "NULL" values. They provide meaning in a number of different ways...
not the least of which is that you don't know (yet) - which is knowledge in
and of itself.

Roxanne

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#16

Gavin Flower

GavinFlower@archidevsys.co.nz

almost 12 years ago

In reply to: Robert DiFalco (#15)

Re: Approach to Data Summary and Analysis

On 16/04/14 13:10, Robert DiFalco wrote:

1. >500K rows per day into the calls table.
2. Very rarely. The only common query is gathering users that have not
been called "today" (along with some other qualifying criteria). More
analytical queries/reports are done for internal use and it is not
essential that they be lickity-split.
a. Usually just one connection at a time executes read queries.
b. the users not called today query will be done once a day.
c. Daily
d. All users for the last year (if you are asking about retention). We
will also rarely have to run for all time.
e. Not that I'm aware of (or seen) today.
f. For the simple queries we cannot afford latency between calls and
querying who was already called.

While I don't seem to be getting much support for it here :D my write
performance (which is most essential) has been much better since I
further normalized the tables and made it so that NULL is never used
and data is never updated (i.e. it is immutable once it is written).

As for wanting to avoid NULLs I don't really know what to say.
Obviously some times NULL's are required. For this design I don't
really need them and they make the data harder to reason about
(because they are kind of open to interpretation). They can also give
you different results than you sometimes expect (for example when
looking for a non matching key, you start having to inject some OR IS
NULLs and such). Also, the absence of null can make a lot of queries
more optimal). That said, I understand where you all are coming from
with de-normalization. It's definitely the path of the least
resistance. Our instinct is to want to see all related data in a
single table when possible.

The summary table was really a separate point from whether or not
people liked my schema or not -- I mean whether I de-normalize as
people are asking or not, there would still be the question of a
summary table for MAX and COUNT queries or to not have a summary table
for those. I probably made the original question too open ended.

On Tue, Apr 15, 2014 at 3:26 PM, Roxanne Reid-Bennett <rox@tara-lu.com
<mailto:rox@tara-lu.com>> wrote:

On 4/14/2014 12:27 PM, Robert DiFalco wrote:

And so on for calls_connected, calls_completed, call_errors, etc.

Occasionally I will want to know things like "When was the
last time a user answered a call" or "How many times has a
user been called".
...

Sometimes I might want to get this data for a whole bunch of
users.
...

So the other option is to create a call_summary table that is
updated with triggers.
...

My only issue with a summary table is that I don't want a
bunch of null fields.
...

But over time all fields would eventually become non-NULL.

So that leads me to a summary table for EACH call state. This
approach has the down side that it creates a lot of tables and
triggers. It has the upside of being pretty efficient without
having to deal with NULL values. It's also pretty easy to
reason about.
...

So for my question -- is the choice between these a personal
preference sort of thing or is there a right or wrong
approach? Am I missing another approach that would be better?
I'm okay with SQL but I'm not expert so I'm not sure if there
is an accepted DESIGN PATTERN for this that I am missing.

There is no right or wrong - there is better, worse, best, and
worst for any specific scenario. In my experience, most people
have time/money to get to an 80% "better" design than all the
other answers during design and then it gets refined over time.
And yes, personal experience does play a part in how people
interpret better/worse [aka religion] ;)

I didn't see anybody ask these questions - and to identify
"better" - they have to be asked.
1. How much data are you feeding into your system how fast?
this directly affects your choices on distribution,
parallel processing... writes vs updates vs triggers for copying
vs all reads
[and if on bare metal - potentially where you place your
logs, indexes, core lookup tables, etc]
2. How much data are you reading out of your system - how fast?
you have given "simple" use cases (how many calls completed
within a time frame or to a number)
you have given very slightly more complex use cases (when
was the last time John answered a call)
you have given a slightly more bulky processing question of
(how many times have these users been called)
So...
a) How many users executing read queries do you have?
b) What is the expected load for simple queries (per
week/day/hour/minute - depending upon your resolution on speed)
c) What is the expected load for your mid-line complex queries
d) What is the "maximum" volume you expect a bulk query to
go after (like all users in the last 18 years, or this city's
users in the last day?) and how frequently will that kind of
query be executed? How much tolerance for delay do your users have?
e) do you have any known really complex queries that might
bog the system down?
f) How much lag time can you afford between capture and
reporting?

Answers to the above define your performance requirements - which
defines the style of schema you need. Queries can be written to
pull data from any schema design - but how fast they can perform
or how easily they can be created...

Chris and Vincent both targeted a balance between writes and reads
- which adequately answers 80-85% of the usages out there. But
you didn't give us any of the above - so their recommendation
(while very likely valid) may not actually fit your case at all.

As to design patterns -
"Generally" a database schema is more normalized for an
operational system because normalization results in fewer
writes/updates and lowers the risk of corruption if a failure
takes place. It also isolates updates for any specific value to
one location minimizing internally caused data corruption.
Reporting systems are generally less normalized because writes are
more one-time and reads are where the load occurs.
Sometimes you have to use data replication to have a system that
appropriately supports both.

you have shown you are already aware of normalization.
If you weren't aware of approaches to Data Warehousing... you can
review information about how it is accomplished
- see the blogs on kimballgroup DOT com they cover a lot of
high(er) level concepts with enough specificity to be of some
direct use.
[that website also covers some ideas for "Big Data" which aren't
necessarily limited to RDBMS']

Specify your performance requirements, then figure out your schema
design.

FWIW I don't understand your (or any other person's) hesitancy for
"lots of" "NULL" values. They provide meaning in a number of
different ways... not the least of which is that you don't know
(yet) - which is knowledge in and of itself.

Roxanne

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org
<mailto:pgsql-general@postgresql.org>)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Have you considered partial indexes? Using the /WHERE //predicate/
option of /CREATE INDEX/.

This can be useful if you often look for things that are often only a
small subset of keys. For example a partial index on sex would useful
for nurses, only indexing those that are male as they are in a very
small minority.

Cheers,
Gavin

#17

Roxanne Reid-Bennett

rox@tara-lu.com

almost 12 years ago

In reply to: Robert DiFalco (#15)

Re: Approach to Data Summary and Analysis

On 4/15/2014 9:10 PM, Robert DiFalco wrote:

1. >500K rows per day into the calls table.
2. Very rarely. The only common query is gathering users that have not
been called "today" (along with some other qualifying criteria). More
analytical queries/reports are done for internal use and it is not
essential that they be lickity-split.
a. Usually just one connection at a time executes read queries.
b. the users not called today query will be done once a day.
c. Daily
d. All users for the last year (if you are asking about retention). We
will also rarely have to run for all time.
e. Not that I'm aware of (or seen) today.
f. For the simple queries we cannot afford latency between calls and
querying who was already called.

While I don't seem to be getting much support for it here :D my write
performance (which is most essential) has been much better since I
further normalized the tables and made it so that NULL is never used
and data is never updated (i.e. it is immutable once it is written).

Based on the above you are primarily capturing data and feeding back
essentially one easy to find result set [who has NOT been successfully
called] on an ongoing single threaded basis [once per day?]. So you are
absolutely correct - tune for writing speed.

The summary table was really a separate point from whether or not
people liked my schema or not -- I mean whether I de-normalize as
people are asking or not, there would still be the question of a
summary table for MAX and COUNT queries or to not have a summary table
for those. I probably made the original question too open ended.

Do you know your answer?
you said : "Occasionally I will want to know things like "
you answered to frequency on queries as "the users not called today
query will be done once a day." as was c) [I'm assuming once?]
and d) appears to be "ad-hoc" and you said your users can deal with
latency in response for those.

So finding Min/Max/Count quickly really *don't* matter for tuning.

So the only reason I can see to add a summary table is to ... simplify
maintenance [note I did NOT say "development"] and then only IF it
doesn't impact the write speeds beyond an acceptable level. Proper
internal / external documentation can mitigate maintenance nightmares.
If your developer(s) can't figure out how to get the data they need from
the schema - then give them the queries to run. [you are likely better
at tuning those anyway]

Last consideration - business consumption of data does change over
time. Disk space is cheap [but getting and keeping speed sometimes
isn't]. You might consider including ongoing partial archival of the
operational data during slow usage (write) periods.

Roxanne

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#18

Robert DiFalco

robert.difalco@gmail.com

almost 12 years ago

In reply to: Roxanne Reid-Bennett (#17)

Re: Approach to Data Summary and Analysis

Thanks Roxanne, I suppose when it comes down to it -- for the current use
cases and data size -- my only concern is the "calling" query that will
need to use max to determine if a user has already had a call today. For a
large user set, for each user I would either have to MAX on the answered
timestamp to compare it against today or do an exist query to see if any
timestamp for that user is greater or equal than "today".

But I suppose I just need to construct a huge dataset and see. I was
thinking by keeping a summary so I always knew the last answer or call time
for each user that I could mitigate this becoming an issue. Over time a
single user could have answered a call thousands of times. So that would
make a "<=" timestamp query be just # of users instead of # of users X 1000
(or however many calls they have answered over the non-archived time
period).

On Wed, Apr 16, 2014 at 8:42 AM, Roxanne Reid-Bennett <rox@tara-lu.com>wrote:

Show quoted text

On 4/15/2014 9:10 PM, Robert DiFalco wrote:

1. >500K rows per day into the calls table.
2. Very rarely. The only common query is gathering users that have not
been called "today" (along with some other qualifying criteria). More
analytical queries/reports are done for internal use and it is not
essential that they be lickity-split.
a. Usually just one connection at a time executes read queries.
b. the users not called today query will be done once a day.
c. Daily
d. All users for the last year (if you are asking about retention). We
will also rarely have to run for all time.
e. Not that I'm aware of (or seen) today.
f. For the simple queries we cannot afford latency between calls and
querying who was already called.

While I don't seem to be getting much support for it here :D my write
performance (which is most essential) has been much better since I further
normalized the tables and made it so that NULL is never used and data is
never updated (i.e. it is immutable once it is written).

Based on the above you are primarily capturing data and feeding back
essentially one easy to find result set [who has NOT been successfully
called] on an ongoing single threaded basis [once per day?]. So you are
absolutely correct - tune for writing speed.

The summary table was really a separate point from whether or not people

liked my schema or not -- I mean whether I de-normalize as people are
asking or not, there would still be the question of a summary table for MAX
and COUNT queries or to not have a summary table for those. I probably made
the original question too open ended.

Do you know your answer?

you said : "Occasionally I will want to know things like "
you answered to frequency on queries as "the users not called today query
will be done once a day." as was c) [I'm assuming once?]
and d) appears to be "ad-hoc" and you said your users can deal with
latency in response for those.

So finding Min/Max/Count quickly really *don't* matter for tuning.

So the only reason I can see to add a summary table is to ... simplify
maintenance [note I did NOT say "development"] and then only IF it doesn't
impact the write speeds beyond an acceptable level. Proper internal /
external documentation can mitigate maintenance nightmares. If your
developer(s) can't figure out how to get the data they need from the schema
- then give them the queries to run. [you are likely better at tuning those
anyway]

Last consideration - business consumption of data does change over time.
Disk space is cheap [but getting and keeping speed sometimes isn't]. You
might consider including ongoing partial archival of the operational data
during slow usage (write) periods.

Roxanne

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#19

Roxanne Reid-Bennett

rox@tara-lu.com

almost 12 years ago

In reply to: Robert DiFalco (#18)

Re: Approach to Data Summary and Analysis

On 4/16/2014 2:40 PM, Robert DiFalco wrote:

Thanks Roxanne, I suppose when it comes down to it -- for the current
use cases and data size -- my only concern is the "calling" query that
will need to use max to determine if a user has already had a call
today. For a large user set, for each user I would either have to MAX
on the answered timestamp to compare it against today or do an exist
query to see if any timestamp for that user is greater or equal than
"today".

I didn't go back to look at your original schema- but.. if your 500K
records are coming in time ordered... You may be able to track "max" as
an attribute on an "SCD" based on the caller/callee table [or the
caller/ee table itself if that table is only used by your app] with an
update from a post-insert trigger on the appropriate table. Even if they
aren't time ordered, you add the overhead of a single comparative in the
trigger. Downside is that you fire a trigger and an update for every
insert. [or just an update depending on what is driving your load of the
500K records]

Again - the proof on "value" of this overhead is a comparison of the
cost for the updates vs the cost on the query to find max() I suspect
your once a day query can afford all sorts of other optimizations that
are "better" than a trigger fired on every insert. [such as the
function index - that was already mentioned] I really suspect you just
don't have enough load on the query side (complex queries * # of users)
to justify the extra load on the write side (+1 trigger, +1 update /
insert) to avoid a (potentially) heavy query load 1x/day.

Another option... if only worried about "today".. then keep only
"today's" data in your query table, and migrate historical data nightly
to a pseudo archive table for those "every once in a while" questions.
I haven't played with table inheritance in Postgres - but that's a
capability I might look at if I were doing a pseudo archive table.

Roxanne

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general