Populating missing dates in postgresql data

Started by Lavrenz, Steven Mabout 11 years ago9 messagesgeneral
Jump to latest
#1Lavrenz, Steven M
slavrenz@purdue.edu

Alright everyone, this is a doozy of a problem. I am new to Postgres so I appreciate patience/understanding. I have a database of hardware objects, each of which has several different "channels". Once per day, these channels are supposed to check in with a central server, generating an event log table (TABLE A) like the following:

object_id channel check-in date
****************************************
990 1 2014-12-01
990 1 2014-12-02
990 2 2014-12-01
990 2 2014-12-02
286 2 2014-12-01
286 2 2014-12-02
286 5 2014-12-01
286 5 2014-12-02
4507 1 2014-12-01
4507 1 2014-12-02
4507 2 2014-12-01
4507 2 2014-12-02

And so on. Occasionally, communications will break down to the hardware, such that no reporting occurs. For example, let's say that object 286 loses communications on 12/1/2014. Then the table might look like:

object_id channel check-in date
****************************************
990 1 2014-12-01
990 1 2014-12-02
990 2 2014-12-01
990 2 2014-12-02

286 2 2014-12-02

286 5 2014-12-02
4507 1 2014-12-01
4507 1 2014-12-02
4507 2 2014-12-01
4507 2 2014-12-02

Or let's say that for some reason, just channel 2 loses reporting for a day. Then we would have:

object_id channel check-in date
****************************************
990 1 2014-12-01
990 1 2014-12-02
990 2 2014-12-01
990 2 2014-12-02

286 2 2014-12-02
286 5 2014-12-01
286 5 2014-12-02
4507 1 2014-12-01
4507 1 2014-12-02
4507 2 2014-12-01
4507 2 2014-12-02

I have a second table (TABLE B) with all of the object_ids and channels that are supposed to be reporting in each day. For cases where a certain channel does not check in, I want to add a column that indicates the comm failure. So, for the example where all channels on object 286 do not check in, I would like to get is something like this:

object_id channel check-in date comm failure
**********************************************************
990 1 2014-12-01 No
990 1 2014-12-02 No
990 2 2014-12-01 No
990 2 2014-12-02 No
286 2 2014-12-01 Yes
286 2 2014-12-02 No
286 5 2014-12-01 Yes
286 5 2014-12-02 No
4507 1 2014-12-01 No
4507 1 2014-12-02 No
4507 2 2014-12-01 No
4507 2 2014-12-02 No

I have been racking my mind for the better part of a day on how to do this. The thing is that I can do a right join of TABLE B on TABLE A, and this will populate the missing object ids and channels. However, this only works for a single day, and it gives me something like:

object_id channel check-in date comm failure
**********************************************************
990 1 2014-12-01 No
990 1 2014-12-02 No
990 2 2014-12-01 No
990 2 2014-12-02 No
286 2 Yes
286 2 2014-12-02 No
286 5 Yes
286 5 2014-12-02 No
4507 1 2014-12-01 No
4507 1 2014-12-02 No
4507 2 2014-12-01 No
4507 2 2014-12-02 No

I need to do a count of comm failures by day, so I need to populate the check-in date field. Please help!

Best Regards,
Steve

*************************************************
Steven Lavrenz, MS, EIT
Doctoral Research Fellow, Ph.D. Candidate
Purdue University | Transportation Engineering
Hampton Hall of Civil Engineering, Room 1122
550 Stadium Mall Drive
West Lafayette, IN 47907
765-775-6423
slavrenz@purdue.edu<https://exchange.purdue.edu/owa/redir.aspx?C=1KvRJRKXRUmCQXrzhc8xxDD2sumyJNIICxsOWx4t9faOapFqc_-gBVMV21AYScHluQX0uMJmOYA.&amp;URL=mailto%3aslavrenz%40purdue.edu&gt;

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Lavrenz, Steven M (#1)
Re: Populating missing dates in postgresql data

On 03/25/2015 05:25 PM, Lavrenz, Steven M wrote:

Alright everyone, this is a doozy of a problem. I am new to Postgres so
I appreciate patience/understanding. I have a database of hardware
objects, each of which has several different �channels�. Once per day,
these channels are supposed to check in with a central server,
generating an event log table (TABLE A) like the following:

object_id channel check-in date

****************************************

990 1 2014-12-01

990 1 2014-12-02

990 2 2014-12-01

990 2 2014-12-02

286 2 2014-12-01

286 2 2014-12-02

286 5 2014-12-01

286 5 2014-12-02

4507 1 2014-12-01

4507 1 2014-12-02

4507 2 2014-12-01

4507 2 2014-12-02

And so on. Occasionally, communications will break down to the hardware,
such that no reporting occurs. For example, let�s say that object 286
loses communications on 12/1/2014. Then the table might look like:

object_id channel check-in date

****************************************

990 1 2014-12-01

990 1 2014-12-02

990 2 2014-12-01

990 2 2014-12-02

286 2 2014-12-02

286 5 2014-12-02

4507 1 2014-12-01

4507 1 2014-12-02

4507 2 2014-12-01

4507 2 2014-12-02

Or let�s say that for some reason, just channel 2 loses reporting for a
day. Then we would have:

object_id channel check-in date

****************************************

990 1 2014-12-01

990 1 2014-12-02

990 2 2014-12-01

990 2 2014-12-02

286 2 2014-12-02

286 5 2014-12-01

286 5 2014-12-02

4507 1 2014-12-01

4507 1 2014-12-02

4507 2 2014-12-01

4507 2 2014-12-02

I have a second table (TABLE B) with all of the object_ids and channels
that are supposed to be reporting in each day. For cases where a certain
channel does not check in, I want to add a column that indicates the
comm failure. So, for the example where all channels on object 286 do
not check in, I would like to get is something like this:

object_id channel check-in
date comm failure

**********************************************************

990 1
2014-12-01 No

990 1
2014-12-02 No

990 2
2014-12-01 No

990 2
2014-12-02 No

286 2
2014-12-01 Yes

286 2
2014-12-02 No

286 5
2014-12-01 Yes

286 5
2014-12-02 No

4507 1
2014-12-01 No

4507 1
2014-12-02 No

4507 2
2014-12-01 No

4507 2
2014-12-02 No

I have been racking my mind for the better part of a day on how to do
this. The thing is that I can do a right join of TABLE B on TABLE A, and
this will populate the missing object ids and channels. However, this
only works for a single day, and it gives me something like:

object_id channel check-in
date comm failure

**********************************************************

990 1
2014-12-01 No

990 1
2014-12-02 No

990 2
2014-12-01 No

990 2
2014-12-02 No

286 2
Yes

286 2
2014-12-02 No

286 5
Yes

286 5
2014-12-02 No

4507 1
2014-12-01 No

4507 1
2014-12-02 No

4507 2
2014-12-01 No

4507 2
2014-12-02 No

I need to do a count of comm failures by day, so I need to populate the
check-in date field. Please help!

Without seeing the actual query this is just a suggestion. I would say
use CASE:

http://www.postgresql.org/docs/9.3/interactive/functions-conditional.html#FUNCTIONS-CASE

Where if the date was not available from table A use the one from table B.

Best Regards,

Steve

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3David G. Johnston
david.g.johnston@gmail.com
In reply to: Lavrenz, Steven M (#1)
Re: Populating missing dates in postgresql data

On Wed, Mar 25, 2015 at 5:25 PM, Lavrenz, Steven M <slavrenz@purdue.edu>
wrote:

Alright everyone, this is a doozy of a problem. I am new to Postgres so
I appreciate patience/understanding. I have a database of hardware objects,
each of which has several different “channels”. Once per day, these
channels are supposed to check in with a central server, generating an
event log table (TABLE A) like the following:

object_id channel check-in date

****************************************

990 1 2014-12-01

990 1 2014-12-02

990 2 2014-12-01

990 2 2014-12-02

286 2 2014-12-01

286 2 2014-12-02

286 5 2014-12-01

286 5 2014-12-02

4507 1 2014-12-01

4507 1 2014-12-02

4507 2 2014-12-01

4507 2 2014-12-02

And so on. Occasionally, communications will break down to the hardware,
such that no reporting occurs. For example, let’s say that object 286 loses
communications on 12/1/2014. Then the table might look like:

object_id channel check-in date

****************************************

990 1 2014-12-01

990 1 2014-12-02

990 2 2014-12-01

990 2 2014-12-02

286 2 2014-12-02

286 5 2014-12-02

4507 1 2014-12-01

4507 1 2014-12-02

4507 2 2014-12-01

4507 2 2014-12-02

Or let’s say that for some reason, just channel 2 loses reporting for a
day. Then we would have:

object_id channel check-in date

****************************************

990 1 2014-12-01

990 1 2014-12-02

990 2 2014-12-01

990 2 2014-12-02

286 2 2014-12-02

286 5 2014-12-01

286 5 2014-12-02

4507 1 2014-12-01

4507 1 2014-12-02

4507 2 2014-12-01

4507 2 2014-12-02

I have a second table (TABLE B) with all of the object_ids and channels
that are supposed to be reporting in each day. For cases where a certain
channel does not check in, I want to add a column that indicates the comm
failure. So, for the example where all channels on object 286 do not check
in, I would like to get is something like this:

object_id channel check-in
date comm failure

**********************************************************

990 1
2014-12-01 No

990 1
2014-12-02 No

990 2
2014-12-01 No

990 2
2014-12-02 No

286 2
2014-12-01 Yes

286 2
2014-12-02 No

286 5
2014-12-01 Yes

286 5
2014-12-02 No

4507 1
2014-12-01 No

4507 1
2014-12-02 No

4507 2
2014-12-01 No

4507 2
2014-12-02 No

I have been racking my mind for the better part of a day on how to do
this. The thing is that I can do a right join of TABLE B on TABLE A, and
this will populate the missing object ids and channels. However, this only
works for a single day, and it gives me something like:

object_id channel check-in
date comm failure

**********************************************************

990 1
2014-12-01 No

990 1
2014-12-02 No

990 2
2014-12-01 No

990 2
2014-12-02 No

286 2
Yes

286 2
2014-12-02 No

286 5
Yes

286 5
2014-12-02 No

4507 1
2014-12-01 No

4507 1
2014-12-02 No

4507 2
2014-12-01 No

4507 2
2014-12-02 No

I need to do a count of comm failures by day, so I need to populate the
check-in date field. Please help!

​People are much more inclined to help if you make it easy for them.

Read up on "CTE" (WITH clause) in the documentation. And "VALUES". Both
in the "SELECT" command section.

Use those to create inline versions of your two tables and then provide the
query you can get working.

The basic solution is:

WITH actual_checkins (id, channel, date, count_for_day) AS (...)
, expected_checkins (id, channel, date) AS (...)
SELECT id, channel, date, COALESCE(count_for_day, 0) AS number_of_checkins
FROM expected_checkins
LEFT JOIN actual_checkins USING (id, channel, date)

You can convert zero/non-true to "no/yes" via a CASE WHEN count = 0 THEN
'no' ELSE 'yes' END or similar.

David J.

​P.S. the function "generate_series(date, date)" may serve you well

#4Mitu Verma
mitu.verma@ericsson.com
In reply to: David G. Johnston (#3)
Re: Populating missing dates in postgresql data

Hi,

We have a customer complaining about the time taken by one of the application scripts while deleting older data from the log tables.
During the deletion, customer reported that he often sees the below error and because of which table size doesn’t reduce.

ERROR: canceling autovacuum task
Date: 2015-03-14 04:29:19
Context: automatic analyze of table "fm_db_Server3.mmsuper.audittraillogentry"

We have the following queries in this regard:

- How often is the autovacuum task invoked by postgres

- If the task gets cancelled (while we were deleting data from the database) would this task be re-invoked at a later time?

- If insertion of data into a table also impact this task?

- If we can manually schedule this task to a particular time (like off peak hours)?

Regards
Mitu

#5David G. Johnston
david.g.johnston@gmail.com
In reply to: Mitu Verma (#4)
Re: Populating missing dates in postgresql data

On Wed, Mar 25, 2015 at 8:57 PM, Mitu Verma <mitu.verma@ericsson.com> wrote:

Hi,

We have a customer complaining about the time taken by one of the
application scripts while deleting older data from the log tables.

During the deletion, customer reported that he often sees the below error
and because of which table size doesn’t reduce.

​You should look into partitioning (implemented via inheritance) for your
log table so that you can simply issue a "DROP TABLE ..."​

ERROR: canceling autovacuum task

Date: 2015-03-14 04:29:19

Context: automatic analyze of table
"fm_db_Server3.mmsuper.audittraillogentry"

We have the following queries in this regard:

- How often is the autovacuum task invoked by postgres

​As often as it needs to - based upon changes being made to tables​

- If the task gets cancelled (while we were deleting data from
the database) would this task be re-invoked at a later time?

​Yes​

- If insertion of data into a table also impact this task?

​Yes​

- If we can manually schedule this task to a particular time
(like off peak hours)?

​Yes. You can issue "ANALYZE" (or VACUUM) manually via SQL or the
PostgreSQL provided "vacuumdb" command using whatever external scheduling
mechanism you have at your disposal.

David J.

#6Alban Hertroys
haramrae@gmail.com
In reply to: Lavrenz, Steven M (#1)
Re: Populating missing dates in postgresql data

On 26 Mar 2015, at 1:25, Lavrenz, Steven M <slavrenz@purdue.edu> wrote:

Alright everyone, this is a doozy of a problem. I am new to Postgres so I appreciate patience/understanding. I have a database of hardware objects, each of which has several different “channels”. Once per day, these channels are supposed

object_id channel check-in date

****************************************

990 1 2014-12-01

990 1 2014-12-02

990 2 2014-12-01

990 2 2014-12-02

286 2 2014-12-01

286 2 2014-12-02

286 5 2014-12-01

286 5 2014-12-02

4507 1 2014-12-01

4507 1 2014-12-02

4507 2 2014-12-01

4507 2 2014-12-02

And so on. Occasionally, communications will break down to the hardware, such that no reporting occurs. object 286 loses communications on 12/1/2014. Then the table might look like:

object_id channel check-in date

****************************************

990 1 2014-12-01

990 1 2014-12-02

990 2 2014-12-01

990 2 2014-12-02

286 2 2014-12-02

286 5 2014-12-02

4507 1 2014-12-01

4507 1 2014-12-02

4507 2 2014-12-01

4507 2 2014-12-02

Or let’s say that for some reason, just channel 2 loses reporting for a day. Then we would have:

object_id channel check-in date

****************************************

990 1 2014-12-01

990 1 2014-12-02

990 2 2014-12-01

990 2 2014-12-02

286 2 2014-12-02

286 5 2014-12-01

286 5 2014-12-02

4507 1 2014-12-01

4507 1 2014-12-02

4507 2 2014-12-01

4507 2 2014-12-02

I have a second table (TABLE B) with all of the object_ids and channels that are supposed to be reporting in each day. For cases where a certain channel does not check in, I want to add a column that indicates the comm failure. So, for

object_id channel check-in date comm failure

**********************************************************

990 1 2014-12-01 No

990 1 2014-12-02 No

990 2 2014-12-01 No

990 2 2014-12-02 No

286 2 2014-12-01 Yes

286 2 2014-12-02 No

286 5 2014-12-01 Yes

286 5 2014-12-02 No

4507 1 2014-12-01 No

4507 1 2014-12-02 No

4507 2 2014-12-01 No

4507 2 2014-12-02 No

I have been racking my mind for the better part of a day on how to do this. The thing is that I can do a right join of TABLE B on TABLE A, and this will populate the missing object ids and channels. However, this only works for a single

object_id channel check-in date comm failure

**********************************************************

990 1 2014-12-01 No

990 1 2014-12-02 No

990 2 2014-12-01 No

990 2 2014-12-02 No

286 2 Yes

286 2 2014-12-02 No

286 5 Yes

286 5 2014-12-02 No

4507 1 2014-12-01 No

4507 1 2014-12-02 No

4507 2 2014-12-01 No

4507 2 2014-12-02 No

I need to do a count of comm failures by day, so I need to populate the check-in date field. Please help!

Easiest would be to insert the missing values in your table, something like:

WITH RECURSIVE calendar (missing_date) AS (
SELECT MAX(check_in_date) FROM table_a WHERE comm_failure = 'YES'
UNION ALL
SELECT missing_date + interval '1 day' FROM calendar
WHERE missing_date < CURRENT_DATE
)
INSERT INTO table_a (object_id, channel, check_in_date, comm_failure)
SELECT b.object_id, b.channel. c.missing_date, 'YES'
FROM table_b b, calendar c
WHERE NOT EXISTS (
SELECT 1
FROM table_a a
WHERE a.object_id = b.object_id
AND a.channel = b.channel
AND a.check_in_date = c.missing_date
);

That's off the top of my head, untested, etc, but I think I got that mostly right.

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Vincent Veyron
vv.lists@wanadoo.fr
In reply to: Lavrenz, Steven M (#1)
Re: Populating missing dates in postgresql data

On Thu, 26 Mar 2015 00:25:09 +0000
"Lavrenz, Steven M" <slavrenz@purdue.edu> wrote:

I have a second table (TABLE B) with all of the object_ids and channels that are supposed to be reporting in each day. For cases where a certain channel does not check in, I want to add a column that indicates the comm failure.

Not sure if your context allows it, but if you can change your program's logic, it might be easier to :

-add a boolean field (e.g. 'checked') to table B, set to false
-whenever an object checks in do 'update table B set checked = true where object_id = X and channel = Y'
-run a cron job once a day that
-runs 'select * from B where checked = false' and stores results somewhere
-resets B with 'update B set checked = false'

--
Salutations, Vincent Veyron

https://libremen.com/
Legal case, contract and insurance claim management software

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8David G. Johnston
david.g.johnston@gmail.com
In reply to: Vincent Veyron (#7)
Re: Populating missing dates in postgresql data

On Fri, Mar 27, 2015 at 3:41 AM, Vincent Veyron <vv.lists@wanadoo.fr> wrote:

On Thu, 26 Mar 2015 00:25:09 +0000
"Lavrenz, Steven M" <slavrenz@purdue.edu> wrote:

I have a second table (TABLE B) with all of the object_ids and channels

that are supposed to be reporting in each day. For cases where a certain
channel does not check in, I want to add a column that indicates the comm
failure.

Not sure if your context allows it, but if you can change your program's
logic, it might be easier to :

-add a boolean field (e.g. 'checked') to table B, set to false
-whenever an object checks in do 'update table B set checked = true where
object_id = X and channel = Y'
-run a cron job once a day that
-runs 'select * from B where checked = false' and stores results
somewhere
-resets B with 'update B set checked = false'

​This is a performance optimization that I would avoid at nearly any cost,
and there are likely better ways to limit the processing scope without
having to trust the a cron job runs daily​

​in order to not lose data. Now, there are other points of failure here
that are of a similar nature already but still adding one for
(pre-mature...) optimization doesn't seem like a good move.

David J.

#9Vincent Veyron
vv.lists@wanadoo.fr
In reply to: David G. Johnston (#8)
Re: Populating missing dates in postgresql data

On Fri, 27 Mar 2015 08:33:36 -0700
"David G. Johnston" <david.g.johnston@gmail.com> wrote:

​This is a performance optimization that I would avoid at nearly any cost,
and there are likely better ways to limit the processing scope without
having to trust the a cron job runs daily​

mmmhh... it's not so much performance optimization as query simplification, but I agree it adds a maintenance problem.

I never heard of problems with cron not running though, is this really a concern?

--
Salutations, Vincent Veyron

https://legalcase.libremen.com/
Legal case, contract and insurance claim management software

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general