Summing activity intervals without any obvious column to group by

Started by Carey Tildenover 13 years ago7 messagesgeneral
Jump to latest
#1Carey Tilden
carey.tilden@gmail.com

Apologies for the awkward title. I haven't quite thought of the right way
to describe my problem, which may be why I've had a hard time figuring out
how to solve it. I have a list of program start/stop times, and I want to
know how long each run takes to complete. The thing that's really tripping
me up is there are gaps in the sequence. I've figured out how to collapse
the results down to a single row per attempt, but I can't quite figure out
how to further collapse down each full run to its own row. It'd be easy if
I had a session_id or something to group on, but I don't. All I have are
the start/stop times.

Here's some sample data. Hopefully this clarifies what I'm talking about:

drop table if exists program_runs;

create temporary table program_runs (
id serial,
time_stamp timestamptz,
action text
);

insert into program_runs (time_stamp, action) values
('2012-01-01 10:00:00 PST', 'started'), ('2012-01-01 10:10:00 PST',
'stopped early'),
('2012-01-01 10:20:00 PST', 'started'), ('2012-01-01 10:30:00 PST',
'stopped early'),
('2012-01-01 10:40:00 PST', 'started'), ('2012-01-01 10:47:00 PST',
'completed'),
('2012-01-01 10:50:00 PST', 'started'), ('2012-01-01 11:00:00 PST',
'stopped early'),
('2012-01-01 11:10:00 PST', 'started'), ('2012-01-01 11:13:00 PST',
'completed'),
('2012-01-01 11:20:00 PST', 'started'), ('2012-01-01 11:30:00 PST',
'stopped early'),
('2012-01-01 11:40:00 PST', 'started'), ('2012-01-01 11:50:00 PST',
'stopped early'),
('2012-01-01 12:00:00 PST', 'started'), ('2012-01-01 12:10:00 PST',
'stopped early'),
('2012-01-01 12:20:00 PST', 'started'), ('2012-01-01 12:29:00 PST',
'completed');

select
this_time_stamp as starting_time_stamp,
next_time_stamp - this_time_stamp as time_elapsed,
next_action as closing_action
from (
select
time_stamp as this_time_stamp, lead(time_stamp) over (order by
id) as next_time_stamp,
action as this_action, lead(action) over (order by id) as
next_action,
id as this_id, lead(id) over (order by id) as next_id
from program_runs
) q
where this_action = 'started';

Note that each run has a pair of entries in the table. The first is always
"started", but the second may be either "stopped early" or "completed".
The final results I'd like to see are:

starting_time_stamp | total_time_elapsed
------------------------+--------------------
2012-01-01 10:00:00-08 | 00:27:00
2012-01-01 10:50:00-08 | 00:13:00
2012-01-01 11:20:00-08 | 00:39:00

Hope that's enough detail. Any ideas or suggestions gladly accepted!

Regards,
Carey

#2Andrew Sullivan
ajs@crankycanuck.ca
In reply to: Carey Tilden (#1)
Re: Summing activity intervals without any obvious column to group by

On Mon, Aug 13, 2012 at 05:28:24PM -0700, Carey Tilden wrote:

how to solve it. I have a list of program start/stop times, and I want to
know how long each run takes to complete. The thing that's really tripping
me up is there are gaps in the sequence. I've figured out how to collapse
the results down to a single row per attempt, but I can't quite figure out
how to further collapse down each full run to its own row. It'd be easy if
I had a session_id or something to group on, but I don't. All I have are
the start/stop times.

Looking at your data and the sample output you provided, it would
appear that the "real" start time of a program is always the one that
is started _after_ a completion (or else it's the very first start).
In other words, it's never the case that a "start" in the program_runs data
is the start of a new run when an existing, previously-started run
hasn't completed.

Is that right? If so, then you ought to be able to use windowing
functions. For each completion, pick the earliest start before it that is
_after_ a completion or, if there is no such completion, is the very
first record. See the discussion of windowing functions in the manual
for how to do this.

Best,

A

--
Andrew Sullivan
ajs@crankycanuck.ca

#3Carey Tilden
carey.tilden@gmail.com
In reply to: Andrew Sullivan (#2)
Re: Summing activity intervals without any obvious column to group by

On Mon, Aug 13, 2012 at 5:53 PM, Andrew Sullivan <ajs@crankycanuck.ca>wrote:

On Mon, Aug 13, 2012 at 05:28:24PM -0700, Carey Tilden wrote:

how to solve it. I have a list of program start/stop times, and I want

to

know how long each run takes to complete. The thing that's really

tripping

me up is there are gaps in the sequence. I've figured out how to

collapse

the results down to a single row per attempt, but I can't quite figure

out

how to further collapse down each full run to its own row. It'd be easy

if

I had a session_id or something to group on, but I don't. All I have are
the start/stop times.

Looking at your data and the sample output you provided, it would
appear that the "real" start time of a program is always the one that
is started _after_ a completion (or else it's the very first start).
In other words, it's never the case that a "start" in the program_runs data
is the start of a new run when an existing, previously-started run
hasn't completed.

Is that right? If so, then you ought to be able to use windowing
functions. For each completion, pick the earliest start before it that is
_after_ a completion or, if there is no such completion, is the very
first record. See the discussion of windowing functions in the manual
for how to do this.

That's right, but I'm not simply trying to see how long it's been between
the "real" start and the "real" end. I want to exclude the gaps between
runs. I'm looking for how many minutes it spent actually processing.

Carey

#4David G. Johnston
david.g.johnston@gmail.com
In reply to: Carey Tilden (#1)
Re: Summing activity intervals without any obvious column to group by

On Aug 13, 2012, at 20:28, Carey Tilden <carey.tilden@gmail.com> wrote:

Apologies for the awkward title. I haven't quite thought of the right way to describe my problem, which may be why I've had a hard time figuring out how to solve it. I have a list of program start/stop times, and I want to know how long each run takes to complete. The thing that's really tripping me up is there are gaps in the sequence. I've figured out how to collapse the results down to a single row per attempt, but I can't quite figure out how to further collapse down each full run to its own row. It'd be easy if I had a session_id or something to group on, but I don't. All I have are the start/stop times.

Here's some sample data. Hopefully this clarifies what I'm talking about:

drop table if exists program_runs;

create temporary table program_runs (
id serial,
time_stamp timestamptz,
action text
);

insert into program_runs (time_stamp, action) values
('2012-01-01 10:00:00 PST', 'started'), ('2012-01-01 10:10:00 PST', 'stopped early'),
('2012-01-01 10:20:00 PST', 'started'), ('2012-01-01 10:30:00 PST', 'stopped early'),
('2012-01-01 10:40:00 PST', 'started'), ('2012-01-01 10:47:00 PST', 'completed'),
('2012-01-01 10:50:00 PST', 'started'), ('2012-01-01 11:00:00 PST', 'stopped early'),
('2012-01-01 11:10:00 PST', 'started'), ('2012-01-01 11:13:00 PST', 'completed'),
('2012-01-01 11:20:00 PST', 'started'), ('2012-01-01 11:30:00 PST', 'stopped early'),
('2012-01-01 11:40:00 PST', 'started'), ('2012-01-01 11:50:00 PST', 'stopped early'),
('2012-01-01 12:00:00 PST', 'started'), ('2012-01-01 12:10:00 PST', 'stopped early'),
('2012-01-01 12:20:00 PST', 'started'), ('2012-01-01 12:29:00 PST', 'completed');

select
this_time_stamp as starting_time_stamp,
next_time_stamp - this_time_stamp as time_elapsed,
next_action as closing_action
from (
select
time_stamp as this_time_stamp, lead(time_stamp) over (order by id) as next_time_stamp,
action as this_action, lead(action) over (order by id) as next_action,
id as this_id, lead(id) over (order by id) as next_id
from program_runs
) q
where this_action = 'started';

Note that each run has a pair of entries in the table. The first is always "started", but the second may be either "stopped early" or "completed". The final results I'd like to see are:

starting_time_stamp | total_time_elapsed
------------------------+--------------------
2012-01-01 10:00:00-08 | 00:27:00
2012-01-01 10:50:00-08 | 00:13:00
2012-01-01 11:20:00-08 | 00:39:00

Hope that's enough detail. Any ideas or suggestions gladly accepted!

Regards,
Carey

First artificially generate row (pair) identifiers by integer dividing the ordered row number by 2.

Using window or sub-queries identify the bookends for each group (i.e., the identifier for each completed and the prior completed). Give these groups artificial session identifiers/row numbers.

Assign the artificial session id to each transaction row by using the bookends.

Now you have identifiers with which to group.

This makes a number of assumptions regarding the form of the input data. It will solve for your example data but it may not generalize. In particular it assumes non-overlapping sessions.

HTH

David J.

#5Carey Tilden
carey.tilden@gmail.com
In reply to: David G. Johnston (#4)
Re: Summing activity intervals without any obvious column to group by

On Mon, Aug 13, 2012 at 6:01 PM, David Johnston <polobo@yahoo.com> wrote:

On Aug 13, 2012, at 20:28, Carey Tilden <carey.tilden@gmail.com> wrote:

Apologies for the awkward title. I haven't quite thought of the right

way to describe my problem, which may be why I've had a hard time figuring
out how to solve it. I have a list of program start/stop times, and I want
to know how long each run takes to complete. The thing that's really
tripping me up is there are gaps in the sequence. I've figured out how to
collapse the results down to a single row per attempt, but I can't quite
figure out how to further collapse down each full run to its own row. It'd
be easy if I had a session_id or something to group on, but I don't. All I
have are the start/stop times.

Here's some sample data. Hopefully this clarifies what I'm talking

about:

drop table if exists program_runs;

create temporary table program_runs (
id serial,
time_stamp timestamptz,
action text
);

insert into program_runs (time_stamp, action) values
('2012-01-01 10:00:00 PST', 'started'), ('2012-01-01 10:10:00

PST', 'stopped early'),

('2012-01-01 10:20:00 PST', 'started'), ('2012-01-01 10:30:00

PST', 'stopped early'),

('2012-01-01 10:40:00 PST', 'started'), ('2012-01-01 10:47:00

PST', 'completed'),

('2012-01-01 10:50:00 PST', 'started'), ('2012-01-01 11:00:00

PST', 'stopped early'),

('2012-01-01 11:10:00 PST', 'started'), ('2012-01-01 11:13:00

PST', 'completed'),

('2012-01-01 11:20:00 PST', 'started'), ('2012-01-01 11:30:00

PST', 'stopped early'),

('2012-01-01 11:40:00 PST', 'started'), ('2012-01-01 11:50:00

PST', 'stopped early'),

('2012-01-01 12:00:00 PST', 'started'), ('2012-01-01 12:10:00

PST', 'stopped early'),

('2012-01-01 12:20:00 PST', 'started'), ('2012-01-01 12:29:00

PST', 'completed');

select
this_time_stamp as starting_time_stamp,
next_time_stamp - this_time_stamp as time_elapsed,
next_action as closing_action
from (
select
time_stamp as this_time_stamp, lead(time_stamp) over (order

by id) as next_time_stamp,

action as this_action, lead(action) over (order by id) as

next_action,

id as this_id, lead(id) over (order by id) as next_id
from program_runs
) q
where this_action = 'started';

Note that each run has a pair of entries in the table. The first is

always "started", but the second may be either "stopped early" or
"completed". The final results I'd like to see are:

starting_time_stamp | total_time_elapsed
------------------------+--------------------
2012-01-01 10:00:00-08 | 00:27:00
2012-01-01 10:50:00-08 | 00:13:00
2012-01-01 11:20:00-08 | 00:39:00

Hope that's enough detail. Any ideas or suggestions gladly accepted!

Regards,
Carey

First artificially generate row (pair) identifiers by integer dividing the
ordered row number by 2.

Using window or sub-queries identify the bookends for each group (i.e.,
the identifier for each completed and the prior completed). Give these
groups artificial session identifiers/row numbers.

Assign the artificial session id to each transaction row by using the
bookends.

This is the part where I draw a blank. How would I do that? Seems like it
should be easy with window functions, but I just can't think of the way to
do it.

Now you have identifiers with which to group.

This makes a number of assumptions regarding the form of the input data.
It will solve for your example data but it may not generalize. In
particular it assumes non-overlapping sessions.

The assumptions hold fairly well. Sessions do not overlap, thankfully.
There are different program runs to untangle, but that's simple enough
(order by program_name, time_stamp).

Thanks for the suggestions so far!

Carey

#6David G. Johnston
david.g.johnston@gmail.com
In reply to: Carey Tilden (#5)
Re: Summing activity intervals without any obvious column to group by

On Aug 13, 2012, at 21:22, Carey Tilden <carey.tilden@gmail.com> wrote:

On Mon, Aug 13, 2012 at 6:01 PM, David Johnston <polobo@yahoo.com> wrote:
On Aug 13, 2012, at 20:28, Carey Tilden <carey.tilden@gmail.com> wrote:

Apologies for the awkward title. I haven't quite thought of the right way to describe my problem, which may be why I've had a hard time figuring out how to solve it. I have a list of program start/stop times, and I want to know how long each run takes to complete. The thing that's really tripping me up is there are gaps in the sequence. I've figured out how to collapse the results down to a single row per attempt, but I can't quite figure out how to further collapse down each full run to its own row. It'd be easy if I had a session_id or something to group on, but I don't. All I have are the start/stop times.

Here's some sample data. Hopefully this clarifies what I'm talking about:

drop table if exists program_runs;

create temporary table program_runs (
id serial,
time_stamp timestamptz,
action text
);

insert into program_runs (time_stamp, action) values
('2012-01-01 10:00:00 PST', 'started'), ('2012-01-01 10:10:00 PST', 'stopped early'),
('2012-01-01 10:20:00 PST', 'started'), ('2012-01-01 10:30:00 PST', 'stopped early'),
('2012-01-01 10:40:00 PST', 'started'), ('2012-01-01 10:47:00 PST', 'completed'),
('2012-01-01 10:50:00 PST', 'started'), ('2012-01-01 11:00:00 PST', 'stopped early'),
('2012-01-01 11:10:00 PST', 'started'), ('2012-01-01 11:13:00 PST', 'completed'),
('2012-01-01 11:20:00 PST', 'started'), ('2012-01-01 11:30:00 PST', 'stopped early'),
('2012-01-01 11:40:00 PST', 'started'), ('2012-01-01 11:50:00 PST', 'stopped early'),
('2012-01-01 12:00:00 PST', 'started'), ('2012-01-01 12:10:00 PST', 'stopped early'),
('2012-01-01 12:20:00 PST', 'started'), ('2012-01-01 12:29:00 PST', 'completed');

select
this_time_stamp as starting_time_stamp,
next_time_stamp - this_time_stamp as time_elapsed,
next_action as closing_action
from (
select
time_stamp as this_time_stamp, lead(time_stamp) over (order by id) as next_time_stamp,
action as this_action, lead(action) over (order by id) as next_action,
id as this_id, lead(id) over (order by id) as next_id
from program_runs
) q
where this_action = 'started';

Note that each run has a pair of entries in the table. The first is always "started", but the second may be either "stopped early" or "completed". The final results I'd like to see are:

starting_time_stamp | total_time_elapsed
------------------------+--------------------
2012-01-01 10:00:00-08 | 00:27:00
2012-01-01 10:50:00-08 | 00:13:00
2012-01-01 11:20:00-08 | 00:39:00

Hope that's enough detail. Any ideas or suggestions gladly accepted!

Regards,
Carey

First artificially generate row (pair) identifiers by integer dividing the ordered row number by 2.

Using window or sub-queries identify the bookends for each group (i.e., the identifier for each completed and the prior completed). Give these groups artificial session identifiers/row numbers.

Assign the artificial session id to each transaction row by using the bookends.

This is the part where I draw a blank. How would I do that? Seems like it should be easy with window functions, but I just can't think of the way to do it.

With detail as ()
, bookmarks as ()
Select detail.id, bookmarks.id, ...
From detail
Join bookmarks on ( detail.id between bookmarks.startid and bookmarks.endid )

Bookmarks as (
Select row_number() over () as id,
detail.id, coalesce(min(detail.id) over (rows 1 preceeding),0) --coalesce for the first row
From detail where 'completed'
)

Probably need to play with row ordering but this should get you started.

Show quoted text

Now you have identifiers with which to group.

This makes a number of assumptions regarding the form of the input data. It will solve for your example data but it may not generalize. In particular it assumes non-overlapping sessions.

The assumptions hold fairly well. Sessions do not overlap, thankfully. There are different program runs to untangle, but that's simple enough (order by program_name, time_stamp).

Thanks for the suggestions so far!

Carey

#7Carey Tilden
carey.tilden@gmail.com
In reply to: David G. Johnston (#6)
Re: Summing activity intervals without any obvious column to group by

On Mon, Aug 13, 2012 at 7:05 PM, David Johnston <polobo@yahoo.com> wrote:

On Aug 13, 2012, at 21:22, Carey Tilden <carey.tilden@gmail.com> wrote:

On Mon, Aug 13, 2012 at 6:01 PM, David Johnston <polobo@yahoo.com> wrote:

On Aug 13, 2012, at 20:28, Carey Tilden <carey.tilden@gmail.com> wrote:

Apologies for the awkward title. I haven't quite thought of the right

way to describe my problem, which may be why I've had a hard time figuring
out how to solve it. I have a list of program start/stop times, and I want
to know how long each run takes to complete. The thing that's really
tripping me up is there are gaps in the sequence. I've figured out how to
collapse the results down to a single row per attempt, but I can't quite
figure out how to further collapse down each full run to its own row. It'd
be easy if I had a session_id or something to group on, but I don't. All I
have are the start/stop times.

Here's some sample data. Hopefully this clarifies what I'm talking

about:

drop table if exists program_runs;

create temporary table program_runs (
id serial,
time_stamp timestamptz,
action text
);

insert into program_runs (time_stamp, action) values
('2012-01-01 10:00:00 PST', 'started'), ('2012-01-01 10:10:00

PST', 'stopped early'),

('2012-01-01 10:20:00 PST', 'started'), ('2012-01-01 10:30:00

PST', 'stopped early'),

('2012-01-01 10:40:00 PST', 'started'), ('2012-01-01 10:47:00

PST', 'completed'),

('2012-01-01 10:50:00 PST', 'started'), ('2012-01-01 11:00:00

PST', 'stopped early'),

('2012-01-01 11:10:00 PST', 'started'), ('2012-01-01 11:13:00

PST', 'completed'),

('2012-01-01 11:20:00 PST', 'started'), ('2012-01-01 11:30:00

PST', 'stopped early'),

('2012-01-01 11:40:00 PST', 'started'), ('2012-01-01 11:50:00

PST', 'stopped early'),

('2012-01-01 12:00:00 PST', 'started'), ('2012-01-01 12:10:00

PST', 'stopped early'),

('2012-01-01 12:20:00 PST', 'started'), ('2012-01-01 12:29:00

PST', 'completed');

select
this_time_stamp as starting_time_stamp,
next_time_stamp - this_time_stamp as time_elapsed,
next_action as closing_action
from (
select
time_stamp as this_time_stamp, lead(time_stamp) over (order

by id) as next_time_stamp,

action as this_action, lead(action) over (order by id) as

next_action,

id as this_id, lead(id) over (order by id) as next_id
from program_runs
) q
where this_action = 'started';

Note that each run has a pair of entries in the table. The first is

always "started", but the second may be either "stopped early" or
"completed". The final results I'd like to see are:

starting_time_stamp | total_time_elapsed
------------------------+--------------------
2012-01-01 10:00:00-08 | 00:27:00
2012-01-01 10:50:00-08 | 00:13:00
2012-01-01 11:20:00-08 | 00:39:00

Hope that's enough detail. Any ideas or suggestions gladly accepted!

Regards,
Carey

First artificially generate row (pair) identifiers by integer dividing
the ordered row number by 2.

Using window or sub-queries identify the bookends for each group (i.e.,
the identifier for each completed and the prior completed). Give these
groups artificial session identifiers/row numbers.

Assign the artificial session id to each transaction row by using the
bookends.

This is the part where I draw a blank. How would I do that? Seems like
it should be easy with window functions, but I just can't think of the way
to do it.

With detail as ()
, bookmarks as ()
Select detail.id, bookmarks.id, ...
From detail
Join bookmarks on ( detail.id between bookmarks.startid and
bookmarks.endid )

Bookmarks as (
Select row_number() over () as id,
detail.id, coalesce(min(detail.id) over (rows 1 preceeding),0) --coalesce
for the first row
From detail where 'completed'
)

Probably need to play with row ordering but this should get you started.

Thanks much! That finally clicked and I now have the results I was after.
I included my latest code as an attachment, since pasting it inline seemed
a bit much. I'm all ears if anyone wants to take a look and suggest any
further refinements.

Cheers,
Carey

Show quoted text

Now you have identifiers with which to group.

This makes a number of assumptions regarding the form of the input data.
It will solve for your example data but it may not generalize. In
particular it assumes non-overlapping sessions.

The assumptions hold fairly well. Sessions do not overlap, thankfully.
There are different program runs to untangle, but that's simple enough
(order by program_name, time_stamp).

Thanks for the suggestions so far!

Carey

Attachments:

program_runs_testbed.sqlapplication/octet-stream; name=program_runs_testbed.sqlDownload