Statistics query

Started by Steve Crawfordalmost 13 years ago4 messagesgeneral

scrawford@pinpointresearch.com

almost 13 years ago

I'm seeking ideas on the best way to craft the following query. I've
stripped everything down to the bare essentials and simplified it below.

Input data has a timestamp (actually an int received from the system in
the form of a Unix epoch), a unit identifier and a status:

event_time | unit_id | status
------------+---------+--------
1357056011 | 60 | 1
1357056012 | 178 | 0
1357056019 | 168 | 0
1357056021 | 3 | 0
1357056021 | 4 | 1
1357056021 | 179 | 0
1357056022 | 0 | 1
1357056022 | 1 | 0
1357056023 | 2 | 0
1357056024 | 9 | 0
1357056025 | 5 | 0
1357056025 | 6 | 0
1357056026 | 7 | 1
...

A given unit_id cannot have two events at the same time (enforced by
constraints).

Given a point in time I would like to:

1. Identify all distinct unit_ids with an entry that exists in the
preceding hour then

2. Count both the total events and sum the status=1 events for the most
recent 50 events for each unit_id that fall within a limited period
(e.g. don't look at data earlier than midnight). So unit_id 60 might
have 50 events in the last 15 minutes while unit_id 4 might have only 12
events after midnight.

The output would look something like:

unit_id | events | status_1_count
---------+--------+----------------
1 | 50 | 34
2 | 27 | 18
1 | 50 | 34
1 | 2 | 0
...

Each sub-portion is easy and while I could use external processing or
set-returning functions I was hoping first to find the secret-sauce to
glue everything together into a single query.

Cheers,
Steve

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Chris Curvey

chris@chriscurvey.com

almost 13 years ago

In reply to: Steve Crawford (#1)

Re: Statistics query

On Wed, Apr 10, 2013 at 12:30 PM, Steve Crawford <
scrawford@pinpointresearch.com> wrote:

I'm seeking ideas on the best way to craft the following query. I've
stripped everything down to the bare essentials and simplified it below.

Input data has a timestamp (actually an int received from the system in
the form of a Unix epoch), a unit identifier and a status:

event_time | unit_id | status
------------+---------+-------**-
1357056011 | 60 | 1
1357056012 | 178 | 0
1357056019 | 168 | 0
1357056021 | 3 | 0
1357056021 | 4 | 1
1357056021 | 179 | 0
1357056022 | 0 | 1
1357056022 | 1 | 0
1357056023 | 2 | 0
1357056024 | 9 | 0
1357056025 | 5 | 0
1357056025 | 6 | 0
1357056026 | 7 | 1
...

A given unit_id cannot have two events at the same time (enforced by
constraints).

Given a point in time I would like to:

1. Identify all distinct unit_ids with an entry that exists in the
preceding hour then

2. Count both the total events and sum the status=1 events for the most
recent 50 events for each unit_id that fall within a limited period (e.g.
don't look at data earlier than midnight). So unit_id 60 might have 50
events in the last 15 minutes while unit_id 4 might have only 12 events
after midnight.

The output would look something like:

unit_id | events | status_1_count
---------+--------+-----------**-----
1 | 50 | 34
2 | 27 | 18
1 | 50 | 34
1 | 2 | 0
...

Each sub-portion is easy and while I could use external processing or
set-returning functions I was hoping first to find the secret-sauce to glue
everything together into a single query.

Cheers,
Steve

something like

select unit_id, count(*), sum(status)
from mytable a
where event_time >= [whatever unix epoch translates to "last midnight"]
and exists
( select *
from mytable b
where b.unit_id = a.unit_id
and b.epoch >= [unix epoch that translates to "one hour ago"])
group by unit _id;

1) I think figuring out the unix epoch should be reasonable...but I don't
know how to do it off the top of my head.
2) I could completely be misunderstanding this. I'm not sure why the
example results would have unit id 1 repeated. (which my suggestion WON'T
do)

Steve Crawford

scrawford@pinpointresearch.com

almost 13 years ago

In reply to: Chris Curvey (#2)

Re: Statistics query

On 04/10/2013 10:31 AM, Chris Curvey wrote:

On Wed, Apr 10, 2013 at 12:30 PM, Steve Crawford
<scrawford@pinpointresearch.com
<mailto:scrawford@pinpointresearch.com>> wrote:

I'm seeking ideas on the best way to craft the following query.
I've stripped everything down to the bare essentials and
simplified it below.

Input data has a timestamp (actually an int received from the
system in the form of a Unix epoch), a unit identifier and a status:

event_time | unit_id | status
------------+---------+--------
1357056011 | 60 | 1
1357056012 | 178 | 0
1357056019 | 168 | 0
1357056021 | 3 | 0
1357056021 | 4 | 1
1357056021 | 179 | 0
1357056022 | 0 | 1
1357056022 | 1 | 0
1357056023 | 2 | 0
1357056024 | 9 | 0
1357056025 | 5 | 0
1357056025 | 6 | 0
1357056026 | 7 | 1
...

A given unit_id cannot have two events at the same time (enforced
by constraints).

Given a point in time I would like to:

1. Identify all distinct unit_ids with an entry that exists in the
preceding hour then

2. Count both the total events and sum the status=1 events for the
most recent 50 events for each unit_id that fall within a limited
period (e.g. don't look at data earlier than midnight). So unit_id
60 might have 50 events in the last 15 minutes while unit_id 4
might have only 12 events after midnight.

The output would look something like:

unit_id | events | status_1_count
---------+--------+----------------
1 | 50 | 34
2 | 27 | 18
1 | 50 | 34
1 | 2 | 0
...

Each sub-portion is easy and while I could use external processing
or set-returning functions I was hoping first to find the
secret-sauce to glue everything together into a single query.

Cheers,
Steve

something like

select unit_id, count(*), sum(status)
from mytable a
where event_time >= [whatever unix epoch translates to "last midnight"]
and exists
( select *
from mytable b
where b.unit_id = a.unit_id
and b.epoch >= [unix epoch that translates to "one hour ago"])
group by unit _id;

1) I think figuring out the unix epoch should be reasonable...but I
don't know how to do it off the top of my head.
2) I could completely be misunderstanding this. I'm not sure why the
example results would have unit id 1 repeated. (which my suggestion
WON'T do)

Because I screwed up cutting and pasting to make an example. The unit_id
in the output should, in fact, be distinct:

unit_id | events | status_1_count
---------+--------+----------------
1 | 50 | 34
2 | 27 | 18
3 | 50 | 34
4 | 2 | 0

You are correct, epoch is easy:
abstime(epoch)
or
extract(epoch from timestamptz)
depending on which direction you are going or for an hour difference
just subtract 3600.

The solution, however, misses the important complicating gotcha. The
units I want listed are only those that have had at least one event in
the last hour. But for each such unit, I only want the statistics to
reflect the most-recent 50 events (even if those events occurred earlier
than the current hour) provided the event occurred on the current date.
So the events column can never be less than 1 nor more than 50.

For example...

One unit might have a single event at the start of the last hour but 49
more in the preceding 10 minutes. I want to see that unit and the stats
for those 50 events.

Same thing if a unit has 50 events clustered at the end of an hour - I
don't want the earlier ones.

Another might have 50 events early in the day but none this hour. I
don't want to see that one.

But I do want to see the one that had an event in the last hour late in
the day along with the 48 other events that have accumulated since midnight.

Cheers,
Steve

Steve Crawford

scrawford@pinpointresearch.com

almost 13 years ago

In reply to: Steve Crawford (#3)

Re: SOLVED Statistics query

On 04/10/2013 10:51 AM, Steve Crawford wrote:

...

Given a point in time I would like to:

1. Identify all distinct unit_ids with an entry that exists in
the preceding hour then

2. Count both the total events and sum the status=1 events for
the most recent 50 events for each unit_id that fall within a
limited period (e.g. don't look at data earlier than midnight).
So unit_id 60 might have 50 events in the last 15 minutes while
unit_id 4 might have only 12 events after midnight.

...

Guess I needed to wait for the coffee to absorb. I've come up with an
initial working solution (perhaps excess use of CTE but it's useful for
testing/debugging over different portions of collected data):

with

report_time as (
select
1365526800::int as list_end
),

report_ranges as (
select
extract(epoch from date_trunc('day', abstime(list_end)))::int as
day_start,
greatest(list_end-3600, extract(epoch from date_trunc('day',
abstime(list_end)))::int) as list_start,
list_end
from
report_time
),

today_events as (
select
unit_id,
event_time,
status
from
event_log d,
report_ranges r
where
d.event_time >= r.day_start and
d.event_time <= r.list_end
),

unit_id_list as (
select
distinct unit_id,
coalesce((select
i.event_time
from
today_events i
where
i.unit_id = o.unit_id and
i.event_time <= r.list_end
order by
event_time desc
limit 1
offset 49), r.day_start) as first_event
from
event_log o,
report_ranges r
where
event_time between r.list_start and r.list_end
)

select
unit_id,
(select
count(*)
from
today_events ii
where
ii.unit_id = oo.unit_id and
ii.event_time >= oo.first_event) as events,
(select
sum (case when status = -6 then 1 else 0 end)
from
today_events ii
where
ii.unit_id = oo.unit_id and
ii.event_time >= oo.first_event) as live_answer
from
unit_id_list oo
order by
unit_id
;

Cheers,
Steve